cloud computing
LECTURE NOTES
ON
CLOUD COMPUTING
IV B. Tech I semester
Ms. V DIVYA VANI
Assistant Professor
Mr. C.PRAVEEN KUMAR
Assistant Professor
Mr. CH.SRIKANTH
Assistant Professor
COMPUTER SCIENCE AND ENGINEERING
INSTITUTE OF AERONAUTICAL ENGINEERING
(Autonomous)
Dundigal, Hyderabad-500043
Chapter 1
System Models and
Enabling Technologies
Summary:
Parallel, distributed, and cloud computing systems advance all works of life. This
chapter assesses the evolutional changes in computing and IT trends in the past 30 years. These
changes are driven by killer applications with variable amounts of workload and datasets at
different periods of time. We study high-performance computing (HPC) and high-throughput
computing (HTC) systems in clusters/MPP, service-oriented architecture (SOA), grids, P2P
networks, and Internet clouds. These systems are distinguished by their architectures, OS
platforms, processing algorithms, communication protocols, security demands, and service
models. This chapter introduces the essential issues in scalability, performance, availability,
security, energy-efficiency, workload outsourcing, datacenter protection, etc. The intent is to pave
the way for our readers to study the details in subsequent chapters.
1.1 Scalable Computing Towards Massive Parallelism
1.1.1
1.1.2
1.1.3
1.2
High-Performance vs. High-Throughput Computing
Analysis of Top 500 Supercomputers
Killer Applications and Grand Challenges
Enabling Technologies for Distributed Computing 7
1.2.1
1.2.2
1.2.3
1.2.4
System Components and Wide-Area Networking
Virtual Machines and Virtualization Middleware
Trends in Distributed Operating Systems
Parallel Programming Environments
1.3 Distributed Computing System Models
1.3.1
1.3.2
1.3.3
1.3.4
1.3.5
1.4
14
Clusters of Cooperative Computers
Grid Computing Infrastructures
Service-Oriented Architecture (SOA)
Peer-to-Peer Network Families
Cloud Computing over The Internet
Performance, Security, and Energy-Efficiency
1.4.1
1.4.2
1.4.3
1.4.4
1.5
2
24
Performance Metrics and System Scalability
Fault-Tolerance and System Availability
Network Threats and Data Integrity
Energy-Efficiency in Distributed Computing
References and Homework Problems
34
_____________________________________________________________________
Distributed Computing : Clusters, Grids and Clouds, All rights reserved
Fox, and Jack Dongarra, May 2, 2010.
1-1
by Kai Hwang, Geoffrey
Chapter 1: System Models and Enabling Technologies (42 pages)
revised May 2, 2010
1.1 Scalable Computing Towards Massive Parallelism
Over the past 60 years, the state of computing has gone through a series of platform and
environmental changes. We review below the evolutional changes in machine architecture, operating system
platform, network connectivity, and application workloads. Instead of using a centralized computer to solve
computational problems, a parallel and distributed computing system uses multiple computers to solve
large-scale problems over the Internet. Distributed computing becomes data-intensive and network-centric.
We will identify the killer applications of modern systems that practice parallel and distributed computing.
These large-scale applications have significantly upgraded the quality of life in all aspects of our
civilization.
1.1.1 High-Performance versus High-Throughput Computing
For a long time, high-performance computing (HPC) systems emphasizes the raw speed performance.
The speed of HPC systems increased from Gflops in the early 1990’s to now Pflops in 2010. This
improvement was driven mainly by demands from scientific, engineering, and manufacturing communities
in the past. The speed performance in term of floating-point computing capability on a single system is
facing some challenges by the business computing users. This flops speed measures the time to complete the
execution of a single large computing task, like the Linpack benchmark used in Top-500 ranking. In reality,
the number of users of the Top-500 HPC computers is rather limited to only 10% of all computer users.
Today, majority of computer users are still using desktop computers and servers either locally or in huge
datacenters, when they conduct Internet search and market-driven computing tasks.
The development of market-oriented high-end computing systems is facing a strategic change from the
HPC paradigm to a high-throughput computing (HTC) paradigm. This HTC paradigm pays more attention
to high-flux multi-computing. The main application of high-flux computing system lies in Internet searches
and web services by millions or more users simultaneously. The performance goal is thus shifted to measure
the high throughput or the number of tasks completed per unit of time. HTC technology needs to improve
not only high speed in batch processing, but also address the acute problem of cost, energy saving, security,
and reliability at many datacenters and enterprise computing centers. This book is designed to address both
HPC and HTC systems, that meet the demands of all computer users.
In the past, electronic computers have gone through five generations of development. Each generation
lasted 10 to 20 years. Adjacent generations overlapped in about 10 years. During 1950-1970, a handful of
mainframe, such as IBM 360 and CDC 6400, were built to satisfy the demand from large business or
government organizations. During 1960–1980, lower-c ost minicomputers, like DEC’s PDP 11 and VAX
series, became popular in small business and college campuses. During 1970-1990, personal computers
built with VLSI microprocessors became widespread in use by mass population. During 1980-2000,
massive number of portable computers and pervasive devices appeared in both wired and wireless
applications. Since 1990, we are overwhelmed with using both HPC and HTC systems that are hidden in
Internet clouds. They offer web-scale services to general masses in a digital society.
Levels of Parallelism: Let us first review types of parallelism before we proceed further with the
computing trends. When hardware was bulky and expensive 50 years ago, most computers were designed in
a bit-serial fashion. Bit-level parallelism (BLP) converts bit-serial processing to word-level processing
gradually. We started with 4-bit microprocessors to 8, 16, 32 and 64-bit CPUs over the years. The next
wave of improvement is the instruction-level parallelism (ILP). When we shifted from using processor to
execute single instruction at a time to execute multiple instructions simultaneously, we have practiced ILP
through pipelining, superscalar, VLIW (very-long instruction word), and multithreading in the past 30
years. ILP demands branch prediction, dynamic scheduling, speculation, and higher degree of compiler
_____________________________________________________________________
Distributed Computing : Clusters, Grids and Clouds, All rights reserved
Fox, and Jack Dongarra, May 2, 2010.
1-2
by Kai Hwang, Geoffrey
Chapter 1: System Models and Enabling Technologies (42 pages)
revised May 2, 2010
support to make it work efficiently.
Data-level parallelism (DLP) was made popular through SIMD (single-instruction and multiple-data)
and vector machines using vector or array types of instructions. DLP demands both even more hardware
support and compiler assistance to work properly. Ever since the introduction of multicore processors and
chip multiprocessors (CMP), we explore the task-level parallelism (TLP). A modern processor explores all
of the above parallelism types. The BLP, ILP, and DLP are well supported by advances in hardware and
compilers. However, the TLP is far from being very successful due to the difficulty in programming and
compilation of codes for efficient execution on multicores and CMPs. As we move from parallel processing
to distributed processing, we will see the increase of computing granularity to job-level parallelism (JLP). It
is fair to say the coarse-grain parallelism is built on top of the fine-grain parallelism.
The Age of Internet Computing : The rapid development of the Internet has resulted in billions of people
login online everyday. As a result, supercomputer sites and datacenters have changed from providing high
performance floating-point computing capabilities to concurrently servicing huge number of requests from
billions of users. The development of computing clouds computing and the widely adoption of provided
computing services demand HTC systems which are often built parallel and distributed computing
technologies. We cannot meet the future computing demand by pursuing only the Linpack performance on a
handful of computers. We must build efficient datacenters using low-cost servers, storage systems, and
high-bandwidth networks.
In the future, both HPC and HTC demand multi-core processors that can handle hundreds or thousand
of computing threads, tens-of-kilo-thread node prototype, and mobile cloud services platform prototype.
Both types of systems emphasize parallelism and distributed computing. Future HPC and HTC systems
must satisfy the huge demand of computing power in terms of throughput, efficiency, scalability, reliability
etc. The term of high efficiency used here means not only speed performance of computing systems, but
also the work efficiency (including the programming efficiency) and the energy efficiency in term of
throughput per watt of energy consumed. To achieve these goals, three key scientific issues must be
addressed:
(1) Efficiency measured in building blocks and execution model to exploit massive
parallelism as in HPC. This may include data access and storage model for HTC and
energy efficiency.
(2) Dependability in terms of reliability and self-management from the chip to system and
application levels. The purpose is to provide high-throughput service with QoS assurance
even under failure conditions.
(3) Adaptation in programming model which can support billions of job requests over
massive datasets, virtualized cloud resources, and flexible application service model.
The Platform Evolution: The general computing trend is to leverage more and more on shared web
resources over the Internet. As illustrated in Fig.1.1, we see the evolution from two tracks of system
development: distributed computing systems (DCS) and high-performance computing (HPC) systems. On
the HPC side, homogeneous supercomputers (massively parallel processors, MPP) are gradually replaced
by clusters of cooperative computers out of the desire to share computing resources. The cluster is often a
collection of computer nodes that are physically connected in close range to each other. Clusters, MPP, and
Grid systems are studied in Chapters 3 and 4. On the DCS side, Peer-to-Peer (P2P) networks appeared for
distributed file sharing and content delivery applications. A P2P system is built over many client machines
to be studied in Chapter 5. Peer machines are globally distributed in nature. Both P2P and cloud computing
_____________________________________________________________________
Distributed Computing : Clusters, Grids and Clouds, All rights reserved
Fox, and Jack Dongarra, May 2, 2010.
1-3
by Kai Hwang, Geoffrey
Chapter 1: System Models and Enabling Technologies (42 pages)
revised May 2, 2010
and web service platforms emphasize more on HTC rather than HPC.
Figure 1.1 Evolutional trend towards web-scale distributed high-throughput computing and
integrated web services to satisfy heterogeneous applications.
Distributed Computing Families: Ever since mid 90’s, technologies for building peer-to-peer (P2P)
networks and network of clusters were consolidated into many national projects to establish wide-area
computing infrastructures, known as computational grids or data grids. We will study Grid computing
technology in Chapter 4. More recently, there is a surge of interest to explore Internet cloud resources for
web-scale supercomputing. Internet clouds are resulted from moving desktop computing to a serviceoriented computing using server clusters and huge databases at datacenters. This chapter introduces the
basics of various parallel and distributed families. Grids and clouds are disparity systems with great
emphases on resource sharing in hardware, software, and datasets.
Design theory, enabling technologies, and case studies of these massively distributed systems are
treated in this book. Massively distributed systems are intended to exploit a high degree of parallelism or
concurrency among many machines. In 2009, the largest cluster ever built has 224,162 processor cores in
Cray XT-5 system. The largest computational grid connects any where from ten to hundreds of server
clusters. A typical P2P network may involve millions of client machines, simultaneously. Experimental
cloud computing clusters have been built with thousands of processing nodes. We devote the material min
Chapters 7 and 8 to cover cloud computing Case studies of HPC system as cluster and grids and HTC
systems as P2P networks and datacenter-based cloud platforms will be examined in Chapter 9.
1.1.2 Analysis of Top-500 Supercomputers
Figure 1.2 plots the measured performance of the Top-500 fastest computers from 1993 to 2009. The Yaxis is scaled by the sustained speed performance in terms of GFlops, Tfops, and PFlops. The middle curve
plots the performance of the No.1 fastest computers recorded over the years. The peak performance
increases from 58.7 GFlops to 1.76 PFlops in 16 years. The bottom curve corresponds to the number 500
_____________________________________________________________________
Distributed Computing : Clusters, Grids and Clouds, All rights reserved
Fox, and Jack Dongarra, May 2, 2010.
1-4
by Kai Hwang, Geoffrey
Chapter 1: System Models and Enabling Technologies (42 pages)
revised May 2, 2010
computer speed at each year. It increases from 0.42 GFlops to 20 Tflops in 16 years. The top curve plots the
total sum of all 500 fastest computer speed ovet the same period. These plots give a fairly good performance
projection for years to come. For example, 1 PFlops was achieved by IBM Roadrunner in June of 2007. It is
interesting to observe that the total sum increases almost linearly over the years.
Figure 1.2 The Top-500 supercomputer performance from 1993 to 2009
(Courtesy of Top 500 Organization, 2009)
It is interesting to observe in Fig.1.3 the architectural evolution of the Top-500 supercomputers
over the years. In 1993, 250 systems assumed the SMP (symmetric multiprossor) architecture shown in
yellow area. Most SMPs are built with shared memory and shared I/O devices. The word ―symmetric‖
refers to the fact all processors are equally capable to execute the supervisory and/or the application codes.
There were 120 MPP systems (in dark orange area) built then. The SIMD (single instruction stream over
multiple data streams) machines (some called array processors) and uniprocessor systems disappeared in
1997, while the cluster (light orange) architecture appeared in 1999, The clustered systems grow rapidly
from a few to 375 systems out of 500 by 2005. On the other hand, the SMP architecture disappeared
gradually to zero by 2002. Today, the dominating architecture classes in the Top-500 list are the clusters,
MPP, and constellations (pink). More than 85% of the Top-500 computers used in 2010 adopted the cluster
configurations and the remaining 15% chosen the MPP (massively parallel processor) architecture.
_____________________________________________________________________
Distributed Computing : Clusters, Grids and Clouds, All rights reserved
Fox, and Jack Dongarra, May 2, 2010.
1-5
by Kai Hwang, Geoffrey
Chapter 1: System Models and Enabling Technologies (42 pages)
revised May 2, 2010
Figure 1.3 Architectural evolution of the Top-500 supercomputers from 1993 to 2009.
(Courtesy of Top 500 Organization, 2009)
In Table 1.1, we summarize the key architecture features, sustained Linpack benchmark performance ,
and power assumption of five top 5 supercomputers reported in November 2009. We will present the details
of the top two systems: Cray Jaguar and IBM Roadrunner as case studies in Chapter 8. These two machines
have exceeded the Pflops performance. The power consumptions of these systems are enormous including
the cooling electricity. This has triggered the increasing demand of green information technology in recent
years. These state of the art systems will be used far beyond 2010 when this book was written.
Table 1.1 Top Five Supercomputers Evaluated in Nov. 2009
System Rank
and Name
Architecture Description (Core size,
Processor, GHz, OS, and Topology)
Sustained
Speed
Power/
system
1. Jaguar at Oak
Ridge Nat’l Lab,
US
Cray XT-5HE: An MPP built with 224,162 cores in 2.6 GHz
Opteron 6-core processors, interconnected by a 3-D torus
network
1.759
PFlops
6.95 MW
2. Roadrunner
at DOE/NNSA/
LANL, US
IBM BladeCenter QS22/LS21 cluster of 122,400 cores in
12,960 3.2 GHz POWER XCell 8i processors and 6,480 AMD
1.8 GHz Operon dual-core processors, running Linux and
interconnected by an InfiniBand network
1.042
PFops
2.35 MW
3. Kraken at NICS,
University of
Tennessee, US
Crat XT-5-HE : An MPP built with 98,928 cores of 2.6 GHz
Opteron 6-core processors interconnected by a 3-D torus
network
831
TFops
3.09 MW
4. JUGENE at the
FZJ in Germany
IBM BlueGene/P solution built with 294,912 processors:
PowerPC core, 4-way SMP nodes, and 144 TB of memory in
72 racks, interconnected by a 3-D torus network
825.5
TFlops
2.27 MW
5. Tianhe-1 at NSC/
NUDT in China
NUST TH-1 cluster of 71,680 cores in Xeon processors and
ATI Radeon GPUs, interconnected by an InfiniBand network
563
TFlops
1.48 MW
1.1.3
Killer Applications and Grand Challenges
High-performance computing systems offer transparency in many application aspects. For example,
_____________________________________________________________________
Distributed Computing : Clusters, Grids and Clouds, All rights reserved
Fox, and Jack Dongarra, May 2, 2010.
1-6
by Kai Hwang, Geoffrey
Chapter 1: System Models and Enabling Technologies (42 pages)
revised May 2, 2010
data access, resource allocation, process location, concurrency in execution, job replication, and failure
recovery should be made transparent to both users and system management. In Table 1.2, we identify below
a few key applications that have driven the development of parallel and distributed systems in recent years.
These applications spread across many important domains in our society: science, engineering, business,
education, health care, traffic control, Internet and web services, military, and government applications.
Almost all applications demand computing economics, web-scale data collection, system reliability, and
scalable performance.
For example, distributed transaction processing is often practiced in banking and finance industry.
Distributed banking systems must be designed to scale and tolerate faults with the growing demands.
Transactions represent 90% of the existing market for reliable banking systems. We have to deal with
multiple database servers in distributed transactions. How to maintain the consistency of replicated
transaction records is crucial in real-time banking services. Other complications include short of software
support, network saturation, and security threats in these applications. We will study some of the killer
applications and the software standards needed in Chapters 8 and 9.
Table 1.2 Killer Applications of HPC and HTC Systems
Domain
Science and
Engineering
Business, Education,
service industry,
and Health Care
Specific Applications
Scientific simulations, genomic analysis, etc.
Earthquake prediction, global warming, weather forecasting, etc.
Telecommunication, content delivery, e-commerce, etc.
Banking, stock exchanges, transaction processing, etc.
Air traffic control , electric power Grids, distance education, etc.
Health care, hospital automation, telemedicine, etc.
Internet and
Web Services
and Government
Internet search, datacenters, decision-make systems, etc.
Traffic monitory , worm containment, cyber security, etc.
Digital government, on-line tax return, social networking, etc.
Mission-Critical
Applications
Military commend, control, intelligent systems,
crisis management, etc.
1.2 Enabling Technologies for Distributed Parallelism
This section reviews hardware, software and network technologies for distributed computing system
design and applications. Viable approaches to build distributed operating systems are assessed for handling
massive parallelism in distributed environment.
1.2.1
System Components and Wide-Area Networking
In this section, we assess the growth of component and network technologies in building HPC or HTC
systems in recent years. In Fig.1,4, processor speed is measured by MIPS (million instructions per second).
_____________________________________________________________________
Distributed Computing : Clusters, Grids and Clouds, All rights reserved
Fox, and Jack Dongarra, May 2, 2010.
1-7
by Kai Hwang, Geoffrey
Chapter 1: System Models and Enabling Technologies (42 pages)
revised May 2, 2010
The network bandwidth is counted by Mbps or Gbps (Mega or Giga bits per second). The unit GE refers to
1 Gbps Ethernet bandwidth.
Advances in Processors: The upper curve in Fig.1.4 plots the processor speed growth in modern micro
processors or in chip multiprocessors (CMP). We see a growth from 1 MIPS of VAX 780 in 1978 to 1,800
MIPS of Intel Pentium 4 in 2002, and to 22,000 MIPS peak for Sun Niagara 2 in 2008. By Moore’s law, the
processor speed is doubled in every 18 months. This doubling effect was pretty accurate in the past 30
years. The clock rate for these processors increases from 12 MHz in Intel 286 to 4 GHz in Pentium 4 in 30
years. However, the clock rate has stopped increasingly due to the need to reduce power consumption. The
ILP (instruction-level parallelism) is highly exploited in modern processors. ILP mechanisms include
multiple-issue superscalar architecture, dynamic branch prediction, and speculative execution, etc. These
ILP techniques are all hardware and compiler-supported. In addition, DLP (data-level parallelism) and TLP
(thread-level parallelism) are also highly explored in today’s processors.
Many processors are now upgraded to have multi-core and multithreaded micro-architectures. The
architecture of a typical multicore processor is shown in Fig.1.5. Each core is essentially a processor with its
own private cache (L1 cache). Multiple cores are housed in the same chip with a L2 cache that is shared by
all cores. In the future, multiple CMPs could be built on the same CPU chip with even the L3 cache on chip.
Multicore and multithreaded processors are now built in many high-end processors like Intel Xeon,
Montecito, Sun Niagara, IBM Power 6 and X cell processors. Each core could be also multithreaded. For
example, the Niagara II is built with 8 cores with 8 threads handled by each core. This implies that the
maximum ILP and TLP that can be exploited in Niagaris equal to 64 (= 8 x 8).
100000
1000000
Intel Core 2 QX9770
Sun Niagara 2
Network Bandwidth
10000
100000
40 GE
Intel Pentium III
1000
CPU Speed (MIPS)
Intel Pentium 4
Intel Pentium Pro
10 GE
10000
Motorola 68060
100
Motorola 68030
1000
Gigabit
Ethernet
10
Intel 286
1
100
Fast Ethernet
Network Bandwidth (Mbps)
Processor Speed
Vax 11/780
Ethernet
0.1
1978
10
1983
1988
1993
Year
1998
2003
2008
Figure 1.4 Improvement of processor and network technologies over 30 years.
Multicore Architecture: With multiple of the multicores in Fig.1.5 buily on even larger chip, the number
of working cores on the same CPU chip could reach hundreds in the next few years. Both IA-32 and IA-64
instruction set architectures are built in commercial processors today. Now, x-86 processors has been
extended to serve HPC and HTC in some high-end server processors. Many RISC processors are now
replaced by multicore x-86 processors in the Top-500 supercomputer systems. The trend is that x-86
_____________________________________________________________________
Distributed Computing : Clusters, Grids and Clouds, All rights reserved
Fox, and Jack Dongarra, May 2, 2010.
1-8
by Kai Hwang, Geoffrey
Chapter 1: System Models and Enabling Technologies (42 pages)
revised May 2, 2010
upgrades will dominate in datacenters and supercomputers. Graphic processing units (GPU) appeared in
18
HPC systems. In the future, Exa-scale (EFlops or 10 Flops) systems could be built with a large number of
multi-core CPUs and GPUs. In 2009, the No.1 supercomputer in the Top-500 list (a Cray XT-5 named
Jaguar) has already with almost over 30 thousands AMD 6-core Opteron processors resulting a total of
224,162 cores in the entire HPC system.
Figure 1.5 The schematic of a modern multicore processor using a hierarchy of caches
Wide-Area Networking : The lower curve in Fig.1.4 plots the rapid growth of Ethernet bandwidth from 10
Mbps in 1979 to 1 Gbps in 1999 and 40 GE in 2007. It was speculated that 1 Tbps network links will be
available by 2012. According to Berman, Fox, and Hey [3], we expect a 1,000, 1,000, 100, 10, and 1 Gbps
network links, respectively, at international, national, organization, optical desktop, and copper desktop
connections in 2006. An increase factor of 2 per year on network performance was reported, which is faster
than Moore’s law on CPU speed doubling in every 18 months. The implication is that more computers will
be used concurrently in the future. High-bandwidth networking increases the capability of building
massively distributed systems. The IDC 2010 report has predicted that both InfiniBand and Ethernet will be
the two major interconnect choices in the HPC arena.
Memory, SSD, and Disk Arrays: Figure 1.12 plots the growth of DRAM chip capacity from 16 Kb in 1976 to
16 Gb in 2008. This shows that the memory chips get 4 times increase in capacity every 3 years. The
memory access time did not improve much in the past. In fact, the memory wall problem is getting worse as
the processor gets faster. For hard drives, the capacity increases from 260 MB in 250 GB in 2004. The
Seagate Barracuda 7200.11 hard drive reached 1.5 TB in 2008. The increase is about 10 times in capacity
every 8 years. The capacity increase of disk arrays is even greater in the years to come. On the other hand,
faster processor speed and larger memory capacity result in wider gap between processors and memory. The
memory wall becomes even a more serious problem than before. Memory wall still limits the scalability of
multi-core processors in terms of performance.
_____________________________________________________________________
Distributed Computing : Clusters, Grids and Clouds, All rights reserved
Fox, and Jack Dongarra, May 2, 2010.
1-9
by Kai Hwang, Geoffrey
Chapter 1: System Models and Enabling Technologies (42 pages)
revised May 2, 2010
100000
1000000
Memory Chip
Disk Capacity
16Gb 100000
1Gb
Memory Chip (Mbit)
1000
Seagate
Barracuda 7200
256Mb
64Mb
100
0.1
WDC WD1200JB
Seagate DiamondMax 2160
1Mb
256Kb
64Kb
10
Maxtor
4Mb
1
1
ST43400N
0.1
Morrow Designs
DISCUS M26
Iomega
0.01
0.01
1978
1000
100
16Mb
10
10000
Disk Capacity (GB)
10000
0.001
1983
1988
1993
1998
2003
2008
Year
Figure 1.6 Improvement of memory and disk technologies over 30 years
The rapid growth of flash memory and solid-state drive (SSD) also impacts the future of HPC and
HTC systems. The mortality rate of SSD is not bad at all. A typical SSD can handle 300,000 -1,000,000
write cycles per block. So SSD can last for several years, even they have heavy write usage. Flash and SSD
will demonstrate impressive speedups in many applications. For example, the Apple Macbook pro uses 128
GB solid-state hard drive, which is only $150 more than a 500 GB 7200 RPM SATA drive. However to get
256 GB or 512 GB SSD drive, the cost may go up significantly. At present, SSD drives are still too
expensive to replace stable disk arrays in the storage market. Eventually, power consumption, cooling and
packaging will limit the large system development. The power increases linearly with respect to the clock
frequency and quadratically with respect to the voltage applied on chips. We cannot increase the clock rate
indefinitely. Lower the voltage supply is very much in demand.
1.2.2 Virtual Machines and Virtualization Middleware
A conventional computer has a single OS image. This offers a rigid architecture that tightly couples
application software to a specific hardware platform. Some software running well on one machine may not
be executable on anther platform with a different instruction set under a fixed OS management. Virtual
machines (VM) offer novel solutions to underutilized resources, application inflexibility, software
manageability, and security concerns in existing physical machines.
Virtual Machines: The concept of virtual machines is illustrated in Fig.1.7. The host machine is equipped
with the physical hardware shown at the bottom. For example, a desktop with x-86 architecture running its
installed Windows OS as shown in Fig.1.7(a). The VM can be provisioned to any hardware system. The
VM is built with virtual resources managed by a guest OS to run a specific application. Between the VMs
and the host platform, we need to deploy a middleware layer called a virtual machine monitor (VMM) .
Figure 1.7(b) shows a native VM installed with the use a VMM called a hypervisor at the privileged mode.
For example, the hardware has a x-86 architecture running the Windows system. The guest OS could be a
Linux system and the hypervisor is the XEN system developed at Cambridge University. This hypervisor
approach is also called bare-metal VM, because the hypervisor handles the bare hardware (CPU, memory,
and I/O) directly.
_____________________________________________________________________
Distributed Computing : Clusters, Grids and Clouds, All rights reserved
Fox, and Jack Dongarra, May 2, 2010.
1 - 10
by Kai Hwang, Geoffrey
Chapter 1: System Models and Enabling Technologies (42 pages)
revised May 2, 2010
Another architecture is the host VM shown in Fig.1.7(c). Here the VMM runs with a non-privileged
mode. The host OS need not be modified. The VM can be also implemented with a dual mode as shown in
Fig.1.7(d). Part of VMM runs at the user level and another portion runs at the supervisor level. In this case,
the host OS may have to be modified to some extent. Multiple VMs can be ported to one given hardware
system, to support the virtualization process. The VM approach offers hardware-independence of the OS
and applications. The user application and its dedicated OS could be bundled together as a virtual appliance,
that can be easily ported on various hardware platforms.
Guest Apps
Guest Apps
Appls
Guest Apps
Guest OS
VMM)
VMM
(Hypervisor)
Host OS
(OS)
Hardware
Hardware
Hardware
Operating System
(a) Physical Machine
(b) Native VM
(c) Hosted VM
Guest OS
VMM
Host
OS
Nonprivileged
mode
VMM
Privileged
mode
Hardware
(d) Dual-mode VM
Figure 1.7 Three ways of constructing a virtual machine (VM) embedded in a physical
machine.The VM could run on an OS different from that of the host computer.
Virtualization Operations: The VMM provides the VM abstraction to the guest OS. With full
virtualization, the VMM exports a VM abstraction identical to the physical machine; so that a standard OS
such as Windows 2000 or Linux can run just as they would on the physical hardware. Low-level VMM
operations are indicated by Mendel Rosenblum [29] and illustrated in Fig.1..8. First, the VMs can be
multiplexed between hardware machines as shown in Fig.1..8(a). Second, a VM can be suspended and
stored in a stable storage as shown in Fig.1..8 (b). Third, a suspended VM can be resumed or provisioned to
a new hardware platform in Fig.1.8(c). Finally, a VM can be migrated from one hardware platform to
another platform as shown in Fig.1.8 (d).
These VM operations enable a virtual machine to be provisioned to any available hardware platform.
They make it flexible to port distributed application executions. Furthermore the VM approach will
significantly enhance the utilization of server resources. Multiple server functions can be consolidated on
the same hardware platform to achieve higher system efficiency. This will eliminate server sprawl via
deployment of systems as VMs. These VMs move transparency to the shared hardware. According to a
claim by VMWare, the server utilization could be increased from current 5-15% to 60-80%.
_____________________________________________________________________
Distributed Computing : Clusters, Grids and Clouds, All rights reserved
Fox, and Jack Dongarra, May 2, 2010.
1 - 11
by Kai Hwang, Geoffrey
Chapter 1: System Models and Enabling Technologies (42 pages)
revised May 2, 2010
(a) Multiplexing
(c ) Provision (Resume)
(b) Suspension (Storage)
(d) Life migration
Figure 1.8 Virtual machine multiplexing, suspension, provision, and migration in a distributed
computing environment, (Courtesy of M. Rosenblum, Keynote address, ACM ASPLOS 2006 [29])
Virtual Infrastructures: This is very much needed in distributed computing. Physical resources for
compute, storage, and networking at the bottom are mapped to the needy applications embedded in various
VMs at the top. Hardware and software are then separated. Virtual Infrastructure is what connects resources
to distributed applications. It is a dynamic mapping of the system resources to specific applications. The
result is decreased costs and increased efficiencies and responsiveness. Virtualization for server
consolidation and containment is a good example. We will study virtual machines and virtualization support
in Chapter 2. Virtualization support for clusters, grids and clouds are studied in Chapters 3, 4, and 6,
respectively.
1.2.3 Trends in Distributed Operating Systems
The computers in most distributed systems are loosely coupled. Thus the distributed system has
inherently multiple system images. This is mainly due to the fact that all node machines run with an
independent operating system. To promote resource sharing and fast communications among node
machines, we desire to have a distributed OS that manages all resources coherently and efficiently. Such a
system is most likely to be a closed system. They rely on message passing and remote procedure call (RPC)
for internode communications. It should be pointed out that a distributed OS is crucial to upgrade the
performance, efficiency, and application flexibility of distributed applications. A distributed system could
not face the shortcomings in restricted applications and lack of software and security support, until a wellbuilt distributed OSs are in widespread use.
Distributed Operating Systems : Tanenbaum [26] classifies three approaches to distributing the resource
management functions in a distributed computer system. The first approach is to build a network OS over a
_____________________________________________________________________
Distributed Computing : Clusters, Grids and Clouds, All rights reserved
Fox, and Jack Dongarra, May 2, 2010.
1 - 12
by Kai Hwang, Geoffrey
Chapter 1: System Models and Enabling Technologies (42 pages)
revised May 2, 2010
large number of heterogeneous OS platforms. Such a network OS offers the lowest transparency to users.
Network OS is essentially a distributed file system. Independent computers rely on file sharing as a means
of communication. The second approach is to develop middleware to offer limited degree of resource
sharing like what was build for clustered systems (Section 1.2.1). The third approach is to develop a
distributed OS to achieve higher use or system transparency.
Amoeba vs. DCE: Table 1.3 summarizes the functionalities of a distributed OS Amoeba and a middlewarebased DCE developed in the last two decades. To balance the resource management workload, the
functionalities of such a DOS should be distributed to any available server. In this sense, the conventional
OS runs only on a centralized platform. With the distribution of OS services, the DOS design should take
either a light-weight microkernel approach like the Amoeba [27] or extend an existing OS like the DCE [5]
by extending UNIX. The trend is to free up users from most resource management duties. We need new
web-based operating systems to support virtualization of resources in distributed environments. We shall
study distributed OS installed in distributed systems in subsequent chapters.
Table 1.3 Feature Comparison of Two Distributed Operating Systems
Operating System
Functionality
History and Current
System Status
AMOEBA developed at Vrije
University, Amsterdam [32]
DCE as OSF/1 by Open
Software Foundation [5]
Developed at VU and tested in European Release as OSF/1 product, DEC was built
Community, version 5.2 released in 1995, as user extension on top of an existing OS
written in C.
like UNIX, VMS, Windows, OS/2, etc.
Distributed OS
Architecture
Microkernel-based, location transparent,
using many servers to handle files,
directory, replication, run, boot, and
TCP/IP services
This is a middleware-OS providing a
platform for running distributed applications
The system supports RPC, security, and
other DCE Threads.
Amoeba Microkernel
or DEC Packages
A special microkernel handles low-level
process , memory, I/O, and
communication functions
DCE packages handle file, time, directory,
and security services, RPC, authentication
at user space.
Communication
Mechanisms
Use a network-layer FLIP protocol and
RPC to implement point-to-point and
group communications
DCE RPC supports authenticated
communication and other security services
in user programs
1.2.4
Parallel and Distributed Programming Environments
Four programming models are specifically introduced below for distributed computing with
expected scalable performance and application flexibility. We summarize four distributed programming
models in Table 1.4. Some software toolsets developed in recent years are also identified here. MPI is the
most popular programming model for message-passing systems. Google’s MapReduce and BigTable are for
effective use of resources from Internet clouds and data centers. The service clouds demand extending
Hadoop, EC2, and S3 to facilitate distributed computing applications over distributed storage systems.
Message-Passing Interface (MPI) is the primary programming standard used to develop parallel programs
to run on a distributed system. MPI is essentially a library of subprograms that can be called from C or
Fortran to write parallel programs running on a distributed system. We need to embody clusters, Grid and
P2P systems with upgraded web services and utility computing applications. Besides MPI, distributed
programming can be also supported with low-level primitives like PVM (parallel virtual machine). Both
MPI and PVM are described in Hwang and Xu [20].
MapReduce: This is a web-programming model for scalable data processing on large clusters over large
_____________________________________________________________________
Distributed Computing : Clusters, Grids and Clouds, All rights reserved
Fox, and Jack Dongarra, May 2, 2010.
1 - 13
by Kai Hwang, Geoffrey
Chapter 1: System Models and Enabling Technologies (42 pages)
revised May 2, 2010
datasets [11]. The model is applied mainly in web-scale search and cloud computing applications. The user
specifies a map function to generate a set of intermediate key/value pairs. Then the user applies a reduce
function to merge all intermediate values with the same intermediate key. MapReduce is highly scalable to
explore high degree of parallelism at job levels. A typical MapReduce computation process many handle
Terabybe of data on tens of thousand or more client machines. Hundreds of MapReduce programs are likely
to be executed, simultaneously. Thousands of MapReduce jobs are executed on Google’s clusters everyday.
Table 1.4 Parallel and Distributed Programming Models and Toolsets
Model
MPI
MapReduce
Hadoop
Objectives and Web Link
Attractive Features Implemented
Message-Passing Interface is a library of
subprograms that can be called from C or
Fortran to write parallel programs running on
distributed computer systems [2, 21]
Specify synchronous or asynchronous point-topoint and collective communication commands
and I/O operations in user programs for
message-passing execution
A web programming model for scalable data
processing on large cluster over large
datasets, applied in web search operations
[12]
A map function to generate a set of
intermediate key/value pairs. A Reduce
function to merge all intermediate values with
the same key
A software platform to write and run large
user applications on vas datasets in business
and advertising applications.
http://hadoop.apache.org/core/
Hadoop is scalable, economical, efficient and
reliable in providing users with easy access of
commercial clusters
Hadoop Library: Hadoop offers a software platform that was originally developed by a Yahoo group. The
package enable users write and run applications over vast distributed data. Attractive features include: (1)
Scalability: Hadoop can easily scale to store and process petabytes of data in the Web space. (2) Economy:
An open-source MapReduce minimizes the overheads in task spawning and massive data communication,
(3) Efficiency: Processing data with high-degree of parallelism across a large number of commodity nodes
and (4) Reliability: This refers to automatically keeping multiple data copies to facilitate redeployment of
computing tasks upon unexpected system failures.
Open Grid Service Architecture (OGSA): The development of grid infrastructure is driven by pushing
need in large-scale distributed computing applications, These applications must count on a high degree of
resource and data sharing. Table 1..5 introduces the OGSA (Open Grid Service Architecture) as a common
standard for general public use of grid services. Genesis II is a its realization. The key features covers
distributed execution environment, PKI (Public Key Infrastructure) services using local certificate authority
(CA), trust management and security policies in grid computing.
Globus Toolkits and Extensions: Globus is middleware library jointly developed by the US Argonne
National Laboratory and USC Information Science Institute during the past decade. This library
implemented some of the OGSA standards for resource discovery, allocation, and security enforcement in a
Grid environment. The Globus packages support multi-site mutual authentication with PKI certificates.
Globus has gone through several versions released subsequently. The current version GT 4 is in use in 2008.
Sun SGE and IBM Grid Toolbox: Both Sun Microsystems and IBM have extended Globus for business
applications. We will cover grid computing principles and technology in Chapter 5 and grid applications in
Chapter 9.
Table 1.5 Grid Standards and Toolkits for scientific and Engineering Applications
_____________________________________________________________________
Distributed Computing : Clusters, Grids and Clouds, All rights reserved
Fox, and Jack Dongarra, May 2, 2010.
1 - 14
by Kai Hwang, Geoffrey
Chapter 1: System Models and Enabling Technologies (42 pages)
revised May 2, 2010
Grid
Standards
Major Grid Service
Functionalities
Key Features and Security
Infrastructure
OGSA
Standard
Open Grid Service Architecture
offers common grid service
standards for general public use
Support heterogeneous distributed environment,
bridging CA, multiple trusted intermediaries,
dynamic policies, multiple security mechanisms, etc.
Globus
Toolkits
Resource allocation, Globus
security infrastructure (GSI), and
generic security service API
Sign-in multi-site authentication with PKI, Kerberos,
SSL, Proxy, delegation, and GSS API for message
integrity and confidentiality
Supporting local grids and clusters
in enterprise or campus Intranet grid
applications
Using reserved ports, Kerberos, DCE, SSL, and
authentication in classified hosts at various trust
levels and resource access restrictions
AIX and Linux grids built on top
of Globus Toolkit, autonomic
computing, Replica services
Using simple CA, granting access, grid service
(ReGS), supporting Grid application framework for
Java (GAF4J), GridMap in IBM IntraGrid for
security update, etc.
Sun Grid
Engine (SGE)
IBM Grid
Toolbox
1.3 Distributed Computing System Models
A massively parallel and distributed computing system or in short a massive system is built over a large
number of autonomous computer nodes. These node machines are interconnected by system-area networks
(SAN), local-are networks (LAN), or wide-area networks (WAN) in a hierarchical manner. By today’s
networking technology, a few LAN switches can easily connect hundreds of machines as a working cluster.
A WAN can connect many local clusters to form a very-large cluster of clusters. In this sense, one can build
a massive system to have millions of computers connected to edge networks in various Internet domains.
System Classification: Massive systems are considered highly scalable to reach a web-scale connectivity,
either physically or logically. In Table 1.6, we classify massive systems into four classes: namely the
clusters, P2P networks, computing grids, and Internet clouds over huge datacenters. In terms of node
number, these four system classes may involve hundreds, thousands, or even millions of computers as
participating nodes. These machines work collectively, cooperatively, or collaboratively at various levels.
The table entries characterize these four system classes in various technical and application aspects.
From the application prospective, clusters are most popular in supercomputing applications. In 2009,
417 out of the top-500 supercomputers were built with a cluster architecture. It is fair to say that clusters
have laid the necessary foundation to build large-scale grids and clouds. P2P networks appeal most to
business applications. However, the content industry was reluctant to accept P2P technology for lack of
copyright protection in ad hoc networks. Many national grids built in the past decade were underutilized for
lack of reliable middleware or well-coded applications. Potential advantages of cloud computing lie in its
low cost and simplicity to both providers and users.
New Challenges:Utility computing focuses on a business model, by which customers receive computing
resources from a paid service provider. All grid/cloud platforms are regarded as utility service providers.
However, cloud computing offers a broader concept than utility computing. Distributed cloud applications
run on any available servers in some edge networks. Major technological challenges include all aspects of
computer science and engineering. For example, we need new network-efficient processors, scalable
memory and storage schemes, distributed OS, middleware for machine virtualization, new programming
model, effective resource management, and application program development in distributed systems that
explore massive parallelism at all processing levels.
_____________________________________________________________________
Distributed Computing : Clusters, Grids and Clouds, All rights reserved
Fox, and Jack Dongarra, May 2, 2010.
1 - 15
by Kai Hwang, Geoffrey
Chapter 1: System Models and Enabling Technologies (42 pages)
revised May 2, 2010
Table 1.6 Classification of Distributed Parallel Computing Systems
Functionality,
Applications
Multicomputer
Clusters [11, 21]
Peer-to-Peer
Networks [13, 33]
Data/Computational
Grids [4, 14, 33]
Architecture,
Network
Connectivity
and Size
Network of compute
nodes interconnected by
SAN, LAN, or WAN,
hierarchically
Flexible network of
client machines
logically connected by
an overlay network
Heterogeneous cluster of
clusters connected by
high-speed network links
over selected resource
sites.
Control and Homogeneous nodes Autonomous client
Centralized control,
Resources with distributed control, nodes, free in and out, server oriented with
Management running Unix or Linux with distributed self- authenticated security,
Applications
and networkcentric
services
High-performance
computing, search
engines, and web
services, etc.
Virtualized cluster of
servers over many
datacenters via
service-level
agreement
Dynamic resource
provisioning of servers,
storage, and networks
over massive datasets
organization
and static resources
management
Most appealing to
business file sharing,
content delivery, and
social networking
Distributed supercomputing, global
problem solving, and
datacenter services
Upgraded web search,
utility computing, and
outsourced computing
services
TeraGrid, GriPhyN,
UK EGEE, D-Grid,
ChinaGrid, IBM
IntraGrid, etc.
Google App Engine,
IBM Bluecloud,
Amazon Web
Service(AWS), and
Microsoft Azure,
Representative Google search engine, Gnutella, eMule,
Operational SunBlade, IBM
BitTorrent, Napster,
Systems
BlueGene,
Papestry, KaZaA,
Road Runner,
Cray XT4, etc.
Cloud Platforms
[7, 8, 22, 31]
Skype, JXTA,
and .NET
1.3.1 Clusters of Cooperative Computers
A computing cluster is built by a collection of interconnected stand-alone computers, which work
cooperatively together as a single integrated computing resource. To handle heavy workload with large
datasets, clustered computer systems have demonstrated impressive results in the past.
Cluster Architecture: Figure 1.9 shows the architecture of a typical sever cluster built around a lowlatency and high-bandwidth interconnection network. This network can be as simple as a SAN (e.g.
Myrinet) or a LAN (e.g. Ethernet). To build a larger cluster with more nodes, the IN can be built with
multiple levels of Gigabit Ethernet, Myrinet, or InfiniBand switches. Through hierarchical construction
using SAN, LAN, or WAN, one can build scalable clusters with increasing number of nodes. The whole
cluster is connected to the Internet via a VPN gateway. The gateway IP address could be used to locate the
cluster over the cyberspace.
Single-System Image: The system image of a computer is decided by the way the OS manages the shared
cluster resources. Most clusters have loosely-coupled node computers. All resources of a server node is
managed by its own OS. Thus, most clusters have multiple system images coexisting simultaneously. Greg
Pfister [27] has indicated that an ideal cluster should merge multiple system images into a single-system
image (SSI) at various operational levels. We need an idealized cluster operating system or some middlware
to support SSI at various levels, including the sharing of all CPUs, memories, and I/O across all computer
nodes attached to the cluster.
A Cluster
.
...
S
n-1
Sn
16
S2
by
The Internet
Servers
S
1
System-Area Network, or
Local-Area Networks, or
Storage-Area Network
(Ethernet, Myrinet,
InfiniBand, etc. )
Gateway
Chapter 1: System Models and Enabling Technologies (42 pages)
revised May 2, 2010
Figure 1.9 A cluster of servers (S1, S2,…,S n) interconnected by a high-bandwidth system-area or
local-area network with shared I/O devices and disk arrays. The cluster acts as a single computing
node attached to the Internet throught a gateway.
A single system image is the illusion, created by software or hardware that presents a collection of
resources as an integrated powerful resource. SSI makes the cluster appear like a single machine to the user,
applications, and network. A cluster with multiple system images is nothing but a collection of independent
computers. Figure 1.10 shows the hardware and software architecture of a typical cluster system. Each node
computer has its own operating system. On top of all operating systems, we deploy some two layers of
middleware at the user space to support the high availability and some SSI features for shared resources or
fast MPI communications.
Figure 1.10 The architecture of a working cluster with full hardware, software, anAd
middleware support for availability and single system image.
For example, since memory modules are distributed at different server nodes, they are managed
independently over disjoint address spaces. This implies that the cluster has multiple images at the memoryreference level. On the other hand, we may want all distributed memories to be shared by all servers by
forming a distributed shared memory (DSM) with a single address space. A DSM cluster thus has a singlesystem image (SSI) at the memory-sharing level. Cluster explores data parallelism at the job level with high
system availability.
Cluster Design Issues: Unfortunately, a cluster-wide OS for complete resource sharing is not available yet.
Middleware or OS extensions were developed at the user space to achieve SSI at selected functional
_____________________________________________________________________
Distributed Computing : Clusters, Grids and Clouds, All rights reserved
Fox, and Jack Dongarra, May 2, 2010.
1 - 17
by Kai Hwang, Geoffrey
Chapter 1: System Models and Enabling Technologies (42 pages)
revised May 2, 2010
levels.Without the middleware, the cluster nodes cannot work together effectively to achieve cooperative
computing. The software environments and applications must rely on the middleware to achieve high
performance. The cluster benefits come from scalable performance, efficient message-passing, high system
availability, seamless fault tolerance, and cluster-wide job management as summarized in Table 1.7.
Clusters and MPP designs are treated in Chapter 3.
Table 1.7 Critical Cluster Design Issues and Feasible Implementations
Features
Functional Characterization
Feasible Implementations
Availability
Support
Hardware and software support for
sustained high availability in cluster
Failover, failback, checkpointing, roll back
recovery, non-stop OS, etc
Hardware
Fault-Tolerance
Automated failure management to
eliminate all single points of failure
Component redundancy, hot swapping,
RAID, and multiple power supplies, etc.
Single-System
Image (SSI)
Achieving SSI at functional level with
hardware and software support,
middleware, or OS extensions.
Hardware mechanisms or middleware
support to achieve distributed shared
memory (DSM) at coherent cache level.
Efficient
Communications
To reduce message-passing system
overhead and hide latencies
Fast message passing , active messages,
enhanced MPI library, etc.
Cluster-wide Job
Management
Dynamic Load
Balancing
Scalability and
Programmability
1.3.2
Use a global job management system with Apply single-job management systems such
better scheduling and monitory
as LSF, Codine, etc
Balance the workload of all processing
nodes along with failure recovery
Workload monitory, process migration, job
replication and gang scheduling, etc.
Adding more servers to a cluster or adding Use scalable interconnect, performance
more clusters to a Grid as the workload or monitory, distributed execution environment,
data set increases
and better software tools
Grid Computing Infrastructures
In 30 years, we have experienced a natural growth path from Internet to web and grid computing
services. Internet service such as the Telnet command enables connection from one computer to a remote
computer. The Web service like http protocol enables remote access of remote web pages. Grid computing
is envisioned to allow close interactions among applications running on distant computers, simultaneously.
Forbes Magazine has projected the global grow of IT-based economy from $1 Trillion in 2001 to 20
Trillion by 2015. The evolution from Internet to web and grid services is certainly playing a major role to
this end.
Computing Grids: Like an electric-utility power grid, a computing grid offers an infrastructure that
couples computers, software/middleware, special instruments, and people and sensors together. Grid is
often constructed across LAN,WAN, or Internet backbone networks at regional, national, or global scales.
Enterprises or organizations present grids as integrated computing resources. They can be viewed also as
virtual platforms to support virtual organizations. The computers used in a grid are primarily workstations,
servers, clusters, and supercomputers. Personal computers, laptops and PDAs can be used as access devices
to a grid system. The grid software and middleware are needed as applications and utility libraries and
databases, Special instruments are used to search for life in the galaxy, for example.
Figure 1.11 shows the concept of a computational grid built over three resource sites at the
University of Wisconsin at Madison, University of Illinois at Champaign-Urbana, and California Institute of
Technology. The three sites offer complementary computing resources, including workstations, large
_____________________________________________________________________
Distributed Computing : Clusters, Grids and Clouds, All rights reserved
Fox, and Jack Dongarra, May 2, 2010.
1 - 18
by Kai Hwang, Geoffrey
Chapter 1: System Models and Enabling Technologies (42 pages)
revised May 2, 2010
servers, mesh of processors, and Linux clusters to satisfy a chain of computational needs. Three steps are
shown to the chain of weather data collection, distributed computations, and result analysis in atmospheric
simulations. Many other even larger computational grids like NSF TeraGrid and EGEE, and ChinaGrid
have built similar national infrastructures to perform distributed scientific grid applications.
Figure 1.11 An example computational Grid built over specialized computers at three
resource sites at Wisconsin, Caltech, and Illinois. (Courtesy of Michel Waldrop,
“Grid Computing”, IEEE Computer Magazine, 2000. [34])
Grid Families: Grid technology demands new distributed computing models, software/middleware support,
network protocols, and hardware infrastructures. National grid projects are followed by industrial grid
platform development by IBM, Microsoft, Sun, HP, Dell, Cisco, EMC, Platform Computing, etc New grid
service providers (GSP) and new grid applications are opened rapidly, similar to the growth of Internet and
Web services in the past two decades. In Table 1.8, we classify grid systems developed in the past decade
into two families: namely computational or data grids and P2P grids. These computing grids are mostly
built at the national level. We identify their major applications, representative systems, and lesson learned
so far. Grid Computing will be studied in Chapters 4 and 8.
Table 1.8 Two Grid Computing Infrastructures and Representative Systems
Design Issues
Computational and Data Grids
Grid Applications
reported
Distributed Supercomputing, National
Grid Initiatives, etc
Open grid with P2P flexibility, all resources
from client machines
Representative
Systems
TeraGrid in US,
ChinaGrid, UK e-Science, etc.
JXTA, FightAid@home,
SETI@home
Development
Lessons learned
Restricted user groups, middleware bugs, Unreliable user-contributed resources,
rigid protocols to acquire resources
limited to a few apps.
1.3.3
P2P Grids
Service-Oriented Architectures (SOA)
Technology has advanced at breakneck speed up over the last decade with many changes that
_____________________________________________________________________
Distributed Computing : Clusters, Grids and Clouds, All rights reserved
Fox, and Jack Dongarra, May 2, 2010.
1 - 19
by Kai Hwang, Geoffrey
Chapter 1: System Models and Enabling Technologies (42 pages)
revised May 2, 2010
are still occurring. However in this chaos, the value of building systems in terms of services has
grown in acceptance and it has become a core idea of most distributed systems. Always one builds
systems in layered fashion as sketched below in Fig.1.12. Here we use the rather clumsy term
―entity‖ to denote the abstraction used as the basi c building block. In Grids/Web Services, Java
and CORBA, an entity is respectively a service, Java object or a CORBA distributed object in a
variety of languages.
The architectures build on the traditional 7 OSI layers providing the base networking abstractions. On
top of this we have a base software environment which would be .NET or Apache Axis for Web Services,
the Java Virtual Machine for Java or a Broker network for CORBA. Then on top of this base environment,
one builds a higher-level environment reflecting the special features of the distributing computing
environment and represented by the green box in Fig.1.12. This starts with Entity Interfaces and Inter-entity
communication which can be thought of as rebuilding the top 4 OSI layers but at the entity and not the bit
level.
The entity interfaces correspond to the WSDL, Java method and CORBA IDL specifications in these
example distributed systems. These interfaces are linked with customized high level communication
systems – SOAP, RMI and IIOP in the three examples. These communication systems support features
including particular message patterns (such as RPC or remote procedure call), fault recovery and specialized
routing. Often these communications systems are built on message oriented middleware (enterprise bus)
infrastructure such as WebSphereMQ or JMS (Java Message Service) which provide rich functionality and
support virtualization of routing, sender and recipients.
In the case of fault tolerance, we find features in the Web Service Reliable Messaging framework
that mimic the OSI layer capability (as in TCP fault tolerance) modified to match the different abstractions
(such as messages versus packets, virtualized addressing) at the entity levels. Security is a critical capability
that either uses or re-implements the capabilities seen in concepts like IPSec and secure sockets in the OSI
layers. Entity communication is supported by higher level services for registries, metadata and management
of the entities discussed in Section 4.4.
Application Specific Entities and Systems
Generally Useful Entities and Systems
Entity Coordination
Entity Management
Entity Discovery and Information
Inter-Entity Communication
Entity Interfaces
Base Software Environment
Protocol HTTP FTP DNS …
Presentation XDR …
Session SSH …
Transport TCP UDP …
Network IP …
Data Link / Physical
Distributed
Entities
Bit level
Internet
Fig. 1.12. General layered architecture for distributed entities
Here one might get several models with for example Jini and JNDI illustrating different approaches
within the Java distributed object model. The CORBA Trader Service, UDDI, LDAP and ebXML are other
examples of discovery and information services described in Section 4.4. Management services include
service state and lifetime support; examples include the CORBA Life Cycle and Persistent State, the
different Enterprise Javabean models, Jini's lifetime model and a suite of Web service specifications that
_____________________________________________________________________
Distributed Computing : Clusters, Grids and Clouds, All rights reserved
Fox, and Jack Dongarra, May 2, 2010.
1 - 20
by Kai Hwang, Geoffrey
Chapter 1: System Models and Enabling Technologies (42 pages)
revised May 2, 2010
we will study further in Chapter 4.
We often term this collection of entity level capabilities that extend the OSI stack the ―Internet on
the Internet‖: or the ―Entity Internet built on the Bit Internet‖. The above describes a classic distr ibuted
computing model and as well as intense debate on the best ways of implementing distributed systems there
is competition with "centralized but still modular" approaches where systems are built in terms of
components in an Enterprise Javabean or equivalent approach.
The latter can have performance advantages and offer a "shared memory" model allowing more
convenient exchange of information. However the distributed model has two critical advantages -- namely
higher performance (from multiple CPU's when communication is unimportant) and a cleaner separation of
software functions with clear software re-use and maintenance advantages. We expect the distributed model
to gain in popularity as the default approach to software systems. Here the early CORBA and Java
approaches to distributed systems are being replaced by the service model shown in Fig.1.13.
Loose coupling and support of heterogeneous implementations makes services more
attractive than distributed objects. The architecture of this figure underlies modern systems with
typically two choice of service architecture -- Web Services or REST systems. These are further
discussed in chapter 4 and have very distinct approaches to building reliable interoperable systems.
in Web services, one aims to fully specify all aspects of the service and its environment. This
specification is carried with communicated messages using the SOAP protocol. The hosting
environment then becomes a universal distributed operating system with fully distributed
capability carried by SOAP messages.
Application Specific Services/Grids
Generally Useful Services and Grids
Workflow
Service Management
Service Discovery and Information
Service Internet TransportProtocol
Service Interfaces
Base Hosting Environment
Protocol HTTP FTP DNS …
Presentation XDR …
Session SSH …
Transport TCP UDP …
Network IP …
Data Link / Physical
Higher
Level
Services
Service
Context
Service
Internet
Bit level
Internet
Figure 1.13 Layered architecture for web srvices and grids
Experience has seen mixed success for this approach as it has been hard to agree on key parts of the
protocol and even harder to robustly and efficiently implement the universal processing of the protocol (by
software like Apache Axis). In the REST approach, one adopts simplicity as the universal principle and
delegated most of the hard problems to application (implementation specific) software. In a Web Service
language REST has minimal information in the header and the message body (that is opaque to generic
message processing) carries all needed information. REST architectures are clearly more appropriate to
rapidly technology environments that we see today.
However, the ideas in Web Services are important and probably will be needed in mature systems at a
different level in stack (as part of application). Note that REST can use XML schemas but not used that are
part of SOAP; "XML over HTTP" is a popular design choice. Above the communication and management
layers, we have the capability to compose new entities or distributed programs by integrating several
_____________________________________________________________________
Distributed Computing : Clusters, Grids and Clouds, All rights reserved
Fox, and Jack Dongarra, May 2, 2010.
1 - 21
by Kai Hwang, Geoffrey
Chapter 1: System Models and Enabling Technologies (42 pages)
revised May 2, 2010
entities together as sketched in Fig.1.14. In CORBA and Java, the distributed entities are linked with remote
procedure calls and the simplest way to build composite applications is to view the entities as objects and
use the traditional ways of linking them together. For Java, this could be as simple as writing a Java
program with method calls replaced by RMI (Remote Method Invocation) while CORBA supports a similar
model with a syntax reflecting the C++ style of its entity (object) interfaces.
Raw Data
Data
S
S
S
S
Filter
Cloud
SS
Wisdom
Decisions
S
S
FS
FS
Filter
Service
Cloud
Filter
FS
Filter
Service
FS
S
S
FS
FS
S
S
S
S
S
FS
Filter
Service
FS
FS
S
S
S
Filter
Cloud
FS
FS
SS
Filter
Cloud
FS
SS
Discovery
Cloud
FS
FS
SS
FS
FS
FS
SS
FS
Filter
Service
FS
FS
SS
FS
Database
S
S
FS
SS
Another
Grid
Knowledge
Another
Grid
Another
Grid
Another
Service
Information
S
S
FS
Filter
Filter
Cloud
Cloud
S
S
Discovery
Cloud
FS
S
S
Compute
Cloud
S
S
Traditional Grid
with exposed
services
S
S
Storage
Cloud
Sensor or Data
Interchange
Service
Figure 1.14. Grids of Clouds and Grids where SS refers to Sensor Service and fs to a filter or
transforming service.
There are also very many distributed programming models built on top of these of these basic
constructs. For Web Services, workflow technologies are used to coordinate or orchestrate services with
special specifications used to define critical business process models such as two phase transactions. In
section 4.2, we describe the general approach used in workflow, the BPEL Web Service standard and
several important workflow approaches Pegasus, Taverna, Kepler, Trident and Swift. In all approaches one
is building collections of services which together tackle all or part of a problem. As always one ends with
systems of systems as the basic architecture.
Allowing the term Grid to refer to a single service or represent a collection of services, we find the
architecture of Fig.1.14. Here sensors represent entities (such as instruments) that output data (as messages)
and Grids and Clouds represent collections of services that have multiple message-based inputs and outputs.
The figure emphasizes the system of systems or "Grids and Clouds of Grids and Clouds" architecture. Most
distributed systems requires a web interface or portal shown in Fig.1.14 and two examples (OGFCE and
HUBzero) are described in Section 4.3 using both Web Service (portlet) and Web 2.0 (gadget) technologies.
_____________________________________________________________________
Distributed Computing : Clusters, Grids and Clouds, All rights reserved
Fox, and Jack Dongarra, May 2, 2010.
1 - 22
by Kai Hwang, Geoffrey
Chapter 1: System Models and Enabling Technologies (42 pages)
revised May 2, 2010
1.3.4
Peer-to-Peer Network Families
A well-established distributed system is the client-server architecture. Client machines (PC and
workstations) are connected to a central server for compute, or Email, file access, database applications. The
peer-to-peer (P2P) architecture offers a distributed model of networked systems. First, a P2P network is
client-oriented instead of server-oriented. In this section, we introduce P2P systems at the physical level and
overlay networks at the logical level.
P2P Networks: In a P2P system, every node acts as both a client and a server, providing part of the system
resources. Peer machines are simply client computers connected to the Internet. All client machines act
autonomously to join or leave the system freely. This implies that no master-slave relationship exists among
the peers. No central coordination or no central database is needed. In other words, no peer machine has a
global view of the entire P2P system. The system is self-organizing with distributed control.
The architecture of a P2P network is shown in Fig.1.15 at two abstraction levels. Initially, the peers
are totally unrelated. Each peer machine joins or leaves the P2P network, voluntarily. Only the participating
peers form the physical network at any time. Unlike the cluster or grid, a P2P network does not use a
dedicated interconnection network. The physical network is simply an ad hoc network formed at various
Internet domains randomly using TCP/IP and NAI protocols. Thus, the physical network varies in size and
topology dynamically due to the free membership in the P2P network.
Overlay
Network
Figure 1.15 The structure of a peer-to-peer system by mapping a physical network to
a virtual overlay network (Courtesy of JXTA, http://www.jxta.com )
Overlay Networks: Data items or files are distributed in the participating peers. Based on communication
or file-sharing needs, the peer IDs form an overlay network at the logical level. This overlay is a virtual
network formed by mapping each physical machine with its ID, logically through a virtual mapping shown
in Fig.1.7. When a new peer joins the system, its peer ID is added as a node in the overlay network. When
an existing peer leaves the system, its peer ID is removed from the overlay network, automatically.
Therefore, it is the P2P overlay network that characterizes the logical connectivity among the peers.
There are two types of overlay networks: unstructured versus structured. An unstructured overlay
network is characterized by a random graph. There is no fixed route to send messages or file among the
nodes. Often, flooding is applied to send a query to all nodes in an unstructured overlay, thus ending up
with heavy network traffic and nondeterministic search results. Structured overlay networks follow certain
connectivity topology and rules to insert or remove nodes (Peer IDs) from the overlay graph. Routing
mechanisms are developed to take advantage of the structured overlays.
_____________________________________________________________________
Distributed Computing : Clusters, Grids and Clouds, All rights reserved
Fox, and Jack Dongarra, May 2, 2010.
1 - 23
by Kai Hwang, Geoffrey
Chapter 1: System Models and Enabling Technologies (42 pages)
revised May 2, 2010
P2P Application Families: Based on applications, we classify P2P networks into four classes in Table 1.9.
The first family is for distributed file sharing of digital contents (music, video, etc.) on the P2P network.
This includes many popular P2P networks like Gnutella, Napster, BitTorrent, etc. Collaboration P2P
networks include MSN or Skype chatting, instant messaging, collaborative design, etc. The third family is
for distributed P2P computing in specific applications. For example, SETI@home provides 25 Tflops
distributed computing power, collectively, over 3 million Internet host machines. Other P2P platforms like
JXTA, .NET, and FightingAID@home, support naming, discovery, communication, security, and resource
aggregation in some P2P applications. We will study these topics in Chapters 5 and 8.
Table 1.9 Major Categories of Peer-to-Peer Network Families
System
Features
Distributed File
Sharing
Attractive
Applications
Content distribution of
MP3 music, video, open
software, etc.
Instant Messaging,
Collaborative design
and gaming
Operational
Problems
Loose security
and on-line
copyright violations
Lack of trust, disturbed Security holes,
by spam, privacy, and selfish partners,
peer collusions
and peer collusion
Lack of standards
or protection
protocols
Gnutella, Napster,
eMule, BitTorrent,
Aimster, KaZaA, etc.
ICQ, AIM, Groove,
Magi, Multiplayer
Games, Skype, etc.
JXTA, .NET,
FightingAid@
home, etc.
Example
Systems
Collaborative
Platform
Distributed P2P
Computing
Scientific
exploration and
social networking
SETI@home,
Geonome@
home, etc.
Peer-to-Peer
Platform
Open networks
for public
resources
P2P Computing Challenges: P2P computing faces three types of heterogeneity problems in hardware,
software and network requirements. There are too many hardware models and architectures to select from.
Incompatibility exists between software and OS. Different network connections and protocols make it too
complex to apply in real applications. We need system scalability as the workload increases. System scaling
is directly related to performance and bandwidth.
Data location is also important to affect collective performance. Data locality, network proximity,
and interoperability are three design objectives in distributed applications. The P2P performance is affected
by routing efficiency and self-organization by the participating peers. Fault Tolerance, failure management,
and load balancing are other important issues in using overlay networks. Lack of trust among the peers
posts another problem. Peers are strangers to each other. Security, privacy, and copyright violations are
major worries by industry to apply P2P technology in business applications.
1.3.5
Virtualized Cloud Computing Infrastructure
Gordon Bell, Jim Gray, and Alex Szalay [3] have advocated: ―Computational science is changing to be
data-intensive. Supercomputers must be balanced systems, not just CPU farms but also petascale I/O and
networking arrays.‖ In the future, working with lar ge data sets will typically mean sending the
computations (programs) to the data, rather than copying the data to the workstations. This reflects the trend
in IT that moves computing and data from desktops to large datacenters, where on-demand provision of
software, hardware, and data as a service. Data explosion leads to the idea of cloud computing.
Cloud computing has been defined differently by many users and designers. Just to cite a few, IBM
being a major developer of cloud computing has defined cloud computing as: ―A cloud is a pool of
virtualized computer resources. A cloud can host a variety of different workloads, including batch-style
backend jobs and interactive, user-facing applications, allow workloads to be deployed and scaled-out
_____________________________________________________________________
Distributed Computing : Clusters, Grids and Clouds, All rights reserved
Fox, and Jack Dongarra, May 2, 2010.
1 - 24
by Kai Hwang, Geoffrey
Chapter 1: System Models and Enabling Technologies (42 pages)
revised May 2, 2010
quickly through the rapid provisioning of virtual machines or physical machines, support redundant, selfrecovering, highly scalable programming models that allow workloads to recover from many unavoidable
hardware/software failures; and monitor resource use in real time to enable rebalancing of allocations when
needed.‖
Internet Clouds: Cloud computing applies a virtualized platform with elastic resources on-demand by
provisioning hardware, software, and datasets, dynamically. The idea is to move desktop computing to a
service-oriented platform using server clusters and huge databases at datacenters. Cloud computing
leverages its low cost and simplicity that benefit both users and the providers. Machine virtualization has
enabled such cost-effectiveness. Cloud computing intends to satisfy many heterogeneous user applications
simultaneously. The cloud ecosystem must be designed to be secure, trustworthy, and dependable.
Ian Foster defined cloud computing as follows: ― A large-scale distributed computing paradigm that is
driven by economics of scale, in which a pool of abstracted virtualized, dynamically-scalable, managed
computing power, storage, platforms, and services are delivered on demand to external customers over the
Internet‖. Despite some minor differences in the a bove definitions, we identify six common characteristics
of Internet clouds as depicted in Fig.1.16.
Paid
Software
Hardware
Internet
Cloud
User
Storage
Submit
Network
Service
Figure 1.16 Concept of virtualized resources provisioning through the Internet cloud, where the hardware,
software, storage, network and services are put together to form a cloud platform.
(1) Cloud platform offers a scalable computing paradigm built around the datacenters.
(2) Cloud resources are dynamically provisioned by datacenters upon user demand.
(3) Cloud system provides computing power, storage space, and flexible platforms
for upgraded web-scale application services.
(4) Cloud computing relies heavily on the virtualization of all sorts of resources.
(5) Cloud computing defines a new paradigm for collective computing, data consumption
and delivery of information services over the Internet.
(6) Clouds stress the cost of ownership reduction in mega datacenters.
Basic Cloud Models: Traditionally, a distributed computing system tends to be owned and operated by an
autonomous administrative domain (e.g., a research laboratory or company) for on-premises computing
needs. However, these traditional systems have encountered several performance bottlenecks: constant
system maintenance, poor utilization and increasing costs associated with hardware/software upgrades.
Cloud computing as an on-demand computing paradigm resolves or relieves from these problems. In
_____________________________________________________________________
Distributed Computing : Clusters, Grids and Clouds, All rights reserved
Fox, and Jack Dongarra, May 2, 2010.
1 - 25
by Kai Hwang, Geoffrey
Chapter 1: System Models and Enabling Technologies (42 pages)
revised May 2, 2010
Figure 1.17, we introduce the basic concepts of three cloud computing service models. More cloud details
are given in Chapters 7, 8 and 9.
Figure 1.17 Basic concept of cloud computing models and services provided
(Courtesy of IBM Corp. 2009)
Infrastructure as a Service (IaaS): This model allows users to server, storage, networks, and datacenter
fabric resources. The user can deploy and run on multiple VMs running guest OSes on specific applications.
The user does not manage or control the underlying cloud infrastructure, but can specifv when to request
and release the needed resources.
Platform as a Service (PaaS): This model provides the user to deploy user-built applications onto a
virtualized cloud platform The platform include both hardware and software integrated with specific
programming interfaces. The provide supplies the API and software tools (e.g., Java, python, Web 2.0,
.Net). The user is freed from managing the underlying cloud infrastructure.
Software as a Service (SaaS): This refers to browser-initiated application software over thousands of paid
cloud customers. The SaaS model applies to business processes, industry applications, CRM (consumer
relationship mamagment), ERP (enterprise resources planning), HR (human resources) and collaborative
applications. On the customer side, there is no upfront investment in servers or software licensing. On the
provider side, costs are rather low, compared with conventional hosting of user applications.
Internet clouds offer four deployment modes: private, public, managed, and hybrid [22]. These modes
demand different levels of security implications. The different service level agreements and service
deployment modalities imply the security to be a shared responsibility of all the cloud providers, the cloud
resource consumers and the third party cloud enabled software providers. Advantages of cloud computing
have been advocated by many IT experts, industry leaders, and computer science researchers.
Benefits of Outsourcing to The Cloud: Outsourcing local workload and/or resources to the cloud has
_____________________________________________________________________
Distributed Computing : Clusters, Grids and Clouds, All rights reserved
Fox, and Jack Dongarra, May 2, 2010.
1 - 26
by Kai Hwang, Geoffrey
Chapter 1: System Models and Enabling Technologies (42 pages)
revised May 2, 2010
become an appealing alternative in terms of operational efficiency and cost effectiveness. This outsourcing
practice particularly gains its momentum with the flexibility of cloud services from no lock-in contracts
with the provider and the use of a pay-as-you-go pricing model. Clouds are primarily driven by
economics—the pay-per-use pricing model similar to basic utilities of electricity, water and gas. From the
consumer’s perspective, this pricing model for computing has relieved many issues in IT practices, such as
the burden of new equipment purchases and the ever-increasing costs in operation of computing facilities
(e.g., salary for technical supporting personnel and electricity bills).
Specifically, a sudden surge of workload can be effectively dealt with; and this also has an economic
benefit in that it helps avoid over provisioning of resources for such a surge. From the provider’s
perspective, charges imposed for processing consumers’ service requests—often exploiting underutilized
resources—are an additional source of revenue. Sin ce the cloud service provider has to deal with a diverse
set of consumers, including both regular and new/one-off consumers, and their requests most likely differ
from one another, the judicious scheduling of these requests plays a key role in the efficient use of resources
for the provider to maximize its profit and for the consumer to received satisfactory service quality (e.g.,
response time). Recently, Amazon introduced EC2 Spot instances for which the pricing dynamically
changes based on the demand-supply relationship (http://aws.amazon.com/ec2/spot-instances/).
Accountability and security are two other major concerns associated with the adoption of clouds. These will
be treated in Chapters 7.
Chapter 6 offers details of datacenter design, cloud platform architecture and resource deployment,
Chapter 7 provides major cloud platforms built and various cloud services being offered. Listed below are 8
motivations of adapting the cloud for upgrading Internet applications and web services in general.
(1).
(2).
(3).
(4).
(5).
(6).
(7).
(8).
Desired location in areas with protected space and better energy efficiency.
Sharing of peak-load capacity among a large pool of users, improving the overall utilization
Separation of infrastructure maintenance duties from domain-specific application development.
Significant reduction in cloud computing cost, compared with traditional computing paradigms.
Cloud computing programming and application development
Service and data discovery and content/service distribution
Privacy, security, copyright, and reliability issues
Service agreements, business models, and pricing policies.
Representative Cloud Providers : In Table 1.10, we summarize the features of three cloud platforms built
up to 2008. The Google platform is a closed system, dynamically built over a cluster of servers,. These
servers selected from over 460,000 Google servers worldwide. This platform is proprietary in nature, only
programmable by Google staff. Users must order the standard services through Google. The IBM
BlueCloud offers a total system solution by selling the entire server cluster plus software packages for
resources management and monitoring, WebSphere 2.0 applications, DB2 databases, and virtualization
middleware. The third cloud platform is offered by Amazon as a custom-service utility cluster. Users lease
special subcluster configuration and torage space to run custom-coded applications.
The IBM BlueCloud allows cloud users to fill out a form defining their hardware platform, CPU,
memory, storage, operating system, middleware, and team members and their associated roles. A SaaS
bureau may order travel or secretarial services from a common cloud platform. The MSP coordinates
service delivery and pricing by user specifications. Many IT companies are now offering cloud computing
services. We desire a software environment that provides many useful tools to build cloud applications over
large datasets. In addition to MapReduce, BigTable, EC2, and 3S and the established environment packages
like Hadoop, AWS, AppEngine, and WebSphere2. Details of these cloud systems are given in Chapter 7
and 8.
_____________________________________________________________________
Distributed Computing : Clusters, Grids and Clouds, All rights reserved
Fox, and Jack Dongarra, May 2, 2010.
1 - 27
by Kai Hwang, Geoffrey
Chapter 1: System Models and Enabling Technologies (42 pages)
revised May 2, 2010
Table 1.10 Three Cloud Computing Platforms and Underlying Technologies [21]
Features
Google Cloud [18]
IBM BlueCloud [7]
Amazon Elastic Cloud
Architecture
and Service
Models applied
Highly scalable server
clusters, GFS, and datacenters operating with
PaaS or SaaS models
A sever cluster with limited
scalability for distributed
problem solving and webscale under a PaaS model
A 2000-node utility cluster
(iDataPlex) for distributed
computing/storage services
under the IaaS model
Technology,
Virtualization,
and Reliability
Commodity hardware.
Application-level API,
simple service, and high
reliability
Custom hardware, Open
software, Hadoop library,
virtualization with XEN and
PowerVM, high reliability
e-commerce platform,
virtualization based on XEN,
and simple reliability
System
Vulnerability,
and Security
Resilience
Datacenter security is
loose, no copyright
protection, Google
rewrites desktop
applications for web
WebSphere-2 security,
PowerVM could be tuned
for security protection, and
access control and VPN
support
Rely on PKI and VPN for
authentication and access
control, lack of security
defense mechanisms
1.3 Performance, Security, and Energy-Efficiency
In this section, we introduce the fundamental design principles and rules of thumb for building
massively distributed computing systems. We study scalability, availability, programming models, and
security issues that are encountered in clusters, grids, P2P networks, and Internet clouds.
1.4.1
System Performance and Scalability Analysis
Performance metrics are needed to measure various distributed systems. We present various
dimensions of scalability and performance laws. Then we examine system scalability against OS image and
the limiting factors encountered.
Performance Metrics: We have used CPU speed in MIPS and network bandwidth in Mbps in Section 1.3.1
to estimate processor and network performance. In a distributed system, the performance is attributed to a
large number of factors. The system throughput is often measured by the MIPS rate, Tflops (Tera floatingpoint operations per second), TPS (transactions per second), etc. Other measures include the job response
time and network latency.
We desire to use an interconnection network that has low latency and high bandwidth. System
overhead is often attributed to OS boot time, compile time, I/O data rate, and run-time support system used.
Other pereformanc-related metrics include the quality of service (QoS) for Internet and Web services;
system availability and dependability; and security resilience for system defense against network attacks.
We will study some of these in remaining subsections.
Dimensions of Scalability: We want to design a distributed system to achieve scalable performance. Any
resource upgrade in a system should be backward compatible with the existing hardware and software
resources. Overdesign may not be cost-effective. System scaling can increase or decrease resources
depending on many practical factors. We characterize the following dimensions of scalability in parallel
_____________________________________________________________________
Distributed Computing : Clusters, Grids and Clouds, All rights reserved
Fox, and Jack Dongarra, May 2, 2010.
1 - 28
by Kai Hwang, Geoffrey
Chapter 1: System Models and Enabling Technologies (42 pages)
revised May 2, 2010
and distributed systems.
a) Size Scalability: This refers to achieve higher performance or performing more functionality by
increasing the machine size. The word ―size‖ refers to adding the number of pr ocessors; more cache,
memory, storage or I/O channels. The most obvious way to simple counting the number of
processors installed. Not all parallel computer or distributed architectures are equally size-scalable.
For example, IBM S2 was scaled up to 512 processors in 1997. But in 2008, the IBM BlueGene/L
system can scale up to 65,000 processors.
b) Software Scalability: This refers to upgrades in OS or compilers, adding mathematical and
engineering libraries, porting new application software, and install more user-friendly programming
environment. Some software upgrade may not work with large system configurations. Testing and
fine-tuning of new software on larger system is a non-trivial job.
c) Application scalability: This refers to the match of problem size scalability with the machine size
scalability. Problem size affects the size of the data set or the workload increase. Instead of
increasing machine size, we enlarge the problem size to enhance the system efficiency or costeffectiveness.
d) Technology Scalability: This refers to a system that can adapt to changes in building technologies,
such as those component and networking technologies discussed in Section 3.1. Scaling a system
design with new technology must consider three aspects: time, space, and heterogeneity. Time refers
to generation scalability. Changing to new-generation processors, one must consider the impact to
motherboard, power supply, packaging and cooling, etc. Based on the past experience, most system
upgrade their commodity processors every 3 to 5 years. Space is more related to packaging and
energy concerns. Heterogeneity scalability demands harmony and portability among different
component suppliers.
Scalability vs. OS Image Count: In Fig.1.18, we estimate the scalable performance against the multiplicity
of OS images in distributed systems deployed up to 2010. Scalable performance implies that the system can
achieve higher speed performance by adding more processors or servers, enlarging the physical node
memory size, extending the disk capacity, or adding more I/O channels, etc. The OS image is counted by
the number of independent OS images observed in a cluster, grid, P2P network, or the cloud. We include
the SMP and NUMA in the comparison. An SMP server has a single system image. Which could be a single
node in a large cluster. By 2010 standard, the largest shared–memory SMP node has at most hundreds of
processors. This low scalability of SMP system is constrained by the packaging and system-interconnect
used.
_____________________________________________________________________
Distributed Computing : Clusters, Grids and Clouds, All rights reserved
Fox, and Jack Dongarra, May 2, 2010.
1 - 29
by Kai Hwang, Geoffrey
Chapter 1: System Models and Enabling Technologies (42 pages)
revised May 2, 2010
Figure 1.18 System scalability versus multiplicity of OS images in HPC clusters, MPP, and grids
and HTC systems like P2P networks and the clouds. (The magnitude of scalability and OS image
count are estimated based on system configurations deployed up to 2010. The SMP and NUMA are
included for comparison purpose)
The NUMA machines are often made out of SMP nodes with distributed shared memories. A NUMA
machine can run with multiple operating systems. It can scale to a few thousands of processors
communicating with MPI library. For example, an NUMA machine may have 2048 processors running by
32 SMP operating systems. Thus, there are 32 OS images in the 2048-processor NUMA system. The cluster
nodes can be either SMP servers or high-end machines that are loosely coupled together. Therefore, clusters
have much higher scalability than NUMA machines. The number of OS images in a cluster is counted by
the cluster nodes concurrently in use. The cloud could be a virtualized cluster. By 2010, the largest cloud in
use commercially has size that can scale up to a few thousand VMs at most.
Reviewing the fact many cluster nodes are SMP (multiprocessor) or multicore servers, the total
number of processors or cores in a cluster system is one or two orders of magnitude greater than the number
of OS images running in the cluster. The node in a computational grid could be either a server cluster, or a
mainframe, or a supercomputer, or a massively parallel processor (MPP). Therefore, OS image count in a
large grid structure could be hundreds or thousands times fewer than the total number of processors in the
grid. A P2P network can easily scale to millions of independent peer nodes, essentially desktop machines.
The performance of a P2P file-sharing network depends on the quality of service (QoS) received in a public
networks. We plot the low-speed P2P networks in Fig.1.15. Internet clouds are evaluated similarly to the
way we assess cluster performance.
Amdahl’s Law: Consider the execution of a given program on a uniprocessor workstation with a total
execution time of T minutes. Now, the program has been parallelized or partitioned for parallel execution
on a cluster of many processing nodes. Assume that a fraction α of the code must be executed sequentially,
called the sequential bottleneck. Therefore, (1- α) of the code can be compiled for parallel execution by n
processors. The total execution time of the program is calculated by α T + (1-α)T/n , where the first term is
the sequential execution time on a single processor. The second term is the parallel execution time on n
processing nodes.
We will ignore all system or communication overheads, I/O time, or exception handling time in the
then following speedup analysis. Amdahl’s Law states that: The speedup factor of using the n-processor
system over the use of a single processor is expressed by:
_____________________________________________________________________
Distributed Computing : Clusters, Grids and Clouds, All rights reserved
Fox, and Jack Dongarra, May 2, 2010.
1 - 30
by Kai Hwang, Geoffrey
Chapter 1: System Models and Enabling Technologies (42 pages)
revised May 2, 2010
Speedup = S = T / [ αT + (1-α)T/ n ] = 1 / [ α + (1-α) /n ]
(1.1)
The maximum speedup of n is achieved, only if the sequential bottleneck α is reduced to zero or the code is
fully parallelizable with α = 0. As the cluster becomes sufficiently large, i.e. n → ∞, we have S = 1/ α , an
upper bound on the speedup S. Surprisingly, this upper bound is independent of the cluster size n.
Sequential bottleneck is the portion of the code that cannot be parallelized. For example, the maximum
speedup achieved speedup is 4, if α = 0.25 or 1-α = 0.75, even we use hundreds of processors. Amdahl’s
law teaches us that we should make the sequential bottleneck as small as possible. By increasing the cluster
size alone may not give us a good speedup we expected.
Problem with Fixed Workload: In Amdahl’s law, we have assumed the same amount of workload for
both sequential and parallel execution of the program with a fixed problem size or dataset. This was called
fixed-workload speedup by Hwang and Xu [14]. To execute a fixed workload on n processors, parallel
processing may lead to a system efficiency defined as follows:
E = S / n = 1 / [ α n + 1-α ]
(1.2)
Very often the system efficiency is rather low, especially when the cluster size is very large. To execute the
aforementioned program on a cluster with n = 256 nodes, extremely low efficiency E = 1/[0.25 x256 + 0.75]
= 1.5% is observed. This is due to the fact that only a few processors (say 4) are kept busy, while the
majority of the nodes are left idling.
Scaled-Workload Speedup: To achieve higher efficiency in using a large cluster, we must consider scaling
the problem size to match with the cluster capability. This leads to the following speedup law proposed by
John Gustafson (1988). Let W be the workload in a given program. When we use an n-processor system, we
scale the workload to W’ = αW+(1-α )nW. Note that only the parallelizable portion of the workload is
scaled n times in the second term. This scaled workload W’ is essentially the sequential execution time on a
single processor. The parallel execution time of W’ workload on n processors is kept at the level of the
original workload W. Thus, a scaled-workload speedup is defined as follows:
S’ = W’/W = [ αW+(1 – α )nW ] /W = α +(1 – α )n
(1.3)
This speedup is known as Gustafson’s Law. By fixing the parallel execution time at level W, we
achieve the following efficiency expression:
E’ = S’ / n = α/n + (1- α)
(1.4)
For the above program with a scaled workload, we can improve the efficiency of using a 256-node cluster to
E’ = 0.25/256 + 0.75 = 0.751. We shall apply either the Amdahl’s Law or Gustafson’s Law under different
workload conditions. For fixed workload, we apply Amdahl’s law. To solve scaled problems, we apply
Gustafson’s Law.
1.4.2
System Availability and Application Flexibility
In addition to performance, system availability and application flexibility are two other most important
design goals in a distributed computing system. We check these related two concerns, separately.
System Availability: High availability (HA) is desired in all clusters, grids, P2P, and cloud systems. A
system is highly available if it has long mean time to failure (MTTF) and short mean time to repair (MTTR).
_____________________________________________________________________
Distributed Computing : Clusters, Grids and Clouds, All rights reserved
Fox, and Jack Dongarra, May 2, 2010.
1 - 31
by Kai Hwang, Geoffrey
Chapter 1: System Models and Enabling Technologies (42 pages)
revised May 2, 2010
The system availability is formally defined as flows:
System Availability = MTTF / ( MTTF + MTTR )
(1.5)
The system availability is attributed to many factors. All hardware, software, and network components may
fail. Any failure that will pull down the operation of the entire system is called a single point of failure. The
rule of thumb is to design a dependable computing system with no single point of failure. Adding hardware
redundancy, increasing component reliability, and design for testability will all help enhance the system
availability and dependability.
In Fig.1.19, we estimate the effects on system availability by scaling the system size in term of the
number of processor cores in a system. In general, as a distributed system increases in size, the availability
decrease due to higher chance of failure and difficulty to isolate the failures. Both SMP and MPP are most
vulnerable under the mangement of a single OS. Increasing system size will result in higher chance to break
down. The NUMA machine has limited improvement in availability from an SMP due to use of multiple
system managements.
Most clusters are designed to have high-availability (HA) with failover capability, even as the cluster
gets much bigger. Vrtualized clouds form a subclass of the hosting server clusters at various datacenters.
Hence a cloud has an estimated availability similar to that of the hosting cluster. A grid is visualized as a
hierarchical cluster of clusters. They have even higher availability due to the isolation of faults. Therefore,
clusters, clouds, and grids have a decreasing availability as system gets larger. A P2P file-sharing network
hass the highest aggregation of client machines. However, they operate essentially independently with low
availability even many peer nodes depart or fail simultaneously.
Figure 1.19 Estimated effects on the system availability by the size of clusters, MPP, Grids, P2P filesharing networks, and computing clouds. (The estimate is based on reported experiences in hardware,
OS, storage, network, and packaging technologies in available system configurations in 2010.)
1.4.3 Security Threats and Defense Technologies
Clusters, Grids, P2P, and Clouds all demand security and copyright protection. These are crucial to
their acceptance by a digital society. In this section, we introduce the system vulnerability, network threats,
defense countermeasures, and copyright protection in distributed or cloud computing systems.
Threats To Systems and Networks : Network viruses have threatened many users in widespread attacks
_____________________________________________________________________
Distributed Computing : Clusters, Grids and Clouds, All rights reserved
Fox, and Jack Dongarra, May 2, 2010.
1 - 32
by Kai Hwang, Geoffrey
Chapter 1: System Models and Enabling Technologies (42 pages)
revised May 2, 2010
constantly. These incidents created worm epidemic by pulling down many routers and servers. These
attacks had caused billions of dollars loss in business, government, and services. Various attack types and
the potential damages to users are summarized in Fig.1.20. Information leakage leads to the loss of
confidentiality. Loss of data integrity may be caused by user alteration, Trojan horse, and service spoofing
attacks. The denial of service (DoS) result in loss of system operation and Internet connections.
Lack of authentication or authorization lead to illegitimate use of computing resources by attackers.
Open resources like datacenters, P2P networks, grid and cloud infrastructures could well become the next
targets. We need to protect clusters, grids, clouds, and P2P systems. Otherwise, no users dare to use or trust
them for outsourced work. Malicious intrusions to these systems may destroy valuable hosts, network, and
storage resources. Internet anomalies found in routers, gateways, and distributed hosts may hinder the
acceptance of these public-resource computing services.
Security Responsibilities: We identify three security requirements: confidentiality, integrity, and
availability for most internet service providers and cloud users. As shown in Fig.1.21, in the order of SaaS,
PaaS, and IaaS, the providers gradually release the responsibilities of security control to the cloud users. In
summary, the SaaS model relies on the cloud provider to perform all security functions. On the other
extreme, the IaaS model wants the users to assume almost all security functions except leaving the
availability to the hands of the providers. The PaaS model relies on the provider to maintain data integrity
and availability, but burdens the user with confidentiality and privacy control.
Figure 1.20 Various system attacks and network threats to cyberspace.
System Defense Technologies: Three generations of network defense technologies have appeared in the
past. In the first generation, tools were designed to prevent or avoid intrusions. These tools usually
manifested as access control policies or tokens, cryptographic systems, etc. However, the intruder can
always penetrate a secure system because there is always a weakest link in the security provisioning
process. The second generation detects intrusions timely to exercise remedial actions. These techniques
include firewalls, Intrusion Detection Systems (IDS), PKI service, reputation systems, etc. The third
generation provides more intelligent responses to intrusions.
_____________________________________________________________________
Distributed Computing : Clusters, Grids and Clouds, All rights reserved
Fox, and Jack Dongarra, May 2, 2010.
1 - 33
by Kai Hwang, Geoffrey
Chapter 1: System Models and Enabling Technologies (42 pages)
revised May 2, 2010
Figure 1.21: Internet security responsibilities by cloud service providers and by the user mass.
Copyright Protection: Collusive piracy is the main source of intellectual property violations within the
boundary of a P2P network. Paid clients (colluders) may illegally share copyrighted content files with
unpaid clients (pirates). On-line piracy has hindered the use of open P2P networks for commercial content
delivery. One can develop a proactive content poisoning scheme to stop colluders and pirates from alleged
copyright infringements in P2P file sharing. Pirates are detected timely with identity-based signatures and
time-stamped tokens. The scheme stops collusive piracy without hurting legitimate P2P clients. We will
cover grid security, P2P reputation systems, and copyright-protection issues in Chapters 5 and 7.
Data Protection Infrastructure: Security infrastructure is needed to support safeguard web and cloud
services. At the user level, we need to perform trust negotiation and reputation aggregation over all users.
At the application end, we need to establish security precautions in worm containment and intrusion
detection against virus, worm, and DDoS attacks. We need also deploy mechanism to prevent on-line piracy
and copyright violations of digital contents. In Chapter 6, we will study reputation system for protecting
distributed systems and datacenters.
1.4.4 Energy-Efficiency in Distributed Computing
Primary performance goals in conventional parallel and distributed computing systems are high
performance and high throughput, considering some form of performance reliability, e.g., fault tolerance
and security. However, these systems recently encounter new challenging issues including energy
efficiency, and workload and resource outsourcing. These emerging issues are crucial not only in their own,
but also for the sustainability of large-scale computing systems in general. In this section, we review energy
consumption issues in servers and HPC systems. The issue of workload and resource outsourcing for cloud
computing is discussed. Then we introduce the protection issues of datacenters and explore solutions.
The energy consumption in parallel and distributed computing systems raises various monetary,
environmental and system performance issues. For example, Earth Simulator and Petaflop are two example
systems with 12 and 100 megawatts of peak power, respectively. With an approximate price of 100 dollars
per megawatt, their energy costs during peak operation times are 1,200 and 10,000 dollars per hour; this is
_____________________________________________________________________
Distributed Computing : Clusters, Grids and Clouds, All rights reserved
Fox, and Jack Dongarra, May 2, 2010.
1 - 34
by Kai Hwang, Geoffrey
Chapter 1: System Models and Enabling Technologies (42 pages)
revised May 2, 2010
beyond the acceptable budget of many (potential) system operators. In addition to power cost, cooling is
another issue that must be addressed due to negative effects of high temperature on electronic components.
The rising temperature of a circuit not only derails the circuit from its normal range but also results in
decreasing the lifetime of its components.
Energy consumption of unused servers: To run a server farm (data center) a company has to spend a huge
amount of money for hardware, software (software licences), operational support and energy every year.
Therefore, the company should thoroughly identify weather the installed server farm (more specifically, the
volume of provisioned resources) is at an appropriate level in terms particularly of utilization. Some
analysts estimate that on average around one-sixth (15%) of the full-time servers in a company is left
powered on without being actively used (i.e., idling) on a daily basis. This indicates that with 44 million
servers in the world, around 4.7 million servers are not doing any useful work.
The potential savings by turning off these servers are large, globally $3.8 billion in energy costs alone
and $24.7 billion in the total cost of running non-productive servers according to a study by 1E Company in
partnership with the Alliance to Save Energy (ASE). With the respect to environment, this amount of energy
2
wasting is equal to entering 11.8 million tons of carbon dioxide per year which is equivalent to the CO
pollution of 2.1 million cars. This ratio in the U.S comes to 3.17 million tons of carbon dioxide, or 580,678
cars. Therefore, the first step in IT departments is to analyze their servers to find out unused and/or
underutilized servers.
Reducing energy in active servers: In addition to the identification of unused/under-utilized servers for
energy savings, the application of appropriate techniques to decrease energy consumption in active
distributed systems with negligible influence on their performance is necessary. Power management issue in
distributed computing platforms can be categorized into four layers (Fig.1.22): application layer,
middleware layer, resource layer and network layer.
_____________________________________________________________________
Distributed Computing : Clusters, Grids and Clouds, All rights reserved
Fox, and Jack Dongarra, May 2, 2010.
1 - 35
by Kai Hwang, Geoffrey
Chapter 1: System Models and Enabling Technologies (42 pages)
revised May 2, 2010
Figure 1.22 Four operational layers of distributed computing systems
Application layer: Until now, most user applications in science, business, engineering, and financial areas,
tend to increase the speed or quality performance. By introducing energy-aware applications, the challenge
is how to design sophisticated multilevel and multi-domain energy management applications without
hurting performance. The first step is to explore a relationship between performance and energy
consumption. Indeed, the energy consumption of an application has a strong dependency with the number of
instructions needed to execute the application and the number of transactions with storage unit (or memory).
As well these two factors (computation and storage) are correlated and they affect application completion
time.
Middleware layer: The middleware layer acts as a bridge between the application layer and the resource
layer. This layer provides resource broker, communication service, task analyzer, task scheduler, security
access, reliability control and information service. This layer is susceptible for applying energy-efficient
techniques particularly in task scheduling. Until recently, scheduling is aimed to minimize a cost function
generally the makespan, i.e., the whole execution time of a set of tasks. Distributed computing systems
necessitates a new cost function covering both makespan and energy consumption.
Resource layer: The resource layer consists of a wide range of resources including computing nodes and
storage units. This layer generally interacts with hardware devices and also operating system; and therefore
it is responsible for controlling all distributed resources in distributed computing systems. In the recent past,
several mechanisms have been developed for more efficient power management of hardware and operating
systems. The majority of them are hardware approaches particularly for processors. Dynamic power
_____________________________________________________________________
Distributed Computing : Clusters, Grids and Clouds, All rights reserved
Fox, and Jack Dongarra, May 2, 2010.
1 - 36
by Kai Hwang, Geoffrey
Chapter 1: System Models and Enabling Technologies (42 pages)
revised May 2, 2010
management (DPM) and dynamic voltage-frequency scaling (DVFS) are two popular methods incorporated
in recent computer hardware systems In DPM, hardware devices, such as CPU have the capability to switch
from idle mode to one or more lower-power modes. In DVFS, energy savings are achieved on the fact that
the power consumption in CMOS circuits has the direct relationship with frequency and the square of
voltage supply. In this case, the execution time and power consumption are controllable by switching
between different frequencies and voltages. Figure 1.23 shows the principle of the DVFS method. This
method enables the exploitation of the slack time (idle time) typically incurred by inter-task relationships
(e.g., precedence constraints) [24]. Specifically, the slack time associated with a task is utilized to execute
the task in a lower voltage-frequency. The relationship between energy and voltage-frequency in CMOS
circuits is related by the following expression:
E
f
2
 C eff fv t
 K (v  vt )
v
(1.6)
2
where v, Ceff, K, and vt are the voltage, circuit switching capacity, a technology dependent factor, and
threshold voltage, respectively. The parameter t is the execution time of the task under clock frequency f .
By reducing voltage and frequency the energy consumption of device can be reduced. However, both
DPM and DVFS techniques may cause some negative effects on power consumption of a device in both
active and idle, and create a transition overload for switching between states or voltage/frequencies.
Transition overload is especially important in DPM technique: if the transition latencies between lowerpower modes are assumed to be negligible, then energy can be saved by simply switching between these
modes. However, this assumption is rarely valuable and therefore switching between low-power modes
affects performance.
Figure 1.23 DVFS technique (right) original task (left) voltage-frequency scaled task (Courtesy of R.Ge, et al,
―Performance Constrained Distributed DVS Scheduling for Scientific Applications on Power-aware Clusters‖, Proc. of
ACM Supercomputing Conf., Wash. DC, 2005 [16].)
Another important issue in the resource layer is in the storage area. Storage units interact with the
computing nodes greatly. This huge amount of interactions keeps the storage units always active. This
results in large energy consumption. Storage devices spend about 27% of the total energy consumption in a
data center. What is even worse is this figure increases rapidly due to 60% increase in storage need
annually.
Network layer: Routing and transferring packets and enabling network services to the resource layer are
the main responsibility of the network layer in distributed computing systems. The major challenge to build
energy-efficient networks is again how to measure, predict and make balance between energy consumption
and performance. Two major challenges to design energy-efficient networks are identified below:
 The models should represent the networks comprehensively as they should give a full understanding
_____________________________________________________________________
Distributed Computing : Clusters, Grids and Clouds, All rights reserved
Fox, and Jack Dongarra, May 2, 2010.
1 - 37
by Kai Hwang, Geoffrey
Chapter 1: System Models and Enabling Technologies (42 pages)
revised May 2, 2010
of interactions between time, space and energy.
New energy-efficient routing algorithms need to be developed. New energy-efficient protocols should
be developed against network attacks.
As information resources drive economic and social development, datacenters become increasingly
important as where the information items are stored, processed, and services provided. Datacenters becomes
another core infrastructure just like power grid and transportation systems. Traditional datacenter suffers
from high construction and operational cost, complex resource management and poor usability, low security
and reliability, and huge energy consumption etc. It is necessary to adopt new technologies in next
generation datacenter designs as studies in Chapter 7.
1.5
References and Homework Problems
In the past 4 decades, parallel processing and distributed computing have been hot topics for research
and development. Earlier work in this area were treated in several classic books [1, 11, 20, 21]. More recent
coverage can be found in newer books [6, 13, 14, 16, 18, 26] published beyond 2000. Cluster computing
was covered in [21, 27] and grid computing in [3, 4, 14, 34]. P2P networks are introduced in [13, 33]. Cloud
computing is studied in [7-10, 15, 19, 22, 23, 31]. Virtualization techniques are treated in [28-30].
Distributed algorithms and parallel programming are studied in [2, 12, 18, 21, 25]. Distributed operating
systems and software tools are covered in [5, 32]. Energy efficiency and power management are studied in
[17, 24, 35]. Clusters serve as the foundation of distributed and cloud computing. All of these topics will be
studied in more details in subsequent chapters.
References
[1]
G. Almasi and A. Gottlieb, Highly Parallel Computing, Banjamin-Cummins Publisher, 1989.
[2]
G. Andrewa, Foundations of multithreaded, Parallel and Distributed Programming,
Addison-Wesley, 2000.
[3]
G. Bell, J. Gray. And A. Szalay, ― Petascale Computational Systems : Balanced Cyberstructure in a
Data-Centric World‖ , IEEE Computer Magazine, 2006
[4]
F. Berman, G. Fox, and T. Hey (Editors), Grid Computing, Wiley and Sons, 2003, ISBN:
0-470-85319-0
[5]
M. Bever, et al, ― Distributed Systems, OSF DCE, and Beyond‖ , in DCE-The OSF Distributed
Computing Environment, A. Schill (edtor), Belin, Springer-Verlag, pp. 1-20, 1993
[6]
K. Birman, Reliable Distributed Systems: Technologes, Web Services, and Applications,
Springer Verlag 2005.
[7]
G. Boss, et al, ― Cloud Computing-The BlueCloud Project ― , www.ibm.com/developerworks/
websphere/zones/hipods/ Oct. 2007
[8]
R. Buyya, C. Yeo; and S. Venugopal, "Market-Oriented Cloud Computing: Vision, Hype, and
Reality for Delivering IT Services as Computing Utilities," 10th IEEE Int’l Conf. on High Perf.
Computing and Comm., Sept. 2008
[9]
F. Chang, et al., ― Bigtable: A Distributed Storage System for Structured Data‖ , OSDI 2006.
[10]
T. Chou, Introduction to Cloud Computing : Business and Technology, Lecture Notes at Stanford
University and at Tsinghua University, Active Book Press, 2010.
[11] D. Culler, J. Singh, and A. Gupta, Parallel Computer Architecture, Kaufmann Publishers,
_____________________________________________________________________
Distributed Computing : Clusters, Grids and Clouds, All rights reserved
Fox, and Jack Dongarra, May 2, 2010.
1 - 38
by Kai Hwang, Geoffrey
Chapter 1: System Models and Enabling Technologies (42 pages)
revised May 2, 2010
1999.
[12]
J. Dean and S. Ghemawat, ― MapReduce: Simplified Data Processing on Large Clusters‖ ,
Proc. of OSDI 2004.
[13]
J. Dillimore, T. Kindberg, and G. Coulouris, Distributed Systems: Concepts and Design,
(4th Edition), Addison Wesley, May 2005, ISBN -10-03-2126-3545.
[14]
J. Dongarra, et al, (editors), Source Book of Parallel Computing, Morgan Kaufmann, 2003.
[15]
I. Foster, Y. Zhao, J.Raicu, and S. Lu, "Cloud Computing and Grid Computing 360-Degree
Compared," Grid Computing Environments Workshop, 12-16 Nov. 2008.
[16]
V. K. Garg, Elements of Distributed Computing, Wiley-IEEE Press, 2002.
[17]
R.Ge, X. .Feng, and K.W.Cameron, ―Performance const rained distributed DVS scheduling for
scientific applications on power-aware clusters‖, Proc. Supercomputing Conf., Wash. DC, 2005.
[18]
S. Ghosh, Distributed Systems- An Algorithmic Approach, Chapman & Hiall/CRC, 2007.
[19]
Google, Inc. ― Google and the Wisdom of Clouds‖ , http://www.businessweek.com/
magazine/content/ 0752/ b4064048925836.htm
[20]
K. Hwang, Advanced Computer Architecture: Parallelism, Scalability, Programming,
McGraw-Hill, 1993.,
[21]
K. Hwang and Z. Xu: Scalable Parallel Computing, McGraw-Hill, 1998.
[22]
K. Hwang, S. Kulkarni, and Y. Hu, ―Cloud Security w ith Virtualized Defense and Reputation-based
Trust Management‖, IEEE Conf. Dependable, Autonomous, and Secure Computing (DAC-2009),
Chengdu, China, Dec.14, 2009
[23] K. Hwang and D. Li, ― Security and Data Protecti on for Trusted Cloud Computing‖, IEEE Internet
Computing, September. 2010.
[24]
Kelton Research, ―1E / Alliance to Save Energy Server Energy & Effi ciency Report‖,
http://www.1e.com/
EnergyCampaign/downloads/Server_Energy_and_Efficiency_Report_2009.pdf , Sept. 2009.
[25]
Y. C. Lee and A. Y. Zomaya, ―A Novel State Transiti on Method for Metaheuristic-Based Scheduling
in Heterogeneous Computing Systems,‖ IEEE Trans. Parallel and Distributed Systems, Sept. 2008.
[26]
D. Peleg, Distributed Computing : A Locality-Sensitive Approach, SIAM Publisher, 2000.
[27]
G.F. Pfister, In Serach of Clusters, (second Edition), Prentice-Hall, 2001
[28]
M. Rosenblum and T. Garfinkel, ―Virtual Machine Mon itors: Current Technology and Future
Trends‖, IEEE Computer, May 2005, pp.39-47.
[29]
M. Rosenblum, ―Recent Advances in Virtual Machines and Operating Systems‖ , Keynote Address,
ACM ASPLOS 2006
[30]
J. Smith and R. Nair, Virtual Machines, Morgan Kaufmann , 2005
[31]
B. Sotomayor, R. Montero, and I. Foster, ―Virtual Infrastructure Management in Private and Hybrid
Clouds‖, IEEE Internet Computing, Sept. 2009
[32]
A. Tannenbaum, Distributed Operating Systems, Prentice-Hall, 1995.
[33]
I. Taylor, From P2P to Web Services and Grids , Springer-Verlag, London, 2005.
_____________________________________________________________________
Distributed Computing : Clusters, Grids and Clouds, All rights reserved
Fox, and Jack Dongarra, May 2, 2010.
1 - 39
by Kai Hwang, Geoffrey
Chapter 1: System Models and Enabling Technologies (42 pages)
revised May 2, 2010
[34]
M. Waldrop, ―Grid Computing‖, IEEE Computer Magazine, 2000
[35]
Z. Zong, ―Energy-Efficient Resource Management for High-Performance Computing Platforms‖,
PhD Dissertation, Auburn University, August 9, 2008
Homework Problems
Problem 1.1: Map ten abbreviated terms and system models on the left with the best-match descriptions on
the right. Just enter the description label (a, b, c, …,j ) in the underlined blanks in front of the terms.
________
(a) A scalable software platform promoted by Apache for web users to write and
run applications over vast amounts of distributed data.
Globus
______ BitTorrent
(b) A P2P network for MP3 music delivery using a centralized directory server
________ Gnutella
(c) The programming model and associated implementation by Google
for distributed mapping and reduction of very large data sets
_______
(d) A middleware library jointly developed by USC/ISI and Argonne
National Lab. for Grid resource management and job scheduling
_____
______
EC2
(e) A distributed storage program by Google for managing structured
data that can scale to very large size.
TeraGrid
EGEE
(f)
A P2P file-sharing network using multiple file index trackers
__________Hadoop
(g) A critical design goal of cluster of computers to tolerate
nodal faults or recovery from host failures.
______ SETI@home
(h) The service architecture specification as an open Grid standard
________ Napster
(i) An elastic and flexible computing environment that allows web
application developers to acquire cloud resources effectively
________ Bigtable
(j)
A P2P Grid over 3 millions of desktops for distributed signal processing in
search of extra-terrestrial intelligence
Problem 1.2: Circle only one correct answer in each of the following questions.
(1)
(2)
(3)
In today’s Top 500 list of the fastest computing systems, which architecture class
dominates the population ?
a.
Symmetric shared-memory multiprocessor systems
b.
Centralized massively parallel processor (MPP) systems.
c.
Clusters of cooperative computers.
Which of the following software packages is particularly designed as a distributed storage
management system of scalable datasets over Internet clouds?
a.
MapReduce
b.
Hadoop
c.
Bigtable
Which global network system was best designed to eliminate isolated resource islands ?
a.
The Internet for computer-to-computer interaction using Telnet command
b.
The Web service for page-to-page visits using http:// command
c.
The Grid service using midleware to establish interactions between applications
running on a federation of cooperative machines.
_____________________________________________________________________
Distributed Computing : Clusters, Grids and Clouds, All rights reserved
Fox, and Jack Dongarra, May 2, 2010.
1 - 40
by Kai Hwang, Geoffrey
Chapter 1: System Models and Enabling Technologies (42 pages)
revised May 2, 2010
(4)
(5)
Which of the following software tools is specifically designed for scalable storage
services in distributed cloud computing applications ?
a.
Amazon EC2
b.
Amazon S3
c.
Apache Hadoop library
In a cloud formed by a cluster of servers, all servers must be select as follows:
a.
All cloud machines must be bulit on physical servers
b.
All cloud machines must be built with virtual servers
c.
The cloud machines can be either physical or virtual servers.
Problem 1.3: Content delivery networks have gone through three generations of development: namely the
client-server architecture, massive network of content servers, and P2P networks. Discuss the advantages
and shortcomings of using these content delivery networks.
Problem 1.4: Conduct a deeper study of the three cloud platform models presented in Table 1.6. Compare
their advantages and shortcomings in development of distributed applications on each cloud platform. The
material in Table 1.7 and Table 1.8 are useful in your assessment.
Problem 1.5: Consider parallel execution of an MPI-coded C program in SPMD (single program and
multiple data streams) mode on a server cluster consisting of n identical Linux servers. SPMD mode means
that the same MPI program is running simultaneously on all servers but over different data sets of identical
workload. Assume that 25% of the program execution is attributed to the execution of MPI commands. For
simplicity, assume that all MPI commands take the same amount of execution time. Answer the following
questions using Amdahl’s law:
(a) Given that the total execution time of the MPI program on a 4-server cluster is T minutes. What
is the speedup factor of executing the same MPI program on a 256-server cluster, compared with
using the 4-server cluster. Assume that the program execution is deadlock-free and ignore all
other run-time execution overheads in the calculation.
(b). Suppose that all MPI commands are now enhanced by a factor of 2 by using active messages
executed by message handlers at the user space. The enhancement can reduce the execution time
of all PMI commands by half. What is the speedup of the 256-server cluster installed with this
MPI enhancement, computed with the old 256-server cluster without MPI enhancement?
Problem 1.6: Consider a program to multiply two large-scale N x N matrices, where N is the matrix size.
3
The sequential multiply time on a single sever is T1 = c N minutes, where c is a constant decided by the
3
2
0.5
server used. A MPI-code parallel program requires Tn = c N /n + d N / n minutes to complete execution
on an n-server cluster system, where d is a constant determined by the MPI version used. You can assume
the program has a zero sequential bottleneck (α = 0). The second term in Tn accounts for the total message
passing overhead experienced by n servers.
Answer the following questions for a given cluster configuration with n = 64 servers and c = 0.8 and d
= 0.1. Parts (a, b) have a fixed workload corresponding to the matrix size N = 15,000. Parts (c, d) have a
1/3
1/3
scaled workload associated with an enlarged matrix size N’ = n N = 64 x 15,000= 4x15,000 = 60,000.
Assume the same cluster configuration to process both workloads. Thus the system parameters n, c, and d
stay unchanged. Running the scaled workload, the overhead also increases with the enlarged matrix size N’.
(a)
Using Amdahl’s law, calculate the speedup of the n-server cluster over a single server.
(b)
What is the efficiency of the cluster system used in Part (a) ?
(c)
Calculate the speedup in executing the scaled workload for an enlarged N’ x N’ matrix
_____________________________________________________________________
Distributed Computing : Clusters, Grids and Clouds, All rights reserved
Fox, and Jack Dongarra, May 2, 2010.
1 - 41
by Kai Hwang, Geoffrey
Chapter 1: System Models and Enabling Technologies (42 pages)
revised May 2, 2010
on the same cluster configuration using Gustafson Law.
(d)
Calculate the efficiency of running the scaled workload in Part (c) on the 64-processor cluster.
(e)
Compare the above speedup and efficiency results and comment on their implications.
Problem 1.7: Cloud computing is an emerging distributed computing paradigm. An increasing number of
organizations in industry and business sectors adopt cloud systems as their system of choice. Answer the
following questions on cloud computing.
(a) List and describe main characteristics of cloud computing systems.
(b) Discuss key enabling technologies in cloud computing systems.
(c)
Discuss different ways for cloud service providers to maximize their revenue.
Problem 1.8: Compare the similarities and differences between traditional computing clusters/grids and the
computing clouds launched in recent years. You should consider all technical and economic aspects as
listed below. Answer the following questions against real example systems or platforms built in recent
years. Also discuss the possible convergence of the two computing paradigms in the future..
(a) Hardware, software, and networking support
(b) Resource allocation and provisioning methods
(c)
Infrastructure management and protection.
(d) Supporting of utility computing services
(e)
Operational and cost models applied.
Problem 1.9: Answer the following questions on personal computing (PC) and high-performance
computing (HPC) systems:
(a) Explain why the changes in personal computing (PC) and high-performance computing (HPC)
were evolutionary rather revolutionary in the past 30 years.
(b) Discuss the drawbacks in disruptive changes in processor architecture. Why memory wall is a
major problem in achieving scalable performance?
(c)
Explain why x-86 processors are still dominating the PC and HPC markets ?
Problem 1.10: Multi-core and many-core processors have appeared in widespread use in both desktop
computers and HPC systems. Answer the following questions in using advanced processors, memory
devices, and system interconnects.
(a) What are differences between multi-core CPU and GPU in architecture and usages ?
(b) Explain why parallel programming cannot match with the progress of processor technology.
(c) Suggest ideas and defend your argument by some plausible solutions to this mismatch problem
between core scaling and effective programming and use of multicores.
(d) Explain why flash memory SSD can deliver better speedups in some HPC or HTC applications.
(e)
Justify the prediction that Infiniband and Ehternet will continue dominating the HPC market.
Problem 1.11 Compare the HPC and HTC computing paradigms and systems. Discuss their commonality
and differences in hardware and software support and application domains.
Problem 1.12 Answer the roles of multicore processors, memory chips, solid-state drives, and disk arrays.
in building current and future distributed and cloud computing systems.
Problem 1.13 What are lopment trends of operating systems and programming paradigms in modern
_____________________________________________________________________
Distributed Computing : Clusters, Grids and Clouds, All rights reserved
Fox, and Jack Dongarra, May 2, 2010.
1 - 42
by Kai Hwang, Geoffrey
Chapter 1: System Models and Enabling Technologies (42 pages)
revised May 2, 2010
distributed systems and cloud computing platforms ?
Problem 1.14 Distinguish P2P networks from Grids and P2P Grids by filling the missing table entries.
Some entries are already given. You need to study the entries in Table 1.3 , Table 1.5, and Table 1.9 before
you try to distinguish these systems precisely.
Discuss the major advantages and disadvantages in the following challenge areas:
(a)
Why virtual machines and virtual clusters are suggested in cloud computing systems ?
(b) What are the breakthrough areas needed to build virtualized cloud systems cost effectively ?
(c)
What is your observations of the impact of cloud platforms on the future of HPC industry ?
Problem 1.16: Briefly explain each of the following cloud computing services. Identify two clouder
providers in each service category.
(a) Application cloud services
(b) Platform cloud services
(c)
Compute and storage services
(d). Co-location cloud services
(e). Network cloud services.
Table 1.11 Comparison among P2P Networks, Grids, and P2P Grids
Features
Applications
and Peer or
Node Roles
P2P Networks
Grid Systems
Distributed file sharing,
content distribution, peer
machines acting as both
clients and servers
System Control
and Service
Model
Policy-based control in a
grid infrastructure, all
services from clent
machines
System
Connectivity
Resource
Discovery and
Job
Management
P2P Grids
Static conections with
high-speek links over
grid resource sites
Autonomous peers without
discovery, no use of a
central job scheduler
Repersentative
Systems
NSF TeraGrid, UK
EGGE Grid, China Grid
Problem 1.15: plain the impacts of machine virtualization to business computing and HPC systems.
Problem 1.17: Briefly explain the following terms associated with network threats or security defense in a
distributed computing system:
(a)
Denial of service (DoS)
_____________________________________________________________________
Distributed Computing : Clusters, Grids and Clouds, All rights reserved
Fox, and Jack Dongarra, May 2, 2010.
1 - 43
by Kai Hwang, Geoffrey
Chapter 1: System Models and Enabling
Technologies (42 pages) revised May 2,
2010
(b)
Trojan horse
(c)
Network worms
(d)
Masquerade
(e)
Evasdropping
(f)
Service sproofing
(g)
Authorization
(h)
Authentication
(i)
Data integrity
(j)
Confidentaility
Problem 1.18: Briefly answer following questions on green information
technology and energy efficiency in distributed systems. You can find answers in
later chapters or search over the Web.
(a)
Why power consumption is critical to datacenter operations ?
(b)
Justify Equation (1.6) by reading a cited information source.
(c)
What is dynamic voltage frequency scaling (DVFS) technique ?
Problem Problem 1.19: Distinguish the following terminologies associate with
multithreaded processor architecture:
(a)
What is fine-grain multithreading architecture ? Identify two example
processors.
(b)
What is course-grain multithreading architecture ?
example processors.
(c)
What is simultaneously multithreading (SMT) architecture ? Identify
two example proccesors.
Identify two
Problem 1.20: Characterize the following three cloud computing models:
(a)
What is an IaaS (Infrastructure as a Service) cloud ? Give one example
system.
(b) What is a PaaS (Platform as a Service) cloud ? Give one example
system.
(c)
What is a SaaS (Sofftware as a Service) cloud ? Give one example
system.
INTRODUCTION TO CLOUD
COMPUTING
CLOUD COMPUTING IN A NUTSHELL
Computing itself, to be considered fully virtualized, must allow computers to
be built from distributed components such as processing, storage, data, and
software resources.
Technologies such as cluster, grid, and now, cloud computing, have all
aimed at allowing access to large amounts of computing power in a fully
virtualized manner, by aggregating resources and offering a single system
view. Utility computing describes a business model for on-demand delivery of
computing power; consumers pay providers based on usage (―payas-yougo‖), similar to the way in which we currently obtain services from traditional
public utility services such as water, electricity, gas, and telephony.
Cloud computing has been coined as an umbrella term to describe a
category of sophisticated on-demand computing services initially offered by
commercial providers, such as Amazon, Google, and Microsoft. It denotes a
model on which a computing infrastructure is viewed as a ―cloud,‖ from which
businesses and individuals access applications from anywhere in the world on
demand . The main principle behind this model is offering computing, storage,
and software ―as a service.‖
Many practitioners in the commercial and academic spheres have attempted
to define exactly what ―cloud computing‖ is and what unique characteristics it
presents. Buyya et al. have defined it as follows: ―Cloud is a parallel and
distributed computing system consisting of a collection of inter-connected
and virtualised computers that are dynamically provisioned and presented as one
or more unified computing resources based on service-level agreements (SLA)
established through negotiation between the service provider and consumers.‖
Vaquero et al. have stated ―clouds are a large pool of easily usable and
accessible virtualized resources (such as hardware, development platforms
and/or services). These resources can be dynamically reconfigured to adjust
to a variable load (scale), allowing also for an optimum resource utilization.
This pool of resources is typically exploited by a pay-per-use model in which
guarantees are offered by the Infrastructure Provider by means of customized
Service Level Agreements.‖
A recent McKinsey and Co. report
claims that ―Clouds are
hardwarebased services offering compute, network, and storage capacity
where: Hardware management is highly abstracted from the buyer, buyers
incur infrastructure costs as variable OPEX, and infrastructure capacity is
highly elastic.‖
A report from the University of California Berkeley summarized the key
characteristics of cloud computing as: ―(1) the illusion of infinite computing
resources; (2) the elimination of an up-front commitment by cloud users; and
(3) the ability to pay for use . . . as needed .. .‖
The National Institute of Standards and Technology (NIST) characterizes
cloud computing as ―... a pay-per-use model for enabling available,
convenient, on-demand network access to a shared pool of configurable
computing resources (e.g. networks, servers, storage, applications, services)
that can be rapidly provisioned and released with minimal management effort
or service provider interaction.‖
In a more generic definition, Armbrust et al. define cloud as the ―data
center hardware and software that provide services.‖ Similarly, Sotomayor
et al. point out that ―cloud‖ is more often used to refer to the IT infrastructure
deployed on an Infrastructure as a Service provider data center. While there are
countless other definitions, there seems to be common characteristics between
the most notable ones listed above, which a cloud should have: (i) pay-per-use
(no ongoing commitment, utility prices); (ii) elastic capacity and the illusion of
infinite resources; (iii) self-service interface;
and (iv) resources that are
abstracted or virtualised.
ROOTS OF CLOUD COMPUTING
We can track the roots of clouds computing by observing the advancement of
several technologies, especially in hardware (virtualization, multi-core chips),
Internet technologies (Web services, service-oriented architectures, Web 2.0),
distributed computing (clusters, grids), and systems management (autonomic
computing, data center automation). Figure 1.1 shows the convergence of
technology fields that significantly advanced and contributed to the advent
of cloud computing.
Some of these technologies have been tagged as hype in their early stages
of development; however, they later received significant attention from
academia and were sanctioned by major industry players. Consequently, a
specification and standardization process followed, leading to maturity and
wide adoption. The emergence of cloud computing itself is closely linked to
the maturity of such technologies. We present a closer look at the technol ogies
that form the base of cloud computing, with the aim of providing a clearer
picture of the cloud ecosystem as a whole.
From Mainframes to Clouds
We are currently experiencing a switch in the IT world, from in-house
generated computing power into utility-supplied computing resources delivered
over the Internet as Web services. This trend is similar to what occurred about a
century ago when factories, which used to generate their own electric power,
realized that it is was cheaper just plugging their machines into the newly
formed electric power grid .
Computing delivered as a utility can be defined as ―on demand delivery of
infrastructure, applications, and business processes in a security-rich, shared,
scalable, and based computer environment over the Internet for a fee‖ .
Hardware Virtualization
Utility &
Grid
Computing
SOA
Cloud
Computing
Web 2.0
Web Services
Mashups
Internet Technologies
Distributed Computing
Multi-core chips
Autonomic Computing
Data Center Automation
Hardware
Systems Management
FIGURE 1.1. Convergence of various advances leading to the advent of
cloud computing.
This model brings benefits to both consumers and providers of IT services.
Consumers can attain reduction on IT-related costs by choosing to obtain
cheaper services from external providers as opposed to heavily investing on IT
infrastructure and personnel hiring. The ―on-demand‖ component of this
model allows consumers to adapt their IT usage to rapidly increasing or
unpredictable computing needs.
Providers of IT services achieve better operational costs; hardware and
software infrastructures are built to provide multiple solutions and serve many
users, thus increasing efficiency and ultimately leading to faster return on
investment (ROI) as well as lower total cost of ownership (TCO).
The mainframe era collapsed with the advent of fast and inexpensive
microprocessors and IT data centers moved to collections of commodity servers.
The advent of increasingly fast fiber-optics networks has relit the fire, and
new technologies for enabling sharing of computing power over great distances
have appeared.
SOA, Web Services, Web 2.0, and Mashups
•
•
Web Service
• applications running on different messaging product platforms
• enabling information from one application to be made available to
others
• enabling internal applications to be made available over the Internet
SOA
• address requirements of loosely coupled, standards-based, and
protocol-independent distributed computing
• WS ,HTTP, XML
• Common mechanism for delivering service
• applications is a collection of services that together perform
complex business logic
• Building block in IaaS
• User authentication, payroll management, calender
Grid Computing
Grid computing enables aggregation of distributed resources and transparently
access to them. Most production grids such as TeraGrid and EGEE seek to
share compute and storage resources distributed across different administrative
domains, with their main focus being speeding up a broad range of scientific
applications, such as climate modeling, drug design, and protein analysis.
Globus Toolkit is a middleware that implements several standard Grid
services and over the years has aided the deployment of several service-oriented
Grid infrastructures and applications. An ecosystem of tools is available to
interact with service grids, including grid brokers, which facilitate user
interaction with multiple middleware and implement policies to meet QoS
needs.
Virtualization technology has been identified as the perfect fit to issues that
have caused frustration when using grids, such as hosting many dissimilar
software applications on a single physical platform. In this direction, some
research projects.
Utility Computing
In utility computing environments, users assign a ―utility‖ value to their jobs,
where utility is a fixed or time-varying valuation that captures various QoS
constraints (deadline, importance, satisfaction). The valuation is the amount
they are willing to pay a service provider to satisfy their demands. The service
providers then attempt to maximize their own utility, where said utility may
directly correlate with their profit. Providers can choose to prioritize high yield
(i.e., profit per unit of resource) user jobs, leading to a scenario where shared
systems are viewed as a marketplace, where users compete for resources based
on the perceived utility or value of their jobs.
Hardware Virtualization
The idea of virtualizing a computer system‘s resources, including processors,
memory, and I/O devices, has been well established for decades, aiming at
improving sharing and utilization of computer systems . Hardware
virtualization allows running multiple operating systems and software stacks on
a single physical platform. As depicted in Figure 1.2, a software layer, the
virtual machine monitor (VMM), also called a hypervisor, mediates access to
the physical hardware presenting to each guest operating system a virtual
machine (VM), which is a set of virtual platform interfaces .
Virtual Machine 1
Virtual Machine 2
User software
User software
Email Server
Data
Web
base
Facebook App
Ruby on
Java
Virtual Machine N
User software
App A
App X
App B
App Y
Rails
Server
Linux
Guest OS
Virtual Machine Monitor (Hypervisor)
Hardware
FIGURE 1.2. A hardware virtualized server hosting three virtual machines, each one
running distinct operating system and user level software stack.
Workload isolation is achieved since all program instructions are fully
confined inside a VM, which leads to improvements in security. Better
reliability is also achieved because software failures inside one VM do not
affect others . Moreover, better performance control is attained since execution
of one VM should not affect the performance of another VM .
VMWare ESXi. VMware is a pioneer in the virtualization market. Its ecosystem
of tools ranges from server and desktop virtualization to high-level
management tools . ESXi is a VMM from VMWare. It is a bare-metal
hypervisor, meaning that it installs directly on the physical server, whereas
others may require a host operating system.
Xen. The Xen hypervisor started as an open-source project and has served as a
base to other virtualization products, both commercial and open-source.In
addition to an open-source distribution , Xen currently forms the base of
commercial hypervisors of a number of vendors, most notably Citrix
XenServer and Oracle VM.
KVM. The kernel-based virtual machine (KVM) is a Linux virtualization
subsystem. Is has been part of the mainline Linux kernel since version 2.6.20,
thus being natively supported by several distributions. In addition, activities
such as memory management and scheduling are carried out by existing kernel
features, thus making KVM simpler and smaller than hypervisors that take
control of the entire machine .
KVM leverages hardware-assisted virtualization, which improves
performance and allows it to support unmodified guest operating systems ;
currently, it supports several versions of Windows, Linux, and UNIX .
Virtual Appliances and the Open Virtualization
Format
An application combined with the environment needed to run it (operating
system, libraries, compilers, databases, application containers, and so forth) is
referred to as a ―virtual appliance.‖ Packaging application environments in the
shape of virtual appliances eases software customization, configuration, and
patching and improves portability. Most commonly, an appliance is shaped as
a VM disk image associated with hardware requirements, and it can be readily
deployed in a hypervisor.
In a multitude of hypervisors, where each one supports a different VM image
format and the formats are incompatible with one another, a great deal of
interoperability issues arises. For instance, Amazon has its Amazon machine
image (AMI) format, made popular on the Amazon EC2 public cloud. Other
formats are used by Citrix XenServer, several Linux distributions that ship with
KVM, Microsoft Hyper-V, and VMware ESX.
OVF‘s extensibility has encouraged additions relevant to management of
data centers and clouds. Mathews et al. have devised virtual machine contracts
(VMC) as an extension to OVF. A VMC aids in communicating and managing
the complex expectations that VMs have of their runtime environment and vice
versa.
Autonomic Computing
The increasing complexity of computing systems has motivated research on
autonomic computing, which seeks to improve systems by decreasing human
involvement in their operation. In other words, systems should manage
themselves, with high-level guidance from humans .
In this sense, the concepts of autonomic computing inspire software
technologies for data center automation, which may perform tasks such as:
management of service levels of running applications; management of data
center capacity; proactive disaster recovery; and automation of VM
provisioning .
LAYERS AND TYPES OF CLOUDS
Cloud computing services are divided into three classes, according to the
abstraction level of the capability provided and the service model of providers,
namely: (1) Infrastructure as a Service, (2) Platform as a Service, and (3) Software
as a Service . Figure 1.3 depicts the layered organization of the cloud stack
from physical infrastructure to applications.
These abstraction levels can also be viewed as a layered architecture where
services of a higher layer can be composed from services of the underlying
layer.
Infrastructure as a Service
Offering virtualized resources (computation, storage, and communication) on
demand is known as Infrastructure as a Service (IaaS) . A cloud infrastructure
Service
Main Access &
Class
Management Tool
Service content
Web Browser
Social networks, Office suites, CRM,
SaaS
PaaS
Cloud Applications
Video processing
Cloud
Cloud Platform
Development
Environment
Programming languages, Frameworks,
Mashups editors, Structured data
Virtual
IaaS
Infrastructure
Manager
Compute Servers, Data Storage,
17
Firewall, Load Balancer
Cloud Infrastructure
FIGURE 1.3. The cloud computing stack.
enables on-demand provisioning of servers running several choices of operating
systems and a customized software stack. Infrastructure services are considered
to be the bottom layer of cloud computing systems .
Platform as a Service
In addition to infrastructure-oriented clouds that provide raw computing and
storage services, another approach is to offer a higher level of abstraction to
make a cloud easily programmable, known as Platform as a Service (PaaS)..
Google AppEngine, an example of Platform as a Service, offers a scalable
environment for developing and hosting Web applications, which should
be written in specific programming languages such as Python or Java, and use
the services‘ own proprietary structured object data store.
Software as a Service
Applications reside on the top of the cloud stack. Services provided by this
layer can be accessed by end users through Web portals. Therefore, consumers
are increasingly shifting from locally installed computer programs to on-line
software services that offer the same functionally. Traditional desktop
applications such as word processing and spreadsheet can now be accessed as a
service in the Web.
Deployment Models
Although cloud computing has emerged mainly from the appearance of public
computing utilities. In this sense, regardless of its service class, a cloud can be
classified as public, private, community, or hybrid based on model of
deployment as shown in Figure 1.4.
Public/Internet
Clouds
Private/Enterprise
Hybrid/Mixed Clouds
Clouds
3rd party,
multi-tenant Cloud
Cloud computing
model run
Mixed usage of
private and public
Clouds:
infrastructure
& services:
within a company‘s
own Data Center/
infrastructure for
internal and/or
partners use.
Leasing public
cloud services
when private cloud
capacity is
insufficient
* available on
subscription basis
(pay as you go)
FIGURE 1.4. Types of clouds based on deployment models.
Armbrust propose definitions for public cloud as a ―cloud made available in
a pay-as-you-go manner to the general public‖ and private cloud as ―internal
data center of a business or other organization, not made available to the
general public.‖
A community cloud is ―shared by several organizations and supports a
specific community that has shared concerns (e.g., mission, security
requirements, policy, and compliance considerations) .‖
A hybrid cloud takes shape when a private cloud is supplemented with
computing capacity from public clouds . The approach of temporarily renting
capacity to handle spikes in load is known as ―cloud-bursting‖ .
DESIRED FEATURES OF A CLOUD
Certain features of a cloud are essential to enable services that truly represent
the cloud computing model and satisfy expectations of consumers, and cloud
offerings must be (i) self-service, (ii) per-usage metered and billed, (iii) elastic,
and (iv) customizable.
Self-Service
Consumers of cloud computing services expect on-demand, nearly instant
access to resources. To support this expectation, clouds must allow self-service
access so that customers can request, customize, pay, and use services without
intervention of human operators .
Per-Usage Metering and Billing
Cloud computing eliminates up-front commitment by users, allowing them to
request and use only the necessary amount. Services must be priced on a
shortterm basis (e.g., by the hour), allowing users to release (and not pay for)
resources as soon as they are not needed.
Elasticity
Cloud computing gives the illusion of infinite computing resources available on
demand . Therefore users expect clouds to rapidly provide resources in any
quantity at any time. In particular, it is expected that the additional resources
can be (a) provisioned, possibly automatically, when an application load
increases and (b) released when load decreases (scale up and down) .
Customization
In a multi-tenant cloud a great disparity between user needs is often the case.
Thus, resources rented from the cloud must be highly customizable. In the case
of infrastructure services, customization means allowing users to deploy
specialized virtual appliances and to be given privileged (root) access to the
virtual servers. Other service classes (PaaS and SaaS) offer less flexibility and
are not suitable for general-purpose computing , but still are expected to
provide a certain level of customization.
CLOUD INFRASTRUCTURE MANAGEMENT
A key challenge IaaS providers face when building a cloud infrastructure is
managing physical and virtual resources, namely servers, storage, and
networks, in a holistic fashion . The orchestration of resources must be
performed in a way to rapidly and dynamically provision resources to
applications .
The availability of a remote cloud-like interface and the ability of managing
many users and their permissions are the primary features that would
distinguish ―cloud toolkits‖ from ―VIMs.‖ However, in this chapter, we place
both categories of tools under the same group (of the VIMs) and, when
applicable, we highlight the availability of a remote interface as a feature.
Virtually all VIMs we investigated present a set of basic features related to
managing the life cycle of VMs, including networking groups of VMs together
and setting up virtual disks for VMs. These basic features pretty much define
whether a tool can be used in practical cloud deployments or not. On the other
hand, only a handful of software present advanced features (e.g., high
availability) which allow them to be used in large-scale production clouds.
Features
We now present a list of both basic and advanced features that are usually
available in VIMs.
Virtualization Support. The multi-tenancy aspect of clouds requires multiple
customers with disparate requirements to be served by a single hardware
infrastructure.
Self-Service, On-Demand Resource Provisioning. Self-service access to
resources has been perceived as one the most attractive features of clouds. This
feature enables users to directly obtain services from clouds.
Multiple Backend Hypervisors. Different virtualization models and tools offer
different benefits, drawbacks, and limitations. Thus, some VI managers
provide a uniform management layer regardless of the virtualization
technology used.
Storage Virtualization. Virtualizing storage means abstracting logical storage
from physical storage. By consolidating all available storage devices in a data
center, it allows creating virtual disks independent from device and location.
In the VI management sphere, storage virtualization support is often
restricted to commercial products of companies such as VMWare and Citrix.
Other products feature ways of pooling and managing storage devices, but
administrators are still aware of each individual device.
Interface to Public Clouds. Researchers have perceived that extending the
capacity of a local in-house computing infrastructure by borrowing resources
from public clouds is advantageous. In this fashion, institutions can make good
use of their available resources and, in case of spikes in demand, extra load can
be offloaded to rented resources .
Virtual Networking. Virtual networks allow creating an isolated network on
top of a physical infrastructure independently from physical topology and
locations. A virtual LAN (VLAN) allows isolating traffic that shares a
switched network, allowing VMs to be grouped into the same broadcast
domain.
Dynamic Resource Allocation. Increased awareness of energy consumption in
data centers has encouraged the practice of dynamic consolidating VMs in a
fewer number of servers. In cloud infrastructures, where applications
have variable and dynamic needs, capacity management and demand
prediction are especially complicated. This fact triggers the need for dynamic
resource allocation aiming at obtaining a timely match of supply and
demand.
Virtual Clusters. Several VI managers can holistically manage groups of VMs.
This feature is useful for provisioning computing virtual clusters on demand,
and interconnected VMs for multi-tier Internet applications.
Reservation and Negotiation Mechanism. When users request computational
resources to available at a specific time, requests are termed advance
reservations (AR), in contrast to best-effort requests, when users request
resources whenever available .
Additionally, leases may be negotiated and renegotiated, allowing provider
and consumer to modify a lease or present counter proposals until an
agreement is reached.
High Availability and Data Recovery. The high availability (HA) feature of VI
managers aims at minimizing application downtime and preventing business
disruption.
For mission critical applications, when a failover solution involving
restarting VMs does not suffice, additional levels of fault tolerance that rely on
redundancy of VMs are implemented.
Data backup in clouds should take into account the high data volume
involved in VM management.
Case Studies
In this section, we describe the main features of the most popular VI managers
available. Only the most prominent and distinguishing features of each tool are
discussed in detail. A detailed side-by-side feature comparison of VI managers
is presented in Table 1.1.
Apache VCL. The Virtual Computing Lab [60, 61] project has been incepted in
2004 by researchers at the North Carolina State University as a way to provide
customized environments to computer lab users. The software components that
support NCSU‘s initiative have been released as open-source and incorporated
by the Apache Foundation.
AppLogic. AppLogic is a commercial VI manager, the flagship product of
3tera Inc. from California, USA. The company has labeled this product as a
Grid Operating System.
AppLogic provides a fabric to manage clusters of virtualized servers,
focusing on managing multi-tier Web applications. It views an entire
application as a collection of components that must be managed as a single
entity.
In summary, 3tera AppLogic provides the following features: Linux-based
controller; CLI and GUI interfaces; Xen backend; Global Volume Store (GVS)
storage virtualization; virtual networks; virtual clusters; dynamic resource
allocation; high availability; and data protection.
TABLE 1.1. Feature Comparison of Virtual Infrastructure Managers
Installation
Platform of
Controller
Client
UI,
API, Language
Bindings
Backend
Hypervisor(s)
Storage
Virtualization
Interface to
Public Cloud
Virtual Dynamic Resource
Networks
Allocation
VMware
ESX, ESXi,
No
No
Yes
No
Global
No
Yes
Advance
Reservation of
Capacity
High
Availability
Data
Protection
Yes
No
No
Yes
No
Yes
Yes
License
Apache
VCL
Apache v2
Multi-
Portal,
XML-RPC
platform
(Apache/
PHP)
AppLogic
Proprietary
Linux
Server
GUI, CLI
Xen
Volume
Store (GVS)
Citrix Essentials
Proprietary Windows
GUI, CLI,
XenServer,
Hyper-V
Citrix
Storage
Link
No
Yes
Yes
No
Yes
Yes
Xen
Portal,
XML-RPC
Enomaly
GPL v3
Linux
Portal, WS
Eucalyptus
ECP
BSD
Linux
EC2 WS, CLI
Nimbus
Apache v2
Linux
EC2 WS,
No
Amazon EC2
Yes
No
No
No
No
Xen, KVM
No
EC2
Yes
No
No
No
No
Xen, KVM
No
EC2
Yes
Via
Yes
(via No
integration with
OpenNebula)
No
WSRF, CLI
OpenNEbula
integration with
OpenNebula
Apache v2
Linux
XML-RPC, CLI, Java
Xen, KVM
No
Amazon EC2, E
Yes
Yes
Yes
No
No
(via Haizea)
OpenPEX
GPL v2
Multiplatform
Portal, WS
XenServer
No
No
No
No
Yes
No
No
oVirt
GPL v2
Fedora Linux
Portal
KVM
No
No
No
No
No
No
No
Platform
ISF
Proprietary
Portal
Hyper-V
XenServer,
VMWare ESX
No
EC2, IBM CoD,
Yes
Yes
Yes
Unclear
Unclear
(Java)
Platform VMO
Linux
HP
Enterprise
Services
Proprietary
Linux,
Portal
XenServer
No
No
Yes
Yes
No
Yes
No
Proprietary
Linux,
CLI, GUI,
VMware
ESX, ESXi
VMware
vStorage
VMFS
VMware
vCloud partners
Yes
VMware
DRM
No
Yes
Yes
Windows
VMWare
vSphere
Windows
Portal, WS
Citrix Essentials. The Citrix Essentials suite is one the most feature complete
VI management software available, focusing on management and automation
of data centers. It is essentially a hypervisor-agnostic solution, currently
supporting Citrix XenServer and Microsoft Hyper-V.
Enomaly ECP. The Enomaly Elastic Computing Platform, in its most complete
edition, offers most features a service provider needs to build an IaaS cloud.
In summary, Enomaly ECP provides the following features: Linux-based
controller; Web portal and Web services (REST) interfaces; Xen back-end;
interface to the Amazon EC2 public cloud; virtual networks; virtual clusters
(ElasticValet).
Eucalyptus. The Eucalyptus framework was one of the first open-source
projects to focus on building IaaS clouds. It has been developed with the intent
of providing an open-source implementation nearly identical in functionality to
Amazon Web Services APIs.
Nimbus3. The Nimbus toolkit is built on top of the Globus framework. Nimbus
provides most features in common with other open-source VI managers, such
as an EC2-compatible front-end API, support to Xen, and a backend interface
to Amazon EC2.
Nimbus‘ core was engineered around the Spring framework to be easily
extensible, thus allowing several internal components to be replaced and also
eases the integration with other systems.
In summary, Nimbus provides the following features: Linux-based
controller; EC2-compatible (SOAP) and WSRF interfaces; Xen and KVM
backend and a Pilot program to spawn VMs through an LRM; interface to the
Amazon EC2 public cloud; virtual networks; one-click virtual clusters.
OpenNebula. OpenNebula is one of the most feature-rich open-source VI
managers. It was initially conceived to manage local virtual infrastructure, but
has also included remote interfaces that make it viable to build public clouds.
Altogether, four programming APIs are available: XML-RPC and libvirt for
local interaction; a subset of EC2 (Query) APIs and the OpenNebula Cloud
API (OCA) for public access [7, 65].
(Amazon EC2, ElasticHosts); virtual networks; dynamic resource
allocation; advance reservation of capacity.
OpenPEX. OpenPEX (Open Provisioning and EXecution Environment) was
constructed around the notion of using advance reservations as the primary
method for allocating VM instances.
oVirt. oVirt is an open-source VI manager, sponsored by Red Hat‘s Emergent
Technology group. It provides most of the basic features of other VI managers,
including support for managing physical server pools, storage pools, user
accounts, and VMs. All features are accessible through a Web interface.
Platform ISF. Infrastructure Sharing Facility (ISF) is the VI manager offering
from Platform Computing [68]. The company, mainly through its LSF family
of products, has been serving the HPC market for several years.
ISF is built upon Platform‘s VM Orchestrator, which, as a standalone
product, aims at speeding up delivery of VMs to end users. It also provides high
availability by restarting VMs when hosts fail and duplicating the VM that
hosts the VMO controller.
VMWare vSphere and vCloud. vSphere is VMware‘s suite of tools aimed at
transforming IT infrastructures into private clouds. It distinguishes from other
VI managers as one of the most feature-rich, due to the company‘s several
offerings in all levels the architecture.
In the vSphere architecture, servers run on the ESXi platform. A separate
server runs vCenter Server, which centralizes control over the entire virtual
infrastructure. Through the vSphere Client software, administrators connect to
vCenter Server to perform various tasks.
VMware ESX, ESXi backend; VMware vStorage VMFS storage
virtualization; interface to external clouds (VMware vCloud partners); virtual
networks (VMWare Distributed Switch); dynamic resource allocation
(VMware DRM); high availability; data protection (VMWare Consolidated
Backup).
INFRASTRUCTURE AS A SERVICE PROVIDERS
Public Infrastructure as a Service providers commonly offer virtual servers
containing one or more CPUs, running several choices of operating systems
and a customized software stack. In addition, storage space and
communication facilities are often provided.
Features
In spite of being based on a common set of features, IaaS offerings can be
distinguished by the availability of specialized features that influence the
cost—benefit ratio to be experienced by user applications when moved to
the cloud. The most relevant features are: (i) geographic distribution of data
centers; (ii) variety of user interfaces and APIs to access the system; (iii)
specialized components and services that aid particular applications (e.g.,
loadbalancers, firewalls); (iv) choice of virtualization platform and operating
systems; and (v) different billing methods and period (e.g., prepaid vs. postpaid, hourly vs. monthly).
Geographic Presence. To improve availability and responsiveness, a provider
of worldwide services would typically build several data centers distributed
around the world. For example, Amazon Web Services presents the concept of
―availability zones‖ and ―regions‖ for its EC2 service.
User Interfaces and Access to Servers. Ideally, a public IaaS provider must
provide multiple access means to its cloud, thus catering for various users and
their preferences. Different types of user interfaces (UI) provide different levels
of abstraction, the most common being graphical user interfaces (GUI),
command-line tools (CLI), and Web service (WS) APIs.
GUIs are preferred by end users who need to launch, customize, and
monitor a few virtual servers and do not necessary need to repeat the process
several times. On the other hand, CLIs offer more flexibility and the possibility
of automating repetitive tasks via scripts.
Advance Reservation of Capacity. Advance reservations allow users to request
for an IaaS provider to reserve resources for a specific time frame in the future,
thus ensuring that cloud resources will be available at that time. However, most
clouds only support best-effort requests; that is, users requests are server
whenever resources are available.
Automatic Scaling and Load Balancing. As mentioned earlier in this chapter,
elasticity is a key characteristic of the cloud computing model. Applications
often need to scale up and down to meet varying load conditions. Automatic
scaling is a highly desirable feature of IaaS clouds.
Service-Level Agreement. Service-level agreements (SLAs) are offered by IaaS
providers to express their commitment to delivery of a certain QoS. To
customers it serves as a warranty. An SLA usually include availability and
performance guarantees. Additionally, metrics must be agreed upon by all
parties as well as penalties for violating these expectations.
Hypervisor and Operating System Choice. Traditionally, IaaS offerings have
been based on heavily customized open-source Xen deployments. IaaS
providers needed expertise in Linux, networking, virtualization, metering,
resource management, and many other low-level aspects to successfully deploy
and maintain their cloud offerings.
Case Studies
In this section, we describe the main features of the most popular public IaaS
clouds. Only the most prominent and distinguishing features of each one are
discussed in detail. A detailed side-by-side feature comparison of IaaS offerings
is presented in Table 1.2.
Amazon Web Services. Amazon WS (AWS) is one of the major players in the
cloud computing market. It pioneered the introduction of IaaS clouds in
2006.
The Elastic Compute Cloud (EC2) offers Xen-based virtual servers (instances)
that can be instantiated from Amazon Machine Images (AMIs). Instances are
available in a variety of sizes, operating systems, architectures, and price. CPU
capacity of instances is measured in Amazon Compute Units and, although fixed
for each instance, vary among instance types from 1 (small instance) to 20 (high
CPU instance).
In summary, Amazon EC2 provides the following features: multiple data
centers available in the United States (East and West) and Europe; CLI, Web
services (SOAP and Query), Web-based console user interfaces; access to
instance mainly via SSH (Linux) and Remote Desktop (Windows); advanced
reservation of capacity (aka reserved instances) that guarantees availability for
periods of 1 and 3 years; 99.5% availability SLA; per hour pricing; Linux and
Windows operating systems; automatic scaling; load balancing.
TABLE 1.2. Feature Comparison Public Cloud Offerings (Infrastructure as a Service)
Runtime
Server
Resizing/
Vertical
Scaling
Client UI
API Language
Geographic
Presence
Primary
Access to
Server
Advance
Reservation of
Capacity
Smallest
Billing
Guest
Operating
Systems
SLA
Bindings
Unit
Automated
Horizontal
Scaling
Hypervisor
Instance Hardware Capacity
Processor
Load
Balancing
Memory
Storage
Uptime
99.95% Hour
Xen
Linux,
Windows
Available
Elastic Load
with
Balancing
Amazon
CloudWatch
No
1—20 EC2
compute
units
1.7—15 160—1690 GB
GB
1 GB—1 TB
(per EBS
volume)
No
100%
Xen
Linux,
Windows
No
Zeus
software
Processors,
memory
1—4 CPUs
0.5—16 20—270 GB
GB
No
100%
Xen
Linux,
Windows
No
Hardware
(F5)
No
1—6 CPUs
Amazon
EC2
US East,
Europe
CLI, WS,
Portal
SSH (Linux),
Remote
Desktop
(Windows)
Amazon
reserved
instances
(Available in
1 or 3 years
terms, starting
from reservation
time)
Flexiscale
UK
Web Console
SSH
REST, Java,
SSH
Hour
loadbalancing (requires
reboot)
GoGrid
PHP, Python,
Ruby
Hour
GB
0.5—8
3G0B—480
Joyent
Cloud
US
(Emeryville,
SSH,
No
100%
Month
OS Level
(Solaris
OpenSolaris No
Both
hardware
Automatic
1/16—8 CPUs 0.25—32 5—100 GB
CPU bursting
GB
VirtualMin
CA; San
(Web-based
Diego,
system
administration)
Containers)
(F5 networks) (up to 8
and software
(Zeus)
CPUs)
No
Memory, disk Quad-core
0.25—16 10—620 GB
(requires
reboot)
Automatic
CPU bursting
(up to 100%
of available
CPU power
of
physical
host)
GB
CA; Andover,
MA; Dallas,
TX)
Rackspace US
Portal, REST,
Cloud
Servers
Python, PHP,
Java, C#/.
(Dallas, TX)
NET
SSH
No
100%
Hour
Xen
Linux
No
CPU (CPU
power
is
weighed
proportionally
to
memory
size)
Flexiscale. Flexiscale is a UK-based provider offering services similar in
nature to Amazon Web Services. However, its virtual servers offer some
distinct features, most notably: persistent storage by default, fixed IP addresses,
dedicated VLAN, a wider range of server sizes, and runtime adjustment of CPU
capacity (aka CPU bursting/vertical scaling). Similar to the clouds, this service
is also priced by the hour.
Joyent. Joyent‘s Public Cloud offers servers based on Solaris containers
virtualization technology. These servers, dubbed accelerators, allow deploying
various specialized software-stack based on a customized version of
OpenSolaris operating system, which include by default a Web-based
configuration tool and several pre-installed software, such as Apache, MySQL,
PHP, Ruby on Rails, and Java. Software load balancing is available as an
accelerator in addition to hardware load balancers.
In summary, the Joyent public cloud offers the following features: multiple
geographic locations in the United States; Web-based user interface; access to
virtual server via SSH and Web-based administration tool; 100% availability
SLA; per month pricing; OS-level virtualization Solaris containers;
OpenSolaris operating systems; automatic scaling (vertical).
GoGrid. GoGrid, like many other IaaS providers, allows its customers to
utilize a range of pre-made Windows and Linux images, in a range of fixed
instance sizes. GoGrid also offers ―value-added‖ stacks on top for applications
such as high-volume Web serving, e-Commerce, and database stores.
Rackspace Cloud Servers. Rackspace Cloud Servers is an IaaS solution that
provides fixed size instances in the cloud. Cloud Servers offers a range of
Linux-based pre-made images. A user can request different-sized images, where
the size is measured by requested RAM, not CPU.
PLATFORM AS A SERVICE PROVIDERS
Public Platform as a Service providers commonly offer a development and
deployment environment that allow users to create and run their applications
with little or no concern to low-level details of the platform. In addition,
specific programming languages and frameworks are made available in the
platform, as well as other services such as persistent data storage and
inmemory caches.
Features
Programming Models, Languages, and Frameworks. Programming models
made available by IaaS providers define how users can express their
applications using higher levels of abstraction and efficiently run them on the
cloud platform. Each model aims at efficiently solving a particular problem. In
the cloud computing domain, the most common activities that require
specialized models are: processing of large dataset in clusters of computers
(MapReduce model), development of request-based Web services and
applications;
Persistence Options. A persistence layer is essential to allow applications to
record their state and recover it in case of crashes, as well as to store user data.
Traditionally, Web and enterprise application developers have chosen
relational databases as the preferred persistence method. These databases offer
fast and reliable structured data storage and transaction processing, but may
lack scalability to handle several petabytes of data stored in commodity
computers .
Case Studies
In this section, we describe the main features of some Platform as Service
(PaaS) offerings. A more detailed side-by-side feature comparison of VI
managers is presented in Table 1.3.
Aneka. Aneka is a .NET-based service-oriented resource management and
development platform. Each server in an Aneka deployment (dubbed Aneka
cloud node) hosts the Aneka container, which provides the base infrastructure
that consists of services for persistence, security (authorization, authentication
and auditing), and communication (message handling and dispatching).
Several programming models are supported by such task models to enable
execution of legacy HPC applications and MapReduce, which enables a variety
of data-mining and search applications.
App Engine. Google App Engine lets you run your Python and Java Web
applications on elastic infrastructure supplied by Google. The App Engine
serving architecture is notable in
that it allows real-time auto-scaling
without virtualization for many common types of Web applications.
However, such auto-scaling is dependent on the
TABLE 1.3. Feature Comparison of Platform-as-a-Service Cloud Offerings
Target Use
Aneka
Programming
Language,
Frameworks
Developer
Tools
.Net enterprise
applications,
HPC
Web
applications
.NET
Standalone
SDK
Python, Java
Eclipse-based
IDE
Force.com
Enterprise
applications
(esp. CRM)
Apex
Microsoft
Windows
Azure
Enterprise and
Web
applications
.NET
Heroku
Web
applications
Ruby on Rails
AppEngine
Programming
Models
Threads, Task,
MapReduce
Persistence
Options
Automatic
Scaling
Backend
Infrastructure
Providers
Flat
files,
RDBMS, HDFS
No
Amazon EC2
BigTable
Yes
Own
centers
data
Request-based
Web
programming
Eclipse-based
Workflow,
IDE, WebExcel-like
based wizard
formula
language,
Request-based
web
programming
Azure tools for Unrestricted
Microsoft
Visual Studio
Own object
database
Unclear
Own
centers
data
Table/BLOB/
queue
storage,
SQL services
Yes
Own
centers
data
Command-line
tools
PostgreSQL,
Amazon RDS
Yes
Amazon EC2
Requestbased
web
programming
33
Amazon
Elastic
MapReduce
Data processing
Hive and Pig,
Cascading,
Java, Ruby,
Perl, Python,
PHP,
R,
C++
Karmasphere
Studio
for
Hadoop
(NetBeansbased)
MapReduce
Amazon S3
No
Amazon EC2
application developer using a limited subset of the native APIs on each
platform, and in some instances you need to use specific Google APIs such
as URLFetch, Datastore, and memcache in place of certain native API calls.
Microsoft Azure. Microsoft Azure Cloud Services offers developers a hosted .
NET Stack (C#, VB.Net, ASP.NET). In addition, a Java & Ruby SDK for
.NET Services is also available. The Azure system consists of a number of
elements.
Force.com. In conjunction with the Salesforce.com service, the Force.com
PaaS allows developers to create add-on functionality that integrates into main
Salesforce CRM SaaS application.
Heroku. Heroku is a platform for instant deployment of Ruby on Rails Web
applications. In the Heroku system, servers are invisibly managed by the
platform and are never exposed to users.
CHALLENGES AND RISKS
Despite the initial success and popularity of the cloud computing paradigm and
the extensive availability of providers and tools, a significant number of
challenges and risks are inherent to this new model of computing. Providers,
developers, and end users must consider these challenges and risks to take good
advantage of cloud computing.
Security, Privacy, and Trust
Ambrust et al. cite information security as a main issue: ―current cloud
offerings are essentially public . . . exposing the system to more attacks.‖ For
this reason there are potentially additional challenges to make cloud computing
environments as secure as in-house IT systems. At the same time, existing,
wellunderstood technologies can be leveraged, such as data encryption,
VLANs, and firewalls.
Data Lock-In and Standardization
A major concern of cloud computing users is about having their data locked-in
by a certain provider. Users may want to move data and applications out from
a provider that does not meet their requirements. However, in their current
form, cloud computing infrastructures and platforms do not employ standard
methods of storing user data and applications. Consequently, they do not
interoperate and user data are not portable.
Availability, Fault-Tolerance, and Disaster Recovery
It is expected that users will have certain expectations about the service level to
be provided once their applications are moved to the cloud. These expectations
include availability of the service, its overall performance, and what measures
are to be taken when something goes wrong in the system or its components. In
summary, users seek for a warranty before they can comfortably move their
business to the cloud.
Resource Management and Energy-Efficiency
One important challenge faced by providers of cloud computing services is the
efficient management of virtualized resource pools. Physical resources such as
CPU cores, disk space, and network bandwidth must be sliced and shared
among virtual machines running potentially heterogeneous workloads.
Another challenge concerns the outstanding amount of data to be managed
in various VM management activities. Such data amount is a result of
particular abilities of virtual machines, including the ability of traveling through
space (i.e., migration) and time (i.e., checkpointing and rewinding), operations
that may be required in load balancing, backup, and recovery scenarios. In
addition, dynamic provisioning of new VMs and replicating existing VMs
require efficient mechanisms to make VM block storage devices (e.g., image
files) quickly available at selected hosts.
2.2 MIGRATING INTO A CLOUD
The promise of cloud computing has raised the IT expectations of small and
medium enterprises beyond measure. Large companies are deeply debating it.
Cloud computing is a disruptive model of IT whose innovation is part
technology and part business model—in short a ―disruptive techno-commercial
model‖ of IT. This tutorial chapter focuses on the key issues and associated
dilemmas faced by decision makers, architects, and systems managers in trying
to understand and leverage cloud computing for their IT needs. Questions
asked and discussed in this chapter include: when and how to migrate one‘s
application into a cloud; what part or component of the IT application to
migrate into a cloud and what not to migrate into a cloud; what kind of
customers really benefit from migrating their IT into the cloud; and so on. We
describe the key factors underlying each of the above questions and share a
Seven-Step Model of Migration into the Cloud.
Several efforts have been made in the recent past to define the term ―cloud
computing‖ and many have not been able to provide a comprehensive one This
has been more challenging given the scorching pace of the technological
advances as well as the newer business model formulations for the cloud services
being offered.
The Promise of the Cloud
Most users of cloud computing services offered by some of the large-scale data
centers are least bothered about the complexities of the underlying systems or
their functioning. More so given the heterogeneity of either the systems or the
software running on them.
Cloudonomics
Technology
• ‗Pay per use‘ – Lower Cost Barriers
• On Demand Resources –Autoscaling
• Capex vs OPEX – No capital expenses (CAPEX) and only operational expenses OPEX.
• SLA driven operations – Much Lower TCO
• Attractive NFR support: Availability, Reliability
• ‗Infinite‘ Elastic availability – Compute/Storage/Bandwidth
• Automatic Usage Monitoring and Metering
• Jobs/ Tasks Virtualized and Transparently ‗Movable‘
• Integration and interoperability ‗support‘ for hybrid ops
• Transparently encapsulated & abstracted IT features.
FIGURE 2.1. The promise of the cloud computing services.
.
As shown in Figure 2.1, the promise of the cloud both on the business front
(the attractive cloudonomics) and the technology front widely aided the CxOs
to spawn out several non-mission critical IT needs from the ambit of their
captive traditional data centers to the appropriate cloud service. Invariably,
these IT needs had some common features: They were typically Web-oriented;
they represented seasonal IT demands; they were amenable to parallel batch
processing; they were non-mission critical and therefore did not have high
security demands.
The Cloud Service Offerings and Deployment Models
Cloud computing has been an attractive proposition both for the CFO and the
CTO of an enterprise primarily due its ease of usage. This has been achieved
by large data center service vendors or now better known as cloud service
vendors again primarily due to their scale of operations. Google, Amazon,
IaaS
• Abstract Compute/Storage/Bandwidth Resources
• Amazon Web Services[10,9] – EC2, S3, SDB, CDN, CloudWatch
IT Folks
PaaS
• Abstracted Programming Platform with encapsulated infrastructure
• Google Apps Engine(Java/Python), Microsoft Azure, Aneka[13]
Programmers
SaaS
• Application with encapsulated infrastructure & platform
• Salesforce.com; Gmail; Yahoo Mail; Facebook; Twitter
Architects & Users
Cloud Application Deployment & Consumption Models
Public Clouds
Hybrid Clouds
Private Clouds
FIGURE 2.2. The cloud computing service offering and deployment models.
Microsoft, and a few others have been the key players apart from open source
Hadoop built around the Apache ecosystem. As shown in Figure 2.2, the cloud
service offerings from these vendors can broadly be classified into three major
streams: the Infrastructure as a Service (IaaS), the Platform as a Service (PaaS),
and the Software as a Service (SaaS). While IT managers and system
administrators preferred IaaS as offered by Amazon for many of their
virtualized IT needs, the programmers preferred PaaS offerings like Google
AppEngine (Java/Python programming) or Microsoft Azure (.Net
programming). Users of large-scale enterprise software invariably found that
if they had been using the cloud, it was because their usage of the specific
software package was available as a service—it was, in essence, a SaaS
offering. Salesforce.com was an exemplary SaaS offering on the Internet.
From a technology viewpoint, as of today, the IaaS type of cloud offerings
have been the most successful and widespread in usage. Invariably these
reflect the cloud underneath, where storage (most do not know on which
system it is) is easily scalable or for that matter where it is stored or located.
Challenges in the Cloud
While the cloud service offerings present a simplistic view of IT in case of IaaS
or a simplistic view of programming in case PaaS or a simplistic view of
resources usage in case of SaaS, the underlying systems level support challenges
are huge and highly complex. These stem from the need to offer a uniformly
consistent and robustly simplistic view of computing while the underlying
systems are highly failure-prone, heterogeneous, resource hogging, and
exhibiting serious security shortcomings. As observed in Figure 2.3, the
promise of the cloud seems very similar to the typical distributed systems
properties that most would prefer to have.
Distributed System Fallacies
Challenges in Cloud Technologies
and the Promise of the Cloud
Full Network Reliability
Security
Zero Network Latency
Performance Monitoring
Consistent & Robust Service abstractions
Infinite Bandwidth
Secure Network
No Topology changes
Centralized Administration
Zero Transport Costs
Meta Scheduling
Energy efficient load balancing
Scale management
SLA & QoS Architectures
Interoperability & Portability
Homogeneous Networks & Systems
FIGURE 2.3. ‗Under the hood‘ challenges of the cloud computingGreen
services
IT implementations.
Many of them are listed in Figure 2.3. Prime amongst these are the challenges
of security. The Cloud Security Alliance seeks to address many of these issues .
BROAD APPROACHES TO MIGRATING INTO THE CLOUD
Given that cloud computing is a ―techno-business disruptive model‖ and is on
the top of the top 10 strategic technologies to watch for 2010 according to
Gartner, migrating into the cloud is poised to become a large-scale effort in
leveraging the cloud in several enterprises. ―Cloudonomics‖ deals with the
economic rationale for leveraging the cloud and is central to the success of
cloud-based enterprise usage.
Why Migrate?
There are economic and business reasons why an enterprise application can be
migrated into the cloud, and there are also a number of technological reasons.
Many of these efforts come up as initiatives in adoption of cloud technologies
in the enterprise, resulting in integration of enterprise applications running off
the captive data centers with the new ones that have been developed on the
cloud. Adoption of or integration with cloud computing services is a use case of
migration.
With due simplification, the migration of an enterprise application is best
captured by the following:
P-P0 1 P0 -P0
C
l
1 P0l
OFC
where P is the application before migration running in captive data center, P0 is
the application part after migration either into a (hybrid) cloud, P0 l is the part
C
of application being run in the captive local data center, and P0 OFC is the
application part optimized for cloud. If an enterprise application cannot be
migrated fully, it could result in some parts being run on the captive local data
center while the rest are being migrated into the cloud—essentially a case of a
hybrid cloud usage. However, when the entire application is migrated onto the
cloud, then P0l is null. Indeed, the migration of the enterprise application P can
happen at the five levels of application, code, design, architecture, and usage. It
can be that the P0C migration happens at any of the five levels without any P0 l
component. Compound this with the kind of cloud computing service offering
being applied—the IaaS model or PaaS or SaaS model—and we have a variety
of migration use cases that need to be thought through thoroughly by the
migration architects.
Cloudonomics. Invariably, migrating into the cloud is driven by economic
reasons of cost cutting in both the IT capital expenses (Capex) as well as
operational expenses (Opex). There are both the short-term benefits of
opportunistic migration to offset seasonal and highly variable IT loads as well
as the long-term benefits to leverage the cloud. For the long-term sustained
usage, as of 2009, several impediments and shortcomings of the cloud
computing services need to be addressed.
Deciding on the Cloud Migration
In fact, several proof of concepts and prototypes of the enterprise application
are experimented on the cloud to take help in making a sound decision on
migrating into the cloud. Post migration, the ROI on the migration should be
positive for a broad range of pricing variability. Assume that in the M classes
of questions, there was a class with a maximum of N questions. We can then
model the weightage-based decision making as M 3 N weightage matrix as
follows:
M
X
Cl #
!
N
X
Bi
Aij Xij
# Ch
i51
j51
where Cl is the lower weightage threshold and Ch is the higher weightage
threshold while Aij is the specific constant assigned for a question and Xij is the
fraction between 0 and 1 that represents the degree to which that answer to
the question is relevant and applicable.
THE SEVEN-STEP MODEL OF MIGRATION INTO A CLOUD
Typically migration initiatives into the cloud are implemented in phases or in
stages. A structured and process-oriented approach to migration into a cloud has
several advantages of capturing within itself the best practices of many migration
projects. While migration has been a difficult and vague subject—of not much
interest to the academics and left to the industry practitioners—not many efforts
across the industry have been put in to consolidate what has been found to be
both a top revenue earner and a long standing customer pain. After due study
and practice, we share the Seven-Step Model of Migration into the Cloud as part
of our efforts in understanding and leveraging the cloud computing service
offerings in the enterprise context. In a succinct way, Figure 2.4 captures the
essence of the steps in the model of migration into the cloud, while Figure 2.5
captures the iterative process of the seven-step migration into the cloud.
The first step of the iterative process of the seven-step model of migration is
basically at the assessment level. Proof of concepts or prototypes for various
approaches to the migration along with the leveraging of pricing
parameters enables one to make appropriate assessments.
1. Conduct Cloud Migration Assessments
2. Isolate the Dependencies
3. Map the Messaging & Environment
4. Re-architect & Implement the lost Functionalities
5. Leverage Cloud Functionalities & Features
6. Test the Migration
7. Iterate and Optimize
FIGURE 2.4. The Seven-Step Model of Migration into the Cloud. (Source: Infosys
Research.)
START
Assess
Optimize
Isolate
END
The Iterative Seven Step
Test
Migration Model
Augment
Map
Rearchitect
FIGURE 2.5. The iterative Seven-step Model of Migration into the Cloud. (Source:
Infosys Research.)
Having done the augmentation, we validate and test the new form of the
enterprise application with an extensive test suite that comprises testing the
components of the enterprise application on the cloud as well. These test results
could be positive or mixed. In the latter case, we iterate and optimize as
appropriate. After several such optimizing iterations, the migration is deemed
successful. Our best practices indicate that it is best to iterate through this
Seven-Step Model process for optimizing and ensuring that the migration into
the cloud is both robust and comprehensive. Figure 2.6 captures the typical
components of the best practices accumulated in the practice of the Seven-Step
Model of Migration into the Cloud. Though not comprehensive in enumeration,
it is representative.
Assess
• Cloudonomics
• Migration
Costs
• Recurring
Costs
• Database data
segmentation
• Database
Migration
• Functionality
migration
• NFR Support
Isolate
• Runtime
Environment
• Licensing
• Libraries
Dependency
• Applications
Dependency
• Latencies
Bottlenecks
• Performance
bottlenecks
• Architectural
Dependencies
Map
• Messages
mapping:
marshalling &
de-marshalling
• Mapping
Environments
• Mapping
libraries &
runtime
approximations
Re-Architect
• Approximate
lost
functionality
using cloud
runtime
support API
• New
Usecases
• Analysis
• Design
Augment
• Exploit
additional
cloud features
• Seek Low-cost
augmentations
• Autoscaling
• Storage
• Bandwidth
• Security
Test
• Augment Test
Cases and
Test
Automation
• Run Proof-ofConcepts
• Test Migration
strategy
• Test new
testcases due
to cloud
augmentation
• Test for
Production
Loads
Optimize
• Optimize–
rework and
iterate
• Significantly
satisfy
cloudonomics
of migration
• Optimize
compliance
with standards
and
governance
• Deliver best
migration ROI
• Develop
roadmap for
leveraging new
cloud features
FIGURE 2.6. Some details of the iterative Seven-Step Model of Migration into the
Cloud.
Compared with the typical approach to migration into the Amazon AWS, our
Seven-step model is more generic, versatile, and comprehensive. The typical
migration into the Amazon AWS is a phased over several steps. It is about six
steps as discussed in several white papers in the Amazon website and is as
follows: The first phase is the cloud migration assessment phase wherein
dependencies are isolated and strategies worked out to handle these
dependencies. The next phase is in trying out proof of concepts to build a
reference migration architecture. The third phase is the data migration phase
wherein database data segmentation and cleansing is completed. This phase
also tries to leverage the various cloud storage options as best suited. The
fourth phase comprises the application migration wherein either a ―forklift
strategy‖ of migrating the key enterprise application along with its
dependencies (other applications) into the cloud is pursued.
Migration Risks and Mitigation
The biggest challenge to any cloud migration project is how effectively the
migration risks are identified and mitigated. In the Seven-Step Model of
Migration into the Cloud, the process step of testing and validating includes
efforts to identify the key migration risks. In the optimization step, we address
various approaches to mitigate the identified migration risks.
There
are issues of consistent identity management as well. These and
several of the issues are discussed in Section 2.1. Issues and challenges listed in
Figure 2.3 continue to be the persistent research and engineering challenges in
coming up with appropriate cloud computing implementations.
2.3
ENRICHING
THE
‘INTEGRATION
AS
A
SERVICE’ PARADIGM FOR THE CLOUD ERA
AN INTRODUCTION
The trend-setting cloud paradigm actually represents the cool
conglomeration of a number of proven and promising Web and enterprise
technologies. Cloud Infrastructure providers are establishing cloud centers
to host a variety of ICT services and platforms of worldwide individuals,
innovators, and institutions. Cloud service providers (CSPs) are very
aggressive in experimenting and embracing the cool cloud ideas and today
every business and technical services are being hosted in clouds to be
delivered to global customers, clients and consumers over the Internet
communication infrastructure. For example, security as a service (SaaS) is
a prominent cloud-hosted security service that can be subscribed by a
spectrum of users of any connected device and the users just pay for the
exact amount or time of usage. In a nutshell, on-premise and local
applications are becoming online, remote, hosted, on-demand and
offpremise applications.
Business-to-business (B2B). It is logical to take the integration
middleware to clouds to simplify and streamline the enterprise-toenterprise
(E2E), enterprise-to-cloud (E2C) and cloud-to-cloud (C2C) integration.
THE EVOLUTION OF SaaS
SaaS paradigm is on fast track due to its innate powers and potentials.
Executives, entrepreneurs, and end-users are ecstatic about the tactic as
well as strategic success of the emerging and evolving SaaS paradigm.
A number of positive and progressive developments started to grip this
model. Newer resources and activities are being consistently readied
to be delivered as a service. Experts and evangelists are in unison
that cloud is to rock the total IT community as the best possible
infrastructural solution for effective service delivery.
IT as a Service (ITaaS) is the most recent and efficient delivery
method in the decisive IT landscape. With the meteoric and
mesmerizing rise of the service orientation principles, every single IT
resource, activity and infrastructure is being viewed and visualized as a
service that sets the tone for the grand unfolding of the dreamt service
era. Integration as a service (IaaS) is the budding and distinctive
capability of clouds in fulfilling the business integration requirements.
Increasingly business applications are deployed in clouds to reap the
business and technical benefits. On the other hand, there are still
innumerable applications and data sources locally stationed and
sustained primarily due to the security reason.
B2B systems are capable of driving this new on-demand integration
model because they are traditionally employed to automate business
processes between manufacturers and their trading partners. That
means they provide application-to-application connectivity along with
the functionality that is very crucial for linking internal and external
software securely.
The use of hub & spoke (H&S) architecture further simplifies the
implementation and avoids placing an excessive processing burden on
the customer sides. The hub is installed at the SaaS provider‘s cloud
center to do the heavy lifting such as reformatting files. The Web is the
largest digital information
superhighway
1. The Web is the largest repository of all kinds of resources such as
web pages, applications comprising enterprise components, business
services, beans, POJOs, blogs, corporate data, etc.
2. The Web is turning out to be the open, cost-effective and generic
business execution platform (E-commerce, business, auction, etc.
happen in the web for global users) comprising a wider variety of
containers, adaptors, drivers, connectors, etc.
3. The Web is the global-scale communication infrastructure (VoIP,
Video conferencing, IP TV etc,)
4. The Web is the next-generation discovery, Connectivity, and
integration middleware
Thus the unprecedented absorption and adoption of the Internet is the
key driver for the continued success of the cloud computing.
THE CHALLENGES OF SaaS PARADIGM
As with any new technology, SaaS and cloud concepts too suffer a
number of limitations. These technologies are being diligently examined
for specific situations and scenarios. The prickling and tricky issues in
different layers and levels are being looked into. The overall views are
listed out below. Loss or lack of the following features deters the
massive adoption of clouds
1.
2.
3.
4.
5.
6.
Controllability
Visibility & flexibility
Security and Privacy
High Performance and Availability
Integration and Composition
Standards
A number of approaches are being investigated for resolving the
identified issues and flaws. Private cloud, hybrid and the latest
community cloud are being prescribed as the solution for most of these
inefficiencies and deficiencies. As rightly pointed out by someone in his
weblogs, still there are miles to go. There are several companies
focusing on this issue. Boomi (http://www.dell.com/) is one among
them. This company has published several well-written white papers
elaborating the issues confronting those enterprises thinking and trying
to embrace the third-party public clouds for hosting their services
and applications.
Integration Conundrum. While SaaS applications offer outstanding
value in terms of features and functionalities relative to cost, they have
introduced several challenges specific to integration.
APIs are Insufficient. Many SaaS providers have responded to the
integration challenge by developing application programming interfaces
(APIs). Unfortunately, accessing and managing data via an API requires
a significant amount of coding as well as maintenance due to frequent
API modifications and updates.
Data Transmission Security. SaaS providers go to great length to
ensure that customer data is secure within the hosted environment.
However, the need to transfer data from on-premise systems or
applications behind the firewall with SaaS applications.
For any relocated application to provide the promised value for
businesses and users, the minimum requirement is the interoperability
between SaaS applications and on-premise enterprise packages.
The Impacts of Clouds. On the infrastructural front, in the recent past,
the clouds have arrived onto the scene powerfully and have extended
the horizon and the boundary of business applications, events and data.
Thus there is a clarion call for adaptive integration engines that
seamlessly and spontaneously connect enterprise applications with
cloud applications. Integration is being stretched further to the level of
the expanding Internet and this is really a litmus test for system
architects and integrators.
The perpetual integration puzzle has to be solved meticulously for the
originally visualised success of SaaS style.
APPROACHING THE SaaS INTEGRATION ENIGMA
Integration as a Service (IaaS) is all about the migration of the
functionality of a typical enterprise application integration (EAI) hub /
enterprise service bus (ESB) into the cloud for providing for smooth
data transport between any enterprise and SaaS applications. Users
subscribe to IaaS as they would do for any other SaaS application.
Cloud middleware is the next logical evolution of traditional
middleware solutions.
Service orchestration and choreography enables process integration.
Service interaction through ESB integrates loosely coupled systems
whereas CEP connects decoupled systems.
With the unprecedented rise in cloud usage, all these integration
software are bound to move to clouds. SQS also doesn‘t promise inorder and exactly-once delivery. These simplifications let Amazon
make SQS more scalable, but they also mean that developers must use
SQS differently from an on-premise message queuing technology.
As per one of the David Linthicum‘s white papers, approaching
SaaS-toenterprise integration is really a matter of making informed and
intelligent choices.The need for integration between remote cloud
platforms with on-premise enterprise platforms.
Why SaaS Integration is hard?. As indicated in the white paper, there is
a mid-sized paper company that recently became a Salesforce.com
CRM customer. The company currently leverages an on-premise
custom system that uses an Oracle database to track inventory and sales.
The use of the Salesforce.com system provides the company with a
significant value in terms of customer and sales management.
Having understood and defined the ―to be‖ state, data
synchronization technology is proposed as the best fit between the
source, meaning Salesforce. com, and the target, meaning the existing
legacy system that leverages Oracle. First of all, we need to gain the
insights about the special traits and tenets of SaaS applications in order
to arrive at a suitable integration route. The constraining attributes of
SaaS applications are
● Dynamic nature of the SaaS interfaces that constantly change
● Dynamic nature of the metadata native to a SaaS provider such as
Salesforce.com
● Managing assets that exist outside of the firewall
● Massive amounts of information that need to move between
SaaS and on-premise systems daily and the need to maintain data
quality and integrity.
As SaaS are being deposited in cloud infrastructures vigorously, we
need to ponder about the obstructions being imposed by clouds and
prescribe proven solutions. If we face difficulty with local integration,
then the cloud integration is bound to be more complicated. The most
probable reasons are
●
●
●
●
New integration scenarios
Access to the cloud may be limited
Dynamic resources
Performance
Limited Access. Access to cloud resources (SaaS, PaaS, and the
infrastructures) is more limited than local applications. Accessing local
applications is quite simple and faster. Imbedding integration points in
local as well as custom applications is easier.
Dynamic Resources. Cloud resources are virtualized and serviceoriented. That is, everything is expressed and exposed as a service. Due
to the dynamism factor that is sweeping the whole could ecosystem,
application versioning
and infrastructural changes are liable for
dynamic changes.
Performance. Clouds support application scalability and resource
elasticity. However the network distances between elements in the
cloud are no longer under our control.
NEW INTEGRATION SCENARIOS
Before the cloud model, we had to stitch and tie local systems together.
With the shift to a cloud model is on the anvil, we now have to connect
local applications to the cloud, and we also have to connect cloud
applications to each other, which add new permutations to the complex
integration channel matrix.All of this means integration must criss-cross
firewalls somewhere.
Cloud Integration Scenarios. We have identified three major integration
scenarios as discussed below.
Within a Public Cloud (figure 3.1). Two different applications are
hosted in a cloud. The role of the cloud integration middleware (say
cloud-based ESB or internet service bus (ISB)) is to seamlessly enable
these applications to talk to each other. The possible sub-scenarios
include these applications can be owned
App1
FIGURE 3.1.
ISB
App2
Within a Public Cloud.
Cloud 1
FIGURE 3.2.
ISB
Cloud 2
Across Homogeneous Clouds.
Public Cloud
ISB
Private Cloud
FIGURE 3.3.
Across Heterogeneous Clouds.
by two different companies. They may live in a single physical server
but run on different virtual machines.
Homogeneous Clouds (figure 3.2). The applications to be integrated are
posited in two geographically separated cloud infrastructures. The
integration middleware can be in cloud 1 or 2 or in a separate cloud.
There is a need for data and protocol transformation and they get
done by the ISB. The approach is more or less compatible to
enterprise application integration procedure.
Heterogeneous Clouds (figure 3.3). One application is in public cloud
and the other application is private cloud.
THE INTEGRATION METHODOLOGIES
Excluding the custom integration through hand-coding, there are three
types for cloud integration
1. Traditional Enterprise Integration Tools can be empowered with
special connectors to access Cloud-located Applications—This is
the most likely approach for IT organizations, which have already
invested a lot in integration suite for their application integration
needs.
2. Traditional Enterprise Integration Tools are hosted in the
Cloud—This approach is similar to the first option except that
the integration software suite is now hosted in any third-party
cloud infrastructures so that the enterprise does not worry
about procuring and managing the hardware or installing the
integration software.
3. Integration-as-a-Service (IaaS) or On-Demand Integration
Offerings— These are SaaS applications that are designed to
deliver the integration service securely over the Internet and
are able to integrate cloud applications with the on-premise
systems, cloud-to-cloud applications.
In a nutshell, the integration requirements can be realised using
any one of the following methods and middleware products.
1. Hosted and extended ESB (Internet service bus / cloud integration
2.
3.
4.
5.
bus)
Online Message Queues, Brokers and Hubs
Wizard and configuration-based integration platforms (Niche
integration solutions)
Integration Service Portfolio Approach
Appliance-based Integration (Standalone or Hosted)
With the emergence of the cloud space, the integration scope grows
further and hence people are looking out for robust and resilient
solutions and services that would speed up and simplify the whole
process of integration.
Characteristics of Integration Solutions and Products. The key
attributes of integration platforms and backbones gleaned and gained
from integration projects experience are connectivity, semantic
mediation, Data mediation, integrity, security, governance etc
● Connectivity refers to the ability of the integration engine to engage
with both the source and target systems using available native
interfaces.
● Semantic Mediation refers to the ability to account for the
differences between application semantics between two or more
systems.
● Data Mediation converts data from a source data format into
destination data format.
● Data Migration is the process of transferring data between storage
types, formats, or systems.
● Data Security means the ability to insure that information extracted
from the source systems has to securely be placed into target
systems.
● Data Integrity means data is complete and consistent. Thus, integrity
has to be guaranteed when data is getting mapped and maintained
during integration operations, such as data synchronization between
on-premise and SaaS-based systems.
● Governance refers to the processes and technologies that surround a
system or systems, which control how those systems are accessed
and leveraged.
These are the prominent qualities carefully and critically analyzed for
when deciding the cloud / SaaS integration providers.
Data Integration Engineering Lifecycle. As business data are still
stored and sustained in local and on-premise server and storage
machines, it is imperative for a lean data integration lifecycle. The
pivotal phases, as per Mr. David Linthicum, a world-renowned
integration
expert,
are
understanding,
definition,
design,
implementation, and testing.
1. Understanding the existing problem domain means defining the
metadata that is native within the source system (say
Salesforce.com) and the target system.
2. Definition refers to the process of taking the information culled
during the previous step and defining it at a high level including
what the information represents, ownership, and physical
attributes.
3. Design the integration solution around the movement of data from
one point to another accounting for the differences in the
semantics using
the underlying data transformation and
mediation layer by mapping one schema from the source to the
schema of the target.
4. Implementation refers to actually implementing the data
integration solution within the selected technology.
5. Testing refers to assuring that the integration is properly
designed and implemented and that the data synchronizes
properly between the involved systems.
SaaS INTEGRATION PRODUCTS AND PLATFORMS
Cloud-centric integration solutions are being developed and
demonstrated for showcasing their capabilities for integrating enterprise
and cloud applications. The integration puzzle has been the toughest
assignment for long due to heterogeneity and multiplicity-induced
complexity.
Jitterbit
Force.com is a Platform as a Service (PaaS), enabling developers to
create and deliver any kind of on-demand business application.
Salesforce
Google
Microsoft
THE CLOUD
Zoho
Amazon
Yahoo
FIGURE 3.4.
Open Clouds.
The Smooth and Spontaneous Cloud Interaction via
Until now, integrating force.com applications with other on-demand
applications and systems within an enterprise has seemed like a
daunting and doughty task that required too much time, money, and
expertise.
Jitterbit is a fully graphical integration solution that provides users a
versatile platform and a suite of productivity tools to reduce the
integration efforts sharply. Jitterbit is comprised
of two major
components:
● Jitterbit Integration Environment An intuitive point-and-click
graphical UI that enables to quickly configure, test, deploy and
manage integration projects on the Jitterbit server.
● Jitterbit Integration Server A powerful and scalable run-time engine
that processes all the integration operations, fully configurable and
manageable from the Jitterbit application.
Jitterbit is making integration easier, faster, and more affordable
than ever before. Using Jitterbit, one can connect force.com with a
wide variety
PROBLEM
Manufacturing
Sales
R&D
FIGURE 3.5.
Applications.
SOLUTION
Manufacturing
Sales
Consumer
Marketing
R&D
Consumer
Marketing
Linkage of On-Premise with Online and On-Demand
of on-premise systems including ERP, databases, flat files and
custom applications. The figure 3.5 vividly illustrates how Jitterbit
links a number of functional and vertical enterprise systems with
on-demand applications
Boomi Software
Boomi AtomSphere is an integration service that is completely ondemand and connects any combination of SaaS, PaaS, cloud, and onpremise applications without the burden of installing and maintaining
software packages or appliances. Anyone can securely build, deploy
and manage simple to complex integration processes using only a web
browser. Whether connecting SaaS applications found in various lines
of business or integrating across geographic boundaries,
Bungee Connect
For professional developers, Bungee Connect enables cloud computing
by offering an application development and deployment platform
that enables highly interactive applications integrating multiple data
sources and facilitating instant deployment.
OpSource Connect
Expands on the OpSource Services Bus (OSB) by providing the
infrastructure for two-way web services interactions, allowing
customers to consume and publish applications across a common web
services infrastructure.
The Platform Architecture. OpSource Connect is made up of key
features including
●
●
●
●
●
OpSource Services Bus
OpSource Service Connectors
OpSource Connect Certified Integrator Program
OpSource Connect ServiceXchange
OpSource Web Services Enablement Program
The OpSource Services Bus (OSB) is the foundation for OpSource‘s
turnkey development and delivery environment for SaaS and web
companies.
SnapLogic
SnapLogic is a capable, clean, and uncluttered solution
integration that can be deployed in enterprise as well as
landscapes. The free community edition can be used for
common point-to-point data integration tasks, giving
productivity boost beyond custom code.
for data
in cloud
the most
a huge
● Changing data sources. SaaS and on-premise applications, Web
APIs, and RSS feeds
● Changing deployment options. On-premise, hosted, private and
public cloud platforms
● Changing delivery needs. Databases, files, and data services
Transformation Engine and Repository. SnapLogic is a single data
integration platform designed to meet data integration needs. The
SnapLogic server is built on a core of connectivity and transformation
components, which can be used to solve even the most complex data
integration scenarios.
The SnapLogic designer provides an initial hint of the web principles
at work behind the scenes. The SnapLogic server is based on the web
architecture and exposes all its capabilities through web interfaces to
outside world.
The Pervasive DataCloud
Platform (figure 3.6) is unique multi-tenant platform. It provides
dynamic ―compute capacity in the sky‖ for deploying on-demand
integration and other
Managem
ent
Schedule Events
eCommerce
Users Load Balancer
Resources
&
Message Queues
Engine
Queue
Listen
er
Engine
Queue
Listen
er
Engine
Queue
Listen
er
Engine
Queue
Listen
er
Engine Queue
Listener
Scalable Computing Cluster
SaaS Application
S
a
a
S
A
p
p
l
i
c
a
t
i
o
n
Customer
FIGURE 3.6.
Resources.
Customer
Pervasive Integrator Connects Different
data-centric applications. Pervasive DataCloud is the first multi-tenant
platform for delivering the following.
1. Integration as a Service (IaaS) for both hosted and on-premises
applications and data sources
2. Packaged turnkey integration
3. Integration that supports every integration scenario
4. Connectivity to hundreds of different applications and data
sources
Pervasive DataCloud hosts Pervasive and its partners‘ data-centric
applications. Pervasive uses Pervasive DataCloud as a platform for
deploying on-demand integration via
● The Pervasive DataSynch family of packaged integrations. These
are highly affordable, subscription-based, and packaged integration
solutions.
● Pervasive Data Integrator. This runs on the Cloud or on-premises
and is a design-once and deploy anywhere solution to support
every integration scenario
● Data migration, consolidation and conversion
● ETL / Data warehouse
● B2B / EDI integration
● Application integration (EAI)
● SaaS /Cloud integration
● SOA / ESB / Web Services
● Data Quality/Governance
● Hubs
Pervasive DataCloud provides multi-tenant, multi-application and
multicustomer deployment. Pervasive DataCloud is a platform to deploy
applications that are
● Scalable—Its multi-tenant architecture can support multiple users
●
●
●
●
●
and applications for delivery of diverse data-centric solutions
such as data integration. The applications themselves scale to
handle fluctuating data volumes.
Flexible—Pervasive DataCloud supports SaaS-to-SaaS, SaaS-to-on
premise or on-premise to on-premise integration.
Easy to Access and Configure—Customers can access, configure
and run Pervasive DataCloud-based integration solutions via a
browser.
Robust—Provides automatic delivery of updates as well as
monitoring activity by account, application or user, allowing
effortless result tracking.
Secure—Uses the best technologies in the market coupled with the
best data centers and hosting services to ensure that the service
remains secure and available.
Affordable—The platform enables delivery of packaged solutions
in a SaaS-friendly pay-as-you-go model.
Bluewolf
Has announced its expanded ―Integration-as-a-Service‖ solution, the
first to offer ongoing support of integration projects guaranteeing
successful integration between diverse SaaS solutions, such as
salesforce.com, BigMachines, eAutomate, OpenAir and back office
systems (e.g. Oracle, SAP, Great Plains, SQL Service and MySQL).
Called the Integrator, the solution is the only one to include proactive
monitoring and consulting services to ensure integration success. With
remote monitoring of integration jobs via a dashboard included as part
of the Integrator solution, Bluewolf proactively alerts its customers of
any issues with integration and helps to solves them quickly.
Online MQ
Online MQ is an Internet-based queuing system. It is a complete and
secure online messaging solution for sending and receiving messages
over any network. It is a cloud messaging queuing service.
● Ease of Use. It is an easy way for programs that may each be
running on different platforms, in different systems and different
networks, to communicate with each other without having to write
any low-level communication code.
● No Maintenance. No need to install any queuing software/server
and no need to be concerned with MQ server uptime, upgrades and
maintenance.
● Load Balancing and High Availability. Load balancing can be
achieved on a busy system by arranging for more than one program
instance to service a queue. The performance and availability
features are being met through clustering. That is, if one system
fails, then the second system can take care of users‘ requests
without any delay.
● Easy Integration. Online MQ can be used as a web-service (SOAP)
and as a REST service. It is fully JMS-compatible and can hence
integrate easily with any Java EE application servers. Online MQ is
not limited to any specific platform, programming language or
communication protocol.
CloudMQ
This leverages the power of Amazon Cloud to provide enterprise-grade
message queuing capabilities on demand. Messaging allows us to
reliably break up a single process into several parts which can then be
executed asynchronously.
Linxter
Linxter is a cloud messaging framework for connecting all kinds of
applications, devices, and systems. Linxter is a behind-the-scenes,
messageoriented and cloud-based middleware technology and smoothly
automates the complex tasks that developers face when creating
communication-based products and services.
Online MQ, CloudMQ and Linxter are all accomplishing messagebased application and service integration. As these suites are being
hosted in clouds, messaging is being provided as a service to hundreds
of distributed and enterprise applications using the much-maligned
multi-tenancy property. ―Messaging middleware as a service (MMaaS)‖
is the grand derivative of the SaaS paradigm.
SaaS INTEGRATION SERVICES
We have seen the state-of-the-art cloud-based data integration
platforms for real-time data sharing among enterprise information
systems and cloud applications.
There are fresh endeavours in order to achieve service composition in
cloud ecosystem. Existing frameworks such as service component
architecture (SCA) are being revitalised for making it fit for cloud
environments. Composite applications, services, data, views and
processes will be become cloud-centric and hosted in order to support
spatially separated and heterogeneous systems.
Informatica On-Demand
Informatica offers a set of innovative on-demand data integration
solutions called Informatica On-Demand Services. This is a cluster of
easy-to-use SaaS offerings, which facilitate integrating data in SaaS
applications, seamlessly and securely across the Internet with data in
on-premise applications. There are a few key benefits to leveraging this
maturing technology.
● Rapid development and deployment with zero maintenance of the
integration technology.
● Automatically upgraded and continuously enhanced by vendor.
● Proven SaaS integration solutions, such as integration with
Salesforce
.com, meaning that the connections and the metadata
understanding are provided.
● Proven data transfer and translation technology, meaning that
core integration services such as connectivity and semantic
mediation are built into the technology.
Informatica On-Demand has taken the unique approach of moving
its industry leading PowerCenter Data Integration Platform to the
hosted model and then configuring it to be a true multi-tenant
solution.
Microsoft Internet Service Bus (ISB)
Azure is an upcoming cloud operating system from Microsoft. This
makes development, depositing and delivering Web and Windows
application on cloud centers easier and cost-effective.
Microsoft .NET Services. is a set of Microsoft-built and hosted cloud
infrastructure services for building Internet-enabled applications and the
ISB acts as the cloud middleware providing diverse applications with a
common infrastructure to name, discover, expose, secure and
orchestrate web services. The following are the three broad areas.
.NET Service Bus. The .NET Service Bus (figure 3.7) provides a hosted,
secure, and broadly accessible infrastructure for pervasive
communication,
Console Application Exposing Web Services
End Users
End Users
Azure Service Platform
Google App Engine
.Net Services Service Bus
Windows Azure
Applications
Application
via Service Bus
FIGURE 3.7.
.NET Service Bus.
large-scale event distribution, naming, and service publishing. Services
can be exposed through the Service Bus Relay, providing connectivity
options for service endpoints that would otherwise be difficult or
impossible to reach.
.NET Access Control Service. The .NET Access Control Service is a
hosted, secure, standards-based infrastructure for multiparty, federated
authentication, rules-driven, and claims-based authorization.
.NET Workflow Service. The .NET Workflow Service provide a hosted
environment for service orchestration based on the familiar Windows
Workflow Foundation (WWF) development experience.
The most important part of the Azure is actually the service bus
represented as a WCF architecture. The key capabilities of the Service
Bus are
● A federated namespace model that provides a shared, hierarchical
namespace into which services can be mapped.
● A service registry service that provides an opt-in model for
publishing service endpoints into a lightweight, hierarchical, and
RSS-based discovery mechanism.
● A lightweight and scalable publish/subscribe event bus.
● A relay and connectivity service with advanced NAT traversal and
pullmode message delivery capabilities acting as a ―perimeter
network (also known as DMZ, demilitarized zone, and screened
subnet) in the sky‖
Relay Services. Often when we connect a service, it is located behind
the firewall and behind the load balancer. Its address is dynamic
and can be
Relay Service
Client
FIGURE 3.8.
Service
The .NET Relay Service.
resolved only on local network. When we are having the service callbacks to the client, the connectivity challenges lead to scalability,
availability and security issues. The solution to Internet connectivity
challenges is instead of connecting client directly to the service we can
use a relay service as pictorially represented in the relay service figure
3.8.
BUSINESSES-TO-BUSINESS INTEGRATION (B2Bi) SERVICES
B2Bi has been a mainstream activity for connecting geographically
distributed businesses for purposeful and beneficial cooperation.
Products vendors have come out with competent B2B hubs and suites
for enabling smooth data sharing in standards-compliant manner among
the participating enterprises.
Just as these abilities ensure smooth communication between
manufacturers and their external suppliers or customers, they also
enable reliable interchange between hosted and installed applications.
The IaaS model also leverages the adapter libraries developed by
B2Bi vendors to provide rapid integration with various business
systems.
Cloudbased Enterprise Mashup Integration Services for B2B Scenarios
. There is a vast need for infrequent, situational and ad-hoc B2B
applications desired by the mass of business end-users..
Especially in the area of applications to support B2B collaborations,
current offerings are characterized by a high richness but low reach,
like B2B hubs that focus on many features enabling electronic
collaboration, but lack availability for especially small organizations
or even individuals.
Enterprise Mashups, a kind of new-generation Web-based
applications,
seem to adequately fulfill the individual and
heterogeneous requirements of end-users and foster End User
Development (EUD).
Another challenge in B2B integration is the ownership of and
responsibility for processes. In many inter-organizational settings,
business processes are only sparsely structured and formalized, rather
loosely coupled and/or based
on ad-hoc cooperation. Interorganizational collaborations tend to
involve
more and more
participants and the growing number of participants also draws a huge
amount of differing requirements.
Now, in supporting supplier and partner co-innovation and customer
cocreation, the focus is shifting to collaboration which has to embrace
the participants, who are influenced yet restricted by multiple domains
of control and disparate processes and practices.
Both Electronic data interchange translators (EDI) and Managed file
transfer (MFT) have a longer history, while B2B gateways only have
emerged during the last decade.
Enterprise Mashup Platforms and Tools.
Mashups are the adept combination of different and distributed
resources including content, data or application functionality. Resources
represent the core building blocks for mashups. Resources can be
accessed through APIs, which encapsulate the resources and describe
the interface through which they are made available. Widgets or gadgets
primarily put a face on the underlying resources by providing a
graphical representation for them and piping the data received from the
resources. Piping can include operators like aggregation, merging or
filtering. Mashup platform is a Web based tool that allows the creation
of Mashups by piping resources into Gadgets and wiring Gadgets
together.
The Mashup integration services are being implemented as a
prototype in the FAST project. The layers of the prototype are
illustrated in figure 3.9 illustrating the architecture, which describes
how these services work together. The authors of this framework have
given an outlook on the technical realization of the services using cloud
infrastructures and services.
COMPANY A
HTTP
HTTP
Browser R
HTTP
Browser R
Browser R
COMPANY B
HTTP
HTTP
Browser R
Enterprise Mashup Platform
(i.e. FAST)
HTTP
Browser R
Browser R
Enterprise Mashup Platform
(i.e. SAP Research Rooftop)
R
R
REST
REST
Mashup
Integration Service Logic
Integration
Services
Platform
(i.e., Google
App. Engine)
Routing Engine
Identity
Management
Error Handling
and Monitoring
Organization
R
Cloud Based
Services
Translation
Engine
Persistent
Storage
Semantic
R
Message
InfrastructureQueue
R
R
R
Amazon SQS
Amazon S3
Mule
onDemand
Mule
onDemand
OpenID/Oauth (Google)
FIGURE 3.9.
Architecture.
Cloudbased Enterprise Mashup Integration Platform
To simplify this, a Gadget could be provided for the end-user. The
routing engine is also connected to a message queue via an API. Thus,
different message queue engines are attachable. The message queue is
responsible for storing and forwarding the messages controlled by the
routing engine. Beneath the message queue, a persistent storage, also
connected via an API to allow exchangeability, is available to store
large data. The error handling and monitoring service allows tracking
the message-flow to detect errors and to collect statistical data. The
Mashup integration service is hosted as a cloud-based service. Also,
there are cloud-based services available which provide the functionality
required by the integration service. In this way, the Mashup integration
service can reuse and leverage the existing cloud services to speed up
the implementation.
Message Queue. The message queue could be realized by using
Amazon‘s Simple Queue Service (SQS). SQS is a web-service which
provides a queue for messages and stores them until they can be
processed. The Mashup integration services, especially the routing
engine, can put messages into the queue and recall them when they are
needed.
Persistent Storage. Amazon Simple Storage Service5 (S3) is also a
webservice. The routing engine can use this service to store large files.
Translation Engine. This is primarily focused on translating between
different protocols which the Mashup platforms it connects can
understand, e.g. REST or SOAP web services. However, if the need of
translation of the objects transferred arises, this could be attached to the
translation engine.
Interaction between the Services. The diagram describes the process of
a message being delivered and handled by the Mashup Integration
Services Platform. The precondition for this process is that a user
already established a route to a recipient.
A FRAMEWORK OF SENSOR—CLOUD INTEGRATION
In the past few years, wireless sensor networks (WSNs) have been
gaining significant attention because of their potentials of enabling of
novel and attractive solutions in areas such as industrial automation,
environmental monitoring, transportation business, health-care etc.
With the faster adoption of micro and nano technologies, everyday
things are destined to become digitally empowered and smart in their
operations and offerings. Thus the goal is to link smart materials,
appliances, devices, federated messaging middleware, enterprise
information systems and packages, ubiquitous services, handhelds, and
sensors with one another smartly to build and sustain cool, charismatic
and catalytic situation-aware applications.
A virtual community consisting of team of researchers have come together to
solve a complex problem and they need data storage, compute capability,
security; and they need it all provided now. For example, this team is
working on an outbreak of a new virus strain moving through a population.
This requires more than a Wiki or other social organization tool. They
deploy bio-sensors on patient body to monitor patient condition
continuously and to use this data for large and multi-scale simulations to
track the spread of infection as well as the virus mutation and possible cures.
This may require computational resources and a platform for sharing data
and results that are not immediately available to the team.
Traditional HPC approach like Sensor-Grid model can be used in this
case, but setting up the infrastructure to deploy it so that it can scale out
quickly is not easy in this environment. However, the cloud paradigm is
an excellent move.
Here, the researchers need to register their interests to get various
patients‘ state (blood pressure, temperature, pulse rate etc.) from biosensors for largescale parallel analysis and to share this information
with each other to find useful solution for the problem. So the sensor
data needs to be aggregated, processed and disseminated based on
subscriptions.
To integrate sensor networks to cloud, the authors have proposed a
contentbased pub-sub model. In this framework, like MQTT-S, all of
the system complexities reside on the broker‘s side but it differs from
MQTT-S in that it uses content-based pubsub broker rather than topicbased which is suitable for the application scenarios considered.
To deliver published sensor data or events to subscribers, an efficient
and scalable event matching algorithm is required by the pub-sub
broker.
Moreover, several SaaS applications may have an interest in the same
sensor data but for different purposes. In this case, the SA nodes would
need to manage and maintain communication means with multiple
applications in parallel. This might exceed the limited capabilities of the
simple and low-cost SA devices. So pub-sub broker is needed and it is
located in the cloud side because of its higher performance in terms of
bandwidth and capabilities. It has four components describes as
follows:
Social Network
of doctors for
monitoring
patient
healthcare for
virus infection
1
WSN 1
Environmental
data analysis
and
Urban Trafic
prediction
and
1
sharing portal
analysis1network
Other data
analysis or
social
1 network
Gateway
System
3
Actuator
Application Specific 2
2
Gateway
Services
(SaaS)
3
Manager
3
3
4
Sensor
Monitoring
and Metering
Provisioning
Manager
4
Servers
Pub/Sub Broker
WSN 2
Registry
Event
Monitoring Analyzer
Gateway
3
inator
Actuator
Gateway
Mediator
Processing Dissemand
Sensor
4
Service
Registry
Policy
Repository
Collaborator
Sensor
Cloud Provider (CLP)
Agent
WSN 2
FIGURE 3.10.
Integration.
The Framework Architecture of Sensor—Cloud
Stream monitoring and processing component (SMPC). The sensor
stream comes in many different forms. In some cases, it is raw data that
must be captured, filtered and analyzed on the fly and in other cases, it is
stored or cached. The style of computation required depends on the
nature of
the streams. So the SMPC component running on the
cloud monitors the event streams and invokes correct analysis method.
Depending on the data rates and the amount of processing that is
required, SMP manages parallel execution framework on cloud.
Registry component (RC). Different SaaS applications register to pub-sub
broker for various sensor data required by the community user.
Analyzer component (AC). When sensor data or events come to the pubsub broker, analyzer component determines which applications they are
belongs to and whether they need periodic or emergency deliver.
Disseminator component (DC). For each SaaS application, it disseminates
sensor data or events to subscribed users using the event matching
algorithm. It can utilize cloud‘s parallel execution framework for fast event
delivery. The pub-sub components workflow in the framework is as
follows:
Users register their information and subscriptions to various SaaS
applications which then transfer all this information to pub/sub broker
registry. When sensor data reaches to the system from gateways,
event/stream monitoring and processing component (SMPC) in the pub/sub
broker determines whether it needs processing or just store for periodic
send or for immediate delivery.
Mediator. The (resource) mediator is a policy-driven entity within a VO to
ensure that the participating entities are able to adapt to changing
circumstances and are able to achieve their objectives in a dynamic and
uncertain environment.
Policy Repository (PR). The PR virtualizes all of the policies within the
VO. It includes the mediator policies, VO creation policies along with any
policies for resources delegated to the VO as a result of a collaborating
arrangement.
Collaborating Agent (CA). The CA is a policy-driven resource discovery
module for VO creation and is used as a conduit by the mediator to
exchange policy and resource information with other CLPs.
SaaS INTEGRATION APPLIANCES
Appliances are a good fit for high-performance requirements. Clouds too
have gone in the same path and today there are cloud appliances (also
termed as ―cloud in a box‖). In this section, we are to see an
integration appliance.
Cast Iron Systems . This is quite different from the above-mentioned
schemes. Appliances with relevant software etched inside are being
established as a high-performance and hardware-centric solution for several
IT needs.
Cast Iron Systems (www.ibm.com) provides pre-configured solutions for
each of today‘s leading enterprise and On-Demand applications. These
solutions, built using the Cast Iron product offerings offer out-of-the-box
connectivity to specific applications, and template integration processes
(TIPs) for the most common integration scenarios.
2.4 THE ENTERPRISE CLOUD COMPUTING
PARADIGM
Cloud computing is still in its early stages and constantly undergoing
changes as new vendors, offers, services appear in the cloud market.
Enterprises will place stringent requirements on cloud providers to pave
the way for more widespread adoption of cloud computing, leading
to what is known as the enterprise cloud paradigm computing.
Enterprise cloud computing is the alignment of a cloud computing
model with an organization‘s business objectives (profit, return on
investment, reduction of operations costs) and processes. This chapter
explores this paradigm with respect to its motivations, objectives,
strategies and methods.
Section 4.2 describes a selection of deployment models and strategies
for enterprise cloud computing, while Section 4.3 discusses the issues of
moving [traditional] enterprise applications to the cloud. Section 4.4
describes the technical and market evolution for enterprise cloud
computing,
describing some potential opportunities for multiple
stakeholders in the provision of enterprise cloud computing.
BACKGROUND
According to NIST [1], cloud computing is composed of five essential
characteristics: on-demand self-service, broad network access, resource
pooling, rapid elasticity, and measured service. The ways in which these
characteristics are manifested in an enterprise context vary according to the
deployment model employed.
Relevant Deployment Models for Enterprise Cloud Computing
There are some general cloud deployment models that are accepted by the
majority of cloud stakeholders today, as suggested by the references [1] and
and discussed in the following:
● Public clouds are provided by a designated service provider for general
public under a utility based pay-per-use consumption model.
● Private clouds are built, operated, and managed by an organization for its
internal use only to support its business operations exclusively.
● Virtual private clouds are a derivative of the private cloud deployment
model but are further characterized by an isolated and secure segment
of resources, created as an overlay on top of public cloud infrastructure
using advanced network virtualization capabilities..
● Community clouds are shared by several organizations and support a
specific community that has shared concerns (e.g., mission, security
requirements, policy, and compliance considerations).
● Managed clouds arise when the physical infrastructure is owned by and/or
physically located in the organization‘s data centers with an extension of
management and security control plane controlled by the managed service
provider .
● Hybrid clouds are a composition of two or more clouds (private,
community, or public) that remain unique entities but are bound
together by standardized or proprietary technology that enables data
and application portability (e.g., cloud bursting for load-balancing
between clouds).
Adoption and Consumption Strategies
The selection of strategies for enterprise cloud computing is critical for IT
capability as well as for the earnings and costs the organization experiences,
motivating efforts toward convergence of business strategies and IT. Some
critical questions toward this convergence in the enterprise cloud paradigm are
as follows:
● Will an enterprise cloud strategy increase overall business value?
● Are the effort and risks associated with transitioning to an enterprise
cloud strategy worth it?
● Which areas of business and IT capability should be considered for the
enterprise cloud?
● Which cloud offerings are relevant for the purposes of an organization?
● How can the process of transitioning to an enterprise cloud strategy be
piloted and systematically executed?
These questions are addressed from two strategic perspectives: (1) adoption
and (2) consumption. Figure 4.1 illustrates a framework for enterprise cloud
adoption strategies, where an organization makes a decision to adopt a
cloud computing model based on fundamental drivers for cloud computing—
scalability, availability, cost and convenience. The notion of a Cloud Data
Center (CDC) is used, where the CDC could be an external, internal or
federated provider of infrastructure, platform or software services.
An optimal adoption decision cannot be established for all cases because the
types of resources (infrastructure, storage, software) obtained from a CDC
depend on the size of the organisation understanding of IT impact on business,
predictability of workloads, flexibility of existing IT landscape and available
budget/resources for testing and piloting. The strategic decisions using these
four basic drivers are described in following, stating objectives, conditions and
actions.
Cloud Data Center(s)
(CDC)
Conveniencedriv
en: Use cloud
resources so that
there is no need to
maintain local
resources.
Availability-driven:
Use of load-balanced
and localised cloud
resources to increase
availability and
reduce response time
Market-driven:
Users and
providers of
cloud resources
make decisions
based on the
potential saving
and profit
Scalability-driven: Use of cloud
resources to support additional
load or as back-up.
FIGURE 4.1. Enterprise cloud adoption strategies using fundamental cloud
drivers.
1. Scalability-Driven Strategy. The objective is to support increasing
workloads of the organization without investment and expenses
exceeding returns.
2. Availability-Driven Strategy. Availability has close relations to scalability
but is more concerned with the assurance that IT capabilities and functions
are accessible, usable and acceptable by the standards of users.
3. Market-Driven Strategy. This strategy is more attractive and viable for
small, agile organizations that do not have (or wish to have) massive
investments in their IT infrastructure.
(1) Software Provision: Cloud provides instances
(2) Storage Provision: Cloud provides data
of software but data is maintained within user‘s
data center
management and software accesses data
remotely from user‘s data center
(3) Solution Provision: Software and storage
are maintained in cloud and the user does not
maintain a data center
(4) Redundancy Services: Cloud is used as an
alternative or extension of user‘s data center
for software and storage
FIGURE 4.2. Enterprise cloud consumption strategies.
on their profiles and requests service requirements .
4. Convenience-Driven Strategy. The objective is to reduce the load and
need for dedicated system administrators and to make access to IT
capabilities by users easier, regardless of their location and connectivity
(e.g. over the Internet).
There are four consumptions strategies identified, where the differences in
objectives, conditions and actions reflect the decision of an organization to
trade-off hosting costs, controllability and resource elasticity of IT resources
for software and data. These are discussed in the following.
1. Software Provision. This strategy is relevant when the elasticity
requirement is high for software and low for data, the controllability
concerns are low for software and high for data, and the cost reduction
concerns for software are high, while cost reduction is not a priority for
data, given the high controllability concerns for data, that is, data are
highly sensitive.
2. Storage Provision. This strategy is relevant when the elasticity
requirements is high for data and low for software, while the
controllability of software is more critical than for data. This can be the
case for data intensive applications, where the results from processing in
the application are more critical and sensitive than the data itself.
3. Solution Provision. This strategy is relevant when the elasticity and cost
reduction requirements are high for software and data, but the
controllability requirements can be entrusted to the CDC.
4. Redundancy Services. This strategy can be considered as a hybrid
enterprise cloud strategy, where the organization switches between
traditional, software, storage or solution management based on changes
in its operational conditions and business demands.
Even though an organization may find a strategy that appears to provide it
significant benefits, this does not mean that immediate adoption of the strategy
is advised or that the returns on investment will be observed immediately.
ISSUES FOR ENTERPRISE APPLICATIONS ON THE CLOUD
Enterprise Resource Planning (ERP) is the most comprehensive definition of
enterprise application today. For these reasons, ERP solutions have emerged as
the core of successful information management and the enterprise
backbone of nearly any organization . Organizations that have successfully
implemented the ERP systems are reaping the benefits of having integrating
working environment, standardized process and operational benefits to the
organization .
One of the first issues is that of infrastructure availability. Al-Mashari and
Yasser argued that adequate IT infrastructure, hardware and networking are
crucial for an ERP system‘s success.
One of the ongoing discussions concerning future scenarios considers varying
infrastructure requirements and constraints given different workloads and
development phases. Recent surveys among companies in North America
and Europe with enterprise-wide IT systems showed that nearly all kinds of
workloads are seen to be suitable to be transferred to IaaS offerings.
Considering Transactional and Analytical Capabilities
Transactional type of applications or so-called OLTP (On-line Transaction
Processing) applications, refer to a class of systems that manage
transactionoriented applications, typically using relational databases. These
applications rely on strong ACID (atomicity, consistency, isolation,
durability) properties and are relatively write/update-intensive. Typical OLTPtype ERP components are sales and distributions (SD), banking and financials,
customer relationship management (CRM) and supply chain management
(SCM).
One can conclude that analytical applications will benefit more than their
transactional counterparts from the opportunities created by cloud computing,
especially on compute elasticity and efficiency.
2.4.1 TRANSITION CHALLENGES
The very concept of cloud represents a leap from traditional approach for IT to
deliver mission critical services. With any leap comes the gap of risk and
challenges to overcome. These challenges can be classified in five different
categories, which are the five aspects of the enterprise cloud stages: build,
develop, migrate, run, and consume (Figure 4.3).
The requirement for a company-wide cloud approach should then become
the number one priority of the CIO, especially when it comes to having a
coherent and cost effective development and migration of services on this
architecture.
Develop
Build
Run
Consume
Migrate
FIGURE 4.3. Five stages of the cloud.
A second challenge is migration of existing or ―legacy‖ applications to ―the
cloud.‖ The expected average lifetime of ERP product is B15 years, which
means that companies will need to face this aspect sooner than later as they try
to evolve toward the new IT paradigm.
The ownership of enterprise data conjugated with the integration with others
applications integration in and from outside the cloud is one of the key
challenges. Future enterprise application development frameworks will need to
enable the separation of data management from ownership. From this, it can
be extrapolated that SOA, as a style, underlies the architecture and, moreover,
the operation of the enterprise cloud.
One of these has been notoriously hard to upgrade: the human factor;
bringing staff up to speed on the requirements of cloud computing with respect
to architecture, implementation, and operation has always been a tedious task.
Once the IT organization has either been upgraded to provide cloud or is
able to tap into cloud resource, they face the difficulty of maintaining the
services in the cloud. The first one will be to maintain interoperability between
in-house infrastructure and service and the CDC (Cloud Data Center).
Before leveraging such features, much more basic functionalities are
problematic: monitoring, troubleshooting, and comprehensive capacity
planning are actually missing in most offers. Without such features it becomes
very hard to gain visibility into the return on investment and the consumption
of cloud services.
Today there are two major cloud pricing models: Allocation based and
Usage based . The first one is provided by the poster child of cloud computing,
namely, Amazon. The principle relies on allocation of resource
for a fixed
amount of time. As companies need to evaluate the offers they need to also
include the hidden costs such as lost IP, risk, migration, delays and provider
overheads. This combination can be compared to trying to choose a new mobile
with carrier plan.The market dynamics will hence evolve alongside the
technology for the enterprise cloud computing paradigm.
ENTERPRISE CLOUD TECHNOLOGY AND MARKET EVOLUTION
This section discusses the potential factors which will influence this evolution of
cloud computing and today‘s enterprise landscapes to the enterprise computing
paradigm, featuring the convergence of business and IT and an open, service
oriented marketplace.
Technology Drivers for Enterprise Cloud Computing Evolution
This will put pressure on cloud providers to build their offering on open
interoperable standards to be considered as a candidate by enterprises. There
have been a number initiatives emerging in this space. Amazon, Google, and
Microsoft, who currently do not actively participate in these efforts. True
interoperability across
the board in the near future seems unlikely. However, if achieved, it could lead
to facilitation of advanced scenarios and thus drive the mainstream adoption of
the enterprise cloud computing paradigm.
Part of preserving investments is maintaining the assurance that cloud
resources and services powering the business operations perform according
to the business requirements. Underperforming resources or service disruptions
lead to business and financial loss, reduced business credibility, reputation,
and marginalized user productivity. Another important factor in this regard is
lack of insights into the performance and health of the resources and service
deployed on the cloud, such that this is another area of technology evolution
that will be pushed.
This would prove to be a critical capability empowering third-party
organizations to act as independent auditors especially with respect to SLA
compliance auditing and for mediating the SLA penalty related issues.
Emerging trend in the cloud application space is the divergence from the
traditional RDBMS based data store backend. Cloud computing has given rise
to alternative data storage technologies (Amazon Dynamo, Facebook
Cassandra, Google BigTable, etc.) based on key-type storage models as
compared to the relational model, which has been the mainstream choice for
data storage for enterprise applications.
As these technologies evolve into maturity, the PaaS market will consolidate
into a smaller number of service providers. Moreover, big traditional software
vendors will also join this market which will potentially trigger this
consolidation through acquisitions and mergers. These views are along the
lines of the research published by Gartner. Gartner predicts that from 2011 to
2015 market competition and maturing developer practises will drive
consolidation around a small group of industry-dominant cloud technology
providers.
A recent report published by Gartner presents an interesting perspective on
cloud evolution. The report argues that as cloud services proliferate, services
would become complex to be handled directly by the consumers.
To cope
with these scenarios, meta-services or cloud brokerage services will emerge.
These brokerages will use several types of brokers and platforms to enhance
service delivery and, ultimately service value. According to Gartner, before
these scenarios can be enabled, there is a need for brokerage business to use
these brokers and platforms. According to Gartner, the following types of cloud
service brokerages (CSB) are foreseen:
● Cloud Service Intermediation. An intermediation broker providers a
service that directly enhances a given service delivered one or more service
consumers, essentially on top of a given service to enhance a specific
capability.
● Aggregation. An aggregation brokerage service combines multiple
services into one or more new services.
● Cloud Service Arbitrage. These services will provide flexibility and
opportunistic choices for the service aggregator.
The above shows that there is potential for various large, medium, and
small organizations to become players in the enterprise cloud marketplace.
The dynamics of such a marketplace are still to be explored as the enabling
technologies and standards continue to mature.
BUSINESS DRIVERS TOWARD A MARKETPLACE FOR
ENTERPRISE CLOUD COMPUTING
In order to create an overview of offerings and consuming players on the
market, it is important to understand the forces on the market and motivations
of each player.
The Porter model consists of five influencing factors/views (forces) on the
market (Figure 4.4). The intensity of rivalry on the market is traditionally
influenced by industry-specific characteristics :
● Rivalry: The amount of companies dealing with cloud and virtualization
technology is quite high at the moment; this might be a sign for high
New Market Entrants
• Geographical factors
• Entrant strategy
• Routes to market
Suppliers
• Level of quality
• Supplier‘s size
• Bidding processes/
capabilities
Cloud Market
•
•
•
•
Cost structure
Product/service ranges
Differentiation, strategy
Number/size of players
Buyers (Consumers)
•
•
•
•
Buyer size
Buyers number
Product/service
Requirements
Technology Development
• Substitutes
• Trends
• Legislative effects
FIGURE 4.4. Porter‘s five forces market model (adjusted for the cloud market) .
BUSINESS DRIVERS TOWARD A MARKETPLACE FOR ENTERPRISE
113
rivalry. But also the products and offers are quite various, so many niche
products tend to become established.
● Obviously, the cloud-virtualization market is presently booming and will
keep growing during the next years. Therefore the fight for customers and
struggle for market share will begin once the market becomes saturated
and companies start offering comparable products.
● The initial costs for huge data centers are enormous. By building up
federations of computing and storing utilities, smaller companies can try
to make use of this scale effect as well.
● Low switching costs or high exit barriers influence rivalry. When a
customer can freely switch from one product to another, there is a greater
struggle to capture customers. From the opposite point of view high exit
barriers discourage customers to buy into a new technology. The trends
towards standardization of formats and architectures try to face this
problem and tackle it. Most current cloud providers are only paying
attention to standards related to the interaction with the end user.
However, standards for clouds interoperability are still to be developed .
Market
Regulations
Business Model
Hype
Cycle Phase
Market
Technology
FIGURE 4.5. Dynamic business models (based on [49] extend by
influence factors identified by [50]).
.
THE CLOUD SUPPLY CHAIN
One indicator of what such a business model would look like is in the complexity
of deploying, securing, interconnecting and maintaining enterprise landscapes
and solutions such as ERP, as discussed in Section 4.3. The concept of a Cloud
Supply Chain (C-SC) and hence Cloud Supply Chain Management (C-SCM)
appear to be viable future business models for the enterprise cloud computing
paradigm. The idea of C-SCM represents the management of a network of
interconnected businesses involved in the end-to-end provision of product and
service packages required by customers. The established understanding of a
supply chain is two or more parties linked by a flow of goods, information,
and funds [55], [56] A specific definition for a C-SC is hence: ―two or more
parties linked by the provision of cloud services, related information
and funds.‖ Figure 4.6 represents a concept for the C-SC, showing the flow
of products along different organizations such as hardware suppliers, software
component suppliers, data center operators, distributors and the end customer.
Figure 4.6 also makes a distinction between innovative and functional
products in the C-SC. Fisher classifies products primarily on the basis of their
demand patterns into two categories: primarily functional or primarily
innovative [57]. Due to their stability, functional products favor competition,
which leads to low profit margins and, as a consequence of their properties, to
low inventory costs, low product variety, low stockout costs, and low
obsolescence [58], [57]. Innovative products are characterized by additional
(other) reasons for a customer in addition to basic needs that lead to purchase,
unpredictable demand (that is high uncertainties, difficult to forecast and
variable demand), and short product life cycles (typically 3 months to 1
year). Cloud services
Cloud services, information, funds
Data center
Fuctional
Distributor
operator
End
customer
Product
Cloud supply chain
Innovative
Hardware
supplier
Component
supplier
Potential Closed Loop Cooperation
FIGURE 4.6. Cloud supply chain (C-SC).
should fulfill basic needs of customers and favor competition due to their
reproducibility. Table 4.1 presents a comparison of Traditional
TABLE 4.1. Comparison of Traditional and Emerging ICT Supply Chainsa
Emerging ICT
Traditional Supply Chain Concepts
Primary goal
Efficient SC
Responsive SC
Cloud SC
Supply demand at
Respond quickly
to demand
(changes)
Supply demand at the
lowest level of costs
and respond quickly
to demand
Create modularity
to allow
postponement
Create modularity to
allow individual
setting while
maximizing the
performance of
services
the lowest level of
cost
Product design
strategy
Maximize
performance at the
minimum product
cost
of product
differentiation
Pricing strategy
Concepts
Lower margins
because price is a
prime customer
driver
Manufacturing
strategy
Higher margins,
because price is
not a prime
customer driver
Lower costs
through high
utilization
Lower margins, as
high competition and
comparable products
Select based on cost
and quality
Supplier strategy
Inventory
strategy
Lead time
strategy
Transportation
strategy
Minimize
inventory to
lower cost
Reduce but not
at the expense
of costs
Greater reliance on
low cost modes
Maintain
capacity
flexibility to
meet
unexpected
demand
High utilization
while flexible
reaction on
demand
Maintain
buffer
inventory to
meet
unexpected
demand
Optimize
of
buffer
for
unpredicted
demand,
and
best utilization
Aggressively
reduce even if
the costs are
significant
Select based on
speed,
flexibility, and
quantity
Greater
reliance on
responsive
modes
Strong servicelevel agreements
(SLA) for ad hoc
provision
Select
on
complex
optimum
of
speed,
cost,
and flexibility
Implement highly
responsive and
low cost modes
a
Based on references 54 and 57.
Supply Chain concepts such as the
efficient SC and responsive SC and a
new concept for emerging ICT as the
cloud computing area with cloud
services as traded proxy.
INTRODUCTION TO CLOUD
COMPUTING
CLOUD COMPUTING IN A NUTSHELL
Computing itself, to be considered fully virtualized, must allow computers to
be built from distributed components such as processing, storage, data, and
software resources.
Technologies such as cluster, grid, and now, cloud computing, have all
aimed at allowing access to large amounts of computing power in a fully
virtualized manner, by aggregating resources and offering a single system
view. Utility computing describes a business model for on-demand delivery of
computing power; consumers pay providers based on usage (―payas-yougo‖), similar to the way in which we currently obtain services from traditional
public utility services such as water, electricity, gas, and telephony.
Cloud computing has been coined as an umbrella term to describe a
category of sophisticated on-demand computing services initially offered by
commercial providers, such as Amazon, Google, and Microsoft. It denotes a
model on which a computing infrastructure is viewed as a ―cloud,‖ from which
businesses and individuals access applications from anywhere in the world on
demand . The main principle behind this model is offering computing, storage,
and software ―as a service.‖
Many practitioners in the commercial and academic spheres have attempted
to define exactly what ―cloud computing‖ is and what unique characteristics it
presents. Buyya et al. have defined it as follows: ―Cloud is a parallel and
distributed computing system consisting of a collection of inter-connected
and virtualised computers that are dynamically provisioned and presented as one
or more unified computing resources based on service-level agreements (SLA)
established through negotiation between the service provider and consumers.‖
Vaquero et al. have stated ―clouds are a large pool of easily usable and
accessible virtualized resources (such as hardware, development platforms
and/or services). These resources can be dynamically reconfigured to adjust
to a variable load (scale), allowing also for an optimum resource utilization.
This pool of resources is typically exploited by a pay-per-use model in which
guarantees are offered by the Infrastructure Provider by means of customized
Service Level Agreements.‖
A recent McKinsey and Co. report
claims that ―Clouds are
hardwarebased services offering compute, network, and storage capacity
where: Hardware management is highly abstracted from the buyer, buyers
incur infrastructure costs as variable OPEX, and infrastructure capacity is
highly elastic.‖
A report from the University of California Berkeley summarized the key
characteristics of cloud computing as: ―(1) the illusion of infinite computing
resources; (2) the elimination of an up-front commitment by cloud users; and
(3) the ability to pay for use . . . as needed .. .‖
The National Institute of Standards and Technology (NIST) characterizes
cloud computing as ―... a pay-per-use model for enabling available,
convenient, on-demand network access to a shared pool of configurable
computing resources (e.g. networks, servers, storage, applications, services)
that can be rapidly provisioned and released with minimal management effort
or service provider interaction.‖
In a more generic definition, Armbrust et al. define cloud as the ―data
center hardware and software that provide services.‖ Similarly, Sotomayor
et al. point out that ―cloud‖ is more often used to refer to the IT infrastructure
deployed on an Infrastructure as a Service provider data center. While there are
countless other definitions, there seems to be common characteristics between
the most notable ones listed above, which a cloud should have: (i) pay-per-use
(no ongoing commitment, utility prices); (ii) elastic capacity and the illusion of
infinite resources; (iii) self-service interface;
and (iv) resources that are
abstracted or virtualised.
ROOTS OF CLOUD COMPUTING
We can track the roots of clouds computing by observing the advancement of
several technologies, especially in hardware (virtualization, multi-core chips),
Internet technologies (Web services, service-oriented architectures, Web 2.0),
distributed computing (clusters, grids), and systems management (autonomic
computing, data center automation). Figure 1.1 shows the convergence of
technology fields that significantly advanced and contributed to the advent
of cloud computing.
Some of these technologies have been tagged as hype in their early stages
of development; however, they later received significant attention from
academia and were sanctioned by major industry players. Consequently, a
specification and standardization process followed, leading to maturity and
wide adoption. The emergence of cloud computing itself is closely linked to
the maturity of such technologies. We present a closer look at the technol ogies
that form the base of cloud computing, with the aim of providing a clearer
picture of the cloud ecosystem as a whole.
From Mainframes to Clouds
We are currently experiencing a switch in the IT world, from in-house
generated computing power into utility-supplied computing resources delivered
over the Internet as Web services. This trend is similar to what occurred about a
century ago when factories, which used to generate their own electric power,
realized that it is was cheaper just plugging their machines into the newly
formed electric power grid .
Computing delivered as a utility can be defined as ―on demand delivery of
infrastructure, applications, and business processes in a security-rich, shared,
scalable, and based computer environment over the Internet for a fee‖ .
Hardware Virtualization
Utility &
Grid
Computing
SOA
Cloud
Computing
Web 2.0
Web Services
Mashups
Internet Technologies
Distributed Computing
Multi-core chips
Autonomic Computing
Data Center Automation
Hardware
Systems Management
FIGURE 1.1. Convergence of various advances leading to the advent of
cloud computing.
This model brings benefits to both consumers and providers of IT services.
Consumers can attain reduction on IT-related costs by choosing to obtain
cheaper services from external providers as opposed to heavily investing on IT
infrastructure and personnel hiring. The ―on-demand‖ component of this
model allows consumers to adapt their IT usage to rapidly increasing or
unpredictable computing needs.
Providers of IT services achieve better operational costs; hardware and
software infrastructures are built to provide multiple solutions and serve many
users, thus increasing efficiency and ultimately leading to faster return on
investment (ROI) as well as lower total cost of ownership (TCO).
The mainframe era collapsed with the advent of fast and inexpensive
microprocessors and IT data centers moved to collections of commodity servers.
The advent of increasingly fast fiber-optics networks has relit the fire, and
new technologies for enabling sharing of computing power over great distances
have appeared.
SOA, Web Services, Web 2.0, and Mashups
•
•
Web Service
• applications running on different messaging product platforms
• enabling information from one application to be made available to
others
• enabling internal applications to be made available over the Internet
SOA
• address requirements of loosely coupled, standards-based, and
protocol-independent distributed computing
• WS ,HTTP, XML
• Common mechanism for delivering service
• applications is a collection of services that together perform
complex business logic
• Building block in IaaS
• User authentication, payroll management, calender
Grid Computing
Grid computing enables aggregation of distributed resources and transparently
access to them. Most production grids such as TeraGrid and EGEE seek to
share compute and storage resources distributed across different administrative
domains, with their main focus being speeding up a broad range of scientific
applications, such as climate modeling, drug design, and protein analysis.
Globus Toolkit is a middleware that implements several standard Grid
services and over the years has aided the deployment of several service-oriented
Grid infrastructures and applications. An ecosystem of tools is available to
interact with service grids, including grid brokers, which facilitate user
interaction with multiple middleware and implement policies to meet QoS
needs.
Virtualization technology has been identified as the perfect fit to issues that
have caused frustration when using grids, such as hosting many dissimilar
software applications on a single physical platform. In this direction, some
research projects.
Utility Computing
In utility computing environments, users assign a ―utility‖ value to their jobs,
where utility is a fixed or time-varying valuation that captures various QoS
constraints (deadline, importance, satisfaction). The valuation is the amount
they are willing to pay a service provider to satisfy their demands. The service
providers then attempt to maximize their own utility, where said utility may
directly correlate with their profit. Providers can choose to prioritize high yield
(i.e., profit per unit of resource) user jobs, leading to a scenario where shared
systems are viewed as a marketplace, where users compete for resources based
on the perceived utility or value of their jobs.
Hardware Virtualization
The idea of virtualizing a computer system‘s resources, including processors,
memory, and I/O devices, has been well established for decades, aiming at
improving sharing and utilization of computer systems . Hardware
virtualization allows running multiple operating systems and software stacks on
a single physical platform. As depicted in Figure 1.2, a software layer, the
virtual machine monitor (VMM), also called a hypervisor, mediates access to
the physical hardware presenting to each guest operating system a virtual
machine (VM), which is a set of virtual platform interfaces .
Virtual Machine 1
Virtual Machine 2
User software
User software
Email Server
Data
Web
base
Facebook App
Ruby on
Java
Virtual Machine N
User software
App A
App X
App B
App Y
Rails
Server
Linux
Guest OS
Virtual Machine Monitor (Hypervisor)
Hardware
FIGURE 1.2. A hardware virtualized server hosting three virtual machines, each one
running distinct operating system and user level software stack.
Workload isolation is achieved since all program instructions are fully
confined inside a VM, which leads to improvements in security. Better
reliability is also achieved because software failures inside one VM do not
affect others . Moreover, better performance control is attained since execution
of one VM should not affect the performance of another VM .
VMWare ESXi. VMware is a pioneer in the virtualization market. Its ecosystem
of tools ranges from server and desktop virtualization to high-level
management tools . ESXi is a VMM from VMWare. It is a bare-metal
hypervisor, meaning that it installs directly on the physical server, whereas
others may require a host operating system.
Xen. The Xen hypervisor started as an open-source project and has served as a
base to other virtualization products, both commercial and open-source.In
addition to an open-source distribution , Xen currently forms the base of
commercial hypervisors of a number of vendors, most notably Citrix
XenServer and Oracle VM.
KVM. The kernel-based virtual machine (KVM) is a Linux virtualization
subsystem. Is has been part of the mainline Linux kernel since version 2.6.20,
thus being natively supported by several distributions. In addition, activities
such as memory management and scheduling are carried out by existing kernel
features, thus making KVM simpler and smaller than hypervisors that take
control of the entire machine .
KVM leverages hardware-assisted virtualization, which improves
performance and allows it to support unmodified guest operating systems ;
currently, it supports several versions of Windows, Linux, and UNIX .
Virtual Appliances and the Open Virtualization
Format
An application combined with the environment needed to run it (operating
system, libraries, compilers, databases, application containers, and so forth) is
referred to as a ―virtual appliance.‖ Packaging application environments in the
shape of virtual appliances eases software customization, configuration, and
patching and improves portability. Most commonly, an appliance is shaped as
a VM disk image associated with hardware requirements, and it can be readily
deployed in a hypervisor.
In a multitude of hypervisors, where each one supports a different VM image
format and the formats are incompatible with one another, a great deal of
interoperability issues arises. For instance, Amazon has its Amazon machine
image (AMI) format, made popular on the Amazon EC2 public cloud. Other
formats are used by Citrix XenServer, several Linux distributions that ship with
KVM, Microsoft Hyper-V, and VMware ESX.
OVF‘s extensibility has encouraged additions relevant to management of
data centers and clouds. Mathews et al. have devised virtual machine contracts
(VMC) as an extension to OVF. A VMC aids in communicating and managing
the complex expectations that VMs have of their runtime environment and vice
versa.
Autonomic Computing
The increasing complexity of computing systems has motivated research on
autonomic computing, which seeks to improve systems by decreasing human
involvement in their operation. In other words, systems should manage
themselves, with high-level guidance from humans .
In this sense, the concepts of autonomic computing inspire software
technologies for data center automation, which may perform tasks such as:
management of service levels of running applications; management of data
center capacity; proactive disaster recovery; and automation of VM
provisioning .
LAYERS AND TYPES OF CLOUDS
Cloud computing services are divided into three classes, according to the
abstraction level of the capability provided and the service model of providers,
namely: (1) Infrastructure as a Service, (2) Platform as a Service, and (3) Software
as a Service . Figure 1.3 depicts the layered organization of the cloud stack
from physical infrastructure to applications.
These abstraction levels can also be viewed as a layered architecture where
services of a higher layer can be composed from services of the underlying
layer.
Infrastructure as a Service
Offering virtualized resources (computation, storage, and communication) on
demand is known as Infrastructure as a Service (IaaS) . A cloud infrastructure
Service
Main Access &
Class
Management Tool
Service content
Web Browser
Social networks, Office suites, CRM,
SaaS
PaaS
Cloud Applications
Video processing
Cloud
Cloud Platform
Development
Environment
Programming languages, Frameworks,
Mashups editors, Structured data
Virtual
IaaS
Infrastructure
Manager
Compute Servers, Data Storage,
17
Firewall, Load Balancer
Cloud Infrastructure
FIGURE 1.3. The cloud computing stack.
enables on-demand provisioning of servers running several choices of operating
systems and a customized software stack. Infrastructure services are considered
to be the bottom layer of cloud computing systems .
Platform as a Service
In addition to infrastructure-oriented clouds that provide raw computing and
storage services, another approach is to offer a higher level of abstraction to
make a cloud easily programmable, known as Platform as a Service (PaaS)..
Google AppEngine, an example of Platform as a Service, offers a scalable
environment for developing and hosting Web applications, which should
be written in specific programming languages such as Python or Java, and use
the services‘ own proprietary structured object data store.
Software as a Service
Applications reside on the top of the cloud stack. Services provided by this
layer can be accessed by end users through Web portals. Therefore, consumers
are increasingly shifting from locally installed computer programs to on-line
software services that offer the same functionally. Traditional desktop
applications such as word processing and spreadsheet can now be accessed as a
service in the Web.
Deployment Models
Although cloud computing has emerged mainly from the appearance of public
computing utilities. In this sense, regardless of its service class, a cloud can be
classified as public, private, community, or hybrid based on model of
deployment as shown in Figure 1.4.
Public/Internet
Clouds
Private/Enterprise
Hybrid/Mixed Clouds
Clouds
3rd party,
multi-tenant Cloud
Cloud computing
model run
Mixed usage of
private and public
Clouds:
infrastructure
& services:
within a company‘s
own Data Center/
infrastructure for
internal and/or
partners use.
Leasing public
cloud services
when private cloud
capacity is
insufficient
* available on
subscription basis
(pay as you go)
FIGURE 1.4. Types of clouds based on deployment models.
Armbrust propose definitions for public cloud as a ―cloud made available in
a pay-as-you-go manner to the general public‖ and private cloud as ―internal
data center of a business or other organization, not made available to the
general public.‖
A community cloud is ―shared by several organizations and supports a
specific community that has shared concerns (e.g., mission, security
requirements, policy, and compliance considerations) .‖
A hybrid cloud takes shape when a private cloud is supplemented with
computing capacity from public clouds . The approach of temporarily renting
capacity to handle spikes in load is known as ―cloud-bursting‖ .
DESIRED FEATURES OF A CLOUD
Certain features of a cloud are essential to enable services that truly represent
the cloud computing model and satisfy expectations of consumers, and cloud
offerings must be (i) self-service, (ii) per-usage metered and billed, (iii) elastic,
and (iv) customizable.
Self-Service
Consumers of cloud computing services expect on-demand, nearly instant
access to resources. To support this expectation, clouds must allow self-service
access so that customers can request, customize, pay, and use services without
intervention of human operators .
Per-Usage Metering and Billing
Cloud computing eliminates up-front commitment by users, allowing them to
request and use only the necessary amount. Services must be priced on a
shortterm basis (e.g., by the hour), allowing users to release (and not pay for)
resources as soon as they are not needed.
Elasticity
Cloud computing gives the illusion of infinite computing resources available on
demand . Therefore users expect clouds to rapidly provide resources in any
quantity at any time. In particular, it is expected that the additional resources
can be (a) provisioned, possibly automatically, when an application load
increases and (b) released when load decreases (scale up and down) .
Customization
In a multi-tenant cloud a great disparity between user needs is often the case.
Thus, resources rented from the cloud must be highly customizable. In the case
of infrastructure services, customization means allowing users to deploy
specialized virtual appliances and to be given privileged (root) access to the
virtual servers. Other service classes (PaaS and SaaS) offer less flexibility and
are not suitable for general-purpose computing , but still are expected to
provide a certain level of customization.
CLOUD INFRASTRUCTURE MANAGEMENT
A key challenge IaaS providers face when building a cloud infrastructure is
managing physical and virtual resources, namely servers, storage, and
networks, in a holistic fashion . The orchestration of resources must be
performed in a way to rapidly and dynamically provision resources to
applications .
The availability of a remote cloud-like interface and the ability of managing
many users and their permissions are the primary features that would
distinguish ―cloud toolkits‖ from ―VIMs.‖ However, in this chapter, we place
both categories of tools under the same group (of the VIMs) and, when
applicable, we highlight the availability of a remote interface as a feature.
Virtually all VIMs we investigated present a set of basic features related to
managing the life cycle of VMs, including networking groups of VMs together
and setting up virtual disks for VMs. These basic features pretty much define
whether a tool can be used in practical cloud deployments or not. On the other
hand, only a handful of software present advanced features (e.g., high
availability) which allow them to be used in large-scale production clouds.
Features
We now present a list of both basic and advanced features that are usually
available in VIMs.
Virtualization Support. The multi-tenancy aspect of clouds requires multiple
customers with disparate requirements to be served by a single hardware
infrastructure.
Self-Service, On-Demand Resource Provisioning. Self-service access to
resources has been perceived as one the most attractive features of clouds. This
feature enables users to directly obtain services from clouds.
Multiple Backend Hypervisors. Different virtualization models and tools offer
different benefits, drawbacks, and limitations. Thus, some VI managers
provide a uniform management layer regardless of the virtualization
technology used.
Storage Virtualization. Virtualizing storage means abstracting logical storage
from physical storage. By consolidating all available storage devices in a data
center, it allows creating virtual disks independent from device and location.
In the VI management sphere, storage virtualization support is often
restricted to commercial products of companies such as VMWare and Citrix.
Other products feature ways of pooling and managing storage devices, but
administrators are still aware of each individual device.
Interface to Public Clouds. Researchers have perceived that extending the
capacity of a local in-house computing infrastructure by borrowing resources
from public clouds is advantageous. In this fashion, institutions can make good
use of their available resources and, in case of spikes in demand, extra load can
be offloaded to rented resources .
Virtual Networking. Virtual networks allow creating an isolated network on
top of a physical infrastructure independently from physical topology and
locations. A virtual LAN (VLAN) allows isolating traffic that shares a
switched network, allowing VMs to be grouped into the same broadcast
domain.
Dynamic Resource Allocation. Increased awareness of energy consumption in
data centers has encouraged the practice of dynamic consolidating VMs in a
fewer number of servers. In cloud infrastructures, where applications
have variable and dynamic needs, capacity management and demand
prediction are especially complicated. This fact triggers the need for dynamic
resource allocation aiming at obtaining a timely match of supply and
demand.
Virtual Clusters. Several VI managers can holistically manage groups of VMs.
This feature is useful for provisioning computing virtual clusters on demand,
and interconnected VMs for multi-tier Internet applications.
Reservation and Negotiation Mechanism. When users request computational
resources to available at a specific time, requests are termed advance
reservations (AR), in contrast to best-effort requests, when users request
resources whenever available .
Additionally, leases may be negotiated and renegotiated, allowing provider
and consumer to modify a lease or present counter proposals until an
agreement is reached.
High Availability and Data Recovery. The high availability (HA) feature of VI
managers aims at minimizing application downtime and preventing business
disruption.
For mission critical applications, when a failover solution involving
restarting VMs does not suffice, additional levels of fault tolerance that rely on
redundancy of VMs are implemented.
Data backup in clouds should take into account the high data volume
involved in VM management.
Case Studies
In this section, we describe the main features of the most popular VI managers
available. Only the most prominent and distinguishing features of each tool are
discussed in detail. A detailed side-by-side feature comparison of VI managers
is presented in Table 1.1.
Apache VCL. The Virtual Computing Lab [60, 61] project has been incepted in
2004 by researchers at the North Carolina State University as a way to provide
customized environments to computer lab users. The software components that
support NCSU‘s initiative have been released as open-source and incorporated
by the Apache Foundation.
AppLogic. AppLogic is a commercial VI manager, the flagship product of
3tera Inc. from California, USA. The company has labeled this product as a
Grid Operating System.
AppLogic provides a fabric to manage clusters of virtualized servers,
focusing on managing multi-tier Web applications. It views an entire
application as a collection of components that must be managed as a single
entity.
In summary, 3tera AppLogic provides the following features: Linux-based
controller; CLI and GUI interfaces; Xen backend; Global Volume Store (GVS)
storage virtualization; virtual networks; virtual clusters; dynamic resource
allocation; high availability; and data protection.
TABLE 1.1. Feature Comparison of Virtual Infrastructure Managers
Installation
Platform of
Controller
Client
UI,
API, Language
Bindings
Backend
Hypervisor(s)
Storage
Virtualization
Interface to
Public Cloud
Virtual Dynamic Resource
Networks
Allocation
VMware
ESX, ESXi,
No
No
Yes
No
Global
No
Yes
Advance
Reservation of
Capacity
High
Availability
Data
Protection
Yes
No
No
Yes
No
Yes
Yes
License
Apache
VCL
Apache v2
Multi-
Portal,
XML-RPC
platform
(Apache/
PHP)
AppLogic
Proprietary
Linux
Server
GUI, CLI
Xen
Volume
Store (GVS)
Citrix Essentials
Proprietary Windows
GUI, CLI,
XenServer,
Hyper-V
Citrix
Storage
Link
No
Yes
Yes
No
Yes
Yes
Xen
Portal,
XML-RPC
Enomaly
GPL v3
Linux
Portal, WS
Eucalyptus
ECP
BSD
Linux
EC2 WS, CLI
Nimbus
Apache v2
Linux
EC2 WS,
No
Amazon EC2
Yes
No
No
No
No
Xen, KVM
No
EC2
Yes
No
No
No
No
Xen, KVM
No
EC2
Yes
Via
Yes
(via No
integration with
OpenNebula)
No
WSRF, CLI
OpenNEbula
integration with
OpenNebula
Apache v2
Linux
XML-RPC, CLI, Java
Xen, KVM
No
Amazon EC2, E
Yes
Yes
Yes
No
No
(via Haizea)
OpenPEX
GPL v2
Multiplatform
Portal, WS
XenServer
No
No
No
No
Yes
No
No
oVirt
GPL v2
Fedora Linux
Portal
KVM
No
No
No
No
No
No
No
Platform
ISF
Proprietary
Portal
Hyper-V
XenServer,
VMWare ESX
No
EC2, IBM CoD,
Yes
Yes
Yes
Unclear
Unclear
(Java)
Platform VMO
Linux
HP
Enterprise
Services
Proprietary
Linux,
Portal
XenServer
No
No
Yes
Yes
No
Yes
No
Proprietary
Linux,
CLI, GUI,
VMware
ESX, ESXi
VMware
vStorage
VMFS
VMware
vCloud partners
Yes
VMware
DRM
No
Yes
Yes
Windows
VMWare
vSphere
Windows
Portal, WS
Citrix Essentials. The Citrix Essentials suite is one the most feature complete
VI management software available, focusing on management and automation
of data centers. It is essentially a hypervisor-agnostic solution, currently
supporting Citrix XenServer and Microsoft Hyper-V.
Enomaly ECP. The Enomaly Elastic Computing Platform, in its most complete
edition, offers most features a service provider needs to build an IaaS cloud.
In summary, Enomaly ECP provides the following features: Linux-based
controller; Web portal and Web services (REST) interfaces; Xen back-end;
interface to the Amazon EC2 public cloud; virtual networks; virtual clusters
(ElasticValet).
Eucalyptus. The Eucalyptus framework was one of the first open-source
projects to focus on building IaaS clouds. It has been developed with the intent
of providing an open-source implementation nearly identical in functionality to
Amazon Web Services APIs.
Nimbus3. The Nimbus toolkit is built on top of the Globus framework. Nimbus
provides most features in common with other open-source VI managers, such
as an EC2-compatible front-end API, support to Xen, and a backend interface
to Amazon EC2.
Nimbus‘ core was engineered around the Spring framework to be easily
extensible, thus allowing several internal components to be replaced and also
eases the integration with other systems.
In summary, Nimbus provides the following features: Linux-based
controller; EC2-compatible (SOAP) and WSRF interfaces; Xen and KVM
backend and a Pilot program to spawn VMs through an LRM; interface to the
Amazon EC2 public cloud; virtual networks; one-click virtual clusters.
OpenNebula. OpenNebula is one of the most feature-rich open-source VI
managers. It was initially conceived to manage local virtual infrastructure, but
has also included remote interfaces that make it viable to build public clouds.
Altogether, four programming APIs are available: XML-RPC and libvirt for
local interaction; a subset of EC2 (Query) APIs and the OpenNebula Cloud
API (OCA) for public access [7, 65].
(Amazon EC2, ElasticHosts); virtual networks; dynamic resource
allocation; advance reservation of capacity.
OpenPEX. OpenPEX (Open Provisioning and EXecution Environment) was
constructed around the notion of using advance reservations as the primary
method for allocating VM instances.
oVirt. oVirt is an open-source VI manager, sponsored by Red Hat‘s Emergent
Technology group. It provides most of the basic features of other VI managers,
including support for managing physical server pools, storage pools, user
accounts, and VMs. All features are accessible through a Web interface.
Platform ISF. Infrastructure Sharing Facility (ISF) is the VI manager offering
from Platform Computing [68]. The company, mainly through its LSF family
of products, has been serving the HPC market for several years.
ISF is built upon Platform‘s VM Orchestrator, which, as a standalone
product, aims at speeding up delivery of VMs to end users. It also provides high
availability by restarting VMs when hosts fail and duplicating the VM that
hosts the VMO controller.
VMWare vSphere and vCloud. vSphere is VMware‘s suite of tools aimed at
transforming IT infrastructures into private clouds. It distinguishes from other
VI managers as one of the most feature-rich, due to the company‘s several
offerings in all levels the architecture.
In the vSphere architecture, servers run on the ESXi platform. A separate
server runs vCenter Server, which centralizes control over the entire virtual
infrastructure. Through the vSphere Client software, administrators connect to
vCenter Server to perform various tasks.
VMware ESX, ESXi backend; VMware vStorage VMFS storage
virtualization; interface to external clouds (VMware vCloud partners); virtual
networks (VMWare Distributed Switch); dynamic resource allocation
(VMware DRM); high availability; data protection (VMWare Consolidated
Backup).
INFRASTRUCTURE AS A SERVICE PROVIDERS
Public Infrastructure as a Service providers commonly offer virtual servers
containing one or more CPUs, running several choices of operating systems
and a customized software stack. In addition, storage space and
communication facilities are often provided.
Features
In spite of being based on a common set of features, IaaS offerings can be
distinguished by the availability of specialized features that influence the
cost—benefit ratio to be experienced by user applications when moved to
the cloud. The most relevant features are: (i) geographic distribution of data
centers; (ii) variety of user interfaces and APIs to access the system; (iii)
specialized components and services that aid particular applications (e.g.,
loadbalancers, firewalls); (iv) choice of virtualization platform and operating
systems; and (v) different billing methods and period (e.g., prepaid vs. postpaid, hourly vs. monthly).
Geographic Presence. To improve availability and responsiveness, a provider
of worldwide services would typically build several data centers distributed
around the world. For example, Amazon Web Services presents the concept of
―availability zones‖ and ―regions‖ for its EC2 service.
User Interfaces and Access to Servers. Ideally, a public IaaS provider must
provide multiple access means to its cloud, thus catering for various users and
their preferences. Different types of user interfaces (UI) provide different levels
of abstraction, the most common being graphical user interfaces (GUI),
command-line tools (CLI), and Web service (WS) APIs.
GUIs are preferred by end users who need to launch, customize, and
monitor a few virtual servers and do not necessary need to repeat the process
several times. On the other hand, CLIs offer more flexibility and the possibility
of automating repetitive tasks via scripts.
Advance Reservation of Capacity. Advance reservations allow users to request
for an IaaS provider to reserve resources for a specific time frame in the future,
thus ensuring that cloud resources will be available at that time. However, most
clouds only support best-effort requests; that is, users requests are server
whenever resources are available.
Automatic Scaling and Load Balancing. As mentioned earlier in this chapter,
elasticity is a key characteristic of the cloud computing model. Applications
often need to scale up and down to meet varying load conditions. Automatic
scaling is a highly desirable feature of IaaS clouds.
Service-Level Agreement. Service-level agreements (SLAs) are offered by IaaS
providers to express their commitment to delivery of a certain QoS. To
customers it serves as a warranty. An SLA usually include availability and
performance guarantees. Additionally, metrics must be agreed upon by all
parties as well as penalties for violating these expectations.
Hypervisor and Operating System Choice. Traditionally, IaaS offerings have
been based on heavily customized open-source Xen deployments. IaaS
providers needed expertise in Linux, networking, virtualization, metering,
resource management, and many other low-level aspects to successfully deploy
and maintain their cloud offerings.
Case Studies
In this section, we describe the main features of the most popular public IaaS
clouds. Only the most prominent and distinguishing features of each one are
discussed in detail. A detailed side-by-side feature comparison of IaaS offerings
is presented in Table 1.2.
Amazon Web Services. Amazon WS (AWS) is one of the major players in the
cloud computing market. It pioneered the introduction of IaaS clouds in
2006.
The Elastic Compute Cloud (EC2) offers Xen-based virtual servers (instances)
that can be instantiated from Amazon Machine Images (AMIs). Instances are
available in a variety of sizes, operating systems, architectures, and price. CPU
capacity of instances is measured in Amazon Compute Units and, although fixed
for each instance, vary among instance types from 1 (small instance) to 20 (high
CPU instance).
In summary, Amazon EC2 provides the following features: multiple data
centers available in the United States (East and West) and Europe; CLI, Web
services (SOAP and Query), Web-based console user interfaces; access to
instance mainly via SSH (Linux) and Remote Desktop (Windows); advanced
reservation of capacity (aka reserved instances) that guarantees availability for
periods of 1 and 3 years; 99.5% availability SLA; per hour pricing; Linux and
Windows operating systems; automatic scaling; load balancing.
TABLE 1.2. Feature Comparison Public Cloud Offerings (Infrastructure as a Service)
Runtime
Server
Resizing/
Vertical
Scaling
Client UI
API Language
Geographic
Presence
Primary
Access to
Server
Advance
Reservation of
Capacity
Smallest
Billing
Guest
Operating
Systems
SLA
Bindings
Unit
Automated
Horizontal
Scaling
Hypervisor
Instance Hardware Capacity
Processor
Load
Balancing
Memory
Storage
Uptime
99.95% Hour
Xen
Linux,
Windows
Available
Elastic Load
with
Balancing
Amazon
CloudWatch
No
1—20 EC2
compute
units
1.7—15 160—1690 GB
GB
1 GB—1 TB
(per EBS
volume)
No
100%
Xen
Linux,
Windows
No
Zeus
software
Processors,
memory
1—4 CPUs
0.5—16 20—270 GB
GB
No
100%
Xen
Linux,
Windows
No
Hardware
(F5)
No
1—6 CPUs
Amazon
EC2
US East,
Europe
CLI, WS,
Portal
SSH (Linux),
Remote
Desktop
(Windows)
Amazon
reserved
instances
(Available in
1 or 3 years
terms, starting
from reservation
time)
Flexiscale
UK
Web Console
SSH
REST, Java,
SSH
Hour
loadbalancing (requires
reboot)
GoGrid
PHP, Python,
Ruby
Hour
GB
0.5—8
3G0B—480
Joyent
Cloud
US
(Emeryville,
SSH,
No
100%
Month
OS Level
(Solaris
OpenSolaris No
Both
hardware
Automatic
1/16—8 CPUs 0.25—32 5—100 GB
CPU bursting
GB
VirtualMin
CA; San
(Web-based
Diego,
system
administration)
Containers)
(F5 networks) (up to 8
and software
(Zeus)
CPUs)
No
Memory, disk Quad-core
0.25—16 10—620 GB
(requires
reboot)
Automatic
CPU bursting
(up to 100%
of available
CPU power
of
physical
host)
GB
CA; Andover,
MA; Dallas,
TX)
Rackspace US
Portal, REST,
Cloud
Servers
Python, PHP,
Java, C#/.
(Dallas, TX)
NET
SSH
No
100%
Hour
Xen
Linux
No
CPU (CPU
power
is
weighed
proportionally
to
memory
size)
Flexiscale. Flexiscale is a UK-based provider offering services similar in
nature to Amazon Web Services. However, its virtual servers offer some
distinct features, most notably: persistent storage by default, fixed IP addresses,
dedicated VLAN, a wider range of server sizes, and runtime adjustment of CPU
capacity (aka CPU bursting/vertical scaling). Similar to the clouds, this service
is also priced by the hour.
Joyent. Joyent‘s Public Cloud offers servers based on Solaris containers
virtualization technology. These servers, dubbed accelerators, allow deploying
various specialized software-stack based on a customized version of
OpenSolaris operating system, which include by default a Web-based
configuration tool and several pre-installed software, such as Apache, MySQL,
PHP, Ruby on Rails, and Java. Software load balancing is available as an
accelerator in addition to hardware load balancers.
In summary, the Joyent public cloud offers the following features: multiple
geographic locations in the United States; Web-based user interface; access to
virtual server via SSH and Web-based administration tool; 100% availability
SLA; per month pricing; OS-level virtualization Solaris containers;
OpenSolaris operating systems; automatic scaling (vertical).
GoGrid. GoGrid, like many other IaaS providers, allows its customers to
utilize a range of pre-made Windows and Linux images, in a range of fixed
instance sizes. GoGrid also offers ―value-added‖ stacks on top for applications
such as high-volume Web serving, e-Commerce, and database stores.
Rackspace Cloud Servers. Rackspace Cloud Servers is an IaaS solution that
provides fixed size instances in the cloud. Cloud Servers offers a range of
Linux-based pre-made images. A user can request different-sized images, where
the size is measured by requested RAM, not CPU.
PLATFORM AS A SERVICE PROVIDERS
Public Platform as a Service providers commonly offer a development and
deployment environment that allow users to create and run their applications
with little or no concern to low-level details of the platform. In addition,
specific programming languages and frameworks are made available in the
platform, as well as other services such as persistent data storage and
inmemory caches.
Features
Programming Models, Languages, and Frameworks. Programming models
made available by IaaS providers define how users can express their
applications using higher levels of abstraction and efficiently run them on the
cloud platform. Each model aims at efficiently solving a particular problem. In
the cloud computing domain, the most common activities that require
specialized models are: processing of large dataset in clusters of computers
(MapReduce model), development of request-based Web services and
applications;
Persistence Options. A persistence layer is essential to allow applications to
record their state and recover it in case of crashes, as well as to store user data.
Traditionally, Web and enterprise application developers have chosen
relational databases as the preferred persistence method. These databases offer
fast and reliable structured data storage and transaction processing, but may
lack scalability to handle several petabytes of data stored in commodity
computers .
Case Studies
In this section, we describe the main features of some Platform as Service
(PaaS) offerings. A more detailed side-by-side feature comparison of VI
managers is presented in Table 1.3.
Aneka. Aneka is a .NET-based service-oriented resource management and
development platform. Each server in an Aneka deployment (dubbed Aneka
cloud node) hosts the Aneka container, which provides the base infrastructure
that consists of services for persistence, security (authorization, authentication
and auditing), and communication (message handling and dispatching).
Several programming models are supported by such task models to enable
execution of legacy HPC applications and MapReduce, which enables a variety
of data-mining and search applications.
App Engine. Google App Engine lets you run your Python and Java Web
applications on elastic infrastructure supplied by Google. The App Engine
serving architecture is notable in
that it allows real-time auto-scaling
without virtualization for many common types of Web applications.
However, such auto-scaling is dependent on the
TABLE 1.3. Feature Comparison of Platform-as-a-Service Cloud Offerings
Target Use
Aneka
Programming
Language,
Frameworks
Developer
Tools
.Net enterprise
applications,
HPC
Web
applications
.NET
Standalone
SDK
Python, Java
Eclipse-based
IDE
Force.com
Enterprise
applications
(esp. CRM)
Apex
Microsoft
Windows
Azure
Enterprise and
Web
applications
.NET
Heroku
Web
applications
Ruby on Rails
AppEngine
Programming
Models
Threads, Task,
MapReduce
Persistence
Options
Automatic
Scaling
Backend
Infrastructure
Providers
Flat
files,
RDBMS, HDFS
No
Amazon EC2
BigTable
Yes
Own
centers
data
Request-based
Web
programming
Eclipse-based
Workflow,
IDE, WebExcel-like
based wizard
formula
language,
Request-based
web
programming
Azure tools for Unrestricted
Microsoft
Visual Studio
Own object
database
Unclear
Own
centers
data
Table/BLOB/
queue
storage,
SQL services
Yes
Own
centers
data
Command-line
tools
PostgreSQL,
Amazon RDS
Yes
Amazon EC2
Requestbased
web
programming
33
Amazon
Elastic
MapReduce
Data processing
Hive and Pig,
Cascading,
Java, Ruby,
Perl, Python,
PHP,
R,
C++
Karmasphere
Studio
for
Hadoop
(NetBeansbased)
MapReduce
Amazon S3
No
Amazon EC2
application developer using a limited subset of the native APIs on each
platform, and in some instances you need to use specific Google APIs such
as URLFetch, Datastore, and memcache in place of certain native API calls.
Microsoft Azure. Microsoft Azure Cloud Services offers developers a hosted .
NET Stack (C#, VB.Net, ASP.NET). In addition, a Java & Ruby SDK for
.NET Services is also available. The Azure system consists of a number of
elements.
Force.com. In conjunction with the Salesforce.com service, the Force.com
PaaS allows developers to create add-on functionality that integrates into main
Salesforce CRM SaaS application.
Heroku. Heroku is a platform for instant deployment of Ruby on Rails Web
applications. In the Heroku system, servers are invisibly managed by the
platform and are never exposed to users.
CHALLENGES AND RISKS
Despite the initial success and popularity of the cloud computing paradigm and
the extensive availability of providers and tools, a significant number of
challenges and risks are inherent to this new model of computing. Providers,
developers, and end users must consider these challenges and risks to take good
advantage of cloud computing.
Security, Privacy, and Trust
Ambrust et al. cite information security as a main issue: ―current cloud
offerings are essentially public . . . exposing the system to more attacks.‖ For
this reason there are potentially additional challenges to make cloud computing
environments as secure as in-house IT systems. At the same time, existing,
wellunderstood technologies can be leveraged, such as data encryption,
VLANs, and firewalls.
Data Lock-In and Standardization
A major concern of cloud computing users is about having their data locked-in
by a certain provider. Users may want to move data and applications out from
a provider that does not meet their requirements. However, in their current
form, cloud computing infrastructures and platforms do not employ standard
methods of storing user data and applications. Consequently, they do not
interoperate and user data are not portable.
Availability, Fault-Tolerance, and Disaster Recovery
It is expected that users will have certain expectations about the service level to
be provided once their applications are moved to the cloud. These expectations
include availability of the service, its overall performance, and what measures
are to be taken when something goes wrong in the system or its components. In
summary, users seek for a warranty before they can comfortably move their
business to the cloud.
Resource Management and Energy-Efficiency
One important challenge faced by providers of cloud computing services is the
efficient management of virtualized resource pools. Physical resources such as
CPU cores, disk space, and network bandwidth must be sliced and shared
among virtual machines running potentially heterogeneous workloads.
Another challenge concerns the outstanding amount of data to be managed
in various VM management activities. Such data amount is a result of
particular abilities of virtual machines, including the ability of traveling through
space (i.e., migration) and time (i.e., checkpointing and rewinding), operations
that may be required in load balancing, backup, and recovery scenarios. In
addition, dynamic provisioning of new VMs and replicating existing VMs
require efficient mechanisms to make VM block storage devices (e.g., image
files) quickly available at selected hosts.
2.2 MIGRATING INTO A CLOUD
The promise of cloud computing has raised the IT expectations of small and
medium enterprises beyond measure. Large companies are deeply debating it.
Cloud computing is a disruptive model of IT whose innovation is part
technology and part business model—in short a ―disruptive techno-commercial
model‖ of IT. This tutorial chapter focuses on the key issues and associated
dilemmas faced by decision makers, architects, and systems managers in trying
to understand and leverage cloud computing for their IT needs. Questions
asked and discussed in this chapter include: when and how to migrate one‘s
application into a cloud; what part or component of the IT application to
migrate into a cloud and what not to migrate into a cloud; what kind of
customers really benefit from migrating their IT into the cloud; and so on. We
describe the key factors underlying each of the above questions and share a
Seven-Step Model of Migration into the Cloud.
Several efforts have been made in the recent past to define the term ―cloud
computing‖ and many have not been able to provide a comprehensive one This
has been more challenging given the scorching pace of the technological
advances as well as the newer business model formulations for the cloud services
being offered.
The Promise of the Cloud
Most users of cloud computing services offered by some of the large-scale data
centers are least bothered about the complexities of the underlying systems or
their functioning. More so given the heterogeneity of either the systems or the
software running on them.
Cloudonomics
Technology
• ‗Pay per use‘ – Lower Cost Barriers
• On Demand Resources –Autoscaling
• Capex vs OPEX – No capital expenses (CAPEX) and only operational expenses OPEX.
• SLA driven operations – Much Lower TCO
• Attractive NFR support: Availability, Reliability
• ‗Infinite‘ Elastic availability – Compute/Storage/Bandwidth
• Automatic Usage Monitoring and Metering
• Jobs/ Tasks Virtualized and Transparently ‗Movable‘
• Integration and interoperability ‗support‘ for hybrid ops
• Transparently encapsulated & abstracted IT features.
FIGURE 2.1. The promise of the cloud computing services.
.
As shown in Figure 2.1, the promise of the cloud both on the business front
(the attractive cloudonomics) and the technology front widely aided the CxOs
to spawn out several non-mission critical IT needs from the ambit of their
captive traditional data centers to the appropriate cloud service. Invariably,
these IT needs had some common features: They were typically Web-oriented;
they represented seasonal IT demands; they were amenable to parallel batch
processing; they were non-mission critical and therefore did not have high
security demands.
The Cloud Service Offerings and Deployment Models
Cloud computing has been an attractive proposition both for the CFO and the
CTO of an enterprise primarily due its ease of usage. This has been achieved
by large data center service vendors or now better known as cloud service
vendors again primarily due to their scale of operations. Google, Amazon,
IaaS
• Abstract Compute/Storage/Bandwidth Resources
• Amazon Web Services[10,9] – EC2, S3, SDB, CDN, CloudWatch
IT Folks
PaaS
• Abstracted Programming Platform with encapsulated infrastructure
• Google Apps Engine(Java/Python), Microsoft Azure, Aneka[13]
Programmers
SaaS
• Application with encapsulated infrastructure & platform
• Salesforce.com; Gmail; Yahoo Mail; Facebook; Twitter
Architects & Users
Cloud Application Deployment & Consumption Models
Public Clouds
Hybrid Clouds
Private Clouds
FIGURE 2.2. The cloud computing service offering and deployment models.
Microsoft, and a few others have been the key players apart from open source
Hadoop built around the Apache ecosystem. As shown in Figure 2.2, the cloud
service offerings from these vendors can broadly be classified into three major
streams: the Infrastructure as a Service (IaaS), the Platform as a Service (PaaS),
and the Software as a Service (SaaS). While IT managers and system
administrators preferred IaaS as offered by Amazon for many of their
virtualized IT needs, the programmers preferred PaaS offerings like Google
AppEngine (Java/Python programming) or Microsoft Azure (.Net
programming). Users of large-scale enterprise software invariably found that
if they had been using the cloud, it was because their usage of the specific
software package was available as a service—it was, in essence, a SaaS
offering. Salesforce.com was an exemplary SaaS offering on the Internet.
From a technology viewpoint, as of today, the IaaS type of cloud offerings
have been the most successful and widespread in usage. Invariably these
reflect the cloud underneath, where storage (most do not know on which
system it is) is easily scalable or for that matter where it is stored or located.
Challenges in the Cloud
While the cloud service offerings present a simplistic view of IT in case of IaaS
or a simplistic view of programming in case PaaS or a simplistic view of
resources usage in case of SaaS, the underlying systems level support challenges
are huge and highly complex. These stem from the need to offer a uniformly
consistent and robustly simplistic view of computing while the underlying
systems are highly failure-prone, heterogeneous, resource hogging, and
exhibiting serious security shortcomings. As observed in Figure 2.3, the
promise of the cloud seems very similar to the typical distributed systems
properties that most would prefer to have.
Distributed System Fallacies
Challenges in Cloud Technologies
and the Promise of the Cloud
Full Network Reliability
Security
Zero Network Latency
Performance Monitoring
Consistent & Robust Service abstractions
Infinite Bandwidth
Secure Network
No Topology changes
Centralized Administration
Zero Transport Costs
Meta Scheduling
Energy efficient load balancing
Scale management
SLA & QoS Architectures
Interoperability & Portability
Homogeneous Networks & Systems
FIGURE 2.3. ‗Under the hood‘ challenges of the cloud computingGreen
services
IT implementations.
Many of them are listed in Figure 2.3. Prime amongst these are the challenges
of security. The Cloud Security Alliance seeks to address many of these issues .
BROAD APPROACHES TO MIGRATING INTO THE CLOUD
Given that cloud computing is a ―techno-business disruptive model‖ and is on
the top of the top 10 strategic technologies to watch for 2010 according to
Gartner, migrating into the cloud is poised to become a large-scale effort in
leveraging the cloud in several enterprises. ―Cloudonomics‖ deals with the
economic rationale for leveraging the cloud and is central to the success of
cloud-based enterprise usage.
Why Migrate?
There are economic and business reasons why an enterprise application can be
migrated into the cloud, and there are also a number of technological reasons.
Many of these efforts come up as initiatives in adoption of cloud technologies
in the enterprise, resulting in integration of enterprise applications running off
the captive data centers with the new ones that have been developed on the
cloud. Adoption of or integration with cloud computing services is a use case of
migration.
With due simplification, the migration of an enterprise application is best
captured by the following:
P-P0 1 P0 -P0
C
l
1 P0l
OFC
where P is the application before migration running in captive data center, P0 is
the application part after migration either into a (hybrid) cloud, P0 l is the part
C
of application being run in the captive local data center, and P0 OFC is the
application part optimized for cloud. If an enterprise application cannot be
migrated fully, it could result in some parts being run on the captive local data
center while the rest are being migrated into the cloud—essentially a case of a
hybrid cloud usage. However, when the entire application is migrated onto the
cloud, then P0l is null. Indeed, the migration of the enterprise application P can
happen at the five levels of application, code, design, architecture, and usage. It
can be that the P0C migration happens at any of the five levels without any P0 l
component. Compound this with the kind of cloud computing service offering
being applied—the IaaS model or PaaS or SaaS model—and we have a variety
of migration use cases that need to be thought through thoroughly by the
migration architects.
Cloudonomics. Invariably, migrating into the cloud is driven by economic
reasons of cost cutting in both the IT capital expenses (Capex) as well as
operational expenses (Opex). There are both the short-term benefits of
opportunistic migration to offset seasonal and highly variable IT loads as well
as the long-term benefits to leverage the cloud. For the long-term sustained
usage, as of 2009, several impediments and shortcomings of the cloud
computing services need to be addressed.
Deciding on the Cloud Migration
In fact, several proof of concepts and prototypes of the enterprise application
are experimented on the cloud to take help in making a sound decision on
migrating into the cloud. Post migration, the ROI on the migration should be
positive for a broad range of pricing variability. Assume that in the M classes
of questions, there was a class with a maximum of N questions. We can then
model the weightage-based decision making as M 3 N weightage matrix as
follows:
M
X
Cl #
!
N
X
Bi
Aij Xij
# Ch
i51
j51
where Cl is the lower weightage threshold and Ch is the higher weightage
threshold while Aij is the specific constant assigned for a question and Xij is the
fraction between 0 and 1 that represents the degree to which that answer to
the question is relevant and applicable.
THE SEVEN-STEP MODEL OF MIGRATION INTO A CLOUD
Typically migration initiatives into the cloud are implemented in phases or in
stages. A structured and process-oriented approach to migration into a cloud has
several advantages of capturing within itself the best practices of many migration
projects. While migration has been a difficult and vague subject—of not much
interest to the academics and left to the industry practitioners—not many efforts
across the industry have been put in to consolidate what has been found to be
both a top revenue earner and a long standing customer pain. After due study
and practice, we share the Seven-Step Model of Migration into the Cloud as part
of our efforts in understanding and leveraging the cloud computing service
offerings in the enterprise context. In a succinct way, Figure 2.4 captures the
essence of the steps in the model of migration into the cloud, while Figure 2.5
captures the iterative process of the seven-step migration into the cloud.
The first step of the iterative process of the seven-step model of migration is
basically at the assessment level. Proof of concepts or prototypes for various
approaches to the migration along with the leveraging of pricing
parameters enables one to make appropriate assessments.
8. Conduct Cloud Migration Assessments
9. Isolate the Dependencies
10. Map the Messaging & Environment
11. Re-architect & Implement the lost Functionalities
12. Leverage Cloud Functionalities & Features
13. Test the Migration
14. Iterate and Optimize
FIGURE 2.4. The Seven-Step Model of Migration into the Cloud. (Source: Infosys
Research.)
START
Assess
Optimize
Isolate
END
The Iterative Seven Step
Test
Migration Model
Augment
Map
Rearchitect
FIGURE 2.5. The iterative Seven-step Model of Migration into the Cloud. (Source:
Infosys Research.)
Having done the augmentation, we validate and test the new form of the
enterprise application with an extensive test suite that comprises testing the
components of the enterprise application on the cloud as well. These test results
could be positive or mixed. In the latter case, we iterate and optimize as
appropriate. After several such optimizing iterations, the migration is deemed
successful. Our best practices indicate that it is best to iterate through this
Seven-Step Model process for optimizing and ensuring that the migration into
the cloud is both robust and comprehensive. Figure 2.6 captures the typical
components of the best practices accumulated in the practice of the Seven-Step
Model of Migration into the Cloud. Though not comprehensive in enumeration,
it is representative.
Assess
• Cloudonomics
• Migration
Costs
• Recurring
Costs
• Database data
segmentation
• Database
Migration
• Functionality
migration
• NFR Support
Isolate
• Runtime
Environment
• Licensing
• Libraries
Dependency
• Applications
Dependency
• Latencies
Bottlenecks
• Performance
bottlenecks
• Architectural
Dependencies
Map
• Messages
mapping:
marshalling &
de-marshalling
• Mapping
Environments
• Mapping
libraries &
runtime
approximations
Re-Architect
• Approximate
lost
functionality
using cloud
runtime
support API
• New
Usecases
• Analysis
• Design
Augment
• Exploit
additional
cloud features
• Seek Low-cost
augmentations
• Autoscaling
• Storage
• Bandwidth
• Security
Test
• Augment Test
Cases and
Test
Automation
• Run Proof-ofConcepts
• Test Migration
strategy
• Test new
testcases due
to cloud
augmentation
• Test for
Production
Loads
Optimize
• Optimize–
rework and
iterate
• Significantly
satisfy
cloudonomics
of migration
• Optimize
compliance
with standards
and
governance
• Deliver best
migration ROI
• Develop
roadmap for
leveraging new
cloud features
FIGURE 2.6. Some details of the iterative Seven-Step Model of Migration into the
Cloud.
Compared with the typical approach to migration into the Amazon AWS, our
Seven-step model is more generic, versatile, and comprehensive. The typical
migration into the Amazon AWS is a phased over several steps. It is about six
steps as discussed in several white papers in the Amazon website and is as
follows: The first phase is the cloud migration assessment phase wherein
dependencies are isolated and strategies worked out to handle these
dependencies. The next phase is in trying out proof of concepts to build a
reference migration architecture. The third phase is the data migration phase
wherein database data segmentation and cleansing is completed. This phase
also tries to leverage the various cloud storage options as best suited. The
fourth phase comprises the application migration wherein either a ―forklift
strategy‖ of migrating the key enterprise application along with its
dependencies (other applications) into the cloud is pursued.
Migration Risks and Mitigation
The biggest challenge to any cloud migration project is how effectively the
migration risks are identified and mitigated. In the Seven-Step Model of
Migration into the Cloud, the process step of testing and validating includes
efforts to identify the key migration risks. In the optimization step, we address
various approaches to mitigate the identified migration risks.
There
are issues of consistent identity management as well. These and
several of the issues are discussed in Section 2.1. Issues and challenges listed in
Figure 2.3 continue to be the persistent research and engineering challenges in
coming up with appropriate cloud computing implementations.
2.3
ENRICHING
THE
‘INTEGRATION
AS
A
SERVICE’ PARADIGM FOR THE CLOUD ERA
AN INTRODUCTION
The trend-setting cloud paradigm actually represents the cool
conglomeration of a number of proven and promising Web and enterprise
technologies. Cloud Infrastructure providers are establishing cloud centers
to host a variety of ICT services and platforms of worldwide individuals,
innovators, and institutions. Cloud service providers (CSPs) are very
aggressive in experimenting and embracing the cool cloud ideas and today
every business and technical services are being hosted in clouds to be
delivered to global customers, clients and consumers over the Internet
communication infrastructure. For example, security as a service (SaaS) is
a prominent cloud-hosted security service that can be subscribed by a
spectrum of users of any connected device and the users just pay for the
exact amount or time of usage. In a nutshell, on-premise and local
applications are becoming online, remote, hosted, on-demand and
offpremise applications.
Business-to-business (B2B). It is logical to take the integration
middleware to clouds to simplify and streamline the enterprise-toenterprise
(E2E), enterprise-to-cloud (E2C) and cloud-to-cloud (C2C) integration.
THE EVOLUTION OF SaaS
SaaS paradigm is on fast track due to its innate powers and potentials.
Executives, entrepreneurs, and end-users are ecstatic about the tactic as
well as strategic success of the emerging and evolving SaaS paradigm.
A number of positive and progressive developments started to grip this
model. Newer resources and activities are being consistently readied
to be delivered as a service. Experts and evangelists are in unison
that cloud is to rock the total IT community as the best possible
infrastructural solution for effective service delivery.
IT as a Service (ITaaS) is the most recent and efficient delivery
method in the decisive IT landscape. With the meteoric and
mesmerizing rise of the service orientation principles, every single IT
resource, activity and infrastructure is being viewed and visualized as a
service that sets the tone for the grand unfolding of the dreamt service
era. Integration as a service (IaaS) is the budding and distinctive
capability of clouds in fulfilling the business integration requirements.
Increasingly business applications are deployed in clouds to reap the
business and technical benefits. On the other hand, there are still
innumerable applications and data sources locally stationed and
sustained primarily due to the security reason.
B2B systems are capable of driving this new on-demand integration
model because they are traditionally employed to automate business
processes between manufacturers and their trading partners. That
means they provide application-to-application connectivity along with
the functionality that is very crucial for linking internal and external
software securely.
The use of hub & spoke (H&S) architecture further simplifies the
implementation and avoids placing an excessive processing burden on
the customer sides. The hub is installed at the SaaS provider‘s cloud
center to do the heavy lifting such as reformatting files. The Web is the
largest digital information
superhighway
1. The Web is the largest repository of all kinds of resources such as
web pages, applications comprising enterprise components, business
services, beans, POJOs, blogs, corporate data, etc.
2. The Web is turning out to be the open, cost-effective and generic
business execution platform (E-commerce, business, auction, etc.
happen in the web for global users) comprising a wider variety of
containers, adaptors, drivers, connectors, etc.
3. The Web is the global-scale communication infrastructure (VoIP,
Video conferencing, IP TV etc,)
4. The Web is the next-generation discovery, Connectivity, and
integration middleware
Thus the unprecedented absorption and adoption of the Internet is the
key driver for the continued success of the cloud computing.
THE CHALLENGES OF SaaS PARADIGM
As with any new technology, SaaS and cloud concepts too suffer a
number of limitations. These technologies are being diligently examined
for specific situations and scenarios. The prickling and tricky issues in
different layers and levels are being looked into. The overall views are
listed out below. Loss or lack of the following features deters the
massive adoption of clouds
1.
2.
3.
4.
5.
6.
Controllability
Visibility & flexibility
Security and Privacy
High Performance and Availability
Integration and Composition
Standards
A number of approaches are being investigated for resolving the
identified issues and flaws. Private cloud, hybrid and the latest
community cloud are being prescribed as the solution for most of these
inefficiencies and deficiencies. As rightly pointed out by someone in his
weblogs, still there are miles to go. There are several companies
focusing on this issue. Boomi (http://www.dell.com/) is one among
them. This company has published several well-written white papers
elaborating the issues confronting those enterprises thinking and trying
to embrace the third-party public clouds for hosting their services
and applications.
Integration Conundrum. While SaaS applications offer outstanding
value in terms of features and functionalities relative to cost, they have
introduced several challenges specific to integration.
APIs are Insufficient. Many SaaS providers have responded to the
integration challenge by developing application programming interfaces
(APIs). Unfortunately, accessing and managing data via an API requires
a significant amount of coding as well as maintenance due to frequent
API modifications and updates.
Data Transmission Security. SaaS providers go to great length to
ensure that customer data is secure within the hosted environment.
However, the need to transfer data from on-premise systems or
applications behind the firewall with SaaS applications.
For any relocated application to provide the promised value for
businesses and users, the minimum requirement is the interoperability
between SaaS applications and on-premise enterprise packages.
The Impacts of Clouds. On the infrastructural front, in the recent past,
the clouds have arrived onto the scene powerfully and have extended
the horizon and the boundary of business applications, events and data.
Thus there is a clarion call for adaptive integration engines that
seamlessly and spontaneously connect enterprise applications with
cloud applications. Integration is being stretched further to the level of
the expanding Internet and this is really a litmus test for system
architects and integrators.
The perpetual integration puzzle has to be solved meticulously for the
originally visualised success of SaaS style.
APPROACHING THE SaaS INTEGRATION ENIGMA
Integration as a Service (IaaS) is all about the migration of the
functionality of a typical enterprise application integration (EAI) hub /
enterprise service bus (ESB) into the cloud for providing for smooth
data transport between any enterprise and SaaS applications. Users
subscribe to IaaS as they would do for any other SaaS application.
Cloud middleware is the next logical evolution of traditional
middleware solutions.
Service orchestration and choreography enables process integration.
Service interaction through ESB integrates loosely coupled systems
whereas CEP connects decoupled systems.
With the unprecedented rise in cloud usage, all these integration
software are bound to move to clouds. SQS also doesn‘t promise inorder and exactly-once delivery. These simplifications let Amazon
make SQS more scalable, but they also mean that developers must use
SQS differently from an on-premise message queuing technology.
As per one of the David Linthicum‘s white papers, approaching
SaaS-toenterprise integration is really a matter of making informed and
intelligent choices.The need for integration between remote cloud
platforms with on-premise enterprise platforms.
Why SaaS Integration is hard?. As indicated in the white paper, there is
a mid-sized paper company that recently became a Salesforce.com
CRM customer. The company currently leverages an on-premise
custom system that uses an Oracle database to track inventory and sales.
The use of the Salesforce.com system provides the company with a
significant value in terms of customer and sales management.
Having understood and defined the ―to be‖ state, data
synchronization technology is proposed as the best fit between the
source, meaning Salesforce. com, and the target, meaning the existing
legacy system that leverages Oracle. First of all, we need to gain the
insights about the special traits and tenets of SaaS applications in order
to arrive at a suitable integration route. The constraining attributes of
SaaS applications are
● Dynamic nature of the SaaS interfaces that constantly change
● Dynamic nature of the metadata native to a SaaS provider such as
Salesforce.com
● Managing assets that exist outside of the firewall
● Massive amounts of information that need to move between
SaaS and on-premise systems daily and the need to maintain data
quality and integrity.
As SaaS are being deposited in cloud infrastructures vigorously, we
need to ponder about the obstructions being imposed by clouds and
prescribe proven solutions. If we face difficulty with local integration,
then the cloud integration is bound to be more complicated. The most
probable reasons are
●
●
●
●
New integration scenarios
Access to the cloud may be limited
Dynamic resources
Performance
Limited Access. Access to cloud resources (SaaS, PaaS, and the
infrastructures) is more limited than local applications. Accessing local
applications is quite simple and faster. Imbedding integration points in
local as well as custom applications is easier.
Dynamic Resources. Cloud resources are virtualized and serviceoriented. That is, everything is expressed and exposed as a service. Due
to the dynamism factor that is sweeping the whole could ecosystem,
application versioning
and infrastructural changes are liable for
dynamic changes.
Performance. Clouds support application scalability and resource
elasticity. However the network distances between elements in the
cloud are no longer under our control.
NEW INTEGRATION SCENARIOS
Before the cloud model, we had to stitch and tie local systems together.
With the shift to a cloud model is on the anvil, we now have to connect
local applications to the cloud, and we also have to connect cloud
applications to each other, which add new permutations to the complex
integration channel matrix.All of this means integration must criss-cross
firewalls somewhere.
Cloud Integration Scenarios. We have identified three major integration
scenarios as discussed below.
Within a Public Cloud (figure 3.1). Two different applications are
hosted in a cloud. The role of the cloud integration middleware (say
cloud-based ESB or internet service bus (ISB)) is to seamlessly enable
these applications to talk to each other. The possible sub-scenarios
include these applications can be owned
App1
FIGURE 3.1.
ISB
App2
Within a Public Cloud.
Cloud 1
FIGURE 3.2.
ISB
Cloud 2
Across Homogeneous Clouds.
Public Cloud
ISB
Private Cloud
FIGURE 3.3.
Across Heterogeneous Clouds.
by two different companies. They may live in a single physical server
but run on different virtual machines.
Homogeneous Clouds (figure 3.2). The applications to be integrated are
posited in two geographically separated cloud infrastructures. The
integration middleware can be in cloud 1 or 2 or in a separate cloud.
There is a need for data and protocol transformation and they get
done by the ISB. The approach is more or less compatible to
enterprise application integration procedure.
Heterogeneous Clouds (figure 3.3). One application is in public cloud
and the other application is private cloud.
THE INTEGRATION METHODOLOGIES
Excluding the custom integration through hand-coding, there are three
types for cloud integration
1. Traditional Enterprise Integration Tools can be empowered with
special connectors to access Cloud-located Applications—This is
the most likely approach for IT organizations, which have already
invested a lot in integration suite for their application integration
needs.
2. Traditional Enterprise Integration Tools are hosted in the
Cloud—This approach is similar to the first option except that
the integration software suite is now hosted in any third-party
cloud infrastructures so that the enterprise does not worry
about procuring and managing the hardware or installing the
integration software.
3. Integration-as-a-Service (IaaS) or On-Demand Integration
Offerings— These are SaaS applications that are designed to
deliver the integration service securely over the Internet and
are able to integrate cloud applications with the on-premise
systems, cloud-to-cloud applications.
In a nutshell, the integration requirements can be realised using
any one of the following methods and middleware products.
6. Hosted and extended ESB (Internet service bus / cloud integration
bus)
7. Online Message Queues, Brokers and Hubs
8. Wizard and configuration-based integration platforms (Niche
integration solutions)
9. Integration Service Portfolio Approach
10.Appliance-based Integration (Standalone or Hosted)
With the emergence of the cloud space, the integration scope grows
further and hence people are looking out for robust and resilient
solutions and services that would speed up and simplify the whole
process of integration.
Characteristics of Integration Solutions and Products. The key
attributes of integration platforms and backbones gleaned and gained
from integration projects experience are connectivity, semantic
mediation, Data mediation, integrity, security, governance etc
● Connectivity refers to the ability of the integration engine to engage
with both the source and target systems using available native
interfaces.
● Semantic Mediation refers to the ability to account for the
differences between application semantics between two or more
systems.
● Data Mediation converts data from a source data format into
destination data format.
● Data Migration is the process of transferring data between storage
types, formats, or systems.
● Data Security means the ability to insure that information extracted
from the source systems has to securely be placed into target
systems.
● Data Integrity means data is complete and consistent. Thus, integrity
has to be guaranteed when data is getting mapped and maintained
during integration operations, such as data synchronization between
on-premise and SaaS-based systems.
● Governance refers to the processes and technologies that surround a
system or systems, which control how those systems are accessed
and leveraged.
These are the prominent qualities carefully and critically analyzed for
when deciding the cloud / SaaS integration providers.
Data Integration Engineering Lifecycle. As business data are still
stored and sustained in local and on-premise server and storage
machines, it is imperative for a lean data integration lifecycle. The
pivotal phases, as per Mr. David Linthicum, a world-renowned
integration
expert,
are
understanding,
definition,
design,
implementation, and testing.
6. Understanding the existing problem domain means defining the
metadata that is native within the source system (say
Salesforce.com) and the target system.
7. Definition refers to the process of taking the information culled
during the previous step and defining it at a high level including
what the information represents, ownership, and physical
attributes.
8. Design the integration solution around the movement of data from
one point to another accounting for the differences in the
semantics using
the underlying data transformation and
mediation layer by mapping one schema from the source to the
schema of the target.
9. Implementation refers to actually implementing the data
integration solution within the selected technology.
10.Testing refers to assuring that the integration is properly
designed and implemented and that the data synchronizes
properly between the involved systems.
SaaS INTEGRATION PRODUCTS AND PLATFORMS
Cloud-centric integration solutions are being developed and
demonstrated for showcasing their capabilities for integrating enterprise
and cloud applications. The integration puzzle has been the toughest
assignment for long due to heterogeneity and multiplicity-induced
complexity.
Jitterbit
Force.com is a Platform as a Service (PaaS), enabling developers to
create and deliver any kind of on-demand business application.
Salesforce
Google
Microsoft
THE CLOUD
Zoho
Amazon
Yahoo
FIGURE 3.4.
Open Clouds.
The Smooth and Spontaneous Cloud Interaction via
Until now, integrating force.com applications with other on-demand
applications and systems within an enterprise has seemed like a
daunting and doughty task that required too much time, money, and
expertise.
Jitterbit is a fully graphical integration solution that provides users a
versatile platform and a suite of productivity tools to reduce the
integration efforts sharply. Jitterbit is comprised
of two major
components:
● Jitterbit Integration Environment An intuitive point-and-click
graphical UI that enables to quickly configure, test, deploy and
manage integration projects on the Jitterbit server.
● Jitterbit Integration Server A powerful and scalable run-time engine
that processes all the integration operations, fully configurable and
manageable from the Jitterbit application.
Jitterbit is making integration easier, faster, and more affordable
than ever before. Using Jitterbit, one can connect force.com with a
wide variety
PROBLEM
Manufacturing
Sales
R&D
FIGURE 3.5.
Applications.
SOLUTION
Manufacturing
Sales
Consumer
Marketing
R&D
Consumer
Marketing
Linkage of On-Premise with Online and On-Demand
of on-premise systems including ERP, databases, flat files and
custom applications. The figure 3.5 vividly illustrates how Jitterbit
links a number of functional and vertical enterprise systems with
on-demand applications
Boomi Software
Boomi AtomSphere is an integration service that is completely ondemand and connects any combination of SaaS, PaaS, cloud, and onpremise applications without the burden of installing and maintaining
software packages or appliances. Anyone can securely build, deploy
and manage simple to complex integration processes using only a web
browser. Whether connecting SaaS applications found in various lines
of business or integrating across geographic boundaries,
Bungee Connect
For professional developers, Bungee Connect enables cloud computing
by offering an application development and deployment platform
that enables highly interactive applications integrating multiple data
sources and facilitating instant deployment.
OpSource Connect
Expands on the OpSource Services Bus (OSB) by providing the
infrastructure for two-way web services interactions, allowing
customers to consume and publish applications across a common web
services infrastructure.
The Platform Architecture. OpSource Connect is made up of key
features including
●
●
●
●
●
OpSource Services Bus
OpSource Service Connectors
OpSource Connect Certified Integrator Program
OpSource Connect ServiceXchange
OpSource Web Services Enablement Program
The OpSource Services Bus (OSB) is the foundation for OpSource‘s
turnkey development and delivery environment for SaaS and web
companies.
SnapLogic
SnapLogic is a capable, clean, and uncluttered solution
integration that can be deployed in enterprise as well as
landscapes. The free community edition can be used for
common point-to-point data integration tasks, giving
productivity boost beyond custom code.
for data
in cloud
the most
a huge
● Changing data sources. SaaS and on-premise applications, Web
APIs, and RSS feeds
● Changing deployment options. On-premise, hosted, private and
public cloud platforms
● Changing delivery needs. Databases, files, and data services
Transformation Engine and Repository. SnapLogic is a single data
integration platform designed to meet data integration needs. The
SnapLogic server is built on a core of connectivity and transformation
components, which can be used to solve even the most complex data
integration scenarios.
The SnapLogic designer provides an initial hint of the web principles
at work behind the scenes. The SnapLogic server is based on the web
architecture and exposes all its capabilities through web interfaces to
outside world.
The Pervasive DataCloud
Platform (figure 3.6) is unique multi-tenant platform. It provides
dynamic ―compute capacity in the sky‖ for deploying on-demand
integration and other
Managem
ent
Schedule Events
eCommerce
Users Load Balancer
Resources
&
Message Queues
Engine
Queue
Listen
er
Engine
Queue
Listen
er
Engine
Queue
Listen
er
Engine
Queue
Listen
er
Engine Queue
Listener
Scalable Computing Cluster
SaaS Application
S
a
a
S
A
p
p
l
i
c
a
t
i
o
n
Customer
FIGURE 3.6.
Resources.
Customer
Pervasive Integrator Connects Different
data-centric applications. Pervasive DataCloud is the first multi-tenant
platform for delivering the following.
5. Integration as a Service (IaaS) for both hosted and on-premises
applications and data sources
6. Packaged turnkey integration
7. Integration that supports every integration scenario
8. Connectivity to hundreds of different applications and data
sources
Pervasive DataCloud hosts Pervasive and its partners‘ data-centric
applications. Pervasive uses Pervasive DataCloud as a platform for
deploying on-demand integration via
● The Pervasive DataSynch family of packaged integrations. These
are highly affordable, subscription-based, and packaged integration
solutions.
● Pervasive Data Integrator. This runs on the Cloud or on-premises
and is a design-once and deploy anywhere solution to support
every integration scenario
● Data migration, consolidation and conversion
● ETL / Data warehouse
● B2B / EDI integration
● Application integration (EAI)
● SaaS /Cloud integration
● SOA / ESB / Web Services
● Data Quality/Governance
● Hubs
Pervasive DataCloud provides multi-tenant, multi-application and
multicustomer deployment. Pervasive DataCloud is a platform to deploy
applications that are
● Scalable—Its multi-tenant architecture can support multiple users
●
●
●
●
●
and applications for delivery of diverse data-centric solutions
such as data integration. The applications themselves scale to
handle fluctuating data volumes.
Flexible—Pervasive DataCloud supports SaaS-to-SaaS, SaaS-to-on
premise or on-premise to on-premise integration.
Easy to Access and Configure—Customers can access, configure
and run Pervasive DataCloud-based integration solutions via a
browser.
Robust—Provides automatic delivery of updates as well as
monitoring activity by account, application or user, allowing
effortless result tracking.
Secure—Uses the best technologies in the market coupled with the
best data centers and hosting services to ensure that the service
remains secure and available.
Affordable—The platform enables delivery of packaged solutions
in a SaaS-friendly pay-as-you-go model.
Bluewolf
Has announced its expanded ―Integration-as-a-Service‖ solution, the
first to offer ongoing support of integration projects guaranteeing
successful integration between diverse SaaS solutions, such as
salesforce.com, BigMachines, eAutomate, OpenAir and back office
systems (e.g. Oracle, SAP, Great Plains, SQL Service and MySQL).
Called the Integrator, the solution is the only one to include proactive
monitoring and consulting services to ensure integration success. With
remote monitoring of integration jobs via a dashboard included as part
of the Integrator solution, Bluewolf proactively alerts its customers of
any issues with integration and helps to solves them quickly.
Online MQ
Online MQ is an Internet-based queuing system. It is a complete and
secure online messaging solution for sending and receiving messages
over any network. It is a cloud messaging queuing service.
● Ease of Use. It is an easy way for programs that may each be
running on different platforms, in different systems and different
networks, to communicate with each other without having to write
any low-level communication code.
● No Maintenance. No need to install any queuing software/server
and no need to be concerned with MQ server uptime, upgrades and
maintenance.
● Load Balancing and High Availability. Load balancing can be
achieved on a busy system by arranging for more than one program
instance to service a queue. The performance and availability
features are being met through clustering. That is, if one system
fails, then the second system can take care of users‘ requests
without any delay.
● Easy Integration. Online MQ can be used as a web-service (SOAP)
and as a REST service. It is fully JMS-compatible and can hence
integrate easily with any Java EE application servers. Online MQ is
not limited to any specific platform, programming language or
communication protocol.
CloudMQ
This leverages the power of Amazon Cloud to provide enterprise-grade
message queuing capabilities on demand. Messaging allows us to
reliably break up a single process into several parts which can then be
executed asynchronously.
Linxter
Linxter is a cloud messaging framework for connecting all kinds of
applications, devices, and systems. Linxter is a behind-the-scenes,
messageoriented and cloud-based middleware technology and smoothly
automates the complex tasks that developers face when creating
communication-based products and services.
Online MQ, CloudMQ and Linxter are all accomplishing messagebased application and service integration. As these suites are being
hosted in clouds, messaging is being provided as a service to hundreds
of distributed and enterprise applications using the much-maligned
multi-tenancy property. ―Messaging middleware as a service (MMaaS)‖
is the grand derivative of the SaaS paradigm.
SaaS INTEGRATION SERVICES
We have seen the state-of-the-art cloud-based data integration
platforms for real-time data sharing among enterprise information
systems and cloud applications.
There are fresh endeavours in order to achieve service composition in
cloud ecosystem. Existing frameworks such as service component
architecture (SCA) are being revitalised for making it fit for cloud
environments. Composite applications, services, data, views and
processes will be become cloud-centric and hosted in order to support
spatially separated and heterogeneous systems.
Informatica On-Demand
Informatica offers a set of innovative on-demand data integration
solutions called Informatica On-Demand Services. This is a cluster of
easy-to-use SaaS offerings, which facilitate integrating data in SaaS
applications, seamlessly and securely across the Internet with data in
on-premise applications. There are a few key benefits to leveraging this
maturing technology.
● Rapid development and deployment with zero maintenance of the
integration technology.
● Automatically upgraded and continuously enhanced by vendor.
● Proven SaaS integration solutions, such as integration with
Salesforce
.com, meaning that the connections and the metadata
understanding are provided.
● Proven data transfer and translation technology, meaning that
core integration services such as connectivity and semantic
mediation are built into the technology.
Informatica On-Demand has taken the unique approach of moving
its industry leading PowerCenter Data Integration Platform to the
hosted model and then configuring it to be a true multi-tenant
solution.
Microsoft Internet Service Bus (ISB)
Azure is an upcoming cloud operating system from Microsoft. This
makes development, depositing and delivering Web and Windows
application on cloud centers easier and cost-effective.
Microsoft .NET Services. is a set of Microsoft-built and hosted cloud
infrastructure services for building Internet-enabled applications and the
ISB acts as the cloud middleware providing diverse applications with a
common infrastructure to name, discover, expose, secure and
orchestrate web services. The following are the three broad areas.
.NET Service Bus. The .NET Service Bus (figure 3.7) provides a hosted,
secure, and broadly accessible infrastructure for pervasive
communication,
Console Application Exposing Web Services
End Users
End Users
Azure Service Platform
Google App Engine
.Net Services Service Bus
Windows Azure
Applications
Application
via Service Bus
FIGURE 3.7.
.NET Service Bus.
large-scale event distribution, naming, and service publishing. Services
can be exposed through the Service Bus Relay, providing connectivity
options for service endpoints that would otherwise be difficult or
impossible to reach.
.NET Access Control Service. The .NET Access Control Service is a
hosted, secure, standards-based infrastructure for multiparty, federated
authentication, rules-driven, and claims-based authorization.
.NET Workflow Service. The .NET Workflow Service provide a hosted
environment for service orchestration based on the familiar Windows
Workflow Foundation (WWF) development experience.
The most important part of the Azure is actually the service bus
represented as a WCF architecture. The key capabilities of the Service
Bus are
● A federated namespace model that provides a shared, hierarchical
namespace into which services can be mapped.
● A service registry service that provides an opt-in model for
publishing service endpoints into a lightweight, hierarchical, and
RSS-based discovery mechanism.
● A lightweight and scalable publish/subscribe event bus.
● A relay and connectivity service with advanced NAT traversal and
pullmode message delivery capabilities acting as a ―perimeter
network (also known as DMZ, demilitarized zone, and screened
subnet) in the sky‖
Relay Services. Often when we connect a service, it is located behind
the firewall and behind the load balancer. Its address is dynamic
and can be
Relay Service
Client
FIGURE 3.8.
Service
The .NET Relay Service.
resolved only on local network. When we are having the service callbacks to the client, the connectivity challenges lead to scalability,
availability and security issues. The solution to Internet connectivity
challenges is instead of connecting client directly to the service we can
use a relay service as pictorially represented in the relay service figure
3.8.
BUSINESSES-TO-BUSINESS INTEGRATION (B2Bi) SERVICES
B2Bi has been a mainstream activity for connecting geographically
distributed businesses for purposeful and beneficial cooperation.
Products vendors have come out with competent B2B hubs and suites
for enabling smooth data sharing in standards-compliant manner among
the participating enterprises.
Just as these abilities ensure smooth communication between
manufacturers and their external suppliers or customers, they also
enable reliable interchange between hosted and installed applications.
The IaaS model also leverages the adapter libraries developed by
B2Bi vendors to provide rapid integration with various business
systems.
Cloudbased Enterprise Mashup Integration Services for B2B Scenarios
. There is a vast need for infrequent, situational and ad-hoc B2B
applications desired by the mass of business end-users..
Especially in the area of applications to support B2B collaborations,
current offerings are characterized by a high richness but low reach,
like B2B hubs that focus on many features enabling electronic
collaboration, but lack availability for especially small organizations
or even individuals.
Enterprise Mashups, a kind of new-generation Web-based
applications,
seem to adequately fulfill the individual and
heterogeneous requirements of end-users and foster End User
Development (EUD).
Another challenge in B2B integration is the ownership of and
responsibility for processes. In many inter-organizational settings,
business processes are only sparsely structured and formalized, rather
loosely coupled and/or based
on ad-hoc cooperation. Interorganizational collaborations tend to
involve
more and more
participants and the growing number of participants also draws a huge
amount of differing requirements.
Now, in supporting supplier and partner co-innovation and customer
cocreation, the focus is shifting to collaboration which has to embrace
the participants, who are influenced yet restricted by multiple domains
of control and disparate processes and practices.
Both Electronic data interchange translators (EDI) and Managed file
transfer (MFT) have a longer history, while B2B gateways only have
emerged during the last decade.
Enterprise Mashup Platforms and Tools.
Mashups are the adept combination of different and distributed
resources including content, data or application functionality. Resources
represent the core building blocks for mashups. Resources can be
accessed through APIs, which encapsulate the resources and describe
the interface through which they are made available. Widgets or gadgets
primarily put a face on the underlying resources by providing a
graphical representation for them and piping the data received from the
resources. Piping can include operators like aggregation, merging or
filtering. Mashup platform is a Web based tool that allows the creation
of Mashups by piping resources into Gadgets and wiring Gadgets
together.
The Mashup integration services are being implemented as a
prototype in the FAST project. The layers of the prototype are
illustrated in figure 3.9 illustrating the architecture, which describes
how these services work together. The authors of this framework have
given an outlook on the technical realization of the services using cloud
infrastructures and services.
COMPANY A
HTTP
HTTP
Browser R
HTTP
Browser R
Browser R
COMPANY B
HTTP
HTTP
Browser R
Enterprise Mashup Platform
(i.e. FAST)
HTTP
Browser R
Browser R
Enterprise Mashup Platform
(i.e. SAP Research Rooftop)
R
R
REST
REST
Mashup
Integration Service Logic
Integration
Services
Platform
(i.e., Google
App. Engine)
Routing Engine
Identity
Management
Error Handling
and Monitoring
Organization
R
Cloud Based
Services
Translation
Engine
Persistent
Storage
Semantic
R
Message
InfrastructureQueue
R
R
R
Amazon SQS
Amazon S3
Mule
onDemand
Mule
onDemand
OpenID/Oauth (Google)
FIGURE 3.9.
Architecture.
Cloudbased Enterprise Mashup Integration Platform
To simplify this, a Gadget could be provided for the end-user. The
routing engine is also connected to a message queue via an API. Thus,
different message queue engines are attachable. The message queue is
responsible for storing and forwarding the messages controlled by the
routing engine. Beneath the message queue, a persistent storage, also
connected via an API to allow exchangeability, is available to store
large data. The error handling and monitoring service allows tracking
the message-flow to detect errors and to collect statistical data. The
Mashup integration service is hosted as a cloud-based service. Also,
there are cloud-based services available which provide the functionality
required by the integration service. In this way, the Mashup integration
service can reuse and leverage the existing cloud services to speed up
the implementation.
Message Queue. The message queue could be realized by using
Amazon‘s Simple Queue Service (SQS). SQS is a web-service which
provides a queue for messages and stores them until they can be
processed. The Mashup integration services, especially the routing
engine, can put messages into the queue and recall them when they are
needed.
Persistent Storage. Amazon Simple Storage Service5 (S3) is also a
webservice. The routing engine can use this service to store large files.
Translation Engine. This is primarily focused on translating between
different protocols which the Mashup platforms it connects can
understand, e.g. REST or SOAP web services. However, if the need of
translation of the objects transferred arises, this could be attached to the
translation engine.
Interaction between the Services. The diagram describes the process of
a message being delivered and handled by the Mashup Integration
Services Platform. The precondition for this process is that a user
already established a route to a recipient.
A FRAMEWORK OF SENSOR—CLOUD INTEGRATION
In the past few years, wireless sensor networks (WSNs) have been
gaining significant attention because of their potentials of enabling of
novel and attractive solutions in areas such as industrial automation,
environmental monitoring, transportation business, health-care etc.
With the faster adoption of micro and nano technologies, everyday
things are destined to become digitally empowered and smart in their
operations and offerings. Thus the goal is to link smart materials,
appliances, devices, federated messaging middleware, enterprise
information systems and packages, ubiquitous services, handhelds, and
sensors with one another smartly to build and sustain cool, charismatic
and catalytic situation-aware applications.
A virtual community consisting of team of researchers have come together to
solve a complex problem and they need data storage, compute capability,
security; and they need it all provided now. For example, this team is
working on an outbreak of a new virus strain moving through a population.
This requires more than a Wiki or other social organization tool. They
deploy bio-sensors on patient body to monitor patient condition
continuously and to use this data for large and multi-scale simulations to
track the spread of infection as well as the virus mutation and possible cures.
This may require computational resources and a platform for sharing data
and results that are not immediately available to the team.
Traditional HPC approach like Sensor-Grid model can be used in this
case, but setting up the infrastructure to deploy it so that it can scale out
quickly is not easy in this environment. However, the cloud paradigm is
an excellent move.
Here, the researchers need to register their interests to get various
patients‘ state (blood pressure, temperature, pulse rate etc.) from biosensors for largescale parallel analysis and to share this information
with each other to find useful solution for the problem. So the sensor
data needs to be aggregated, processed and disseminated based on
subscriptions.
To integrate sensor networks to cloud, the authors have proposed a
contentbased pub-sub model. In this framework, like MQTT-S, all of
the system complexities reside on the broker‘s side but it differs from
MQTT-S in that it uses content-based pubsub broker rather than topicbased which is suitable for the application scenarios considered.
To deliver published sensor data or events to subscribers, an efficient
and scalable event matching algorithm is required by the pub-sub
broker.
Moreover, several SaaS applications may have an interest in the same
sensor data but for different purposes. In this case, the SA nodes would
need to manage and maintain communication means with multiple
applications in parallel. This might exceed the limited capabilities of the
simple and low-cost SA devices. So pub-sub broker is needed and it is
located in the cloud side because of its higher performance in terms of
bandwidth and capabilities. It has four components describes as
follows:
Social Network
of doctors for
monitoring
patient
healthcare for
virus infection
1
WSN 1
Environmental
data analysis
and
Urban Trafic
prediction
and
1
sharing portal
analysis1network
Other data
analysis or
social
1 network
Gateway
System
3
Actuator
Application Specific 2
2
Gateway
Services
(SaaS)
3
Manager
3
3
4
Sensor
Monitoring
and Metering
Provisioning
Manager
4
Servers
Pub/Sub Broker
WSN 2
Registry
Event
Monitoring Analyzer
Gateway
3
inator
Actuator
Gateway
Mediator
Processing Dissemand
Sensor
4
Service
Registry
Policy
Repository
Collaborator
Sensor
Cloud Provider (CLP)
Agent
WSN 2
FIGURE 3.10.
Integration.
The Framework Architecture of Sensor—Cloud
Stream monitoring and processing component (SMPC). The sensor
stream comes in many different forms. In some cases, it is raw data that
must be captured, filtered and analyzed on the fly and in other cases, it is
stored or cached. The style of computation required depends on the
nature of
the streams. So the SMPC component running on the
cloud monitors the event streams and invokes correct analysis method.
Depending on the data rates and the amount of processing that is
required, SMP manages parallel execution framework on cloud.
Registry component (RC). Different SaaS applications register to pub-sub
broker for various sensor data required by the community user.
Analyzer component (AC). When sensor data or events come to the pubsub broker, analyzer component determines which applications they are
belongs to and whether they need periodic or emergency deliver.
Disseminator component (DC). For each SaaS application, it disseminates
sensor data or events to subscribed users using the event matching
algorithm. It can utilize cloud‘s parallel execution framework for fast event
delivery. The pub-sub components workflow in the framework is as
follows:
Users register their information and subscriptions to various SaaS
applications which then transfer all this information to pub/sub broker
registry. When sensor data reaches to the system from gateways,
event/stream monitoring and processing component (SMPC) in the pub/sub
broker determines whether it needs processing or just store for periodic
send or for immediate delivery.
Mediator. The (resource) mediator is a policy-driven entity within a VO to
ensure that the participating entities are able to adapt to changing
circumstances and are able to achieve their objectives in a dynamic and
uncertain environment.
Policy Repository (PR). The PR virtualizes all of the policies within the
VO. It includes the mediator policies, VO creation policies along with any
policies for resources delegated to the VO as a result of a collaborating
arrangement.
Collaborating Agent (CA). The CA is a policy-driven resource discovery
module for VO creation and is used as a conduit by the mediator to
exchange policy and resource information with other CLPs.
SaaS INTEGRATION APPLIANCES
Appliances are a good fit for high-performance requirements. Clouds too
have gone in the same path and today there are cloud appliances (also
termed as ―cloud in a box‖). In this section, we are to see an
integration appliance.
Cast Iron Systems . This is quite different from the above-mentioned
schemes. Appliances with relevant software etched inside are being
established as a high-performance and hardware-centric solution for several
IT needs.
Cast Iron Systems (www.ibm.com) provides pre-configured solutions for
each of today‘s leading enterprise and On-Demand applications. These
solutions, built using the Cast Iron product offerings offer out-of-the-box
connectivity to specific applications, and template integration processes
(TIPs) for the most common integration scenarios.
2.4 THE ENTERPRISE CLOUD COMPUTING
PARADIGM
Cloud computing is still in its early stages and constantly undergoing
changes as new vendors, offers, services appear in the cloud market.
Enterprises will place stringent requirements on cloud providers to pave
the way for more widespread adoption of cloud computing, leading
to what is known as the enterprise cloud paradigm computing.
Enterprise cloud computing is the alignment of a cloud computing
model with an organization‘s business objectives (profit, return on
investment, reduction of operations costs) and processes. This chapter
explores this paradigm with respect to its motivations, objectives,
strategies and methods.
Section 4.2 describes a selection of deployment models and strategies
for enterprise cloud computing, while Section 4.3 discusses the issues of
moving [traditional] enterprise applications to the cloud. Section 4.4
describes the technical and market evolution for enterprise cloud
computing,
describing some potential opportunities for multiple
stakeholders in the provision of enterprise cloud computing.
BACKGROUND
According to NIST [1], cloud computing is composed of five essential
characteristics: on-demand self-service, broad network access, resource
pooling, rapid elasticity, and measured service. The ways in which these
characteristics are manifested in an enterprise context vary according to the
deployment model employed.
Relevant Deployment Models for Enterprise Cloud Computing
There are some general cloud deployment models that are accepted by the
majority of cloud stakeholders today, as suggested by the references [1] and
and discussed in the following:
● Public clouds are provided by a designated service provider for general
public under a utility based pay-per-use consumption model.
● Private clouds are built, operated, and managed by an organization for its
internal use only to support its business operations exclusively.
● Virtual private clouds are a derivative of the private cloud deployment
model but are further characterized by an isolated and secure segment
of resources, created as an overlay on top of public cloud infrastructure
using advanced network virtualization capabilities..
● Community clouds are shared by several organizations and support a
specific community that has shared concerns (e.g., mission, security
requirements, policy, and compliance considerations).
● Managed clouds arise when the physical infrastructure is owned by and/or
physically located in the organization‘s data centers with an extension of
management and security control plane controlled by the managed service
provider .
● Hybrid clouds are a composition of two or more clouds (private,
community, or public) that remain unique entities but are bound
together by standardized or proprietary technology that enables data
and application portability (e.g., cloud bursting for load-balancing
between clouds).
Adoption and Consumption Strategies
The selection of strategies for enterprise cloud computing is critical for IT
capability as well as for the earnings and costs the organization experiences,
motivating efforts toward convergence of business strategies and IT. Some
critical questions toward this convergence in the enterprise cloud paradigm are
as follows:
● Will an enterprise cloud strategy increase overall business value?
● Are the effort and risks associated with transitioning to an enterprise
cloud strategy worth it?
● Which areas of business and IT capability should be considered for the
enterprise cloud?
● Which cloud offerings are relevant for the purposes of an organization?
● How can the process of transitioning to an enterprise cloud strategy be
piloted and systematically executed?
These questions are addressed from two strategic perspectives: (1) adoption
and (2) consumption. Figure 4.1 illustrates a framework for enterprise cloud
adoption strategies, where an organization makes a decision to adopt a
cloud computing model based on fundamental drivers for cloud computing—
scalability, availability, cost and convenience. The notion of a Cloud Data
Center (CDC) is used, where the CDC could be an external, internal or
federated provider of infrastructure, platform or software services.
An optimal adoption decision cannot be established for all cases because the
types of resources (infrastructure, storage, software) obtained from a CDC
depend on the size of the organisation understanding of IT impact on business,
predictability of workloads, flexibility of existing IT landscape and available
budget/resources for testing and piloting. The strategic decisions using these
four basic drivers are described in following, stating objectives, conditions and
actions.
Cloud Data Center(s)
(CDC)
Conveniencedriv
en: Use cloud
resources so that
there is no need to
maintain local
resources.
Availability-driven:
Use of load-balanced
and localised cloud
resources to increase
availability and
reduce response time
Market-driven:
Users and
providers of
cloud resources
make decisions
based on the
potential saving
and profit
Scalability-driven: Use of cloud
resources to support additional
load or as back-up.
FIGURE 4.1. Enterprise cloud adoption strategies using fundamental cloud
drivers.
5. Scalability-Driven Strategy. The objective is to support increasing
workloads of the organization without investment and expenses
exceeding returns.
6. Availability-Driven Strategy. Availability has close relations to scalability
but is more concerned with the assurance that IT capabilities and functions
are accessible, usable and acceptable by the standards of users.
7. Market-Driven Strategy. This strategy is more attractive and viable for
small, agile organizations that do not have (or wish to have) massive
investments in their IT infrastructure.
(1) Software Provision: Cloud provides instances
(2) Storage Provision: Cloud provides data
of software but data is maintained within user‘s
data center
management and software accesses data
remotely from user‘s data center
(5) Solution Provision: Software and storage
are maintained in cloud and the user does not
maintain a data center
(6) Redundancy Services: Cloud is used as an
alternative or extension of user‘s data center
for software and storage
FIGURE 4.2. Enterprise cloud consumption strategies.
on their profiles and requests service requirements .
8. Convenience-Driven Strategy. The objective is to reduce the load and
need for dedicated system administrators and to make access to IT
capabilities by users easier, regardless of their location and connectivity
(e.g. over the Internet).
There are four consumptions strategies identified, where the differences in
objectives, conditions and actions reflect the decision of an organization to
trade-off hosting costs, controllability and resource elasticity of IT resources
for software and data. These are discussed in the following.
5. Software Provision. This strategy is relevant when the elasticity
requirement is high for software and low for data, the controllability
concerns are low for software and high for data, and the cost reduction
concerns for software are high, while cost reduction is not a priority for
data, given the high controllability concerns for data, that is, data are
highly sensitive.
6. Storage Provision. This strategy is relevant when the elasticity
requirements is high for data and low for software, while the
controllability of software is more critical than for data. This can be the
case for data intensive applications, where the results from processing in
the application are more critical and sensitive than the data itself.
7. Solution Provision. This strategy is relevant when the elasticity and cost
reduction requirements are high for software and data, but the
controllability requirements can be entrusted to the CDC.
8. Redundancy Services. This strategy can be considered as a hybrid
enterprise cloud strategy, where the organization switches between
traditional, software, storage or solution management based on changes
in its operational conditions and business demands.
Even though an organization may find a strategy that appears to provide it
significant benefits, this does not mean that immediate adoption of the strategy
is advised or that the returns on investment will be observed immediately.
ISSUES FOR ENTERPRISE APPLICATIONS ON THE CLOUD
Enterprise Resource Planning (ERP) is the most comprehensive definition of
enterprise application today. For these reasons, ERP solutions have emerged as
the core of successful information management and the enterprise
backbone of nearly any organization . Organizations that have successfully
implemented the ERP systems are reaping the benefits of having integrating
working environment, standardized process and operational benefits to the
organization .
One of the first issues is that of infrastructure availability. Al-Mashari and
Yasser argued that adequate IT infrastructure, hardware and networking are
crucial for an ERP system‘s success.
One of the ongoing discussions concerning future scenarios considers varying
infrastructure requirements and constraints given different workloads and
development phases. Recent surveys among companies in North America
and Europe with enterprise-wide IT systems showed that nearly all kinds of
workloads are seen to be suitable to be transferred to IaaS offerings.
Considering Transactional and Analytical Capabilities
Transactional type of applications or so-called OLTP (On-line Transaction
Processing) applications, refer to a class of systems that manage
transactionoriented applications, typically using relational databases. These
applications rely on strong ACID (atomicity, consistency, isolation,
durability) properties and are relatively write/update-intensive. Typical OLTPtype ERP components are sales and distributions (SD), banking and financials,
customer relationship management (CRM) and supply chain management
(SCM).
One can conclude that analytical applications will benefit more than their
transactional counterparts from the opportunities created by cloud computing,
especially on compute elasticity and efficiency.
2.4.1 TRANSITION CHALLENGES
The very concept of cloud represents a leap from traditional approach for IT to
deliver mission critical services. With any leap comes the gap of risk and
challenges to overcome. These challenges can be classified in five different
categories, which are the five aspects of the enterprise cloud stages: build,
develop, migrate, run, and consume (Figure 4.3).
The requirement for a company-wide cloud approach should then become
the number one priority of the CIO, especially when it comes to having a
coherent and cost effective development and migration of services on this
architecture.
Develop
Build
Run
Consume
Migrate
FIGURE 4.3. Five stages of the cloud.
A second challenge is migration of existing or ―legacy‖ applications to ―the
cloud.‖ The expected average lifetime of ERP product is B15 years, which
means that companies will need to face this aspect sooner than later as they try
to evolve toward the new IT paradigm.
The ownership of enterprise data conjugated with the integration with others
applications integration in and from outside the cloud is one of the key
challenges. Future enterprise application development frameworks will need to
enable the separation of data management from ownership. From this, it can
be extrapolated that SOA, as a style, underlies the architecture and, moreover,
the operation of the enterprise cloud.
One of these has been notoriously hard to upgrade: the human factor;
bringing staff up to speed on the requirements of cloud computing with respect
to architecture, implementation, and operation has always been a tedious task.
Once the IT organization has either been upgraded to provide cloud or is
able to tap into cloud resource, they face the difficulty of maintaining the
services in the cloud. The first one will be to maintain interoperability between
in-house infrastructure and service and the CDC (Cloud Data Center).
Before leveraging such features, much more basic functionalities are
problematic: monitoring, troubleshooting, and comprehensive capacity
planning are actually missing in most offers. Without such features it becomes
very hard to gain visibility into the return on investment and the consumption
of cloud services.
Today there are two major cloud pricing models: Allocation based and
Usage based . The first one is provided by the poster child of cloud computing,
namely, Amazon. The principle relies on allocation of resource
for a fixed
amount of time. As companies need to evaluate the offers they need to also
include the hidden costs such as lost IP, risk, migration, delays and provider
overheads. This combination can be compared to trying to choose a new mobile
with carrier plan.The market dynamics will hence evolve alongside the
technology for the enterprise cloud computing paradigm.
ENTERPRISE CLOUD TECHNOLOGY AND MARKET EVOLUTION
This section discusses the potential factors which will influence this evolution of
cloud computing and today‘s enterprise landscapes to the enterprise computing
paradigm, featuring the convergence of business and IT and an open, service
oriented marketplace.
Technology Drivers for Enterprise Cloud Computing Evolution
This will put pressure on cloud providers to build their offering on open
interoperable standards to be considered as a candidate by enterprises. There
have been a number initiatives emerging in this space. Amazon, Google, and
Microsoft, who currently do not actively participate in these efforts. True
interoperability across
the board in the near future seems unlikely. However, if achieved, it could lead
to facilitation of advanced scenarios and thus drive the mainstream adoption of
the enterprise cloud computing paradigm.
Part of preserving investments is maintaining the assurance that cloud
resources and services powering the business operations perform according
to the business requirements. Underperforming resources or service disruptions
lead to business and financial loss, reduced business credibility, reputation,
and marginalized user productivity. Another important factor in this regard is
lack of insights into the performance and health of the resources and service
deployed on the cloud, such that this is another area of technology evolution
that will be pushed.
This would prove to be a critical capability empowering third-party
organizations to act as independent auditors especially with respect to SLA
compliance auditing and for mediating the SLA penalty related issues.
Emerging trend in the cloud application space is the divergence from the
traditional RDBMS based data store backend. Cloud computing has given rise
to alternative data storage technologies (Amazon Dynamo, Facebook
Cassandra, Google BigTable, etc.) based on key-type storage models as
compared to the relational model, which has been the mainstream choice for
data storage for enterprise applications.
As these technologies evolve into maturity, the PaaS market will consolidate
into a smaller number of service providers. Moreover, big traditional software
vendors will also join this market which will potentially trigger this
consolidation through acquisitions and mergers. These views are along the
lines of the research published by Gartner. Gartner predicts that from 2011 to
2015 market competition and maturing developer practises will drive
consolidation around a small group of industry-dominant cloud technology
providers.
A recent report published by Gartner presents an interesting perspective on
cloud evolution. The report argues that as cloud services proliferate, services
would become complex to be handled directly by the consumers.
To cope
with these scenarios, meta-services or cloud brokerage services will emerge.
These brokerages will use several types of brokers and platforms to enhance
service delivery and, ultimately service value. According to Gartner, before
these scenarios can be enabled, there is a need for brokerage business to use
these brokers and platforms. According to Gartner, the following types of cloud
service brokerages (CSB) are foreseen:
● Cloud Service Intermediation. An intermediation broker providers a
service that directly enhances a given service delivered one or more service
consumers, essentially on top of a given service to enhance a specific
capability.
● Aggregation. An aggregation brokerage service combines multiple
services into one or more new services.
● Cloud Service Arbitrage. These services will provide flexibility and
opportunistic choices for the service aggregator.
The above shows that there is potential for various large, medium, and
small organizations to become players in the enterprise cloud marketplace.
The dynamics of such a marketplace are still to be explored as the enabling
technologies and standards continue to mature.
BUSINESS DRIVERS TOWARD A MARKETPLACE FOR
ENTERPRISE CLOUD COMPUTING
In order to create an overview of offerings and consuming players on the
market, it is important to understand the forces on the market and motivations
of each player.
The Porter model consists of five influencing factors/views (forces) on the
market (Figure 4.4). The intensity of rivalry on the market is traditionally
influenced by industry-specific characteristics :
● Rivalry: The amount of companies dealing with cloud and virtualization
technology is quite high at the moment; this might be a sign for high
New Market Entrants
• Geographical factors
• Entrant strategy
• Routes to market
Suppliers
• Level of quality
• Supplier‘s size
• Bidding processes/
capabilities
Cloud Market
•
•
•
•
Cost structure
Product/service ranges
Differentiation, strategy
Number/size of players
Buyers (Consumers)
•
•
•
•
Buyer size
Buyers number
Product/service
Requirements
Technology Development
• Substitutes
• Trends
• Legislative effects
FIGURE 4.4. Porter‘s five forces market model (adjusted for the cloud market) .
BUSINESS DRIVERS TOWARD A MARKETPLACE FOR ENTERPRISE
113
rivalry. But also the products and offers are quite various, so many niche
products tend to become established.
● Obviously, the cloud-virtualization market is presently booming and will
keep growing during the next years. Therefore the fight for customers and
struggle for market share will begin once the market becomes saturated
and companies start offering comparable products.
● The initial costs for huge data centers are enormous. By building up
federations of computing and storing utilities, smaller companies can try
to make use of this scale effect as well.
● Low switching costs or high exit barriers influence rivalry. When a
customer can freely switch from one product to another, there is a greater
struggle to capture customers. From the opposite point of view high exit
barriers discourage customers to buy into a new technology. The trends
towards standardization of formats and architectures try to face this
problem and tackle it. Most current cloud providers are only paying
attention to standards related to the interaction with the end user.
However, standards for clouds interoperability are still to be developed .
Market
Regulations
Business Model
Hype
Cycle Phase
Market
Technology
FIGURE 4.5. Dynamic business models (based on [49] extend by
influence factors identified by [50]).
.
THE CLOUD SUPPLY CHAIN
One indicator of what such a business model would look like is in the complexity
of deploying, securing, interconnecting and maintaining enterprise landscapes
and solutions such as ERP, as discussed in Section 4.3. The concept of a Cloud
Supply Chain (C-SC) and hence Cloud Supply Chain Management (C-SCM)
appear to be viable future business models for the enterprise cloud computing
paradigm. The idea of C-SCM represents the management of a network of
interconnected businesses involved in the end-to-end provision of product and
service packages required by customers. The established understanding of a
supply chain is two or more parties linked by a flow of goods, information,
and funds [55], [56] A specific definition for a C-SC is hence: ―two or more
parties linked by the provision of cloud services, related information
and funds.‖ Figure 4.6 represents a concept for the C-SC, showing the flow
of products along different organizations such as hardware suppliers, software
component suppliers, data center operators, distributors and the end customer.
Figure 4.6 also makes a distinction between innovative and functional
products in the C-SC. Fisher classifies products primarily on the basis of their
demand patterns into two categories: primarily functional or primarily
innovative [57]. Due to their stability, functional products favor competition,
which leads to low profit margins and, as a consequence of their properties, to
low inventory costs, low product variety, low stockout costs, and low
obsolescence [58], [57]. Innovative products are characterized by additional
(other) reasons for a customer in addition to basic needs that lead to purchase,
unpredictable demand (that is high uncertainties, difficult to forecast and
variable demand), and short product life cycles (typically 3 months to 1
year). Cloud services
Cloud services, information, funds
Data center
Fuctional
Distributor
operator
End
customer
Product
Cloud supply chain
Innovative
Hardware
supplier
Component
supplier
Potential Closed Loop Cooperation
FIGURE 4.6. Cloud supply chain (C-SC).
should fulfill basic needs of customers and favor competition due to their
reproducibility. Table 4.1 presents a comparison of Traditional
TABLE 4.1. Comparison of Traditional and Emerging ICT Supply Chainsa
Emerging ICT
Traditional Supply Chain Concepts
Primary goal
Efficient SC
Responsive SC
Cloud SC
Supply demand at
Respond quickly
to demand
(changes)
Supply demand at the
lowest level of costs
and respond quickly
to demand
Create modularity
to allow
postponement
Create modularity to
allow individual
setting while
maximizing the
performance of
services
the lowest level of
cost
Product design
strategy
Maximize
performance at the
minimum product
cost
of product
differentiation
Pricing strategy
Concepts
Lower margins
because price is a
prime customer
driver
Manufacturing
strategy
Higher margins,
because price is
not a prime
customer driver
Lower costs
through high
utilization
Lower margins, as
high competition and
comparable products
Select based on cost
and quality
Supplier strategy
Inventory
strategy
Lead time
strategy
Transportation
strategy
Minimize
inventory to
lower cost
Reduce but not
at the expense
of costs
Greater reliance on
low cost modes
Maintain
capacity
flexibility to
meet
unexpected
demand
High utilization
while flexible
reaction on
demand
Maintain
buffer
inventory to
meet
unexpected
demand
Optimize
of
buffer
for
unpredicted
demand,
and
best utilization
Aggressively
reduce even if
the costs are
significant
Select based on
speed,
flexibility, and
quantity
Greater
reliance on
responsive
modes
Strong servicelevel agreements
(SLA) for ad hoc
provision
Select
on
complex
optimum
of
speed,
cost,
and flexibility
Implement highly
responsive and
low cost modes
a
Based on references 54 and 57.
Supply Chain concepts such as the
efficient SC and responsive SC and a
new concept for emerging ICT as the
cloud computing area with cloud
services as traded products.
INTRODUCTION TO CLOUD
COMPUTING
CLOUD COMPUTING IN A NUTSHELL
Computing itself, to be considered fully virtualized, must allow computers to
be built from distributed components such as processing, storage, data, and
software resources.
Technologies such as cluster, grid, and now, cloud computing, have all
aimed at allowing access to large amounts of computing power in a fully
virtualized manner, by aggregating resources and offering a single system
view. Utility computing describes a business model for on-demand delivery of
computing power; consumers pay providers based on usage (―payas-yougo‖), similar to the way in which we currently obtain services from traditional
public utility services such as water, electricity, gas, and telephony.
Cloud computing has been coined as an umbrella term to describe a
category of sophisticated on-demand computing services initially offered by
commercial providers, such as Amazon, Google, and Microsoft. It denotes a
model on which a computing infrastructure is viewed as a ―cloud,‖ from which
businesses and individuals access applications from anywhere in the world on
demand . The main principle behind this model is offering computing, storage,
and software ―as a service.‖
Many practitioners in the commercial and academic spheres have attempted
to define exactly what ―cloud computing‖ is and what unique characteristics it
presents. Buyya et al. have defined it as follows: ―Cloud is a parallel and
distributed computing system consisting of a collection of inter-connected
and virtualised computers that are dynamically provisioned and presented as one
or more unified computing resources based on service-level agreements (SLA)
established through negotiation between the service provider and consumers.‖
Vaquero et al. have stated ―clouds are a large pool of easily usable and
accessible virtualized resources (such as hardware, development platforms
and/or services). These resources can be dynamically reconfigured to adjust
to a variable load (scale), allowing also for an optimum resource utilization.
This pool of resources is typically exploited by a pay-per-use model in which
guarantees are offered by the Infrastructure Provider by means of customized
Service Level Agreements.‖
A recent McKinsey and Co. report
claims that ―Clouds are
hardwarebased services offering compute, network, and storage capacity
where: Hardware management is highly abstracted from the buyer, buyers
incur infrastructure costs as variable OPEX, and infrastructure capacity is
highly elastic.‖
A report from the University of California Berkeley summarized the key
characteristics of cloud computing as: ―(1) the illusion of infinite computing
resources; (2) the elimination of an up-front commitment by cloud users; and
(3) the ability to pay for use . . . as needed .. .‖
The National Institute of Standards and Technology (NIST) characterizes
cloud computing as ―... a pay-per-use model for enabling available,
convenient, on-demand network access to a shared pool of configurable
computing resources (e.g. networks, servers, storage, applications, services)
that can be rapidly provisioned and released with minimal management effort
or service provider interaction.‖
In a more generic definition, Armbrust et al. define cloud as the ―data
center hardware and software that provide services.‖ Similarly, Sotomayor
et al. point out that ―cloud‖ is more often used to refer to the IT infrastructure
deployed on an Infrastructure as a Service provider data center. While there are
countless other definitions, there seems to be common characteristics between
the most notable ones listed above, which a cloud should have: (i) pay-per-use
(no ongoing commitment, utility prices); (ii) elastic capacity and the illusion of
infinite resources; (iii) self-service interface;
and (iv) resources that are
abstracted or virtualised.
ROOTS OF CLOUD COMPUTING
We can track the roots of clouds computing by observing the advancement of
several technologies, especially in hardware (virtualization, multi-core chips),
Internet technologies (Web services, service-oriented architectures, Web 2.0),
distributed computing (clusters, grids), and systems management (autonomic
computing, data center automation). Figure 1.1 shows the convergence of
technology fields that significantly advanced and contributed to the advent
of cloud computing.
Some of these technologies have been tagged as hype in their early stages
of development; however, they later received significant attention from
academia and were sanctioned by major industry players. Consequently, a
specification and standardization process followed, leading to maturity and
wide adoption. The emergence of cloud computing itself is closely linked to
the maturity of such technologies. We present a closer look at the technol ogies
that form the base of cloud computing, with the aim of providing a clearer
picture of the cloud ecosystem as a whole.
From Mainframes to Clouds
We are currently experiencing a switch in the IT world, from in-house
generated computing power into utility-supplied computing resources delivered
over the Internet as Web services. This trend is similar to what occurred about a
century ago when factories, which used to generate their own electric power,
realized that it is was cheaper just plugging their machines into the newly
formed electric power grid .
Computing delivered as a utility can be defined as ―on demand delivery of
infrastructure, applications, and business processes in a security-rich, shared,
scalable, and based computer environment over the Internet for a fee‖ .
Hardware Virtualization
Utility &
Grid
Computing
SOA
Cloud
Computing
Web 2.0
Web Services
Mashups
Internet Technologies
Distributed Computing
Multi-core chips
Autonomic Computing
Data Center Automation
Hardware
Systems Management
FIGURE 1.1. Convergence of various advances leading to the advent of
cloud computing.
This model brings benefits to both consumers and providers of IT services.
Consumers can attain reduction on IT-related costs by choosing to obtain
cheaper services from external providers as opposed to heavily investing on IT
infrastructure and personnel hiring. The ―on-demand‖ component of this
model allows consumers to adapt their IT usage to rapidly increasing or
unpredictable computing needs.
Providers of IT services achieve better operational costs; hardware and
software infrastructures are built to provide multiple solutions and serve many
users, thus increasing efficiency and ultimately leading to faster return on
investment (ROI) as well as lower total cost of ownership (TCO).
The mainframe era collapsed with the advent of fast and inexpensive
microprocessors and IT data centers moved to collections of commodity servers.
The advent of increasingly fast fiber-optics networks has relit the fire, and
new technologies for enabling sharing of computing power over great distances
have appeared.
SOA, Web Services, Web 2.0, and Mashups
•
•
Web Service
• applications running on different messaging product platforms
• enabling information from one application to be made available to
others
• enabling internal applications to be made available over the Internet
SOA
• address requirements of loosely coupled, standards-based, and
protocol-independent distributed computing
• WS ,HTTP, XML
• Common mechanism for delivering service
• applications is a collection of services that together perform
complex business logic
• Building block in IaaS
• User authentication, payroll management, calender
Grid Computing
Grid computing enables aggregation of distributed resources and transparently
access to them. Most production grids such as TeraGrid and EGEE seek to
share compute and storage resources distributed across different administrative
domains, with their main focus being speeding up a broad range of scientific
applications, such as climate modeling, drug design, and protein analysis.
Globus Toolkit is a middleware that implements several standard Grid
services and over the years has aided the deployment of several service-oriented
Grid infrastructures and applications. An ecosystem of tools is available to
interact with service grids, including grid brokers, which facilitate user
interaction with multiple middleware and implement policies to meet QoS
needs.
Virtualization technology has been identified as the perfect fit to issues that
have caused frustration when using grids, such as hosting many dissimilar
software applications on a single physical platform. In this direction, some
research projects.
Utility Computing
In utility computing environments, users assign a ―utility‖ value to their jobs,
where utility is a fixed or time-varying valuation that captures various QoS
constraints (deadline, importance, satisfaction). The valuation is the amount
they are willing to pay a service provider to satisfy their demands. The service
providers then attempt to maximize their own utility, where said utility may
directly correlate with their profit. Providers can choose to prioritize high yield
(i.e., profit per unit of resource) user jobs, leading to a scenario where shared
systems are viewed as a marketplace, where users compete for resources based
on the perceived utility or value of their jobs.
Hardware Virtualization
The idea of virtualizing a computer system‘s resources, including processors,
memory, and I/O devices, has been well established for decades, aiming at
improving sharing and utilization of computer systems . Hardware
virtualization allows running multiple operating systems and software stacks on
a single physical platform. As depicted in Figure 1.2, a software layer, the
virtual machine monitor (VMM), also called a hypervisor, mediates access to
the physical hardware presenting to each guest operating system a virtual
machine (VM), which is a set of virtual platform interfaces .
Virtual Machine 1
Virtual Machine 2
User software
User software
Email Server
Data
Web
base
Facebook App
Ruby on
Java
Virtual Machine N
User software
App A
App X
App B
App Y
Rails
Server
Linux
Guest OS
Virtual Machine Monitor (Hypervisor)
Hardware
FIGURE 1.2. A hardware virtualized server hosting three virtual machines, each one
running distinct operating system and user level software stack.
Workload isolation is achieved since all program instructions are fully
confined inside a VM, which leads to improvements in security. Better
reliability is also achieved because software failures inside one VM do not
affect others . Moreover, better performance control is attained since execution
of one VM should not affect the performance of another VM .
VMWare ESXi. VMware is a pioneer in the virtualization market. Its ecosystem
of tools ranges from server and desktop virtualization to high-level
management tools . ESXi is a VMM from VMWare. It is a bare-metal
hypervisor, meaning that it installs directly on the physical server, whereas
others may require a host operating system.
Xen. The Xen hypervisor started as an open-source project and has served as a
base to other virtualization products, both commercial and open-source.In
addition to an open-source distribution , Xen currently forms the base of
commercial hypervisors of a number of vendors, most notably Citrix
XenServer and Oracle VM.
KVM. The kernel-based virtual machine (KVM) is a Linux virtualization
subsystem. Is has been part of the mainline Linux kernel since version 2.6.20,
thus being natively supported by several distributions. In addition, activities
such as memory management and scheduling are carried out by existing kernel
features, thus making KVM simpler and smaller than hypervisors that take
control of the entire machine .
KVM leverages hardware-assisted virtualization, which improves
performance and allows it to support unmodified guest operating systems ;
currently, it supports several versions of Windows, Linux, and UNIX .
Virtual Appliances and the Open Virtualization
Format
An application combined with the environment needed to run it (operating
system, libraries, compilers, databases, application containers, and so forth) is
referred to as a ―virtual appliance.‖ Packaging application environments in the
shape of virtual appliances eases software customization, configuration, and
patching and improves portability. Most commonly, an appliance is shaped as
a VM disk image associated with hardware requirements, and it can be readily
deployed in a hypervisor.
In a multitude of hypervisors, where each one supports a different VM image
format and the formats are incompatible with one another, a great deal of
interoperability issues arises. For instance, Amazon has its Amazon machine
image (AMI) format, made popular on the Amazon EC2 public cloud. Other
formats are used by Citrix XenServer, several Linux distributions that ship with
KVM, Microsoft Hyper-V, and VMware ESX.
OVF‘s extensibility has encouraged additions relevant to management of
data centers and clouds. Mathews et al. have devised virtual machine contracts
(VMC) as an extension to OVF. A VMC aids in communicating and managing
the complex expectations that VMs have of their runtime environment and vice
versa.
Autonomic Computing
The increasing complexity of computing systems has motivated research on
autonomic computing, which seeks to improve systems by decreasing human
involvement in their operation. In other words, systems should manage
themselves, with high-level guidance from humans .
In this sense, the concepts of autonomic computing inspire software
technologies for data center automation, which may perform tasks such as:
management of service levels of running applications; management of data
center capacity; proactive disaster recovery; and automation of VM
provisioning .
LAYERS AND TYPES OF CLOUDS
Cloud computing services are divided into three classes, according to the
abstraction level of the capability provided and the service model of providers,
namely: (1) Infrastructure as a Service, (2) Platform as a Service, and (3) Software
as a Service . Figure 1.3 depicts the layered organization of the cloud stack
from physical infrastructure to applications.
These abstraction levels can also be viewed as a layered architecture where
services of a higher layer can be composed from services of the underlying
layer.
Infrastructure as a Service
Offering virtualized resources (computation, storage, and communication) on
demand is known as Infrastructure as a Service (IaaS) . A cloud infrastructure
Service
Main Access &
Class
Management Tool
Service content
Web Browser
Social networks, Office suites, CRM,
SaaS
PaaS
Cloud Applications
Video processing
Cloud
Cloud Platform
Development
Environment
Programming languages, Frameworks,
Mashups editors, Structured data
Virtual
IaaS
Infrastructure
Manager
Compute Servers, Data Storage,
17
Firewall, Load Balancer
Cloud Infrastructure
FIGURE 1.3. The cloud computing stack.
enables on-demand provisioning of servers running several choices of operating
systems and a customized software stack. Infrastructure services are considered
to be the bottom layer of cloud computing systems .
Platform as a Service
In addition to infrastructure-oriented clouds that provide raw computing and
storage services, another approach is to offer a higher level of abstraction to
make a cloud easily programmable, known as Platform as a Service (PaaS)..
Google AppEngine, an example of Platform as a Service, offers a scalable
environment for developing and hosting Web applications, which should
be written in specific programming languages such as Python or Java, and use
the services‘ own proprietary structured object data store.
Software as a Service
Applications reside on the top of the cloud stack. Services provided by this
layer can be accessed by end users through Web portals. Therefore, consumers
are increasingly shifting from locally installed computer programs to on-line
software services that offer the same functionally. Traditional desktop
applications such as word processing and spreadsheet can now be accessed as a
service in the Web.
Deployment Models
Although cloud computing has emerged mainly from the appearance of public
computing utilities. In this sense, regardless of its service class, a cloud can be
classified as public, private, community, or hybrid based on model of
deployment as shown in Figure 1.4.
Public/Internet
Clouds
Private/Enterprise
Hybrid/Mixed Clouds
Clouds
3rd party,
multi-tenant Cloud
Cloud computing
model run
Mixed usage of
private and public
Clouds:
infrastructure
& services:
within a company‘s
own Data Center/
infrastructure for
internal and/or
partners use.
Leasing public
cloud services
when private cloud
capacity is
insufficient
* available on
subscription basis
(pay as you go)
FIGURE 1.4. Types of clouds based on deployment models.
Armbrust propose definitions for public cloud as a ―cloud made available in
a pay-as-you-go manner to the general public‖ and private cloud as ―internal
data center of a business or other organization, not made available to the
general public.‖
A community cloud is ―shared by several organizations and supports a
specific community that has shared concerns (e.g., mission, security
requirements, policy, and compliance considerations) .‖
A hybrid cloud takes shape when a private cloud is supplemented with
computing capacity from public clouds . The approach of temporarily renting
capacity to handle spikes in load is known as ―cloud-bursting‖ .
DESIRED FEATURES OF A CLOUD
Certain features of a cloud are essential to enable services that truly represent
the cloud computing model and satisfy expectations of consumers, and cloud
offerings must be (i) self-service, (ii) per-usage metered and billed, (iii) elastic,
and (iv) customizable.
Self-Service
Consumers of cloud computing services expect on-demand, nearly instant
access to resources. To support this expectation, clouds must allow self-service
access so that customers can request, customize, pay, and use services without
intervention of human operators .
Per-Usage Metering and Billing
Cloud computing eliminates up-front commitment by users, allowing them to
request and use only the necessary amount. Services must be priced on a
shortterm basis (e.g., by the hour), allowing users to release (and not pay for)
resources as soon as they are not needed.
Elasticity
Cloud computing gives the illusion of infinite computing resources available on
demand . Therefore users expect clouds to rapidly provide resources in any
quantity at any time. In particular, it is expected that the additional resources
can be (a) provisioned, possibly automatically, when an application load
increases and (b) released when load decreases (scale up and down) .
Customization
In a multi-tenant cloud a great disparity between user needs is often the case.
Thus, resources rented from the cloud must be highly customizable. In the case
of infrastructure services, customization means allowing users to deploy
specialized virtual appliances and to be given privileged (root) access to the
virtual servers. Other service classes (PaaS and SaaS) offer less flexibility and
are not suitable for general-purpose computing , but still are expected to
provide a certain level of customization.
CLOUD INFRASTRUCTURE MANAGEMENT
A key challenge IaaS providers face when building a cloud infrastructure is
managing physical and virtual resources, namely servers, storage, and
networks, in a holistic fashion . The orchestration of resources must be
performed in a way to rapidly and dynamically provision resources to
applications .
The availability of a remote cloud-like interface and the ability of managing
many users and their permissions are the primary features that would
distinguish ―cloud toolkits‖ from ―VIMs.‖ However, in this chapter, we place
both categories of tools under the same group (of the VIMs) and, when
applicable, we highlight the availability of a remote interface as a feature.
Virtually all VIMs we investigated present a set of basic features related to
managing the life cycle of VMs, including networking groups of VMs together
and setting up virtual disks for VMs. These basic features pretty much define
whether a tool can be used in practical cloud deployments or not. On the other
hand, only a handful of software present advanced features (e.g., high
availability) which allow them to be used in large-scale production clouds.
Features
We now present a list of both basic and advanced features that are usually
available in VIMs.
Virtualization Support. The multi-tenancy aspect of clouds requires multiple
customers with disparate requirements to be served by a single hardware
infrastructure.
Self-Service, On-Demand Resource Provisioning. Self-service access to
resources has been perceived as one the most attractive features of clouds. This
feature enables users to directly obtain services from clouds.
Multiple Backend Hypervisors. Different virtualization models and tools offer
different benefits, drawbacks, and limitations. Thus, some VI managers
provide a uniform management layer regardless of the virtualization
technology used.
Storage Virtualization. Virtualizing storage means abstracting logical storage
from physical storage. By consolidating all available storage devices in a data
center, it allows creating virtual disks independent from device and location.
In the VI management sphere, storage virtualization support is often
restricted to commercial products of companies such as VMWare and Citrix.
Other products feature ways of pooling and managing storage devices, but
administrators are still aware of each individual device.
Interface to Public Clouds. Researchers have perceived that extending the
capacity of a local in-house computing infrastructure by borrowing resources
from public clouds is advantageous. In this fashion, institutions can make good
use of their available resources and, in case of spikes in demand, extra load can
be offloaded to rented resources .
Virtual Networking. Virtual networks allow creating an isolated network on
top of a physical infrastructure independently from physical topology and
locations. A virtual LAN (VLAN) allows isolating traffic that shares a
switched network, allowing VMs to be grouped into the same broadcast
domain.
Dynamic Resource Allocation. Increased awareness of energy consumption in
data centers has encouraged the practice of dynamic consolidating VMs in a
fewer number of servers. In cloud infrastructures, where applications
have variable and dynamic needs, capacity management and demand
prediction are especially complicated. This fact triggers the need for dynamic
resource allocation aiming at obtaining a timely match of supply and
demand.
Virtual Clusters. Several VI managers can holistically manage groups of VMs.
This feature is useful for provisioning computing virtual clusters on demand,
and interconnected VMs for multi-tier Internet applications.
Reservation and Negotiation Mechanism. When users request computational
resources to available at a specific time, requests are termed advance
reservations (AR), in contrast to best-effort requests, when users request
resources whenever available .
Additionally, leases may be negotiated and renegotiated, allowing provider
and consumer to modify a lease or present counter proposals until an
agreement is reached.
High Availability and Data Recovery. The high availability (HA) feature of VI
managers aims at minimizing application downtime and preventing business
disruption.
For mission critical applications, when a failover solution involving
restarting VMs does not suffice, additional levels of fault tolerance that rely on
redundancy of VMs are implemented.
Data backup in clouds should take into account the high data volume
involved in VM management.
Case Studies
In this section, we describe the main features of the most popular VI managers
available. Only the most prominent and distinguishing features of each tool are
discussed in detail. A detailed side-by-side feature comparison of VI managers
is presented in Table 1.1.
Apache VCL. The Virtual Computing Lab [60, 61] project has been incepted in
2004 by researchers at the North Carolina State University as a way to provide
customized environments to computer lab users. The software components that
support NCSU‘s initiative have been released as open-source and incorporated
by the Apache Foundation.
AppLogic. AppLogic is a commercial VI manager, the flagship product of
3tera Inc. from California, USA. The company has labeled this product as a
Grid Operating System.
AppLogic provides a fabric to manage clusters of virtualized servers,
focusing on managing multi-tier Web applications. It views an entire
application as a collection of components that must be managed as a single
entity.
In summary, 3tera AppLogic provides the following features: Linux-based
controller; CLI and GUI interfaces; Xen backend; Global Volume Store (GVS)
storage virtualization; virtual networks; virtual clusters; dynamic resource
allocation; high availability; and data protection.
TABLE 1.1. Feature Comparison of Virtual Infrastructure Managers
Installation
Platform of
Controller
Client
UI,
API, Language
Bindings
Backend
Hypervisor(s)
Storage
Virtualization
Interface to
Public Cloud
Virtual Dynamic Resource
Networks
Allocation
VMware
ESX, ESXi,
No
No
Yes
No
Global
No
Yes
Advance
Reservation of
Capacity
High
Availability
Data
Protection
Yes
No
No
Yes
No
Yes
Yes
License
Apache
VCL
Apache v2
Multi-
Portal,
XML-RPC
platform
(Apache/
PHP)
AppLogic
Proprietary
Linux
Server
GUI, CLI
Xen
Volume
Store (GVS)
Citrix Essentials
Proprietary Windows
GUI, CLI,
XenServer,
Hyper-V
Citrix
Storage
Link
No
Yes
Yes
No
Yes
Yes
Xen
Portal,
XML-RPC
Enomaly
GPL v3
Linux
Portal, WS
Eucalyptus
ECP
BSD
Linux
EC2 WS, CLI
Nimbus
Apache v2
Linux
EC2 WS,
No
Amazon EC2
Yes
No
No
No
No
Xen, KVM
No
EC2
Yes
No
No
No
No
Xen, KVM
No
EC2
Yes
Via
Yes
(via No
integration with
OpenNebula)
No
WSRF, CLI
OpenNEbula
integration with
OpenNebula
Apache v2
Linux
XML-RPC, CLI, Java
Xen, KVM
No
Amazon EC2, E
Yes
Yes
Yes
No
No
(via Haizea)
OpenPEX
GPL v2
Multiplatform
Portal, WS
XenServer
No
No
No
No
Yes
No
No
oVirt
GPL v2
Fedora Linux
Portal
KVM
No
No
No
No
No
No
No
Platform
ISF
Proprietary
Portal
Hyper-V
XenServer,
VMWare ESX
No
EC2, IBM CoD,
Yes
Yes
Yes
Unclear
Unclear
(Java)
Platform VMO
Linux
HP
Enterprise
Services
Proprietary
Linux,
Portal
XenServer
No
No
Yes
Yes
No
Yes
No
Proprietary
Linux,
CLI, GUI,
VMware
ESX, ESXi
VMware
vStorage
VMFS
VMware
vCloud partners
Yes
VMware
DRM
No
Yes
Yes
Windows
VMWare
vSphere
Windows
Portal, WS
Citrix Essentials. The Citrix Essentials suite is one the most feature complete
VI management software available, focusing on management and automation
of data centers. It is essentially a hypervisor-agnostic solution, currently
supporting Citrix XenServer and Microsoft Hyper-V.
Enomaly ECP. The Enomaly Elastic Computing Platform, in its most complete
edition, offers most features a service provider needs to build an IaaS cloud.
In summary, Enomaly ECP provides the following features: Linux-based
controller; Web portal and Web services (REST) interfaces; Xen back-end;
interface to the Amazon EC2 public cloud; virtual networks; virtual clusters
(ElasticValet).
Eucalyptus. The Eucalyptus framework was one of the first open-source
projects to focus on building IaaS clouds. It has been developed with the intent
of providing an open-source implementation nearly identical in functionality to
Amazon Web Services APIs.
Nimbus3. The Nimbus toolkit is built on top of the Globus framework. Nimbus
provides most features in common with other open-source VI managers, such
as an EC2-compatible front-end API, support to Xen, and a backend interface
to Amazon EC2.
Nimbus‘ core was engineered around the Spring framework to be easily
extensible, thus allowing several internal components to be replaced and also
eases the integration with other systems.
In summary, Nimbus provides the following features: Linux-based
controller; EC2-compatible (SOAP) and WSRF interfaces; Xen and KVM
backend and a Pilot program to spawn VMs through an LRM; interface to the
Amazon EC2 public cloud; virtual networks; one-click virtual clusters.
OpenNebula. OpenNebula is one of the most feature-rich open-source VI
managers. It was initially conceived to manage local virtual infrastructure, but
has also included remote interfaces that make it viable to build public clouds.
Altogether, four programming APIs are available: XML-RPC and libvirt for
local interaction; a subset of EC2 (Query) APIs and the OpenNebula Cloud
API (OCA) for public access [7, 65].
(Amazon EC2, ElasticHosts); virtual networks; dynamic resource
allocation; advance reservation of capacity.
OpenPEX. OpenPEX (Open Provisioning and EXecution Environment) was
constructed around the notion of using advance reservations as the primary
method for allocating VM instances.
oVirt. oVirt is an open-source VI manager, sponsored by Red Hat‘s Emergent
Technology group. It provides most of the basic features of other VI managers,
including support for managing physical server pools, storage pools, user
accounts, and VMs. All features are accessible through a Web interface.
Platform ISF. Infrastructure Sharing Facility (ISF) is the VI manager offering
from Platform Computing [68]. The company, mainly through its LSF family
of products, has been serving the HPC market for several years.
ISF is built upon Platform‘s VM Orchestrator, which, as a standalone
product, aims at speeding up delivery of VMs to end users. It also provides high
availability by restarting VMs when hosts fail and duplicating the VM that
hosts the VMO controller.
VMWare vSphere and vCloud. vSphere is VMware‘s suite of tools aimed at
transforming IT infrastructures into private clouds. It distinguishes from other
VI managers as one of the most feature-rich, due to the company‘s several
offerings in all levels the architecture.
In the vSphere architecture, servers run on the ESXi platform. A separate
server runs vCenter Server, which centralizes control over the entire virtual
infrastructure. Through the vSphere Client software, administrators connect to
vCenter Server to perform various tasks.
VMware ESX, ESXi backend; VMware vStorage VMFS storage
virtualization; interface to external clouds (VMware vCloud partners); virtual
networks (VMWare Distributed Switch); dynamic resource allocation
(VMware DRM); high availability; data protection (VMWare Consolidated
Backup).
INFRASTRUCTURE AS A SERVICE PROVIDERS
Public Infrastructure as a Service providers commonly offer virtual servers
containing one or more CPUs, running several choices of operating systems
and a customized software stack. In addition, storage space and
communication facilities are often provided.
Features
In spite of being based on a common set of features, IaaS offerings can be
distinguished by the availability of specialized features that influence the
cost—benefit ratio to be experienced by user applications when moved to
the cloud. The most relevant features are: (i) geographic distribution of data
centers; (ii) variety of user interfaces and APIs to access the system; (iii)
specialized components and services that aid particular applications (e.g.,
loadbalancers, firewalls); (iv) choice of virtualization platform and operating
systems; and (v) different billing methods and period (e.g., prepaid vs. postpaid, hourly vs. monthly).
Geographic Presence. To improve availability and responsiveness, a provider
of worldwide services would typically build several data centers distributed
around the world. For example, Amazon Web Services presents the concept of
―availability zones‖ and ―regions‖ for its EC2 service.
User Interfaces and Access to Servers. Ideally, a public IaaS provider must
provide multiple access means to its cloud, thus catering for various users and
their preferences. Different types of user interfaces (UI) provide different levels
of abstraction, the most common being graphical user interfaces (GUI),
command-line tools (CLI), and Web service (WS) APIs.
GUIs are preferred by end users who need to launch, customize, and
monitor a few virtual servers and do not necessary need to repeat the process
several times. On the other hand, CLIs offer more flexibility and the possibility
of automating repetitive tasks via scripts.
Advance Reservation of Capacity. Advance reservations allow users to request
for an IaaS provider to reserve resources for a specific time frame in the future,
thus ensuring that cloud resources will be available at that time. However, most
clouds only support best-effort requests; that is, users requests are server
whenever resources are available.
Automatic Scaling and Load Balancing. As mentioned earlier in this chapter,
elasticity is a key characteristic of the cloud computing model. Applications
often need to scale up and down to meet varying load conditions. Automatic
scaling is a highly desirable feature of IaaS clouds.
Service-Level Agreement. Service-level agreements (SLAs) are offered by IaaS
providers to express their commitment to delivery of a certain QoS. To
customers it serves as a warranty. An SLA usually include availability and
performance guarantees. Additionally, metrics must be agreed upon by all
parties as well as penalties for violating these expectations.
Hypervisor and Operating System Choice. Traditionally, IaaS offerings have
been based on heavily customized open-source Xen deployments. IaaS
providers needed expertise in Linux, networking, virtualization, metering,
resource management, and many other low-level aspects to successfully deploy
and maintain their cloud offerings.
Case Studies
In this section, we describe the main features of the most popular public IaaS
clouds. Only the most prominent and distinguishing features of each one are
discussed in detail. A detailed side-by-side feature comparison of IaaS offerings
is presented in Table 1.2.
Amazon Web Services. Amazon WS (AWS) is one of the major players in the
cloud computing market. It pioneered the introduction of IaaS clouds in
2006.
The Elastic Compute Cloud (EC2) offers Xen-based virtual servers (instances)
that can be instantiated from Amazon Machine Images (AMIs). Instances are
available in a variety of sizes, operating systems, architectures, and price. CPU
capacity of instances is measured in Amazon Compute Units and, although fixed
for each instance, vary among instance types from 1 (small instance) to 20 (high
CPU instance).
In summary, Amazon EC2 provides the following features: multiple data
centers available in the United States (East and West) and Europe; CLI, Web
services (SOAP and Query), Web-based console user interfaces; access to
instance mainly via SSH (Linux) and Remote Desktop (Windows); advanced
reservation of capacity (aka reserved instances) that guarantees availability for
periods of 1 and 3 years; 99.5% availability SLA; per hour pricing; Linux and
Windows operating systems; automatic scaling; load balancing.
TABLE 1.2. Feature Comparison Public Cloud Offerings (Infrastructure as a Service)
Runtime
Server
Resizing/
Vertical
Scaling
Client UI
API Language
Geographic
Presence
Primary
Access to
Server
Advance
Reservation of
Capacity
Smallest
Billing
Guest
Operating
Systems
SLA
Bindings
Unit
Automated
Horizontal
Scaling
Hypervisor
Instance Hardware Capacity
Processor
Load
Balancing
Memory
Storage
Uptime
99.95% Hour
Xen
Linux,
Windows
Available
Elastic Load
with
Balancing
Amazon
CloudWatch
No
1—20 EC2
compute
units
1.7—15 160—1690 GB
GB
1 GB—1 TB
(per EBS
volume)
No
100%
Xen
Linux,
Windows
No
Zeus
software
Processors,
memory
1—4 CPUs
0.5—16 20—270 GB
GB
No
100%
Xen
Linux,
Windows
No
Hardware
(F5)
No
1—6 CPUs
Amazon
EC2
US East,
Europe
CLI, WS,
Portal
SSH (Linux),
Remote
Desktop
(Windows)
Amazon
reserved
instances
(Available in
1 or 3 years
terms, starting
from reservation
time)
Flexiscale
UK
Web Console
SSH
REST, Java,
SSH
Hour
loadbalancing (requires
reboot)
GoGrid
PHP, Python,
Ruby
Hour
GB
0.5—8
3G0B—480
Joyent
Cloud
US
(Emeryville,
SSH,
No
100%
Month
OS Level
(Solaris
OpenSolaris No
Both
hardware
Automatic
1/16—8 CPUs 0.25—32 5—100 GB
CPU bursting
GB
VirtualMin
CA; San
(Web-based
Diego,
system
administration)
Containers)
(F5 networks) (up to 8
and software
(Zeus)
CPUs)
No
Memory, disk Quad-core
0.25—16 10—620 GB
(requires
reboot)
Automatic
CPU bursting
(up to 100%
of available
CPU power
of
physical
host)
GB
CA; Andover,
MA; Dallas,
TX)
Rackspace US
Portal, REST,
Cloud
Servers
Python, PHP,
Java, C#/.
(Dallas, TX)
NET
SSH
No
100%
Hour
Xen
Linux
No
CPU (CPU
power
is
weighed
proportionally
to
memory
size)
Flexiscale. Flexiscale is a UK-based provider offering services similar in
nature to Amazon Web Services. However, its virtual servers offer some
distinct features, most notably: persistent storage by default, fixed IP addresses,
dedicated VLAN, a wider range of server sizes, and runtime adjustment of CPU
capacity (aka CPU bursting/vertical scaling). Similar to the clouds, this service
is also priced by the hour.
Joyent. Joyent‘s Public Cloud offers servers based on Solaris containers
virtualization technology. These servers, dubbed accelerators, allow deploying
various specialized software-stack based on a customized version of
OpenSolaris operating system, which include by default a Web-based
configuration tool and several pre-installed software, such as Apache, MySQL,
PHP, Ruby on Rails, and Java. Software load balancing is available as an
accelerator in addition to hardware load balancers.
In summary, the Joyent public cloud offers the following features: multiple
geographic locations in the United States; Web-based user interface; access to
virtual server via SSH and Web-based administration tool; 100% availability
SLA; per month pricing; OS-level virtualization Solaris containers;
OpenSolaris operating systems; automatic scaling (vertical).
GoGrid. GoGrid, like many other IaaS providers, allows its customers to
utilize a range of pre-made Windows and Linux images, in a range of fixed
instance sizes. GoGrid also offers ―value-added‖ stacks on top for applications
such as high-volume Web serving, e-Commerce, and database stores.
Rackspace Cloud Servers. Rackspace Cloud Servers is an IaaS solution that
provides fixed size instances in the cloud. Cloud Servers offers a range of
Linux-based pre-made images. A user can request different-sized images, where
the size is measured by requested RAM, not CPU.
PLATFORM AS A SERVICE PROVIDERS
Public Platform as a Service providers commonly offer a development and
deployment environment that allow users to create and run their applications
with little or no concern to low-level details of the platform. In addition,
specific programming languages and frameworks are made available in the
platform, as well as other services such as persistent data storage and
inmemory caches.
Features
Programming Models, Languages, and Frameworks. Programming models
made available by IaaS providers define how users can express their
applications using higher levels of abstraction and efficiently run them on the
cloud platform. Each model aims at efficiently solving a particular problem. In
the cloud computing domain, the most common activities that require
specialized models are: processing of large dataset in clusters of computers
(MapReduce model), development of request-based Web services and
applications;
Persistence Options. A persistence layer is essential to allow applications to
record their state and recover it in case of crashes, as well as to store user data.
Traditionally, Web and enterprise application developers have chosen
relational databases as the preferred persistence method. These databases offer
fast and reliable structured data storage and transaction processing, but may
lack scalability to handle several petabytes of data stored in commodity
computers .
Case Studies
In this section, we describe the main features of some Platform as Service
(PaaS) offerings. A more detailed side-by-side feature comparison of VI
managers is presented in Table 1.3.
Aneka. Aneka is a .NET-based service-oriented resource management and
development platform. Each server in an Aneka deployment (dubbed Aneka
cloud node) hosts the Aneka container, which provides the base infrastructure
that consists of services for persistence, security (authorization, authentication
and auditing), and communication (message handling and dispatching).
Several programming models are supported by such task models to enable
execution of legacy HPC applications and MapReduce, which enables a variety
of data-mining and search applications.
App Engine. Google App Engine lets you run your Python and Java Web
applications on elastic infrastructure supplied by Google. The App Engine
serving architecture is notable in
that it allows real-time auto-scaling
without virtualization for many common types of Web applications.
However, such auto-scaling is dependent on the
TABLE 1.3. Feature Comparison of Platform-as-a-Service Cloud Offerings
Target Use
Aneka
Programming
Language,
Frameworks
Developer
Tools
.Net enterprise
applications,
HPC
Web
applications
.NET
Standalone
SDK
Python, Java
Eclipse-based
IDE
Force.com
Enterprise
applications
(esp. CRM)
Apex
Microsoft
Windows
Azure
Enterprise and
Web
applications
.NET
Heroku
Web
applications
Ruby on Rails
AppEngine
Programming
Models
Threads, Task,
MapReduce
Persistence
Options
Automatic
Scaling
Backend
Infrastructure
Providers
Flat
files,
RDBMS, HDFS
No
Amazon EC2
BigTable
Yes
Own
centers
data
Request-based
Web
programming
Eclipse-based
Workflow,
IDE, WebExcel-like
based wizard
formula
language,
Request-based
web
programming
Azure tools for Unrestricted
Microsoft
Visual Studio
Own object
database
Unclear
Own
centers
data
Table/BLOB/
queue
storage,
SQL services
Yes
Own
centers
data
Command-line
tools
PostgreSQL,
Amazon RDS
Yes
Amazon EC2
Requestbased
web
programming
33
Amazon
Elastic
MapReduce
Data processing
Hive and Pig,
Cascading,
Java, Ruby,
Perl, Python,
PHP,
R,
C++
Karmasphere
Studio
for
Hadoop
(NetBeansbased)
MapReduce
Amazon S3
No
Amazon EC2
application developer using a limited subset of the native APIs on each
platform, and in some instances you need to use specific Google APIs such
as URLFetch, Datastore, and memcache in place of certain native API calls.
Microsoft Azure. Microsoft Azure Cloud Services offers developers a hosted .
NET Stack (C#, VB.Net, ASP.NET). In addition, a Java & Ruby SDK for
.NET Services is also available. The Azure system consists of a number of
elements.
Force.com. In conjunction with the Salesforce.com service, the Force.com
PaaS allows developers to create add-on functionality that integrates into main
Salesforce CRM SaaS application.
Heroku. Heroku is a platform for instant deployment of Ruby on Rails Web
applications. In the Heroku system, servers are invisibly managed by the
platform and are never exposed to users.
CHALLENGES AND RISKS
Despite the initial success and popularity of the cloud computing paradigm and
the extensive availability of providers and tools, a significant number of
challenges and risks are inherent to this new model of computing. Providers,
developers, and end users must consider these challenges and risks to take good
advantage of cloud computing.
Security, Privacy, and Trust
Ambrust et al. cite information security as a main issue: ―current cloud
offerings are essentially public . . . exposing the system to more attacks.‖ For
this reason there are potentially additional challenges to make cloud computing
environments as secure as in-house IT systems. At the same time, existing,
wellunderstood technologies can be leveraged, such as data encryption,
VLANs, and firewalls.
Data Lock-In and Standardization
A major concern of cloud computing users is about having their data locked-in
by a certain provider. Users may want to move data and applications out from
a provider that does not meet their requirements. However, in their current
form, cloud computing infrastructures and platforms do not employ standard
methods of storing user data and applications. Consequently, they do not
interoperate and user data are not portable.
Availability, Fault-Tolerance, and Disaster Recovery
It is expected that users will have certain expectations about the service level to
be provided once their applications are moved to the cloud. These expectations
include availability of the service, its overall performance, and what measures
are to be taken when something goes wrong in the system or its components. In
summary, users seek for a warranty before they can comfortably move their
business to the cloud.
Resource Management and Energy-Efficiency
One important challenge faced by providers of cloud computing services is the
efficient management of virtualized resource pools. Physical resources such as
CPU cores, disk space, and network bandwidth must be sliced and shared
among virtual machines running potentially heterogeneous workloads.
Another challenge concerns the outstanding amount of data to be managed
in various VM management activities. Such data amount is a result of
particular abilities of virtual machines, including the ability of traveling through
space (i.e., migration) and time (i.e., checkpointing and rewinding), operations
that may be required in load balancing, backup, and recovery scenarios. In
addition, dynamic provisioning of new VMs and replicating existing VMs
require efficient mechanisms to make VM block storage devices (e.g., image
files) quickly available at selected hosts.
2.2 MIGRATING INTO A CLOUD
The promise of cloud computing has raised the IT expectations of small and
medium enterprises beyond measure. Large companies are deeply debating it.
Cloud computing is a disruptive model of IT whose innovation is part
technology and part business model—in short a ―disruptive techno-commercial
model‖ of IT. This tutorial chapter focuses on the key issues and associated
dilemmas faced by decision makers, architects, and systems managers in trying
to understand and leverage cloud computing for their IT needs. Questions
asked and discussed in this chapter include: when and how to migrate one‘s
application into a cloud; what part or component of the IT application to
migrate into a cloud and what not to migrate into a cloud; what kind of
customers really benefit from migrating their IT into the cloud; and so on. We
describe the key factors underlying each of the above questions and share a
Seven-Step Model of Migration into the Cloud.
Several efforts have been made in the recent past to define the term ―cloud
computing‖ and many have not been able to provide a comprehensive one This
has been more challenging given the scorching pace of the technological
advances as well as the newer business model formulations for the cloud services
being offered.
The Promise of the Cloud
Most users of cloud computing services offered by some of the large-scale data
centers are least bothered about the complexities of the underlying systems or
their functioning. More so given the heterogeneity of either the systems or the
software running on them.
Cloudonomics
Technology
• ‗Pay per use‘ – Lower Cost Barriers
• On Demand Resources –Autoscaling
• Capex vs OPEX – No capital expenses (CAPEX) and only operational expenses OPEX.
• SLA driven operations – Much Lower TCO
• Attractive NFR support: Availability, Reliability
• ‗Infinite‘ Elastic availability – Compute/Storage/Bandwidth
• Automatic Usage Monitoring and Metering
• Jobs/ Tasks Virtualized and Transparently ‗Movable‘
• Integration and interoperability ‗support‘ for hybrid ops
• Transparently encapsulated & abstracted IT features.
FIGURE 2.1. The promise of the cloud computing services.
.
As shown in Figure 2.1, the promise of the cloud both on the business front
(the attractive cloudonomics) and the technology front widely aided the CxOs
to spawn out several non-mission critical IT needs from the ambit of their
captive traditional data centers to the appropriate cloud service. Invariably,
these IT needs had some common features: They were typically Web-oriented;
they represented seasonal IT demands; they were amenable to parallel batch
processing; they were non-mission critical and therefore did not have high
security demands.
The Cloud Service Offerings and Deployment Models
Cloud computing has been an attractive proposition both for the CFO and the
CTO of an enterprise primarily due its ease of usage. This has been achieved
by large data center service vendors or now better known as cloud service
vendors again primarily due to their scale of operations. Google, Amazon,
IaaS
• Abstract Compute/Storage/Bandwidth Resources
• Amazon Web Services[10,9] – EC2, S3, SDB, CDN, CloudWatch
IT Folks
PaaS
• Abstracted Programming Platform with encapsulated infrastructure
• Google Apps Engine(Java/Python), Microsoft Azure, Aneka[13]
Programmers
SaaS
• Application with encapsulated infrastructure & platform
• Salesforce.com; Gmail; Yahoo Mail; Facebook; Twitter
Architects & Users
Cloud Application Deployment & Consumption Models
Public Clouds
Hybrid Clouds
Private Clouds
FIGURE 2.2. The cloud computing service offering and deployment models.
Microsoft, and a few others have been the key players apart from open source
Hadoop built around the Apache ecosystem. As shown in Figure 2.2, the cloud
service offerings from these vendors can broadly be classified into three major
streams: the Infrastructure as a Service (IaaS), the Platform as a Service (PaaS),
and the Software as a Service (SaaS). While IT managers and system
administrators preferred IaaS as offered by Amazon for many of their
virtualized IT needs, the programmers preferred PaaS offerings like Google
AppEngine (Java/Python programming) or Microsoft Azure (.Net
programming). Users of large-scale enterprise software invariably found that
if they had been using the cloud, it was because their usage of the specific
software package was available as a service—it was, in essence, a SaaS
offering. Salesforce.com was an exemplary SaaS offering on the Internet.
From a technology viewpoint, as of today, the IaaS type of cloud offerings
have been the most successful and widespread in usage. Invariably these
reflect the cloud underneath, where storage (most do not know on which
system it is) is easily scalable or for that matter where it is stored or located.
Challenges in the Cloud
While the cloud service offerings present a simplistic view of IT in case of IaaS
or a simplistic view of programming in case PaaS or a simplistic view of
resources usage in case of SaaS, the underlying systems level support challenges
are huge and highly complex. These stem from the need to offer a uniformly
consistent and robustly simplistic view of computing while the underlying
systems are highly failure-prone, heterogeneous, resource hogging, and
exhibiting serious security shortcomings. As observed in Figure 2.3, the
promise of the cloud seems very similar to the typical distributed systems
properties that most would prefer to have.
Distributed System Fallacies
Challenges in Cloud Technologies
and the Promise of the Cloud
Full Network Reliability
Security
Zero Network Latency
Performance Monitoring
Consistent & Robust Service abstractions
Infinite Bandwidth
Secure Network
No Topology changes
Centralized Administration
Zero Transport Costs
Meta Scheduling
Energy efficient load balancing
Scale management
SLA & QoS Architectures
Interoperability & Portability
Homogeneous Networks & Systems
FIGURE 2.3. ‗Under the hood‘ challenges of the cloud computingGreen
services
IT implementations.
Many of them are listed in Figure 2.3. Prime amongst these are the challenges
of security. The Cloud Security Alliance seeks to address many of these issues .
BROAD APPROACHES TO MIGRATING INTO THE CLOUD
Given that cloud computing is a ―techno-business disruptive model‖ and is on
the top of the top 10 strategic technologies to watch for 2010 according to
Gartner, migrating into the cloud is poised to become a large-scale effort in
leveraging the cloud in several enterprises. ―Cloudonomics‖ deals with the
economic rationale for leveraging the cloud and is central to the success of
cloud-based enterprise usage.
Why Migrate?
There are economic and business reasons why an enterprise application can be
migrated into the cloud, and there are also a number of technological reasons.
Many of these efforts come up as initiatives in adoption of cloud technologies
in the enterprise, resulting in integration of enterprise applications running off
the captive data centers with the new ones that have been developed on the
cloud. Adoption of or integration with cloud computing services is a use case of
migration.
With due simplification, the migration of an enterprise application is best
captured by the following:
P-P0 1 P0 -P0
C
l
1 P0l
OFC
where P is the application before migration running in captive data center, P0 is
the application part after migration either into a (hybrid) cloud, P0 l is the part
C
of application being run in the captive local data center, and P0 OFC is the
application part optimized for cloud. If an enterprise application cannot be
migrated fully, it could result in some parts being run on the captive local data
center while the rest are being migrated into the cloud—essentially a case of a
hybrid cloud usage. However, when the entire application is migrated onto the
cloud, then P0l is null. Indeed, the migration of the enterprise application P can
happen at the five levels of application, code, design, architecture, and usage. It
can be that the P0C migration happens at any of the five levels without any P0 l
component. Compound this with the kind of cloud computing service offering
being applied—the IaaS model or PaaS or SaaS model—and we have a variety
of migration use cases that need to be thought through thoroughly by the
migration architects.
Cloudonomics. Invariably, migrating into the cloud is driven by economic
reasons of cost cutting in both the IT capital expenses (Capex) as well as
operational expenses (Opex). There are both the short-term benefits of
opportunistic migration to offset seasonal and highly variable IT loads as well
as the long-term benefits to leverage the cloud. For the long-term sustained
usage, as of 2009, several impediments and shortcomings of the cloud
computing services need to be addressed.
Deciding on the Cloud Migration
In fact, several proof of concepts and prototypes of the enterprise application
are experimented on the cloud to take help in making a sound decision on
migrating into the cloud. Post migration, the ROI on the migration should be
positive for a broad range of pricing variability. Assume that in the M classes
of questions, there was a class with a maximum of N questions. We can then
model the weightage-based decision making as M 3 N weightage matrix as
follows:
M
X
Cl #
!
N
X
Bi
Aij Xij
# Ch
i51
j51
where Cl is the lower weightage threshold and Ch is the higher weightage
threshold while Aij is the specific constant assigned for a question and Xij is the
fraction between 0 and 1 that represents the degree to which that answer to
the question is relevant and applicable.
THE SEVEN-STEP MODEL OF MIGRATION INTO A CLOUD
Typically migration initiatives into the cloud are implemented in phases or in
stages. A structured and process-oriented approach to migration into a cloud has
several advantages of capturing within itself the best practices of many migration
projects. While migration has been a difficult and vague subject—of not much
interest to the academics and left to the industry practitioners—not many efforts
across the industry have been put in to consolidate what has been found to be
both a top revenue earner and a long standing customer pain. After due study
and practice, we share the Seven-Step Model of Migration into the Cloud as part
of our efforts in understanding and leveraging the cloud computing service
offerings in the enterprise context. In a succinct way, Figure 2.4 captures the
essence of the steps in the model of migration into the cloud, while Figure 2.5
captures the iterative process of the seven-step migration into the cloud.
The first step of the iterative process of the seven-step model of migration is
basically at the assessment level. Proof of concepts or prototypes for various
approaches to the migration along with the leveraging of pricing
parameters enables one to make appropriate assessments.
15. Conduct Cloud Migration Assessments
16. Isolate the Dependencies
17. Map the Messaging & Environment
18. Re-architect & Implement the lost Functionalities
19. Leverage Cloud Functionalities & Features
20. Test the Migration
21. Iterate and Optimize
FIGURE 2.4. The Seven-Step Model of Migration into the Cloud. (Source: Infosys
Research.)
START
Assess
Optimize
Isolate
END
The Iterative Seven Step
Test
Migration Model
Augment
Map
Rearchitect
FIGURE 2.5. The iterative Seven-step Model of Migration into the Cloud. (Source:
Infosys Research.)
Having done the augmentation, we validate and test the new form of the
enterprise application with an extensive test suite that comprises testing the
components of the enterprise application on the cloud as well. These test results
could be positive or mixed. In the latter case, we iterate and optimize as
appropriate. After several such optimizing iterations, the migration is deemed
successful. Our best practices indicate that it is best to iterate through this
Seven-Step Model process for optimizing and ensuring that the migration into
the cloud is both robust and comprehensive. Figure 2.6 captures the typical
components of the best practices accumulated in the practice of the Seven-Step
Model of Migration into the Cloud. Though not comprehensive in enumeration,
it is representative.
Assess
• Cloudonomics
• Migration
Costs
• Recurring
Costs
• Database data
segmentation
• Database
Migration
• Functionality
migration
• NFR Support
Isolate
• Runtime
Environment
• Licensing
• Libraries
Dependency
• Applications
Dependency
• Latencies
Bottlenecks
• Performance
bottlenecks
• Architectural
Dependencies
Map
• Messages
mapping:
marshalling &
de-marshalling
• Mapping
Environments
• Mapping
libraries &
runtime
approximations
Re-Architect
• Approximate
lost
functionality
using cloud
runtime
support API
• New
Usecases
• Analysis
• Design
Augment
• Exploit
additional
cloud features
• Seek Low-cost
augmentations
• Autoscaling
• Storage
• Bandwidth
• Security
Test
• Augment Test
Cases and
Test
Automation
• Run Proof-ofConcepts
• Test Migration
strategy
• Test new
testcases due
to cloud
augmentation
• Test for
Production
Loads
Optimize
• Optimize–
rework and
iterate
• Significantly
satisfy
cloudonomics
of migration
• Optimize
compliance
with standards
and
governance
• Deliver best
migration ROI
• Develop
roadmap for
leveraging new
cloud features
FIGURE 2.6. Some details of the iterative Seven-Step Model of Migration into the
Cloud.
Compared with the typical approach to migration into the Amazon AWS, our
Seven-step model is more generic, versatile, and comprehensive. The typical
migration into the Amazon AWS is a phased over several steps. It is about six
steps as discussed in several white papers in the Amazon website and is as
follows: The first phase is the cloud migration assessment phase wherein
dependencies are isolated and strategies worked out to handle these
dependencies. The next phase is in trying out proof of concepts to build a
reference migration architecture. The third phase is the data migration phase
wherein database data segmentation and cleansing is completed. This phase
also tries to leverage the various cloud storage options as best suited. The
fourth phase comprises the application migration wherein either a ―forklift
strategy‖ of migrating the key enterprise application along with its
dependencies (other applications) into the cloud is pursued.
Migration Risks and Mitigation
The biggest challenge to any cloud migration project is how effectively the
migration risks are identified and mitigated. In the Seven-Step Model of
Migration into the Cloud, the process step of testing and validating includes
efforts to identify the key migration risks. In the optimization step, we address
various approaches to mitigate the identified migration risks.
There
are issues of consistent identity management as well. These and
several of the issues are discussed in Section 2.1. Issues and challenges listed in
Figure 2.3 continue to be the persistent research and engineering challenges in
coming up with appropriate cloud computing implementations.
2.3
ENRICHING
THE
‘INTEGRATION
AS
A
SERVICE’ PARADIGM FOR THE CLOUD ERA
AN INTRODUCTION
The trend-setting cloud paradigm actually represents the cool
conglomeration of a number of proven and promising Web and enterprise
technologies. Cloud Infrastructure providers are establishing cloud centers
to host a variety of ICT services and platforms of worldwide individuals,
innovators, and institutions. Cloud service providers (CSPs) are very
aggressive in experimenting and embracing the cool cloud ideas and today
every business and technical services are being hosted in clouds to be
delivered to global customers, clients and consumers over the Internet
communication infrastructure. For example, security as a service (SaaS) is
a prominent cloud-hosted security service that can be subscribed by a
spectrum of users of any connected device and the users just pay for the
exact amount or time of usage. In a nutshell, on-premise and local
applications are becoming online, remote, hosted, on-demand and
offpremise applications.
Business-to-business (B2B). It is logical to take the integration
middleware to clouds to simplify and streamline the enterprise-toenterprise
(E2E), enterprise-to-cloud (E2C) and cloud-to-cloud (C2C) integration.
THE EVOLUTION OF SaaS
SaaS paradigm is on fast track due to its innate powers and potentials.
Executives, entrepreneurs, and end-users are ecstatic about the tactic as
well as strategic success of the emerging and evolving SaaS paradigm.
A number of positive and progressive developments started to grip this
model. Newer resources and activities are being consistently readied
to be delivered as a service. Experts and evangelists are in unison
that cloud is to rock the total IT community as the best possible
infrastructural solution for effective service delivery.
IT as a Service (ITaaS) is the most recent and efficient delivery
method in the decisive IT landscape. With the meteoric and
mesmerizing rise of the service orientation principles, every single IT
resource, activity and infrastructure is being viewed and visualized as a
service that sets the tone for the grand unfolding of the dreamt service
era. Integration as a service (IaaS) is the budding and distinctive
capability of clouds in fulfilling the business integration requirements.
Increasingly business applications are deployed in clouds to reap the
business and technical benefits. On the other hand, there are still
innumerable applications and data sources locally stationed and
sustained primarily due to the security reason.
B2B systems are capable of driving this new on-demand integration
model because they are traditionally employed to automate business
processes between manufacturers and their trading partners. That
means they provide application-to-application connectivity along with
the functionality that is very crucial for linking internal and external
software securely.
The use of hub & spoke (H&S) architecture further simplifies the
implementation and avoids placing an excessive processing burden on
the customer sides. The hub is installed at the SaaS provider‘s cloud
center to do the heavy lifting such as reformatting files. The Web is the
largest digital information
superhighway
1. The Web is the largest repository of all kinds of resources such as
web pages, applications comprising enterprise components, business
services, beans, POJOs, blogs, corporate data, etc.
2. The Web is turning out to be the open, cost-effective and generic
business execution platform (E-commerce, business, auction, etc.
happen in the web for global users) comprising a wider variety of
containers, adaptors, drivers, connectors, etc.
3. The Web is the global-scale communication infrastructure (VoIP,
Video conferencing, IP TV etc,)
4. The Web is the next-generation discovery, Connectivity, and
integration middleware
Thus the unprecedented absorption and adoption of the Internet is the
key driver for the continued success of the cloud computing.
THE CHALLENGES OF SaaS PARADIGM
As with any new technology, SaaS and cloud concepts too suffer a
number of limitations. These technologies are being diligently examined
for specific situations and scenarios. The prickling and tricky issues in
different layers and levels are being looked into. The overall views are
listed out below. Loss or lack of the following features deters the
massive adoption of clouds
1.
2.
3.
4.
5.
6.
Controllability
Visibility & flexibility
Security and Privacy
High Performance and Availability
Integration and Composition
Standards
A number of approaches are being investigated for resolving the
identified issues and flaws. Private cloud, hybrid and the latest
community cloud are being prescribed as the solution for most of these
inefficiencies and deficiencies. As rightly pointed out by someone in his
weblogs, still there are miles to go. There are several companies
focusing on this issue. Boomi (http://www.dell.com/) is one among
them. This company has published several well-written white papers
elaborating the issues confronting those enterprises thinking and trying
to embrace the third-party public clouds for hosting their services
and applications.
Integration Conundrum. While SaaS applications offer outstanding
value in terms of features and functionalities relative to cost, they have
introduced several challenges specific to integration.
APIs are Insufficient. Many SaaS providers have responded to the
integration challenge by developing application programming interfaces
(APIs). Unfortunately, accessing and managing data via an API requires
a significant amount of coding as well as maintenance due to frequent
API modifications and updates.
Data Transmission Security. SaaS providers go to great length to
ensure that customer data is secure within the hosted environment.
However, the need to transfer data from on-premise systems or
applications behind the firewall with SaaS applications.
For any relocated application to provide the promised value for
businesses and users, the minimum requirement is the interoperability
between SaaS applications and on-premise enterprise packages.
The Impacts of Clouds. On the infrastructural front, in the recent past,
the clouds have arrived onto the scene powerfully and have extended
the horizon and the boundary of business applications, events and data.
Thus there is a clarion call for adaptive integration engines that
seamlessly and spontaneously connect enterprise applications with
cloud applications. Integration is being stretched further to the level of
the expanding Internet and this is really a litmus test for system
architects and integrators.
The perpetual integration puzzle has to be solved meticulously for the
originally visualised success of SaaS style.
APPROACHING THE SaaS INTEGRATION ENIGMA
Integration as a Service (IaaS) is all about the migration of the
functionality of a typical enterprise application integration (EAI) hub /
enterprise service bus (ESB) into the cloud for providing for smooth
data transport between any enterprise and SaaS applications. Users
subscribe to IaaS as they would do for any other SaaS application.
Cloud middleware is the next logical evolution of traditional
middleware solutions.
Service orchestration and choreography enables process integration.
Service interaction through ESB integrates loosely coupled systems
whereas CEP connects decoupled systems.
With the unprecedented rise in cloud usage, all these integration
software are bound to move to clouds. SQS also doesn‘t promise inorder and exactly-once delivery. These simplifications let Amazon
make SQS more scalable, but they also mean that developers must use
SQS differently from an on-premise message queuing technology.
As per one of the David Linthicum‘s white papers, approaching
SaaS-toenterprise integration is really a matter of making informed and
intelligent choices.The need for integration between remote cloud
platforms with on-premise enterprise platforms.
Why SaaS Integration is hard?. As indicated in the white paper, there is
a mid-sized paper company that recently became a Salesforce.com
CRM customer. The company currently leverages an on-premise
custom system that uses an Oracle database to track inventory and sales.
The use of the Salesforce.com system provides the company with a
significant value in terms of customer and sales management.
Having understood and defined the ―to be‖ state, data
synchronization technology is proposed as the best fit between the
source, meaning Salesforce. com, and the target, meaning the existing
legacy system that leverages Oracle. First of all, we need to gain the
insights about the special traits and tenets of SaaS applications in order
to arrive at a suitable integration route. The constraining attributes of
SaaS applications are
● Dynamic nature of the SaaS interfaces that constantly change
● Dynamic nature of the metadata native to a SaaS provider such as
Salesforce.com
● Managing assets that exist outside of the firewall
● Massive amounts of information that need to move between
SaaS and on-premise systems daily and the need to maintain data
quality and integrity.
As SaaS are being deposited in cloud infrastructures vigorously, we
need to ponder about the obstructions being imposed by clouds and
prescribe proven solutions. If we face difficulty with local integration,
then the cloud integration is bound to be more complicated. The most
probable reasons are
●
●
●
●
New integration scenarios
Access to the cloud may be limited
Dynamic resources
Performance
Limited Access. Access to cloud resources (SaaS, PaaS, and the
infrastructures) is more limited than local applications. Accessing local
applications is quite simple and faster. Imbedding integration points in
local as well as custom applications is easier.
Dynamic Resources. Cloud resources are virtualized and serviceoriented. That is, everything is expressed and exposed as a service. Due
to the dynamism factor that is sweeping the whole could ecosystem,
application versioning
and infrastructural changes are liable for
dynamic changes.
Performance. Clouds support application scalability and resource
elasticity. However the network distances between elements in the
cloud are no longer under our control.
NEW INTEGRATION SCENARIOS
Before the cloud model, we had to stitch and tie local systems together.
With the shift to a cloud model is on the anvil, we now have to connect
local applications to the cloud, and we also have to connect cloud
applications to each other, which add new permutations to the complex
integration channel matrix.All of this means integration must criss-cross
firewalls somewhere.
Cloud Integration Scenarios. We have identified three major integration
scenarios as discussed below.
Within a Public Cloud (figure 3.1). Two different applications are
hosted in a cloud. The role of the cloud integration middleware (say
cloud-based ESB or internet service bus (ISB)) is to seamlessly enable
these applications to talk to each other. The possible sub-scenarios
include these applications can be owned
App1
FIGURE 3.1.
ISB
App2
Within a Public Cloud.
Cloud 1
FIGURE 3.2.
ISB
Cloud 2
Across Homogeneous Clouds.
Public Cloud
ISB
Private Cloud
FIGURE 3.3.
Across Heterogeneous Clouds.
by two different companies. They may live in a single physical server
but run on different virtual machines.
Homogeneous Clouds (figure 3.2). The applications to be integrated are
posited in two geographically separated cloud infrastructures. The
integration middleware can be in cloud 1 or 2 or in a separate cloud.
There is a need for data and protocol transformation and they get
done by the ISB. The approach is more or less compatible to
enterprise application integration procedure.
Heterogeneous Clouds (figure 3.3). One application is in public cloud
and the other application is private cloud.
THE INTEGRATION METHODOLOGIES
Excluding the custom integration through hand-coding, there are three
types for cloud integration
1. Traditional Enterprise Integration Tools can be empowered with
special connectors to access Cloud-located Applications—This is
the most likely approach for IT organizations, which have already
invested a lot in integration suite for their application integration
needs.
2. Traditional Enterprise Integration Tools are hosted in the
Cloud—This approach is similar to the first option except that
the integration software suite is now hosted in any third-party
cloud infrastructures so that the enterprise does not worry
about procuring and managing the hardware or installing the
integration software.
3. Integration-as-a-Service (IaaS) or On-Demand Integration
Offerings— These are SaaS applications that are designed to
deliver the integration service securely over the Internet and
are able to integrate cloud applications with the on-premise
systems, cloud-to-cloud applications.
In a nutshell, the integration requirements can be realised using
any one of the following methods and middleware products.
11.Hosted and extended ESB (Internet service bus / cloud integration
bus)
12.Online Message Queues, Brokers and Hubs
13.Wizard and configuration-based integration platforms (Niche
integration solutions)
14.Integration Service Portfolio Approach
15.Appliance-based Integration (Standalone or Hosted)
With the emergence of the cloud space, the integration scope grows
further and hence people are looking out for robust and resilient
solutions and services that would speed up and simplify the whole
process of integration.
Characteristics of Integration Solutions and Products. The key
attributes of integration platforms and backbones gleaned and gained
from integration projects experience are connectivity, semantic
mediation, Data mediation, integrity, security, governance etc
● Connectivity refers to the ability of the integration engine to engage
with both the source and target systems using available native
interfaces.
● Semantic Mediation refers to the ability to account for the
differences between application semantics between two or more
systems.
● Data Mediation converts data from a source data format into
destination data format.
● Data Migration is the process of transferring data between storage
types, formats, or systems.
● Data Security means the ability to insure that information extracted
from the source systems has to securely be placed into target
systems.
● Data Integrity means data is complete and consistent. Thus, integrity
has to be guaranteed when data is getting mapped and maintained
during integration operations, such as data synchronization between
on-premise and SaaS-based systems.
● Governance refers to the processes and technologies that surround a
system or systems, which control how those systems are accessed
and leveraged.
These are the prominent qualities carefully and critically analyzed for
when deciding the cloud / SaaS integration providers.
Data Integration Engineering Lifecycle. As business data are still
stored and sustained in local and on-premise server and storage
machines, it is imperative for a lean data integration lifecycle. The
pivotal phases, as per Mr. David Linthicum, a world-renowned
integration
expert,
are
understanding,
definition,
design,
implementation, and testing.
11.Understanding the existing problem domain means defining the
metadata that is native within the source system (say
Salesforce.com) and the target system.
12.Definition refers to the process of taking the information culled
during the previous step and defining it at a high level including
what the information represents, ownership, and physical
attributes.
13.Design the integration solution around the movement of data from
one point to another accounting for the differences in the
semantics using
the underlying data transformation and
mediation layer by mapping one schema from the source to the
schema of the target.
14.Implementation refers to actually implementing the data
integration solution within the selected technology.
15.Testing refers to assuring that the integration is properly
designed and implemented and that the data synchronizes
properly between the involved systems.
SaaS INTEGRATION PRODUCTS AND PLATFORMS
Cloud-centric integration solutions are being developed and
demonstrated for showcasing their capabilities for integrating enterprise
and cloud applications. The integration puzzle has been the toughest
assignment for long due to heterogeneity and multiplicity-induced
complexity.
Jitterbit
Force.com is a Platform as a Service (PaaS), enabling developers to
create and deliver any kind of on-demand business application.
Salesforce
Google
Microsoft
THE CLOUD
Zoho
Amazon
Yahoo
FIGURE 3.4.
Open Clouds.
The Smooth and Spontaneous Cloud Interaction via
Until now, integrating force.com applications with other on-demand
applications and systems within an enterprise has seemed like a
daunting and doughty task that required too much time, money, and
expertise.
Jitterbit is a fully graphical integration solution that provides users a
versatile platform and a suite of productivity tools to reduce the
integration efforts sharply. Jitterbit is comprised
of two major
components:
● Jitterbit Integration Environment An intuitive point-and-click
graphical UI that enables to quickly configure, test, deploy and
manage integration projects on the Jitterbit server.
● Jitterbit Integration Server A powerful and scalable run-time engine
that processes all the integration operations, fully configurable and
manageable from the Jitterbit application.
Jitterbit is making integration easier, faster, and more affordable
than ever before. Using Jitterbit, one can connect force.com with a
wide variety
PROBLEM
Manufacturing
Sales
R&D
FIGURE 3.5.
Applications.
SOLUTION
Manufacturing
Sales
Consumer
Marketing
R&D
Consumer
Marketing
Linkage of On-Premise with Online and On-Demand
of on-premise systems including ERP, databases, flat files and
custom applications. The figure 3.5 vividly illustrates how Jitterbit
links a number of functional and vertical enterprise systems with
on-demand applications
Boomi Software
Boomi AtomSphere is an integration service that is completely ondemand and connects any combination of SaaS, PaaS, cloud, and onpremise applications without the burden of installing and maintaining
software packages or appliances. Anyone can securely build, deploy
and manage simple to complex integration processes using only a web
browser. Whether connecting SaaS applications found in various lines
of business or integrating across geographic boundaries,
Bungee Connect
For professional developers, Bungee Connect enables cloud computing
by offering an application development and deployment platform
that enables highly interactive applications integrating multiple data
sources and facilitating instant deployment.
OpSource Connect
Expands on the OpSource Services Bus (OSB) by providing the
infrastructure for two-way web services interactions, allowing
customers to consume and publish applications across a common web
services infrastructure.
The Platform Architecture. OpSource Connect is made up of key
features including
●
●
●
●
●
OpSource Services Bus
OpSource Service Connectors
OpSource Connect Certified Integrator Program
OpSource Connect ServiceXchange
OpSource Web Services Enablement Program
The OpSource Services Bus (OSB) is the foundation for OpSource‘s
turnkey development and delivery environment for SaaS and web
companies.
SnapLogic
SnapLogic is a capable, clean, and uncluttered solution
integration that can be deployed in enterprise as well as
landscapes. The free community edition can be used for
common point-to-point data integration tasks, giving
productivity boost beyond custom code.
for data
in cloud
the most
a huge
● Changing data sources. SaaS and on-premise applications, Web
APIs, and RSS feeds
● Changing deployment options. On-premise, hosted, private and
public cloud platforms
● Changing delivery needs. Databases, files, and data services
Transformation Engine and Repository. SnapLogic is a single data
integration platform designed to meet data integration needs. The
SnapLogic server is built on a core of connectivity and transformation
components, which can be used to solve even the most complex data
integration scenarios.
The SnapLogic designer provides an initial hint of the web principles
at work behind the scenes. The SnapLogic server is based on the web
architecture and exposes all its capabilities through web interfaces to
outside world.
The Pervasive DataCloud
Platform (figure 3.6) is unique multi-tenant platform. It provides
dynamic ―compute capacity in the sky‖ for deploying on-demand
integration and other
Managem
ent
Schedule Events
eCommerce
Users Load Balancer
Resources
&
Message Queues
Engine
Queue
Listen
er
Engine
Queue
Listen
er
Engine
Queue
Listen
er
Engine
Queue
Listen
er
Engine Queue
Listener
Scalable Computing Cluster
SaaS Application
S
a
a
S
A
p
p
l
i
c
a
t
i
o
n
Customer
FIGURE 3.6.
Resources.
Customer
Pervasive Integrator Connects Different
data-centric applications. Pervasive DataCloud is the first multi-tenant
platform for delivering the following.
9. Integration as a Service (IaaS) for both hosted and on-premises
applications and data sources
10.Packaged turnkey integration
11.Integration that supports every integration scenario
12.Connectivity to hundreds of different applications and data
sources
Pervasive DataCloud hosts Pervasive and its partners‘ data-centric
applications. Pervasive uses Pervasive DataCloud as a platform for
deploying on-demand integration via
● The Pervasive DataSynch family of packaged integrations. These
are highly affordable, subscription-based, and packaged integration
solutions.
● Pervasive Data Integrator. This runs on the Cloud or on-premises
and is a design-once and deploy anywhere solution to support
every integration scenario
● Data migration, consolidation and conversion
● ETL / Data warehouse
● B2B / EDI integration
● Application integration (EAI)
● SaaS /Cloud integration
● SOA / ESB / Web Services
● Data Quality/Governance
● Hubs
Pervasive DataCloud provides multi-tenant, multi-application and
multicustomer deployment. Pervasive DataCloud is a platform to deploy
applications that are
● Scalable—Its multi-tenant architecture can support multiple users
●
●
●
●
●
and applications for delivery of diverse data-centric solutions
such as data integration. The applications themselves scale to
handle fluctuating data volumes.
Flexible—Pervasive DataCloud supports SaaS-to-SaaS, SaaS-to-on
premise or on-premise to on-premise integration.
Easy to Access and Configure—Customers can access, configure
and run Pervasive DataCloud-based integration solutions via a
browser.
Robust—Provides automatic delivery of updates as well as
monitoring activity by account, application or user, allowing
effortless result tracking.
Secure—Uses the best technologies in the market coupled with the
best data centers and hosting services to ensure that the service
remains secure and available.
Affordable—The platform enables delivery of packaged solutions
in a SaaS-friendly pay-as-you-go model.
Bluewolf
Has announced its expanded ―Integration-as-a-Service‖ solution, the
first to offer ongoing support of integration projects guaranteeing
successful integration between diverse SaaS solutions, such as
salesforce.com, BigMachines, eAutomate, OpenAir and back office
systems (e.g. Oracle, SAP, Great Plains, SQL Service and MySQL).
Called the Integrator, the solution is the only one to include proactive
monitoring and consulting services to ensure integration success. With
remote monitoring of integration jobs via a dashboard included as part
of the Integrator solution, Bluewolf proactively alerts its customers of
any issues with integration and helps to solves them quickly.
Online MQ
Online MQ is an Internet-based queuing system. It is a complete and
secure online messaging solution for sending and receiving messages
over any network. It is a cloud messaging queuing service.
● Ease of Use. It is an easy way for programs that may each be
running on different platforms, in different systems and different
networks, to communicate with each other without having to write
any low-level communication code.
● No Maintenance. No need to install any queuing software/server
and no need to be concerned with MQ server uptime, upgrades and
maintenance.
● Load Balancing and High Availability. Load balancing can be
achieved on a busy system by arranging for more than one program
instance to service a queue. The performance and availability
features are being met through clustering. That is, if one system
fails, then the second system can take care of users‘ requests
without any delay.
● Easy Integration. Online MQ can be used as a web-service (SOAP)
and as a REST service. It is fully JMS-compatible and can hence
integrate easily with any Java EE application servers. Online MQ is
not limited to any specific platform, programming language or
communication protocol.
CloudMQ
This leverages the power of Amazon Cloud to provide enterprise-grade
message queuing capabilities on demand. Messaging allows us to
reliably break up a single process into several parts which can then be
executed asynchronously.
Linxter
Linxter is a cloud messaging framework for connecting all kinds of
applications, devices, and systems. Linxter is a behind-the-scenes,
messageoriented and cloud-based middleware technology and smoothly
automates the complex tasks that developers face when creating
communication-based products and services.
Online MQ, CloudMQ and Linxter are all accomplishing messagebased application and service integration. As these suites are being
hosted in clouds, messaging is being provided as a service to hundreds
of distributed and enterprise applications using the much-maligned
multi-tenancy property. ―Messaging middleware as a service (MMaaS)‖
is the grand derivative of the SaaS paradigm.
SaaS INTEGRATION SERVICES
We have seen the state-of-the-art cloud-based data integration
platforms for real-time data sharing among enterprise information
systems and cloud applications.
There are fresh endeavours in order to achieve service composition in
cloud ecosystem. Existing frameworks such as service component
architecture (SCA) are being revitalised for making it fit for cloud
environments. Composite applications, services, data, views and
processes will be become cloud-centric and hosted in order to support
spatially separated and heterogeneous systems.
Informatica On-Demand
Informatica offers a set of innovative on-demand data integration
solutions called Informatica On-Demand Services. This is a cluster of
easy-to-use SaaS offerings, which facilitate integrating data in SaaS
applications, seamlessly and securely across the Internet with data in
on-premise applications. There are a few key benefits to leveraging this
maturing technology.
● Rapid development and deployment with zero maintenance of the
integration technology.
● Automatically upgraded and continuously enhanced by vendor.
● Proven SaaS integration solutions, such as integration with
Salesforce
.com, meaning that the connections and the metadata
understanding are provided.
● Proven data transfer and translation technology, meaning that
core integration services such as connectivity and semantic
mediation are built into the technology.
Informatica On-Demand has taken the unique approach of moving
its industry leading PowerCenter Data Integration Platform to the
hosted model and then configuring it to be a true multi-tenant
solution.
Microsoft Internet Service Bus (ISB)
Azure is an upcoming cloud operating system from Microsoft. This
makes development, depositing and delivering Web and Windows
application on cloud centers easier and cost-effective.
Microsoft .NET Services. is a set of Microsoft-built and hosted cloud
infrastructure services for building Internet-enabled applications and the
ISB acts as the cloud middleware providing diverse applications with a
common infrastructure to name, discover, expose, secure and
orchestrate web services. The following are the three broad areas.
.NET Service Bus. The .NET Service Bus (figure 3.7) provides a hosted,
secure, and broadly accessible infrastructure for pervasive
communication,
Console Application Exposing Web Services
End Users
End Users
Azure Service Platform
Google App Engine
.Net Services Service Bus
Windows Azure
Applications
Application
via Service Bus
FIGURE 3.7.
.NET Service Bus.
large-scale event distribution, naming, and service publishing. Services
can be exposed through the Service Bus Relay, providing connectivity
options for service endpoints that would otherwise be difficult or
impossible to reach.
.NET Access Control Service. The .NET Access Control Service is a
hosted, secure, standards-based infrastructure for multiparty, federated
authentication, rules-driven, and claims-based authorization.
.NET Workflow Service. The .NET Workflow Service provide a hosted
environment for service orchestration based on the familiar Windows
Workflow Foundation (WWF) development experience.
The most important part of the Azure is actually the service bus
represented as a WCF architecture. The key capabilities of the Service
Bus are
● A federated namespace model that provides a shared, hierarchical
namespace into which services can be mapped.
● A service registry service that provides an opt-in model for
publishing service endpoints into a lightweight, hierarchical, and
RSS-based discovery mechanism.
● A lightweight and scalable publish/subscribe event bus.
● A relay and connectivity service with advanced NAT traversal and
pullmode message delivery capabilities acting as a ―perimeter
network (also known as DMZ, demilitarized zone, and screened
subnet) in the sky‖
Relay Services. Often when we connect a service, it is located behind
the firewall and behind the load balancer. Its address is dynamic
and can be
Relay Service
Client
FIGURE 3.8.
Service
The .NET Relay Service.
resolved only on local network. When we are having the service callbacks to the client, the connectivity challenges lead to scalability,
availability and security issues. The solution to Internet connectivity
challenges is instead of connecting client directly to the service we can
use a relay service as pictorially represented in the relay service figure
3.8.
BUSINESSES-TO-BUSINESS INTEGRATION (B2Bi) SERVICES
B2Bi has been a mainstream activity for connecting geographically
distributed businesses for purposeful and beneficial cooperation.
Products vendors have come out with competent B2B hubs and suites
for enabling smooth data sharing in standards-compliant manner among
the participating enterprises.
Just as these abilities ensure smooth communication between
manufacturers and their external suppliers or customers, they also
enable reliable interchange between hosted and installed applications.
The IaaS model also leverages the adapter libraries developed by
B2Bi vendors to provide rapid integration with various business
systems.
Cloudbased Enterprise Mashup Integration Services for B2B Scenarios
. There is a vast need for infrequent, situational and ad-hoc B2B
applications desired by the mass of business end-users..
Especially in the area of applications to support B2B collaborations,
current offerings are characterized by a high richness but low reach,
like B2B hubs that focus on many features enabling electronic
collaboration, but lack availability for especially small organizations
or even individuals.
Enterprise Mashups, a kind of new-generation Web-based
applications,
seem to adequately fulfill the individual and
heterogeneous requirements of end-users and foster End User
Development (EUD).
Another challenge in B2B integration is the ownership of and
responsibility for processes. In many inter-organizational settings,
business processes are only sparsely structured and formalized, rather
loosely coupled and/or based
on ad-hoc cooperation. Interorganizational collaborations tend to
involve
more and more
participants and the growing number of participants also draws a huge
amount of differing requirements.
Now, in supporting supplier and partner co-innovation and customer
cocreation, the focus is shifting to collaboration which has to embrace
the participants, who are influenced yet restricted by multiple domains
of control and disparate processes and practices.
Both Electronic data interchange translators (EDI) and Managed file
transfer (MFT) have a longer history, while B2B gateways only have
emerged during the last decade.
Enterprise Mashup Platforms and Tools.
Mashups are the adept combination of different and distributed
resources including content, data or application functionality. Resources
represent the core building blocks for mashups. Resources can be
accessed through APIs, which encapsulate the resources and describe
the interface through which they are made available. Widgets or gadgets
primarily put a face on the underlying resources by providing a
graphical representation for them and piping the data received from the
resources. Piping can include operators like aggregation, merging or
filtering. Mashup platform is a Web based tool that allows the creation
of Mashups by piping resources into Gadgets and wiring Gadgets
together.
The Mashup integration services are being implemented as a
prototype in the FAST project. The layers of the prototype are
illustrated in figure 3.9 illustrating the architecture, which describes
how these services work together. The authors of this framework have
given an outlook on the technical realization of the services using cloud
infrastructures and services.
COMPANY A
HTTP
HTTP
Browser R
HTTP
Browser R
Browser R
COMPANY B
HTTP
HTTP
Browser R
Enterprise Mashup Platform
(i.e. FAST)
HTTP
Browser R
Browser R
Enterprise Mashup Platform
(i.e. SAP Research Rooftop)
R
R
REST
REST
Mashup
Integration Service Logic
Integration
Services
Platform
(i.e., Google
App. Engine)
Routing Engine
Identity
Management
Error Handling
and Monitoring
Organization
R
Cloud Based
Services
Translation
Engine
Persistent
Storage
Semantic
R
Message
InfrastructureQueue
R
R
R
Amazon SQS
Amazon S3
Mule
onDemand
Mule
onDemand
OpenID/Oauth (Google)
FIGURE 3.9.
Architecture.
Cloudbased Enterprise Mashup Integration Platform
To simplify this, a Gadget could be provided for the end-user. The
routing engine is also connected to a message queue via an API. Thus,
different message queue engines are attachable. The message queue is
responsible for storing and forwarding the messages controlled by the
routing engine. Beneath the message queue, a persistent storage, also
connected via an API to allow exchangeability, is available to store
large data. The error handling and monitoring service allows tracking
the message-flow to detect errors and to collect statistical data. The
Mashup integration service is hosted as a cloud-based service. Also,
there are cloud-based services available which provide the functionality
required by the integration service. In this way, the Mashup integration
service can reuse and leverage the existing cloud services to speed up
the implementation.
Message Queue. The message queue could be realized by using
Amazon‘s Simple Queue Service (SQS). SQS is a web-service which
provides a queue for messages and stores them until they can be
processed. The Mashup integration services, especially the routing
engine, can put messages into the queue and recall them when they are
needed.
Persistent Storage. Amazon Simple Storage Service5 (S3) is also a
webservice. The routing engine can use this service to store large files.
Translation Engine. This is primarily focused on translating between
different protocols which the Mashup platforms it connects can
understand, e.g. REST or SOAP web services. However, if the need of
translation of the objects transferred arises, this could be attached to the
translation engine.
Interaction between the Services. The diagram describes the process of
a message being delivered and handled by the Mashup Integration
Services Platform. The precondition for this process is that a user
already established a route to a recipient.
A FRAMEWORK OF SENSOR—CLOUD INTEGRATION
In the past few years, wireless sensor networks (WSNs) have been
gaining significant attention because of their potentials of enabling of
novel and attractive solutions in areas such as industrial automation,
environmental monitoring, transportation business, health-care etc.
With the faster adoption of micro and nano technologies, everyday
things are destined to become digitally empowered and smart in their
operations and offerings. Thus the goal is to link smart materials,
appliances, devices, federated messaging middleware, enterprise
information systems and packages, ubiquitous services, handhelds, and
sensors with one another smartly to build and sustain cool, charismatic
and catalytic situation-aware applications.
A virtual community consisting of team of researchers have come together to
solve a complex problem and they need data storage, compute capability,
security; and they need it all provided now. For example, this team is
working on an outbreak of a new virus strain moving through a population.
This requires more than a Wiki or other social organization tool. They
deploy bio-sensors on patient body to monitor patient condition
continuously and to use this data for large and multi-scale simulations to
track the spread of infection as well as the virus mutation and possible cures.
This may require computational resources and a platform for sharing data
and results that are not immediately available to the team.
Traditional HPC approach like Sensor-Grid model can be used in this
case, but setting up the infrastructure to deploy it so that it can scale out
quickly is not easy in this environment. However, the cloud paradigm is
an excellent move.
Here, the researchers need to register their interests to get various
patients‘ state (blood pressure, temperature, pulse rate etc.) from biosensors for largescale parallel analysis and to share this information
with each other to find useful solution for the problem. So the sensor
data needs to be aggregated, processed and disseminated based on
subscriptions.
To integrate sensor networks to cloud, the authors have proposed a
contentbased pub-sub model. In this framework, like MQTT-S, all of
the system complexities reside on the broker‘s side but it differs from
MQTT-S in that it uses content-based pubsub broker rather than topicbased which is suitable for the application scenarios considered.
To deliver published sensor data or events to subscribers, an efficient
and scalable event matching algorithm is required by the pub-sub
broker.
Moreover, several SaaS applications may have an interest in the same
sensor data but for different purposes. In this case, the SA nodes would
need to manage and maintain communication means with multiple
applications in parallel. This might exceed the limited capabilities of the
simple and low-cost SA devices. So pub-sub broker is needed and it is
located in the cloud side because of its higher performance in terms of
bandwidth and capabilities. It has four components describes as
follows:
Social Network
of doctors for
monitoring
patient
healthcare for
virus infection
1
WSN 1
Environmental
data analysis
and
Urban Trafic
prediction
and
1
sharing portal
analysis1network
Other data
analysis or
social
1 network
Gateway
System
3
Actuator
Application Specific 2
2
Gateway
Services
(SaaS)
3
Manager
3
3
4
Sensor
Monitoring
and Metering
Provisioning
Manager
4
Servers
Pub/Sub Broker
WSN 2
Registry
Event
Monitoring Analyzer
Gateway
3
inator
Actuator
Gateway
Mediator
Processing Dissemand
Sensor
4
Service
Registry
Policy
Repository
Collaborator
Sensor
Cloud Provider (CLP)
Agent
WSN 2
FIGURE 3.10.
Integration.
The Framework Architecture of Sensor—Cloud
Stream monitoring and processing component (SMPC). The sensor
stream comes in many different forms. In some cases, it is raw data that
must be captured, filtered and analyzed on the fly and in other cases, it is
stored or cached. The style of computation required depends on the
nature of
the streams. So the SMPC component running on the
cloud monitors the event streams and invokes correct analysis method.
Depending on the data rates and the amount of processing that is
required, SMP manages parallel execution framework on cloud.
Registry component (RC). Different SaaS applications register to pub-sub
broker for various sensor data required by the community user.
Analyzer component (AC). When sensor data or events come to the pubsub broker, analyzer component determines which applications they are
belongs to and whether they need periodic or emergency deliver.
Disseminator component (DC). For each SaaS application, it disseminates
sensor data or events to subscribed users using the event matching
algorithm. It can utilize cloud‘s parallel execution framework for fast event
delivery. The pub-sub components workflow in the framework is as
follows:
Users register their information and subscriptions to various SaaS
applications which then transfer all this information to pub/sub broker
registry. When sensor data reaches to the system from gateways,
event/stream monitoring and processing component (SMPC) in the pub/sub
broker determines whether it needs processing or just store for periodic
send or for immediate delivery.
Mediator. The (resource) mediator is a policy-driven entity within a VO to
ensure that the participating entities are able to adapt to changing
circumstances and are able to achieve their objectives in a dynamic and
uncertain environment.
Policy Repository (PR). The PR virtualizes all of the policies within the
VO. It includes the mediator policies, VO creation policies along with any
policies for resources delegated to the VO as a result of a collaborating
arrangement.
Collaborating Agent (CA). The CA is a policy-driven resource discovery
module for VO creation and is used as a conduit by the mediator to
exchange policy and resource information with other CLPs.
SaaS INTEGRATION APPLIANCES
Appliances are a good fit for high-performance requirements. Clouds too
have gone in the same path and today there are cloud appliances (also
termed as ―cloud in a box‖). In this section, we are to see an
integration appliance.
Cast Iron Systems . This is quite different from the above-mentioned
schemes. Appliances with relevant software etched inside are being
established as a high-performance and hardware-centric solution for several
IT needs.
Cast Iron Systems (www.ibm.com) provides pre-configured solutions for
each of today‘s leading enterprise and On-Demand applications. These
solutions, built using the Cast Iron product offerings offer out-of-the-box
connectivity to specific applications, and template integration processes
(TIPs) for the most common integration scenarios.
2.4 THE ENTERPRISE CLOUD COMPUTING
PARADIGM
Cloud computing is still in its early stages and constantly undergoing
changes as new vendors, offers, services appear in the cloud market.
Enterprises will place stringent requirements on cloud providers to pave
the way for more widespread adoption of cloud computing, leading
to what is known as the enterprise cloud paradigm computing.
Enterprise cloud computing is the alignment of a cloud computing
model with an organization‘s business objectives (profit, return on
investment, reduction of operations costs) and processes. This chapter
explores this paradigm with respect to its motivations, objectives,
strategies and methods.
Section 4.2 describes a selection of deployment models and strategies
for enterprise cloud computing, while Section 4.3 discusses the issues of
moving [traditional] enterprise applications to the cloud. Section 4.4
describes the technical and market evolution for enterprise cloud
computing,
describing some potential opportunities for multiple
stakeholders in the provision of enterprise cloud computing.
BACKGROUND
According to NIST [1], cloud computing is composed of five essential
characteristics: on-demand self-service, broad network access, resource
pooling, rapid elasticity, and measured service. The ways in which these
characteristics are manifested in an enterprise context vary according to the
deployment model employed.
Relevant Deployment Models for Enterprise Cloud Computing
There are some general cloud deployment models that are accepted by the
majority of cloud stakeholders today, as suggested by the references [1] and
and discussed in the following:
● Public clouds are provided by a designated service provider for general
public under a utility based pay-per-use consumption model.
● Private clouds are built, operated, and managed by an organization for its
internal use only to support its business operations exclusively.
● Virtual private clouds are a derivative of the private cloud deployment
model but are further characterized by an isolated and secure segment
of resources, created as an overlay on top of public cloud infrastructure
using advanced network virtualization capabilities..
● Community clouds are shared by several organizations and support a
specific community that has shared concerns (e.g., mission, security
requirements, policy, and compliance considerations).
● Managed clouds arise when the physical infrastructure is owned by and/or
physically located in the organization‘s data centers with an extension of
management and security control plane controlled by the managed service
provider .
● Hybrid clouds are a composition of two or more clouds (private,
community, or public) that remain unique entities but are bound
together by standardized or proprietary technology that enables data
and application portability (e.g., cloud bursting for load-balancing
between clouds).
Adoption and Consumption Strategies
The selection of strategies for enterprise cloud computing is critical for IT
capability as well as for the earnings and costs the organization experiences,
motivating efforts toward convergence of business strategies and IT. Some
critical questions toward this convergence in the enterprise cloud paradigm are
as follows:
● Will an enterprise cloud strategy increase overall business value?
● Are the effort and risks associated with transitioning to an enterprise
cloud strategy worth it?
● Which areas of business and IT capability should be considered for the
enterprise cloud?
● Which cloud offerings are relevant for the purposes of an organization?
● How can the process of transitioning to an enterprise cloud strategy be
piloted and systematically executed?
These questions are addressed from two strategic perspectives: (1) adoption
and (2) consumption. Figure 4.1 illustrates a framework for enterprise cloud
adoption strategies, where an organization makes a decision to adopt a
cloud computing model based on fundamental drivers for cloud computing—
scalability, availability, cost and convenience. The notion of a Cloud Data
Center (CDC) is used, where the CDC could be an external, internal or
federated provider of infrastructure, platform or software services.
An optimal adoption decision cannot be established for all cases because the
types of resources (infrastructure, storage, software) obtained from a CDC
depend on the size of the organisation understanding of IT impact on business,
predictability of workloads, flexibility of existing IT landscape and available
budget/resources for testing and piloting. The strategic decisions using these
four basic drivers are described in following, stating objectives, conditions and
actions.
Cloud Data Center(s)
(CDC)
Conveniencedriv
en: Use cloud
resources so that
there is no need to
maintain local
resources.
Availability-driven:
Use of load-balanced
and localised cloud
resources to increase
availability and
reduce response time
Market-driven:
Users and
providers of
cloud resources
make decisions
based on the
potential saving
and profit
Scalability-driven: Use of cloud
resources to support additional
load or as back-up.
FIGURE 4.1. Enterprise cloud adoption strategies using fundamental cloud
drivers.
9. Scalability-Driven Strategy. The objective is to support increasing
workloads of the organization without investment and expenses
exceeding returns.
10. Availability-Driven Strategy. Availability has close relations to
scalability but is more concerned with the assurance that IT capabilities
and functions are accessible, usable and acceptable by the standards of
users.
11. Market-Driven Strategy. This strategy is more attractive and viable for
small, agile organizations that do not have (or wish to have) massive
investments in their IT infrastructure.
(1) Software Provision: Cloud provides instances
(2) Storage Provision: Cloud provides data
of software but data is maintained within user‘s
data center
management and software accesses data
remotely from user‘s data center
(7) Solution Provision: Software and storage
are maintained in cloud and the user does not
maintain a data center
(8) Redundancy Services: Cloud is used as an
alternative or extension of user‘s data center
for software and storage
FIGURE 4.2. Enterprise cloud consumption strategies.
on their profiles and requests service requirements .
12.Convenience-Driven Strategy. The objective is to reduce the load and
need for dedicated system administrators and to make access to IT
capabilities by users easier, regardless of their location and connectivity
(e.g. over the Internet).
There are four consumptions strategies identified, where the differences in
objectives, conditions and actions reflect the decision of an organization to
trade-off hosting costs, controllability and resource elasticity of IT resources
for software and data. These are discussed in the following.
9. Software Provision. This strategy is relevant when the elasticity
requirement is high for software and low for data, the controllability
concerns are low for software and high for data, and the cost reduction
concerns for software are high, while cost reduction is not a priority for
data, given the high controllability concerns for data, that is, data are
highly sensitive.
10. Storage Provision. This strategy is relevant when the elasticity
requirements is high for data and low for software, while the
controllability of software is more critical than for data. This can be the
case for data intensive applications, where the results from processing in
the application are more critical and sensitive than the data itself.
11.Solution Provision. This strategy is relevant when the elasticity and cost
reduction requirements are high for software and data, but the
controllability requirements can be entrusted to the CDC.
12.Redundancy Services. This strategy can be considered as a hybrid
enterprise cloud strategy, where the organization switches between
traditional, software, storage or solution management based on changes
in its operational conditions and business demands.
Even though an organization may find a strategy that appears to provide it
significant benefits, this does not mean that immediate adoption of the strategy
is advised or that the returns on investment will be observed immediately.
ISSUES FOR ENTERPRISE APPLICATIONS ON THE CLOUD
Enterprise Resource Planning (ERP) is the most comprehensive definition of
enterprise application today. For these reasons, ERP solutions have emerged as
the core of successful information management and the enterprise
backbone of nearly any organization . Organizations that have successfully
implemented the ERP systems are reaping the benefits of having integrating
working environment, standardized process and operational benefits to the
organization .
One of the first issues is that of infrastructure availability. Al-Mashari and
Yasser argued that adequate IT infrastructure, hardware and networking are
crucial for an ERP system‘s success.
One of the ongoing discussions concerning future scenarios considers varying
infrastructure requirements and constraints given different workloads and
development phases. Recent surveys among companies in North America
and Europe with enterprise-wide IT systems showed that nearly all kinds of
workloads are seen to be suitable to be transferred to IaaS offerings.
Considering Transactional and Analytical Capabilities
Transactional type of applications or so-called OLTP (On-line Transaction
Processing) applications, refer to a class of systems that manage
transactionoriented applications, typically using relational databases. These
applications rely on strong ACID (atomicity, consistency, isolation,
durability) properties and are relatively write/update-intensive. Typical OLTPtype ERP components are sales and distributions (SD), banking and financials,
customer relationship management (CRM) and supply chain management
(SCM).
One can conclude that analytical applications will benefit more than their
transactional counterparts from the opportunities created by cloud computing,
especially on compute elasticity and efficiency.
2.4.1 TRANSITION CHALLENGES
The very concept of cloud represents a leap from traditional approach for IT to
deliver mission critical services. With any leap comes the gap of risk and
challenges to overcome. These challenges can be classified in five different
categories, which are the five aspects of the enterprise cloud stages: build,
develop, migrate, run, and consume (Figure 4.3).
The requirement for a company-wide cloud approach should then become
the number one priority of the CIO, especially when it comes to having a
coherent and cost effective development and migration of services on this
architecture.
Develop
Build
Run
Consume
Migrate
FIGURE 4.3. Five stages of the cloud.
A second challenge is migration of existing or ―legacy‖ applications to ―the
cloud.‖ The expected average lifetime of ERP product is B15 years, which
means that companies will need to face this aspect sooner than later as they try
to evolve toward the new IT paradigm.
The ownership of enterprise data conjugated with the integration with others
applications integration in and from outside the cloud is one of the key
challenges. Future enterprise application development frameworks will need to
enable the separation of data management from ownership. From this, it can
be extrapolated that SOA, as a style, underlies the architecture and, moreover,
the operation of the enterprise cloud.
One of these has been notoriously hard to upgrade: the human factor;
bringing staff up to speed on the requirements of cloud computing with respect
to architecture, implementation, and operation has always been a tedious task.
Once the IT organization has either been upgraded to provide cloud or is
able to tap into cloud resource, they face the difficulty of maintaining the
services in the cloud. The first one will be to maintain interoperability between
in-house infrastructure and service and the CDC (Cloud Data Center).
Before leveraging such features, much more basic functionalities are
problematic: monitoring, troubleshooting, and comprehensive capacity
planning are actually missing in most offers. Without such features it becomes
very hard to gain visibility into the return on investment and the consumption
of cloud services.
Today there are two major cloud pricing models: Allocation based and
Usage based . The first one is provided by the poster child of cloud computing,
namely, Amazon. The principle relies on allocation of resource
for a fixed
amount of time. As companies need to evaluate the offers they need to also
include the hidden costs such as lost IP, risk, migration, delays and provider
overheads. This combination can be compared to trying to choose a new mobile
with carrier plan.The market dynamics will hence evolve alongside the
technology for the enterprise cloud computing paradigm.
ENTERPRISE CLOUD TECHNOLOGY AND MARKET EVOLUTION
This section discusses the potential factors which will influence this evolution of
cloud computing and today‘s enterprise landscapes to the enterprise computing
paradigm, featuring the convergence of business and IT and an open, service
oriented marketplace.
Technology Drivers for Enterprise Cloud Computing Evolution
This will put pressure on cloud providers to build their offering on open
interoperable standards to be considered as a candidate by enterprises. There
have been a number initiatives emerging in this space. Amazon, Google, and
Microsoft, who currently do not actively participate in these efforts. True
interoperability across
the board in the near future seems unlikely. However, if achieved, it could lead
to facilitation of advanced scenarios and thus drive the mainstream adoption of
the enterprise cloud computing paradigm.
Part of preserving investments is maintaining the assurance that cloud
resources and services powering the business operations perform according
to the business requirements. Underperforming resources or service disruptions
lead to business and financial loss, reduced business credibility, reputation,
and marginalized user productivity. Another important factor in this regard is
lack of insights into the performance and health of the resources and service
deployed on the cloud, such that this is another area of technology evolution
that will be pushed.
This would prove to be a critical capability empowering third-party
organizations to act as independent auditors especially with respect to SLA
compliance auditing and for mediating the SLA penalty related issues.
Emerging trend in the cloud application space is the divergence from the
traditional RDBMS based data store backend. Cloud computing has given rise
to alternative data storage technologies (Amazon Dynamo, Facebook
Cassandra, Google BigTable, etc.) based on key-type storage models as
compared to the relational model, which has been the mainstream choice for
data storage for enterprise applications.
As these technologies evolve into maturity, the PaaS market will consolidate
into a smaller number of service providers. Moreover, big traditional software
vendors will also join this market which will potentially trigger this
consolidation through acquisitions and mergers. These views are along the
lines of the research published by Gartner. Gartner predicts that from 2011 to
2015 market competition and maturing developer practises will drive
consolidation around a small group of industry-dominant cloud technology
providers.
A recent report published by Gartner presents an interesting perspective on
cloud evolution. The report argues that as cloud services proliferate, services
would become complex to be handled directly by the consumers.
To cope
with these scenarios, meta-services or cloud brokerage services will emerge.
These brokerages will use several types of brokers and platforms to enhance
service delivery and, ultimately service value. According to Gartner, before
these scenarios can be enabled, there is a need for brokerage business to use
these brokers and platforms. According to Gartner, the following types of cloud
service brokerages (CSB) are foreseen:
● Cloud Service Intermediation. An intermediation broker providers a
service that directly enhances a given service delivered one or more service
consumers, essentially on top of a given service to enhance a specific
capability.
● Aggregation. An aggregation brokerage service combines multiple
services into one or more new services.
● Cloud Service Arbitrage. These services will provide flexibility and
opportunistic choices for the service aggregator.
The above shows that there is potential for various large, medium, and
small organizations to become players in the enterprise cloud marketplace.
The dynamics of such a marketplace are still to be explored as the enabling
technologies and standards continue to mature.
BUSINESS DRIVERS TOWARD A MARKETPLACE FOR
ENTERPRISE CLOUD COMPUTING
In order to create an overview of offerings and consuming players on the
market, it is important to understand the forces on the market and motivations
of each player.
The Porter model consists of five influencing factors/views (forces) on the
market (Figure 4.4). The intensity of rivalry on the market is traditionally
influenced by industry-specific characteristics :
● Rivalry: The amount of companies dealing with cloud and virtualization
technology is quite high at the moment; this might be a sign for high
New Market Entrants
• Geographical factors
• Entrant strategy
• Routes to market
Suppliers
• Level of quality
• Supplier‘s size
• Bidding processes/
capabilities
Cloud Market
•
•
•
•
Cost structure
Product/service ranges
Differentiation, strategy
Number/size of players
Buyers (Consumers)
•
•
•
•
Buyer size
Buyers number
Product/service
Requirements
Technology Development
• Substitutes
• Trends
• Legislative effects
FIGURE 4.4. Porter‘s five forces market model (adjusted for the cloud market) .
BUSINESS DRIVERS TOWARD A MARKETPLACE FOR ENTERPRISE
113
rivalry. But also the products and offers are quite various, so many niche
products tend to become established.
● Obviously, the cloud-virtualization market is presently booming and will
keep growing during the next years. Therefore the fight for customers and
struggle for market share will begin once the market becomes saturated
and companies start offering comparable products.
● The initial costs for huge data centers are enormous. By building up
federations of computing and storing utilities, smaller companies can try
to make use of this scale effect as well.
● Low switching costs or high exit barriers influence rivalry. When a
customer can freely switch from one product to another, there is a greater
struggle to capture customers. From the opposite point of view high exit
barriers discourage customers to buy into a new technology. The trends
towards standardization of formats and architectures try to face this
problem and tackle it. Most current cloud providers are only paying
attention to standards related to the interaction with the end user.
However, standards for clouds interoperability are still to be developed .
Market
Regulations
Business Model
Hype
Cycle Phase
Market
Technology
FIGURE 4.5. Dynamic business models (based on [49] extend by
influence factors identified by [50]).
.
THE CLOUD SUPPLY CHAIN
One indicator of what such a business model would look like is in the complexity
of deploying, securing, interconnecting and maintaining enterprise landscapes
and solutions such as ERP, as discussed in Section 4.3. The concept of a Cloud
Supply Chain (C-SC) and hence Cloud Supply Chain Management (C-SCM)
appear to be viable future business models for the enterprise cloud computing
paradigm. The idea of C-SCM represents the management of a network of
interconnected businesses involved in the end-to-end provision of product and
service packages required by customers. The established understanding of a
supply chain is two or more parties linked by a flow of goods, information,
and funds [55], [56] A specific definition for a C-SC is hence: ―two or more
parties linked by the provision of cloud services, related information
and funds.‖ Figure 4.6 represents a concept for the C-SC, showing the flow
of products along different organizations such as hardware suppliers, software
component suppliers, data center operators, distributors and the end customer.
Figure 4.6 also makes a distinction between innovative and functional
products in the C-SC. Fisher classifies products primarily on the basis of their
demand patterns into two categories: primarily functional or primarily
innovative [57]. Due to their stability, functional products favor competition,
which leads to low profit margins and, as a consequence of their properties, to
low inventory costs, low product variety, low stockout costs, and low
obsolescence [58], [57]. Innovative products are characterized by additional
(other) reasons for a customer in addition to basic needs that lead to purchase,
unpredictable demand (that is high uncertainties, difficult to forecast and
variable demand), and short product life cycles (typically 3 months to 1
year). Cloud services
Cloud services, information, funds
Data center
Fuctional
Distributor
operator
End
customer
Product
Cloud supply chain
Innovative
Hardware
supplier
Component
supplier
Potential Closed Loop Cooperation
FIGURE 4.6. Cloud supply chain (C-SC).
should fulfill basic needs of customers and favor competition due to their
reproducibility. Table 4.1 presents a comparison of Traditional
TABLE 4.1. Comparison of Traditional and Emerging ICT Supply Chainsa
Emerging ICT
Traditional Supply Chain Concepts
Primary goal
Efficient SC
Responsive SC
Cloud SC
Supply demand at
Respond quickly
to demand
(changes)
Supply demand at the
lowest level of costs
and respond quickly
to demand
Create modularity
to allow
postponement
Create modularity to
allow individual
setting while
maximizing the
performance of
services
the lowest level of
cost
Product design
strategy
Maximize
performance at the
minimum product
cost
of product
differentiation
Pricing strategy
Concepts
Lower margins
because price is a
prime customer
driver
Manufacturing
strategy
Higher margins,
because price is
not a prime
customer driver
Lower costs
through high
utilization
Lower margins, as
high competition and
comparable products
Select based on cost
and quality
Supplier strategy
Inventory
strategy
Lead time
strategy
Transportation
strategy
Minimize
inventory to
lower cost
Reduce but not
at the expense
of costs
Greater reliance on
low cost modes
Maintain
capacity
flexibility to
meet
unexpected
demand
High utilization
while flexible
reaction on
demand
Maintain
buffer
inventory to
meet
unexpected
demand
Optimize
of
buffer
for
unpredicted
demand,
and
best utilization
Aggressively
reduce even if
the costs are
significant
Select based on
speed,
flexibility, and
quantity
Greater
reliance on
responsive
modes
Strong servicelevel agreements
(SLA) for ad hoc
provision
Select
on
complex
optimum
of
speed,
cost,
and flexibility
Implement highly
responsive and
low cost modes
a
Based on references 54 and 57.
Supply Chain concepts such as the
efficient SC and responsive SC and
a new concept for emerging ICT as
the cloud computing area with cloud
services as traded products. UNIT 3
VIRTUAL MACHINES PROVISIONING
AND MIGRATION SERVICES
Cloud computing is an emerging research infrastructure that builds on the
achievements of different research areas, such as service-oriented architecture
(SOA), grid computing, and virtualization technology. It offers infrastructure
as a service that is based on pay-as-you-use and on-demand computing models
to the end users (exactly the same as a public utility service like electricity,
water, gas, etc.). This service is referred to as Infrastructure as a Service
(IaaS). In this chapter, we shall focus on two core services that enable the
users to get the best out of the IaaS model in public and private cloud setups.
To make the concept clearer, consider this analogy for virtual machine
provisioning, to know its value: Historically, when there is a need to install
a new server for a certain workload to provide a particular service for a
client, lots of effort was exerted by the IT administrator, and much time was
spent to install and provision a new server, because the administrator has to
follow specific checklist and procedures to perform this task on hand. Now,
with the emergence of virtualization technology and the cloud computing IaaS
model, it is just a matter of minutes to achieve the same task.
Provisioning a new virtual machine is a matter of minutes, saving lots of time
and effort. Migrations of a virtual machine is a matter of milliseconds: saving
time, effort, making the service alive for customers, and achieving the SLA/
SLO agreements and quality-of-service (QoS) specifications required.
An overview about the chapter‘s higlights and sections can be grasped by the
mind map shown in Figure 5.1.
BACKGROUND AND RELATED WORK
In this section, we will have a quick look at previous work, give an overview
about virtualization technology, public cloud, private cloud, standardization
efforts, high availability through the migration, and provisioning of virtual
machines, and shed some lights on distributed management‘s tools.
Virtualization Technology Overview
Virtualization has revolutionized data center‘s technology through a set of
techniques and tools that facilitate the providing and management of the
dynamic data center‘s infrastructure. As shown in
Amazon and Provisioning Services
Amazon Elastic Compute Cloud
125
Performance and High Availability in Clustered VMs through Live Migration
Accelerating VMs live migration time
Cloud-wide VM migration and memory de-duplication
Infrastructure Enabling Tecnology
Architecture
Architecture
Elastic Load Balancer
Hizea
Provisioning in the Cloud Context
Introduction and Inspiration
openNebula
OCCI and OGF
Cloud and Virtualization Standardization Efforts
Auto Scaling
CloudWatch
Aneka characterization of virtual workloads
Performance evaluation and workload
High-Performance Data Scaling in Private and public Cloud Environments
Cloud federations and Provisioning
tools in hybrid
cloud
VM scheduling
alogorithms
VM Provisioning and Manageability
Future Research Directions
Virtualization Technology
Public and Private IaaS
High Availability
VM Provisioning Process
Steps to Provision VM
VM Provisioning and Migration Services
VM
VMlifecycle
Migration, SLA and On-Demand Computing
Cisco initiative UCS (Unified Commuting System)
Self-adaptive and dynamic data center
VM Migration Services
Migrations
Techniques
Live Storage Migration of Virtual Machine
Live Migration Effect on a Running Web Server
Live Migration and High availability
Regular/Cold Migration
Final Thoughts about the Example
Migration of Virtual Machines to Alternate Platforms
Live Migration security
Extend migration algorithm to allow for priorities
VM Lifecycle and VM Monitoring
References
Conclusion
OVF
Distributed Management of Virtualization
Background and Related work
VM Provisioning and Migration In Action
Deployment Scenario
Live Migration
Installation
FIGURE 3.1. VM provisioning and migration mind map.
Live Migration Anatomy, Xen Hypervisor Algorithm
Live Migration Vendor Implementations Examples
Virtual Machine
Virtual Machine
Virtual Machine
Virtual Machine
Workload 1
Workload 2
Workload 3
Workload n
Guest OS
Guest OS
Guest OS
Guest OS
Virtualization Layer (VMM or Hypervisor)
Physical Server Layer
FIGURE 5.2. A layered virtualization technology architecture.
Figure 5.2, the virtualization layer will partition the physical resource of the
underlying physical server into multiple virtual machines with different
workloads. The fascinating thing about this virtualization layer is that it
schedules, allocates the physical resource, and makes each virtual machine
think that it totally owns the whole underlying hardware‘s physical resource
(processor, disks, rams, etc.).
Virtual machine‘s technology makes it very flexible and easy to manage
resources in cloud computing environments, because they improve the
utilization of such resources by multiplexing many virtual machines on one
physical host (server consolidation), as shown in Figure 5.1. These machines
can be scaled up and down on demand with a high level of resources‘
abstraction.
Virtualization enables high, reliable, and agile deployment mechanisms and
management of services, providing on-demand cloning and live migration
services which improve reliability. Accordingly, having an effective
management‘s suite for managing virtual machines‘ infrastructure is critical for
any cloud computing infrastructure as a service (IaaS) vendor.
Public Cloud and Infrastructure Services
Public cloud or external cloud describes cloud computing in a traditional
mainstream sense, whereby resources are dynamically provisioned via publicly
accessible Web applications/Web services (SOAP or RESTful interfaces)
from an off-site third-party provider, who shares resources and bills on a finegrained utility computing basis , the user pays only for the capacity of the
provisioned resources at a particular time.
There are many examples for vendors who publicly provide infrastructure as
a service. Amazon Elastic Compute Cloud (EC2) is the best known example,
but the market now bristles with lots of competition like GoGrid , Joyent
Accelerator , Rackspace , AppNexus , FlexiScale , and Manjrasoft
Aneka .
Here, we will briefly cover and describe Amazon EC2 offering. Amazon
Elastic Compute Cloud (EC2) is an IaaS service that provides elastic compute
capacity in the cloud. These services can be leveraged via Web services
(SOAP or REST), a Web-based AWS (Amazon Web Service) management
console, or the EC2 command line tools. The Amazon service provides hundreds
of pre-made AMIs (Amazon Machine Images) with a variety of operating
systems (i.e., Linux, OpenSolaris, or Windows) and pre-loaded software.
It provides you with complete control of your computing resources and lets
you run on Amazon‘s computing and infrastructure environment easily.
Amazon EC2 reduces the time required for obtaining and booting a new
server‘s instances to minutes, thereby allowing a quick scalable capacity and
resources, up and down, as the computing requirements change. Amazon offers
different instances‘ size according to (a) the resources‘ needs (small, large, and
extra large), (b) the high CPU‘s needs it provides (medium and extra large high
CPU instances), and (c) high-memory instances (extra large, double extra large,
and quadruple extra large instance).
Private Cloud and Infrastructure Services
A private cloud aims at providing public cloud functionality, but on private
resources, while maintaining control over an organization‘s data and resources
to meet security and governance‘s requirements in an organization. Private
cloud exhibits a highly virtualized cloud data center located inside your
organization‘s firewall. It may also be a private space dedicated for your
company within a cloud vendor‘s data center designed to handle the
organization‘s workloads.
Private clouds exhibit the following characteristics:
● Allow service provisioning and compute capability for an organization‘s
users in a self-service manner.
● Automate and provide well-managed virtualized environments.
● Optimize computing resources, and servers‘ utilization.
● Support specific workloads.
There are many examples for vendors and frameworks that provide
infrastructure as a service in private setups. The best-known examples are
Eucalyptus and OpenNebula (which will be covered in more detail later on).
It is also important to highlight a third type of cloud setup named ―hybrid
cloud,‖ in which a combination of private/internal and external cloud resources
exist together by enabling outsourcing of noncritical services and functions in
public cloud and keeping the critical ones internal. Hybrid cloud‘s main
function is to release resources from a public cloud and to handle sudden
demand usage, which is called ―cloud bursting.‖
Distributed Management of Virtualization
Virtualization‘s benefits bring their own challenges and complexities presented
in the need for a powerful management capabilities. That is why many
commercial, open source products and research projects such as OpenNebula ,
IBM Virtualization Manager, Joyent, and VMware DRS are being developed to
dynamically provision virtual machines, utilizing the physical infrastrcture.
High Availability
High availability is a system design protocol and an associated implementation
that ensures a certain absolute degree of operational continuity during a given
measurement period. Availability refers to the ability of a user‘s community to
access the system—whether for submiting new work, updating or altering
existing work, or collecting the results of the previous work. If a user cannot
access the system, it is said to be unavailable.
Since a virtual environment is the larger part of any organization,
management of these virtual resources within this environemnet becomes a
critical mission, and the migration services of these resources became a corner
stone in achieving high availability for these services hosted by VMs.
Cloud and Virtualization Standardization Efforts
Standardization is important to ensure interoperability between virtualization
mangement vendors, the virtual machines produced by each one of them, and
cloud computing. Here, we will have look at the prevalent standards that
make cloud computing and virtualization possible. In the past few years,
virtualization standardization efforts led by the Distributed Management
Task
OCCI and OGF
Another standardization effort has been initiated by Open Grid Forum (OGF)
through organizing an official new working group to deliver a standard API for
cloud IaaS, the Open Cloud Computing Interface Working Group
(OCCIWG). The new API for interfacing ―IaaS‖ cloud computing facilities
will allow :
● Consumers to interact with cloud computing infrastructure on an ad hoc
basis.
● Integrators to offer advanced management services.
● Aggregators to offer a single common interface to multiple providers.
● Providers to offer a standard interface that is compatible with the
available tools.
● Vendors of grids/clouds to offer standard interfaces for dynamically
scalable service‘s delivery in their products.
VIRTUAL MACHINES PROVISIONING AND MANAGEABILITY
In this section, we will have an overview on the typical life cycle of VM and its
major possible states of operation, which make the management and
automation of VMs in virtual and cloud environments easier than in traditional
computing environments.
As shown in Figure 5.3, the cycle starts by a request delivered to the IT
department, stating the requirement for creating a new server for a particular
service. This request is being processed by the IT administration to start seeing
the servers‘ resource pool, matching these resources with the requirements, and
starting the provision of the needed virtual machine. Once it is provisioned
and started, it is ready to provide the required service according to an SLA, or a
time period after which the virtual is being released; and free resources, in this
case, won‘t be needed.
Release Ms
• End of service
• Compute resources
deallocated to other VMs
VMs In Operation
• Serving web requests
• Migration services
• Scal on-demand
compute resources
IT Service Request
• Infrastructure
Requirements Analysis
• IT request
VM Provision
• Load OS + Appliances
• Customize and Configure
• Start the server
FIGURE 5.-3. Virtual machine life cycle.
5.3 VIRTUAL MACHINES PROVISIONING AND MANAGEABILITY
131
VM Provisioning Process
Provisioning a virtual machine or server can be explained and illustrated as in
Figure 5.4:
Steps to Provision VM. Here, we describe the common and normal steps of
provisioning a virtual server:
● Firstly, you need to select a server from a pool of available servers
(physical servers with enough capacity) along with the appropriate OS
template you need to provision the virtual machine.
● Secondly, you need to load the appropriate software (operating system
you selected in the previous step, device drivers, middleware, and the
needed applications for the service required).
● Thirdly, you need to customize and configure the machine (e.g., IP
address, Gateway) to configure an associated network and storage
resources.
● Finally, the virtual server is ready to start with its newly loaded software.
Typically, these are the tasks required or being performed by an IT or a data
center‘s specialist to provision a particular virtual machine.
To summarize, server provisioning is defining server‘s configuration based on
the organization requirements, a hardware, and software component (processor,
RAM, storage, networking, operating system, applications, etc.). Normally,
virtual machines can be provisioned by manually installing an operating
system, by using a preconfigured VM template, by cloning an existing VM, or
by importing a physical server or a virtual server from another hosting
platform. Physical servers can also be virtualized and provisioned using P2V
(physical to virtual) tools and techniques (e.g., virt-p2v).
After creating a virtual machine by virtualizing a physical server, or by
building a new virtual server in the virtual environment, a template can be
created out of it. Most virtualization management vendors (VMware, XenServer,
etc.) provide the data center‘s administration with the ability to do such tasks in
Servers Pool
Running Provisioned VM
Load OS and
Customize and
Appliances
Configure
Install Patches
Start the Server
Appliances
Repository
FIGURE 5.4. Virtual machine provision process.
5.4 VIRTUAL MACHINE MIGRATION SERVICES
1 3 2 133
an easy way. Provisioning from a template is an invaluable feature, because it
reduces the time required to create a new virtual machine.
Administrators can create different templates for different purposes. For
example, you can create a Windows 2003 Server template for the finance
department, or a Red Hat Linux template for the engineering department. This
enables the administrator to quickly provision a correctly configured virtual
server on demand.
This ease and flexibility bring with them the problem of virtual machine‘s
sprawl, where virtual machines are provisioned so rapidly that documenting
and managing the virtual machine‘s life cycle become a challenge .
VIRTUAL MACHINE MIGRATION SERVICES
Migration service, in the context of virtual machines, is the process of moving a
virtual machine from one host server or storage location to another; there are
different techniques of VM migration, hot/life migration, cold/regular
migration, and live storage migration of a virtual machine [20]. In this process,
all key machines‘ components, such as CPU, storage disks, networking, and
memory, are completely virtualized, thereby facilitating the entire state of a
virtual machine to be captured by a set of easily moved data files. We will
cover some of the migration‘s techniques that most virtualization tools provide
as a feature.
Migrations Techniques
Live Migration and High Availability. Live migration (which is also called
hot or real-time migration) can be defined as the movement of a virtual
machine from one physical host to another while being powered on. When it is
properly carried out, this process takes place without any noticeable effect from
the end user‘s point of view (a matter of milliseconds). One of the most
significant advantages of live migration is the fact that it facilitates proactive
maintenance in case of failure, because the potential problem can be resolved
before the disruption of service occurs. Live migration can also be used for load
balancing in which work is shared among computers in order to optimize the
utilization of available CPU resources.
Live Migration Anatomy, Xen Hypervisor Algorithm. In this section we
will explain live migration‘s mechanism and how memory and virtual machine
states are being transferred, through the network, from one host A to another
host B [21]; the Xen hypervisor is an example for this mechanism. The logical
steps that are executed when migrating an OS are summarized in Figure 5.5. In
this research, the migration process has been viewed as a transactional
interaction between the two hosts involved:
Stage 0: Pre-Migration. An active virtual machine exists on the physical
5.4 VIRTUAL MACHINE MIGRATION SERVICES
host A.
1 3 3 133
5.4 VIRTUAL MACHINE MIGRATION SERVICES
1 3 4 133
VM running normally on Stage 0: Pre-Migration
Active VM on Host A
Host A
Alternate physical host may be preselected for migration
Block devices mirrored and free resources maintained
Stage 1: Reservation
Initialize a container on the target host
Overhead due to copying
Stage 2: Iterative Pre-copy
Enable shadow paging
Downtime
Copy
dirty pages in successive rounds.
Stage 3: Stop and
copy
(VM Out of Service)
Suspend VM on host A
Generate ARP to redirect traffic to Host B
Synchronize
all remaining VM state to Host B
Stage
4: Commitment
VM running normally on
Host B
VM state on Host A is released
Stage 5: Activation
VM starts on Host B
Connects to local devices
resumes normal operation
FIGURE 5.5. Live migration timeline [21].
Stage 1: Reservation. A request is issued to migrate an OS from host A to
host B (a precondition is that the necessary resources exist on B and on a
VM container of that size).
Stage 2: Iterative Pre-Copy. During the first iteration, all pages are
transferred from A to B. Subsequent iterations copy only those pages
dirtied during the previous transfer phase.
Stage 3: Stop-and-Copy. Running OS instance at A is suspended, and its
network traffic is redirected to B. As described in reference 21, CPU state
and any remaining inconsistent memory pages are then transferred. At
the end of this stage, there is a consistent suspended copy of the VM at
both A and B. The copy at A is considered primary and is resumed in case
of failure.
Stage 4: Commitment. Host B indicates to A that it has successfully received
a consistent OS image. Host A acknowledges this message as a
commitment of the migration transaction. Host A may now discard the
original VM, and host B becomes the primary host.
5.4 VIRTUAL MACHINE MIGRATION SERVICES
1 3 5 133
Stage 5: Activation. The migrated VM on B is now activated. Post-migration
code runs to reattach the device‘s drivers to the new machine and
advertise moved IP addresses.
5.4 VIRTUAL MACHINE MIGRATION SERVICES
1 3 6 133
This approach to failure management ensures that at least one host has a
consistent VM image at all times during migration. It depends on the
assumption that the original host remains stable until the migration commits
and that the VM may be suspended and resumed on that host with no risk of
failure. Based on these assumptions, a migration request essentially attempts to
move the VM to a new host and on any sort of failure, execution is resumed
locally, aborting the migration.
Live Migration Effect on a Running Web Server. Clark et al. [21] did
evaluate the above migration on an Apache 1.3 Web server; this served static
content at a high rate, as illustrated in Figure 5.6. The throughput is achieved
when continuously serving a single 512-kB file to a set of one hundred
concurrent clients. The Web server virtual machine has a memory allocation of
800 MB. At the start of the trace, the server achieves a consistent throughput
of approximately 870 Mbit/sec. Migration starts 27 sec into the trace, but is
initially rate-limited to 100 Mbit/sec (12% CPU), resulting in server‘s
throughput drop to 765 Mbit/sec. This initial low-rate pass transfers 776 MB
and lasts for 62 sec. At this point, the migration‘s algorithm, described in
Section 5.4.1, increases its rate over several iterations and finally suspends the
VM after a further 9.8 sec. The final stop-and-copy phase then transfers the
remaining pages, and the Web server resumes at full rate after a 165-msec
outage.
This simple example demonstrates that a highly loaded server can be
migrated with both controlled impact on live services and a short downtime.
However, the working set of the server, in this case, is rather small. So, this
should be expected as a relatively easy case of live migration.
Live Migration Vendor Implementations Examples. There are lots of
VM management and provisioning tools that provide the live migration
of VM facility, two of which are VMware VMotion and Citrix XenServer
―XenMotion.‖
400
200
1000
Throughput (Mbit/sec)
800
600
5.4 VIRTUAL MACHINE MIGRATION SERVICES
1 3 7 133
Migration on Web Server Transmission Rate
1 st precopy, 62 secs
870 Mbit/sec
765
Effe
ct of
further iterations
9.8 secs
Mbit/sec
694 Mbit/sec
0
165 ms total downtime
0
10
20
512 kb files
100 concurrent clients
30
40
50
60
70
80
90
Sample over 100 ms
500 ms 130
100 Sample
110 over 120
Elapsed time (secs)
FIGURE 5.6. Results of migrating a running Web server VM [21].
5.4 VIRTUAL MACHINE MIGRATION SERVICES
1 3 8 133
VMware Vmotion. This allows users to (a) automatically optimize and
allocate an entire pool of resources for maximum hardware utilization,
flexibility, and availability and (b) perform hardware‘s maintenance without
scheduled downtime along with migrating virtual machines away from failing
or underperforming servers [22].
Citrix XenServer XenMotion. This is a nice feature of the Citrix XenServer
product, inherited from the Xen live migrate utility, which provides the IT
administrator with the facility to move a running VM from one XenServer to
another in the same pool without interrupting the service (hypothetically for
zero-downtime server maintenance, which actually takes minutes), making it a
highly available service. This also can be a good feature to balance the
workloads on the virtualized environment [23].
Regular/Cold Migration. Cold migration is the migration of a powered-off
virtual machine. With cold migration, you have the option of moving the
associated disks from one data store to another. The virtual machines are not
required to be on a shared storage. It‘s important to highlight that the two main
differences between live migration and cold migration are that live migration
needs a shared storage for virtual machines in the server‘s pool, but cold
migration does not; also, in live migration for a virtual machine between two
hosts, there would be certain CPU compatibility checks to be applied; while in
cold migration this checks do not apply. The cold migration process is simple to
implement (as the case for the VMware product), and it can be summarized as
follows [24]:
● The configuration files, including the NVRAM file (BIOS settings), log
files, as well as the disks of the virtual machine, are moved from the source
host to the destination host‘s associated storage area.
● The virtual machine is registered with the new host.
● After the migration is completed, the old version of the virtual machine is
deleted from the source host.
Live Storage Migration of Virtual Machine. This kind of migration
constitutes moving the virtual disks or configuration file of a running virtual
machine to a new data store without any interruption in the availability of the
virtual machine‘s service. For more details about how this option is working in
a VMware product, see reference 20.
VM Migration, SLA and On-Demand Computing
As we discussed, virtual machines‘ migration plays an important role in data
centers by making it easy to adjust resource‘s priorities to match resource‘s
5.4 VIRTUAL MACHINE MIGRATION SERVICES
demand conditions.
1 3 9 133
5.4 VIRTUAL MACHINE MIGRATION SERVICES
1 4 0 133
This role is completely going in the direction of meeting SLAs; once it has
been detected that a particular VM is consuming more than its fair share of
resources at the expense of other VMs on the same host, it will be eligible, for
this machine, to either be moved to another underutilized host or assign more
resources for it, in case that the host machine still has resources; this in turn will
highly avoid the violations of the SLA and will also, fulfill the requirements of
on-demand computing resources. In order to achieve such goals, there should
be an integration between virtualization‘s management tools (with its
migrations and performance‘s monitoring capabilities), and SLA‘s
management tools to achieve balance in resources by migrating and monitoring
the workloads, and accordingly, meeting the SLA.
Migration of Virtual Machines to Alternate Platforms
One of the nicest advantages of having facility in data center‘s technologies is to
have the ability to migrate virtual machines from one platform to another;
there are a number of ways for achieving this, such as depending on the source
and target virtualization‘s platforms and on the vendor‘s tools that manage this
facility—for example, the VMware converter that handles migrations between
ESX hosts; the VMware server; and the VMware workstation. The VMware
converter can also import from other virtualization platforms, such as
Microsoft virtual server machines .
VM PROVISIONING AND MIGRATION IN ACTION
Now, it is time to get into business with a real example of how we can manage
the life cycle, provision, and migrate a virtual machine by the help of one
of the open source frameworks used to manage virtualized infrastructure. Here,
we will use ConVirt [25] (open source framework for the management of open
source virtualization like Xen [26] and KVM [27], known previously as
XenMan).
Deployment Scenario. ConVirt deployment consists of at least one ConVirt
workstation, where ConVirt is installed and ran, which provides the main
console for managing the VM life cycle, managing images, provisioning new
VMs, monitoring machine resources, and so on. There are two essential
deployment scenarios for ConVirt: A, basic configuration in which the Xen
or KVM virtualization platform is on the local machine, where ConVirt is
already installed; B, an advanced configuration in which the Xen or KVM is on
one or more remote servers. The scenario in use here is the advanced one. In
data centers, it is very common to install centralized management software
(ConVirt here) on a dedicated machine for use in managing remote servers
in the data center. In our example, we will use this dedicated machine
where ConVirt is installed and used to manage a pool of remote servers
5.4 VIRTUAL MACHINE MIGRATION SERVICES
1 4 1 133
(two machines). In order to use advanced features of ConVirt (e.g.,
live
5.5 VM PROVISIONING AND MIGRATION IN ACTION
137137
migration), you should set up a shared storage for the server pool in use on
which the disks of the provisioned virtual machines are stored. Figure 5.7
illustrates the scenario.
Installation. The installation process involves the following:
● Installing ConVirt on at least one computer. See reference 28 for
installation details.
● Preparing each managed server to be managed by ConVirt. See reference 28
for managed servers‘ installation details. We have two managing servers
with the following Ips (managed server 1, IP:172.16.2.22; and managed
server 2, IP:172.16.2.25) as shown in the deployment diagram (Figure 5.7).
● Starting ConVirt and discovering the managed servers you have prepared.
Notes
● Try to follow the installation steps existing in reference 28 according to the
distribution of the operating system in use. In our experiment, we use
Ubuntu 8.10 in our setup.
● Make sure that the managed servers include Xen or KVM hypervisors
installed.
● Make sure that you can access managed servers from your ConVirt
management console through SSH.
Management Server 2
IP:172.16.2.25
Shared
Storage
iSCSi or NFS
5.5 VM PROVISIONING AND MIGRATION IN ACTION
Management Console
Management Server
1
IP:172.16.2.22
FIGURE 5.7. A deployment scenario network diagram.
138137
5.5 VM PROVISIONING AND MIGRATION IN ACTION
139137
Environment, Software, and Hardware. ConVirt 1.1, Linux Ubuntu 8.10,
three machines, Dell core 2 due processor, 4G RAM.
Adding Managed Servers and Provisioning VM. Once the installation is
done and you are ready to manage your virtual infrastructure, then you can
start the ConVirt management console (see Figure 5.8):
Select any of servers‘ pools existing (QA Lab in our scenario) and on its
context menu, select ―Add Server.‖
● You will be faced with a message asking about the virtualization platform
you want to manage (Xen or KVM), as shown in Figure 5.9:
● Choose KVM, and then enter the managed server information and
credentials (IP, username, and password) as shown in Figure 5.10.
● Once the server is synchronized and authenticated with the manage ment
console, it will appear in the left pane/of the ConVirt, as shown in Figure
5.11.
● Select this server, and start provisioning your virtual machine as in
Figure 5.12:
● Fill in the virtual machine‘s information (name, storage, OS template, etc.;
Figure 5.13); then you will find it created on the managed server tree
powered-off.
Note: While provisioning your virtual machine, make sure that you create
disks on the shared storage (NFS or iSCSi). You can do so by selecting
5.5 VM PROVISIONING AND MIGRATION IN ACTION
FIGURE 5.8. Adding managed server on the data centre‘s management
140137
console.
Download from Wow! eBook <www.wowebook.com>
5.5 VM PROVISIONING AND MIGRATION IN ACTION
FIGURE 5.9. Select virtualization platform.
141137
5.5 VM PROVISIONING AND MIGRATION IN ACTION
FIGURE 5.10. Managed server info and credentials.
142137
5.5
VM PROVISIONING AND MIGRATION IN ACTION
FIGURE 5.11. Managed server has been added.
141
5.5
VM PROVISIONING AND MIGRATION IN ACTION
FIGURE 5.12. Provision a virtual machine.
141
5.5
VM PROVISIONING AND MIGRATION IN ACTION
141
FIGURE 5.13. Configuring virtual machine.
the ―provisioning‖ tab, and changing the VM_DISKS_DIR to point to the
location of your shared NFS.
● Start your VM (Figures 5.14 and 5.15), and make sure the installation
media of the operating system you need is placed in drive, in order to use it
for booting the new VM and proceed in the installation process; then start
the installation process as shown in Figure 5.16.
● Once the installation finishes, you can access your provisioned virtual
machine from the consol icon on the top of your ConVirt management
console.
● Reaching this step, you have created your first managed server and
provisioned virtual machine. You can repeat the same procedure to add
the second managed server in your pool to be ready for the next step of
migrating one virtual machine from one server to the other.
VM Life Cycle and VM Monitoring
You can notice through working with ConVirt that you are able to manage the
whole life cycle of the virtual machine; start, stop, reboot, migrate, clone, and so
on. Also, you noticed how easy it is to monitor the resources of the managed server
5.5
VM PROVISIONING AND MIGRATION IN ACTION
141
and to monitor the virtual machine‘s guests that help you balance and control the
load on these managed servers once needed. In the next section, we are going to
discuss how easy it is to migrate a virtual machine from host to host.
5.5
VM PROVISIONING AND MIGRATION IN ACTION
FIGURE 5.14. Provisioned VM ready to be started.
143
5.5
VM PROVISIONING AND MIGRATION IN ACTION
FIGURE 5.15. Provisioned VM started.
143
5.5
VM PROVISIONING AND MIGRATION IN ACTION
143
FIGURE 5.16. VM booting from the installation CD to start the installation process.
Live Migration
ConVirt tool allows running virtual machines to be migrated from one server
to another [29].This feature makes it possible to organize the virtual machine to
physical machine relationship to balance the workload; for example, a VM
needing more CPU can be moved to a machine having available CPU cycles,
or, in other cases, like taking the host machine for maintenance. For proper
VM migration the following points must be considered [29]:
● Shared storage for all Guest OS disks (e.g., NFS, or iSCSI).
● Identical mount points on all servers (hosts).
● The kernel and ramdisk when using para-virtualized virtual machines
should, also, be shared. (This is not required, if pygrub is used.)
● Centrally accessible installation media (iso).
● It is preferable to use identical machines with the same version of
virtualization platform.
● Migration needs to be done within the same subnet.
Migration Process in ConVirt
● To start the migration of a virtual machine from one host to the other,
select it and choose a migrating virtual machine, as shown in Figure 5.17.
5.5
VM PROVISIONING AND MIGRATION IN ACTION
143
● You will have a window containing all the managed servers in your data
center (as shown in Figure 5.18). Choose one as a destination and start
5.5
VM PROVISIONING AND MIGRATION IN ACTION
143
FIGURE 5.17. VM migration.
FIGURE 5.18. Select the destination managed server candidate for
migration.
migration, or drag the VM and drop it on to another managed server to
5.5
VM PROVISIONING AND MIGRATION IN ACTION
143
initiate migration.
● Once the virtual machine has been successfully placed and migrated to
the destination host, you can see it still living and working (as shown in
Figure 5.19).
5.6 PROVISIONING IN THE CLOUD CONTEXT
FIGURE 5.19. VM started on the destination server after
145145
migration.
Final Thoughts about the Example
This is just a demonstrating example of how to provision and migrate virtual
machines; however, there are more tools and vendors that offer virtual
infrastructure‘s management like Citrix XenServer, VMware vSphere, and so
on.
PROVISIONING IN THE CLOUD CONTEXT
In the cloud context, we shall discuss systems that provide the virtual machine
provisioning and migration services; Amazon EC2 is a widely known example
for vendors that provide public cloud services. Also, Eucalyptus and
OpenNebula are two complementary and enabling technologies for open source
cloud tools, which play an invaluable role in infrastructure as a service and in
building private, public, and hybrid cloud architecture.
Eucalyptus is a system for implementing on-premise private and hybrid clouds
using the hardware and software‘s infrastructure, which is in place without
modification. The current interface to Eucalyptus is compatible with Amazon‘s
EC2, S3, and EBS interfaces, but the infrastructure is designed to support
multiple client-side interfaces. Eucalyptus is implemented using commonly
5.6 PROVISIONING IN THE CLOUD CONTEXT
146145
available Linux tools and basic Web service‘s technologies [30]. Eucalyptus
adds capabilities, such as end-user customization, self-service provisioning, and
legacy application support to data center‘s virtualization‘s features, making the
IT customer‘s service easier . On the other hand, OpenNebula is a virtual
5.6 PROVISIONING IN THE CLOUD CONTEXT
147145
infrastructure manager that orchestrates storage, network, and virtualization
technologies to enable the dynamic placement of multi-tier services on
distributed infrastructures, combining both data center‘s resources and remote
cloud‘s resources according to allocation‘s policies. OpenNebula provides
internal cloud administration and user‘s interfaces for the full management of
the cloud‘s platform.
Amazon Elastic Compute Cloud
The Amazon EC2 (Elastic Compute Cloud) is a Web service that allows
users to provision new machines into Amazon‘s virtualized infrastructure in a
matter of minutes; using a publicly available API (application programming
interface), it reduces the time required to obtain and boot a new server. Users
get full root access and can install almost any OS or application in their AMIs
(Amazon Machine Images). Web services APIs allow users to reboot their
instances remotely, scale capacity quickly, and use on-demand service when
needed; by adding tens, or even hundreds, of machines. It is very important to
mention that there is no up-front hardware setup and there are no installation
costs, because Amazon charges only for the capacity you actually use.
EC2 instance is typically a virtual machine with a certain amount of RAM,
CPU, and storage capacity.
Setting up an EC2 instance is quite easy. Once you create your AWS
(Amazon Web service) account, you can use the on-line AWS console, or
simply download the offline command line‘s tools to start provisioning your
instances.
Amazon EC2 provides its customers with three flexible purchasing models to
make it easy for the cost optimization:
● On-Demand instances, which allow you to pay a fixed rate by the hour
with no commitment.
● Reserved instances, which allow you to pay a low, one-time fee and in turn
receive a significant discount on the hourly usage charge for that instance.
It ensures that any reserved instance you launch is guaranteed to succeed
(provided that you have booked them in advance). This means that users
of these instances should not be affected by any transient limitations in
EC2 capacity.
● Spot instances, which enable you to bid whatever price you want for
instance capacity, providing for even greater savings, if your applications
have flexible start and end times.
Amazon and Provisioning Services. Amazon provides an excellent set of
tools that help in provisioning service; Amazon Auto Scaling [30] is a set
of command line tools that allows scaling Amazon EC2 capacity up or down
automatically and according to the conditions the end user defines. This feature
ensures that the number of Amazon EC2 instances can scale up seamlessly
5.6 PROVISIONING IN THE CLOUD CONTEXT
148145
during demand spikes to maintain performance and can scale down
automatically when loads diminish and become less intensive to minimize the
costs. Auto Scaling service and CloudWatch [31] (a monitoring service for
AWS cloud resources and their utilization) help in exposing functionalities
required for provisioning application services on Amazon EC2.
Amazon Elastic Load Balancer [32] is another service that helps in building
fault-tolerant applications by automatically provisioning incoming application
workload across available Amazon EC2 instances and in multiple availability
zones.
Infrastructure Enabling Technology
Offering infrastructure as a service requires software and platforms that can
manage the Infrastructure that is being shared and dynamically provisioned.
For this, there are three noteworthy technologies to be considered: Eucalyptus,
OpenNebula, and Aneka.
Eucalyptus
Eucalyptus is an open-source infrastructure for the implementation of cloud
computing on computer clusters. It is considered one of the earliest tools
developed as a surge computing (in which data center‘s private cloud could
augment its ability to handle workload‘s spikes by a design that allows it to
send overflow work to a public cloud) tool. Its name is an acronym for ―elastic
utility computing architecture for linking your programs to useful systems.‖
Here are some of the Eucalyptus features :
● Interface compatibility with EC2, and S3 (both Web service and Query/
REST interfaces).
● Simple installation and deployment.
● Support for most Linux distributions (source and binary packages).
● Support for running VMs that run atop the Xen hypervisor or KVM.
Support for other kinds of VMs, such as VMware, is targeted for future
releases.
● Secure internal communication using SOAP with WS security.
● Cloud administrator‘s tool for system‘s management and user‘s accounting.
● The ability to configure multiple clusters each with private internal
network addresses into a single cloud.
Eucalyptus aims at fostering the research in models for service‘s provisioning,
scheduling, SLA formulation, and hypervisors‘ portability.
Eucalyptus Architecture. Eucalyptus architecture, as illustrated in Figure 5.20,
constitutes each high-level system‘s component as a stand-alone Web service with
5.6 PROVISIONING IN THE CLOUD CONTEXT
the following high-level components .
149145
5.6 PROVISIONING IN THE CLOUD CONTEXT
Client-side Interface (via network)
Client-side API Translator
Database
Cloud Controller
Cluster Controller
Walrus (S3)
Storage Controller
(EBS)
Node Controller
FIGURE 5.20. Eucalyptus high-level architecture.
150145
5.6 PROVISIONING IN THE CLOUD CONTEXT
151145
● Node controller (NC) controls the execution, inspection, and termination
of VM instances on the host where it runs.
● Cluster controller (CC) gathers information about and schedules VM
execution on specific node controllers, as well as manages virtual instance
network.
● Storage controller (SC) is a put/get storage service that implements
Amazon‘s S3 interface and provides a way for storing and accessing
VM images and user data.
● Cloud controller (CLC) is the entry point into the cloud for users and
administrators. It queries node managers for information about resources,
makes high-level scheduling decisions, and implements them by making
requests to cluster controllers.
● Walrus (W) is the controller component that manages access to the
storage services within Eucalyptus. Requests are communicated to Walrus
using the SOAP or REST-based interface.
5.6 PROVISIONING IN THE CLOUD CONTEXT
152145
Its design is an open and elegant one. It can be very beneficial in testing and
debugging purposes before deploying it on a real cloud. For more details about
Eucalyptus architecture and design, check reference 11.
Ubuntu Enterprise Cloud and Eucalyptus. Ubuntu Enterprise Cloud
(UEC) [33] is a new initiative by Ubuntu to make it easier to provision, deploy,
configure, and use cloud infrastructures based on Eucalyptus. UEC brings
Amazon EC2-like infrastructure‘s capabilities inside the firewall.
This is by far the simplest way to install and try Eucalyptus. Just download
the Ubuntu server version and install it wherever you want. UEC is also the
first open source project that lets you create cloud services in your local
environment easily and leverage the power of cloud computing.
VM Dynamic Management Using OpenNebula
OpenNebula is an open and flexible tool that fits into existing data center‘s
environments to build any type of cloud deployment. OpenNebula can be
primarily used as a virtualization tool to manage your virtual infrastructure,
which is usually referred to as private cloud. OpenNebula supports a hybrid
cloud to combine local infrastructure with public cloud-based infrastructure,
enabling highly scalable hosting environments. OpenNebula also supports
public clouds by providing cloud‘s interfaces to expose its functionality for
virtual machine, storage, and network management. OpenNebula is one of the
technologies being enhanced in the Reservoir Project [14], European research
initiatives in virtualized infrastructures, and cloud computing.
OpenNebula architecture is shown in Figure 5.21, which illustrates the
existence of public and private clouds and also the resources being managed by
its virtual manager.
OpenNebula is an open-source alternative to these commercial tools for
the dynamic management of VMs on distributed resources. This tool is
supporting several research lines in advance reservation of capacity,
probabilistic admission control, placement optimization, resource models for
the efficient management of groups of virtual machines, elasticity support, and
so on. These research lines address the requirements from both types of clouds
namely, private and public.
OpenNebula and Haizea. Haizea is an open-source virtual machine-based
lease management architecture developed by Sotomayor et al. [34]; it can be
used as a scheduling backend for OpenNebula. Haizea uses leases as a
fundamental resource provisioning abstraction and implements those leases as
virtual machines, taking into account the overhead of using virtual machines
when scheduling leases. Haizea also provides advanced functionality such as
[35]:
5.6 PROVISIONING IN THE CLOUD CONTEXT
● Advance reservation of capacity.
● Best-effort scheduling with backfilling.
153145
5.7 FUTURE RESEARCH DIRECTIONS
1 5 4 151
Cloud User
Local User and
Administrator Interface
Scheduler
Cloud Service
Virtual Infrastructure Manager
Virtualization
Storage
Network
Cloud
Public
Cloud
Local Infrastructure
Interface
FIGURE 5.21. OpenNebula high-level architecture [14].
● Resource preemption (using VM suspend/resume/migrate).
● Policy engine, allowing developers to write pluggable scheduling policies
in Python.
Aneka
Manjrasoft Aneka is a .NET-based platform and framework designed for
building and deploying distributed applications on clouds. It provides a set of
APIs for transparently exploiting distributed resources and expressing the
business logic of applications by using the preferred programming abstractions.
Aneka is also a market-oriented cloud platform since it allows users to build and
schedule applications, provision resources, and monitor results using pricing,
accounting, and QoS/SLA services in private and/or public cloud environments.
It allows end users to build an enterprise/private cloud setup by exploiting
the power of computing resources in the enterprise data centers, public clouds
such as Amazon EC2 , and hybrid clouds by combining enterprise private
clouds managed by Aneka with resources from Amazon EC2 or
other
enterprise clouds built and managed using technologies such as XenServer.
Aneka also provides support for deploying and managing clouds. By using
its Management Studio and a set of Web interfaces, it is possible to set up either
public or private clouds, monitor their status, update their configuration, and
perform the basic management operations.
5.7 FUTURE RESEARCH DIRECTIONS
1 5 5 151
Aneka Architecture. Aneka platform architecture , as illustrated in Figure
5.22, consists of a collection of physical and virtualized resources
5.7 FUTURE RESEARCH DIRECTIONS
1 5 6 151
connected through a network. Each of these resources hosts an instance of the
Aneka container representing the runtime environment where the distributed
applications are executed. The container provides the basic management
features of the single node and leverages all the other operations on the services
that it is hosting. The services are broken up into fabric, foundation, and
execution services. Fabric services directly interact with the node through the
platform abstraction layer (PAL) and perform hardware profiling and dynamic
resource provisioning. Foundation services identify the core system of the
Aneka middleware, providing a set of basic features to enable Aneka containers
to perform specialized and specific sets of tasks. Execution services directly deal
with the scheduling and execution of applications in the cloud.
FUTURE RESEARCH DIRECTIONS
Virtual machine provision and migration services take their place in research to
achieve the best out of its objectives, and here is a list of potential areas‘
candidates for research:
● Self-adaptive and dynamic data center.
Data centers exist in the premises of any hosting or ISPs that host different
Web sites and applications. These sites are being accessed at different timing
pattern (morning hours, afternoon, etc.). Thus, workloads against these sites
need to be tracked because they vary dynamically over time. The sizing of host
machines (the number of virtual machines that host these applications)
represents a challenge, and there is a potential research area over here to study
the performance impact and overhead due to this dynamic creation of virtual
machines hosted in these self-adaptive data centers, in order to manage Web
sites properly.
Study of the performance in this dynamic environment will also tackle the
the balance that should be exist between a rapid response time of individual
applications, the overall performance of the data, and the high availability of
the applications and its services.
● Performance evaluation and workload characterization of virtual
workloads.
It is very invaluable in any virtualized infrastructure to have a notion about
the workload provisioned in each VM, the performance‘s impacts due to the
hypervisors layer, and the overhead due to consolidated workloads for such
5.7 FUTURE RESEARCH DIRECTIONS
1 5 7 151
systems; but yet, this is not a deterministic process. Single-workload benchmark
is useful in quantifying the virtualization overhead within a single VM, but not
useful in a whole virtualized environment with multiple isolated VMs with
varying workloads on each, leading to the inability of capturing the system‘s
5.7 FUTURE RESEARCH DIRECTIONS
1 5 2 153
Model
Model
Model
Models
Foundation Services
Membership
Services
Reservation
Services
Storage
Services
License
Accounting
Services
Services
5.7 FUTURE RESEARCH DIRECTIONS
1 5 3 153
Fabric Services
Dynamic Resource Provisioning Services
Hardware Profile Services
Infrastructure
.NET @ Windows
Mono @ Linux
Physical Machines/Virtual Machines
Amazon
Google
Microsoft
LAN network
Security
IBM
Persistence
Private Cloud
Data Center
FIGURE 5.22. Manjras oft Aneka layered architecture .
behavior. So, there is a big need for a common workload model and
methodology for virtualized systems; thus benchmark‘s results can be
compared across different platforms. It will help in the dynamic workload‘s
relocation and migrations‘ services.
● One of the potential areas that worth study and investigation is the
development of fundamental tools and techniques that facilitate
the integration and provisioning of distributed and hybrid clouds in
5.7 FUTURE RESEARCH DIRECTIONS
1 5 4 153
federated way, which is critical for enabling of composition and
deployment of elastic application services [35, 36].
● High-performance data scaling in private and public cloud environments.
Organizations and enterprises that adopt the cloud computing architectures
can face lots of challenges related to (a) the elastic provisioning of compute
clouds on their existing data center‘s infrastructure and (b) the inability of the
data layer to scale at the same rate as the compute layer. So, there is a persisting
need for implementing systems that are capable of scaling data with the same
pace as scaling the infrastructure, or to integrate current infrastructure elastic
provisioning systems with existing systems that are designed to scale out the
applications and data layers.
● Performance and high availability in clustered VMs through live
migration.
Clusters are very common in research centers, enterprises, and accordingly in
the cloud. For these clusters to work in a proper way, there are two aspects of
great importance, namely, high availability, and high performance service. This
can be achieved through clusters of virtual machines in which high available
applications can be achieved through the live migration of the virtual machine
to different locations in the cluster or in the cloud. So, the need exists to
(a) study the performance, (b) study the performance‘s improvement
opportunities with regard to the migrations of these virtual machines, and (c)
decide to which location the machine should be migrated.
● VM scheduling algorithms.
● Accelerating VMs live migration time.
● Cloud-wide VM migration and memory de-duplication.
Normal VM migration is being done within the same physical site location
(campus, data center, lab, etc.). However, migrating virtual machines between
different locations is an invaluable feature to be added to any virtualization
management‘s tools. For more details on memory status, storage relocation,
and so on; check the patent pending technology about this topic [37].
Considering such setup can enable faster and longer-distance VM migrations,
cross-site load balancing, power management, and de-duplicating memory
throughout multiple sites. It is a rich area for research.
● Live migration security.
5.7 FUTURE RESEARCH DIRECTIONS
1 5 5 153
Live migration security is a very important area of research, because several
security‘s vulnerabilities exist; check reference 38 for an empirical exploitation
of live migration.
5.7 FUTURE RESEARCH DIRECTIONS
1 5 6 153
● Extend migration algorithm to allow for priorities.
● Cisco initiative UCS (Unified Commuting System) and its role in dynamic
just-in-time provisioning of virtual machines and increase of business
agility [39].
CONCLUSION
Virtual machines‘ provisioning and migration are very critical tasks in today‘s
virtualized systems, data center‘s technology, and accordingly the cloud
computing services.
They have a huge impact on the continuity, and availability of business. In a
few minutes, you can provision a complete server with all its appliances to
perform a particular functionality, or to offer a service. In a few milliseconds,
you can migrate a virtual machine hosted on a physical server within a clustered
environment to a completely different server for the purpose of maintenance,
workloads‘ needs, and so on. In this chapter, we covered VM provisioning and
migration services techniques, as well as tools and concepts, and also shed some
light on potential areas for research.
REFERENCES
1.
D. Chisnall, The Definitive Guide to the Xen Hypervisor, Upper Saddle River, NJ,
Prentice Hall, 2008.
2. M. El-Refaey and M. Rizkaa, Virtual systems workload characterization: An
overview, in Proceedings of the 18th IEEE International Workshops on Enabling
Technologies: Infrastructures for Collaborative Enterprises, WETICE 2009,
Groningen, The Netherlands, 29 June—1 July 2009.
3. A. T. Velte, T. J. Velte, and R. Elsenpeter, Cloud Computing: A Practical Approach,
McGraw-Hill, New York, 2010.
4. Amazon Elastic Compute Cloud (Amazon EC2), http://aws.amazon.com/ec2/,
March 15, 2010.
5. Cloud Hosting, Cloud Computing, Hybrid Infrastructure from GoGrid, http://
www.gogrid.com/, March 19, 2010.
6. JoyentCloud Computing Companies: Domain, Application & Web Hosting
Services, http://www.joyent.com/, March 10, 2010.
7. Rackspace hosting, http://www.rackspace.com/index.php, March 10, 2010.
8. AppNexus—Home, http://www.appnexus.com/, March 9, 2010.
9. FlexiScale cloud computing and hosting: instant Windows and Linux cloud servers
on demand, http://www.flexiscale.com/, March 12, 2010.
10. C. Vecchiola, X. Chu, and R. Buyya, Aneka: A Software Platform for .NET-based
5.7 FUTURE RESEARCH DIRECTIONS 1 5 7 153
cloud computing, high speed and large scale scientific computing, in Advances in
Parallel Computing, W. Gentzsch, L. Grandinetti, G. Joubert (eds.), ISBN: 978160750-073-5, IOS Press, Amsterdam, Netherlands, 2009, pp. 267—295.
REFERENCES
155
11. D. Nurmi, R. Wolski, C. Grzegorczyk, G. Obertelli, S. Soman, L. Youseff, D.
Zagorodnov, The Eucalyptus Open-source Cloud-computing System, in
Proceedings of 9th IEEE International Symposium on Cluster Computing and the
Grid, Shanghai, China, pp. 124—131.
12. B. Sotomayor, R. Santiago Montero, I. Martı´ n Llorente, I. Foster, Capacity
Leasing in Cloud Systems using the OpenNebula Engine (short paper). Workshop
on Cloud Computing and its Applications 2008 (CCA08), October 22—23, 2008,
Chicago, Illinois, USA.
13. P. Gardfja¨ ll, E. Elmroth, L. Johnsson, O. Mulmo, and T. Sandholm, Scalable
gridwide capacity allocation with the SweGrid Accounting System (SGAS),
Concurrency and Computation: Practice and Experience, 20(18): 2089—2122,
2008.
14. I. M. LIorente, Innovation for cloud infrastructure management in OpenNebula/
RESERVOIR, ETSI Workshop on Grids, Clouds & Service Infrastructures, Sophia
Antipolis, France, December 3, 2009.
15. B. Rochwerger, J. Caceres, R. S. Montero, D. Breitgand, E. Elmroth, A. Galis,
E. Levy, I. M. Llorente, and K. Nagin, Y. Wolfsthal, The RESERVOIR Model
and architecture for open federated cloud computing, IBM Systems Journal,
Volume 53, Number 4, 2009.
16. F. Piedad and M. W. Hawkins, High Availability: Design, Techniques, and
Processes, Prentice Hall PTR, Upper Saddle River, NJ, 2000.
17. DMTF—VMAN, http://www.dmtf.org/standards/mgmt/vman, March 27, 2010.
18. OGF Open Cloud Computing Interface Working Group, http://www.occi-wg.org/
doku.php, March 27, 2010.
19. J. Arrasjid, K. Balachandran, D. Conde, G. Lamb, and S. Kaplan, Deploying the
VMware Infrastructure, The USENIX Association, August 10, 2008.
20. Live Storage Migration of virtual machine, http://www.vmware.com/technology/
virtual-storage/live-migration.html, August 19, 2009.
21. C. Clark, K. Fraser, S. Hand, J. G. Hansen, E. Jul, C. Kimpach, I. Pratt, and
W. Warfield, Live migration of virtual machines, in 2nd USENIX Symposium on
Networked Systems, Design and Implementation (NSDI 05), May 2005.
22. VMware VMotion for Live migration of virtual machines, http://www.vmware
.com/products/vi/vc/vmotion.html, August 19, 2009.
23. Knowledge Center Home—Citrix Knowledge Center, Article ID: CTX115813
http://support.citrix.com, August 28, 2009.
24. Cold Migration, http://pubs.vmware.com/vsp40_e/admin/wwhelp/wwhimpl/common/html/wwhelp.htm#href=c_cold_migration.html#1_10_21_7_1&single=true,
August 20, 2009.
25. ConVirture: Enterprise—class management for open source virtualization, http://
www.convirture.com/, August 21, 2009.
26. S. Crosby, D. E. Williams, and J. Garcia, Virtualization with Xen: Including
XenEnterprise, XenServer, and XenExpress, Syngress Media Inc., ISBN 159749167-5, 2007.
27. I. Habib, Virtualization with KVM, Linux Journal, 2008(166):8, February 2008.
28. Installation—ConVirt,
http://www.convirture.com/wiki/index.php?title=
Installation, March 27, 2010.
29.
30.
31.
32.
33.
34.
35.
36.
37.
38.
39.
40.
VM Migration—ConVirt, http://www.convirture.com/wiki/index.php?title=VM_
Migration, March 25, 2010.
Amazon Auto Scaling Service, http://aws.amazon.com/autoscaling/, March 23,
2010.
Amazon CloudWatch Service, http://aws.amazon.com/cloudwatch/, March 23,
2010.
Amazon Load Balancer Service, http://aws.amazon.com/elasticloadbalancing/.
March 21, 2010.
S. Wardley, E. Goyer, and N. Barcet, Ubuntu Enterprise Cloud Architecture,
http://www.ubuntu.com/cloud/private, March 23, 2010.
B. Sotomayor, K. Keahey, and I. Foster. Combining batch execution and leasing
using virtual machines ACM, in Proceedings of the 17th international Symposium
on High Performance Distributed Computing, New York, 2008, pp. 87—96.
I. M. LIorente, The OpenNebula Open Source Toolkit to Build Cloud
Infrastructures, Seminars LinuxWorld NL, Utrecht, The Netherlands,
November 5th, 2009.
R. Buyya1, R. Ranjan, and R. N. Calheiros, InterCloud: Utility-Oriented
Federation of Cloud Computing Environments for Scaling of Application Services,
in Proceedings of the 10th International Conference on Algorithms and Architectures
for Parallel Processing, ICA3PP 2010, Busan, South Korea, May 21—23, 2010.
K. Lawton, Virtualization 3.0: Cloud-wide VM migration and memory
deduplication,
http://www.trendcaller.com/2009/03/virtualization-30-vm-memorywan. html, August 25, 2009.
J. Oberheide, E. Cooke, and F. Jahanian, Empirical exploitation of live virtual
machine migration needs modification, http://www.net-security.org/article.php?
id 5 1120, August 29, 2009.
Cisco Unified Computing System, http://www.cisco.com/go/unifiedcomputing,
August 30, 2009.
D. Nurmi, R. Wolski, C. Grzegorczyk, G. Obertelli, S. Soman, L. Youseff, and D.
Zagorodnov. The Eucalyptus open-source cloud-computing system, in Proceedings
of the 9th IEEE/ACM International Symposium on Cluster Computing and the Grid
(CCGrid 2009), May 18—May 21, 2010, Shanghai, China.
CHAPTER 6
ON THE MANAGEMENT OF
VIRTUAL MACHINES FOR CLOUD
INFRASTRUCTURES
IGNACIO M. LLORENTE, RUBE´ N S. MONTERO, BORJA SOTOMAYOR,
DAVID BREITGAND, ALESSANDRO MARASCHINI, ELIEZER LEVY, and
BENNY ROCHWERGER
In 2006, Amazon started offering virtual machines (VMs) to anyone with a
credit card for just $0.10/hour through its Elastic Compute Cloud (EC2)
service. Although not the first company to lease VMs, the programmer-friendly
EC2 Web services API and their pay-as-you-go pricing popularized the
―Infrastructure as a Service‖ (IaaS) paradigm, which is now closely related to
the notion of a ―cloud.‖ Following the success of Amazon EC2 [1], several
other IaaS cloud providers, or public clouds, have emerged—such as
ElasticHosts , GoGrid , and FlexiScale —that provide a publicly accessible
interface for purchasing and managing computing infrastructure that is
instantiated as VMs running on the provider‘s data center. There is also a
growing ecosystem of technologies and tools to build private clouds—where
inhouse resources are virtualized, and internal users can request and manage
these resources using interfaces similar or equal to those of public clouds—and
hybrid clouds—where an organization‘s private cloud can supplement its
capacity using a public cloud.
Thus, within the broader context of cloud computing, this chapter focuses
on the subject of IaaS clouds and, more specifically, on the efficient
management of virtual machines in this type of cloud. Section 6.1 starts by
discussing the characteristics of IaaS clouds and the challenges involved in
managing these clouds. The following sections elaborate on some of these
challenges, describing the solutions proposed within the virtual machine
management activity of
Cloud Computing: Principles and Paradigms, Edited by Rajkumar Buyya, James Broberg and
Andrzej Goscinski Copyright r 2011 John Wiley & Sons, Inc.
157
RESERVOIR (Resources and Services Virtualization without Barriers), a
European Union FP7-funded project. Section 6.2 starts by discussing the
problem of managing virtual infrastructures; Section 6.3 presents scheduling
techniques that can be used to provide advance reservation of capacity within
these infrastructures; Section 6.4 focuses on service-level agreements (or SLAs)
in IaaS clouds and discusses capacity management techniques supporting SLA
commitments. Finally, the chapter concludes with a discussion of remaining
challenges and future work in IaaS clouds.
THE ANATOMY OF CLOUD INFRASTRUCTURES
There are many commercial IaaS cloud providers in the market, such as
those cited earlier, and all of them share five characteristics: (i) They provide
on-demand provisioning of computational resources; (ii) they use virtualization
technologies to lease these resources; (iii) they provide public and simple remote
interfaces to manage those resources; (iv) they use a pay-as-you-go cost model,
typically charging by the hour; and (v) they operate data centers large
enough to provide a seemingly unlimited amount of resources to their clients
(usually touted as ―infinite capacity‖ or ―unlimited elasticity‖). Private and
hybrid clouds share these same characteristics but, instead of selling capacity
over publicly accessible interfaces, focus on providing capacity to an
organization‘s internal users.
Virtualization technologies have been the key enabler of many of these
salient characteristics of IaaS clouds by giving providers a more flexible and
generic way of managing their resources. Thus, virtual infrastructure (VI)
management—the management of virtual machines distributed across a pool of
physical resources—becomes a key concern when building an IaaS cloud and
poses a number of challenges. Like traditional physical resources, virtual
machines require a fair amount of configuration, including preparation of
the machine‘s software environment and network configuration. However,
in a virtual infrastructure, this configuration must be done on-the-fly, with as
little time between the time the VMs are requested and the time they are
available to the user. This is further complicated by the need to configure
groups of VMs that will provide a specific service (e.g., an application requiring
a Web server and a database server). Additionally, a virtual infrastructure
manager must be capable of allocating resources efficiently, taking into account
an organization‘s goals (such as minimizing power consumption and other
operational costs) and reacting to changes in the physical infrastructure.
Virtual infrastructure management in private clouds has to deal with an
additional problem: Unlike large IaaS cloud providers, such as Amazon,
private clouds typically do not have enough resources to provide the illusion
of ―infinite capacity.‖ The immediate provisioning scheme used in public
clouds, where resources are provisioned at the moment they are requested, is
ineffective in private clouds. Support for additional provisioning schemes, such
6.1 THE ANATOMY OF CLOUD INFRASTRUCTURES
159
as best-effort provisioning and advance reservations to guarantee quality of
service (QoS) for applications that require resources at specific times (e.g.,
during known ―spikes‖ in capacity requirements), is required. Thus, efficient
resource allocation algorithms and policies and the ability to combine both
private and public cloud resources, resulting in a hybrid approach, become even
more important.
Several VI management solutions have emerged over time, such as platform
ISF and VMware vSphere , along with open-source initiatives such as
Enomaly Computing Platform and Ovirt . Many of these tools originated out
of the need to manage data centers efficiently using virtual machines, before the
Cloud Computing paradigm took off. However, managing virtual
infrastructures in a private/hybrid cloud is a different, albeit similar, problem
than managing a virtualized data center, and existing tools lack several features
that are required for building IaaS clouds. Most notably, they exhibit
monolithic and closed structures and can only operate, if at all, with some
preconfigured placement policies, which are generally simple (round robin,
first fit, etc.) and based only on CPU speed and utilization of a fixed and
predetermined number of resources, such as memory and network bandwidth.
This precludes extending their resource management strategies with custom
policies or integration with other cloud systems, or even adding cloud
interfaces.
Thus, there are still several gaps in existing VI solutions. Filling these gaps
will require addressing a number of research challenges over the next years,
across several areas, such as virtual machine management, resource scheduling,
SLAs, federation of resources, and security. In this chapter, we focus on three
problems addressed by the Virtual Machine Management Activity of
RESERVOIR: distributed management of virtual machines, reservation-based
provisioning of virtualized resource, and provisioning to meet SLA
commitments.
Distributed Management of Virtual Machines
The first problem is how to manage the virtual infrastructures themselves.
Although resource management has been extensively studied, particularly for
job management in high-performance computing, managing VMs poses
additional problems that do not arise when managing jobs, such as the need to
set up custom software environments for VMs, setting up and managing
networking for interrelated VMs, and reducing the various overheads involved
in using VMs. Thus, VI managers must be able to efficiently orchestrate all these
different tasks. The problem of efficiently selecting or scheduling computational
resources is well known. However, the state of the art in VM-based resource
scheduling follows a static approach, where resources are initially selected
using a greedy allocation strategy, with minimal or no support for other
placement policies. To efficiently schedule resources, VI managers must be
able to support flexible and complex scheduling policies and must leverage
the ability of VMs to
suspend, resume, and migrate.
6.2 DISTRIBUTED MANAGEMENT OF VIRTUAL INFRASTRUCTURES
1 6 2 161
This complex task is one of the core problems that the RESERVOIR project
tries to solve. In Section 6.2 we describe the problem of how to manage VMs
distributed across a pool of physical resources and describe OpenNebula, the
virtual infrastructure manager developed by the RESERVOIR project.
Reservation-Based Provisioning of Virtualized Resources
A particularly interesting problem when provisioning virtual infrastructures is
how to deal with situations where the demand for resources is known
beforehand—for example, when an experiment depending on some complex
piece of equipment is going to run from 2 pm to 4 pm, and computational
resources must be available at exactly that time to process the data produced by
the equipment. Commercial cloud providers, such as Amazon, have enough
resources to provide the illusion of infinite capacity, which means that this
situation is simply resolved by requesting the resources exactly when needed; if
capacity is ―infinite,‖ then there will be resources available at 2 pm.
On the other hand, when dealing with finite capacity, a different approach is
needed. However, the intuitively simple solution of reserving the resources
beforehand turns out to not be so simple, because it is known to cause
resources to be underutilized [10—13], due to the difficulty of scheduling other
requests around an inflexible reservation.
As we discuss in Section 6.3, VMs allow us to overcome the utilization
problems typically associated with advance reservations and we describe
Haizea, a VM-based lease manager supporting advance reservation along
with other provisioning models not supported in existing IaaS clouds, such
as best-effort provisioning.
Provisioning to Meet SLA Commitments
IaaS clouds can be used to deploy services that will be consumed by users other
than the one that deployed the services. For example, a company might depend
on an IaaS cloud provider to deploy three-tier applications (Web front-end,
application server, and database server) for its customers. In this case, there is a
distinction between the cloud consumer (i.e., the service owner; in this case, the
company that develops and manages the applications) and the end users of
the resources provisioned on the cloud (i.e., the service user; in this case, the
users that access the applications). Furthermore, service owners will enter into
service-level agreements (SLAs) with their end users, covering guarantees such
as the timeliness with which these services will respond.
However, cloud providers are typically not directly exposed to the service
semantics or the SLAs that service owners may contract with their end users.
The capacity requirements are, thus, less predictable and more elastic. The
use of reservations may be insufficient, and capacity planning and
optimizations are required instead. The cloud provider‘s task is, therefore, to
make sure that resource allocation requests are satisfied with specific
probability and
6.2 DISTRIBUTED MANAGEMENT OF VIRTUAL INFRASTRUCTURES
1 6 3 161
timeliness. These requirements are formalized in infrastructure SLAs between
the service owner and cloud provider, separate from the high-level SLAs
between the service owner and its end users.
In many cases, either the service owner is not resourceful enough to perform
an exact service sizing or service workloads are hard to anticipate in advance.
Therefore, to protect high-level SLAs, the cloud provider should cater for
elasticity on demand. We argue that scaling and de-scaling of an application is
best managed by the application itself. The reason is that in many cases,
resources allocation decisions are application-specific and are being driven by
the application level metrics. These metrics typically do not have a universal
meaning and are not observable using black box monitoring of virtual
machines comprising the service.
RESERVOIR proposes a flexible framework where service owners may
register service-specific elasticity rules and monitoring probes, and these rules
are being executed to match environment conditions. We argue that elasti city
of the application should be contracted and formalized as part of capacity
availability SLA between the cloud provider and service owner. This poses
interesting research issues on the IaaS side, which can be grouped around two
main topics:
● SLA-oriented capacity planning that guarantees that there is enough
capacity to guarantee service elasticity with minimal over-provisioning.
● Continuous resource placement and scheduling optimization that lowers
operational costs and takes advantage of available capacity transparently
to the service while keeping the service SLAs.
We explore these two topics in further detail in Section 6.4, and we describe
how the RESERVOIR project addresses the research issues that arise therein.
DISTRIBUTED MANAGEMENT OF VIRTUAL INFRASTRUCTURES
Managing VMs in a pool of distributed physical resources is a key concern in
IaaS clouds, requiring the use of a virtual infrastructure manager. To address
some of the shortcomings in existing VI solutions, we have developed the open
source OpenNebula1 virtual infrastructure engine. OpenNebula is capable of
managing groups of interconnected VMs—with support for the Xen, KVM,
and VMWare platforms—within data centers and private clouds that involve a
large amount of virtual and physical servers. OpenNebula can also be used to
build hybrid clouds by interfacing with remote cloud sites [14]. This section
describes how OpenNebula models and manages VMs in a virtual
6.2 DISTRIBUTED MANAGEMENT OF VIRTUAL INFRASTRUCTURES
infrastructure.
1
1 6 4 161
http://www.opennebula.org
6.2 DISTRIBUTED MANAGEMENT OF VIRTUAL INFRASTRUCTURES
1 6 2 163
VM Model and Life Cycle
The primary target of OpenNebula is to manage VMs. Within OpenNebula, a
VM is modeled as having the following attributes:
● A capacity in terms of memory and CPU.
● A set of NICs attached to one or more virtual networks.
● A set of disk images. In general it might be necessary to transfer some of
these image files to/from the physical machine the VM will be running in.
● A state file (optional) or recovery file that contains the memory image of a
running VM plus some hypervisor-specific information.
The life cycle of a VM within OpenNebula follows several stages:
● Resource Selection. Once a VM is requested to OpenNebula, a feasible
placement plan for the VM must be made. OpenNebula‘s default
scheduler provides an implementation of a rank scheduling policy,
allowing site administrators to configure the scheduler to prioritize the
resources that are more suitable for the VM, using information from
the VMs and the physical hosts. As we will describe in Section 6.3,
OpenNebula can also use the Haizea lease manager to support more
complex scheduling policies.
● Resource Preparation. The disk images of the VM are transferred to the
target physical resource. During the boot process, the VM is
contextualized, a process where the disk images are specialized to work in
a given environment. For example, if the VM is part of a group of VMs
offering a service (a compute cluster, a DB-based application, etc.),
contextualization could involve setting up the network and the machine
hostname, or registering the new VM with a service (e.g., the head node
in a compute cluster). Different techniques are available to contextualize a
worker node, including use of an automatic installation system (for
instance, Puppet or Quattor), a context server (see reference 15), or access
to a disk image with the context data for the worker node (OVF
recommendation).
● VM Creation. The VM is booted by the resource hypervisor.
● VM Migration. The VM potentially gets migrated to a more suitable
resource (e.g., tooptimize the power consumption of the physical resources).
● VM Termination. When the VM is going to shut down, OpenNebula can
transfer back its disk images to a known location. This way, changes in the
VM can be kept for a future use.
VM Management
6.2 DISTRIBUTED MANAGEMENT OF VIRTUAL INFRASTRUCTURES
1 6 3 163a VMs life cycle by orchestrating three different
OpenNebula manages
management areas: virtualization by interfacing with a physical resource‘s
6.2 DISTRIBUTED MANAGEMENT OF VIRTUAL INFRASTRUCTURES
1 6 4 163
hypervisor, such as Xen, KVM, or VMWare, to control (e.g., boot, stop, or
shutdown) the VM; image management by transferring the VM images from
an image repository to the selected resource and by creating on-the-fly
temporary images; and networking by creating local area networks (LAN)
to interconnect the VMs and tracking the MAC addresses leased in each
network.
Virtualization. OpenNebula manages VMs by interfacing with the physical
resource virtualization technology (e.g., Xen or KVM) using a set of pluggable
drivers that decouple the managing process from the underlying technology.
Thus, whenever the core needs to manage a VM, it uses high-level commands
such as ―start VM,‖ ―stop VM,‖ and so on, which are translated by the drivers
into commands that the virtual machine manager can understand. By
decoupling the OpenNebula core from the virtualization technologies through
the use of a driver-based architecture, adding support for additional virtual
machine managers only requires writing a driver for it.
Image Management. VMs are supported by a set of virtual disks or images,
which contains the OS and any other additional software needed by the VM.
OpenNebula assumes that there is an image repository that can be any storage
medium or service, local or remote, that holds the base image of the VMs.
There are a number of different possible configurations depending on the user‘s
needs. For example, users may want all their images placed on a separate
repository with only HTTP access. Alternatively, images can be shared through
NFS between all the hosts. OpenNebula aims to be flexible enough to support
as many different image management configurations as possible.
OpenNebula uses the following concepts for its image management model
(Figure 6.1):
● Image Repositories refer to any storage medium, local or remote, that hold
the base images of the VMs. An image repository can be a dedicated file
server or a remote URL from an appliance provider, but they need to be
accessible from the OpenNebula front-end.
● Virtual Machine Directory is a directory on the cluster node where a VM is
running. This directory holds all deployment files for the hypervisor to
boot the machine, checkpoints, and images being used or saved—all of
them specific to that VM. This directory should be shared for most
hypervisors to be able to perform live migrations. Any given VM image
goes through the following steps along its life cycle:
● Preparation implies all the necessary changes to be made to the
machine‘s image so it is prepared to offer the service to which it is
intended. OpenNebula assumes that the images that conform to a
particular VM are prepared and placed in the accessible image
repository.
Download from Wow! eBook <www.wowebook.com>
6.2 DISTRIBUTED MANAGEMENT OF VIRTUAL INFRASTRUCTURES
1 6 5 163
FRONT-END
Image
ONED
Repository
$ONE_LOCATION/var
Shared FS
VM_DIR
VM_DIR
CLUSTER NODE
CLUSTER NODE
VM_DIR
CLUSTER NODE
FIGURE 6.1. Image management in OpenNebula.
● Cloning the image means taking the image from the repository and
placing it in the VM‘s directory in the physical node where it is going to
be run before the VM is actually booted. If a VM image is to be cloned,
the original image is not going to be used, and thus a copy will be used.
There is a qualifier (clone) for the images that can mark them as
targeting for cloning or not.
● Save/remove. If the save qualifier is disabled, once the VM has been
shut down, the images and all the changes thereof are going to be
disposed of. However, if the save qualifier is activated, the image will be
saved for later use.
Networking. In general, services deployed on a cloud, from a computing
cluster to the classical three-tier business application, require several
interrelated VMs, with a virtual application network (VAN) being the primary
link between them. OpenNebula dynamically creates these VANs and
tracks the MAC addresses leased in the network to the service VMs. Note that
here we refer to layer 2 LANs; other TCP/IP services such as DNS, NIS, or
NFS are the responsibility of the service (i.e., the service VMs have to be
configured to provide such services).
The physical hosts that will co-form the fabric of our virtual infrastructures
will need to have some constraints in order to effectively deliver virtual
6.2 DISTRIBUTED MANAGEMENT OF VIRTUAL INFRASTRUCTURES
6 6 163 machines. Therefore, from the point of view of
networks to our 1virtual
networking, we can define our physical cluster as a set of hosts with one or
more network interfaces, each of them connected to a different physical
network.
6.2 DISTRIBUTED MANAGEMENT OF VIRTUAL INFRASTRUCTURES
1 6 7 163
Physical Network
Switch
Virtual LAN–Red
(Ranged)
Virtual LAN–Blue
(Ranged)
Host A
Host B
02:01:0A:00:02:01
Bridge
Bridge
10.0.2.1/24
02:01:0A:00:01:01
02:01:0A:00:02:02
10.0.1.1/24
10.0.2.2/24
VM
VM
VM
VM
VM
02:01:0A:00:01:03
10.0.1.3/24
02:01:93:60:51:f1
Bridge
Bridge
147.96.81.241/24
Virtual LAN–Public
(Fixed)
Internet
FIGURE 6.2. Networkig model for OpenNebula.
We can see in Figure 6.2 two physical hosts with two network interfaces
each; thus there are two different physical networks. There is one physical
network that connects the two hosts using a switch, and there is another
one that gives the hosts access to the public Internet. This is one possible
configuration for the physical cluster, and it is the one we recommend since
it can be used to make both private and public VANs for the virtual machines.
Moving up to the virtualization layer, we can distinguish three different VANs.
One is mapped on top of the public Internet network, and we can see a couple of
virtual machines taking advantage of it. Therefore, these two VMs will have
access to the Internet. The other two are mapped on top of the private physical
6.2 DISTRIBUTED MANAGEMENT OF VIRTUAL INFRASTRUCTURES
1 6 8 163
network: the Red and Blue VANs. Virtual machines connected to the same
private VAN will be able to communicate with each other, otherwise they will be
isolated and won‘t be able to communicate.
Further Reading on OpenNebula
There are a number of scholarly publications that describe the design and
architecture of OpenNebula in more detail, including papers showing
performance results obtained when using OpenNebula to deploy and manage
the back-end nodes of a Sun Grid Engine compute cluster [14] and of a
NGINX Web server [16] on both local resources and an external cloud. The
OpenNebula virtual infrastructure engine is also available for download at
http:// www.opennebula.org/, which provides abundant documentation not just
on how to install and use OpenNebula, but also on its internal architecture.
6.2 DISTRIBUTED MANAGEMENT OF VIRTUAL INFRASTRUCTURES
1 6 9 163
SCHEDULING TECHNIQUES FOR ADVANCE
RESERVATION OF CAPACITY
While a VI manager like OpenNebula can handle all the minutiae of managing
VMs in a pool of physical resources, scheduling these VMs efficiently is a
different and complex matter. Commercial cloud providers, such as Amazon,
rely on an immediate provisioning model where VMs are provisioned right away,
since their data centers‘ capacity is assumed to be infinite. Thus, there is no need
for other provisioning models, such as best-effort provisioning where requests
have to be queued and prioritized or advance provisioning where resources are
pre-reserved so they will be guaranteed to be available at a given time period;
queuing and reservations are unnecessary when resources are always available to
satisfy incoming requests.
However, when managing a private cloud with limited resources, an
immediate provisioning model is insufficient. In this section we describe a
lease-based resource provisioning model used by the Haizea 2 lease manager,
which can be used as a scheduling back-end by OpenNebula to support
provisioning models not supported in other VI management solutions. We
focus, in particular, on advance reservation of capacity in IaaS clouds as a way
to guarantee availability of resources at a time specified by the user.
Existing Approaches to Capacity Reservation
Efficient reservation of resources in resource management systems has been
studied considerably, particularly in the context of job scheduling. In fact, most
modern job schedulers support advance reservation of resources, but their
implementation falls short in several aspects. First of all, they are constrained
by the job abstraction; when a user makes an advance reservation in a jobbased
system, the user does not have direct and unfettered access to the resources, the
way a cloud users can access the VMs they requested, but, rather, is only
allowed to submit jobs to them. For example, PBS Pro creates a new queue that
will be bound to the reserved resources, guaranteeing that jobs submitted to
that queue will be executed on them (assuming they have permission to do so).
Maui and Moab, on the other hand, simply allow users
to specify that a
submitted job should use the reserved resources (if the submitting user has
permission to do so). There are no mechanisms to directly login to the reserved
resources, other than through an interactive job, which does not provide
unfettered access to the resources.
Additionally, it is well known that advance reservations lead to utilization
problems [10—13], caused by the need to vacate resources before a reservation
can begin. Unlike future reservations made by backfilling algorithms, where
the start of the reservation is determined on a best-effort basis, advance
2
http://haizea.cs.uchicago.edu/
6.3 SCHEDULING TECHNIQUES FOR ADVANCE RESERVATION OF CAPACITY
167167
reservations introduce roadblocks in the resource schedule. Thus, traditional
job schedulers are unable to efficiently schedule workloads combining both
best-effort jobs and advance reservations.
However, advance reservations can be supported more efficiently by using a
scheduler capable of preempting running jobs at the start of the reservation and
resuming them at the end of the reservation. Preemption can also be used to run
large parallel jobs (which tend to have long queue times) earlier, and it is
specially relevant in the context of urgent computing, where resources have to
be provisioned on very short notice and the likelihood of having jobs already
assigned to resources is higher. While preemption can be accomplished trivially
by canceling a running job, the least disruptive form of preemption is
checkpointing, where the preempted job‘s entire state is saved to disk, allowing
it to resume its work from the last checkpoint. Additionally, some schedulers
also support job migration, allowing checkpointed jobs to restart on other
available resources, instead of having to wait until the preempting job or
reservation has completed.
However, although many modern schedulers support at least
checkpointingbased preemption, this requires the job‘s executable itself to be
checkpointable. An application can be made checkpointable by explicitly
adding that functionality to an application (application-level and library-level
checkpointing) or transparently by using OS-level checkpointing, where the
operating system (such as Cray, IRIX, and patched versions of Linux using
BLCR [17]) checkpoints a process, without rewriting the program or relinking
it with checkpointing libraries. However, this requires a checkpointing-capable
OS to be available.
Thus, a job scheduler capable of checkpointing-based preemption and
migration could be used to checkpoint jobs before the start of an advance
reservation, minimizing their impact on the schedule. However, the
applicationand library-level checkpointing approaches burden the user with
having to modify their applications to make them checkpointable, imposing a
restriction on the software environment. OS-level checkpointing, on the other
hand, is a more appealing option, but still imposes certain software restrictions
on resource consumers. Systems like Cray and IRIX still require applications to
be compiled for their respective architectures, which would only allow a small
fraction of existing applications to be supported within leases, or would require
existing applications to be ported to these architectures. This is an excessive
restriction on users, given the large number of clusters and applications that
depend on the x86 architecture. Although the BLCR project does provide a
checkpointing x86 Linux kernel, this kernel still has several limitations, such as
not being able to properly checkpoint network traffic and not being able to
checkpoint MPI applications unless they are linked with BLCR-aware MPI
libraries.
An alternative approach to supporting advance reservations was proposed
by Nurmi et al. [18], which introduced ―virtual advance reservations for
queues‖ (VARQ). This approach overlays advance reservations over
6.3 SCHEDULING TECHNIQUES FOR ADVANCE RESERVATION OF CAPACITY
168167
traditional job schedulers by first predicting the time a job would spend waiting
in a scheduler‘s queue and then submitting a job (representing the advance
reservation) at a time such that, based on the wait time prediction, the
probability that it will be running at the start of the reservation is maximized.
Since no actual reservations can be done, VARQ jobs can run on traditional
job schedulers, which will not distinguish between the regular best-effort jobs
and the VARQ jobs. Although this is an interesting approach that can be
realistically implemented in practice (since it does not require modifications to
existing scheduler), it still depends on the job abstraction.
Hovestadt et al. [19, 20] proposed a planning-based (as opposed to
queuingbased) approach to job scheduling, where job requests are immediately
planned by making a reservation (now or in the future), instead of waiting in a
queue. Thus, advance reservations are implicitly supported by a planning-based
system. Additionally, each time a new request is received, the entire schedule
is reevaluated to optimize resource usage. For example, a request for an
advance reservation can be accepted without using preemption, since the jobs
that were originally assigned to those resources can be assigned to different
resources (assuming the jobs were not already running).
Reservations with VMs
As we described earlier, virtualization technologies are a key enabler of many
features found in IaaS clouds. Virtual machines are also an appealing vehicle
for implementing efficient reservation of resources due to their ability to be
suspended, potentially migrated, and resumed without modifying any of
the applications running inside the VM. However, virtual machines also raise
additional challenges related to the overhead of using VMs:
Preparation Overhead. When using VMs to implement reservations, a VM
disk image must be either prepared on-the-fly or transferred to the physical
node where it is needed. Since a VM disk image can have a size in the
order of gigabytes, this preparation overhead can significantly delay the
starting time of leases. This delay may, in some cases, be unacceptable for
advance reservations that must start at a specific time. Runtime Overhead.
Once a VM is running, scheduling primitives such as checkpointing and
resuming can incur in significant overhead since a
VM‘s entire memory space must be saved to disk, and then read from
disk. Migration involves transferring this saved memory along with the
VM disk image. Similar to deployment overhead, this overhead can result
in noticeable delays.
The Haizea project (http://haizea.cs.uchicago.edu/) was created to develop a
scheduler that can efficiently support advance reservations efficiently by using
6.3 SCHEDULING TECHNIQUES FOR ADVANCE RESERVATION OF CAPACITY
169167
the suspend/resume/migrate capability of VMs, but minimizing the overhead of
6.3 SCHEDULING TECHNIQUES FOR ADVANCE RESERVATION OF CAPACITY
170167
using VMs. The fundamental resource provisioning abstraction in Haizea is the
lease, with three types of lease currently supported:
● Advanced reservation leases, where the resources must be available at a
specific time.
● Best-effort leases, where resources are provisioned as soon as possible and
requests are placed on a queue if necessary.
● Immediate leases, where resources are provisioned when requested or not
at all.
The Haizea lease manager can be used as a scheduling back-end for the
OpenNebula virtual infrastructure engine, allowing it to support these three
types of leases. The remainder of this section describes Haizea‘s leasing model
and the algorithms Haizea uses to schedule these leases.
Leasing Model
We define a lease as ―a negotiated and renegotiable agreement between a
resource provider and a resource consumer, where the former agrees to make
a set of resources available to the latter, based on a set of lease terms presented
by the resource consumer.‖ The terms must encompass the following: the
hardware resources required by the resource consumer, such as CPUs, memory,
and network bandwidth; a software environment required on the leased
resources; and an availability period during which a user requests that the
hardware and software resources be available. Since previous work and other
authors already explore lease terms for hardware resources and software
environments [21, 22], our focus has been on the availability dimension of a
lease and, in particular, on how to efficiently support advance reservations.
Thus, we consider the following availability terms:
● Start time may be unspecified (a best-effort lease) or specified (an advance
reservation lease). In the latter case, the user may specify either a specific
start time or a time period during which the lease start may occur.
● Maximum duration refers to the total maximum amount of time that the
leased resources will be available.
● Leases can be preemptable. A preemptable lease can be safely paused
without disrupting the computation that takes place inside the lease.
Haizea‘s resource model considers that it manages W physical nodes capable
of running virtual machines. Each node i has CPUs, megabytes (MB) of
memory, and MB of local disk storage. We assume that all disk images required
to run virtual machines are available in a repository from which they can be
6.3 SCHEDULING TECHNIQUES FOR ADVANCE RESERVATION OF CAPACITY
171167
transferred to nodes as needed and that all are connected at a bandwidth of B
MB/sec by a switched network.
6.3 SCHEDULING TECHNIQUES FOR ADVANCE RESERVATION OF CAPACITY
171
A lease is implemented as a set of N VMs, each allocated resources described
by a tuple (p, m, d, b), where p is number of CPUs, m is memory in MB, d is disk
space in MB, and b is network bandwidth in MB/sec. A disk image I with a size
of size(I) MB must be transferred from the repository to a node before the VM
can start. When transferring a disk image to multiple nodes, we use
multicasting and model the transfer time as size(I)/B. If a lease is preempted, it
is suspended by suspending its VMs, which may then be either resumed on the
same node or migrated to another node and resumed there. Suspending a VM
results in a memory state image file (of size m that can be saved to either a local
filesystem or a global filesystem (f A {local, global}). Resumption requires
reading that image back into memory and then discarding the file. Suspension
of a single VM is done at a rate of s megabytes of VM memory per second, and
we define r similarly for VM resumption.
Lease Scheduling
Haizea is designed to process lease requests and determine how those requests
can be mapped to virtual machines, leveraging their suspend/resume/migrate
capability, in such a way that the leases‘ requirements are satisfied. The
scheduling component of Haizea uses classical backfilling algorithms [23],
extended to allow best-effort leases to be preempted if resources have to be
freed up for advance reservation requests. Additionally, to address the
preparation and runtime overheads mentioned earlier, the scheduler allocates
resources explicitly for the overhead activities (such as transferring disk images
or suspending VMs) instead of assuming they should be deducted from the
lease‘s allocation. Besides guaranteeing that certain operations complete on
time (e.g., an image transfer before the start of a lease), the scheduler also
attempts to minimize this overhead whenever possible, most notably by reusing
disk image transfers and caching disk images on the physical nodes.
Best-effort leases are scheduled using a queue. When a best-effort lease
is requested, the lease request is placed at the end of the queue, which is
periodically evaluated using a backfilling algorithm—both aggressive and
conservative backfilling strategies [23, 24] are supported—to determine if any
leases can be scheduled. The scheduler does this by first checking the earliest
possible starting time for the lease on each physical node, which will depend on
the required disk images. For example, if some physical nodes have cached the
required disk image, it will be possible to start the lease earlier on those nodes.
Once these earliest starting times have been determined, the scheduler chooses
the nodes that allow the lease to start soonest.
The use of VM suspension/resumption allows the best-effort leases to be
scheduled even if there are not enough resources available for their full
requested duration. If there is a ―blocking‖ lease in the future, such as an
advance reservation lease that would prevent the best-effort lease to run to
completion before the blocking lease starts, the best-effort lease can still be
scheduled; the VMs in the best-effort lease will simply be suspended before a
6.3 SCHEDULING TECHNIQUES FOR ADVANCE RESERVATION OF CAPACITY
171
blocking lease. The remainder of a suspended lease is placed in the queue,
according to its submission time, and is scheduled like a regular best-effort lease
(except a resumption operation, and potentially a migration operation, will
have to be scheduled too).
Advance reservations, on the other hand, do not go through a queue,
since they must start at either the requested time or not at all. Thus, scheduling
this type of lease is relatively simple, because it mostly involves checking
if there are enough resources available during the requested interval. However,
the scheduler must also check if any associated overheads can be scheduled
in such a way that the lease can still start on time. For preparation overhead,
the scheduler determines if the required images can be transferred on time.
These transfers are scheduled using an earliest deadline first (EDF) algorithm,
where the deadline for the image transfer is the start time of the advance
reservation lease. Since the start time of an advance reservation lease may occur
long after the lease request, we modify the basic EDF algorithm so that
transfers take place as close as possible to the deadline, preventing images from
unnecessarily consuming disk space before the lease starts. For runtime
overhead, the scheduler will attempt to schedule the lease without having to
preempt other leases; if preemption is unavoidable, the necessary suspension
operations are scheduled if they can be performed on time.
For both types of leases, Haizea supports pluggable policies, allowing system
administrators to write their own scheduling policies without having to modify
Haizea‘s source code. Currently, three policies are pluggable in Haizea:
determining whether a lease is accepted or not, the selection of physical nodes,
and determining whether a lease can preempt another lease.
Our main results so far [25, 26] have shown that, when using workloads
that combine best-effort and advance reservation lease requests, a VM-based
approach with suspend/resume/migrate can overcome the utilization pro blems
typically associated with the use of advance reservations. Even in the presence
of the runtime overhead resulting from using VMs, a VM-based approach
results in consistently better total execution time than a sched uler that does
not support task preemption, along with only slightly worse performance than a
scheduler that does support task preemption. Measuring the wait time and
slowdown of best-effort leases shows that, although the average values of these
metrics increase when using VMs, this effect is due to short leases not being
preferentially selected by Haizea‘s backfilling algorithm, instead of allowing
best-effort leases to run as long as possible before
a preempting AR lease
(and being suspended right before the start of the AR). In effect, a VM-based
approach does not favor leases of a particular length over others, unlike
systems that rely more heavily on backfilling. Our results have also shown
that, although supporting the deployment of multiple software environments, in
the form of multiple VM images, requires the transfer of potentially large disk
image files, this deployment overhead can be minimized through the use of
image transfer scheduling and caching strategies.
6.4 CAPACITY MANAGEMENT TO MEET SLA COMMITMENTS
172173
Further Reading on Lease-Based Resource Management
There are several scholarly publications [25—28] available for download at the
Haizea Web site (http://haizea.cs.uchicago.edu/) describing Haizea‘s design
and algorithms in greater detail and showing performance results obtained
when using Haizea‘s lease-based model.
CAPACITY MANAGEMENT TO MEET SLA COMMITMENTS
As was discussed in the previous section, when temporal behavior of services
with respect to resource demands is highly predictable (e.g., thanks to
wellknown business cycle of a service, or predictable job lengths in
computational service), capacity can be efficiently scheduled using
reservations. In this section we focus on less predictable elastic workloads. For
these workloads, exact scheduling of capacity may not be possible. Rather than
that, capacity planning and optimizations are required.
IaaS providers perform two complementary management tasks: (1) capacity
planning to make sure that SLA obligations are met as contracted with the
service providers and (2) continuous optimization of resource utilization given
specific workload to make the most efficient use of the existing capacity. It is
worthy to emphasize the rationale behind these two management processes.
The first task pertains to the long-term capacity management aimed at
costefficient provisioning in accordance with contracted SLAs. To protect
SLAs with end users, elastic services scale up and down dynamically. This
requires an IaaS provider to guarantee elasticity for the service within some
contracted capacity ranges. Thus, the IaaS provider should plan capacity of the
cloud in such a way that when services change resource demands in response to
environment conditions, the resources will be indeed provided with the
contracted probability. At the same time, the IaaS cloud provider strives to
minimally over-provision capacity, thus minimizing the operational costs. We
observe that these goals can be harmonized thanks to statistical multiplexing of
elastic capacity demands. The key questions will be (a) in what form to provide
capacity guarantees (i.e., infrastructure SLAs) and (b) how to control the risks
inherent to over-subscribing. We treat these problems in Sections 6.4.1 and
6.4.2, respectively.
The second task pertains to shortand medium-term optimization of resource
allocation under the current workload. This optimization may be guided by
different management policies that support high-level business goals of an IaaS
provider. We discuss policy-driven continuous resource optimization in Section
6.4.3.
From an architectural viewpoint, we argue in favor of a resource
management framework that separates between these two activities and allows
combination of solutions to each process, which are best adapted to the needs
of a specific IaaS provider.
6.4 CAPACITY MANAGEMENT TO MEET SLA COMMITMENTS
173173
Infrastructure SLAs
IaaS can be regarded as a giant virtual hardware store, where computational
resources such as virtual machines (VM), virtual application networks (VAN)
and virtual disks (VD) can be ordered on demand in the matter of minutes or
even seconds. Virtualization technology is sufficiently versatile to provide
virtual resources on a almost continuous granularity scale. Chandra et al.
[29] quantitatively study advantages of fine-grain resource allocation in a
shared hosting platform. As this research suggests, fine-grain temporal and
spatial resource allocation may lead to substantial improvements in capacity
utilization.
These advantages come at a cost of increased management, accounting, and
billing overhead. For this reason, in practice, resources are typically provided
on a more coarse discrete scale. For example, Amazon EC2 [1] offers small,
large, and extra large general-purpose VM instances and high-CPU medium
and extra large instances. It is possible that more instance types (e.g., I/O high,
memory high, storage high, etc.) will be added in the future should a demand
for them arise. Other IaaS providers—for example, GoGrid and FlexiScale —
follow similar strategy.
With some caution it may be predicted that this approach, as being
considerably more simple management-wise, will remain prevalent in short to
medium term in the IaaS cloud offerings.
Thus, to deploy a service on a cloud, service provider orders suitable virtual
hardware and installs its application software on it. From the IaaS provider, a
given service configuration is a virtual resource array of black box resources,
which correspond to the number of instances of resource type. For example, a
typical three-tier application may contain 10 general-purpose small instances
to run Web front-ends, three large instances to run an application server
cluster with load balancing and redundancy, and two large instances to run a
replicated database.
In an IaaS model it is expected from the service provider that it sizes capacity
demands for its service. If resource demands are provided correctly and are
indeed satisfied upon request, then desired user experience of the service will be
guaranteed. A risk mitigation mechanism to protect user experience in the IaaS
model is offered by infrastructure SLAs (i.e., the SLAs formalizing capacity
availability) signed between service provider and IaaS provider.
The is no universal approach to infrastructure SLAs. As the IaaS field
matures and more experience is being gained, some methodologies may become
more popular than others. Also some methods may be more suitable for specific
workloads than other. There are three main approaches as follows.
● No SLAs. This approach is based on two premises: (a) Cloud always has
spare capacity to provide on demand, and (b) services are not
QoSsensitive and can withstand moderate performance degradation. This
methodology is best suited for the best effort workloads.
6.4 CAPACITY MANAGEMENT TO MEET SLA COMMITMENTS
174173
● Probabilistic SLAs. These SLAs allow us to trade capacity availability for
cost of consumption. Probabilistic SLAs specify clauses that determine
availability percentile for contracted resources computed over the SLA
evaluation period. The lower the availability percentile, the cheaper the
cost of resource consumption. This is justified by the fact that an IaaS
provider has less stringent commitments and can over-subscribe capacity
to maximize yield without exposing itself to excessive risk. This type of
SLA is suitable for small and medium businesses and for many enterprise
grade applications.
● Deterministic SLAs. These are, in fact, probabilistic SLAs where resource
availability percentile is 100%. These SLAs are most stringent and
difficult to guarantee. From the provider‘s point of view, they do not
admit capacity multiplexing. Therefore this is the most costly option for
service providers, which may be applied for critical services.
We envision coexistence of all three methodologies above, where each SLA
type is most applicable to specific workload type. We will focus on probabilistic
SLAs, however, because they represent the more interesting and flexible option
and lay the foundation for the rest of discussion on statistical multiplexing of
capacity in Section 6.4.2. But before we can proceed, we need to define one
more concept, elasticity rules.
Elasticity rules are scaling and de-scaling policies that guide transition of the
service from one configuration to another to match changes in the environment.
The main motivation for defining these policies stems from the pay-asyou-go
billing model of IaaS clouds. The service owner is interested in paying only for
what is really required to satisfy workload demands minimizing the overprovisioning overhead.
There are three types of elasticity rules:
● Time-driven: These rules change the virtual resources array in response to
a timer event. These rules are useful for predictable workloads—for
example, for services with well-known business cycles.
● OS Level Metrics-Driven: These rules react on predicates defined in terms
of the OS parameters observable in the black box mode (see Amazon
Auto-scaling Service). These auto-scaling policies are useful for
transparently scaling and de-scaling services. The problem is, however,
that in many cases this mechanism is not precise enough.
● Application Metrics-Driven. This is a unique RESERVOIR offering that
allows an application to supply application-specific policies that will be
transparently executed by IaaS middleware in reacting on the monitoring
information supplied by the service-specific monitoring probes running
inside VMs.
For a single service, elasticity rules of all three types can be defined, resulting
in a complex dynamic behavior of a service during runtime. To protect
6.4 CAPACITY MANAGEMENT TO MEET SLA COMMITMENTS
175173
elasticity rules of a service while increasing the multiplexing gain, RESERVOIR
proposes using probabilistic infrastructure availability SLAs.
Assuming that a business day is divided into a number of usage windows, the
generic template for probabilistic infrastructure SLAs is as follows.
For each Wi, and min
eachmax
resource type rj from the virtual resource array,
capacity range C 5 (r j
, rj
) is available for the service with probability pi.
Probabilistically guaranteeing capacity ranges allows service providers to
define its needs flexibly. For example, for business critical usage window,
availability percentile may be higher than for the regular or off-peak hours.
Similarly, capacity ranges may vary in size. From the provider‘s point of view,
defining capacity requirements this way allows yield maximization through
over-subscribing. This creates a win—win situation for both service provider
and IaaS provider.
Policy-Driven Probabilistic Admission Control
Benefits of statistical multiplexing are well known. This is an extensively
studied field, especially in computer networking [30—32]. In the context of
CPU and bandwidth allocation in shared hosting platforms, the problem was
recently studied by Urgaonkar et al. [33]. In this work the resources were
treated as contiguous, allowing infinitesimal capacity allocation. We generalize
this approach by means of treating each (number of instances of resource i in
the virtual resources array) as a random variable. The virtual resources array
is, therefore, a vector of random variables. Since we assume that each
capacity range for each resource type is finite, we may compute both
the
average resource consumption rate and variance in resource consumption for
each service in terms of the capacity units corresponding to each resource type.
Inspired by the approach of Guerin et al. [30], we propose a simple
management lever termed acceptable risk level (ARL) to control oversubscribing of capacity. We define ARL as the probability of having
insufficient capacity to satisfy some capacity allocation requests on demand.
The ARL value can be derived from a business policy of the IaaS provider—
that is, more aggressive versus more conservative over-subscription.
In general, the optimal ARL value can be obtained by calculating the
residual benefit resulting from specific SLA violations. A more conservative,
suboptimal ARL value is simply the complement of the most stringent capacity
range availability percentile across the SLA portfolio.
An infrastructure SLA commitment for the new application service should
be made if and only if the potential effect does not cause the residual benefit to
fall below some predefined level, being controlled by the site‘s business policy.
This decision process is referred to as BSM-aligned admission control.3
3
We will refer to it simply as admission control wherever no ambiguity arises.
6.4 CAPACITY MANAGEMENT TO MEET SLA COMMITMENTS
176173
Once a service application passes admission control successfully, optimal
placement should be found for the virtual resources comprising the service. We
treat this issue in Section 6.4.3.
The admission control algorithm calculates equivalent capacity required to
satisfy the resource demands of the service applications for the given ARL. The
equivalent capacity is then matched against the actual available capacity to
verify whether it is safe to admit a new service.
In a federated environment (like that provided by RESERVOIR) there is
potentially an infinite pool of resources. However, these resources should fit
placement constraints that are posed by the service applications and should be
reserved using inter-cloud framework agreements. Thus, the BSM-aligned
admission control helps the capacity planning process to dimension capacity
requests from the partner clouds and fulfill physical capacity requests at the
local cloud.
The capacity demands of the deployed application services are being
continuously monitored. For each application service, the mean capacity
demand (in capacity units) and the standard deviation of the capacity demand
are being calculated.
When a new service with unknown history arrives in the system, its mean
capacity demand and standard deviation are conservatively estimated from the
service elasticity rules and historic data known for other services. Then, an
equivalent capacity is approximated using Eq. (6.1). The equivalent capacity
is the physical capacity needed to host the new service and all previously
deployed services without increasing the probability of congestion (acceptable
risk level), ε.
Equivalent capacity is expressed in the form of resource array, where each
element represents the number of instances of a resource of a specific type. 4 To
verify that physical capacity is sufficient to support the needed equivalent
capacity, one may use either the efficient and scalable exact solution (via branch
and bound algorithms) to the multiple knapsack problem [48] or the efficient
bin-packing approximation algorithm such as First-Fit-Descending, which
guarantees approximation ratio within 22% of the optimal algorithm. Using
multiple knapsacks is more appropriate when capacity augmentation is not an
option. Assuming that value of the resources is proportional to their size,
solving the multiple knapsack problem provides a good estimation of value
resulting from packing the virtual resources on the given capacity. If capacity
can be augmented—for example, more physical capacity can be obtained from
a partner cloud provider or procured locally—then solving the bin packing
problem is more appropriate since all items (i.e., resources comprising the
service) are always packed.
4
When calculating equivalent capacity, we do not know which service will use specific resource
instances, but we know that it is sufficient, say, to be able to allocate up to 100 small VM instances and
50 large instances to guarantee all resource requests resulting from the elasticity rules application, so
that congestion in resource allocation will not happen with probability larger than ε.
6.4 CAPACITY MANAGEMENT TO MEET SLA COMMITMENTS
177173
Note that this is different from computing the actual placement of services
since at the admission control stage we have ―abstract‖ equivalent capacity.
Matching equivalent capacity against physical capacity, as above, guarantees
that feasible placement for actual services can be found with probability 1 2 ε.
If the local and remote physical capacity that can be used by this site in a
guaranteed manner is sufficient to support the equivalent capacity calculated,
the new service is accepted. Otherwise, a number of possibilities exist, depending on the management policy:
● The service is rejected.
● The total capacity of the site is increased locally and/or remotely (through
federation) by the amount needed to satisfy the equivalent capacity
constraint and the service is admitted.
● The acceptable risk level is increased, and the service is accepted.
Beq ¼ m þ α · σ
ð6:1Þ
n
X
m ¼
mi
ð6:2Þ
i
sffiffiffiffiffiffiffiffiffiffiffiffiffi
n
X
σ ¼
σ2
ð6:3Þ
i
α¼
pffi ffi
pffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
2 · erfc—1ð 2εÞ = —2ln —
ε ln2π — ln ð—2 ln —
ε ln2π Þ
ð6:4Þ
Our approach initially overestimates the average capacity demand for the
new service. With the passage of time, however, as capacity usage statistics are
being collected for the newly admitted application service, the mean and
standard deviation for the capacity demands (per resource type) are adjusted
6.4 CAPACITY MANAGEMENT TO MEET SLA COMMITMENTS
178173
for this service. This allows us to reduce the conservativeness when the next
service arrives.
Service providers may impose various placement restrictions on VMs
comprising the service. For example, it may be required that VMs do not
share the same physical host (anti-affinity). As another example, consider
heterogeneous physical infrastructure and placement constraints arising from
technological incompatibilities.
From the admission control algorithm‘s vantage point, the problem is that
during admission control it may not know which deployment restrictions
6.4 CAPACITY MANAGEMENT TO MEET SLA COMMITMENTS
179173
should be taken into account since which restrictions will be of relevance
depends on the dynamic behavior of the services.
Thus, our proposed solution is best suited for services whose elements admit
full sharing of the infrastructure. Generalizing this approach to handle various
types of deployment restrictions is in the focus of our current research efforts.
In general, to guarantee that a feasible placement for virtual resources will be
found with controllable probability in the presence of placement restrictions,
resource augmentation is required. The resource augmentation may be quite
significant (see references 34 and 35). It is, therefore, prudent on the side of the
IaaS provider to segregate workloads that admit full sharing of the
infrastructure from those who do not and offer service provider-controlled
deployment
restrictions as a premium service to recover capacity augmentation costs.
Policy-Driven Placement Optimization
The purpose of statistical admission control is to guarantee that there is enough
capacity to find a feasible placement with given probability. Policy-driven
placement optimization complements capacity planning and management by
improving a given mapping of physical to virtual resources (e.g., VMs).
In the presence of deployment restrictions, efficient capacity planning with
guaranteed minimal over-provisioning is still an open research problem.
Partially the difficulties lie in hardness of solving multiple knapsacks or its
more general version, the generalized assignment problem. Both problems are
NP-hard in the strong sense (see discussion in Section 6.4.5). In the
RESERVOIR model, where resource augmentation is possible through cloud
partnership, solutions that may require doubling of existing local capacity in
the worst case [34] are applicable. An interesting line of research is to
approximate capacity augmentation introduced by specific constraints, such as
bin—item and item—item. Based on required augmentation, an IaaS provider
may either accept or reject the service.
As shown in reference 36, in the presence of placement constraints of type
bin—item, Bi-criteria Multiple Knapsack with Assignment Restrictions
(BMKAR) that maximizes the total profit of placed items (subject to a lower
bound) and minimizes the total number of containers (i.e., minimizes utilized
capacity) does not admit a polynomial algorithm that satisfies the lower bound
exactly unless P 5 NP. Two approximation algorithms with performance ratios
(running in pseudo-polynomial time) and (running in polynomial time) were
presented. These results are best known today for BMKAR, and the bounds
are tight.
In our current prototypical placement solution, we formulated the problem
as an Integer Linear Programming problem and used branch-and-bound solver
(COIN-CBC [37]) to solve the problem exactly. This serves us as a performance
baseline for future research. As was shown by Pisinger [38], in the absence of
constraints, very large problem instances can be solved exactly in a very
efficient manner using a branch-and-bound algorithm. Obviously, as the scale
6.4 CAPACITY MANAGEMENT TO MEET SLA COMMITMENTS
180173
of the problem (in terms of constraints) increases, ILP becomes infeasible. This
leads us to focus on developing novel heuristic algorithms extending the state of
art, which is discussed in Section 6.4.5.
A number of important aspects should be taken into account in efficient
placement optimization.
Penalization for Nonplacement. In BMKAR, as in all classical knapsack
problems, no-placement of an item results in 0 profit for that item. In the
VM placement with SLA protection problem, nonplacement of an item or
a group of items may result in SLA violation and, thus, payment of
penalty. The management policy to minimize nonplacements is factored
into constraints and an objective function.
Selection Constraints. Selection constraints imply that only when a group of
VMs (items) collectively forming a service is placed, this meta-item yields
profit. Partial placement may even lead to a penalty, since the SLA of a
service may be violated. Thus, partial placement should be prevented. In
our formulation, this is factored into constraints.
Repeated Solution. Since the placement problem is solved continuously, it is
important to minimize the cost of replacement. In particular, we need to
minimize the cost of reassignments of VMs to hosts, because this entails
VM migrations. We factor the penalty member on migration in our
objective function.
Considering ICT-Level Management Policies. There are three policies
that we currently consider: power conservation (by minimizing the number
of physical hosts used for placement), load balancing (by spreading
load across available physical machines), and migration minimization
(by introducing a penalty factor for machines migration). We discuss
policies below. In general, RESERVOIR provides an open-ended engine
that allows to incorporate different policies. Depending on the policy
chosen, the optimization problem is cast into a specific form. Currently,
we support two placement policies: ―load balancing‖ and ―power
conservation,‖ with number of migrations minimized in both cases. The
first policy is attained through solving GAP with conflicts, and the
second one is implemented via bin packing with conflicts.
Inspired by results by Santos et al. [39], who cast infrastructure-level
management policies as soft constraints, we factor the load balancing policy
into our model using the soft constraints approach.
Whereas the hard constraints take the form of
f ð~
xÞ # b
ð6:5Þ
where ~xis the vector of decision variables, with the soft constraints approach, a
constraint violation variable v is introduced into the hard constraint as shown in
Eq. (6.6) and a penalty term P · v is introduced into the objective function to
6.4 CAPACITY MANAGEMENT TO MEET SLA COMMITMENTS
181
prevent trivial solutions, because soft constraints are always possible to satisfy.
If the penalty is a sufficiently large number, the search for an optimal solution
will try to minimize it.
f ð~
xÞ # b þ υ
ð6:6Þ
We exploit the idea that reducing the available capacity at each physical host
will force the search for an optimal solution to spread the VEEs over a larger
number of knapsacks, thus causing the load to be spread more evenly across the
site.
To address power conservation objective as a management policy, we
formulate our problem as bin-packing with conflicts.
Since the optimization policy for VEE placement is being continuously
solved, it is critical to minimize VEE migrations in order to maintain
costeffectiveness. To model this, we define a migration penalty term MP as
shown in Eq. (6.7).
m
n
MP ¼
i¼1
XX
migrðjÞ· absðxt—1 — xt Þ
j¼1 i; j
ð6:7Þ
i; j
Since abs( · ), which is a nonlinear, is part of MP, we cannot incorporate MP
into the objective function as is. To circumvent this problem, we linearize MP
by introducing additional variables, which is a widely used linearization
technique.
Management Policies and Management Goals. Policy-based management
is an overused term. Therefore, it is, beneficial to define and differentiate our
approach to policy-driven admission control and placement optimization in the
more precise terms.
Policy-driven management is a management approach based on
―if(condition)—then(action)‖ rules defined to deal with the situations that are
likely to arise [40]. These policies serve as a basic building blocks for
autonomic computing.
The overall optimality criteria of placement, however, are controlled by the
management policies, which are defined at a higher level of abstraction than ―if
(condition)—then(action)‖ rules. To avoid ambiguity, we term these policies
management goals. Management goals, such as ―conserve power,‖ ―prefer local
resources over remote resources,‖ ―balance workload,‖ ―minimize VM
migrations,‖ ―minimize SLA noncompliance,‖ and so forth, have complex
6.4 CAPACITY MANAGEMENT TO MEET SLA COMMITMENTS
181
logical structures. They cannot be trivially expressed by ―if(condition)—
then(action)‖ rules even though it is possible to create the elementary rules that
will strive to satisfy global management preferences in a reactive or proactive
manner.
Regarding the management activity involved in VM placement
optimization, a two-phase approach can be used. In the first phase, a
feasible
6.4 CAPACITY MANAGEMENT TO MEET SLA COMMITMENTS
181
placement—that is, a placement that satisfies the hard constraints imposed by
the service manifest—can be obtained without concerns for optimality and,
thus, with low effort. In the second phase, either a timer-based or a
thresholdbased management policy can invoke a site-wide optimization
procedure that aligns capacity allocation with the management goals (e.g., with
the goal of using minimal capacity, can be triggered).
Management policies and management goals may be defined at different
levels of the management architecture—that is, at the different levels of
abstraction. At the topmost level, there are business management goals and
policies. We briefly discuss them in the next subsection. In the intermediate level
there are service-induced goals and policies. Finally, at the infrastructure
management level there are ICT management preferences and policies that
are our primary focus in this activity. We discuss them in Section 6.4.4.
Business-Level Goals and Policies. Since business goals are defined at
such a high level of abstraction, a semantic gap exists between them and the
ICT level management goals and policies. Bridging this gap is notoriously
difficult. In this work we aim at narrowing this gap and aligning between the
high-level business management goals and ICT-level management policies by
introducing the notion of acceptable risk level (ARL) of capacity allocation
congestion.
Intuitively, we are interested in minimizing the costs of capacity
overprovisioning while controlling the risk associated with capacity overbooking. From minimizing the cost of capacity over-provisioning, we are
interested in maximizing yield of the existing capacity. However, at some point,
the conflicts (congestions) in capacity allocation may cause excessive SLA
penalties that
would offset the advantages of yield maximization.
Accounting for benefits from complying with SLAs and for costs of
compliance and noncompliance due to congestions, we can compute residual
benefit for the site. The target value of residual benefit can be controlled by a
high-level business policy. To satisfy this business policy, we need to calculate
an appropriate congestion probability, ARL. ARL, in turn, would help us
calculate equivalent capacity for the site to take advantage of statistical
multiplexing in a safe manner.
To allow calculation of residual benefit, capacity allocation behavior under
congestion should deterministic. In particular, a policy under congestion may
be a Max—Min Fair Share allocation [41] or higher-priority-first (HPF)
capacity allocation [39], where services with lower SLA classes are satisfied
only after all services with higher SLA classes are satisfied.
For the sake of discussion, let us assume that the HPF capacity allocation
policy is used.5 We use historical data of the capacity demand (in capacity
5
Whether a certain specific policy is being used is of minor importance. It is important, however,
that the policy would be deterministic.
6.4 CAPACITY MANAGEMENT TO MEET SLA COMMITMENTS
183
units corresponding to different resource types as explained in Section 6.4.2)
per service—specifically, the α-percentile of historic capacity demand per
application (where α equals the percentile of compliance required in the service
SLA). This is used to compute the expected capacity allocation per service under
capacity allocation congestion. Thus, we obtain the set of application services,
whose SLAs may be violated. 6 Using penalty values defined for each affected SLA,
we obtain the residual benefit that would remain after penalties
are enforced.
Using the management policy that put a lower bound on the expected residual
benefit, we compute acceptable risk value, ε, that satisfies this bound.
Infrastructure-Level Management Goals and Policies
In general, infrastructure-level management policies are derived from the
business-level management goals. For example, consider our sample business
level management goal to ―reduce energy expenses by 30% in the next quarter.‖
This broadly defined goal may imply, among other means for achieving it, that
we systematically improve consolidation of VMs on physical hosts by putting
excessive capacity into a low-power consumption mode. Thus, a site-wide ICT
power conservation-level management policy may be formulated as: ―minimize
number of physical machines while protecting capacity availability SLAs of the
application services.‖
As another example, consider the business-level management goal: ―Improve
customer satisfaction by achieving more aggressive performance SLOs.‖ One
possible policy toward satisfying this business-level goal may be formulated as:
―Balance load within the site in order to achieve specific average load per
physical host.‖ Another infrastructure-level management policy to imp rove
performance is: ―Minimize the number of VM migrations.‖ The rationale for
this policy is that performance degradation necessarily occurs during VM
migration.
State of the Art
Our approach to capacity management described in Section 6.4.2 is based on
the premise that service providers perform sizing of their services. A detailed
discussion of the sizing methodologies is out of our scope, and we will only
briefly mention results in this area. Capacity planning for Web services was
studied by Menasce´ and Almeida [42]. Doyle et al. [43] considered the problem
of how to map requirements of a known media service workload into the
corresponding system resource requirements and to accurately size the required
system. Based on the past workload history, the capacity planner finds the 95th
6.4 CAPACITY MANAGEMENT TO MEET SLA COMMITMENTS
6
This is a conservative estimate.
183
6.4 CAPACITY MANAGEMENT TO MEET SLA COMMITMENTS
183
percentile of the service demand (for various resources and on different usage
windows) and asks for the corresponding configuration. Urgaonkar et al. [44]
studied model-based sizing of three-tier commercial services. Recently, Chen et
al. [45] sudied the similar problem and provided novel performance models for
multi-tier services.
Doyle et al. [43] presented new models for automating resource provisioning
for resources that may interact in complex ways. The premise of the modelbased resource provisioning is that internal models capturing service workload
and behavior can enable prediction of effects on service performance of the
changes to the service workload and resource allotments. For example, the
model can answer questions like: ―How much memory is needed to reduce this
service‘s storage access rate by 20%?‖ The paper introduces simple
performance models for Web services and proposes a model-based resource
allocator that utilizes them and allocates appropriate resource slices to achieve
needed performance versus capacity utilization. A slice may be mapped to a
virtual machine or another resource container providing performance isolation.
In cases when exact model-driven service sizing is not available, learning
desirable resource allocation from dynamic service behavior may be possible
using black box monitoring of the service network activity as was recently
shown by Ben-Yehuda et al. [46] for multi-tier services.
Benefits of capacity multiplexing (under the assumption of known resource
demands) in shared hosting platforms were quantitatively studied by Chandra
et al. [29].
An approach to capacity over-subscribing that is conceptually similar to
ours was recently studied by Urgaonkar et al. [33]. In this work, provisioning
CPU and network resources with probabilitistic guarantees on a shared hosting
platform were considered. The main difference between our methodology
and that of Urgaonkar et al. is that we allocate capacity in integral discrete
quanta that encapsulate CPU, memory, network bandwidth, and storage rather
than allowing independent infinitesimally small resources allocation along each
of this capacity dimensions.
An advance of virtualization technologies and increased awareness about
management and power costs of running under-utilized servers have spurred
interest in consolidating existing applications on a fewer number of servers
in the data center. In most practical settings today a static approach to
consolidation, where consolidation is performed as a point-in-time optimization
activity, is used [47, 48]. With the static approach, the cost of VM migration
are usually not accounted for and relatively time-consuming computations are
tolerated. Gupta et al. [48] demonstrated that static consolidation problem can
be modeled as a variant of the bin packing problem where items to be packed
are the servers being consolidated and bins are the target servers. The sizes of
the servers/items being packed are resource utilizations that are obtained from
the performance trace data. The authors present a two-stage heuristic algorithm
for handling the ―bin—item‖ assignment constraints that
6.4 CAPACITY MANAGEMENT TO MEET SLA COMMITMENTS
183
inherently restrict any server consolidation problem. The model is able to solve
extremely large instances of the problem in a reasonable amount of time.
Autonomic and dynamic optimization of virtual machines placement in a
data center received considerable attention (mainly in the research community)
recently [49—59].
Bobroff et al. [54] introduce empiric dynamic server migration and
consolidation algorithm based on predicting capacity demand of virtual servers
using time series analysis.
Mehta and Neogi [49] presented a virtualized servers consolidation planning
tool, Recon, that analyzes historical data collected from an existing
environment and computes the potential benefits of server consolidation
especially in the dynamic setting.
Gmach et al. [50] considered virtualized servers consolidation of multiple
servers and their workloads subject to specific quality of service requirements
that need to be supported.
Wood et al. [52] presented Sandpiper, a system that automates the task of
monitoring and detecting hotspots, determining a new mapping of physical to
virtual resources, and initiating the necessary migrations to protect performance.
Singh et al. [53] presented a promising approach to the design of an agile
data center with integrated server and storage virtualization technologies.
Verma et al. [51] studied the design, implementation, and evaluation of a
power-aware application placement controller in the context of an environment
with heterogeneous virtualized server clusters.
Tang et al. [58] presented a performance model-driven approach to
application placement that can be extended to VM placement.
Wang et al. [55] defined a nonlinear constrained optimization model for
dynamic resource provisioning and presented a novel analytic solution.
Choi et al. [60] proposed machine learning framework that autonomously
finds and adjusts utilization thresholds at runtime for different computing
requirements.
Kelly [59] studied the problem of allocating discrete resources according to
utility functions reported by potential recipients with application to resource
allocation in a Utility Data Center (UDC).
Knapsack-related optimization has been relentlessly studied over the last
30 years. The scientific literature on the subject is, therefore, abundant. For
excellent treatment of the knapsack problems, we recommend references 61 and
62. The Simple Multiple Knapsack Problem (MKP) is NP-hard in the strong
sense. Its generalization, called Generalized Assignment Problem (GAP), is
APX-hard [63]. GAP (and therefore MKP) admits two approximations using a
greedy algorithm [64]. A Fully Polynomial Time Approximation Scheme
(FPTAS) for this problem is unlikely unless P 5 NP [65]. For some time it
was not known whether simple MKP admits the Polynomial Time
Approximation Scheme (PTAS). Chekuri and Khanna [63] presented a PTAS
for MKP in 2000. Shachnai and Tamir showed that the Class-Constrained
Multiple Knapsack also admits PTAS.
6.5 CONCLUSIONS AND FUTURE WORK
185
Running time of PTASs dramatically increases as ε decreases.7 Therefore
heuristic algorithms optimized for specific private cases and scalable exeat
solutions are important.
Pisinger [38] presented a scalable exact branch-and-bound algorithm for
solving multiple knapsack problems with hundreds of thousands of items and
high ratios of items to bins. This algorithm improves the branch-and-bound
algorithm by Martello and Toth [61].
Dawande et al. [34] studied single-criterion and bi-criteria multiple knapsack
problems with assignment restrictions. For the bi-criteria problem of
minimizing utilized capacity subject to a minimum requirement on assigned
weight, they give a (1/3, 2)-approximation algorithm, where the first value
refers to profit and the second one refers to capacity augmentation.
Gupta et al. [66] presented a two-stage consolidation heuristic for servers
consolidation that handles item—bin and item—item conflicts. No bounds on
this heuristic were shown, however.
Epstein and Levin [35] studied the bin packing problem with item—item
conflicts. For bipartite graphs they present a 2.5 approximation algorithm for
perfect graphs (of conflicts) and a 1.75 approximation algorithm for bipartite
graphs.
Additional annotated bibliography and surveys on the knapsack-related
problems can be found in references 67 and 68. For survey of the recent results
in multi-criteria combinatorial optimization, see reference 69.
An important question for studying scalability of the optimization
algorithms is how to produce meaningful benchmarks for the tests. Pisinger
[70] studied relative hardness characterization of the knapsack problems. This
study may serve as a basis for generating synthetic benchmarks to be used in
validating knapsack related solutions.
Business-driven resource provisioning was studied by Marques et al. in [71].
This work proposes a business-oriented approach to designing IT infrastructure
in an e-commerce context subject to load surges.
Santos et al. [39] demonstrated that management policies can be effectively
and elegantly cast as soft constraints into optimization problem.
From analyzing the state of the art in provisioning and placement
optimization, we observe that the mainstream approach is detection and
remediation. In a nutshell, the SLA compliance of the services is being
monitored and when noncompliance or a dangerous trend that may lead to
noncompliance is detected, corrective actions (e.g., VEE migrations) are
attempted.
CONCLUSIONS AND FUTURE WORK
Virtualization is one of the cornerstones of Infrastructure-as-a-Service cloud
computing and, although virtual machines provide numerous benefits,
7
Here ε stands for the approximation parameter and should not be confused with the acceptable
risk level of Section 6.4.2, which was also denoted ε.
REFERENCES 186187
managing them efficiently in a cloud also poses a number of challenges. This
chapter has described some of these challenges, along with the ongoing work
within the RESERVOIR project to address them. In particular, we have
focused on the problems of distributed management of virtual infrastructures,
advance reservation of capacity in virtual infrastructures, and meeting SLA
commitments.
Managing virtual machines distributed across a pool of physical resources,
or virtual infrastructure management, is not a new problem. VM-based data
center management tools have been available long before the emergence of
cloud computing. However, these tools specialized in long-running VMs and
exhibited monolithic architectures that were hard to extend, or were limited by
design to use one particular hypervisor. Cloud infrastructures need to support
pay-as-you-go and on-demand models where VMs have to be provisioned
immediately and fully configured for the user, which requires coordinating
storage, network, and virtualization technologies. To this end, we have
developed OpenNebula, a virtual infrastructure manager designed with the
requirements of cloud infrastructures in mind. OpenNebula is an actively
developed open source project, and future work will focus on managing groups
of VMs arranged in a service-like structure (e.g., a compute cluster), disk image
provision strategies to reduce image cloning times, and improving support for
external providers to enable a hybrid cloud model.
We have also developed Haizea, a resource lease manager that can act as a
scheduling back-end for OpenNebula, supporting other provisioning models
other than the prevalent immediate provisioning models in existing cloud
providers. In particular, Haizea adds support for best-effort provisioning and
advance reservations, both of which become necessary when managing a finite
number of resources. Future work will focus on researching policies for lease
admission and lease preemption, particularly those based on economic models,
and will also focus on researching adaptive scheduling strategies for advance
reservations.
We developed an algorithmic approach to resource over-subscription with
probabilistically guaranteed risk of violating SLAs. Our future work in this
area will focus on (1) validation of this approach with synthetic and real data
through simulating a large-scale IaaS cloud environment, (2) complementing
admission control and capacity planning with heuristics for workload throttling,
particularly those that take advantage of opportunistic placement in a federated
environment, to handle the cases when stochastic properties of the underlying
system change abruptly and dramatically, and (3) policies to control costeffectiveness of resource allocation.
ACKNOWLEDGMENTS
Our work is supported by the European Union through the research grant
RESERVOIR Grant Number 215605.
REFERENCES 187187
REFERENCES
1.
2.
3.
4.
5.
6.
7.
8.
9.
10.
11.
12.
13.
14.
15.
16.
17.
18.
19.
Inc. Amazon. Amazon elastic compute cloud (Amazon ec2). http://aws.amazon.
com/ec2/.
ElasticHosts Ltd. Elastichosts. http://www.elastichosts.com/.
ServePath LLC. Gogrid. http://www.gogrid.com/.
xcalibre communications ltd., Flexiscale. http://www.flexiscale.com/.
B. Rochwerger, J. Caceres, R. S. Montero, D. Breitgand, E. Elmroth, A. Galis, E.
Levy, I. M. Llorente, K. Nagin, and Y. Wolfsthal, The RESERVOIR model and
architecture for open federated cloud computing, IBM Systems Journal, 53
(4):4:1—4:11, 2009.
Platform Computing Corporation. Platform http://www.platform.com/Products/
platform-isf.
VMware Inc., Vmware DRS, http://www.vmware.com/products/vi/vc/drs.html.
Enomaly, Inc. Elastic computing platform. http://www.enomaly.com/.
Red Hat. ovirt. http://ovirt.org/.
I. Foster, C. Kesselman, C. Lee, R. Lindell, K. Nahrstedt, and A. Roy, A distributed
resource management architecture that supports advance reservations and
coallocation, in Proceedings of the International Workshop on Quality of Service,
1999.
W. Smith, I. Foster, and V. Taylor, Scheduling with advanced reservations, in
Proceedings of the 14th International Symposium on Parallel and Distributed
Processing, IEEE Computer Society, 2000, p. 127.
Q. Snell, M. J. Clement, D. B. Jackson, and C. Gregory, The performance impact of
advance reservation meta-scheduling, in Proceedings of the Workshop on Job
Scheduling Strategies for Parallel Processing, Springer-Verlag, London, 2000, pp.
137—153.
M. W. Margo, K. Yoshimoto, P. Kovatch, and P. Andrews, Impact of reservations
on production job scheduling, in Proceedings of the 13th Workshop on Job
Scheduling Strategies for Parallel Processing, 116—131, 2007.
I. M. Llorente, R. Moreno-Vozmediano, and R. S. Montero, Cloud computing for
on-demand grid resource provisioning, in Advances in Parallel Computing, IOS
Press, volume 18, pp. 177—191, 2009.
T. Freeman and K. Keahey, Contextualization: Providing one-click virtual
clusters, in Proceedings of the IEEE Fourth International Conference on eScience,
301—308, December 2008.
R. Moreno, R. S. Montero, and I. M. Llorente, Elastic management of clusterbased
services in the cloud, in Proceedings of the First Workshop on Automated Control
for Datacenters and Clouds (ACDC 2009), 19—24, June 2009.
P. H. Hargrove and J. C. Duell, Berkeley Lab checkpoint/restart (blcr) for linux
clusters, Journal of Physics: Conference Series, 46:494—499, 2006.
D. C. Nurmi, R. Wolski, and J. Brevik, Varq: Virtual advance reservations for
queues, in Proceedings of the 17th International Symposium on High Performance
Distributed Computing, ACM, New York, 2008, pp. 75—86.
M. Hovestadt, O. Kao, A. Keller, and A. Streit, Scheduling in hpc resource
management systems: Queuing vs. planning, Lecture Notes in Computer Science
2862, Springer, Berlin, 2003, pp. 1—20.
REFERENCES 188187
20.
21.
22.
23.
24.
25.
26.
27.
28.
29.
30.
31.
32.
33.
34.
F. Heine, M. Hovestadt, O. Kao, and A. Streit, On the impact of reservations from
the grid on planning-based resource management, in Proceedings of the 5th
International Conference on Computational Science (ICCS 2005), Volume 3516 of
Lecture Notes in Computer Science (LNCS,) Springer, Berlin, 2005, pp. 155—162.
T. Freeman, K. Keahey, I. T. Foster, A. Rana, B. Sotomayor, and F. Wuerthwein,
Division of labor: Tools for growing and scaling grids, in Proceedings of the
International Conference on Service Oriented Computing, 40—51, 2006.
K. Keahey and T. Freeman, Contextualization: Providing one-click virtual clusters,
in Proceedings of the IEEE Fourth International Conference on eScience, 2008.
Ahuva W. Mu‘alem and Dror G. Feitelson, Utilization, predictability, workloads,
and user runtime estimates in scheduling the IBM SP2 with backfilling, IEEE
Transactions on Parallel and Distributed Systems, 12(6):529—543, 2001.
D. A. Lifka, The ANL/IBM SP scheduling system, in Proceedings of the Workshop
on Job Scheduling Strategies for Parallel Processing, Springer-Verlag, London,
1995, pp. 295—303.
B. Sotomayor, K. Keahey, and I. Foster, Combining batch execution and leasing
using virtual machines, in Proceedings of the 17th International Symposium on High
Performance Distributed Computing, ACM, New York, 2008, pp. 87—96.
B. Sotomayor, R. S. Montero, I. M. Llorente, and Ian Foster, Resource leasing
and the art of suspending virtual machines, in Proceedings of the 11th IEEE
International Conference on High Performance Computing and Communications
(HPCC-09), 59—68, June 2009.
B. Sotomayor, A resource management model for VM-based virtual workspaces.
Master’s thesis, University of Chicago, February 2007.
B. Sotomayor, K. Keahey, I. Foster, and T. Freeman, Enabling cost-effective
resource leases with virtual machines, in Hot Topics session in ACM/IEEE
International Symposium on High Performance Distributed Computing 2007
(HPDC 2007), 2007.
A. Chandra, P. Goyal, and P. Shenoy, Quantifying the benefits of resource
multiplexing in on-demand data centers, in Proceedings of the First ACM
Workshop on Algorithms and Architectures for Self-Managing Systems (SelfManage 2003), January 2003.
R. Guerin, H. Ahmadi, and M. Nagshineh, Equivalent Capacity and its application
to bandwidth allocation in high speed networks, IEEE Journal on Selected Areas in
Communication, 9(7):968—981, 1991.
Zhi-Li Zhang, J. Kurose, J. D. Salehi, and D. Towsley, Smoothing, statistical
multiplexing, and call admission control for stored video, IEEE Selected Areas in
Communications, 15(6):1148—1166, 1997.
E. W. Knightly and N. B. Shroff, Admission control for statistical qos: Theory and
practice. IEEE Network, 13(2):20—29, 1999.
B. Urgaonkar, B. Urgaonkar, P. Shenoy, P. Shenoy, and T. Roscoe, Resource
overbooking and application profiling in shared hosting platforms, in Proceedings
of the 5th Symposium on Operating Systems Design and Implementation (OSDI’02),
2002, pp. 239—254.
M. Dawande, J. Kalagnanam, P. Keskinocak, R. Ravi, and F. S. Salman,
Approximation algorithms for the Multiple Knapsack Problem with assignment
REFERENCES 189187
restrictions, Journal of Combinatorial Optimization, 4:171—186, 2000. http://www.
research.ibm.com/pdos/doc/papers/mkar.ps.
35. L. Epstein and A. Levin, On bin packing with conflicts, SIAM Journal on
Optimization, 19(3):1270—1298, 2008.
36. M. Dawande and J. Kalagnanam, The Multiple Knapsack Problem with Color
Constraints, Technical Report, IBM T. J. Watson Research, 1998.
37. J. Forrest and R. Lougee-Heimer, Cbc user guide. http://www.coinor.org/Cbc/
index.html, 2005.
38. D. Pisinger, An exact algorithm for large multiple knapsack problems, European
Journal of Operational Research, 114:528—541, 1999.
39. C. A. Santos, A. Sahai, X. Zhu, D. Beyer, V. Machiraju, and S. Singhal,
Policybased resource assignment in utility computing Environments, in
Proceedings of The 15th IFIP/IEEE Distributed Systems: Operations and
Management, Davis, CA, November 2004.
40. D. Verma, Simplifying network administration using policy-based management,
IEEE Network, 16(2):20—26, Jul 2002.
41. S. Keshav, An Engineering Approach to Computing Networking. Addison-Wesley
Professional Series, Addison-Wesley, Reading, MA, 1997.
42. Daniel A. Menascé and Virgilio A. F. Almeida, Capacity Planning for Web
Performance: metrics, models, and methods. Prentice-Hall, 1998.
43. R. P. Doyle, J. S. Chase, O. M. Asad, W. Jin, and A. M. Vahdat., Model-based
resource provisioning in a Web service utility, in Proceedings of the USENIX
Symposium on Internet Technologies and Systems (USITS), p. 5, 2003.
44. B. Urgaonkar, G. Pacifici, P. Shenoy, M. Spreitzer, and A. Tantawi, An analytical
model for multi-tier internet services and its applications, in Proceedings of the
2005 ACM SIGMETRICS International Conference on Measurement and Modeling
of Computer Systems, ACM, New York, 2005, pp. 291—302.
45. Y. Chen, S. Iyer, D. Milojicic, and A. Sahai, A systematic and practical ap proach
to generating policies from service level objectives, in Proceedings of the 11th
IFIP/IEEE International Symposium on Integrated Network Management, 89—96,
2009.
46. M. Ben-Yehuda, D. Breitgand, M. Factor, H. Kolodner, and V. Kravtsov, NAP: A
building block for remediating performance bottlenecks via black box network
analysis, in Proceedings of the 6th International Conference on Autonomic
Computing and Communications (ICAC’09), Barcelona, Spain, 179—188, June
2009.
47. T. Yuyitung and A. Hillier, Virtualization analysis for VMware, Technical Report,
CiRBA, 2007.
48. R. Gupta, S. K. Bose, S. Sundarrajan, M. Chebiyam, and A. Chakrabarti, A
two stage heuristic algorithm for solving the server consolidation problem with
item—item and bin—item incompatibility constraints, in Proceedings of the IEEE
International Conference on Services Computing (SCC’08), Vol. 2, Honolulu, HI,
July 2008, pp. 39—46.
49. S. Mehta and A. Neogi, Recon: A tool to recommend dynamic server
consolidation in multi-cluster data centers, Proceedings of the IEEE Network
Operations and Management Symposium (NOMS 2008), Salvador, Bahia, Brasil,
April 2008, pp. 363—370.
REFERENCES
50.
51.
52.
53.
54.
55.
56.
57.
58.
59.
60.
61.
62.
63.
191
D. Gmach, J. Rolia, L. Cherkasova, G. Belrose, T. Turicchi, and A. Kemper, An
integrated approach to resource pool management: Policies, efficiency and quality
metrics, in Proceedings of the 38th Annual IEEE/IFIP International Conference on
Dependable Systems and Networks (DSN’2008), 2008.
A. Verma, P. Ahuja, and A. Neogi, pmapper: Power and migration cost aware
application placement in virtualized systems, in Proceedings of the 9th ACM/IFIP/
USENIX International Conference on Middleware, 2008, Springer-Verlag, New
York, pp. 243—264.
T. Wood, P. Shenoy, A. Venkataramani, and M. Yousif. Black-box and gray-box
strategies for virtual machine migration, in Proceedings of the USENIX Symposium
on Networked System Design and Implementation (NSDI’07), Cambridge, MA,
April 2007.
A. Singh, M. Korupolu, and D. Mohapatra, Server-storage virtualization:
Integration and load balancing in data centers, in Proceedings of the 7th
International Symposium on Software Composition (SC 2008),
Budapest,
Hungary, Article No. 53, March 2008.
N. Bobroff, A. Kochut, and K. Beaty, Dynamic placement of virtual machines for
managing SLA violations, in Proceedings of the 10th IFIP/IEEE International
Symposium on Integrated Network Management, IM ’07, pp. 119—128, 2007. Best
Paper award, IM‘07.
X. Wang, Z. Du, Y. Chen, S. Li, D. Lan, G. Wang, and Y. Chen, An autonomic
provisioning framework for outsourcing data center based on virtual appliances,
Cluster Computing, 11(3):229—245, 2008.
C. Hyser, B. McKee, R. Gardner, and J. Watson. Autonomic Virtual
Machine Placement in the Data Center, Technical Report, HP Laboratories,
February 2008.
L. Grit, D. Irwin, A. Yumerefendi, and J. Chase, Virtual machine hosting for
networked clusters: Building the foundations for ―autonomic‖ orchestration, in
Proceedings of the 2nd International Workshop on Virtualization Technology in
Distributed Computing, Washington, DC, IEEE Computer Society, 2006, p. 7.
C. Tang, M. Steinder, M. Spreitzer, and G. Pacifici, A scalable applica tion
placement controller for enterprise data centers, in Proceedings of the 16th
International World Wide Web Conference (WWW07), Bannf, Canada, 331—340,
May 2007.
T. Kelly, Utility-directed allocation, in Proceedings of the First Workshop on
Algorithms and Architectures for Self-Managing Systems, 2003.
H. W. Choi, H. Kwak, A. Sohn, and K. Chung. Autonomous learning for efficient
resource utilization of dynamic VM migration, in Proceedings of the 22nd Annual
International Conference on Supercomputing, New York, ACM, 2008, pp. 185—194.
S. Martello and P. Toth, Knapsack Problems, Algorithms and Computer
Implementations, John Wiley & Sons, New York, 1990.
H. Kellerer, U. Pferschy, and D. Pisinger, Knapsack Problems, Springer, Berlin,
2004.
C. Chekuri and S. Khanna, A PTAS for the multiple knapsack problem, in
Proceedings of the Eleventh Annual ACM-SIAM Symposium on Discrete
Algorithms, 2000, pp. 213—222.
REFERENCES
191
64. D. B. Shmoys and E. Tardos, An approximation algorithm for the generalized
assignment problem, Mathematical Programming, 62:461—474, 1993.
65. M. R. Garey and David S. Johnson, Computers and Intractability: A Guide to the
Theory of NP-Completeness, W. H. Freeman, New York, 1979.
66. R. Gupta, S. K. Bose, S. Sundarrajan, M. Chebiyam, and A. Chakrabarti, A two
stage heuristic algorithm for solving the server consolidation problem with item—
item and bin—item incompatibility constraints, in Proceedings of the 2008 IEEE
International Conference on Services Computing, Washington, DC, IEEE
Computer Society, 2008, pp. 39—46.
67. E. Yu-Hsien Lin, A Bibliographical Survey On Some Well-Known Non-Standard
Knapsack Problems, 1998.
68. A. Fre` ville, The multidimensional 0—1 knapsack problem: An overview. European
Journal of Operational Research, 155(1):1—21, 2004.
69. M. Ehrgott and X. Gandibleux, A survey and annotated bibliography of
multiobjective combinatorial optimization, OR Spectrum, 22(4):425—460, 2000.
70. D. Pisinger, Where are the hard knapsack problems? Computers & Operations
Research, 32(9):2271—2284, 2005.
71. J. Marques, F. Sauve, and A. Moura, Business-oriented capacity planning of
IT infrastructure to handle load surges, in Proceedings of The 10th IEEE/
IFIP Network Operations and Management Symposium (NOMS06), Vancouver,
Canada, April 2006.
72. X. Zhu, D. Young, B. J. Watson, Z. Wang, J. Rolia, S. Singhal, B. McKee, C.
Hyser, D. Gmach, R. Gardner, T. Christian, and L. Cherkasova, 1000 Islands:
Integrated capacity and workload management for the next generation data center,
in Proceedings of The 5th IEEE International Autonomic Computing Conference
(ICAC’08), Chicago. IL, June 2008, pp. 172—181.
REFERENCES
191
Download from Wow! eBook <www.wowebook.com>
CHAPTER 7
ENHANCING CLOUD COMPUTING
ENVIRONMENTS USING A CLUSTER
AS A SERVICE
MICHAEL BROCK and ANDRZEJ GOSCINSKI
INTRODUCTION
The emergence of cloud computing has caused a significant change in how IT
infrastructures are provided to research and business organizations. Instead of
paying for expensive hardware and incur excessive maintenance costs, it is now
possible to rent the IT infrastructure of other organizations for a minimal fee.
While the existence of cloud computing is new, the elements used to create
clouds have been around for some time. Cloud computing systems have been
made possible through the use of large-scale clusters, service-oriented architecture (SOA), Web services, and virtualization.
While the idea of offering resources via Web services is commonplace in
cloud computing, little attention has been paid to the clients themselves—
specifically, human operators. Despite that clouds host a variety of resources
which in turn are accessible to a variety of clients, support for human users is
minimal.
Proposed in this chapter is the Cluster as a Service (CaaS), a Web service for
exposing via WSDL and for discovering and using clusters to run jobs.1 Because
the WSDL document is the most commonly exploited object of a Web service,
the inclusion of state and other information in the WSDL document makes the
1
Jobs contain programs, data and management scripts. A process is a program that is in execution.
When clients use a cluster, they submit jobs and when the jobs which are run by clusters creating
one or more processes.
Cloud Computing: Principles and Paradigms, Edited by Rajkumar Buyya, James Broberg and
Andrzej Goscinski Copyright r 2011 John Wiley & Sons, Inc.
193
internal activity of the Web services publishable. This chapter offers a cloud
higher layer abstraction and support for users. From the virtualization point of
view the CaaS is an interface for clusters that makes their discovery, selection,
and use easier.
The rest of this chapter is structured as follows. Section 7.2 discusses three
well-known clouds. Section 7.3 gives a brief explanation of the dynamic
attribute and Web service-based Resources Via Web Services (RVWS)
framework [1, 2], which forms a basis of the CaaS. Section 7.4 presents the
logical design of our CaaS solution. Section 7.5 presents a proof of concept
where a cluster is published, found, and used. Section 7.6 provides a
conclusion.
RELATED WORK
In this section, four major clouds are examined to learn what is offered to
clients in terms of higher layer abstraction and support for users—in particular,
service and resource publication, discovery, selection, and use. While the focus
of this chapter is to simplify the exposure of clusters as Web services, it is
important to learn what problems exist when attempting to expose any form of
resource via a Web service.
Depending on what services and resources are offered, clouds belong to one
of three basic cloud categories: Infrastructure as a Service (IaaS), Platform as a
Service (PaaS), and Software as a Service (SaaS). IaaS clouds make basic
computational resources (e.g., storage, servers) available as services over the
Internet. PaaS clouds offer easy development and deployment for environments
scalable applications. SaaS clouds allow complete end user applications to be
deployed, managed, and delivered as a service usually through a browser over
the Internet. SaaS clouds only support provider‘s applications on their
infrastructure.
The well-known four clouds—EC2 , Azure , AppEngine , and Salesforce
[16]—represent these three basic cloud categories well.
Amazon Elastic Compute Cloud (EC2)
An IaaS cloud, EC2 offers ―elastic‖ access to hardware resources that EC2
clients use to create virtual servers. Inside the virtual servers, clients either host
the applications they wish to run or host services of their own to access over the
Internet. As demand for the services inside the virtual machine rises, it is
possible to create a duplicate (instance) of the virtual machine and distribute
the load across the instances.
The first problem with EC2 is its low level of abstraction. Tutorials [6—8]
show that when using EC2, clients have to create a virtual machine, install
software into it, upload the virtual machine to EC2, and then use a command
line tool to start it. Even though EC2 has a set of pre-built virtual machines that
7.2 RELATED WORK
195
EC2 clients can use , it still falls on the clients to ensure that their own software
is installed and then configured correctly.
It was only recently that Amazon announced new scalability features,
specifically Auto-Scaling
and Elastic Load Balancing . Before the
announcement of these services, it fell to EC2 clients to either modify their
services running on EC2 or install additional management software into
their EC2 virtual servers. While the offering of Auto-Scaling and Elastic Load
Balancing reduces the modification needed for services hosted on EC2, both
services are difficult to use and require client involvement [11, 12]. In both cases,
it is required of the EC2 client to have a reserve of virtual servers and then
configure Auto-Scaling and Elastic Load Balancing to make use of the virtual
servers based on demand.
Finally, EC2 does not provide any means for publishing services by other
providers, nor does it provide the discovery and selection of services within
EC2. An analysis of EC2 documentation shows that network multicasting (a
vital element to discovery) is not allowed, thus making discovery and
selection of services within EC2 difficult. After services are hosted inside the
virtual machines on EC2, clients are required to manually publish their services
to a discovery service external to EC2.
Google App Engine
Google App Engine is a PaaS cloud that provides a complete Web service
environment: All required hardware, operating systems, and software are
provided to clients. Thus, clients only have to focus on the installation or
creation of their own services, while App Engine runs the services on Google‘s
servers.
However, App Engine is very restricted in what language can be used to
build services. At the time of writing, App Engine only supports the Java and
Python programming languages. If one is not familiar with any of the supported
programming languages, the App Engine client has to learn the language before
building his or her own services. Furthermore, existing applications cannot
simply be placed on App Engine: Only services written completely in Java and
Python are supported.
Finally, App Engine does not contain any support to publish services created
by other service providers, nor does it provide discovery and selection services.
After creating and hosting their services, clients have to publish their services to
discovery services external to App Engine. At the time of writing, an
examination of the App Engine code pages [24] also found no matches when
the keyword ―discovery‖ was used as a search string.
Microsoft Windows Azure
Another PaaS cloud, Microsoft‘s Azure allows clients to build services using
developer libraries which make use of communication, computational, and
storage services in Azure and then simply upload the completed services.
7.3 RVWS DESIGN
1 9 6 197
To ease service-based development, Azure also provides a discovery service
within the cloud itself. Called the .NET Service Bus [14], services hosted in
Azure are published once and are locatable even if they are frequently moved.
When a service is created/started, it publishes itself to the Bus using a URI [15]
and then awaits requests from clients.
While it is interesting that the service can move and still be accessible as long
as the client uses the URI, how the client gets the URI is not addressed.
Furthermore, it appears that no other information such as state or quality of
service (QoS) can be published to the Bus, only the URI.
Salesforce
Salesforce [16] is a SaaS cloud that offers customer relations management
(CRM) software as a service. Instead of maintaining hardware and software
licenses, clients use the software hosted on Salesforce servers for a minimal fee.
Clients of Salesforce use the software as though it is their own one and do not
have to worry about software maintenance costs. This includes the provision of
hardware, the installation, and all required software and the routine updates.
However, Salesforce is only applicable for clients who need existing
software. Salesforce only offers CRM software and does not allow the hosting
of custom services. So while it is the cloud with the greatest ease of use,
Salesforce
has the least flexibility.
Cloud Summary
While there is much promise with the four major clouds presented in this
chapter, all have a problem when it comes to publishing a discovering required
services and resources. Put simply, discovery is close to nonexistent and some
clouds require significant involvement from their clients.
Of all the clouds examined, only Azure offers a discovery service. However,
the discovery service in Azure only addresses static attributes. The .NET
Service Bus only allows for the publication of unique identifiers.
Furthermore, current cloud providers assume that human users of clouds
are experienced programmers. There is no consideration for clients that are
specialists in other fields such as business analysis and engineering. Hence,
when interface tools are provided, they are primitive and only usable by
computing experts. Ease of use needs to be available to both experienced and
novice computing users.
What is needed is an approach to provide higher layer abstraction and
support for users through the provision of simple publication, discovery,
selection, and use of resources. In this chapter, the resource focused on is a
cluster. Clients should be able to easily place required files and executables on
the cluster and get the results back without knowing any cluster specifics. We
propose to exploit Web services to provide a higher level of abstraction and
offer these services.
7.3 RVWS DESIGN
1 9 7 197
RVWS DESIGN
While Web services have simplified resource access and management, it is not
possible to know if the resource(s) behind the Web service is (are) ready for
requests. Clients need to exchange numerous messages with required Web
services to learn the current activity of resources and thus face significant
overhead loss if most of the Web services prove ineffective. Furthermore, even
in ideal circumstances where all resources behind Web services are the best
choice, clients still have to locate the services themselves. Finally, the Web
services have to be stateful so that they are able to best reflect the current state
of their resources.
This was the motivation for creating the RVWS framework. The novelty of
RVWS is that it combines dynamic attributes, stateful Web services (aware
of their past activity), stateful and dynamic WSDL documents [1], and
brokering [17] into a single, effective, service-based framework. Regardless
of clients accessing services directly or discovering them via a broker, clients of
RVWS-based distributed systems spend less time learning of services.
Dynamic Attribute Exposure
There are two categories of dynamic attributes addressed in the RVWS
framework: state and characteristic. State attributes cover the current activity
of the service and its resources, thus indicating readiness. For example, a Web
service that exposes a cluster (itself a complex resource) would most likely have
a dynamic state attribute that indicates how many nodes in the cluster are busy
and how many are idle.
Characteristic attributes cover the operational features of the service, the
resources behind it, the quality of service (QoS), price and provider
information. Again with the cluster Web service example, a possible
characteristic is an array of support software within the cluster. This is
important information as cluster clients need to know what software libraries
exist on the cluster.
Figure 7.1 shows the steps on how to make Web services stateful and how
the dynamic attributes of resources are presented to clients via the WSDL
document.
To keep the stateful Web service current, a Connector is used to detect
changes in resources and then inform the Web service. The Connector has three
logical modules: Detection, Decision, and Notification. The Detection module
routinely queries the resource for attribute information (1—2). Any changes in
the attributes are passed to the Decision module (3) that decides if the attribute
change is large enough to warrant a notification. This prevents excessive
communication with the Web service. Updated attributes are passed on to
the Notification module (4), which informs the stateful Web service (5) that
updates its internal state. When clients requests the stateful WSDL document
(6), the Web service returns the WSDL document with the values of all
attributes (7) at the request time.
7.3 RVWS DESIGN
1 9 8 197
Notific.
Decision
Detection
Connector
Resource
1.
State Attrib.
3.
4.
Characteristic Attrib.
2.
Web Service
6.
State Attrib.
Client
7.
5.
Characteristic Attrib.
FIGURE 7.1. Exposing resource attributes.
Stateful WSDL Document Creation
When exposing the dynamic attributes of resources, the RVWS framework
allows Web services to expose the dynamic attributes through the WSDL
documents of Web services. The Web Service Description Language (WSDL)
[18] governs a schema that describes a Web service and a document written in
the schema. In this chapter, the term WSDL refers to the stateless WSDL
document. Stateful WSDL document refers to the WSDL document created by
RVWS Web services.
All information of service resources is kept in a new WSDL section called
Resources. Figure 7.2 shows the structure of the Resources section with the rest
of the WSDL document. For each resource behind the Web service, a
ResourceInfo section exists.
7.3 RVWS DESIGN
1 9 9 197
Each ResourceInfo section has a resource-id attribute and two child
sections: state and characteristic. All resources behind the Web service have
unique identifiers. When the Connector learns of the resource for the first time,
it publishes the resource to the Web service.
Both the state and characteristics elements contain several description
elements, each with a name attribute and (if the provider wishes) one or
more attributes of the service. Attributes in RVWS use the {name: op value}
notations. An example attribute is {cost: ,5 $5}.
The state of a resource could be very complex and cannot be described in just
one attribute. For example, variations in each node in the cluster all contribute
significantly to the state of the cluster. Thus the state in RVWS is described via
a collection of attributes, all making up the whole state.
The characteristics section describes near-static attributes of resources such
as their limitations and data parameters. For example, the type of CPU on a
node in a cluster is described in this section.
7.3 RVWS DESIGN
2 0 0 197
<definitions xmlns:wsdl="http://schemas.xmlsoap.org/wsdl/">
<resources>
<resource-info identifier="resourceID">
<state>
<description name="" attribute1="value1" …
attributen="valuen">
…Other description Elements…
</description>
…Other description Elements…
</state>
<characteristics>
<description name="" />
…Other description Elements…
</characteristics>
</resource-info>
…Other resource-info elements
</resources>
<types>...</types>
message name="MethodSoapIn">...</message>
<message name="MethodSoapOut">...</message>
<portType name="CounterServiceSoap">...</portType>
<binding
name="CounterServiceSoap"
type="tns:CounterServiceSoap">...</wsdl:binding>
<wsdl:service name="CounterService">...</wsdl:service>
</wsdl:definitions>
FIGURE 7.2. New WSDL section.
7.3 RVWS DESIGN
2 0 1 197
Publication in RVWS
While the stateful WSDL document eliminates the overhead incurred from
manually learning the attributes of the service and its resource(s), the issues
behind discovering services are still unresolved.
To help ease the publication and discovery of required services with stateful
WSDL documents, a Dynamic Broker was proposed (Figure 7.3) [17]. The goal
of the Dynamic Broker is to provide an effective publication and discovery
service based on service, resource, and provider dynamic attributes.
When publishing to the Broker (1), the provider sends attributes of the Web
service to the Dynamic Broker. The dynamic attributes indicate the
functionality, cost, QoS, and any other attributes the provider wishes to have
published about the service. Furthermore, the provider is able to publish
information about itself, such as the provider‘s contact details and reputation.
After publication (1), the Broker gets the stateful WSDL document from
the Web service (2). After getting the stateful WSDL document, the Dynamic
Broker extracts all resource dynamic attributes from the stateful WSDL
documents and stores the resource attributes in the resources
store.
7.3 RVWS DESIGN
201
1.
Provider
Distributed
Broker Data
Web Service
State Attrib.
Characteristic Attrib.
Providers
2.
3.
Publication
Notication
Services
Resources
Dynamic Broker
FIGURE 7.3. Publication.
The Dynamic Broker then stores the (stateless) WSDL document and service
attributes from (1) in the service store. Finally, all attributes about the provider
are placed in the providers store.
As the Web service changes, it is able to send a notification to the Broker (3)
which then updates the relevant attribute in the relevant store. Had all
information about each service been kept in a single stateful WSDL document,
the dynamic broker would have spent a lot of time load, thereby editing and
saving huge XML documents to the database.
Automatic Discovery and Selection
The automatic service discovery that takes into consideration dynamic
attributes in their WSDL documents allows service (e.g., a cluster) discovery.
When discovering services, the client submits to the Dynamic Broker three
groups of requirements (1 in Figure 7.4): service, resource, and provider.
The Dynamic Broker compares each requirement group on the related data
store (2). Then, after getting matches, the Broker applies filtering (3). As the
client using the Broker could vary from human operators to other software
units, the resulting matches have to be filtered to suit the client. Finally, the
filtered results are returned to the client (4).
The automatic service selection that takes into consideration dynamic
attributes in their WSDL documents allows for both a single service (e.g., a
cluster) selection and an orchestration of services to satisfy workflow
requirements (Figure 7.5).
The SLA (service-level agreement) reached by the client and cloud service
provider specifies attributes of services that form the client‘s request or
7.3 RVWS DESIGN
201
workflow. This is followed by the process of services‘ selection using Brokers.
Thus, selection is carried out automatically and transparently. In a system
comprising many clouds, the set of attributes is partitioned over many
distributed service databases, for autonomy, scalability, and performance.
7.3 RVWS DESIGN
201
The automatic selection of services is performed to optimize a function
reflecting client requirements. Time-critical and high-throughput tasks benefit
by executing a computing intensive application on multiple clusters exposed as
services of one or many clouds.
Dynamic
Client
4.
Broker Data
1.
Providers
Dynamic Broker
2.
Matching
Services
3.
Resources
Filtering
FIGURE 7.4. Matching parameters to attributes.
7.3 RVWS DESIGN
Negotiation
Client
Cloud
Provider
Selection
SLA
Composition
Workflow
= <Si1, Si2, Si3 ... Sin>
Public Cloud
Broker
Public Cloud
Broker
Public Cloud
FIGURE 7.5. Dynamic discovery and selection.
Service
Service
201
7.3 RVWS DESIGN
201
The dynamic attribute information only relates to clients that are aware
of them. Human clients know what the attributes are, owning to the section
being clearly named. Software-client-designed pre-RVWS ignore the additional
information as they follow the WSDL schema that we have not changed.
CLUSTER AS A SERVICE: THE LOGICAL DESIGN
Simplification of the use of clusters could only be achieved through higher layer
abstraction that is proposed here to be implemented using the service-based
Cluster as a Service (CaaS) Technology. The purpose of the CaaS Technology
is to ease the publication, discovery, selection, and use of existing
computational clusters.
CaaS Overview
The exposure of a cluster via a Web service is intricate and comprises several
services running on top of a physical cluster. Figure 7.6 shows the complete
CaaS technology.
A typical cluster is comprised of three elements: nodes, data storage, and
middleware. The middleware virtualizes the cluster into a single system image;
thus resources such as the CPU can be used without knowing the organization
of the cluster. Of interest to this chapter are the components that manage
the allocation of jobs to nodes (scheduler) and that monitor the activity of the
cluster (monitor). As time progresses, the amount of free memory, disk space,
and CPU usage of each cluster node changes. Information about how quickly
the scheduler can take a job and start it on the cluster also is vital in choosing a
cluster.
To make information about the cluster publishable, a Publisher Web service
and Connector were created using the RVWS framework. The purpose of the
publisher Web service was to expose the dynamic attributes of the cluster via
the stateful WSDL document. Furthermore, the Publisher service is published
to the Dynamic Broker so clients can easily discover the cluster.
To find clusters, the CaaS Service makes use of the Dynamic Broker. While
the Broker is detailed in returning dynamic attributes of matching services, the
results from the Dynamic Broker are too detailed for the CaaS Service. Thus
another role of the CaaS Service is to ―summarize‖ the result data so that they
convey fewer details.
Ordinarily, clients could find required clusters but they still had to manually
transfer their files, invoke the scheduler, and get the results back. All three tasks
require knowledge of the cluster and are conducted using complex tools. The
role of the CaaS Service is to (i) provide easy and intuitive file transfer tools so
clients can upload jobs and download results and (ii) offer an easy to use
7.3 RVWS DESIGN
201
interface for clients to monitor their jobs. The CaaS Service does this by
7.4 CLUSTER AS A SERVICE: THE LOGICAL DESIGN
Clients
Software Service
CaaS Service
203
Human Operator
Dynamic Broker
Publisher Service
Connector
Scheduler
Monitoring
Cluster Middleware
Data Storage
Node n
Node 1
Cluster Nodes
Example Cluster
FIGURE 7.6. Complete CaaS system.
allowing clients to upload files as they would any Web page while carrying out
the required data transfer to the cluster transparently.
Because clients to the cluster cannot know how the data storage is managed,
the CaaS Service offers a simple transfer interface to clients while addressing the
transfer specifics. Finally, the CaaS Service communicates with the cluster‘s
scheduler, thus freeing the client from needing to know how the scheduler is
invoked when submitting and monitoring jobs.
Cluster Stateful WSDL Document
As stated in Section 7.4.1, the purpose of the Publisher Web service is to expose the
dynamic attributes of a cluster via a stateful WSDL document. Figure 7.7 shows
the resources section to be added to the WSDL of the Publisher Web service.
Inside the state and characteristic elements, an XML element for each cluster
node was created. The advantage of the XML structuring of our
cluster
<definitions xmlns:wsdl="http://schemas.xmlsoap.org/wsdl/">
<resources>
<resource-info resource-identifier="resourceId">
<state element-identifier="elementId">
<cluster-state element-identifier="cluster-state-root">
<cluster-node-name free-disk="" free-memory="" native-os-name=""
native-os-version="" processes-count=""
processes-running="" cpu-usage-percent=""
element-identifier="stateElementId"
memory-free-percent="" />
…Other Cluster Node State Elements…
</cluster-state>
</state>
<characteristics element-identifier="characteristicElementId">
<cluster-characteristics node-count=""
element-identifier="cluster-characteristics-root">
<cluster-node-name core-count="" core-speed="" core-speed-unit=""
hardware-architecture=""
total-disk=""
total-memory=""
total-disk-unit="" total-memory-unit=""
element-identifier="characteristicElementId" />
…Other Cluster Node Characteristic Elements…
</cluster-characteristics>
</characteristics>
</resource-info>
</resources>
<types>...
<message name="MethodSoapIn">...
<message name="MethodSoapOut">...
<portType name="CounterServiceSoap">...
<binding name="CounterServiceSoap" …>...
<wsdl:service name="CounterService">...
</wsdl:definitions>
FIGURE 7.7. Cluster WSDL.
attributes means that comparing client requirements to resource attributes only
requires using XPath queries.
For the CaaS Service to properly support the role of cluster discovery,
detailed information about clusters and their nodes needs to be published to the
WSDL of the cluster and subsequently to the Broker (Table 7.1).
CaaS Service Design
The CaaS service can be described as having four main tasks: cluster discovery
and selection, result organization, job management, and file management.
Based on these tasks, the CaaS Service has been designed
using
TABLE 7.1. Cluster Attributes
Type
Characteristics
State
Attribute Name
Attribute Description
Source
core-count
Number of cores on a cluster
node
Cluster node
core-speed
Speed of each core
core-speed-unit
Unit for the core speed (e.g.,
gigahertz)
hardwarear
chitecture
Hardware architecture of each
cluster node (e.g., 32-bit Intel)
total-disk
Total amount
storage space
total-disk-unit
Storage amount unit (e.g.,
gigabytes)
total-memory
Total amount
memory
total-memory-unit
Memory amount measurement
(e.g., gigabytes)
software-name
Name of an installed piece of
software.
software-version
Version of a installed piece of
software
softwarearc
hitecture
Architecture of a installed piece
of software
node-count
Total number of nodes in the
cluster. Node count differs
from core-count as each node
in a cluster can have many
cores.
Generated
free-disk
Amount of free disk space
Cluster node
free-memory
Amount of free memory
os-name
Name of the installed operating
system
os-version
Version of the
operating system
processes-count
Number of processes
processes-running
Number of processes running
cpu-usage-percent
Overall percent of CPU used.
As this metric is for the node
itself, this value becomes
averaged over cluster core
of
of
physical
physical
running
memory-freepercent Amount of free memory on the
cluster node
Generated
7.4 CLUSTER AS A SERVICE: THE LOGICAL DESIGN
206207
intercommunicating modules. Each module in the CaaS Service encapsulates
one of the tasks and is able to communicate with other modules to extend its
functionality.
Figure 7.8 presents the modules with the CaaS Service and illustrates the
dependencies between them. To improve the description, elements from
Figure 7.6 have been included to show what other entities are used by the
CaaS service.
The modules inside the CaaS Web service are only accessed through an
interface. The use of the interface means the Web service can be updated over
time without requiring clients to be updated nor modified.
Invoking an operation on the CaaS Service Interface (discovery, etc.)
invokes operations on various modules. Thus, to best describe the role each
module plays, the following sections outline the various tasks that the CaaS
Service carries out.
Cluster Discovery. Before a client uses a cluster, a cluster must be discovered
and selected first. Figure 7.9 shows the workflow on finding a required cluster.
To start, clients submit cluster requirements in the form of attribute values to
the CaaS Service Interface (1). The requirements range from the number of
nodes in the cluster to the installed software (both operating systems and
software APIs). The CaaS Service Interface invokes the Cluster Finder module
(2) that communicates with the Dynamic Broker (3) and returns service
matches (if any).
To address the detailed results from the Broker, the Cluster Finder module
invokes the Results Organizer module (4) that takes the Broker results and
returns an organized version that is returned to the client (5—6). The organized
Dynamic
Broker
Result Organizer
File
ManagerTHE LOGICAL DESIGN
Data Storage
7.4 CLUSTER AS A
SERVICE:
207207
Cluster Finder
Job Manager
Scheduler
Example Cluster
CaaS Service Interface
Client
FIGURE 7.8. CaaS Service design.
7.4 CLUSTER AS A SERVICE: THE LOGICAL DESIGN
results instruct the client what clusters satisfy the specified
After reviewing the results, the client chooses a cluster.
208207
requirements.
Job Submission. After selecting a required cluster, all executables and data
files have to be transferred to the cluster and the job submitted to the scheduler
for execution. As clusters vary significantly in the software middleware used to
create them, it can be difficult to place jobs on the cluster. To do so requires
knowing how jobs are stored and how they are queued for execution on the
cluster. Figure 7.10 shows how the CaaS Service simplifies the use of a cluster to
the point where the client does not have to know about the underlying
middleware.
Dynamic Broker
3.
4.
Cluster Finder
2.
Result Organizer
5.
CaaS Service Interface
6.
1.
Client
FIGURE 7.9. Cluster discovery.
7.4 CLUSTER AS A SERVICE: THE LOGICAL DESIGN
4.
File Manager
Data Storage
5.
3.
6.
Job Manager
Scheduler
7.
2.
Example Cluster
CaaS Service
Interface
8.
1.
Client
FIGURE 7.10. Job submission.
209207
7.4 CLUSTER AS A SERVICE: THE LOGICAL DESIGN
210207
All required data, parameters, such as estimated runtime, are uploaded to
the CaaS Service (1). Once the file upload is complete, the Job Manager is
invoked (2). It resolves the transfer of all files to the cluster by invoking the File
Manager (3) that makes a connection to the cluster storage and commences the
transfer of all files (4).
Upon completion of the transfer (4), the outcome is reported back to the Job
Manager (5). On failure, a report is sent and the client can decide on the
appropriate action to take. If the file transfer was successful, the Job Manager
invokes the scheduler on the cluster (6).
The same parameters the client gave to the CaaS Service Interface are
submitted to the scheduler; the only difference being that the Job Manager
also informs the scheduler where the job is kept so it can be started. If the
outcome of the scheduler (6) is successful, the client is then informed (7—8).
The outcome includes the response from the scheduler, the job identifier the
scheduler gave to the job, and any other information the scheduler provides.
Job Monitoring. During execution, clients should be able to view the
execution progress of their jobs. Even though the cluster is not the owned by
the client, the job is. Thus, it is the right of the client to see how the job is
progressing and (if the client decides) terminate the job and remove it from the
cluster. Figure 7.11. outlines the workflow the client takes when querying about
job execution.
First, the client contacts the CaaS service interface (1) that invokes the Job
Manager module (2). No matter what the operation is (check, pause, or
terminate), the Job Manager only has to communicate with the scheduler (3)
and reports back a successful outcome to the client (4—5).
Result Collection. The final role of the CaaS Service is addressing
jobs that have terminated or completed their execution successfully. In both
3
Job Manager
4.
CaaS Service
2. Interface
Client
.
Scheduler
Example Cluster
7.4 CLUSTER AS A SERVICE: THE LOGICAL DESIGN
5.
1.
FIGURE 7.11. Job monitoring.
211207
7.4 CLUSTER AS A SERVICE: THE LOGICAL DESIGN
File Manager
3
.
4.
2.
212207
CaaS Service
Interface
Data Storage
Example Cluster
Client
5.
1.
FIGURE 7.12. Job result collection.
cases, error or data files need to be transferred to the client. Figure 7.12 presents
the workflow and CaaS Service modules used to retrieve error or result files
from the cluster.
Clients start the error or result file transfer by contacting the CaaS Service
Interface (1) that then invokes the File Manager (2) to retrieve the files from the
cluster‘s data storage (3). If there is a transfer error, the File Manager attempts
to resolve the issue first before informing the client. If the transfer of files (3) is
successful, the files are returned to the CaaS Service Interface (4) and then the
client (5). When returning the files, URL link or a FTP address is provided so
the client can retrieve the files.
User Interface: CaaS Web Pages
The CaaS Service has to support at least two forms of client: software clients
and human operator clients. Software clients could be other software
applications or services and thus are able to communicate with the CaaS
Service Interface directly.
For human operators to use the CaaS Service, a series of Web pages has been
7.4 CLUSTER AS A SERVICE: THE LOGICAL DESIGN
213207
designed. Each page in the series covers a step in the process of discovering,
selecting, and using a cluster. Figure 7.13 shows the Cluster Specification Web
page where clients can start the discovery of a required cluster.
In Section A the client is able to specify attributes about the required cluster.
Section B allows specifying any required software the cluster job needs.
Afterwards, the attributes are then given to the CaaS service that performs a
search for possible clusters and the results are displayed in a Select Cluster Web
page (Figure 7.14).
Next, the client goes to the job specification page, Figure 7.15. Section A
allows specifying the job. Section B allows the client to specify and upload all
data files and job executables. If the job is complex, Section B also allows
specifying a job script. Job scripts are script files that describe and manage
7.4 CLUSTER AS A SERVICE: THE LOGICAL DESIGN
Section A: Hardware
Number of Nodes:
Amount of Memory:
Free Memory:
Disk Free:
CPU:
50
50
GB
50
GB
50
GB
Pentium 4
64 bit
3.2
GHz
Section B: Software
Operating System:
Windows XP w/Service Pack 2
Discover ->
FIGURE 7.13. Web page for cluster specification.
Cluster A
select
Cluster B
select
Hardware
Number of Nodes :
Amount of Memory :
Free Memory :
Disk Free :
CPU :
Architecture :
Speed
Software
Operating System :
<- Refine
Search :
Architecture
Version :
FIGURE 7.14. Web page for showing matching clusters.
211
7.4 CLUSTER AS A SERVICE: THE LOGICAL DESIGN
211
various stages of a large cluster job. Section C allows specifying an estimated
time the job would take to complete.
Afterword, the CaaS Service attempts to submit the job; the outcome is
shown in the Job Monitoring page, Figure 7.16. Section A tells the client
whether the job is submitted successfully. Section B offers commands to allow
the client to take an appropriate action.
When the job is complete, the client is able to collect the results from the
Collect Results page (Figure 7.17). Section A shows the outcome of the job.
7.4 CLUSTER AS A SERVICE: THE LOGICAL DESIGN
Section A: Identification
Job Name:
Travelling Sales Man
Job Owner
Joe Bloggs
Section B: Job File Specification
Executible
My_exec.exe
Browse...
custom_set.dat
my_script.pl
Browse...
Browse...
Script:
Add
Remove
Clear
Data files:
Proven.dat
Control.dat
Recent.dat
Output Filename:
out.dat
Section C: Execution Specification
Estimated Tme:
3d 14h
<- Change Clusters
Submit ->
FIGURE 7.15. Web page for job specification.
Section A: Submission Outcome
Outcome:
Submitted Successfully
211
7.4 CLUSTER AS A SERVICE: THE LOGICAL DESIGN
Job
cj404
ID:
Report:
Delegating Submission request…. Request Accepted.
Job has been started.
Refresh
Pause
Halt
Section B: Job Control
Collect Results ->
FIGURE 7.16. Web page for monitoring job
execution.
211
7.5
PROOF OF CONCEPT 212213
Section A: Execution Outcome
Outcome:
Time Finished:
Completed Successfully
16:59
Report:
After a total of 2 days and 7 hours, your job has
completed execution.
Section B : Results Download
HTTP:
Finish
http://download.clustera.org/cb404/out.dat
FIGURE 7.17. Web page for collecting result files.
Section B allows the client to easily download the output file generated from the
completed/aborted job via HTTP or using an FTP client.
PROOF OF CONCEPT
To demonstrate the RVWS framework and CaaS Technology, a proof of
concept was performed where an existing cluster was published, discovered,
selected, and used. It was expected that the existing cluster could be easily used
all through a Web browser and without any knowledge of the underlying
middleware.
CaaS Technology Implementation
The CaaS Service was implemented using Windows Communication
7.5
PROOF OF CONCEPT 213213
Foundations (WCF) of .NET 3.5 that uses Web services. An open source
library for building SSH clients in .NET (sharpSSH) [19] was used to build the
Job and File Managers. Because schedulers are mostly command driven, the
commands and outputs were wrapped into a Web service. Each module
outlined in Section
7.4.3 is implemented as its own Web service.
The experiments were carried out on a single cluster exposed via RVWS;
communication was carried out only through the CaaS Service. To manage all
the services and databases needed to expose and use clusters via Web services,
VMware virtual machines were used. Figure 7.18 shows the complete test
environment with the contents of each virtual machine. All virtual machines
have 512 MB of virtual memory and run the Windows Server 2003. All virtual
machines run .NET 2.0; the CaaS virtual machine runs .NET 3.5.
7.5
PROOF OF CONCEPT 214213
Web Browser
Client System
CaaS System
Temp File Store
CaaS Service
Database
Dynamic Broker
{VMware VM}
Dynamic Broker
System
{VMware VM}
Publisher Web
Service System
Connector
Publisher Web
Service
{VMware VM}
Deakin
Cluster
FIGURE 7.18. Complete CaaS environment.
The first virtual machine is the Publisher Web service system. It contains the
Connector, Publisher Web service [17], and all required software libraries.
The Dynamic Broker virtual machine contains the Broker and its database. The
final virtual machine is the CaaS virtual machine; it has the CaaS Service and a
temporary data store. To improve reliability, all file transfers between the
cluster and the client are cached. The client system is an Asus Notebook with 2
gigabytes of memory and an Intel Centrino Duo processor, and it runs the
Windows XP operating system.
Cluster Behind the CaaS
The cluster used in the proof of concept consists of 20 nodes plus two head
nodes (one running Linux and the other running Windows). Each node in the
cluster has two Intel Cloverton Quad Core CPUs running at 1.6 GHz,
8 gigabytes of memory, and 250 gigabytes of data storage, and all nodes are
connected via gigabit Ethernet and Infiniband. The head nodes are the same
except they have 1.2 terabytes of data storage.
In terms of middleware, the cluster was constructed using Sun GridEngine
[20], OpenMPI [21], and Ganglia [22]. GridEngine provided a high level of
abstraction where jobs were placed in a queue and then allocated to cluster
nodes based on policies. OpenMPI provided a common distribute application
7.5
PROOF OF CONCEPT 215213
API that hid the underlying communication system. Finally, Ganglia provided
easy access to current cluster node usage metrics.
7.5
PROOF OF CONCEPT 216213
Even though there is a rich set of software middleware, the use of the
middleware itself is complex and requires invocation from command line tools.
In this proof of concept, it is expected that all the list middleware will be
abstracted so clients only see the cluster as a large supercomputer and do not
have to know about the middleware.
Experiments and Results
The first experiment was the publication of the cluster to the publisher Web
service and easily discovering the cluster via the Dynamic Broker. For this
experiment, a gene folding application from UNAFold [23] was used. The
application was used because it had high CPU and memory demands. To keep
consistency between results from the publisher Web service and Dynamic
Broker, the cluster Connector was instructed to log all its actions to a text file to
later examination.
Figure 7.19 shows that after starting the Connector, the Connector was able
to learn of cluster node metrics from Ganglia, organize the captured Ganglia
metrics into attributes, and forwarded the attributes to the Publisher Web
service.
Figure 7.20 shows that the data from the Connector was also being presented
in the stateful WSDL document. As the Connector was detecting slight changes in
the cluster (created from the management services), the stateful WSDL of the
cluster Web service was requested and the same information was found in
the stateful WSDL document.
22/01/2009 1:51:52 PM-Connector[Update]:
Passing 23 attribute updates to the web service...
* Updating west-03.eit.deakin.edu.au-state in
free-memory to 7805776
* Updating west-03.eit.deakin.edu.au-state in
ready-queue-last-five-minutes to 0.00
...
Other attribute updates from various cluster nodes...
FIGURE 7.19. Connector output.
<rvwi:state rvwi:element-identifier= "resource-state">
7.5
PROOF OF CONCEPT 217213
<cluster-state>
<west-03.eit.deakin.edu.au free-memory="7805776" />
...Other Cluster Node Entries...
</cluster-state>
...Rest of Stateful WSDL...
FIGURE 7.20. Updated WSDL element.
7.5
PROOF OF CONCEPT 218213
In the consistency stage, a computational and memory intense job was
started on a randomly selected node and the stateful WSDL of the Publisher
Web service requested to see if the correct cluster node was updated. The
WSDL document indicated that node 20 was running the job (Figure 7.21).
This was confirmed when the output file of the Connector was examined. As the
cluster changed, both the Connector and the Publisher Web service were kept
current.
After publication, the Dynamic Broker was used to discover the newly
published Web service. A functional attribute of {main: 5 monitor} was
specified for the discovery. Figure 7.22 shows the Dynamic Broker discovery
results with the location of the Publisher Web service and its matching dynamic
attribute.
At this point, all the cluster nodes were being shown because no
requirements on the state nor the characteristics of the cluster were specified.
The purpose of the selection stage of this experiment is intended to ensure that
when given client attribute values, the Dynamic Broker only returned matching
attribute.
For this stage, only loaded cluster nodes were required; thus a state attribute
value of {cpu_usage_percent: >10} was specified. Figure 7.23 shows the
Dynamic Broker results only indicating node 20 as a loaded cluster node.
<west-20.eit.deakin.edu.au
cpu-system-usage="1.5"
cpu-usage-percent="16.8"
free-memory="12104"
memory-free-percent="0.001489594" />
FIGURE 7.21. Loaded cluster node element.
<ArrayOfServiceMatch>
<ServiceMatch>
<Url >http://einstein/rvws/rvwi_cluster /
ClusterMonitorService.asmx</Url>
<Wsdl>...Service Stateful WSDL...</Wsdl>
<Metadata>
7.5
<service-meta>
<Functionalty main="monitor" />
...Other Provider Attributes...
</service-meta>
</Metadata>
</ServiceMatch>
</ArrayOfServiceMatch>
FIGURE 7.22. Service match results from dynamic
broker.
PROOF OF CONCEPT 219213
7.5
PROOF OF CONCEPT 220213
<west-20.eit.deakin.edu.au
cpu-usage-percent="64.3" />
FIGURE 7.23. The only state element returned.
<west-03.eit.deakin.edu.au cpu-usage-percent="12.5" />
<west-20.eit.deakin.edu.au cpu-usage-percent="63" />
FIGURE 7.24. Cluster nodes returned from the broker.
The final test was to load yet another randomly selected cluster node. This
time, the cluster node was to be discovered using only the Dynamic Broker and
without looking at the Connector or the Publisher Web service. Once a job was
placed on a randomly selected cluster node, the Dynamic Broker was queried
with the same attribute values that generated Figure 7.23.
Figure 7.24 shows the Dynamic Broker results indicating node 3 as a loaded
cluster node. Figure 7.25 shows an excerpt from the Connector text file that
confirmed that node 3 had recently changed state.
Figure 7.26 shows the filled-in Web form from the browser. Figure 7.27
shows the outcome of our cluster discovery. This outcome is formatted like that
shown in Figure 7.14. As the cluster was now being successfully published, it
was possible to test the rest of the CaaS solution.
Figure 7.26 shows the filled in Web form from the browser. Figure 7.27
shows the outcome of our cluster discovery, formatted like that shown in
Figure 7.14. Because only the Deakin cluster was present, that cluster was
chosen to run our job. For our example job, we specified the script, data files,
and a desired return file.
Figure 7.28 shows the complete form. For this proof of concept, the cluster
job was simple: Run UNIX grep over a text file and return another text file with
lines that match our required pattern. While small, all the functionality of the
CaaS service is used: The script and data file had to be uploaded and then
submitted, to the scheduler, and the result file had to be returned.
Onceourjobwas specified, clicking the ―Submit‖ buttonwas expectedtoupload
the files to the CaaS virtual machine and then transfer the files to the cluster. Once
the page in Figure 7.29 was presented to us, we examined both the CaaS virtual
7.5
PROOF OF CONCEPT 221213
machine and cluster data store. In both cases, we found our script and data file.
After seeing the output of the Job Monitoring page, we contacted the cluster
and queried the scheduler to see if information on the page was correct. The job
listed on the page was given the ID of 3888, and we found the same job listed as
running with the scheduler.
One final test was seeing if the Job Monitoring Web page was able to check
the state of our job and (if finished) allows us to collect our result file. We got
confirmation that our job had completed, and we were able to proceed to the
Results Collection page.
7.5
PROOF OF CONCEPT 222213
22/01/2009 2:00:58 PM-Connector[Update]:
Passing 36 attribute updates to the web service...
* Updating west-03.eit.deakin.edu.au-state in
cpu-usage-percent to 12.5
FIGURE 7.25. Text file entry from the
connector.
Section A: Hardware
Number of Nodes:
20
Amount of Memory: 8130000
Gigabyte
Gigabyte
Memory:
Gigabyte
7400000
Free
Disk
32-bit
Free:
CPU:
Section B: Software
Operating System:
Any Linux
FIGURE 7.26. Cluster specification.
GigaHertz
7.5
PROOF OF CONCEPT 223213
Hardware
Cluster
Deakin
Nodes
Mem. Amount
20
9
Mem. Free
3
Software
Disk Free
CPU Archi.
–
9
CPU Speed
–
FIGURE 7.27. Cluster selection.
OS Name OS Ver.
20
Deakin
Section B: Job File Submission
Executible:
Browse_
Script:
C:\collection\execution.s
Browse_
C:\collection\data.zip
Browse_
Data Files:
Name of Output File:
cats.txt
FIGURE 7.28. Job specification.
–
OS Archi.
–
Use Selected
7.5
PROOF OF CONCEPT 224213
Section A: Submission Outcome
Outcome: Your job 38888 (‖execution.sh‖) has been submitted
Job ID: 38888
26/05/2009 10:39:03 AM: You job is still running.
Report: 26/05/2009 10:39:55 AM: You job appears to have finished.
26/05/2009 10:39:55 AM: Please collect your result files.
FIGURE 7.29. Job monitoring.
Section B: Result File Download
HTTP:
cats.txt
FTP:
FIGURE 7.30. Result collection.
The collection of result file(s) starts when the ―Collect Results‖ button (shown
in Figure 7.16) is clicked. It was expected that by this time the result file would
have been copied to the CaaS virtual machine. Once the collection Web page was
displayed (Figure 7.30), we checked the virtual machine and found our results file.
FUTURE RESEARCH DIRECTIONS
In terms of future research for the RVWS framework and CaaS technology, the
fields of load management, security, and SLA negotiation are open. Load
management is a priority because loaded clusters should be able to offload their
7.5
PROOF OF CONCEPT 225213
jobs to other known clusters. In future work, we plan to expose another cluster
using the same CaaS technology and evaluate its performance with two
clusters.
At the time of writing, the Dynamic Broker within the RVWS framework
considers all published services and resources to be public: There is no support
for paid access or private services. In the future, the RVWS framework has to
be enhanced so that service providers have greater control over how services are
published and who accesses them.
SLA negotiation is also a field of interest. Currently, if the Dynamic Broker
cannot find matching services and resources, the Dynamic Broker returns no
results. To better support a service-based environment, the Dynamic Broker
needs to be enhanced to allow it to delegate service attributes with service
providers. For example, the Dynamic Broker needs to be enhanced to try and
―barter‖ down the price of a possible service if it matches all other requirements.
REFERENCES
219
CONCLUSION
While cloud computing has emerged as a new economical approach for
sourcing organization IT infrastructures, cloud computing is still in its infancy
and suffers from poor ease of use and a lack of service discovery. To improve
the use of clouds, we proposed the RVWS framework to improve publication,
discovery, selection, and use of cloud services and resources.
We have achieved the goal of this project by the development of a
technology for building Cluster as a Service (CaaS) using the RVWS
framework. Through the combination of dynamic attributes, Web service‘s
WSDL and brokering, we successfully created a Web service that quickly and
easily published, discovered, and selected a cluster and allowed us to specify a
job and we execute it, and we finally got the result file back.
The easy publication, discovery, selection, and use of the cluster are
significant outcomes because clusters are one of the most complex resources
in computing. Because we were able to simplify the use of a cluster, it is
possible to use the same approach to simplify any other form of resource from
databases to complete hardware systems. Furthermore, our proposed solution
provides a new higher level of abstraction for clouds that supports cloud users.
No matter the background of the user, all users are able to access clouds in the
same easy-to-use manner.
REFERENCES
1.
2.
3.
4.
5.
6.
7.
8.
9.
M. Brock and A. Goscinski, State aware WSDL, in Sixth Australasian Symposium
on Grid Computing and e-Research (AusGrid 2008). Wollongong, Australia, 82,
January 2008, pp. 35—44.
M. Brock and A. Goscinski, Publishing dynamic state changes of resources
through state aware WSDL, in International Conference on Web Services
(ICWS) 2008. Beijing, September 23—26, 2008, pp. 449—456.
Amazon, Amazon Elastic Compute Cloud. http://aws.amazon.com/ec2/, 1 August
2009.
Microsoft, Azure, http://www.microsoft.com/azure/default.mspx, 5 May 2009.
Google, App Engine. http://code.google.com/appengine/, 17 February 2009.
P. Chaganti, Cloud computing with Amazon Web services, Part 1: Introduction.
Updated 15 March 2009, http://www.ibm.com/developerworks/library/ arcloudaws1/.
P. Chaganti, Cloud computing with Amazon Web services, Part 2: Storage in
the cloud with Amazon simple storage service (S3). Updated 15 March 2009,
http://www.ibm.com/developerworks/library/ar-cloudaws2/.
P. Chaganti, Cloud computing with Amazon Web services, Part 3: Servers on
demand with EC2. Updated 15 March 2009, http://www.ibm.com/developerworks/library/ar-cloudaws3/.
Amazon, Amazon Machine Images. http://developer.amazonwebservices.com/
connect/kbcategory.jspa?categoryID 5 171, 28 July 2009.
10.
11.
12.
13.
14.
15.
16.
17.
18.
19.
20.
21.
22.
23.
24.
Amazon, Auto Scaling. http://aws.amazon.com/autoscaling/, 28 July 2009.
Amazon, Auto Scaling Developer Guide. Updated 15 May 2009, http://docs.
amazonwebservices.com/AutoScaling/latest/DeveloperGuide/, 28 July 2009.
Amazon, Elastic Load Balancing Developer Guide. Updated 15 May 2009, http://
docs.amazonwebservices.com/ElasticLoadBalancing/latest/DeveloperGuide/, 28
July 2009.
Amazon, Amazon EC2 Technical FAQ. http://developer.amazonwebservices.com/
connect/entry.jspa?externalID=1145, 15 May 2009.
A. Skonnard, A Developer‘s Guide to the Microsoft .NET Service Bus. December
2008.
M. Mealling and R. Denenberg, Uniform resource identifiers (URIs), URLs, and
uniform resource names (URNs): Clarifications and recommendations, http://
tools.ietf.org/html/rfc3305, 28 June 2009.
Salesforce.com, CRM—salesforce.com, http://www.salesforce.com/.
M. Brock and A. Goscinski, Supporting service oriented computing with
distributed brokering and dynamic WSDL, Computing Series, 8 December, 2008,
Technical Report, C08/05, Deakin University. 2008.
World Wide Web Consortium, Web Services Description Language (WSDL)
Version 2.0. Updated 23 May 2007, http://www.w3.org/TR/wsdl20-primer/,
21 June 2007.
T. Gal, sharpSSH—A secure shell (SSH) library for .NET. Updated 30 October
2005, http://www.codeproject.com/KB/IP/sharpssh.aspx, 1 March 2009.
Sun Microsystems, GridEngine, http://gridengine.sunsource.net/, 9 March 2009.
Indiana University, Open MPI: Open source high performance computing.
Updated 14 July 2009, http://www.open-mpi.org/, 31 August 2009.
Ganglia, Ganglia. Updated 9 September 2008, http://ganglia.info/, 3 November
2008.
M. Zuker and N. R. Markham, UNAFold. Updated 18 January 2005, http://
dinamelt.bioinfo.rpi.edu/unafold/, 1 April 2009.
Google, Developer‘s Guide—Google App Engine. http://code.google.com/
appengine/docs/, 28 June 2009.
CHAPTER 8
SECURE DISTRIBUTED DATA
STORAGE IN CLOUD COMPUTING
YU CHEN, WEI-SHINN KU, JUN FENG, PU LIU, and ZHOU SU
INTRODUCTION
Cloud computing has gained great attention from both industry and academia
since 2007. With the goal of providing users more flexible services in a transparent
manner, all services are allocated in a ―cloud‖ that actually is a collection of
devices and resources connected through the Internet. One of the core services
provided by cloud computing is data storage. This poses new challenges in
creating secure and reliable data storage and access facilities over remote service
providers in the cloud. The security of data storage is one of the necessary tasks to
be addressed before the blueprint for cloud computing is accepted.
In the past decades, data storage has been recognized as one of the main
concerns of information technology. The benefits of network-based applications
have led to the transition from server-attached storage to distributed storage.
Based on the fact that data security is the foundation of information security,
a great quantity of efforts has been made in the area of distributed storage
security [1—3]. However, this research in cloud computing security is still in
its infancy .
One consideration is that the unique issues associated with cloud computing
security have not been recognized. Some researchers think that cloud
computing security will not be much different from existing security practices
and that the security aspects can be well-managed using existing techniques
such as digital signatures, encryption, firewalls, and/or the isolation of virtual
environments, and so on . For example, SSL (Secure Sockets Layer) is a
protocol that provides reliable secure communications on the Internet for things
such as Web browsing, e-mail, instant messaging, and other data transfers.
Cloud Computing: Principles and Paradigms, Edited by Rajkumar Buyya, James Broberg and
Andrzej Goscinski Copyright r 2011 John Wiley & Sons, Inc.
221
8.2 CLOUD STORAGE: FROM LANs TO WANs
222223
Another consideration is that the specific security requirements for cloud
computing have not been well-defined within the community. Cloud security is
an important area of research. Many consultants and security agencies have
issued warnings on the security threats in the cloud computing model . Besides,
potential users still wonder whether the cloud is secure. There are at least two
concerns when using the cloud. One concern is that the users do not want to
reveal their data to the cloud service provider. For example, the data could be
sensitive information like medical records. Another concern is that the users are
unsure about the integrity of the data they receive from the cloud. Therefore,
within the cloud, more than conventional security mechanisms will be required
for data security.
This chapter presents the recent research progress and some results of secure
distributed data storage in cloud computing. The rest of this chapter is
organized as follows. Section 8.2 indicates the results of the migration from
traditional distributed data storage to the cloud-computing-based data storage
platform. Aside from discussing the advantages of the new technology, we also
illustrate a new vulnerability through analyzing three current commercial cloud
service platforms. Section 8.3 presents technologies for data security in cloud
computing from four different perspectives:
Database Outsourcing and Query Integrity
Data Integrity in Untrustworthy Storage
Web-Application-Based Security
Multimedia Data Security Storage
Assurance
Section 8.4 discusses some open questions and existing challenges in this area
and outlines the potential directions for further research. Section 8.5 wraps up
this chapter with a brief summary.
CLOUD STORAGE: FROM LANs TO WANs
Cloud computing has been viewed as the future of the IT industry. It will be a
revolutionary change in computing services. Users will be allowed to purchase
CPU cycles, memory utilities, and information storage services conveniently
just like how we pay our monthly water and electricity bills. However, this
image will not become realistic until some challenges have been addressed. In
this section, we will briefly introduce the major difference brought by
distributed data storage in cloud computing environment. Then, vulnerabilities
in today‘s cloud computing platforms are analyzed and illustrated.
Moving From LANs to WANs
Most designs of distributed storage take the form of either storage area
8.2 CLOUD STORAGE: FROM LANs TO WANs
223223
networks (SANs) or network-attached storage (NAS) on the LAN level, such
Download from Wow! eBook <www.wowebook.com>
8.2 CLOUD STORAGE: FROM LANs TO WANs
224223
as the networks of an enterprise, a campus, or an organization. SANs are
constructed on top of block-addressed storage units connected through
dedicated high-speed networks. In contrast, NAS is implemented by attaching
specialized file servers to a TCP/IP network and providing a file-based interface
to client machine . For SANs and NAS, the distributed storage nodes are
managed by the same authority. The system administrator has control over
each node, and essentially the security level of data is under control. The
reliability of such systems is often achieved by redundancy, and the storage
security is highly dependent on the security of the system against the attacks
and intrusion from outsiders. The confidentiality and integrity of data are
mostly achieved using robust cryptographic schemes.
However, such a security system would not be robust enough to secure
the data in distributed storage applications at the level of wide area net works,
specifically in the cloud computing environment. The recent progress
of
network technology enables global-scale collaboration over heterogeneous
networks under different authorities. For instance, in a peer-to-peer (P2P)
file sharing environment, or the distributed storage in a cloud computing
environment, the specific data storage strategy is transparent to the user .
Furthermore, there is no approach to guarantee that the data host nodes are
under robust security protection. In addition, the activity of the medium owner
is not controllable to the data owner. Theoretically speaking, an attacker can
do whatever she wants to the data stored in a storage node once the node is
compromised. Therefore, the confidentiality and the integrity of the data would
be violated when an adversary controls a node or the node administrator
becomes malicious.
Existing Commercial Cloud Services
As shown in Figure 8.1, data storage services on the platform of cloud computing
are fundamentally provided by applications/software based on the Internet.
Although the definition of cloud computing is not clear yet, several pioneer
commercial implementations have been constructed and opened to the public,
such as Amazon‘s Computer Cloud AWS (Amazon Web service) , the
Microsoft Azure Service Platform , and the Google App Engine (GAE) .
In normal network-based applications, user authentication, data
confidentiality, and data integrity can be solved through IPSec proxy using
encryption and digital signature. The key exchanging issues can be solved by
SSL proxy. These methods have been applied to today‘s cloud computing to
secure the data on the cloud and also secure the communication of data to and
from
the cloud. The service providers claim that their services are secure.
This section describes three secure methods used in three commercial cloud
services and discusses their vulnerabilities.
Amazon’s Web Service. Amazon provides Infrastructure as a Service (IaaS)
with different terms, such as Elastic Compute Cloud (EC2), SimpleDB, Simple
8.2 CLOUD STORAGE: FROM LANs TO WANs
Mobile
Station
Laptop
Cloud
Cloud
(Network Fabric)
SaaS
PaaS
IaaS
Internet
Storage Server Farm
FIGURE 8.1. Illustration of cloud computing principle.
User
Service Provider
225223
8.2 CLOUD STORAGE: FROM LANs TO WANs
Create a job Get
the manifest file
Verify the manifest file
with received signature
Sign the manifest file
Email the manifest file
Operate as the file
demand
Ship the device with
signed file
In One Session
226223
Ship the device,
email the log with MD5
FIGURE 8.2. AWS data processing procedure.
Storage Service (S3), and so on. They are supposed to ensure the
confidentiality, integrity, and availability of the customers‘ applications and
data. Figure
presents one of the data processing methods adopted in Amazon‘s AWS ,
which is used to transfer large amounts of data between the AWS cloud and
portable storage devices.
8.2 CLOUD STORAGE: FROM LANs TO WANs
227223
When the user wants to upload the data, he/she stores some parameters such
as AccessKeyID, DeviceID, Destination, and so on, into an import metadata
file called the manifest file and then signs the manifest file and e-mails the signed
manifest file to Amazon. Another metadata file named the signature file is used
by AWS to describe the cipher algorithm that is adopted to encrypt the job
ID and the bytes in the manifest file. The signature file can uniquely identify
and authenticate the user request. The signature file is attached with the storage
device, which is shipped to Amazon for efficiency. On receiving the stor age
device and the signature file, the service provider will validate the signature in
the device with the manifest file sent through the email. Then, Amazon will email management information back to the user including the number of bytes
saved, the MD5 of the bytes, the status of the load, and the location on the
Amazon S3 of the AWS Import—Export Log. This log contains details about
the data files that have been uploaded, including the key names, number of
bytes, and MD5 checksum values.
The downloading process is similar to the uploading process. The user
creates a manifest and signature file, e-mails the manifest file, and ships the
storage device attached with signature file. When Amazon receives these two
files, it will validate the two files, copy the data into the storage device, ship it
back, and e-mail to the user with the status including the MD5 checksum of the
data. Amazon claims that the maximum security is obtained via SSL endpoints.
Create
ContentMD5
Microsoft Windows Azure. The Windows Azure Platform (Azure) is an
Internet-scale cloud services platform hosted in Microsoft data centers, which
provides an operating system and a set of developer services that can be used
individually or together . The platform also provides scalable storage service.
There are three basic data items: blobs (up to 50 GB), tables, and queues (
,8k). In the Azure Storage, based on the blob, table, and queue structures,
Microsoft promises to achieve confidentiality of the users‘ data. The procedure
shown in Figure 8.3 provides security for data accessing to ensure that the data
will not be lost.
PUT
GET
Data with MD5
FIGURE 8.3. Security data access procedure.
Cloud Storage
Create Signature
Get the Secret Key
Create a Account
8.2 CLOUD STORAGE: FROM LANs TO WANs
228223
8.2 CLOUD STORAGE: FROM LANs TO WANs
229223
PUT http://jerry.blob.core.windows.net/movie/mov.avi
?comp=block &blockid=BlockId1 &timeout=30
HTTP/1.1 Content-Length: 2174344
Content-MD5: FJXZLUNMuI/KZ5KDcJPcOA==
Authorization:SharedKeyjerry:F5a+dUDvef+PfMb4T8Rc2jHcwfK58KecSZY+l2naIao= xms-date: Sun, 13 Sept 2009 22:30:25 GMT
GET http://jerry.blob.core.windows.net/movies/mov.avi
x-ms-version: 2009-04-14
HTTP/1.1
Authorization:SharedKeyjerry:ZF3lJMtkOMi4y/nedSk5Vn74IU6/fRMwiPsL+uYSDjY= xms-date: Sun, 13 Sept 2009 22:40:34 GMT
x-ms-version: 2009-04-14
FIGURE 8.4. Example of a REST request.
To use Windows Azure Storage service, a user needs to create a storage
account, which can be obtained from the Windows Azure portal web interface.
After creating an account, the user will receive a 256-bit secret key. Each time
when the user wants to send the data to or fetch the data from the cloud, the
user has to use his secret key to create a HMAC SHA256 signature for each
individual request for identification. Then the user uses his signature
to authenticate request at server. The signature is passed with each request to
authenticate the user requests by verifying the HMAC signature.
The example in Figure 8.4 is a REST request for a PUT/GET block
operation . Content-MD5 checksums can be provided to guard against
network transfer errors and data integrity. The Content-MD5 checksum in the
PUT is the MD5 checksum of the data block in the request. The MD5
checksum is checked on the server. If it does not match, an error is returned.
The content length specifies the size of the data block contents. There is also
an authorization header inside the HTTP request header as shown above in
Figure 8.4.
At the same time, if the Content-MD5 request header was set when the blob
has been uploaded, it will be returned in the response header. Therefore, the
user can check for message content integrity. Additionally, the secure HTTP
connection is used for true data integrity .
Google App Engine (GAE). The Google App Engine (GAE) provides a
powerful distributed data storage service that features a query engine and
8.2 CLOUD STORAGE: FROM LANs TO WANs
230223
transactions. An independent third-party auditor, who claims that GAE can be
secure under the SAS70 auditing industry standard, issued Google Apps an
unqualified SAS70 Type II certification. However, from its on-line storage
8.2 CLOUD STORAGE: FROM LANs TO WANs
231223
WebService, API Server
Optional Firewall
Secure Data Connector
Corporate Firewall
Internet
Tunnel Servers
Google Apps
Encrypted SDC Tunnel
FIGURE 8.5. Illustration of Google SDC working flow.
technical document of lower API , there are only some functions such as GET
and PUT. There is no content addressing the issues of securing storage
services. The security of data storage is assumed guaranteed using techniques
such as by SSL link, based on our knowledge of security method adopted by
other services.
Figure 8.5 is one of the secure services, called Google Secure Data
Connector (SDC), based on GAE . The SDC constructs an encrypted
connection between the data source and Google Apps. As long as the data
source is in the Google Apps domain to the Google tunnel protocol servers,
when the user wants to get the data, he/she will first send an authorized data
requests to Google Apps, which forwards the request to the tunnel server. The
tunnel servers validate the request identity. If the identity is valid, the tunnel
protocol allows the SDC to set up a connection, authenticate, and encrypt the
data that flows across the Internet. At the same time, the SDC uses resource
rules to validate whether a user is authorized to access a specified resource.
When the request is valid, the SDC performs a network request. The server
validates the signed request, checks the credentials, and returns the data if the
user is authorized.
The SDC and tunnel server are like the proxy to encrypt connectivity
between Google Apps and the internal network. Moreover, for more security,
the SDC uses signed requests to add authentication information to requests
that are made through the SDC. In the signed request, the user has to submit
identification information including the owner_id, viewer_id, instance_id,
app_id, public_key, consumer_key, nonce, token, and signature within the
8.2 CLOUD STORAGE: FROM LANs TO WANs
232223
request to ensure the integrity, security, and privacy of the request.
Vulnerabilities in Current Cloud Services
Previous subsections describe three different commercial cloud computing
secure data storage schemes. Storage services that accept a large amount of
data ( .1 TB) normally adopt strategies that help make the shipment more
convenient, just as the Amazon AWS does. In contrast, services that only
8.2 CLOUD STORAGE: FROM LANs TO WANs
233223
accept a smaller data amount ( #50 GB) allow the data to be uploaded or
downloaded via the Internet, just as the Azure Storage Service does. To provide
data integrity, the Azure Storage Service stores the uploaded data MD5
checksum in the database and returns it to the user when the user wants to
retrieve the data. Amazon AWS computes the data MD5 checksum and e-mails
it to the user for integrity checking. The SDC is based on GAE‘s attempt to
strengthen Internet authentication using a signed request. If these services are
grouped together, the following scheme can be derived.
As shown in Figure 8.6, when user_1 stores data in the cloud, she can ship or
send the data to the service provider with MD5_1. If the data are transferred
through the Internet, a signed request could be used to ensure the privacy,
security, and integrity of the data. When the service provider receives the data
and the MD5 checksum, it stores the data with the corresponding checksum
(MD5_1). When the service provider gets a verified request to retrieve the data
from another user or the original user, it will send/ship the data with a MD5
checksum to the user. On the Azure platform, the original checksum
MD5_1will be sent, in contrast, a re-computed checksum MD5_2 is sent on
Amazon‘s AWS.
The procedure is secure for each individual session. The integrity of the data
during the transmission can be guaranteed by the SSL protocol applied.
However, from the perspective of cloud storage services, data integrity depends
on the security of operations while in storage in addition to the security of the
uploading and downloading sessions. The uploading session can only ensure
that the data received by the cloud storage is the data that the user uploaded;
the downloading session can guarantee the data that the user retrieved is the
data cloud storage recorded. Unfortunately, this procedure applied on cloud
storage services cannot guarantee data integrity.
To illustrate this, let‘s consider the following two scenarios. First, assume
that Alice, a company CFO, stores the company financial data at a cloud
storage service provided by Eve. And then Bob, the company administration
chairman, downloads the data from the cloud. There are three important
concerns in this simple procedure:
MD5_1
MD5_1/2
USER1
USER2
8.2 CLOUD STORAGE: FROM LANs TO WANs
Cloud Service
FIGURE 8.6. Illustration of potential integrity problem.
234223
8.2 CLOUD STORAGE: FROM LANs TO WANs
235223
1. Confidentiality. Eve is considered as an untrustworthy third party, Alice
and Bob do not want reveal the data to Eve.
2. Integrity. As the administrator of the storage service, Eve has the
capability to play with the data in hand. How can Bob be confident
that the data he fetched from Eve are the same as what was sent by Alice?
Are there any measures to guarantee that the data have not been
tampered by Eve?
3. Repudiation. If Bob finds that the data have been tampered with, is there
any evidence for him to demonstrate that it is Eve who should be
responsible for the fault? Similarly, Eve also needs certain evidence to
prove her innocence.
Recently, a potential customer asked a question on a cloud mailing-group
regarding data integrity and service reliability. The reply from the developer
was ―We won’t lose your data—we have a robust backup and recovery strategy —
but we’re not responsible for you losing your own data . . . ‖ . Obviously, it is not
persuasive to the potential customer to be confident with the service.
The repudiation issue opens a door for potentially blackmailers when the
user is malicious. Let‘s assume that Alice wants to blackmail Eve. Eve is a cloud
storage service provider who claims that data integrity is one of their key
features. For that purpose, Alice stored some data in the cloud, and later she
downloaded the data. Then, she reported that her data were incorrect and that
it is the fault of the storage provider. Alice claims compensation for her
socalled loss. How can the service provider demonstrate her innocence?
Confidentiality can be achieved by adopting robust encryption schemes.
However, the integrity and repudiation issues are not handled well on the
current cloud service platform. One-way SSL session only guarantees one-way
integrity. One critical link is missing between the uploading and downloading
sessions: There is no mechanism for the user or service provider to check
whether the record has been modified in the cloud storage. This vulnerability
leads to the following questions:
● Upload-to-Download Integrity. Since the integrity in uploading and
downloading phase are handled separately, how can the user or provider
know the data retrieved from the cloud is the same data that the user
uploaded previously?
● Repudiation Between Users and Service Providers. When data errors
happen without transmission errors in the uploading and downloading
sessions, how can the user and service provider prove their innocence?
Bridge the Missing Link
8.2 CLOUD STORAGE: FROM LANs TO WANs
236223
This section presents several simple ideas to bridge the missing link based on
digital signatures and authentication coding schemes. According to whether
8.2
CLOUD STORAGE: FROM LANs TO WANs
231
there is a third authority certified (TAC) by the user and provider and whether
the user and provider are using the secret key sharing technique (SKS), there
are four solutions to bridge the missing link of data integrity between
the uploading and downloading procedures. Actually, other digital signature
technologies can be adopted to fix this vulnerability with different approaches.
Neither TAC nor SKS.
Uploading Session
1. User: Sends data to service provider with MD5 checksum and MD5
Signature by User (MSU).
2. Service Provider: Verifies the data with MD5 checksum, if it is valid, the
service provider sends back the MD5 and MD5 Signature by Provider
(MSP) to user.
3. MSU is stored at the user side, and MSP is stored at the service provider
side.
Once the uploading operation finished, both sides agreed on the integrity of the
uploaded data, and each side owns the MD5 checksum and MD5 signature
generated by the opposite site.
Downloading Session
1. User: Sends request to service provider with authentication code.
2. Service Provider: Verifies the request identity, if it is valid, the service
provider sends back the data with MD5 checksum and MD5 Signature by
Provider (MSP) to user.
3. User verifies the data using the MD5 checksum.
When disputation happens, the user or the service provider can check the
MD5 checksum and the signature of MD5 checksum generated by the opposite
side to prove its innocence. However, there are some special cases that exist.
When the service provider is trustworthy, only MSU is needed; when the user is
trustworthy, only MSP is needed; if each of them trusts the other side, neither
MSU nor MSP is needed. Actually, that is the current method adopted in cloud
computing platforms. Essentially, this approach implies that when the identity
is authenticated that trust is established.
With SKS but without TAC.
Uploading Session
1. User: Sends data to service provider with MD checksum 5.
8.2
CLOUD STORAGE: FROM LANs TO WANs
231
2. Service Provider: Verifies the data with MD5 checksum, if it is valid, the
service provider sends back the MD5 checksum.
3. The service provider and the user share the MD5 checksum with SKS.
8.2
CLOUD STORAGE: FROM LANs TO WANs
231
Then, both sides agree on the integrity of the uploaded data, and they share the
agreed MD5 checksum, which is used when disputation happens.
Downloading Session
1. User: Sends request to the service provider with authentication code.
2. Service Provider: Verifies the request identity, if it is valid, the service
provider sends back the data with MD5 checksum.
3. User verifies the data through the MD5 checksum.
When disputation happens, the user or the service provider can take the
shared MD5 together, recover it, and prove his/her innocence.
With TAC but without SKS.
Uploading Session
1. User: Sends data to the service provider along with MD5 checksum and
MD5 Signature by User (MSU).
2. Service Provider: Verifies the data with MD5 checksum, if it is valid, the
service provider sends back the MD5 checksum and MD5 Signature by
Provider (MSP) to the user.
3. MSU and MSP are sent to TAC.
On finishing the uploading phase, both sides agree on the integrity of the
uploaded data, and TAC owns their agreed MD5 signature.
Downloading Session
1. User: Sends request to the service provider with authentication code.
2. Service Provider: Verifies the request with identity, if it is valid, the service
provider sends back the data with MD5 checksum.
3. User verifies the data through the MD5 checksum.
When disputation happens, the user or the service provider can prove his
innocence by presenting the MSU and MSP stored at the TAC.
Similarly, there are some special cases. When the service provider is
trustworthy, only the MSU is needed; when the user is trustworthy, only the
MSP is needed; if each of them trusts the other, the TAC is not needed. Again,
the last case is the method adopted in the current cloud computing platforms.
When the identity is authenticated, trust is established.
With Both TAC and SKS.
8.2
CLOUD STORAGE: FROM LANs TO WANs
Uploading Session
1. User: Sends data to the service provider with MD5 checksum.
2. Service Provider: verifies the data with MD5 checksum.
231
8.3 TECHNOLOGIES FOR DATA SECURITY IN CLOUD COMPUTING
232233
3. Both the user and the service provider send MD5 checksum to TAC.
4. TAC verifies the two MD5 checksum values. If they match, the TAC
distributes MD5 to the user and the service provider by SKS.
Both sides agree on the integrity of the uploaded data and share the same MD5
checksum by SKS, and the TAC own their agreed MD5 signatures.
Downloading Session
1. User: Sends request to the service provider with authentication code.
2. Service Provider: Verifies the request identity, if it is valid, the service
provider sends back the data with MD5 checksum.
3. User verifies the data through the MD5 checksum.
When disputation happens, the user or the service provider can prove their
innocence by checking the shared MD5 checksum together. If the disputation
cannot be resolved, they can seek further help from the TAC for the MD5
checksum.
Here are the special cases. When the service provider is trustworthy, only the
user needs the MD5 checksum; when the user is trustworthy, only the service
provider needs MD5 checksum; if both of them can be trusted, the TAC is not
needed. This is the method used in the current cloud computing platform.
TECHNOLOGIES FOR DATA SECURITY IN CLOUD COMPUTING
This section presents several technologies for data security and privacy in cloud
computing. Focusing on the unique issues of the cloud data storage platform,
this section does not repeat the normal approaches that provide confidentiality,
integrity, and availability in distributed data storage applications. Instead, we
select to illustrate the unique requirements for cloud computing data security
from a few different perspectives:
● Database Outsourcing and Query Integrity Assurance. Researchers have
pointed out that storing data into and fetching data from devices
and machines behind a cloud are essentially a novel form of database
outsourcing. Section 8.3.1 introduces the technologies of Database
Outsourcing and Query Integrity Assurance on the clouding computing
platform.
● Data Integrity in Untrustworthy Storage. One of the main challenges that
prevent end users from adopting cloud storage services is the fear of
8.3 TECHNOLOGIES FOR DATA SECURITY IN CLOUD COMPUTING
233233
losing data or data corruption. It is critical to relieve the users‘ fear by
providing technologies that enable users to check the integrity of their
data. Section 8.3.2 presents two approaches that allow users to detect
whether the data has been touched by unauthorized people.
8.3 TECHNOLOGIES FOR DATA SECURITY IN CLOUD COMPUTING
234233
● Web-Application-Based Security. Once the dataset is stored remotely, a
Web browser is one of the most convenient approaches that end users can
use to access their data on remote services. In the era of cloud computing,
Web security plays a more important role than ever. Section 8.3.3
discusses the most important concerns in Web security and analyzes a
couple of widely used attacks.
● Multimedia Data Security. With the development of high-speed network
technologies and large bandwidth connections, more and more
multimedia data are being stored and shared in cyber space. The security
requirements for video, audio, pictures, or images are different from other
applications. Section 8.3.4 introduces the requirements for multimedia
data security in the cloud.
Database Outsourcing and Query Integrity Assurance
In recent years, database outsourcing has become an important component of
cloud computing. Due to the rapid advancements in network technology, the
cost of transmitting a terabyte of data over long distances has decreased
significantly in the past decade. In addition, the total cost of data management
is five to ten times higher than the initial acquisition costs. As a result, there is a
growing interest in outsourcing database management tasks to third parties that
can provide these tasks for a much lower cost due to the economy of scale. This
new outsourcing model has the benefits of reducing the costs for running
Database Management Systems (DBMS) independently and enabling
enterprises to concentrate on their main businesses . Figure 8.7 demonstrates
the general architecture of a database outsourcing environment with clients.
The database owner outsources its data management tasks, and clients send
queries to the untrusted service provider. Let T denote the data to be outsourced.
The data T are is preprocessed, encrypted, and stored at the service provider. For
evaluating queries, a user rewrites a set of queries Q against T to queries against
the encrypted database.
The outsourcing of databases to a third-party service provider was first
introduced by Hacigu¨ mu¨ s et al. . Generally, there are two security concerns
queryRewrite(Q)
DB
dataTransform(T)
8.3 TECHNOLOGIES FOR DATA SECURITY IN CLOUD COMPUTING
Clients
235233
Query Results
Database Owner
Service Provider
FIGURE 8.7. The system architecture of database outsourcing.
8.3 TECHNOLOGIES FOR DATA SECURITY IN CLOUD COMPUTING
236233
in database outsourcing. These are data privacy and query integrity. The
related research is outlined below.
Data Privacy Protection. Hacigu¨ mu¨ s et al. [37] proposed a method to execute
SQL queries over encrypted databases. Their strategy is to process as much of a
query as possible by the service providers, without having to decrypt the data.
Decryption and the remainder of the query processing are performed at the client
side. Agrawal et al. [14] proposed an order-preserving encryption scheme for
numeric values that allows any comparison operation to be directly applied on
encrypted data. Their technique is able to handle updates, and new values can be
added without requiring changes in the encryption of other values. Generally,
existing methods enable direct execution of encrypted queries on encrypted
datasets and allow users to ask identity queries over data of different encryptions.
The ultimate goal of this research direction is to make queries in encrypted
databases as efficient as possible while preventing adversaries from learning any
useful knowledge about the data. However, researches in this field did not
consider the problem of query integrity.
Query Integrity Assurance. In addition to data privacy, an important
security concern in the database outsourcing paradigm is query integrity.
Query integrity examines the trustworthiness of the hosting environment.
When a client receives a query result from the service provider, it wants to be
assured that the result is both correct and complete, where correct means that
the result must originate in the owner‘s data and not has been tampered with,
and complete means that the result includes all records satisfying the query.
Devanbu et al. [15] authenticate data records using the Merkle hash tree [16],
which is based on the idea of using a signature on the root of the Merkle hash
tree to generate a proof of correctness. Mykletun et al. [17] studied and
compared several signature methods that can be utilized in data authentication,
and they identified the problem of completeness but did not provide a solution.
Pang et al. [18] utilized an aggregated signature to sign each record with the
information from neighboring records by assuming that all the records are
sorted with a certain order. The method ensures the completeness of a selection
query by checking the aggregated signature. But it has difficulties in handling
multipoint selection query of which the result tuples occupy a noncontinuous
region of the ordered sequence.
The work in Li et al. [19] utilizes Merkle hash tree-based methods to audit
the completeness of query results, but since the Merkle hash tree also applies the
signature of the root Merkle tree node, a similar difficulty exists. Besides,
the network and CPU overhead on the client side can be prohibitively high for
some types of queries. In some extreme cases, the overhead could be as high as
processing these queries locally, which can undermine the benefits of database
outsourcing. Sion [20] proposed a mechanism called the challenge token and
uses it as a probabilistic proof that the server has executed the query over the
entire database. It can handle arbitrary types of queries including joins and
8.3 TECHNOLOGIES FOR DATA SECURITY IN CLOUD COMPUTING
237233
does not assume that the underlying data is ordered. However, the approach is
not applied to the adversary model where an adversary can first compute
the complete query result and then delete the tuples specifically corresponding
to the challenge tokens [21]. Besides, all the aforementioned methods must
modify the DBMS kernel in order to provide proof of integrity.
Recently, Wang et al. [22] proposed a solution named dual encryption to ensure
query integrity without requiring the database engine to perform any special
function beyond query processing. Dual encryption enables cross-examination of
the outsourced data, which consist of (a) the original data stored under a certain
encryption scheme and (b) another small percentage of the original data stored
under a different encryption scheme. Users generate queries against the additional
piece of data and analyze their results to obtain integrity assurance.
For auditing spatial queries, Yang et al [23] proposed the MR-tree, which is
an authenticated data structure suitable for verifying queries executed on
outsourced spatial databases. The authors also designed a caching technique
to reduce the information sent to the client for verification purposes. Four
spatial transformation mechanisms are presented in Yiu et al. [24] for
protecting the privacy of outsourced private spatial data. The data owner selects
transformation keys that are shared with trusted clients, and it is infeasible
to reconstruct the exact original data points from the transformed points
without the key. However, both aforementioned researches did not consider
data privacy protection and query integrity auditing jointly in their design. The
state-of-the-art technique that can ensure both privacy and integrity for
outsourced spatial data is proposed in Ku et al. . In particular, the solution first
employs a one-way spatial transformation method based on Hilbert curves,
which encrypts the spatial data before outsourcing and hence ensures its
privacy. Next, by probabilistically replicating a portion of the data and
encrypting it with a different encryption key, the authors devise a mechanism
for the client to audit the trustworthiness of the query results.
Data Integrity in Untrustworthy Storage
While the transparent cloud provides flexible utility of network-based
resources, the fear of loss of control on their data is one of the major concerns
that prevent end users from migrating to cloud storage services. Actually it is a
potential risk that the storage infrastructure providers become self-interested,
untrustworthy, or even malicious. There are different motivations whereby a
storage service provider could become untrustworthy—for instance, to cover
the consequence of a mistake in operation, or deny the vulnerability in the
system after the data have been stolen by an adversary. This section introduces
two technologies to enable data owners to verify the data integrity while the
files are stored in the remote untrustworthy storage services.
Actually, before the term ―cloud computing‖ appears as an IT term, there
are several remote data storage checking protocols that have been suggested
[25], [26]. Later research has summarized that in practice a remote data
8.3 TECHNOLOGIES FOR DATA SECURITY IN CLOUD COMPUTING
238233
possession checking protocol has to satisfy the following five requirements [27].
Note that the verifier could be either the data owner or a trusted third party,
and the prover could be the storage service provider or storage medium owner
or system administrator.
● Requirement #1. It should not be a pre-requirement that the verifier has to
possess a complete copy of the data to be checked. And in practice, it does
not make sense for a verifier to keep a duplicated copy of the content to be
verified. As long as it serves the purpose well, storing a more concise
contents digest of the data at the verifier should be enough.
● Requirement #2. The protocol has to be very robust considering the
untrustworthy prover. A malicious prover is motivated to hide the
violation of data integrity. The protocol should be robust enough that
such a prover ought to fail in convincing the verifier.
● Requirement #3. The amount of information exchanged during the
verification operation should not lead to high communication overhead.
● Requirement #4. The protocol should be computationally efficient.
● Requirement #5. It ought to be possible to run the verification an
unlimited number of times.
A PDP-Based Integrity Checking Protocol. Ateniese et al. [28] proposed a
protocol based on the provable data procession (PDP) technology, which allows
users to obtain a probabilistic proof from the storage service providers. Such a
proof will be used as evidence that their data have been stored there. One of the
advantages of this protocol is that the proof could be generated by the storage
service provider by accessing only a small portion of the whole dataset. At the
same time, the amount of the metadata that end users are required to store is
also small—that is, O(1). Additionally, such a small amount data exchanging
procedure lowers the overhead in the communication channels too.
Figure 8.8 presents the flowcharts of the protocol for provable data
possession [28]. The data owner, the client in the figure, executes the protocol
to verify that a dataset is stored in an outsourced storage machine as a
collection of n blocks. Before uploading the data into the remote storage, the
data owner pre-processes the dataset and a piece of metadata is generated. The
metadata are stored at the data owner‘s side, and the dataset will be transmitted
to the storage server. The cloud storage service stores the dataset and sends the
data to the user in responding to queries from the data owner in the future.
As part of pre-processing procedure, the data owner (client) may conduct
operations on the data such as expanding the data or generating additional
metadata to be stored at the cloud server side. The data owner could execute
the PDP protocol before the local copy is deleted to ensure that the uploaded
copy has been stored at the server machines successfully. Actually, the data
owner may encrypt a dataset before transferring them to the storage machines.
During the time that data are stored in the cloud, the data owner can generate a
8.3 TECHNOLOGIES FOR DATA SECURITY IN CLOUD COMPUTING
Client generates
metadata (m) and
modified file (F')
239233
No server
processing
Input file
F
Client
F'
Server
m
m
F'
Client store
Server store
(a) Pre-process and store
(1) Client generates
a random challenge
R
R
(2) Server computes
proof of possession P
0/1
(3) C
l
i
e
nt verifies server‘s proof
8.3 TECHNOLOGIES FOR DATA SECURITY IN CLOUD COMPUTING
240233
P
Client
S
e
r
v
e
r
F'
m
m
F'
Client store
Server store
(b) Verify server possession
FIGURE 8.8. Protocol for provable data possession [28].
―challenge‖ and send it to the service provider to ensure that the storage server
has stored the dataset. The data owner requests that the storage server generate
a metadata based on the stored data and then send it back. Using the previously
stored local metadata, the owner verifies the response.
On the behalf of the cloud service provider‘s side, the server may receive
multiple challenges from different users at the same time. For the sake
of availability, it is highly desired to minimize not only the computational
8.3 TECHNOLOGIES FOR DATA SECURITY IN CLOUD COMPUTING
241233
overhead of each individual calculation, but also the number of data blocks to
be accessed. In addition, considering the pressure on the communication
networks, minimal bandwidth consumption also implies that there are a limited
amount of metadata included in the response generated by the server. In the
protocol shown in Figure 8.8, the PDP scheme only randomly accesses one
subdata block when the sample the stored dataset [28]. Hence, the PDP scheme
probabilistically guarantees the data integrity. It is mandatory to access the
whole dataset if a deterministic guarantee is required by the user.
An Enhanced Data Possession Checking Protocol. Sebe et al. [27]
pointed out that the above PDP-based protocol does not satisfy Requirement
#2 with 100% probability. An enhanced protocol has been proposed based on
the idea of the Diffie—Hellman scheme. It is claimed that this protocol satisfies
all five requirements and is computationally more efficient than the PDP-based
protocol [27]. The verification time has been shortened at the setup stage by
taking advantage of the trade-offs between the computation times required
by the prover and the storage required at the verifier. The setup stage sets the
following parameters:
p and q : two primary factors chosen by the verifier;
N 5 pq: a public RSA modulus created by the verifier;
Φ(N) 5 (p 2 1)(q 2 1): the private key of the verifier, which is the secret only
known by the verifier;
l: an integer that is chosen depending on the trade-offs between the
computation time required at the prover and the storage required at
the verifier;
t: a security parameter;
PRNG: a pseudorandom number generator, which generates t-bit integer
values.
The protocol is presented as follows:
At first, the verifier generates the digest of data m:
1. Break the data m into n pieces, each is l-bit. Let m1, m2, . . . , mn
(n ¼ djmj=le) be the integer values corresponding to fragments of m.
2. For each fragment mi, compute and store Mi 5 mi mod Φ(N).
The challenge—response verification protocol is as follows:
8.3 TECHNOLOGIES FOR DATA SECURITY IN CLOUD COMPUTING
1. The verifier
generates a random seed S and a random element α A ZN \{1, N—1} and
sends the challenge (α, S) to the prover.
242233
8.3 TECHNOLOGIES FOR DATA SECURITY IN CLOUD COMPUTING
243233
2. Upon receiving the challenge, the prover:
generates n pseudorandom values ci A [1,2t], for i 5 1 to n, using
PRNG seeded by S,
Pn
r
calculates r ¼
i¼1 cimi and R 5 α mod N, and
sends R to the verifier.
3. The verifier:
regenerates the n pseudorandom values ci A [1,2t], for i 5 1 to n, using
PRNG seeded P
by S,
n
r‘
calculates r0 ¼
i¼1 cimi mod Φ(N) and R ‘ 5 α mod N, and
checks whether R 5 R‘.
Due to the space constraints, this section only introduces the basic
principles and the working flows of the protocols for data integrity
checking in untrustworthy storages. The proof of the correctness, security
analysis, and the performance analysis of the protocols are left for the
interested readers to explore deeper in the cited research papers [25, 26—28].
Web-Application-Based Security
In cloud computing environments, resources are provided as a service over the
Internet in a dynamic, virtualized, and scalable way [29, 30]. Through cloud
computing services, users access business applications on-line from a Web
browser, while the software and data are stored on the servers. Therefore, in the
era of cloud computing, Web security plays a more important role than ever.
The Web site server is the first gate that guards the vast cloud resources. Since
the cloud may operate continuously to process millions of dollars‘ worth of
daily on-line transactions, the impact of any Web security vulnerability will be
amplified at the level of the whole cloud.
Web attack techniques are often referred as the class of attack. When any
Web security vulnerability is identified, attacker will employ those techniques
to take advantage of the security vulnerability. The types of attack can be
categorized in Authentication, Authorization, Client-Side Attacks, Comm and
Execution, Information Disclosure, and Logical Attacks [31]. Due to the
limited space, this section introduces each of them briefly. Interested read ers
are encouraged to explore for more detailed information from the materials
cited.
Authentication. Authentication is the process of verifying a claim that a
subject made to act on behalf of a given principal. Authentication attacks target
a Web site‘s method of validating the identity of a user, service, or application,
including Brute Force, Insufficient Authentication, and Weak Password
8.3 TECHNOLOGIES FOR DATA SECURITY IN CLOUD COMPUTING
244233
Recovery Validation. Brute Force attack employs an automated process to
guess a person‘s username and password by trial and error. In the Insufficient
Authentication case, some sensitive content or functionality are protected by
8.3 TECHNOLOGIES FOR DATA SECURITY IN CLOUD COMPUTING
241
―hiding‖ the specific location in obscure string but still remains accessible
directly through a specific URL. The attacker could discover those URLs
through a Brute Force probing of files and directories. Many Web sites provide
password recovery service. This service will automatically recover the user
name or password to the user if she or he can answer some questions defined as
part of the user registration process. If the recovery questions are either easily
guessed or can be skipped, this Web site is considered to be Weak Password
Recovery Validation.
Authorization. Authorization is used to verify if an authenticated subject
can perform a certain operation. Authentication must precede authorization.
For example, only certain users are allowed to access specific content or
functionality.
Authorization attacks use various techniques to gain access to protected
areas beyond their privileges. One typical authorization attack is caused by
Insufficient Authorization. When a user is authenticated to a Web site, it does
not necessarily mean that she should have access to certain content that
has been granted arbitrarily. Insufficient authorization occurs when a Web
site does not protect sensitive content or functionality with proper access
control restrictions. Other authorization attacks are involved with session.
Those attacks include Credential/Session Prediction, Insufficient Session
Expiration, and Session Fixation.
In many Web sites, after a user successfully authenticates with the Web site
for the first time, the Web site creates a session and generate a unique ―session
ID‖ to identify this session. This session ID is attached to subsequent requests
to the Web site as ―Proof‖ of the authenticated session.
Credential/Session Prediction attack deduces or guesses the unique value of
a session to hijack or impersonate a user.
Insufficient Session Expiration occurs when an attacker is allowed to reuse
old session credentials or session IDs for authorization. For example, in a
shared computer, after a user accesses a Web site and then leaves, with
Insufficient Session Expiration, an attacker can use the browser‘s back button
to access Web pages previously accessed by the victim.
Session Fixation forces a user‘s session ID to an arbitrary value via
CrossSite Scripting or peppering the Web site with previously made HTTP
requests. Once the victim logs in, the attacker uses the predefined session ID
value to impersonate the victim‘s identity.
Client-Side Attacks. The Client-Side Attacks lure victims to click a link in a
malicious Web page and then leverage the trust relationship expectations of the
victim for the real Web site. In Content Spoofing, the malicious Web page can
trick a user into typing user name and password and will then use this
information to impersonate the user.
Cross-Site Scripting (XSS) launches attacker-supplied executable code in the
victim‘s browser. The code is usually written in browser-supported scripting
8.3 TECHNOLOGIES FOR DATA SECURITY IN CLOUD COMPUTING
241
languages such as JavaScript, VBScript, ActiveX, Java, or Flash. Since the code
will run within the security context of the hosting Web site, the code has the
ability to read, modify, and transmit any sensitive data, such as cookies,
accessible by the browser.
Cross-Site Request Forgery (CSRF) is a serve security attack to a vulnerable
site that does not take the checking of CSRF for the HTTP/HTTPS request.
Assuming that the attacker knows the URLs of the vulnerable site which are
not protected by CSRF checking and the victim‘s browser stores credentials
such as cookies of the vulnerable site, after luring the victim to click a link in a
malicious Web page, the attacker can forge the victim‘s identity and access the
vulnerable Web site on victim‘s behalf.
Command Execution. The Command Execution attacks exploit server-side
vulnerabilities to execute remote commands on the Web site. Usually, users
supply inputs to the Web-site to request services. If a Web application does not
properly sanitize user-supplied input before using it within application code, an
attacker could alter command execution on the server. For example, if the
length of input is not checked before use, buffer overflow could happen and
result in denial of service. Or if the Web application uses user input to construct
statements such as SQL, XPath, C/C11 Format String, OS system command,
LDAP, or dynamic HTML, an attacker may inject arbitrary executable code
into the server if the user input is not properly filtered.
Information Disclosure. The Information Disclosure attacks acquire sensitive
information about a web site revealed by developer comments, error messages,
or well-know file name conventions. For example, a Web server may return a
list of files within a requested directory if the default file is not present. This
will supply an attacker with necessary information to launch further attacks
against the system. Other types of Information Disclosure includes using
special paths such as ―.‖ and ―..‖ for Path Traversal, or uncovering hidden
URLs via Predictable Resource Location.
Logical Attacks. Logical Attacks involve the exploitation of a Web
application‘s logic flow. Usually, a user‘s action is completed in a multi-step
process. The procedural workflow of the process is called application logic. A
common Logical Attack is Denial of Service (DoS). DoS attacks will attempt
to consume all available resources in the Web server such as CPU, memory,
disk space, and so on, by abusing the functionality provided by the Web site.
When any one of any system resource reaches some utilization threshold,
the Web site will no long be responsive to normal users. DoS attacks are
often caused by Insufficient Anti-automation where an attacker is permitted
to automate a process repeatedly. An automated script could be executed
thousands of times a minute, causing potential loss of performance or service.
8.4 OPEN QUESTIONS AND CHALLENGES
2 4 2 243
Multimedia Data Security Storage
With the rapid developments of multimedia technologies, more and more
multimedia contents are being stored and delivered over many kinds of devices,
databases, and networks. Multimedia Data Security plays an important role in
the data storage to protect multimedia data. Recently, how storage multimedia
contents are delivered by both different providers and users has attracted much
attentions and many applications. This section briefly goes through the most
critical topics in this area.
Protection from Unauthorized Replication. Contents replication is requi red
to generate and keep multiple copies of certain multimedia contents. For
example, content distribution networks (CDNs) have been used to manage
content distribution to large numbers of users, by keeping the replicas of the
same contents on a group of geographically distributed surrogates [32, 33].
Although the replication can improve the system performance, the
unauthorized replication causes some problems such as contents copyright,
waste of replication cost, and extra control overheads.
Protection from Unauthorized Replacement. As the storage capacity is
limited, a replacement process must be carried out when the capacity exceeds its
limit. It means the situation that a currently stored content [34] must be
removed from the storage space in order to make space for the new coming
content. However, how to decide which content should be removed is very
important. If an unauthorized replacement happens, the content which the user
doesn‘t want to delete will be removed resulting in an accident of the data loss.
Furthermore, if the important content such as system data is removed by
unauthorized replacement, the result will be more serious.
Protection from Unauthorized Pre-fetching. The Pre-fetching is widely
deployed in Multimedia Storage Network Systems between server databases
and end users‘ storage disks [35]. That is to say, If a content can be predicted to
be requested by the user in future requests, this content will be fetched from the
server database to the end user before this user requests it, in order to decrease
user response time. Although the Pre-fetching shows its efficiency, the
unauthorized pre-fetching should be avoided to make the system to fetch the
necessary content.
OPEN QUESTIONS AND CHALLENGES
Almost all the current commercial cloud service providers claim that their
8.4 OPEN QUESTIONS AND CHALLENGES
2 4 3 243
platforms are secure and robust. On one hand, they adopt robust cipher
algorithms for confidentiality of stored data; on the other hand, they depend on
network communication security protocols such as SSL, IPSec, or others to
8.4 OPEN QUESTIONS AND CHALLENGES
2 4 4 243
protect data in transmission in the network. For the service availability and
high performance, they choose virtualization technologies and apply strong
authentication and authorization schemes in their cloud domains. However, as
a new infrastructure/platform leading to new application/service models of
the future‘s IT industry, the requirement for a security cloud computing is
different from the traditional security problems. As pointed out by Dr. K. M.
Khan :
Encryption, digital signatures, network security, firewalls, and the isolation of
virtual environments all are important for cloud computing security, but these
alone won‘t make cloud computing reliable for consumers.
Concerns at Different Levels
The cloud computing environment consists of three levels of abstractions :
1. The cloud infrastructure providers, which is at the back end, own and
manage the network infrastructure and resources including hardware
devices and system software.
2. The cloud service providers, which offer services such as on-demand
computing, utility computing, data processing, software services, and
platforms for developing application software.
3. The cloud consumers, which is at the front end of the cloud computing
environment and consists of two major categories of users: (a) application
developers, who take advantage of the hardware infrastructure and the
software platforms to construct application software for ultimate end
users; and (b) end users, who carry out their daily works using the
ondemand computing, software services, and utility services.
Regarding data/information security, the users at different levels have
variant expectations and concerns due to the roles they play in the data‘s life
cycle.
From the perspective of cloud consumers, normally who are the data
owners, the concerns are essentially raised from the loss of control when
the data are in a cloud. As the dataset is stored in unknown third-party
infrastructure, the owner loses not only the advantages of endpoint restrictions
and management, but also the fine-grained credential quality control. The
uncertainty about the privacy and the doubt about the vulnerability are also
resulted from the disappearing physical and logical network boundaries [36].
The main security concerns of the end users include confidentiality, loss of
control of data, and the undisclosed security profiles of the cloud service and
infrastructure providers. The users‘ data are transmitted between the local
8.4 OPEN QUESTIONS AND CHALLENGES
2 4 5 243
machine and cloud service provider for variant operations, and they are also
persistently stored in the cloud infrastructure provider‘s facilities. During this
8.4 OPEN QUESTIONS AND CHALLENGES
2 4 6 243
procedure, data might not be adequately protected while they are being moved
within the systems or across multiple sites owned by these providers. The data
owner also cannot check the security assurances before using the service from
the cloud, because the actual security capabilities associated with the providers
are transparent to the user/owner.
The problem becomes more complicated when the service and infrastructure
providers are not the same, and this implies additional communication links in
the chain. Involving a third party in the services also introduces an additional
vector of attack. Actually, in practice there are more challenging scenarios.
For instance, consider that multiple end users have different sets of security
requirements while using the same service offered by an individual cloud service
provider. To handle such kind of complexity, one single set of security
provisions does not fit all in cloud computing. The scenarios also imply
that the back-end infrastructure and/or service providers must be capable of
supporting multiple levels requirements of security similar to those guaranteed
by front-end service provider.
From the perspective of the cloud service providers, the main concern with
regard to protecting users‘ data is the transfer of data from devices and servers
within the control of the users to its own devices and subsequently to those of
the cloud infrastructure, where the data is stored. The data are stored in cloud
service provider‘s devices on multiple machines across the entire virtual layer.
The data are also hosted on devices that belong to infrastructure provider. The
cloud service provider needs to ensure users that the security of their data is
being adequately addressed between the partners, that their virtual
environments are isolated with sufficient protection, and that the cleanup of
outdated images is being suitably managed at its site and cloud infrastructure
provider‘s storage machines.
Undoubtedly, the cloud infrastructure providers‘ security concerns are not
less than those of end users or cloud service providers. The infrastructure
provider knows that a single point of failure in its infrastructure security
mechanisms would allow hackers to take out thousands of data bytes owned by
the clients, and most likely data owned by other enterprises. The cloud
infrastructure providers need to ask the following questions:
● How are the data stored in its physical devices protected?
● How does the cloud infrastructure manage the backup of data, and the
destruction of outdated data, at its site?
● How can the cloud infrastructure control access to its physical devices and
the images stored on those devices?
Technical and Nontechnical Challenges
The above analysis has shown that besides technical challenges, the cloud
computing platform (infrastructure and service) providers are also required to
8.4 OPEN QUESTIONS AND CHALLENGES
2 4 7 243
meet a couple of nontechnical issues—for example, the lack of legal
requirements on data security to service providers [36]. More specifically, the
following technical challenges need to be addressed in order to make cloud
computing acceptable for common consumers:
● Open security profiling of services that is available to end users and
verifiable automatically. Service providers need to disclose in detail the
levels of specific security properties rather than providing blanket
assurances of ―secure‖ services.
● The cloud service/infrastructure providers are required to enable end users
to remotely control their virtual working platforms in the cloud and
monitor others‘ access to their data. This includes the capability of
finegrained accessing controls on their own data, no matter where the data
files are stored and processed. In addition, it is ideal to possess the
capability of restricting any unauthorized third parties from manipulating
users‘ data, including the cloud service provider, as well as cloud
infrastructure providers.
● Security compliance with existing standards could be useful to enhance
cloud security. There must be consistency between the security
requirements and/or policies of service consumers and the security
assurances of cloud providers .
● It is mandatory for the providers to ensure that software is as secure as
they claim. These assurances may include certification of the security of
the systems in question. A certificate—issued after rigorous testing
according to agreed criteria (e.g., ISO/IEC 15408)—can ensure the degree
of reliability of software in different configurations and environments as
claimed by the cloud providers.
Regarding the above technical issues, actually they have been and will be
addressed by constant development of new technologies. However, some
special efforts are needed to meet the nontechnical challenges. For instance,
one of the most difficult issue to be solved in cloud computing is the users‘ fear
of losing control over their data. Because end users feel that they do not clearly
know where and how their data are handled, or when the users realize that
their data are processed, transmitted, and stored by devices under the control
of some strangers, it is reasonable for them to be concerned about things
happening in the cloud. In traditional work environments, in order to keep a
dataset secure, the operator just keeps it away from the threat. In cloud
computing, however, it seems that datasets are moved closer to their threats;
that is, they are transmitted to, stored in, and manipulated by remote devices
controlled by third parties, not by the owner of the data set. It is recognized
that this is partly a psychological issue; but until end users have enough
information and insight that make them believe cloud computing security and
its dynamics, the fear is unlikely to go away.
8.4 OPEN QUESTIONS AND CHALLENGES
2 4 8 243
End-user license agreements (EULAs) and vendor privacy policies are not
enough to solve this psychological issue. Service-level agreements (SLAs) need to
specify the preferred security assurances of consumers in detail. Proper business
models and risk assessments related to cloud computing security need to be
defined. In this new security-sensitive design paradigm, the ability to change one‘s
mind is crucial, because consumers are more security-aware than ever before.
They not only make the service-consuming decision on cost and service, they also
want to see real, credible security measures from cloud providers.
SUMMARY
In this chapter we have presented the state-of-the-art research progress and results
of secure distributed data storage in cloud computing. Cloud computing has
acquired considerable attention from both industry and academia in recent years.
Among all the major building blocks of cloud computing, data storage plays a very
important role. Currently, there are several challenges in implementing distributed
storage in cloud computing environments. These challenges will need to be
addressed before users can enjoy the full advantages of cloud computing. In
addition, security is always a significant issue in any computing system.
Consequently, we surveyedanumberof topics relatedtothechallenging issues
ofsecuring distributed data storage, including database outsourcing and query
integrity assurance, data integrity in untrustworthy storage, Web-applicationbased security, and multimedia data security. It is anticipated that the
technologies developed in the aforementioned research will contribute to
paving the way for securing distributed data storage environments within cloud
computing.
REFERENCES
1.
2.
3.
4.
J. A. Garay, R. Gennaro, C. Jutla, and T. Rabin, Secure distributed storage and
retrieval, in Proceedings of the 11th International workshop on Distributed
Algorithms, Saarbrucken, pp. 275—289 Germany, September 1997.
V. Kher and Y. Kim, Securing distributed storage: Challenges, techniques, and
systems, in Proceedings of the 2005 ACM Workshop on Storage Security and
Survivability, Fairfax, VA, November 11, 2005.
R. Ranjan, A. Harwood, and R. Buyya, Peer-to-peer-based resource discovery in
global grids: A tutorial, IEEE Communications Surveys & Tutorials, 10(2), 2008,
pp. 6—33.
K. M. Khan, Security dynamics of cloud computing, Cutter IT Journal, June/July
2009, pp. 38—43.
5.
6.
8.4 OPEN QUESTIONS AND CHALLENGES 2 4 9 243
J. Heiser and M. Nicolett, Assessing the Security Risks of Cloud Computing,
Gartner Inc., June 2, 2008.
G. A. Gibson and R. V. Meter, Network attached storage architecture,
Communications of the ACM, 43(11): 37—45, 2000.
REFERENCES
7.
8.
9.
10.
11.
12.
13.
14.
15.
16.
17.
18.
19.
20.
21.
22.
23.
247
Amazon Import/Export Developer Guid‖, Version 1.2, http://aws.amazon.com/
documentation/, August 2009.
Microsoft Azure Services Platform, http://www.microsoft.com/azure/default.
mspx, 2009.
Google, What is Google App Engine?, http://code.google.com/appengine/docs/
whatisgoogleappengine.html, September 2009.
Microsoft Azura MSDN API, http://msdn.microsoft.com/en-us/library/dd179394.
aspx, 2009.
Google mail, http://groups.google.com/group/google-appengine/browse-thread/
thread/782aea7f85ecbf98/8a9a505e8aaee07a?show_docid=8a9a505e8aaee07a#
W.-S. Ku, L. Hu, C. Shahabi, and H. Wang, Query integrity assurance of
locationbased services accessing outsourced spatial databases, in Proceedings of
the International Symposium on Spatial and Temporal Databases (SSTD),
2009, pp. 80—97.
H. Hacigu¨ mu¨ s, S. Mehrotra, and B. R. Iyer, Providing database as a service, in
Proceedings of the IEEE International Conference on Data Engineering (ICDE),
2002, p. 29.
R. Agrawal, J. Kiernan, R. Srikant, and Y. Xu, Order-preserving encryption for
numeric data, in Proceedings of the ACM International Conference on Management
of Data (SIGMOD), 2004, pp. 563—574.
P. T. Devanbu, M. Gertz, C. U. Martel, and S. G. Stubblebine, Authentic thirdparty
data publication, in Proceedings of the IFIP Working Conference on Data and
Applications Security (DBSec), 2000, pp. 101—112.
R. C. Merkle, A certified digital signature, in Proceedings of the Annual
International Cryptology Conference (CRYPTO), 1989, pp. 218—238.
E. Mykletun, M. Narasimha, and G. Tsudik, Authentication and integrity in
outsourced databases, in Proceedings of the Network and Distributed System
Security Symposium (NDSS), 2004.
H.-H. Pang, A. Jain, K. Ramamritham, and K.-L. Tan, Verifying completeness of
relational query results in data publishing, in Proceedings of the ACM International
Conference on Management of Data (SIGMOD), 2005, pp. 407—418.
F. Li, M. Hadjieleftheriou, G. Kollios, and L. Reyzin, Dynamic authenticated
index structures for outsourced databases, in Proceedings of the ACM International
Conference on Management of Data (SIGMOD), 2006, pp. 121—132.
R. Sion, Query execution assurance for outsourced databases, in Proceedings
of the International Conference on Very Large Data Bases (VLDB), 2005,
pp. 601—612.
M. Xie, H. Wang, J. Yin, and X. Meng, Integrity auditing of outsourced data, in
Proceedings of the International Conference on Very Large Data Bases (VLDB),
2007, pp. 782—793.
H. Wang, J. Yin, C.-S. Perng, and P. S. Yu, Dual encryption for query integrity
assurance, in Proceedings of the ACM Conference on Information and Knowledge
Management (CIKM), 2008, pp. 863—872.
Y, Yang, S, Papadopoulos, D, Papadias, and G. Kollios, Spatial outsourcing for
location-based services, in IEEE International Conference on Data Engineering
(ICDE), 2008, pp. 1082—1091.
24.
M.-L. Yiu, G. Ghinita, C. S. Jensen, and P. Kalnis, Outsourcing search services on
PART
privateIII
spatial data, in Proceedings of the IEEE International Conference on Data
25.
26.
27.
28.
29.
30.
31.
32.
33.
34.
35.
36.
37.
Engineering (ICDE), 2009, pp. 1140—1143.
Y. Deswarte, J.-J. Quisquater, and A. Saidane, Remote integrity checking, in
Integrity and Internal Control in Information Systems VI, Kluwer Academic
Publishers, Boston, 2003, pp. 1—11.
D. L. Gazzaoni-Filho and P. S. Licciardi-Messeder-Barreto, Demonstrating data
possession and uncheatable data transfer, Cryptology ePrint Archive, Report 2006/
150, http://eprint.iacr.org/, 2006.
F. Sebe, J. Domingo-Ferrer, A. Martinez-Balleste, Y. Deswarte, and J.-J.
Quisquater, Efficient remote data possession checking in critical information
infrastructure, IEEE Transactions on Knowledge and Data Engineering, 20(8):
1034—1038, 2008.
G. Ateniese, R. Burns, R. Curtmola, J. Herring, L. Kissner, Z. Peterson, and
D. Song, Provable data possession at untrusted stores, in Proceedings 14th
ACM Conference on Computer and Communication Security (CCS‘07), 2007,
pp. 598—609.
M. D. Dikaiakos, D. Katsaros, G. Pallis, A. Vakali, P. Mehra: Guest editors
introduction: Cloud computing, IEEE Internet Computing, 12(5), 2009, pp. 10—13.
S. Murugesan, Cloud computing: IT‘s day in the sun?, in Cutter Consortium, 2009,
http://www.cutter.com/content/itjournal/fulltext/2009/06/index.html.
Web Application Security Consortium, www.webappsec.org, 2009.
M. A. Niazi and A. R. Baig, Phased approach to simulation of security algorithms
for ambient intelligent (ami) environments, the Winter Simulation Conference,
Washington D.C., December 9—12, 2007.
Z. Su, J. Katto, and Y. Yasuda, Optimal replication algorithm for scalable
streaming media in contents delivery networks, IEICE Transactions on Information
and Systems, E87(12):2723—2732, 2004.
A. Rowstron and P. Druschel, Storage management and caching in PAST, a
largescale, persistent peer-to-peer storage utility, in Proceedings of the Eighth
Workshop on Hot Topics in Operating Systems, Banff, Canada, 2001, pp. 75—80.
Z. Su, T. Washizawa, J. Katto, and Y. Yasuda, Integrated pre-fetching and
replacing algorithm for graceful image caching, IEICE Transactions on
Communications, E89-B(9):2753—2763, 2003.
A. Stamos, A. Becherer, and N. Wilcox, Cloud computing models and
vulnerabilities: Raining on the trendy new parade, in Blackhat USA 2009, Las
Vegas, Nevada.
H. Hacigu¨ mu¨ s, B. R. Iyer, C. Li, and S. Mehrotra, Executing SQL over encrypted
data in the database-service-provider model, in Proceedings of the ACM International
Conference on Management of Data (SIGMOD), 2002, pp. 216—227.
Download from Wow! eBook <www.wowebook.com>
PART III
PLATFORM AND SOFTWARE
AS A SERVICE (PAAS/IAAS)
PART III
CHAPTER 9
ANEKA—INTEGRATION OF PRIVATE
AND PUBLIC CLOUDS
CHRISTIAN VECCHIOLA, XINGCHEN CHU, MICHAEL MATTESS, and
RAJKUMAR BUYYA
9.1 INTRODUCTION
A growing interest in moving software applications, services, and even
infrastructure resources from in-house premises to external providers has been
witnessed recently. A survey conducted by F5 Networks between June and
July 20091 showed that such a trend has now reached a critical mass; and an
increasing number of IT managers have already adopted, or are considering
adopting, this approach to implement IT operations. This model of making IT
resources available, known as Cloud Computing [1], opens new opportunities to
small, medium-sized, and large companies. It is not necessary anymore to bear
considerable costs for maintaining the IT infrastructures or to plan for peak
demand. Instead, infrastructure and applications can scale elastically according
to the business needs at a reasonable price. The possibility of instantly reacting to
the demand of customers without long-term planning is one of the most appealing
features of cloud computing, and it has been a key factor in making this trend
popular among technology and business practitioners.
As a result of this growing interest, the major players in the IT industry such
as Google, Amazon, Microsoft, Sun, and Yahoo have started offering
cloudcomputing-based solutions that cover the entire IT computing stack, from
hardware to applications and services. These offerings have become quickly
CHAPTER 9
1
The survey, available at http://www.f5.com/pdf/reports/cloud-computing-survey-results-2009.pdf,
interviewed 250 IT companies with at least 2500 employees worldwide and targeted the following
personnel: managers, directors, vice presidents, and senior vice presidents.
Cloud Computing: Principles and Paradigms, Edited by Rajkumar Buyya, James Broberg and
Andrzej Goscinski Copyright r 2011 John Wiley & Sons, Inc.
251
popular and led to the establishment of the concept of ―Public Cloud,‖ which r
CHAPTER 9
epresents a publicly accessible distributed system hosting the execution of
applications and providing services billed on a pay-per-use basis. After an
initial enthusiasm for this new trend, it soon became evident that a solution
built on outsourcing the entire IT infrastructure to third parties would not be
applicable in many cases, especially when there are critical operations to be
performed and security concerns to consider. Moreover, with the public cloud
distributed anywhere on the planet, legal issues arise and they simply make it
difficult to rely on a virtual public infrastructure for any IT operation. As an
example, data location and confidentiality are two of the major issues that scare
stakeholders to move into the cloud—data that might be secure in one country
may not be secure in another. In many cases though, users of cloud services
don‘t know where their information is held and different jurisdictions can
apply. It could be stored in some data center in either Europe, (a) where the
European Union favors very strict protection of privacy, or (b) America, where
laws such as the U.S. Patriot Act2 invest government and other agencies
with virtually limitless powers to access information including that belonging
to companies. In addition, enterprises already have their own IT infrastructures.
In spite of this, the distinctive feature of cloud computing still
remains appealing, and the possibility of replicating in-house (on their own IT
infrastructure) the resource and service provisioning model proposed by cloud
computing led to the development of the ―Private Cloud‖ concept.
Private clouds are virtual distributed systems that rely on a private
infrastructure and provide internal users with dynamic provisioning of
computing resources. Differently from public clouds, instead of a pay-asyou-go model, there could be other schemes in place, which take into
account the usage of the cloud and proportionally bill the different
departments or sections of the enterprise. Private clouds have the advantage
of keeping inhouse the core business operations by relying on the existing IT
infrastructure and reducing the burden of maintaining it once the cloud has
been set up. In this scenario, security concerns are less critical, since
sensitive information does not flow out of the private infrastructure.
Moreover, existing IT resources can be better utilized since the Private cloud
becomes accessible to all the division of the enterprise. Another interesting
opportunity that comes with private clouds is the possibility of testing
applications and systems at a comparatively lower price rather than public
clouds before deploying them on the public virtual infrastructure. In April
2009, a Forrester Report on the benefits of delivering in-house cloud
computing solutions for enterprises
2
The U.S. Patriot Act is a statute enacted by the United States Government that increases the
ability of law enforcement agencies to search telephone, e-mail communications, medical, financial,
and other records; it eases restrictions on foreign intelligence gathering within the United States.
The full text of the act is available at the Web site of the Library of the Congress at the following
address: http://thomas.loc.gov/cgi-bin/query/z?c107:H.R.3162.ENR (accessed December 5, 2009).
CHAPTER 9
9.1 INTRODUCTION
253
highlighted some of the key advantages of using a private cloud computing
infrastructure:
● Customer Information Protection. Despite assurances by the public cloud
leaders about security, few provide satisfactory disclosure or have long
enough histories with their cloud offerings to provide warranties about the
specific level of security put in place in their system. Security in-house is
easier to maintain and to rely on.
● Infrastructure Ensuring Service Level Agreements (SLAs). Quality of
service implies that specific operations such as appropriate clustering
and failover, data replication, system monitoring and maintenance,
disaster recovery, and other uptime services can be commensurate to the
application needs. While public clouds vendors provide some of these
features, not all of them are available as needed.
● Compliance with Standard Procedures and Operations. If organizations
are subject to third-party compliance standards, specific procedures have
to be put in place when deploying and executing applications. This could
be not possible in the case of virtual public infrastructure.
In spite of these advantages, private clouds cannot easily scale out in the case of
peak demand, and the integration with public clouds could be a solution to the
increased load. Hence, hybrid clouds, which are the result of a private cloud
growing and provisioning resources from a public cloud, are likely to be best
option for the future in many cases. Hybrid clouds allow exploiting existing
IT infrastructures, maintaining sensitive information within the premises,
and naturally growing and shrinking by provisioning external resources and
releasing them when needed. Security concerns are then only limited to the
public portion of the cloud, which can be used to perform operations with less
stringent constraints but that are still part the system workload.
Platform as a Service (PaaS) solutions offer the right tools to implement and
deploy hybrid clouds. They provide enterprises with a platform for creating,
deploying, and managing distributed applications on top of existing
infrastructures. They are in charge of monitoring and managing the
infrastructure and acquiring new nodes, and they rely on virtualization
technologies in order to scale applications on demand. There are different
implementations of the PaaS model; in this chapter we will introduce
Manjrasoft Aneka, and we will discuss how to build and deploy hybrid clouds
based on this technology. Aneka is a programming and management platform
for building and deploying cloud computing applications. The core value of
Aneka is its service-oriented architecture that creates an extensible system able
to address different application scenarios and deployments such as public,
private, and heterogeneous clouds. On top of these, applications that can be
expressed by means of different programming models can transparently
execute under the desired service-level agreement.
The remainder of this chapter is organized as follows: In the next section
we will briefly review the technologies and tools for Cloud Computing by
presenting both the commercial solution and the research projects currently
available, we will then introduce Aneka in Section 9.3 and provide an overview
of the architecture of the system. In Section 9.4 we will detail the resource
provisioning service that represents the core feature for building hybrid clouds.
Its architecture and implementation will be described in Section 9.5, together
with a discussion about the desired features that a software platform support
hybrid clouds should offer. Some thoughts and future directions for
practitioners will follow, before the conclusions.
TECHNOLOGIES AND TOOLS FOR CLOUD COMPUTING
Cloud computing covers the entire computing stack from hardware
infrastructure to end-user software applications. Hence, there are heterogeneous
offerings addressing different niches of the market. In this section we will
concentrate mostly on the Infrastructure as a Service (IaaS) and Platform
as a Service (PaaS) implementations of the cloud computing model by first
presenting a subset of the most representative commercial solutions and then
discussing the few research projects and platforms, which attracted
considerable attention.
Amazon is probably the major player for what concerns the
Infrastructureas-a-Service solutions in the case of public clouds. Amazon
Web Services deliver a set of services that, when composed together, form a
reliable, scalable, and economically accessible cloud. Within the wide range of
services offered, it is worth noting that Amazon Elastic Compute Cloud
(EC2) and Simple Storage Service (S3) allow users to quickly obtain virtual
compute resources and storage space, respectively. GoGrid provides
customer with a similar offer: it allows users to deploy their own distributed
system on top of their virtual infrastructure. By using the GoGrid Web
interface users can create their custom virtual images, deploy database and
application servers, and mount new storage volumes for their applications.
Both GoGrid and Amazon EC2 charge their customers on a pay-as-you-go
basis, and resources are priced per hours of usage. 3Tera AppLogic lays at
the foundation of many public clouds, it provides a grid operating system that
includes workload distribution, metering, and management of applications.
These are described in a platformindependent manner, and AppLogic takes
care of deploying and scaling them on demand. Together with AppLogic,
which can also be used to manage and deploy private clouds, 3Tera also
provides cloud hosting solutions and, because of its grid operating system,
makes the transition from the private to the public virtual infrastructure
simple and completely transparent. Solutions that are completely based on a
PaaS approach for public clouds are Microsoft Azure and Google
AppEngine. Azure allows developing scalable applications for the cloud. It is
a cloud services operating system that serves as the development,
TECHNOLOGIES AND TOOLS FOR CLOUD COMPUTING
255
runtime, and control environment for the Azure Services Platform. By using the
Microsoft Azure SDK, developers can create services that leverage the .NET
framework. These services are then uploaded to the Microsoft Azure
portal and executed on top of Windows Azure. Additional services such as
workflow management and execution, web services orchestration, and
SQL data storage are provided to empower the hosted applications. Azure
customers are billed on a pay-per-use basis and by taking into account the
different services: compute, storage, bandwidth, and storage transactions.
Google AppEngine is a development platform and a runtime environment
focusing primarily on web applications that will be run on top of Google‘s
server infrastructure. It provides a set of APIs and an application model
that allow developers to take advantage of additional services provided by
Google such as Mail, Datastore, Memcache, and others. Developers can create
applications in Java, Python, and JRuby. These applications will be run within
a sandbox, and AppEngine will take care of automatically scaling when needed.
Google provides a free limited service and utilizes daily and per minute quotas
to meter and price applications requiring professional service.
Different options are available for deploying and managing private clouds.
At the lowest level, virtual machine technologies such as Xen , KVM , and
VMware can help building the foundations of a virtual infrastructure. On top
of this, virtual machine managers such as VMWare vCloud [14] and
Eucalyptus [15] allow the management of a virtual infrastructure and turning a
cluster or a desktop grid into a private cloud. Eucalyptus provides a full
compatibility with the Amazon Web Services interfaces and supports different
virtual machine technologies such as Xen, VMWare, and KVM. By using
Eucalyptus, users can test and deploy their cloud applications on the private
premises and naturally move to the public virtual infrastructure provided by
Amazon EC2 and S3 in a complete transparent manner. VMWare vCloud is
the solution proposed by VMWare for deploying virtual infrastructure as either
public or private clouds. It is built on top of the VMWare virtual machine
technology and provides an easy way to migrate from the private premises to
the public infrastructure that leverages VMWare for infrastructure
virtualization. For what concerns the Platform-as-a-Service solutions, we can
notice DataSynapse, Elastra, Zimory Pools, and the already mentioned
AppLogic. DataSynapse [16] is a global provider of application virtualization
software. By relying on the VMWare, virtualization technology provides a
flexible environment that converts a data center into a private cloud. Elastra
[17] cloud server is a platform for easily configuring and deploying distributed
application infrastructures on clouds: by using a simple control panel,
administrators can visually describe the distributed application in terms of
components and connections and then deploying them on one or more cloud
providers such Amazon EC2 or VMware ESX. Cloud server can provision
resources from either private or public clouds, thus deploying application on
hybrid infrastructures. Zimory [18], a spinoff company from Deutsche
Telekom, provides a software infrastructure layer that automates the use
of
resource pools based on Xen, KVM, and VMware virtualization technologies.
It allows creating an internal cloud composed by sparse private and public
resources that both host the Zimory‘s software agent and provides facilities for
quickly migrating applications from one data center to another and utilizing at
best the existing infrastructure.
The wide range of commercial offerings for deploying and managing private
and public clouds mostly rely on a few key virtualization technologies, on
top of which additional services and features are provided. In this sense, an
interesting research project combining public and private clouds and adding
advanced services such as resource reservation is represented by the
coordinated use of OpenNebula [19] and Haizea [20]. OpenNebula is a virtual
infrastructure manager that can be used to deploy and manage virtual machines
on local resources or on external public clouds, automating the setup of the
virtual machines regardless of the underlying virtualization layer (Xen, KVM,
or VMWare are currently supported) or external cloud such as Amazon EC2. A
key feature of OpenNebula‘s architecture is its highly modular design, which
facilitates integration with any virtualization platform and third-party
component in the cloud ecosystem, such as cloud toolkits, virtual image
managers, service managers, and VM schedulers such as Haizea. Haizea is a
resource lease manager providing leasing capabilities not found in other cloud
systems, such as advance reservations and resource preemption. Integrated
together, OpenNebula and Haizea constitute a virtual management
infrastructure providing flexible and advanced capabilities for resource
management in hybrid clouds. A similar set of capabilities is provided by
OpenPEX [21], which allows users to provision resources ahead of time
through advance reservations. It also incorporates a bilateral negotiation
protocol that allows users and providers to come to an agreement by
exchanging offers and counter offers. OpenPEX natively supports Xen as a
virtual machine manager (VMM), but additional plug-ins can be integrated into
the system to support other VMMs. Nimbus [22], formerly known as Globus
Workspaces, is another framework that provides a wide range of extensibility
points. It is essentially a framework that allows turning a cluster into an
Infrastructure-as-a-Service cloud. What makes it interesting from the
perspective of hybrid clouds is an extremely modular architecture that allows
the customization of many tasks: resource scheduling, network leases,
accounting, propagation (intra VM file transfer), and fine control VM
management.
All of the previous research platforms are mostly IaaS implementation of the
cloud computing model: They provide a virtual infrastructure management
layer that is enriched with advanced features for resource provisioning and
scheduling. Aneka, which is both a commercial solution and a research
platform, positions itself as a Platform-as-a-Service implementation. Aneka
provides not only a software infrastructure for scaling applications, but also a
wide range of APIs that help developers to design and implement applications
that can transparently run on a distributed infrastructure whether this be the
local cluster or the cloud. Aneka, as OpenNebula and Nimbus, is characterized
ANEKA CLOUD PLATFORM
257
by a modular architecture that allows a high level of customization and
integration with existing technologies, especially for what concerns resource
provisioning. Like Zimory, the core feature of Aneka is represented by a
configurable software agent that can be transparently deployed on both
physical and virtual resources and constitutes the runtime environment for
the cloud. This feature, together with the resource provisioning infrastructure,
is at the heart of Aneka-based hybrid clouds. In the next sections we will
introduce the key feature of Aneka and describe in detail the architecture of the
resource provisioning service that is responsible of integrating cloud resources
into the existing infrastructure.
ANEKA CLOUD PLATFORM
Aneka is a software platform and a framework for developing distributed
applications on the cloud. It harnesses the computing resources of a
heterogeneous network of workstations and servers or data centers on
demand. Aneka provides developers with a rich set of APIs for transparently
exploiting these resources by expressing the application logic with a variety
of programming abstractions. System administrators can leverage a
collection of tools to monitor and control the deployed infrastructure. This
can be a public cloud available to anyone through the Internet, a private
cloud constituted by a
set of nodes with restricted access within an
enterprise, or a hybrid cloud where external resources are integrated on
demand, thus allowing applications to scale.
Figure 9.1 provides a layered view of the framework. Aneka is essentially an
implementation of the PaaS model, and it provides a runtime environment for
executing applications by leveraging the underlying infrastructure of the cloud.
Developers can express distributed applications by using the API contained in
the Software Development Kit (SDK) or by porting existing legacy
applications to the cloud. Such applications are executed on the Aneka cloud,
represented by a collection of nodes connected through the network hosting
the Aneka container. The container is the building block of the middleware and
represents the runtime environment for executing applications; it contains the
core functionalities of the system and is built up from an extensible collection of
services that allow administrators to customize the Aneka cloud. There are
three classes of services that characterize the container:
● Execution Services. They are responsible for scheduling and executing
applications. Each of the programming models supported by Aneka
defines specialized implementations of these services for managing the
execution of a unit of work defined in the model.
● Foundation Services. These are the core management services of the
Aneka container. They are in charge of metering applications, allocating
9.4 ANEKA RESOURCE PROVISIONING SERVICE
Applications: Development and Management
Management Kit
Management Studio
Software Development Kit
API
Tutorials
Samples
Admin. Portal
Web Services: SLA and Management
Security
Middleware: Container
Foundation Services
Thread Model
MapReduce Model
…
Foundation Services
Membership
Storage
Accounting
Licensing
Resource Reservation
…
…
Persistence
Task Model
Platform Abstraction Layer (PAL)
Fabric Services
Hardware Profiling
Dynamic Resource Provisioning
ECMA 334-335: .NET or Mono/Windows and Linux and Mac
Physical Resources
Virtualized Resources
259
IBM
9.4 ANEKA RESOURCE PROVISIONING SERVICE
Amazon
Microsoft
259
Private Cloud (LAN)
FIGURE 9.1. Aneka framework architecture.
resources for execution, managing the collection of available nodes, and
keeping the services registry updated.
● Fabric Services: They constitute the lowest level of the services stack of
Aneka and provide access to the resources managed by the cloud. An
9.4 ANEKA RESOURCE PROVISIONING SERVICE
259
important service in this layer is the Resource Provisioning Service, which
enables horizontal scaling3 in the cloud. Resource provisioning makes
Aneka elastic and allows it to grow or to shrink dynamically to meet the
QoS requirements of applications.
The container relies on a platform abstraction layer that interfaces it with the
underlying host, whether this is a physical or a virtualized resource. This makes
the container portable over different runtime environments that feature an
implementation of the ECMA 334 [23] and ECMA 335 [24] specifications (such
as the .NET framework or Mono).
Aneka also provides a tool for managing the cloud, allowing adminis trators
to easily start, stop, and deploy instances of the Aneka container on new
resources and then reconfigure them dynamically to alter the behavior of the
cloud.
ANEKA RESOURCE PROVISIONING SERVICE
The most significant benefit of cloud computing is the elasticity of resources,
services, and applications, which is the ability to automatically scale out based
on demand and users‘ quality of service requests. Aneka as a PaaS not only
features multiple programming models allowing developers to easily build their
distributed applications, but also provides resource provisioning facilities in a
seamless and dynamic fashion. Applications managed by the Aneka container
can be dynamically mapped to heterogeneous resources, which can grow or
shrink according to the application‘s needs. This elasticity is achieved by means
of the resource provisioning framework, which is composed primarily of
services built into the Aneka fabric layer.
Figure 9.2 provides an overview of Aneka resource provisioning over private
and public clouds. This is a typical scenario that a medium or large enterprise
may encounter; it combines privately owned resources with public rented
resources to dynamically increase the resource capacity to a larger scale.
Private resources identify computing and storage elements kept in the
premises that share similar internal security and administrative policies. Aneka
identifies two types of private resources: static and dynamic resources. Static
resources are constituted by existing physical workstations and servers that
may be idle for a certain period of time. Their membership to the Aneka cloud
is manually configured by administrators and does not change over time.
Dynamic resources are mostly represented by virtual instances that join and
leave the Aneka cloud and are controlled by resource pool managers that
provision and release them when needed.
9.4 ANEKA RESOURCE PROVISIONING SERVICE
3
259
Horizontal scaling is the process of adding more computing nodes to a system. It is counterposed
to vertical scaling, which is the process of increasing the computing capability of a single computer
resource.
9.4 ANEKA RESOURCE PROVISIONING SERVICE
261
Public Cloud
Dynamic Provisioning
Internet
Private Cloud
Aneka
Dynamic Provisioning
Static deployment
Interanet
Physical Desktops/Servers
Virtual Machines
FIGURE 9.2. Aneka resource provisioning over private and public clouds.
Public resources reside outside the boundaries of the enterprise and
are provisioned by establishing a service-level agreement with the external
provider. Even in this case we can identify two classes: on-demand and
reserved resources. On-demand resources are dynamically provisioned by
resource pools for a fixed amount of time (for example, an hour) with no longterm commitments and on a pay-as-you-go basis. Reserved resources are
provisioned in advance by paying a low, one-time fee and mostly suited for
long-term usage. These resources are actually the same as static resources, and
no automation is needed in the resource provisioning service to manage them.
Despite the specific classification previously introduced, resources are
managed uniformly once they have joined the Aneka cloud and all the standard
operations that are performed on statically configured nodes can be
transparently applied to dynamic virtual instances. Moreover, specific
9.4 ANEKA RESOURCE PROVISIONING SERVICE
261
operations pertaining to dynamic resources, such as join and leave, are seen as
connection and disconnection of nodes and transparently handled. This is
mostly due to
9.4 ANEKA RESOURCE PROVISIONING SERVICE
261
the indirection layer provided by the Aneka container that abstracts the specific
nature of the hosting machine.
9.4.1
Resource Provisioning Scenario
Figure 9.3 illustrates a possible scenario in which the resource provisioning
service becomes important. A private enterprise maintains a private cloud,
which consists of (a) five physical dedicated desktops from its engineering
department and (b) a small data center managed by Xen Hypervisor providing
virtual machines with the maximum capacity of 12 VMs. In most of the cases,
this setting is able to address the computing needs of the enterprise. In the case
of peak computing demand, additional resources can be provisioned by
leveraging the virtual public infrastructure. For example, a mission critical
application could require at least 30 resources to complete within an hour, and
the customer is willing to spend a maximum of 5 dollars to achieve this goal. In
this case, the Aneka Resource Provisioning service becomes a fundamental
infrastructure component to address this scenario.
In this case, once the client has submitted the application, the Aneka
scheduling engine detects that the current capacity in terms of resources (5
dedicated nodes) is not enough to satisfy the user‘s QoS requirement and to
complete the application on time. An additional 25 resources must be
provisioned. It is the responsibility of the Aneka Resource Provisioning service
to acquire these resources from both the private data center managed by Xen
Hypervisor and the Amazon public cloud. The provisioning service is
configured by default with a cost-effective strategy, which privileges the use of
local resources instead of the dynamically provisioned and chargeable ones.
The computing needs of the application require the full utilization of the local
data center that provides the Aneka cloud with 12 virtual machines. Such
capacity is still not enough to complete the mission critical application in
time; and the
9.4 ANEKA RESOURCE PROVISIONING SERVICE
Client
Provisioning Service
Request(30, $5)
Aneka
Enterprise ACloud
Provision(13, $1.105)
Join(5, 0)
Dedicated Desktops
Capacity(5)
Provision(12, 0)
Private Data Center
Capacity(12 VMs)
FIGURE 9.3. Use case of resource provisioning under Aneka.
261
9.5 HYBRID CLOUD IMPLEMENTATION
2 6 2 263
remaining 13 resources are rented from Amazon for a minimum of one hour,
which incurs a few dollars‘ cost.4
This is not the only scenario that Aneka can support, and different
provisioning patterns can be implemented. Another simple strategy for
provisioning resources could be minimizing the execution time to let the
application finish as early as possible; this requires Aneka to request more
powerful resources from the Amazon public cloud. For example, in the
previous case instead of provisioning 13 small instances from Amazon, a major
number of resources, or more powerful resources, can be rented by spending
the entire budget available for the application. The resource provisioning
infrastructure can also serve broader purposes such as keeping the length of the
system queue, or the average waiting time of a job in the queue, under a specified
value. In these cases, specific policies can be implemented to ensure that the
throughput of the system is kept at a reasonable level.
HYBRID CLOUD IMPLEMENTATION
Currently, there is no widely accepted standard for provisioning virtual
infrastructure from Infrastructure as a Service (IaaS) providers, but each
provider exposes its own interfaces and protocols. Hence, it is not possible
to seamlessly integrate different providers into one single infrastructure.
The resource provisioning service implemented in Aneka addresses these
issues and abstracts away the differences of providers‘ implementation.
In this section we will briefly review what the desired features of a hybrid
cloud implementation are and then we will give a closer a look at the solution
implemented in Aneka together with a practical application of the
infrastructure developed.
Design and Implementation Guidelines
The particular nature of hybrid clouds demands additional and specific
functionalities that software engineers have to consider while designing
software systems supporting the execution of applications in hybrid and
dynamic environments. These features, together with some guidelines on how
to implement them, are presented in the following:
● Support for Heterogeneity. Hybrid clouds are produced by heterogeneous
resources such as clusters, public or private virtual infrastructures, and
workstations. In particular, for what concerns a virtual machine manager,
it must be possible to integrate additional cloud service providers (mostly
9.5 HYBRID CLOUD IMPLEMENTATION
2 6 3 263
4
At the time of writing (October 2010), the cost for a small Linux-based instance in Amazon EC2 is
cent/hour and the total cost bore by the customer will be in this case 1.105 UD. We expect this
price to decrease even more in the next years.
9.5 HYBRID CLOUD IMPLEMENTATION
●
●
●
●
5
2 6 4 263
IaaS providers) without major changes to the entire system design and
codebase. Hence, the specific code related to a particular cloud resource
provider should be kept isolated behind interfaces and within pluggable
components.
Support for Dynamic and Open Systems. Hybrid clouds change their
composition and topology over time. They form as a result of dynamic
conditions such as peak demands or specific Service Level Agreements
attached to the applications currently in execution. An open and
extensible architecture that allows easily plugging new components
and rapidly integrating new features is of a great value in this case.
Specific enterprise architectural patterns can be considered while
designing such software systems. In particular, inversion of control and,
more precisely, dependency injection5 in component-based systems is
really helpful.
Support for Basic VM Operation Management. Hybrid clouds integrate
virtual infrastructures with existing physical systems. Virtual
infrastructures are produced by virtual instances. Hence, software
frameworks that support hypervisor-based execution should implement a
minimum set of operations. They include requesting a virtual instance,
controlling its status, terminating its execution, and keeping track of all
the instances that have been requested.
Support for Flexible Scheduling Policies. The heterogeneity of resources
that constitute a hybrid infrastructure naturally demands for flexible
scheduling policies. Public and private resources can be differently
utilized, and the workload should be dynamically partitioned into different
streams according to their security and quality of service (QoS)
requirements. There is then the need of being able to transparently change
scheduling policies over time with a minimum impact on the existing
infrastructure and almost now downtimes. Configurable scheduling
policies are then an important feature.
Support for Workload Monitoring. Workload monitoring becomes even
more important in the case of hybrid clouds where a subset of resources is
leased and resources can be dismissed if they are no longer necessary.
Workload monitoring is an important feature for any distributed
middleware, in the case of hybrid clouds, it is necessary to integrate this
feature with scheduling policies that either directly or indirectly govern the
management of virtual instances and their leases.
Dependency injection is a technique that allows configuring and connecting components within a
software container (such as a Web or an application server) without hard coding their relation but
for example by providing an abstract specification—for example, a configuration file that specifies
which component to instantiate and to connect them together. A detailed description of this
programming pattern can be found at the following link: http://martinfowler.com/articles/injection.
html (accessed December 2009).
9.5 HYBRID CLOUD IMPLEMENTATION
2 6 5 263
Those presented are, according to the authors, the most relevant features for
successfully supporting the deployment and the management of hybrid clouds.
In this list we did not extensively mention security that is transversal to all
features listed. A basic recommendation for implementing a security
infrastructure for any runtime environment is to use a Defense in Depth6
security model whenever it is possible. This principle is even more important in
heterogeneous systems such as hybrid clouds, where both applications and
resources can represent treats to each other.
Aneka Hybrid Cloud Architecture
The Resource Provisioning Framework represents the foundation on top of
which Aneka-based hybrid clouds are implemented. In this section we will
introduce the components that compose this framework and briefly describe
their interactions.
The basic idea behind the Resource Provisioning Framework is depicted in
Figure 9.4. The resource provisioning infrastructure is represented by a
collection of resource pools that provide access to resource providers, whether
they are external or internal, and managed uniformly through a specific
component called a resource pool manager. A detailed description of the
components follows:
Aneka Container
Membership Catalogue
Services
Join/Leave
Membership Catalogue
Scheduling Service
Resource Provisioning Service
Provisioning
Service
Provision/Release
Resource Pool Manager
List
Provision
Release
FIGURE 9.4. System architecture of the Aneka Resource Provisioning Framework.
9.5 HYBRID CLOUD IMPLEMENTATION
2 6 6 263
6
Defense in depth is an information assurance (IA) strategy in which multiple layers of defense are
placed throughout an information technology (IT) system. More information is available at the
following link: http://www.nsa.gov/ia/_files/support/defenseindepth.pdf (accessed December 2009).
9.5 HYBRID CLOUD IMPLEMENTATION
2 6 7 263
● Resource Provisioning Service. This is an Aneka-specific service that
implements the service interface and wraps the resource pool manager,
thus allowing its integration within the Aneka container.
● Resource Pool Manager. This manages all the registered resource pools
and decides how to allocate resources from those pools. The resource pool
manager provides a uniform interface for requesting additional resources
from any private or public provider and hides the complexity of managing
multiple pools to the Resource Provisioning Service.
● Resource Pool. This is a container of virtual resources that mostly come
from the same resource provider. A resource pool is in charge of
managing the virtual resources it contains and eventually releasing them
when they are no longer in use. Since each vendor exposes its own
specific interfaces, the resource pool (a) encapsulates the specific
implementation of the communication protocol required to interact with it
and (b) provides the pool manager with a unified interface for acquiring,
terminating, and monitoring virtual resources.
The request for additional resources is generally triggered by a scheduler that
detects that the current capacity is not sufficient to satisfy the expected quality
of services ensured for specific applications. In this case a provisioning request
is made to the Resource Provisioning Service. According to specific policies, the
pool manager determines the pool instance(s) that will be used to provision
resources and will forward the request to the selected pools. Each resource pool
will translate the forwarded request by using the specific protocols required by
the external provider and provision the resources. Once the requests are
successfully processed, the requested number of virtual resources will join the
Aneka cloud by registering themselves with the Membership Catalogue Service,
which keeps track of all the nodes currently connected to the cloud. Once joined
the cloud the provisioned resources are managed like any other node.
A release request is triggered by the scheduling service when provisioned
resources are no longer in use. Such a request is then forwarded to the
interested resources pool (with a process similar to the one described in
the previous paragraph) that will take care of terminating the resources when
more appropriate. A general guideline for pool implementation is to keep
provisioned resources active in a local pool until their lease time expires. By
doing this, if a new request arrives within this interval, it can be served without
leasing additional resources from the public infrastructure. Once a virtual
instance is terminated, the Membership Catalogue Service will detect
a disconnection of the corresponding node and update its registry accordingly.
It can be noticed that the interaction flow previously described is completely
independent from the specific resource provider that will be integrated into
the system. In order to satisfy such a requirement, modularity and
welldesigned interfaces between components are very important. The current
design, implemented in Aneka, maintains the specific implementation details
9.5 HYBRID CLOUD IMPLEMENTATION
2 6 8 263
within the ResourcePool implementation, and resource pools can be
dynamically configured and added by using the dependency injection
techniques, which are already implemented for configuring the services hosted
in the container. The current implementation of Aneka allows customizing the
Resource Provisioning Infrastructure by specifying the following elements:
● Resource Provisioning Service. The default implementation provides a
lightweight component that generally forwards the requests to the
resource Pool Manager. A possible extension of the system can be
the implementation of a distributed resource provisioning service that
can operate at this level or at the Resource Pool Manager level.
● Resource Pool Manager. The default implementation provides the basic
management features required for resource and provisioning request
forwarding.
● Resource Pools. The Resource Pool Manager exposes a collection of
resource pools that can be used. It is possible to add any implementation
that is compliant to the interface contract exposed by the Aneka
provisioning API, thus adding a heterogeneous open-ended set of external
providers to the cloud.
● Provisioning Policy. Scheduling services can be customized with resource
provisioning aware algorithms that can perform scheduling of applications
by taking into account the required QoS.
The architecture of the Resource Provisioning Framework shares some features
with other IaaS implementations featuring configurable software containers,
such as OpenNebula [19] and Nimbus [22]. OpenNebula uses the concept of
cloud drivers in order to abstract the external resource providers and provides
a pluggable scheduling engine that supports the integration with advanced
schedulers such Haizea [20] and others. Nimbus provides a plethora of
extension points into its programming API, and among these there are hooks
for scheduling and resource management and the remote management (RM)
API. The first ones control when and where a virtual machine will run, while
the RM API act as unified interface to Infrastructure as a Service (IaaS)
implementations such as Amazon EC2 and OpenNebula. By providing a
specific implementation of RM API, it is possible to integrate other cloud
providers.
In the next paragraph, we will detail the implementation of the Amazon
EC2 resource pool to provide a practical example of a resource pool
implementation.
Use Case—The Amazon EC2 Resource Pool
Amazon EC2 is one of the most popular cloud resource providers. At the time
9.5 HYBRID CLOUD IMPLEMENTATION
of writing it is listed among
2 6 9 263
the top 10 companies providing cloud computing
9.5 HYBRID CLOUD IMPLEMENTATION
2 7 0 263
services.7 It provides a Web service interface for accessing, managing, and
controlling virtual machine instances. The Web-service-based interface
simplifies the integration of Amazon EC2 with any application. This is the case
of Aneka, for which a simple Web service client has been developed to allow
the interaction with EC2. In order to interact with Amazon EC2, several
parameters are required:
● User Identity. This represents the account information used to authenticate
with Amazon EC2. The identity is constituted by a pair of encrypted keys that
are the access key and the secret key. These keys can be obtained from the
Amazon Webservices portalonce the userhassignedin, and they are required
to perform any operation that involves Web service access.
● Resource Identity. The resource identity is the identifier of a public or a
private Amazon Machine Image (AMI) that is used as template from
which to create virtual machine instances.
● Resource Capacity. This specifies the different type of instance that will
be deployed by Amazon EC2. Instance types vary according to the number of
cores, the amount of memory, and other settings that affect the performance
of the virtual machine instance. Several types of images are available, those
commonly used are: small, medium, and large. The capacity of each type of
resource has been predefined by Amazon and is charged differently.
This information is maintained in the EC2ResourcePoolConfiguration class and
need to be provided by the administrator in order to configure the pool. Hence,
the implementation of EC2ResourcePool is forwarding the request of the pool
manager to EC2 by using the Web service client and the configuration
information previously described. It then stores the metadata of each active
virtual instance for further use.
In order to utilize at best the virtual machine instances provisioned from
EC2, the pool implements a cost-effective optimization strategy. According to
the current business model of Amazon, a virtual machine instance is charged by
using one-hour time blocks. This means that if a virtual machine instance is
used for 30 minutes, the customer is still charged for one hour of usage. In order
to provide a good service to applications with a smaller granularity in terms of
execution times, the EC2ResourcePool class implements a local cache that
keeps track of the released instances whose time block is not expired yet. These
instances will be reused instead of activating new instances from Amazon.
With the cost-effective optimization strategy, the pool is able to minimize the
cost of provisioning resources from Amazon cloud and, at the same time,
achieve high utilization of each provisioned resource.
9.5 HYBRID CLOUD IMPLEMENTATION
7
2 7 1 263
Source: http://www.networkworld.com/supp/2009/ndc3/051809-cloud-companies-to-watch.html
(accessed December 2009). A more recent review ranked Amazon still in the top ten (Source:
http://searchcloudcomputing.techtarget.com/generic/0,295582,sid201_gci1381115,00.html#slideshow)
9.5 HYBRID CLOUD IMPLEMENTATION
2 7 2 263
Implementation Steps for Aneka Resource Provisioning
Service
The resource provisioning service is a customized service which will be used to
enable cloud bursting by Aneka at runtime. Figure 9.5 demonstrates one of the
application scenarios that utilize resource provisioning to dynamically
provision virtual machines from Amazon EC2 cloud.
The general steps of resource provisioning on demand in Aneka are the
following:
● The application submits its tasks to the scheduling service, which, in turns,
adds the tasks into the scheduling queue.
● The scheduling algorithm finds an appropriate match between a task and
a resource. If the algorithm could not find enough resources for serving all
the tasks, it requests extra resources from the scheduling service.
● The scheduling service will send a ResourceProvisionMessage to provision
service and will ask provision service to get X number of resources as
determined by the scheduling algorithm.
Message Dispatcher
5. Start VMs
9.5 HYBRID CLOUD IMPLEMENTATION
2 7 3 263
Master Machine
1. Submit Tasks
Aneka Container
8. Register
Membership
Membership Service
Private Cloud
2. Schedule
(local resource)
1. Submit Tasks
Tasks
Scheduling Service
4. Provision
Scheduling
3. Request Extra
Resources
4. Provision
Algorithms
Provision Service
Task Queue
9. Dispatch Task to Worker
Public Cloud
Amazon EC2
(remote resource)
Aneka
AMI
6. Start VMs with
Aneka VM Template
7. VM started, Join Aneka Network
Message Dispatcher
Aneka Workers
Aneka Worker Container
9. Dispatch to Execute
Execution Service
FIGURE 9.5. Aneka resource provisioning (cloud bursting) over Amazon EC2.
9.6
VISIONARY THOUGHTS FOR PRACTITIONERS
269
● Upon receiving the provision message, the provision service will delegate
the provision request to a component called resource pool manager, which
is responsible for managing various resource pools. A resource pool is a
logical view of a cloud resource provider, where the virtual machines
can be provisioned at runtime. Aneka resource provisioning supports
multiple resource pools such as Amazon EC2 pool and Citrix Xen server
pool.
● The resource pool manager knows how to communicate with each pool
and will provision the requested resources on demand. Based on the
requests from the provision service, the pool manager starts X virtual
machines by utilizing the predefined virtual machine template already
configured to run Aneka containers.
● A worker instance of Aneka will be configured and running once a virtual
resource is started. All the work instances will then connect to the Aneka
master machine and will register themselves with Aneka membership
service.
● The scheduling algorithm will be notified by the membership service once
those work instances join the network, and it will start allocating pending
tasks to them immediately.
● Once the application is completed, all the provisioned resources will be
released by the provision service to reduce the cost of renting the virtual
machine.
VISIONARY THOUGHTS FOR PRACTITIONERS
The research on the integration of public and private clouds is still at its early
stage. Even though the adoption of cloud computing technologies is still growing,
delivering IT services via the cloud will be the norm in future. The key areas of
interest that need to be explored include security standardization; pricing models;
and management and scheduling policies for heterogeneous environments. At the
time of writing, only limited research has been carried out in these fields.
As briefly addressed in the introduction, security is one of the major
concerns in hybrid clouds. While private clouds significantly reduce the security
risks concerned by retaining sensitive information within corporate boundaries,
in the case of hybrid clouds the workload that is delegated to the public portion
of the infrastructure is subject to the same security risks that are prevalent in
public clouds. In this sense, workload partitioning and classification can help
in reducing the security risks for sensitive data. Keeping sensitive operations
within the boundaries of the private part of the infrastructure and ensuring that
the information flow in the cloud is kept under control is a naı¨ ve and probably
often limited solution. The major issues that need to be addressed are the
following: security of virtual execution environments (either hypervisors or
managed runtime environments for PaaS implementations), data retention,
possibility of massive outages, provider trust, and also jurisdiction issues that
can break the confidentiality of data. These issues become even more crucial in
the case of hybrid clouds because of the dynamic nature of the way in which
public resources are integrated into the system. Currently, the security measures
and tools adopted for traditional distributed systems are used. Cloud
computing brings not only challenges for security, but also advantages.
Cloud service providers can make sensible investments on the security
infrastructure and provide more secured environments than those provided
by small enterprises. Moreover, a cloud‘s virtual dynamic infrastructure
makes it possible to achieve better fault tolerance and reliability, greater
resiliency to failure, rapid reconstruction of services, and a low-cost approach
to disaster recovery. The lack of standardization is another important area that
has to be covered.
Currently, each vendor publishes their own interfaces, and there is no common
agreement on a standard for exposing such services. This condition limits the
adoption of inter-cloud services on a global scale. As discussed in this chapter,
in order to integrate IaaS solutions from different vendors it is necessary to
implement ad hoc connectors. The lack of standardization covers not only the
programming and management interface, but also the use of abstract
representations for virtual images and active instances. An effort in this
direction is the Open Virtualization Format (OVF) [25], an open standard for
packaging and distributing virtual appliances or more generally software to be
run in virtual machines. However, even if endorsed by the major representative
companies in the field (Microsoft, IBM, Dell, HP, VMWare, and XenSource)
and released as a preliminary standard by the Distributed Management Task
Force, the OVF specification only captures the static representation of a virtual
instance; it is mostly used as a canonical way of distributing virtual machine
images. Many vendors and implementations simply use OVF as an import
format and convert it into their specific runtime format when running the image.
Additional effort has to be spent on defining a common method to represent live
instances of applications and in providing a standard approach to customizing
these instances during startup. Research in this area will be
necessary to completely eliminate vendor lock-in.8 In addition, when building a
hybrid cloud based on legacy hardware and virtual public infrastructure,
additional compatibility issues arise due to the heterogeneity of the runtime
environments: almost all the hypervisors support the x86 machine model, which
could constitute a technology barrier in the seamless transition from private
environments to public ones. Finally, as discussed by Keahey et al. [26], there is
a need for providing (a) a standardized way for describing and comparing the
quality of service (QoS) offerings of different cloud services providers and (b) a
standardized approach to benchmark those services. These are all areas that
have to be explored in order to take advantage of heterogeneous clouds, which,
due to their dynamic nature, require automatic methods for optimizing and
monitoring the publicly provisioned services. An important step in providing a
standardization path and to foster the adoption of cloud computing is the Open
Cloud Manifesto,9 which provides a starting point for the promotion of open
clouds characterized by interoperability between providers and true scalability
for applications.
Since the integration of external resources comes with a price, it is interesting
to study how to optimize the usage of such resources. Currently, resources are
priced in time blocks, and often their granularity does not meet the needs of
enterprises. Virtual resource pooling, as provided by Aneka, is an initial step in
closing this gap, but new strategies for optimizing the usage of external
provisioned resources can be devised. For example, intelligent policies that can
predict when to release a resource by relying on the statistics of the workload
can be investigated. Other policies could identify the optimal number of
resources to provision according to the application needs, the budget allocated
for the execution of the application, and the workload. Research in this direction
will become even more consistent when different pricing models will be
introduced by cloud providers. In this future scenario, the introduction of a
market place for brokering cloud resources and services will definitely give
more opportunities to fully realize the vision of cloud computing. Each vendor
will be able to advertise their services and customers will have more options to
choose from, eventually by relying on meta-brokering services. Once realized,
these opportunities will make the accessibility of cloud computing technology
more natural and at a fairer price, thus simplifying the integration of existing
computing infrastructure owned within the premises.
We believe that one of the major areas of interest in the next few years for
what concerns the implementation and the deployment of hybrid clouds will be
the scheduling of applications and the provisioning of resources for these
applications. In particular, due to the heterogeneous nature of hybrid clouds,
additional coordination between the private and the public service management
becomes fundamental. Hence, cloud schedulers will necessarily be integrated
with different aspects such as federate policy management tools, seamless
hybrid integration, federated security, information asset management,
coordinated provisioning control, and unified monitoring.
COMETCLOUD: AN AUTONOMIC CLOUD
ENGINE
10.1 INTRODUCTION
Clouds typically have highly dynamic demands for resources with highly
heterogeneous and dynamic workloads. For example, the workloads associated
with the application can be quite dynamic, in terms of both the number of tasks
processed and the computation requirements of each task. Furthermore,
different applications may have very different and dynamic quality of service
(QoS) requirements; for example, one application may require high throughput
while another may be constrained by a budget, and a third may have to balance
both throughput and budget. The performance of a cloud service can also vary
based on these varying loads as well as failures, network conditions, and so on,
resulting in different ―QoS‖ to the application.
Combining public cloud platforms and integrating them with existing grids
and data centers can support on-demand scale-up, scale-down, and scale-out.
Users may want to use resources in their private cloud (or data center or grid)
first before scaling out onto a public cloud, and they may have a preference for
a particular cloud or may want to combine multiple clouds. However, such
integration and interoperability is currently nontrivial. Furthermore, integrating
these public cloud platforms with exiting computational grids provides
opportunities for on-demand scale-up and scale-down, that is cloudbursts.
In this chapter, we present the CometCloud autonomic cloud engine. The
overarching goal of CometCloud is to realize a virtual computational cloud
with resizable computing capability, which integrates local computational
environments and public cloud services on-demand, and provide
abstractions and mechanisms to support a range of programming
paradigms and
Cloud Computing: Principles and Paradigms, Edited by Rajkumar Buyya, James Broberg and
Andrzej Goscinski Copyright r 2011 John Wiley & Sons, Inc.
275
applications requirements. Specifically, CometCloud enables policy-based
autonomic cloudbridging and cloudbursting. Autonomic cloudbridging enables
on-the-fly integration of local computational environments (data centers, grids)
and public cloud services (such as Amazon EC2 and Eucalyptus [20]), and
autonomic cloudbursting enables dynamic application scale-out to address
dynamic workloads, spikes in demands, and other extreme requirements.
CometCloud is based on a decentralized coordination substrate, and it
supports highly heterogeneous and dynamic cloud/grid infrastructures,
integration of public/private clouds, and cloudbursts. The coordination
substrate is also used to support a decentralized and scalable task space that
coordinates the scheduling of task, submitted by a dynamic set of users, onto
sets of dynamically provisioned workers on available private and/or public
cloud resources based on their QoS constraints such as cost or performance.
These QoS constraints along with policies, performance history, and the state
of resources are used to determine the appropriate size and mix of the public
and private clouds that should be allocated to a specific application request.
This chapter also demonstrates the ability of CometCloud to support the
dynamic requirements of real applications (and multiple application groups)
with varied computational requirements and QoS constraints. Specifically, this
chapter describes two applications enabled by CometCloud, a computationally
intensive value at risk (VaR) application and a high-throughput medical image
registration. VaR is a market standard risk measure used by senior managers
and regulators to quantify the risk level of a firm‘s holdings. A VaR calculation
should be completed within the limited time, and the computational
requirements for the calculation can change significantly. Image registration is
the process to determine the linear/nonlinear mapping between two images
of the same object or similar objects. In image registration, a set of image
registration methods are used by different (geographically distributed) research
groups to process their locally stored data. The set of images will be typically
acquired at different time, or from different perspectives, and will be in
different coordinate systems. It is therefore critical to align those images into
the same coordinate system before applying any image analysis.
The rest of this chapter is organized as follows. We present the CometCloud
architecture in Section 10.2. Section 10.3 elaborates policy-driven autonomic
cloudbursts,—specifically, autonomic cloudbursts for real-world applications,
autonomic cloudbridging over a virtual cloud, and runtime behavior of
CometCloud. Section 10.4 states the overview of VaR and image registration
applications. We evaluate the autonomic behavior of CometCloud in Section
10.5 and conclude this paper in Section 10.6.
COMETCLOUD ARCHITECTURE
CometCloud is an autonomic computing engine for cloud and grid
environments. It is based on the Comet [1] decentralized coordination
substrate, and it
supports highly heterogeneous and dynamic cloud/grid infrastructures,
integration of public/private clouds, and autonomic cloudbursts. CometCloud is
based on a peer-to-peer substrate that can span enterprise data centers, grids,
and clouds. Resources can be assimilated on-demand and on-the-fly into
its peer-to-peer overlay to provide services to applications. Conceptually,
CometCloud is composed of a programming layer, a service layer, and an
infrastructure layer; these layers are described in more detail in the following
section. CometCloud (and Comet) adapts the Squid information discov ery
scheme to deterministically map the information space onto the dynamic set of
peer nodes. The resulting structure is a locality preserving semantic distributed
hash table (DHT) on top of a self-organizing structured overlay. It maintains
content locality and guarantees that content-based queries, using flexible
content descriptors in the form of keywords, partial keywords, and wildcards,
are delivered with bounded costs. Comet builds a tuple-based coordination
space abstraction using Squid, which can be associatively accessed by all
system peers without requiring the location information of tuples and host
identifiers. CometCloud also provides transient spaces that enable applications
to explicitly exploit context locality.
CometCloud Layered Abstractions
A schematic overview of the CometCloud architecture is presented in Fig ure
10.1. The infrastructure layer uses the Chord self-organizing overlay , and the
Squid information discovery and content-based routing substrate built on top
of Chord. The routing engine supports flexible content-based
Application
Master/Worker/BOT
Programming
layer
Service layer
Scheduling
Monitoring
Task
consistency
Workflow
MapReduce/
Hadoop
Clustering/
Anomaly Detection
Coordination
Publish/Subscribe
Discovery
Event
Messaging
Replication
Infrastructure
layer
Load balancing
Content-based routing
Content security
Self-organizing layer
Data center/Grid/Cloud
FIGURE 10.1. The CometCloud architecture for autonomic cloudbursts.
routing and complex querying using partial keywords, wildcards, or ranges. It
also guarantees that all peer nodes with data elements that match a query/
message will be located. Nodes providing resources in the overlay have different
roles and, accordingly, different access privileges based on their credentials and
capabilities. This layer also provides replication and load balancing services,
and it handles dynamic joins and leaves of nodes as well as node failures. Every
node keeps the replica of its successor node‘s state, and it reflects changes to this
replica whenever its successor notifies it of changes. It also notifies its
predecessor of any changes to its state. If a node fails, the predecessor node
merges the replica into its state and then makes a replica of its new successor. If
a new node joins, the joining node‘s predecessor updates its replica to reflect the
joining node‘s state, and the successor gives its state information to the joining
node. To maintain load balancing, load should be redistributed among the
nodes whenever a node joins and leaves.
The service layer provides a range of services to supports autonomics at the
programming and application level. This layer supports the Linda-like tuple
space coordination model, and it provides a virtual shared-space abstraction as
well as associative access primitives. The basic coordination primitives are
listed below:
● out
(ts, t): a nonblocking operation that inserts tuple t into space ts.
● in (ts, t‘): a blocking operation that removes a tuple t matching template
t‘ from the space ts and returns it.
● rd (ts, t‘): a blocking operation that returns a tuple t matching template
t‘ from the space ts. The tuple is not removed from the space.
The out is for inserting a tuple into the space, and in and rd are for reading a
tuple from the space are implemented. in removes the tuple after read, and
rd only reads the tuple. We support range query, hence ―*‖ can be used for
searching all tuples. The above uniform operators do not distinguish between
local and remote spaces, and consequently the Comet is naturally suitable for
context-transparent applications. However, this abstraction does not maintain
geographic locality between peer nodes and may have a detrimental effect on
the efficiency of the applications imposing context-awareness, for example
mobile applications. These applications require that context locality be
maintained in addition to content locality; that is, they impose requirements
for context-awareness. To address this issue, CometCloud supports
dynamically constructed transient spaces that have a specific scope
definition (e.g., within the same geographical region or the same physical
subnet). The global space is accessible to all peer nodes and acts as the
default coordination platform. Membership and authentication mechanisms
are adopted to restrict access to the transient spaces. The structure of the
transient space is exactly the same as the global space. An application can
switch between spaces at runtime and can simultaneously use multiple
spaces. This layer also provides asynchronous
(publish/subscribe) messaging and evening services. Finally, on-line clustering
services support autonomic management and enable self-monitoring and
control. Events describing the status or behavior of system components are
clustered, and the clustering is used to detect anomalous behaviors.
The programming layer provides the basic framework for application
development and management. It supports a range of paradigms including
the master/worker/BOT. Masters generate tasks and workers consume them.
Masters and workers can communicate via virtual shared space or using a
direct connection. Scheduling and monitoring of tasks are supported by the
application framework. The task consistency service handles lost tasks. Even
though replication is provided by the infrastructure layer, a task may be lost
due to network congestion. In this case, since there is no failure, infrastructure
level replication may not be able to handle it. This can be handled by the
master, for example, by waiting for the result of each task for a predefined time
interval and, if it does not receive the result back, regenerating the lost task. If
the master receives duplicate results for a task, it selects the first one and ignores
other subsequent results. Other supported paradigms include workflow-based
applications as well as Mapreduce and Hadoop .
Comet Space
In Comet, a tuple is a simple XML string, where the first element is the tuple‘s
tag and is followed by an ordered list of elements containing the tuple‘s fields.
Each field has a name followed by its value. The tag, field names, and values
must be actual data for a tuple and can contain wildcards (―*‖) for a template
tuple. This lightweight format is flexible enough to represent the information
for a wide range of applications and can support rich matching relationships
. Further more, the cross-platform nature of XML makes this format suitable
for information exchange in distributed heterogeneous environments. A tuple
in Comet can be retrieved if it exactly or approximately matches a template
tuple. Exact matching requires the tag and field names of the template tuple to
be specified without any wildcard, as in Linda. However, this strict
matching pattern must be relaxed in highly dynamic environments, since
applications (e.g., service discovery) may not know exact tuple structures.
Comet supports tuple retrievals with incomplete structure information using
approximate matching, which only requires the tag of the template tuple be
specified using a keyword or a partial keyword. Examples are shown in
Figure 10.2. In this figure, tuple (a) tagged ―contact‖ has fields ―name, phone,
email, dep‖ with values ―Smith, 7324451000, smith@gmail.com, ece‖ and can
be retrieved using tuple template (b) or (c).
Comet adapts Squid information discovery scheme and employs the Hilbert
space-filling curve (SFC) to map tuples from a semantic information space to
a linear node index. The semantic information space, consisting of based-10
numbers and English words, is defined by application users. For example, a
computational storage resource may belong to the 3D storage space with
10.3
<contact>
AUTONOMIC BEHAVIOR OF C OMETCLOUD
<contact>
282281
<contact>
<name> Smith </name>
<name> Smith </name>
<na*> Smith </na*>
<phone> 7324451000 </phone>
<phone> 7324451000 </phone>
<*>
<email> smith@gmail.com </email> <email> * </email>
<*>
<dep> ece </dep>
<dep> ece </dep>
</contact>
(a)
<dep> * </dep>
</contact>
</contact>
(b)
(c)
FIGURE 10.2. Example of tuples in CometCloud.
coordinates ―space,‖ ―bandwidth,‖ and ―cost.‖ Each tuple is associated with
k keywords selected from its tag and field names, which are the keys of a tuple.
For example, the keys of tuple (a) in Figure 10.2 can be ―name, phone‖ in a 2D
student information space. Tuples are local in the information space if their
keys are lexicographically close, or if they have common keywords. The
selection of keys can be specified by the applications.
A Hilbert SFC is a locality preserving continuous mapping from a kdimensional (kD) space to a 1D space. It is locality preserving in that points
that are close on the curve are mapped from close points in the kD space. The
Hilbert curve readily extends to any number of dimensions. Its
localitypreserving property enables the tuple space to maintain content locality
in
the index space. In Comet, the peer nodes form a one-dimensional
overlay, which is indexed by a Hilbert SFC. Applying the Hilbert mapping, the
tuples are mapped from the multi-dimensional information space to the linear
peer index space. As a result, Comet uses the Hilbert SFC constructs the
distribute hash table (DHT) for tuple distribution and lookup. If the keys of a
tuple only include complete keywords, the tuple is mapped as a point in the
information space and located on at most one node. If its keys consist of partial
keywords, wildcards, or ranges, the tuple identifies a region in the information
space. This region is mapped to a collection of segments on the SFC and
corresponds to a set of points in the index space. Each node stores the keys that
map to the segment of the curve between itself and the predecessor node. For
example, as shown in Figure 10.3, five nodes (with id shown in solid circle) are
indexed using SFC from 0 to 63, the tuple defined as the point (2, 1) is mapped
to index 7 on the SFC and corresponds to node 13, and the tuple defined as the
region (2—3, 1—5) is mapped to two segments on the SFC and corresponds to
nodes 13 and 32.
10.3
AUTONOMIC BEHAVIOR OF C OMETCLOUD
283281
AUTONOMIC BEHAVIOR OF COMETCLOUD
Autonomic Cloudbursting
The goal of autonomic cloudbursts is to seamlessly and securely integrate
private enterprise clouds and data centers with public utility clouds on-demand,
to provide the abstraction of resizable computing capacity. It enables the
10.3
AUTONOMIC BEHAVIOR OF C OMETCLOUD
40
284281
40
5
28
31 32
32
11
51
13
51
13
1
7
2
0
0
0
1
13
63
32
0
0 13
40 51
63
(a)
6
2 3
63
32
40 51
63
0
(b)
FIGURE 10.3. Examples of mapping tuples from 2D information space to 1D index
space [1].
dynamic deployment of application components (which typically run on internal
organizational compute resources) onto a public cloud to address dynamic
workloads, spikes in demands, and other extreme requirements. Furthermore,
given the increasing application and infrastructure scales, as well as their
cooling, operation, and management costs, typical over-provisioning strategies
are no longer feasible. Autonomic cloudbursts can leverage utility clouds
to provide on-demand scale-out and scale-in capabilities based on a range
of metrics.
The overall approach for supporting autonomic cloudbursts in CometCloud
is presented in Figure 10.4. CometCloud considers three types of clouds based
on perceived security/trust and assigns capabilities accordingly. The first is a
highly trusted, robust, and secure cloud, usually composed of trusted/secure
nodes within an enterprise, which is typically used to host masters and other
key (management, scheduling, monitoring) roles. These nodes are also used to
10.3
AUTONOMIC BEHAVIOR OF C OMETCLOUD
285281
store states. In most applications, the privacy and integrity of critical data must
be maintained; as a result, tasks involving critical data should be limited to
cloud nodes that have required credentials. The second type of cloud is one
composed of nodes with such credentials—that is, the cloud of secure workers.
A privileged Comet space may span these two clouds and may contain critical
data, tasks, and other aspects of the application-logic/workflow. The final type
of cloud consists of casual workers. These workers are not part of the space but
can access the space through the proxy and a request handler to obtain
(possibly encrypted) work units as long as they present required credentials.
Nodes can be added or deleted from any of these clouds by purpose. If the
space needs to be scale-up to store dynamically growing workload as well as
requires more computing capability, then autonomic cloudbursts target secure
worker to scale up. But only if more computing capability is required, then
unsecured workers are added.
10.3
AUTONOMIC BEHAVIOR OF C OMETCLOUD
282283
Space sharing zone
Robust/Secure
Generate
Masters
Management
Computing
tasks
Task 1
Task 2
agent
Get tasks
Secure Workers
agent
Comet
Scheduling Monitoring
Request Handler
Computing agent
Task 3
Comet
…
Directly send results
4. Send a task
2. Forward requests in Round-robin
Proxy
5. Send results
directly to master
Datacenter
Grid
Clouds
1. Request a task
Unsecured Workers
Computing agent
FIGURE 10.4. Autonomic cloudbursts using CometCloud.
10.3
AUTONOMIC BEHAVIOR OF C OMETCLOUD
283283
Key motivations for autonomic cloudbursts include:
● Load Dynamics. Application workloads can vary significantly. This
includes the number of application tasks as well the computational
requirements of a task. The computational environment must dynamically
grow (or shrink) in response to these dynamics while still maintaining
strict deadlines.
● Accuracy of the Analytics. The required accuracy of risk analytics depends
on a number of highly dynamic market parameters and has a direct impact
on the computational demand—for example the number of scenarios in
the Monte Carlo VaR formulation. The computational environment must
be able to dynamically adapt to satisfy the accuracy requirements while
still maintaining strict deadlines.
● Collaboration of Different Groups. Different groups can run the same
application with different dataset policies . Here, policy means user‘s SLA
bounded by their condition such as time frame, budgets, and economic
models. As collaboration groups join or leave the work, the computational
environment must grow or shrink to satisfy their SLA.
● Economics. Application tasks can have very heterogeneous and dynamic
priorities and must be assigned resources and scheduled accordingly.
10.3
AUTONOMIC BEHAVIOR OF C OMETCLOUD
284283
Budgets and economic models can be used to dynamically provision
computational resources based on the priority and criticality of the
application task. For example, application tasks can be assigned budgets
and can be assigned resources based on this budget. The computational
environment must be able to handle heterogeneous and dynamic
provisioning and scheduling requirements.
● Failures. Due to the strict deadlines involved, failures can be disastrous.
The computation must be able to manage failures without impacting
application quality of service, including deadlines and accuracies.
Autonomic Cloudbridging
Autonomic cloudbridging is meant to connect CometCloud and a virtual cloud
which consists of public cloud, data center, and grid by the dynamic needs of
the application. The clouds in the virtual cloud are heterogeneous and have
different types of resources and cost policies, besides, the performance of each
cloud can change over time by the number of current users. Hence, types of
used clouds, the number of nodes in each cloud, and resource types of nodes
should be decided according to the changing environment of the clouds and
application‘s resource requirements.
Figure 10.5 shows an overview of the operation of the CometCloud based
autonomic cloudbridging. The scheduling agent manages autonomic
Research site 1
Research site 2
Research site n
CometCloud
Scheduling agent
Policy
10.3
AUTONOMIC BEHAVIOR OF C OMETCLOUD
Cloud-Bridging
Public
Datacenter
Grid
cloud
Virtually integrated working cloud
FIGURE 10.5. Overview of the operation of autonomic cloudbridging.
285283
10.3
AUTONOMIC BEHAVIOR OF C OMETCLOUD
286283
cloudbursts over the virtual cloud, and there can be one or more scheduling
agents. A scheduling agent is located at a robust/secure master site. If multiple
collaborating research groups work together and each group requires
generating tasks with its own data and managing the virtual cloud by its own
policy, then it can have a separate scheduling agent in its master site. The
requests for tasks generated by the different sites are logged in the CometCloud
virtual shared space that spans master nodes at each of the sites. These tasks are
then consumed by workers, which may run on local computational nodes at the
site, a shared data center, and a grid or on a public cloud infrastructure.
A scheduling agent manages QoS constraints and autonomic cloudbursts of its
site according to the defined policy. The workers can access the space using
appropriate credentials, access authorized tasks, and return results back to the
appropriate master indicated in the task itself.
A scheduling agent manages autonomic cloudbridging and guarantees QoS
within user policies. Autonomic cloudburst is represented by changing resource
provisioning not to violate defined policy. We define three types of policies.
● Deadline-Based. When an application needs to be completed as soon as
possible, assuming an adequate budget, the maximum required workers
are allocated for the job.
● Budget-Based. When a budget is enforced on the application, the number
of workers allocated must ensure that the budget is not violated.
● Workload-Based. When the application workload changes, the number of
workers explicitly defined by the application is allocated or released.
Other Autonomic Behaviors
Fault-Tolerance. Supporting fault-tolerance during runtime is critical to
keep the application‘s deadline. We support fault-tolerance in two ways which
are in the infrastructure layer and in the programming layer. The replication
substrate in the infrastructure layer provides a mechanism to keep the same
state as that of its successor‘s state, specifically coordination space and overlay
information. Figure 10.6 shows the overview of replication in the overlay.
Every node has a local space in the service layer and a replica space in the
infrastructure layer. When a tuple is inserted or extracted from the local space,
the node notifies this update to its predecessor and the predecessor updates the
replica space. Hence every node keeps the same replica of its successor‘s local
space. When a node fails, another node in the overlay detects the failure
and notifies it to the predecessor of the failed node. Then the predecessor of
the failed node merges the replica space into the local space, and this makes
all the tuples from the failed node recovered. Also the predecessor node makes
a new replica for the local space of its new successor. We also support
faulttolerance in the programming layer. Even though replica of each node is
maintained, some tasks can be lost during runtime because of network
10.3
Master 2
AUTONOMIC BEHAVIOR OF C OMETCLOUD
Worker 1
287283
Worker 2
Worker 3
Master 1
Worker 6
Worker 5
Worker 4
Local space
CometCloud
Replica space
FIGURE 10.6. Replication overview in the CometCloud overlay.
congestion or task generation during failure. To address this issue, the master
checks the space periodically and regenerates lost tasks.
Load Balancing. In a cloud environment, executing application requests
on underlying grid resources consists of two key steps. The first, which we
call VM Provisioning, consists of creating VM instances to host each
application request, matching the specific characteristics and requirements
of the request. The second step is mapping and scheduling these requests
onto distributed physical resources (Resource Provisioning). Most virtualized
data centers currently provide a set of general-purpose VM classes with
generic resource configurations, which quickly become insufficient to sup port
the highly varied and interleaved workloads. Furthermore, clients can easily
underor overestimate their needs because of a lack of understanding
of
application requirements due to application complexity and/or uncer tainty,
and this often results in over-provisioning due to a tendency to be conservative.
The decentralized clustering approach specifically addresses the distributed
nature of enterprise grids and clouds. The approach builds on a decentralized
messaging and data analysis infrastructure that provides monitoring and
density-based clustering capabilities. By clustering workload requests across
data center job queues, the characterization of different resource classes can be
accomplished to provide autonomic VM provisioning. This approach has
several advantages, including the capability of analyzing jobs across a dynamic
set of distributed queues, the nondependency on a priori knowledge of the
number of clustering classes, and the amenity for online application and timely
adaptation to changing workloads and resources. Furthermore, the robust
nature of the approach allows it to handle changes (joins/leaves) in the job
10.3
AUTONOMIC BEHAVIOR OF C OMETCLOUD
288283
queue servers as well as their failures while maximizing the quality and
efficiency of the clustering.
10.3
AUTONOMIC BEHAVIOR OF C OMETCLOUD
289283
OVERVIEW OF COMETCLOUD-BASED APPLICATIONS
In this section, we describe two types of applications which are VaR for
measuring the risk level of a firm‘s holdings and image registration for medical
informatics. A VaR calculation should be completed within the limited time,
and the computational requirements for the calculation can change
significantly. Besides, the requirement for additional computation happens
irregularly. Hence, for VaR we will focus on how autonomic cloudbursts work
for dynamically changing workloads. Image registration is the process to
determine the linear/nonlinear mapping T between two images of the same
object or similar objects that are acquired at different time, or from different
perspectives. Besides, because a set of image registration methods are used by
different (geographically distributed) research groups to process their locally
stored data, jobs can be injected from multiple sites. Another distinguished
difference between two applications is that data size of image registration is
much larger than that of VaR. For a 3D image, the image size is usually a few
tens of megabytes. Hence, image data should be separated from its task tuple,
and instead it locates on a separate storage server and its location is indicated in
the task tuple. For image registration, because it usually needs to be completed
as soon as possible within budget limit, we will focus on how CometCloud
works using budget-based policy.
Value at Risk (VaR)
Monte Carlo VaR is a very powerful measure used to judge the risk of
portfolios of financial instruments. The complexity of the VaR calculation
stems from simulating portfolio returns. To accomplish this, Monte Carlo
methods are used to ―guess‖ what the future state of the world may look like.
Guessing a large number of times allows the technique to encompass the
complex distributions and the correlations of different factors that drive
portfolio returns into a discreet set of scenarios. Each of these Monte Carlo
scenarios contains a state of the world comprehensive enough to value all
instruments in the portfolio, thereby allowing us to calculate a return for the
portfolio under that scenario.
The process of generating Monte Carlo scenarios begins by selecting
primitive instruments or invariants. To simplify simulation modeling, invariants
are chosen such that they exhibit returns that can be modeled using a stationary
normal probability distribution . In practice these invariants are returns on
stock prices, interest rates, foreign exchange rates, and so on. The universe of
invariants must be selected such that portfolio returns are driven only by
changes to the invariants.
To properly capture the nonlinear pricing of portfolios containing options,
we use Monte Carlo techniques to simulate many realizations of the invariants.
Each realization is referred to as a scenario. Under each of these scenarios, each
option is priced using the invariants and the portfolio is valued. As outlined
above, the portfolio returns for scenarios are ordered from worst loss to best
gain, and a VaR number is calculated.
Image Registration
Nonlinear image registration is the computationally expensive process to
determine the mapping T between two images of the same object or similar
objects acquired at different time, in different position or with different
acquisition parameters or modalities. Both intensity/area based and landmark
based methods have been reported to be effective in handling various
registration tasks. Hybrid methods which integrate both techniques have
demonstrated advantages in the literature [13—15].
Alternative landmark point detection and matching method are developed
as a part of hybrid image registration algorithm for both 2D and 3D images
[16]. The algorithm starts with automatic detection of a set of landmarks in
both fixed and moving images, followed by a coarse to fine estimation of the
nonlinear mapping using the landmarks. Intensity template matching is further
used to obtain the point correspondence between landmarks in the fixed and
moving images. Because there is a large portion of outliers in the initial
landmark correspondence, a robust estimator, RANSAC [17], is applied to
reject outliers. The final refined inliers are used to robustly estimate a Thin
Spline Transform (TPS) [18] to complete the final nonlinear registration.
IMPLEMENTATION AND EVALUATION
In this section, we evaluate basic CometCloud operations first, and then
compare application runtime varying the number of nodes after describing
how the applications were implemented using CometCloud. Then we evaluate
VaR using workload-based policy and Image registration using budget-based
policy. Also we evaluate CometCloud with/without a scheduling agent. For
deadline-based policy that doesn‘t have a budget limit, because it allocates as
many workers as possible, we applied it just to compare results with and
without scheduling agent for budget-based policy.
Evaluation of CometCloud
Basic CometCloud Operations. In this experiment we evaluated the costs of
basic tuple insertion and exact retrieval operations on the Rutgers cloud. Each
machine was a peer node in the CometCloud overlay and the machines formed
a single CometCloud peer group. The size of the tuple in the experiment was
fixed at 200 bytes. Aing-pong-like process was used in the experiment, in
which an application process inserted a tuple into the space using the out
operator, read the same tuple using the rd operator, and deleted it using the
in operator. In the experiment, the out and exact matching in/rd operators used
a three-dimensional information space. For an out operation, the measured
time corresponded to the time interval between when the tuple was posted into
the space and when the response from the destination was received. For an in or
rd operation, the measured time was the time interval between when the
template was posted into the space and when the matching tuple was returned
to the application, assuming that a matching tuple existed in the space. This
time included the time for routing the template, matching tuples in the
repository, and returning the matching tuple. The average performances were
measured for different system sizes.
Figure 10.7a plots the average measured performance and shows that the
system scales well with increasing number of peer nodes. When the number
of peer nodes increases 32 times (i.e., from 2 to 64), the average round-trip
time increases only about 1.5 times, due to the logarithmic complexity of the
routing algorithm of the Chord overlay. rd and in operations exhibit similar
(a)
150
140
in a tuple
rd a tuple
120
110
100
Average time (ms)
130
90
80
2000
70
4000
6000
8000
Number of tuples (average 110 bytes each)
60
(b)
140
Execution Time (ms)
120
100
80
out
rd
in
10000
12000
60
40
20
2 4 8
16
24
32
40
48
56
64
Number of Nodes
FIGURE 10.7. Evaluation of CometCloud primitives on the Rutgers cloud. (a) Average
time for out, in, and rd operators for increasing system sizes. (b) Average time for in and
rd operations with increasing number of tuples. System size fixed at 4 nodes.
TABLE 10.1. The Overlay Join Overhead on Amazon EC2
Number of Nodes
Time (msec)
10
20
40
80
100
353
633
1405
3051
3604
performance, as shown in Figure 10.7a. To further study the in/rd operator, the
average time for in/rd was measured using an increasing number of tuples.
Figure 10.7b shows that the performance of in/rd is largely independent of the
number of tuples in the system: The average time is approximately 105 ms as
the number of tuples is increased from 2000 to 12,000.
Overlay Join Overhead. To share the Comet space, a node should join the
CometCloud overlay and each node should manage a finger table to keep track
of changing neighbors. When a node joins the overlay, it first connects to a
predefined bootstrap node and sends its information such as IP address to the
bootstrap. Then the bootstrap node makes a finger table for the node and sends
it back to the node. Hence, the more nodes join the overlay at the same time,
the larger join overhead happens. Table 10.1 shows the join overhead varying
the number of joining nodes at the same time. We evaluated it on Amazon EC2,
and the figure shows that the join overhead is less than 4 seconds even when 100
nodes join the overlay at the same time.
Application Runtime
All tasks generated by the master are inserted into the Comet space and each
should be described by XML tags that are described differently for the purpose
of an application. Data to be computed can be included in a task or outside of
the task such as in a file server. To show each case, let VaR tasks include data
inside the tuple and image registration tasks include data outside of the tuple
because image data are relatively larger than VaR data. A typical out task for
VaR is described as shown below.
, VarAppTask .
, TaskId . taskid , /TaskId .
, DataBlock . data_blocks , /DataBlock .
, MasterNetName . master_name , /MasterNetName .
, /VarAppTask .
In image registration, each worker processes a whole image, hence the
number of images to be processed is the number of tasks. Besides, because
the image size is too large to be conveyed on a task, when the master generates
tasks, it just includes the data location for the task as a tag. After a worker
takes a task from the Comet space, it connects to the data location and gets
data. A typical out task for image registration is described as shown below.
, ImageRegAppTask .
, TaskId . taskid , /TaskId .
, ImageLocation . image_location , /ImageLocation .
, MasterNetName . master_name , /MasterNetName .
, /ImageRegAppTask .
Figure 10.8 shows the total application runtime of CometCloud-based (a)
VaR and (b) image registration on Amazon EC2 for different number of
scenarios. In this experiment, we ran a master on the Rutgers cloud and up to
80 workers on EC2 instances. Each worker ran on a different instance. We
assumed that all workers were unsecured and did not share the Comet space.
As shown in Figure 10.8a, and as expected, the application runtime of VaR
decreases as the number of EC2 workers increases up to some points. However,
when the number of workers is larger than some values, the application runtime
increases (see 40 and 80 workers). This is because of the communication
overhead that workers ask tasks to the proxy. Note that the proxy is the access
point for unsecured workers even though a request handler sends a task to the
worker after the proxy forwards the request to the request handler. If the
computed data size is large and it needs more time to be completed, then
workers will have less access the proxy and the communication overhead of the
proxy will decrease. Figure 10.8b shows the performance improvement of
image registration when the number of workers increases. The same as in VaR,
when the number of workers increases, the application runtime decreases. In
this application, one image takes around 1 minute to be completed, hence the
communication overhead does not appear in the graph.
60
(a) 200
40
Execution time (sec)
180
160
140
120
100
80
20
0
1
5
10
Number of workers
20
40
80
(b) 7000
600
0
1000 scenarios
500
0
400
0
1
300
0
5
80
10
20
Execution time (sec)
100 Images
3000 scenarios
40
Number of workers
200
0
100
0
0
FIGURE 10.8. Evaluation of CometCloud-based applications on Amazon
(a) VaR. (b) Image registration.
EC2.
Autonomic Cloudbursts Behaviors
VaR Using Workload-Based Policy. In this experiment, autonomic
cloudburst is represented by the number of changing workers. When the
application workload increases (or decreases), a predefined number of workers
are added (or released), based on the application workload. Specifically, we
defined workload-specific and workload-bounded policies. In workloadspecific, a user can specify the workload that nodes are allocated or released. In
workloadbounded, whenever the workload increases by more than a specified
threshold, a predefined number of workers is added. Similarly, if the workload
decreases by more than the specified threshold, the predefined number of
workers is released.
Figure 10.9 demonstrates autonomic cloudbursts in CometCloud based on
two of the above polices—that is, workload-specific and workload-bounded.
The figure plots the changes in the number of worker as the workload changes.
For the workload-specific policy, the initial workload is set to 1000 simulations
and the initial number of workers is set to 8. The workload is then increased or
decreased by 200 simulations at a time, and the number of worked added
or released set to 3. For workload-bounded policy, the number of workers is
initially 8 and the workload is 1000 simulations. In this experiment, the
workload is increased by 200 and decreased by 400 simulations, and 3 workers
are added or released at a time. The plots in Figure 10.9 clearly demonstrate the
cloudburst behavior. Note that the policy used as well as the thresholds can be
changed on-the-fly.
Image Registration Using Budget-Based Policy. The virtual cloud
environment used for the experiments consisted of two research sites located at
Rutgers University and the University of Medicine and Dentistry of New Jersey:
one public cloud (i.e., Amazon Web Service (AWS) EC2 ) and one private data
center at Rutgers (i.e., TW). The two research sites hosted their own image servers
and job queues, and workers running on EC2 or TW access these image servers to
get the image described in the task assigned to them (see Figure 10.5). Each image
16
1200
1000
10
200
0
Workload
Number of workers
6
4
2
0
Workload (number of
simulations)
12
8
800
600
400
(b)
14
1800
1600
1400
1200
1000
8
6
800
600
400
200
0
4
Workload
Number of workers
2
0
Number of workers
1800
1600
1400
Number of workers
10
(a)
Workload
(number of simulations)
12
0
235000
0
Time (ms)
FIGURE 10.9. Policy-based autonomic cloudburst
Workloadspecific policy. (b) Workload-bounded policy.
425000
Time (ms)
using
CometCloud.
(a)
server has 250 images resulting in a total of 500 tasks. Each image is
twodimensional, and its size is between 17 kB and 65 kB. The costs associated
with running tasks on EC2 and TW nodes were computed based on costing
models presented in references 10 and 9, respectively. On EC2, we used
standard small instances with a computing cost of $0.10/hour, data transfer costs
of $0.10/GB for inward transfers, and $0.17/GB for outward transfers. Because
the computing cost is charged by hourly base, users should pay for the full hour
even though they use just a few minutes. However, in this experiment, we
calculated the cost by seconds because the total runtime is less than an hour.
Costs for the TW data center included hardware investment, software,
electricity, and so on, and were estimated based on the discussion in , which
says that a data center costs $120K/life cycle per rack and has a life cycle of 10
years. Hence, we set the cost for TW to $1.37/hour per rack. In the
experiments we set the maximum number of available nodes to 25 for TW
and 100 for EC2. Note that TW nodes outperform EC2 nodes, but are more
expensive. We used budget-based policy for scheduling where the scheduling
agent tries to complete tasks as soon as possible without violating the budget.
We set the maximum available budget in the experiments to $3 to complete all
tasks. The motivation for this choice is as follows. If the available budget was
sufficiently high, then all the available nodes on TW will be allocated, and
tasks would be assigned until the all the tasks were completed. If the budget is
too small, the scheduling agent would not be able to complete all the tasks
within the budget. Hence, we set the budget to an arbitrary value in between.
Finally, the monitoring component of the scheduling agent evaluated the
performance every 1 minute.
Evaluation of CometCloud-Based Image Registration Application Enabled
Scheduling Agent. The results from the experiments are plotted in Figure 10.10.
Note that since the scheduling interval is 1 min, the x axis corresponds to both
time (in minutes) and the scheduling iteration number. Initially, the CometCloud
scheduling agent does not know the cost of completing a task. Hence, it
initially allocated 10 nodes each from TW and EC2. Figure 10.10a shows the
scheduled number of workers on TW and EC2 and Figure 10.10b shows costs
per task for TW and EC2. In the beginning, since the budget is sufficient, the
scheduling agent tries to allocate TW nodes even though they cost more than
EC2 node. In the second scheduling iteration, there are 460 tasks still remaining,
and the agent attempts to allocate 180 TW nodes and 280 EC2 nodes to finish all
tasks as soon as possible within the available budget. If TW and EC2 could
provide the requested nodes, all the tasks would be completed by next iteration.
However, since the maximum available number of TW nodes is only 25, it
allocates these 25 TW nodes and estimates that a completion time of 7.2
iterations. The agent then decides on the number of EC2 workers to be used
based on the estimated rounds.
In case of the EC2, it takes around 1 minute to launch (from the start of
virtual machine to ready state for consuming tasks); as a result, by the 4th
(a) 120
Number of workers
100
80
EC2 schedule
TW schedule
60
40
20
0
Used budget ($)
(b) 0.03
(c) 3.5 3.0 2.5 2.0 1.5 1.0 0.5 0.0
0.02
0.01
0.01
0.00
Cost per task ($)
0.02
Time (min)
1
EC2 cost
2
TW cost
3
4
5
6
7
8
9
1 2 3 4 5 6 7 8 9 10 11 12 13
1
0
Time (min)
1
1
1
2
1 2 3 4 5 6 7 8 9 10 11 12 13
Time (min)
Used budget
Budget limit
1
3
FIGURE 10.10. Experimental evaluation of medical image registration using
CometCloud. Results were obtained using the scheduling agent. (a) Scheduled number
of nodes. (b) Calculated cost per task. (c) Cumulative budget usage over time.
iteration the cost per task for EC2 increases. At this point, the scheduling agent
decides to decrease the number of TW nodes, which are expensive, and instead
it decides to increase the number of EC2 nodes using the available budget. By
the 9th iteration, 22 tasks are still remaining. The scheduling agent now decides
to release 78 EC2 nodes because they will not have jobs to execute. The reason
why the remaining jobs have not completed at the 10th iteration (i.e., 10
minutes) even though 22 nodes are still working is that the performance of EC2
decreased for some reason in our experiments. Figure 10.10c shows the used
budget over time. It shows that all the tasks were completed within the budget
and took around 13 minutes.
Comparison of Execution Time and Used Budget with/without Scheduling
Agent. Figure 10.11 shows a comparison of execution time and used budget
with/without the CometCloud scheduling agent. In the case where only EC2
nodes are used, when the number of EC2 nodes is decreased from 100 to 50 and
25, the execution time increases and the used budget decreases as shown in
Figures 10.11a and 10.11b. Comparing the same number of EC2 and TW nodes
(25 EC2 and 25 TW), the execution time for 25 TW nodes is approximately half
that for 25 EC2 nodes; however, the cost for 25 TW nodes is significantly more
than that for 25 EC2 nodes. When the CometCloud autonomic scheduling
agent is used, the execution time is close to that obtained using 25 TW nodes,
but the cost is much smaller and the tasks are completed within the budget. An
interesting observation from the plots is that if you don‘t have any limits on the
number of EC2 nodes used, then a better solution is to allocate as many EC2
nodes as you can. However, if you only have a limited number of nodes to use
and want to be guaranteed that your job is completed in limited budget, then
the autonomic scheduling approach achieves an acceptable trade-off. Note that
launching EC2 nodes at runtime impacts application performance because it
(a)
30
100 EC2
5
25
20
15
10
Execution time (min)
50 EC2
25 EC2
25 TW
Scheduled
0
(
b
)
8
Used budget ($)
100 EC2
6.0
50 EC2
5.0
25 EC2
4.0
25 TW
3.0
Scheduled
.
0
2.0
1.0
7
.
0.0
0
FIGURE 10.11. Experimental evaluation of medical image registration using
CometCloud Comparison of performance and costs with/without autonomic
scheduling.
(a) Execution time varying the number of nodes of EC2 and TW. (b) Used budget over
time varying the number of nodes for EC2 and TW.
ACKNOWLEDGMENTS
295
takes about a minute: A node launched at time t minutes only starts working
at time t 1 1 minutes. Since different cloud service will have different
performance and cost profiles, the scheduling agent will have to use historical
data and more complex models to compute schedules, as we extend
CometCloud to include other service providers.
T-SYSTEMS‘
CLOUD-BASED
SOLUTIONS
FOR
BUSINESS
APPLICATIONS
INTRODUCTION
Thanks to the widespread acceptance of the Internet, cloud computing has
become firmly established in the private sphere. And now enterprises appear
poised to adopt this technology on a large scale. This is a further example of
the consumerization of IT—with technology in the consumer world driving
developments in the business world.
T-Systems is one of Europe‘s largest ICT service providers. It offers a wide
range of IT, telecommunications, and integrated ICT services, and it boasts
extensive experience in managing complex outsourcing projects. The company
offers hosting and other services from its 75 data centers with over 50,000
servers and over 125.000 MIPS—in Europe, Asia, the Americas and Africa. In
addition, it is a major provider of desktop and network services. T-Systems
approaches cloud computing from the viewpoint of an organization with an
established portfolio of dynamic, scalable services delivered via networks. The
service provider creates end-to-end offerings that integrate all elements, in
collaboration with established hardware and software vendors.
Cloud computing is an opportunity for T-Systems to leverage its established
concept for services delivered from data centers. Cloud computing entails the
industrialization of IT production, enabling customers to use services and
resources on demand. Business, however, cannot adopt wholesale the
principles of cloud computing from the consumer world. Instead, T-Systems
aligns cloud computing with the specific requirements of large enterprises.
This can mean rejecting cloud principles where these conflict with statutory
requirements or security imperatives [1].
WHAT ENTERPRISES DEMAND OF CLOUD COMPUTING
Whether operated in-house or by an external provider, ICT is driven by two
key factors (Figure 11.1): cost pressure and market pressure. Both of these call
for increases in productivity.
Changing Markets
Today‘s markets are increasingly dynamic. Products and skills rapidly become
obsolete, eroding competitiveness. So incumbents need to find and implement
new ideas at an ever faster pace. Also, new businesses are entering the market
more rapidly, and they are extending their portfolios by forging alliances with
other players.
The Internet offers the opportunity to implement new business models and
integrate new stakeholders into processes—at speeds that were previously
unimaginable. One excellent example is the automotive industry, which has
brought together OEMs, suppliers, dealers, and customers on shared Internet
Drivers
Cost pressure
• Convert fixed costs
into variable costs
• Reduce
(IT)
administration costs
• Increase liquidity
Requirements
Increased
productivity
• Speed and ease of
use
• Collaboration
• New technologies
Market
pressure
• Meet new
competition
• New markets
(expansion)
• New business
models
• Consolidation
Demands on ICT
• Speed
• Flexibility
• Scalability
• Availability
• Quality of service
• Security
• Cost benefits
• Transparency
Cloud computing
FIGURE 11.1. The route to cloud computing—industrialization of IT.
platforms. In line with Web 2.0 principles, customers can influence vehicle
development. This and other examples demonstrate the revolutionary potential
of cloud computing.
Markets and market participants are changing at an unprecedented pace.
New competitors are constantly entering the ring, and established enterprises
are undergoing transformation. Value grids are increasing the number of joint
ventures. This often leads to acquisitions, mergers, and divestments and gives
rise to new enterprises and business models.
At the same time, markets have become more flexible. This not only enables
enterprises to move into new lines of business with greater ease and speed, it
also changes prevailing market conditions. Customers respond faster to
changes in the supply of goods and services, market shares shift, some supplyand-demand relationships vanish completely, and individual markets shrink or
disappear. These phenomena have, for instance, radically transformed the retail
industry in recent years.
Against this background, companies not only need to scale up, but also to
scale down—for example, if demand falls, or if they take a strategic decision to
abandon a line of business or territory.
There is a need to respond to all these factors. Pressure is rising not only on
management, but also on ICT—because business processes supported by ICT
have to be rapidly modified to meet new imperatives. While the focus was on
saving money, ICT outsourcing was the obvious answer. But traditional
outsourcing cannot deliver the speed and agility markets now demand.
Today‘s legacy ICT infrastructures have evolved over many years and lack
flexibility. Moreover, few organizations can afford the capital investment
required to keep their technology up to date. At the same time, ICT resources
need to be quickly scaled up and down in line with changing requirements.
Intriguingly, ICT triggered this trend toward faster, more flexible businesses.
Now, this has come full circle—with more dynamic businesses calling for more
dynamic ICT.
Increased Productivity
Today, enterprise ICT and business processes are closely interwoven—so that
the line between processes and technology is becoming blurred. As a result, ICT
is now a critical success factor: It significantly influences competitiveness and
value creation. The impact of fluctuations in the quality of ICT services (for
example, availability) is felt immediately. The nonavailability of ERP
(enterprise resource planning) and e-mail systems brings processes grinding to
a halt and makes collaboration impossible. And the resulting time-to-market
delays mean serious competitive disadvantage.
The demands are also increasing when it comes to teamwork and
collaboration. Solutions not only have to deliver speed plus ease of use, they
also have to support simultaneous work on the same documents, conduct team
meet ings with participants on different continents, and provide the
necessary
infrastructure (anywhere access, avoidance of data redundancy, etc.). That is
no easy task in today‘s environment.
Rising Cost Pressure
Globalization opens up new markets. But it also means exposure to greater
competition. Prices for goods and services are falling at the same time that the
costs for power, staff, and raw materials are rising. The financial crisis has
aggravated the situation, with market growth slowing or stagnating. To master
these challenges, companies have to improve their cost structures.
This generally means cutting costs. Staff downsizing and the divestment of
loss-making units are often the preferred options. However, replacing fixed
costs with variable costs can also contribute significantly—without resorting to
sensitive measures such as layoffs. This improves liquidity. Money otherwise
tied up in capital investment can be put to good use elsewhere. In extreme cases,
this can even avert insolvency; most commonly, the resulting liquidity is used to
increase equity, mitigating financial risk.
A radical increase in the flexibility of the ICT landscapes can deliver
significant long-term benefits. It fundamentally transforms cost structures,
since ICT-related expenses are a significant cost factor. ICT spending (for
example, administration and energy costs) offers considerable potential for
savings.
However, those savings must not be allowed to impact the quality of ICT
services. The goal must be standardized, automated (i.e., industrialized), and
streamlined ICT production. The high quality of the resulting ICT services
increases efficiency and effectiveness and enhances reliability, thereby cutting
costs and improving competitiveness.
In other words, today‘s businesses expect a great deal from their ICT. It not
only has to open up market opportunities, it also has to be secure and reliable.
This means that ICT and associated services have to deliver speed, flexibility,
scalability, security, cost-effectiveness, and transparency. And cloud computing
promises to meet all these expectations.
DYNAMIC ICT SERVICES
Expectations differ considerably, depending on company size and industry. For
example, a pharmaceuticals multinational, a traditional midsize retailer, and a
startup will all have very different ICT requirements, particularly when it comes
to certification.
However, they all face the same challenges: the need to penetrate new
markets, to launch new services, to supply sales models, or to make joint
offerings with partners. This is where dynamic ICT delivers tangible benefits.
At first sight, it may seem paradoxical to claim that standardization can create
flexibility. But industrialized production within the scope of outsourcing
is not restrictive. In fact, quite the opposite: Industrialization provides the basis
for ICT services that are dynamic, fast, in line with real-world requirements,
and secure and reliable. ICT services of this kind are the foundation of a cloud
that provides services on demand. Only by industrializing ICT is it possible to
create the conditions for the flexible delivery of individual ICT services, and for
combining them in advantageous ways.
Standardized production also enables ICT providers to achieve greater
economies of scale. However, this calls for highly effective ICT management—
on the part of both the service provider and the customer. Proven concepts and
methodologies from the manufacturing industry can be applied to ICT. The
following are particularly worth mentioning:
●
●
●
●
Standardization
Automation
Modularization
Integrated creation of ICT services
Steps Toward Industrialized ICT
Standardization and automation greatly reduce production costs and increase
the efficiency and flexibility of ICT. However, they come at a price: There is
less scope for customization. This is something that everyone with a personal
e-mail account from one of the big providers has encountered. Services of
this kind fulfill their purpose, but offer only very stripped-down functionality
and are usually free of charge. More sophisticated e-mail solutions are
available only via fee-based ―premium‖ offerings. In other words, lower
costs and simpler processes go hand in hand. And this is why companies have
to streamline their processes. When it comes to standardization, ICT service
providers focus on the technology while businesses focus on services and
processes.
The growing popularity of standard software reflects this. In the ERP space,
this trend has been evident for years, with homegrown solutions being replaced
by standard packages. A similar shift can be observed in CRM, with a growing
number of slimmed-down offerings available as software as a service (SaaS)
from the cloud.
At the same time, standards-based modularization enables new forms of
customization. However, greater customization of the solutions delivered to
businesses reduces efficiency for providers, thereby pushing up prices. In the
world of ICT, there is a clear conflict between customization and cost.
Standardization has the appeal (particularly for service providers) of cutting
ICT production costs. This means that ICT providers have to take these
arguments in favor of standardization seriously and adapt their production
accordingly. For enterprise customers, security and compliance are also key
considerations, alongside transparent service delivery, data storage, and
transfer. These parameters must be clearly defined in contracts and service-level
agreements (SLAs).
Customization through Modularization
Modular production enables ICT to be tailored to customers‘ specific
requirements—in conjunction with standardization. Modularization allows
providers to pool resources as the basis for delivering the relevant services .
Modularization is essentially a set of standardized individual modules that
can be combined. The resulting combinations give rise to sophisticated
applications tailored to the needs of the specific company. Standardized
interfaces (e.g., APIs) between individual modules play a pivotal role. And
one of the great strengths of modules is their reusability.
The more easily and flexibly such modules can be combined, the greater the
potential benefits. Providers have to keep the number of modules as low as
possible while meeting as many of their customers‘ requirements as possible,
and this is far from easy.
One example of modularization in a different context is combining Web
services from various sources (mashups). In the cloud era, providers of modules
of this kind claim that they enable users with no programming skills to support
processes with ICT. However, experience shows that where such skills are
lacking, a specialist integrator is generally called in as an implementation
partner.
The benefit of modular services is that they can be flexibly combined,
allowing standard offerings to be tailored to specific requirements. At the same
time, they prevent customized solutions from straying too far from the
standard, which would significantly drive up the costs of later modifications.
Integrated Creation of ICT Services
Each of the elements outlined above can have significant advantages. But only
an integrated approach to creating ICT services—combining standardization,
automation and modularization—can deliver the entire range of benefits. This
gives the provider standardized, automated production processes and enables
the desired services to be delivered to the customer quickly and flexibly.
In the context of outsourcing, this form of industrialization yields its full
potential when providers and users have a close, two-way relationship with
corresponding connectivity. This enables businesses to play an active part in
production (ICT supply chain), tailoring ICT services to their changing needs.
However, the technology that supports this relationship must be based on
standards. Cloud computing promises to make switching to a different provider
quick and easy, but that is only possible if users are careful to avoid provider
lock-in.
IMPORTANCE OF QUALITY AND SECURITY IN CLOUDS
Quality (End-to-End SLAs)
If consumers‘ Internet or ICT services are unavailable, or data access is slow,
the consequences are rarely serious. But in business, the nonavailability of a
service can have a grave knock-on effect on entire mission-critical processes—
bringing production to a standstill, or preventing orders from being processed.
In such instances, quality is of the essence. The user is aware of the
performance of systems as a whole, including network connectivity. In complex
software applications, comprising multiple services and technical components,
each individual element poses a potential risk to the smooth running of
processes. Cloud-service providers therefore have to offer end-to-end availability, backed by clearly defined SLAs.
The specific quality requirements are determined by weighing up risk against
cost. The importance of a particular process and the corresponding IT solution
are assessed. The findings are then compared with the service levels on offer. As
a rule, higher service levels come at a higher price. Where a process is not
critical, businesses are often willing to accept relatively low availability to
minimize costs. But if a process is critical, they will opt for a higher service level,
with a corresponding price tag. So the quality question is not about combining
the highest service levels, but about selecting the right levels for each service.
Compliance and Security
Compliance and security are increasingly important for cloud-computing
providers. Security has been the subject of extensive media coverage and
debate. And surveys consistently pinpoint it as the greatest obstacle to cloud
computing. In a 2008 cio.com study, IT decision-makers cited security and loss
of control over data as the key drawbacks of cloud computing.
However, for businesses looking to deploy a form of cloud computing, legal
issues (e.g., privacy and liability) are considerably more important. And this is
why cloud providers have to find ways of enabling customers to meet statutory
requirements.
Consumer Cloud Versus Enterprise Cloud. The Internet has given rise to
new forms of behavior, even when concluding contracts on-line. When
presented with general terms and conditions, many consumers simply check
the relevant box and click ―OK,‖ often not realizing that they are entering into
a legally binding agreement. Standard contracts are now commonly used for
consumer services offered from the cloud. However, this does not meet the
demands of businesses.
Cloud computing raises no new legal issues, but it makes existing ones more
complex. This increased complexity is due to two factors. On the one hand,
cloud computing means that data no longer have to reside in a single location.
On the other hand, business scenarios involving multiple partners are now
conceivable. It is therefore often impossible to say exactly where data are stored
and what national legislation applies. And where data are handled by multiple
providers from different countries (sometimes on the basis of poorly structured
contracts), the issue of liability becomes correspondingly complex.
Cloud Computing from an Enterprise Perspective. With this in mind,
businesses should insist on comprehensive, watertight contracts that include
provisions for the recovery and return of their data, even in the event of
provider bankruptcy. Moreover, they should establish the country where
servers and storage systems are located. Cloud principles notwithstanding,
services still have to be performed and data stored at specific physical locations.
Where data are located determines whose law applies and also determines
which government agencies can access it. In addition to these ―hard‖ factors,
enterprises have to consider that data-privacy cultures differ from country to
country.
Having the legal basis for liability claims is one thing; successfully
prosecuting them is quite another. This is why it is important to know the
contractually agreed legal venue. Moreover, it is useful to have a single end-toend service level agreement defining availability across all services.
Even stricter statutory requirements apply where data are of a personal
nature (e.g., employee details in an HR system). Financial data are also subject
to stringent restrictions. In many parts of Europe, personal data enjoys special
protection. But even encryption cannot guarantee total security. Solutions that
process and store data in encrypted form go a long way toward meeting
statutory data-protection requirements. However, they are prohibited in some
countries. As a result, there are limits to secure data encryption in the cloud.
Companies listed on the U.S. stock exchange are subject to the Sarbanes—
Oxley Act (SOX), requiring complete data transparency and audit trails. This
poses particular challenges for cloud providers. To comply with SOX 404,
CEOs, CFOs, and external auditors have to report annually on the adequacy of
internal control systems for financial reporting. ICT service providers are
responsible for demonstrating the transparency of financial transactions.
However, providing this evidence is especially difficult, if not impossible, in a
cloud environment. This is a challenge that cloud providers must master—if
necessary, by departing from cloud principles.
Service providers also have to ensure that data are not lost and do not fall
into the wrong hands. The EU has data-security regulations that apply for all
European companies. For example, personal details may only be disclosed to
third parties with the consent of the individual involved. Moreover,
publicsector organizations generally insist on having sensitive data processed
in their home country. This is a particularly thorny issue when it comes to
patents, since attitudes to intellectual property differ greatly around the
world.
Moreover, some industries and markets have their own statutory
requirements. It is therefore essential that customers discuss their specific
needs with
the provider. And the provider should be familiar with industry-specific
practices and acquire appropriate certification.
Providers also have to safeguard data against loss, and businesses that use
cloud services should seek a detailed breakdown of disaster-recovery and
business-continuity plans.
Other legal issues may arise directly from the technology behind cloud
computing. On the one hand, conventional software licensing (based on CPUs)
can run counter to cloud-computing business models. On the other hand,
licenses are sometimes subject to geographical restrictions, making it difficult to
deploy them across borders.
What Enterprises Need. Cloud computing and applicable ICT legislation are
based on diametrically opposed principles. The former is founded on liberalism
and unfettered development—in this case, of technical opportunities. The latter
imposes tight constraints on the handling of data and services, as well as on the
relationship between customers and providers. And it seems unlikely that these
two perspectives will be reconciled in the near future.
Cloud providers have to meet the requirements of the law and of customers
alike. As a rule, this leads them to abandon some principles of ―pure‖ cloud
computing—and to adopt only those elements that can be aligned with
applicable legislation and without risk. However, deployment scenarios
involving services from a public cloud are not inconceivable. So providers
have to critically adapt cloud principles.
Furthermore, providers working for major corporations have to be
dependable in the long term, particularly where they deliver made-to-measure
solutions for particular business processes. This is true whether the process is
critical or not. If a provider goes out of business, companies can expect to be
without the service for a long time. So before selecting a cloud provider,
customers should take a long hard look at candidates‘ services, ability to deliver
on promises, and, above all, how well SLAs meet their needs .
DYNAMIC DATA CENTER—PRODUCING BUSINESS-READY,
DYNAMIC ICT SERVICES
Flexibility Across All Modules
Agility at the infrastructure level alone is not enough to provide fast, flexible
ICT services. Other dynamic levels and layers are also required (Figure 11.2) .
Ultimately, what matters to the user is the flexibility of the system or service as
a whole. So service quality is determined by the slowest component.
Adaptable processing and storage resources at the computing level must be
supported by agile LAN and WAN infrastructures. Flexibility is also important
when it comes to application delivery, scalability, and extensibility via
functional modules.
intervention,
Management
processes
must
allow
for
manual
BBuusisnineesssspprroocceessses
Appliicatioon maanagemeent services/staff
Lotus
Oracle
Desktop
Web
services
Archiving
Voice
Ex-
…
change
Processing power (computing, data, storage, archive)
Networks (LAN, WAN)
Standardize
Consolidate
Virtualize
End-to--End SLAs
SAP
End--to--End SLAs
…
Automate
FIGURE 11.2. Flexibility at all levels is a basic requirement for cloud computing.
where necessary, and automatically link the various layers. These factors enable
the creation of end-to-end SLAs across all components.
Every dynamic ICT service is based on a resource pool from which
computing, data, and storage services can be delivered as required. Dynamic
network and application services are also available. Moreover, the (business)
applications are optimized for deployment with pooled resources.
When customers opt for a dynamic service, they require an SLA that covers
not only individual components, but also the service as a whole, including any
WAN elements.
Toward Dynamic, Flexible ICT Services. The first step is to standardize the
customer‘s existing environment. IT systems running different software releases
have to be migrated, often to a single operating system. Hardware also has to
be standardized—for example, by bringing systems onto a specific processor
generation (such as x86). Eliminating disparate operating systems and
hardware platforms at this stage makes it considerably easier to automate
further down the line.
The second step is technical consolidation. This not only reduces the number
of physical servers, but also slims down data storage. Identical backup and
restore mechanisms are introduced at this stage; and small, uneconomical data
centers are closed.
The third step involves separating the logical from the physical.
Virtualization means that services no longer depend on specific hardware. This
has particular benefits in terms of maintenance. Moreover, virtualization
enables server resources to be subdivided and allocated to different tasks.
Process automation is more than just another component—it is key to
meeting the rising demand for IT services. What‘s more, it slashes costs,
improves efficiency (for example, by preventing errors), and accelerates
standard procedures. Providers‘ ability to offer cloud-computing services will
largely depend on whether they can implement mechanisms for automatic
management, allocation, and invoicing of resources.
In the business world, automation must also support seamless integration of
financial, accounting, and ordering systems.
T-Systems‘ Core Cloud Modules: Computing, Storage
Computing. The computing pool is based on server farms located in different
data centers. Logical server systems are created automatically at these farms.
The server systems comply with predefined standards. They are equipped with
the network interface cards required for communications and integration with
storage systems. No internal hard drives or direct-attached storage systems are
deployed.
The configuration management database (CMDB) plays a key role in
computing resource pools (Figure 11.3). This selects and configures the
required physical server (1). Once a server has been selected from the pool,
virtualization technology is selected in line with the relevant application and the
demands it has to meet (2). At the same time, the configuration requirements
are sent to the network configuration management system (3) and to the
storage configuration management system (4). Once all the necessary elements
are in place, the storage systems are mounted on the servers, after which the
operating-system images are booted (5).
5
Virtualization
4
VMWare
Solaris LDOM
Integrity VM
IBM Power 5 LPAR
Storage, data
Network
2
1
Configuration management
DC1
3
DC2
DWDM
FIGURE 11.3. Provision of computing resources.
11.5 DYNAMIC DATA CENTER—PRODUCING BUSINESS-READY, DYNAMIC ICT SERVICES
311
Cloud computing enables a customer‘s application to be switched from
server to server within a defined group at virtually any interval (from minutes to
hours or days). This means that the configuration database must be updated
automatically to accurately reflect the current state of systems and
configurations at all times.
The CMDB also supports other tasks that are not required in conventional
ICT environments. These include enhanced monitoring and reporting, quality
management, and corresponding resource planning. Moreover, an ongoing
inventory of systems and their configurations is essential for rapid
troubleshooting.
Operating systems are provided in the form of images stored on a central
storage system. These are in read-only mode to ensure rapid startup. To limit
the number of operating systems and releases—and minimize related
administrative effort—only one version of each operating system is
maintained. This is employed to configure and boot the servers. This high
degree of standardization significantly reduces administration overhead.
Applications Are Also Virtualized. Speed is of the essence for cloud-computing
providers. Decoupling operating systems from applications plays a key role
here, because it reduces both initial and subsequent application-provisioning
time (following a failure, for example). Making applications available is simply
a matter of mounting them. This approach has other advantages: Applications
can quickly be moved from one server to another, and updates can be managed
independently of operating systems.
However, the full benefits can only be realized if there is a high degree of
automation and standardization in the IT infrastructure and the applications
themselves.
Storage. The necessary storage is provided and configured in much the same
way as the computing resources. IP-based storage systems are deployed. To
reduce hardware-configuration effort, the computing systems use neither SAN
nor direct-attached storage.
Using fiber-channel (FC) cards in the servers and deploying an FC network
increases overall system complexity substantially. The IP storage systems are
linked via Gbit Ethernet. Storage is automatically allocated to the server
systems that require it.
Storage resources are located in different fire zones as well as in different
data centers, preventing data loss in the event of a disaster. The storage system
handles replication of data between data centers and fire zones, so computing
resources are not needed for this purpose (Figure 11.4).
Backup-Integrated Storage. In addition to storage resources, backups are
necessary to safeguard against data loss. For this reason, and in the interests of
automation, the Dynamic Data Center model directly couples backup
to
11.5 DYNAMIC DATA CENTER—PRODUCING BUSINESS-READY, DYNAMIC ICT SERVICES
311
Backup
Data
Integrated
Storage
Application
OS
Archive
Data, backup and configuration management
Backup
DC 1
Snapshot
Backup
DWDM
DC 2
Mirror
FIGURE 11.4. Storage resources: backup-integrated, read-only, and archive storage.
storage; in other words, backup-integrated storage (BIS) is provided, along
with full management functionality.
To accelerate backup and reduce the volume of data transferred, data are
backed up on hard disks within the storage system by means of snapshotting.
This simplifies the structure of the computing systems (as no backup LAN is
necessary) and minimizes the potential for temporal bottlenecks.
Storage systems normally provide for a 35-day storage period. Usually, the
last three days are accessible on-line, with the rest being accessible from a
remote site.
Archive and Other Storage. Archive systems are also available for long-term
data storage. Like BIS, these are hard-disk-based and linked via IP to the
respective systems. Data for archiving is replicated within the archive system
and in a separate fire zone, as well as at a remote data center. Replication is
handled by the archive system itself.
Archive storage can be managed in two ways. Archiving can be initiated
either from the applications themselves, which then handle administration of all
data, or via a document management system.
Some systems require a hard-disk cache. This is not worth backing up via
BIS, since data in a cache change rapidly, and the original data are stored and
backed up elsewhere in the system.
11.5 DYNAMIC DATA CENTER—PRODUCING BUSINESS-READY, DYNAMIC ICT SERVICES
311
Communications. The computing and storage modules are integrated via an
automatically configured LAN or corresponding virtual networks (VPNs). The
servers deployed in the computing module are equipped with multiple network
cards as standard. Depending on requirements, these are grouped to form the
necessary networks.
Networks are segregated from each other by means of VPN technology.
Backup-integrated storage eliminates the need for a separate backup network.
Customer Network. Access for customers is provided via Internet/VPN
connections. Services are assigned to companies by means of unique IP
addresses. As standard, access to Dynamic Data Centers is protected via
redundant, clustered firewalls. Various versions are available to cater to a
range of different customer and application requirements. Virtual firewalls are
configured automatically. Due to the high level of standardization, access is
entirely
IP-based.
Storage and Administration Network. A separate storage network is provided
for accessing operating-system images, applications, and customer and archive
data. Configuration is handled automatically. An additional network,
segregated from the others, is available for managing IT components. Used
purely for systems configuration and other administration tasks, this network
has no access to the customer‘s data or content.
Dynamic Services—A Brief Overview
The Dynamic Data Center concept underlies all T-Systems Dynamic Services.
All the resources required by a given service are automatically provided by the
data center. This lays the foundations for a portfolio of solutions aimed at
business customers.
Dynamic Applications for Enterprises. Enterprises require applications
that support specific processes. This applies both to traditional outsourcing
and to business relationships in the cloud. T-Systems has tailored its portfolio
to fulfill these requirements.
● Communications and Collaboration. These are key components for any
company. Work on projects often entails frequent changes in user
numbers. As a result, enterprises need flexible means of handling
communications and collaboration. T-Systems offers the two leading email systems, Microsoft Exchange and IBM Lotus Domino via Dynamic
Services, ensuring their rapid integration into existing environments.
● ERP and CRM. Dynamic systems are available to support core ERP and
CRM processes. T-Systems offers SAP and Navision solutions in this
space.
● Development and Testing. Software developers often need access—at short
notice and for limited periods of time—to server systems running a variety
of operating system versions and releases. Dynamic Services offer the
flexibility needed to meet these demands. Configured systems that are not
currently required can be locked and mothballed. So when computing
resources are no longer needed, no further costs are incurred. That is the
advantage of Dynamic Services for developers.
● Middleware. When it comes to middleware, Dynamic Services can lay the
foundation for further (more complex) services. In addition, businesses
can deploy them directly and integrate them into their own infrastructure.
The common term for offerings of this type is platform-as-a-service
(PaaS). T-Systems‘ middleware portfolio includes dynamic databases,
Web servers, portals, and archiving components.
● Front-Ends and Devices. Not only business applications, but also users‘ PC
systems, can be provided via the cloud. These systems, including office
applications, can be made available to users via Dynamic Desktop
Services.
Introducing New Services in a Dynamic Data Center. Cloud computing is
developing at a rapid pace. This means that providers have to continuously
review and extend their offerings. Here, too, a standardized approach is key to
ensuring that the services delivered meet business customers‘ requirements.
First, automatic mechanisms have to be developed for standardizing the
installation of typical combinations of operating system, database, and
application software. These mechanisms must also support automated
procedures for starting and stopping applications. The software components
and their automatic management functions are subject to release and patch
management procedures agreed with the vendors.
Deploying the combination of version and patches authorized by the vendor
enables a provider to assume end-to-end responsibility for a service. Automatic
monitoring and monthly reports are put in place for each service. An operating
manual is developed and its recommendations tested in a pilot installation
before the production environment goes live. The operating manual includes
automatic data backup procedures.
Next, a variety of quality options are developed. These can include
redundant resources across multiple fire zones. A concept for segregating
applications from each other is also created. This must include provisions for
selectively enabling communications with other applications via defined
interfaces, implemented in line with customer wishes. Only after EU legislation
(particularly regarding liability) has been reviewed is the application rolled out
to data centers worldwide.
Dynamic Data Centers Across the Globe
T-Systems delivers Dynamic Services from multiple data centers around the
world (Figure 11.5). These are mostly designed as twin-core facilities; in other
words, each location has two identical data centers several kilometers apart.
Dynamic Services
All legal requirements
Cost
(German, EU and US) for
systems validation fulfilled
monitoring
with cockpit
functionalitywith daily
updates
Establishment
of first
international
Dynamic Data
Center
First customer
on the dynamic
platform
Deutsche
Europe‘s largest
DDC JacksonSAP Pinnacle
SAP system
ville/USA
Migrated one of the
world‘s largest SAP
systems to a
dynamic platform.
9 TB database
Award
Dynamic Data
Center in Frankfurt
2004
...
1,000th
15
DDC
customers
Sao
DDCPaolo
Munich
2006
DDC
2007
Telekom/DKK
project
SAP sys.
Run SAP
2008
90 SAP
Hosting
2009
customers
2010
Shanghai
FIGURE 11.5. The development of Dynamic Services.
All Dynamic Data Centers are based on the original concept used for the
first data center of this kind in Frankfurt. There are currently facilities in the
United States, Brazil, Germany, Singapore, and Malaysia.
CASE STUDIES
The industrialization of outsourcing not only impacts costs, it also affects
customers‘ business processes and sourcing strategies. In terms of sourcing, the
effects depend on the company‘s current ICT and on the size of the enterprise.
This is particularly obvious in the case of startups or businesses with an existing
ICT infrastructure.
Dynamic ICT Services for Startups. Startups often have no ICT infrastructure
and lack the knowledge required to establish and operate one. However, to put
their business concepts into practice, companies of this kind need rapid access
to reliable, fully functional ICT services. And because it is difficult to predict
how a startup will grow, its ICT has to offer maximum scalability and
flexibility, thereby enabling the company to meet market requirements.
Moreover, few venture capitalists are prepared to invest in inflexible hardware
and software that has to be depreciated over a number of years.
By deploying dynamic ICT services, a startup can find its feet in those early,
uncertain stages—without maintaining in-house ICT. And if demand falls, the
company does not have to foot the bill for unneeded resources. Instead, it can
quickly and easily scale down—and invest more capital in core tasks.
Dynamic ICT Services at Companies with Existing ICT Infrastructures. In
comparison to startups, most large companies already have established
ICT
departments, with the skills needed to deliver the desired services. These
inhouse units often focus on ICT security rather than flexibility. After all, a
company‘s knowledge and expertise resides to a large extent in its data. These
data must be secure and available at all times—because it is often a
businesscritical asset, the loss of which could jeopardize the company‘s
future.
In addition to cost savings, companies that use Dynamic Services benefit
from greater transparency—it is clear at all times which resources are available
and which are currently being used.
The opportunity to source ICT as a service opens up a wide range of new
options. For example, an international player with a complex legacy
environment can introduce Dynamic Services for SAP Solutions for a specific
business segment by adding resources to its German company and then rolling
these out to its other national subsidiaries. This kind of dynamic ICT
provisioning also enables fast, flexible penetration of new markets, without
the need for advance planning and long-term operation of additional ICT
resources. And seasonal fluctuations in business can be dealt with even more
easily.
Figure 11.6 shows the flexible and dynamic provisioning of resources.
Provisioning starts during the implementation phase, with a development
and test environment over a one-month period. This is followed by go-live of
the production environment. Additional development and training resources
can be accessed rapidly, if and when required.
Example: Dynamic Infrastructure Services
A mid-sized furniture manufacturer with over 800 employees leverages dynamic
infrastructure services. Within the scope of make-to-order manufacturing, the
company produces couches and armchairs in line with customers‘ specific needs.
On average, it makes some 1500 couches and armchairs daily. During the
Performance upgrades for quaterly reports
2 weeks
training system
Performance
1 month
3 month
test
system
development
system
2006
2007
2008
available on a day-by-day base
FIGURE 11.6. Flexible ICT provisioning for dynamic markets.
summer months, this figure is almost halved—and use of the company‘s in-house
IT falls accordingly. In June 2005, the IT department outsourced data backup
and provisioning of mainframe resources to T-Systems. The service provider
now provides these as services, on a pay-per-use basis. As a result, the furniture
manufacturer no longer has to maintain in-house IT resources sized for peak
loads. Instead, its IT infrastructure is provided as a service [infrastructure as a
service (IaaS)].
If the data volume or number of users suddenly rises or falls, the company
can scale its resources up or down—and costs increase or decrease accordingly.
At the same time, it benefits from a solution that is always at the leading edge of
technology, without having to invest in that technology itself. Through regular
reporting, the customer also gains new transparency into the services it uses.
Around-the-clock monitoring provides maximum protection against system
failure and downtime. And the service provider backs up data from production
planning, on-line sales, transactions, e-mails, and the ERP system at one of its
data centers.
Example: Dynamic Services for SAP
Infrastructure services like these enable the delivery of more complex services.
In this context, T-Systems specializes in business-critical applications supported
by SAP. So far about 100 European-based companies use Dynamic Services
from T-Systems. Among them are Shell, Philips, Linde, and MAN.
In this case a global group with a workforce of almost 500,000 in 60
countries operates in various business segments. However, its core business is
direct sales: The enterprise sells its products via sales partners, who process
110,000 orders each week using the central SAP system. If these orders are not
processed, payment will not be received; as a result, system failure could
significantly impact the company. Furthermore, around one million calls (in
Germany alone) are handled each year in the CRM module, as are tasks
ranging from a simple change of address to changes in financing arrangements.
The group‘s IT strategy is therefore focused on ensuring efficient, effective IT
support for its international direct sales.
Due to weekly commissions for sales employees and the unpredictable
nature of call-center activities, system-sizing estimates can vary by up to
500%. In addition, the rapid development of the company‘s SAP R/3 solution,
in conjunction with an international rollout, has significantly increased IT
resource requirements. Because it was virtually impossible to quantify these
factors in advance, the company decided to migrate to a dynamic platform
for future delivery of its SAP services. The entire application was transferred to
T-Systems‘ data center, where it has been operated using a Dynamic Services
model since January 2006.
With the move, the group has implemented a standardization strategy
that enables flexible adaptation of business processes and makes for more
straightforward and transparent group-wide reporting. With a conventional
infrastructure sized for peak loads, SAP R/3 operating costs would have been
twice as high as with the current dynamic solution. Furthermore, the company
now has the opportunity to scale its resources up or down by 50% within a
single day.
DKK: Europe’s Largest SAP Installation Is Run in a
Private Cloud
Many simple applications and small-scale, non-core systems already run in the
cloud. And now, some enterprises are having larger-scale, mission-critical apps
delivered in this way, or via their own secure clouds. For example, Deutsche
Telekom currently utilizes ICT services from a private cloud for some of its
business-critical processes.
This move was motivated by the desire to establish a highly scalable,
ondemand system for processing invoicing and payments and for managing
customer accounts and receivables. Deutsche Telekom‘s revenue management
system, DKK, handles more than 1.5 million payments a day from
approximately 30 million customers, making it one of the largest SAP
installations in the world (Figure 11.5).
T-Systems migrated the legacy server environment, comprising two
monolithic systems with a capacity of some 50,000 SAPS, to a highly
standardized, rapidly scalable solution based on Dynamic Services for SAP.
Performance improved by more than 20%, while costs sank by 30%.
The customer can freely scale ICT resources up or down. Furthermore, a
disaster recovery solution was established at a second, remote data center for
failure protection. The system currently handles nine terabytes of data.
The significant cost reductions are the result of vendor-independent
standardizationofhardwarewithclustereddeploymentofcommoditycomponents,
backupintegrated storage, and extensively standardized processes and
procedures.
Quantifiable improvements, in technical terms, include a 45% drop in server
response times and a 40% reduction in batch-job processing times. Even client
response times have shrunk by close to 10%. This means that the new platform
significantly exceeds the targeted 20% improvement in overall system
performance. The dynamic cloud solution has proved more cost-effective, and
delivers better performance, than an environment operated on traditional lines.
The transition tothe new platform did notinvolve modifications tothe
customdeveloped SAP ABAP programs. Returning to a conventional
environment would be even more straightforward, since no changes to the
operating system
would be required, and the application‘s business logic would not be affected.
Migrating Globally Distributed SAP Systems to a
Dynamic Platform
Even experienced ICT providers with a successful track record in
transformation projects have to perform risk analysis, including fallback
scenarios. To
reduce the risk of migrations (in both directions), cloud providers that serve
large enterprises require skills in both conventional operations and cloud
computing.
In one transformation engagement, when the contract was signed, the
customer was operating 232 SAP systems worldwide, with a total capacity of
1.2 million SAPS. Initially, T-Systems assumed responsibility for the systems
within the scope of a conventional outsourcing agreement, without changing
the mode of operation. The original environment was then gradually replaced
by a commercial cloud solution (managed private cloud). This approach has
since become established practice for the T-Systems. Within the agreed
timeframe of 18 months, 80% of the systems were migrated. This major project
involved not only SAP software, but also non-SAP systems, which were
brought onto the new platform via dedicated interfaces.
Projects on this scale have a lasting influence on a service provider‘s
datacenter infrastructure, and they drive IT industrialization. In this particular
engagement, the most compelling arguments for the customer were (a) the
security and reliability of the provider‘s data centers and (b) the smooth
interaction between the SAP interfaces. Transparency throughout the entire
systems landscape, lower costs, and greater responsiveness to changing
requirements were the key customer benefits.
11.7 SUMMARY: CLOUD COMPUTING OFFERS MUCH
MORE THAN TRADITIONAL OUTSOURCING
Cloud computing is an established concept from the private world that is
gaining ground in the business world. This trend can help large corporations
master some of their current challenges—for example, cost and market
pressures that call for increased productivity. While conventional outsourcing
can help enterprises cut costs, it cannot deliver the flexibility they need.
And greater flexibility brings even greater savings. Cloud computing poses a
challenge to traditional outsourcing models. If the paradigm shift becomes
a reality, IT users will have even more choice when it comes to selecting a
provider—and cloud computing will become a further alternative to existing
sourcing options.
Cloud computing makes for a more straightforward and flexible relationship
between providers and their customers. Contracts can be concluded more
rapidly, and resources are available on demand. What‘s more, users benefit
from end-to-end services delivered dynamically in line with their specific
business requirements. And companies only pay for the services they actually
use, significantly lowering IT investment. In a nutshell, cloud computing means
that IT services are available as and when they are needed—helping pare back
costs.
When it comes to selecting a sourcing model, cost and flexibility are only two
of the many factors that have to be taken into account. Further important
REFERENCES
319
aspects are data privacy, security, compliance with applicable legislation, and
quality of service. The public cloud cannot offer a solution to these issues,
which is why private clouds are well worth considering.
Providers of cloud computing for large corporations need to be able to
intelligently combine their offerings with customer-specific IT systems and
services. In some cases, they can also leverage resources and services from the
public cloud.
But first, companies must consider which services and resources can be
outsourced to the cloud, and they must also define how important each one is
for the organization. Services that are not mission critical do not require robust
service levels and can be delivered via the public cloud. But business-critical IT
processes call for clearly defined SLAs, which, in turn, pushes up costs. Private
clouds are an effective way of meeting these requirements.
In both cloud-computing models, services are delivered on a standardized
basis. This reflects a general trend toward the industrialization of IT. Provision
of services via a private cloud requires higher standards of quality than via the
public cloud. By means of industrialization, cloud-computing providers enable
more efficient use of their IT infrastructures, thereby increasing productivity.
This not only cuts production costs, it also reduces the environmental footprint
of businesses‘ IT.
Case studies show that the general principles of cloud computing have
already been successfully adapted and employed for business-critical
applications hosted in a private cloud. However, enterprises must carefully
weigh up the pros and cons of each model and decide which resources can be
provided via the public cloud and which require a private cloud.
CHAPTER 12
WORKFLOW ENGINE FOR CLOUDS
SURAJ PANDEY, DILEBAN KARUNAMOORTHY,
and RAJKUMAR BUYYA
INTRODUCTION
A workflow models a process as consisting of a series of steps that simplifies the
complexity of execution and management of applications. Scientific workflows
in domains such as high-energy physics and life sciences utilize distributed
resources in order to access, manage, and process a large amount of data from a
higher level. Processing and managing such large amounts of data require the
use of a distributed collection of computation and storage facilities. These
resources are often limited in supply and are shared among many competing
users. The recent progress in virtualization technologies and the rapid growth
of cloud computing services have opened a new paradigm in distributed
computing for utilizing existing (and often cheaper) resource pools for
ondemand and scalable scientific computing. Scientific Workflow Management
Systems (WfMS) need to adapt to this new paradigm in order to leverage the
benefits of cloud services.
Cloud services vary in the levels of abstraction and hence the type of service
they present to application users. Infrastructure virtualization enables providers
such as Amazon1 to offer virtual hardware for use in computeand dataintensive
workflow applications. Platform-as-a-Service (PaaS) clouds expose a higherlevel development and runtime environment for building and deploying
workflow applications on cloud infrastructures. Such services may also expose
domain-specific concepts for rapid-application development. Further up in the
cloud stack are Software-as-a-Service providers who offer end users with
1
http://aws.amazon.com
Cloud Computing: Principles and Paradigms, Edited by Rajkumar Buyya, James Broberg and
Andrzej Goscinski Copyright r 2011 John Wiley & Sons, Inc.
321
standardized software solutions that could be integrated into existing
workflows.
This chapter presents workflow engines and its integration with the cloud
computing paradigm. We start by reviewing existing solutions for workflow
applications and their limitations with respect to scalability and on-demand
access. We then discuss some of the key benefits that cloud services offer workflow
applications, compared to traditional grid environments. Next, we give a brief
introduction to workflow management systems in order to highlight components
that will become an essential part of the discussions in this chapter. We discuss
strategies for utilizing cloud resources in workflow applications next, along with
architectural changes, useful tools, and services. We then present a case study on
the use of cloud services for a scientific workflow application and finally end the
chapter with a discussion on visionary thoughts and the key challenges to realize
them. In order to aid our discussions, we refer to the workflow management
system and cloud middleware developed at CLOUDS Lab, University of
Melbourne. These tools, referred to as Cloudbus toolkit [1], henceforth, are
mature platforms arising from years of research and development.
BACKGROUND
Over the recent past, a considerable body of work has been done on the use of
workflow systems for scientific applications. Yu and Buyya provide a
comprehensive taxonomy of workflow management systems based on
workflow design, workflow scheduling, fault management, and data
movement. They characterize and classify different approaches for building
and executing workflows on Grids. They also study existing grid workflow
systems highlighting key features and differences.
Some of the popular workflow systems for scientific applications include
DAGMan (Directed Acyclic Graph MANager) [3, 4], Pegasus , Kepler , and
Taverna workbench . DAGMan is a workflow engine under the Pegasus
workflow management system. Pegasus uses DAGMan to run the executable
workflow. Kepler provides support for Web-service-based workflows. It uses
an actor-oriented design approach for composing and executing scientific
application workflows. The computational components are called actors, and
they are linked together to form a workflow. The Taverna workbench enables
the automation of experimental methods through the integration of various
services, including WSDL-based single operation Web services, into workflows.
For a detailed description of these systems, we refer you to Yu and Buyya .
Scientific workflows are commonly executed on shared infrastructure such as
Tera-Grid,2 Open Science Grid,3 and dedicated clusters . Existing workflow
systems tend to utilize these global Grid resources that are made available
through prior agreements and typically at no cost. The notion of leveraging
virtualized resources was new, and the idea of using resources as a utility [9, 10]
was limited to academic papers and was not implemented in practice. With the
advent of cloud computing paradigm, economy-based utility computing is
gaining widespread adoption in the industry.
Deelman et al. presented a simulation-based study on the costs involved
when executing scientific application workflows using cloud services. They
studied the cost performance trade-offs of different execution and resource
provisioning plans, and they also studied the storage and communication fees
of Amazon S3 in the context of an astronomy application known as Montage
[5, 10]. They conclude that cloud computing is a cost-effective solution for
dataintensive applications.
The Cloudbus toolkit [1] is our initiative toward providing viable solutions
for using cloud infrastructures. We propose a wider vision that incorporates an
inter-cloud architecture and a market-oriented utility computing model. The
Cloudbus workflow engine , presented in the sections to follow, is a step
toward scaling workflow applications on clouds using market-oriented
computing.
WORKFLOW MANAGEMENT SYSTEMS AND CLOUDS
The primary benefit of moving to clouds is application scalability. Unlike grids,
scalability of cloud resources allows real-time provisioning of resources to meet
application requirements at runtime or prior to execution. The elastic nature of
clouds facilitates changing of resource quantities and characteristics to vary at
runtime, thus dynamically scaling up when there is a greater need for additional
resources and scaling down when the demand is low. This enables workflow
management systems to readily meet quality-of-service (QoS) requirements of
applications, as opposed to the traditional approach that required advance
reservation of resources in global multi-user grid environments. With most
cloud computing services coming from large commercial organizations,
servicelevel agreements (SLAs) have been an important concern to both the
service providers and consumers. Due to competitions within emerging service
providers, greater care is being taken in designing SLAs that seek to offer (a)
better QoS guarantees to customers and (b) clear terms for compensation in the
event of violation. This allows workflow management systems to provide better
end-to-end guarantees when meeting the service requirements of users by
mapping them to service providers based on characteristics of SLAs.
Economically motivated, commercial cloud providers strive to provide better
services guarantees compared to grid service providers. Cloud providers also
take advantage of economies of scale, providing compute, storage, and
bandwidth resources at substantially lower costs. Thus utilizing public cloud
services could be economical and a cheaper alternative (or add-on) to the more
expensive dedicated resources. One of the benefits of using virtualized
resources for workflow execution, as opposed to having direct access to the
physical machine, is the reduced need for securing the physical resource from
malicious code using techniques such as sandboxing. However, the long-term
effect of using virtualized resources in clouds that effectively share a ―slice‖ of
the physical machine, as opposed to using dedicated resources for highperformance applications, is an interesting research question.
12.3.1
Architectural Overview
Figure 12.1 presents a high-level architectural view of a Workflow Management
System (WfMS) utilizing cloud resources to drive the execution of a scientific
Workflow Management System –
schedules jobs in workflow to remote
resources based on user-specified QoS
requirements
and
SLA-based
negotiation with remote resources
capable of meeting those demands.
A storage service such as FTP or Amazon S3
for temporary storage of application
components, such as executable and data
files, and output (result) files.
Storage
File Transfer
Workflow Engine
Job C1
Job C2
Job B1
Job C3
Job B2
Storage
Resource Broker
EC2
Aneka Web Services
Executor
Persistence
Job A
Executor
Aneka
Plugin
Plugin
Storage
REST
Scheduler
Executor
Aneka Enterprise
Cloud Platform
Aneka Cloud
REST
Workstation
Workstation
Amazon Web Services
Switch
Workstation
Workstation
EC2 Instance
Workstation
Local cluster with fixed
number of resources
EC2 Instance
EC2 Instance
EC2 Instance
Amazon EC2 instances
to augment to the local cluster
FIGURE 12.1. Workflow engine in the cloud.
workflow application. The workflow system comprises the workflow engine, a
resource broker , and plug-ins for communicating with various technological
platforms, such as Aneka [14] and Amazon EC2. A detailed architecture
describing the components of a WfMS is given in Section 12.4.
User applications could only use cloud services or use cloud together with
existing grid/cluster-based solutions. Figure 12.1 depicts two scenarios, one
where the Aneka platform is used in its entirety to complete the workflow, and
the other where Amazon EC2 is used to supplement a local cluster when there
are insufficient resources to meet the QoS requirements of the application.
Aneka , described in further detail in Section 12.5, is a PaaS cloud and can be run
on a corporate network or a dedicated cluster or can be hosted entirely on an
IaaS cloud. Given limited resources in local networks, Aneka is capable of
transparently provisioning additional resources by acquiring new resources in
third-party cloud services such as Amazon EC2 to meet application demands.
This relieves the WfMS from the responsibility of managing and allocating
resources directly, to simply negotiating the required resources with Aneka.
Aneka also provides a set of Web services for service negotiation, job
submission, and job monitoring. The WfMS would orchestrate the workflow
execution by scheduling jobs in the right sequence to the Aneka Web Services.
The typical flow of events when executing an application workflow on Aneka
would begin with the WfMS staging in all required data for each job onto a
remote storage resource, such as Amazon S3 or an FTP server. In this case,
the data would take the form of a set of files, including the application
binaries. These data can be uploaded by the user prior to execution, and they
can be stored in storage facilities offered by cloud services for future use. The
WfMS then forwards workflow tasks to Aneka‘s scheduler via the Web service
interface. These tasks are subsequently examined for required files, and the
storage service is instructed to stage them in from the remote storage server, so
that they are accessible by the internal network of execution nodes. The
execution begins by scheduling tasks to available execution nodes (also known
as worker nodes). The workers download any required files for each task they
execute from the storage server, execute the application, and upload all output
files as a result of the execution back to the storage server. These files are then
staged out to the remote storage server so that they are accessible by other tasks
in the workflow managed by the WfMS. This process continues until the
workflow application is complete.
The second scenario describes a situation in which the WfMS has greater
control over the compute resources and provisioning policies for executing
workflow applications. Based on user-specified QoS requirements, the WfMS
schedules workflow tasks to resources that are located at the local cluster
and in the cloud. Typical parameters that drive the scheduling decisions in
such a scenario include deadline (time) and budget (cost) [15, 16]. For instance,
a policy for scheduling an application workflow at minimum execution
cost would utilize local resources and then augment them with cheaper
cloud resources, if needed, rather than using high-end but more expensive
cloud resources. On the contrary, a policy that scheduled workflows to achieve
minimum execution time would always use high-end cluster and cloud
resources, irrespective of costs. The resource provisioning policy determines
the extent of additional resources to be provisioned on the public clouds. In this
second scenario, the WfMS interacts directly with the resources provisioned.
When using Aneka, however, all interaction takes place via the Web service
interface.
The following sections focuses on the integration of workflow management
systems and clouds and describes in detail practical issues involved in using
clouds for scientific workflow applications.
12.4
ARCHITECTURE OF WORKFLOW MANAGEMENT SYSTEMS
Scientific applications are typically modeled as workflows, consisting of tasks,
data elements, control sequences and data dependencies. Workflow
management systems are responsible for managing and executing these
workflows. According to Raicu et al. [17], scientific workflow management
systems are engaged and applied to the following aspects of scientific
computations: (1) describing complex scientific procedures (using GUI tools,
workflow specific languages), (2) automating data derivation processes (data
transfer components), (3) high-performance computing (HPC) to improve
throughput and performance (distributed resources and their coordination), and
(4) provenance management and query (persistence components). The
Cloudbus Workflow Management System consists of components that are
responsible for handling tasks, data and resources taking into account users‘
QoS requirements. Its architecture is depicted in Figure 12.2. The architecture
consists of three major parts: (a) the user interface, (b) the core, and (c) plugins. The user interface allows end users to work with workflow composition,
workflow execution planning, submission, and monitoring. These features are
delivered through a Web portal or through a stand-alone application that is
installed at the user‘s end. Workflow composition is done using an XMLbased Workflow Language (xWFL). Users define task properties and link them
based on their data dependencies. Multiple tasks can be constructed using
copy-paste functions present in most GUIs.
The components within the core are responsible for managing the execution
of workflows. They facilitate in the translation of high-level workflow
descriptions (defined at the user interface using XML) to task and data objects.
These objects are then used by the execution subsystem. The scheduling
component applies user-selected scheduling policies and plans to the
workflows at various stages in their execution. The tasks and data dispatchers
interact with the resource interface plug-ins to continuously submit and monitor
tasks in the workflow. These components form the core part of the workflow
engine.
The plug-ins support workflow executions on different environments and
platforms. Our system has plug-ins for querying task and data characteristics
Workflow Management System
....
User
Interface
Workflow
Application
Planner Description
Composition
Workflow
and Qos
Web Portal
Engine
Resource Discovery
Catalogs
MDS
Workflow Language Parser
(XWFL, BPEL …)
Tasks
Parameters
Dependencies
Sources
Conditions
Exceptions
Workflow
Coordinator
HTTP
Other
Replica
Catalog
Grid
Market
Directory
Planner
Replication
Service
FTP
GridFTP
Task Manager
Workflow Scheduler
Monitoring Interface
GUI
Data Providence
Manager
Tasks Dispatcher
Storage
Broker
Data Movement
Task Manager
Factory
Event
Service
Storage and
Replication
Text
Measurements
Resource Plug-in
Energy Consumption
Gridbus
Broker
Web
Globus
Market Maker
Services
Resource Utilization
Scalable Application
Manager
CLOUD
InterCloud
Plug-in components
Workflow Submission Handler
FIGURE 12.2. Architecture of Workflow Management System.
(e.g., querying metadata services, reading from trace files), transferring data to
and from resources (e.g., transfer protocol implementations, and storage and
replication services), monitoring the execution status of tasks and applications
(e.g., real-time monitoring GUIs, logs of execution, and the scheduled retrieval
of task status), and measuring energy consumption.
The resources are at the bottom layer of the architecture and include
clusters, global grids, and clouds. The WfMS has plug-in components for
interacting with various resource management systems present at the front end
of distributed resources. Currently, the Cloudbus WfMS supports Aneka, Pbs,
Globus, and fork-based middleware. The resource managers may communicate
with the market maker, scalable application manager, and InterCloud services
for global resource management [18].
UTILIZING CLOUDS FOR WORKFLOW EXECUTION
Taking the leap to utilizing cloud services for scientific workflow applications
requires an understanding of the types of clouds services available, the required
component changes in workflow systems for interacting with cloud services, the
set of tools available to support development and deployment efforts, the steps
involved in deploying workflow systems and services on the cloud, and an
appreciation of the key benefits and challenges involved. In the sections to
follow, we take a closer look at some of these issues. We begin by introducing
the reader to the Aneka Enterprise Cloud service. We do this for two reasons.
First, Aneka serves as a useful tool for utilizing clouds, including platform
abstraction and dynamic provisioning. Second, we describe later in the chapter
a case study detailing the use of Aneka to execute a scientific workflow
application on clouds.
Aneka
Aneka is a distributed middleware for deploying platform-as-a-service (PaaS)
offerings (Figure 12.3). Developed at CLOUDS Lab, University of Melbourne,
Aneka is the result of years of research on cluster, grid, and cloud computing
for high-performance computing (HPC) applications. Aneka, which is both a
development and runtime environment, is available for public use (for a cost), 4
can be installed on corporate networks, or dedicated clusters, or can be hosted
on infrastructure clouds like Amazon EC2. In comparison, similar PaaS
services such as Google AppEngine [19] and Windows Azure [20] are in-house
platforms hosted on infrastructures owned by the respective companies. Aneka
was developed on Microsoft‘s.NET Framework 2.0 and is compatible with
other implementations of the ECMA 335 standard [21], such as Mono. Aneka
Aneka
Container
Storage
Work units
Executor
Infrastructure
Client
Executor
Scheduler
Work units
Internet
Scheduler
Workers
Executor
Client
Client
Applications
FIGURE 12.3. A deployment of Aneka Enterprise Cloud.
can run on popular platforms such as Microsoft Windows, Linux, and Mac OS
X, harnessing the collective computing power of a heterogeneous network.
The runtime environment consists of a collection of Aneka containers
running on physical or virtualized nodes. Each of these containers can be
configured to play a specific role such as scheduling or execution. The Aneka
distribution also provides a set of tools for administrating the cloud,
reconfiguring nodes, managing users, and monitoring the execution of
applications. The Aneka service stack provides services for infrastructure
management, application execution management, accounting, licensing, and
security. For more information we refer you to Vecchiola et al. [14].
Aneka‘s Dynamic Resource Provisioning service enables horizontal scaling
depending on the overall load in the cloud. The platform is thus elastic in
nature and can provision additional resources on-demand from external
physical or virtualized resource pools, in order to meet the QoS requirements
of applications. In a typical scenario, Aneka would acquire new virtualized
resources from external clouds such as Amazon EC2, in order to meet the
minimum waiting time of applications submitted to Aneka. Such a scenario
would arise when the current load in the cloud is high, and there is a lack of
available resources to timely process all jobs.
The development environment provides a rich set of APIs for developing
applications that can utilize free resources of the underlying infrastructure.
These APIs expose different programming abstractions, such as the task model,
thread model, and MapReduce [22]. The task programming model is of particular
importance to the current discussion. It models ―independent bag of tasks‖
(BoT) applications that are composed of a collection of work units independent
of each other, and it may be executed in any given order. One of the benefits of
the task programming model is its simplicity, making it easy to run legacy
applications on the cloud. An application using the task model composes one or
more task instances and forwards them as work units to the scheduler. The
scheduling service currently supports the First-In-First-Out, First-In-First-Out
with Backfilling, Clock-Rate Priority, and Preemption-Based Priority Queue
scheduling algorithms. The runtime environment also provides two specialized
services to support this model: the task scheduling service and the task execution
service.
The storage service provides a temporary repository for application files—
that is, input files that are required for task execution, and output files that are
he result of execution. Prior to dispatching work units, any files required
are staged-in to the storage service from the remote location. This remote
location can be either the client machine, a remote FTP server, or a cloud
storage service such as Amazon S3. The work units are then dispatched to
executors, which download the files before execution. Any output files
produced as a result of the execution are uploaded back to the storage service.
From here they are staged-out to the remote storage location.
Aneka Web Services
Aneka exposes three SOAP Web services for service negotiation, reservation,
and task submission, as depicted in Figure 12.4. The negotiation and
reservation services work in concert, and they provide interfaces for
negotiating
SOAP Request
Aneka Web Services
1. Task Submission Service
2. Negotiation Service
3. Reservation Service
Aneka
Platform as a Service
SOAP Response
FIGURE 12.4. Aneka Web services interface.
resource use and reserving them in Aneka for predetermined timeslots. As such,
these services are only useful when Aneka has limited resources to work with
and no opportunities for provisioning additional resources. The task Web
service provides a SOAP interface for executing jobs on Aneka. Based on the
task programming model, this service allows remote clients to submit jobs,
monitor their status, and abort jobs.
General Approach
Traditional WfMSs were designed with a centralized architecture and were thus
tied to a single machine. Moving workflow engines to clouds requires (a)
architectural changes and (b) integration of cloud management tools.
Architectural Changes. Most components of a WfMS can be separated from
the core engine so that they can be executed on different cloud services. Each
separated component could communicate with a centralized or replicated
workflow engine using events. The manager is responsible for coordinating
the distribution of load to its subcomponents, such as the Web server,
persistence, monitoring units, and so forth.
In our WfMS, we have separated the components that form the architecture
into the following: user interface, core, and plug-ins. The user interface can now
be coupled with a Web server running on a ―large‖ instance of cloud that can
handle increasing number of users. The Web request from users accessing the
WfMS via a portal is thus offloaded to a different set of resources.
Similarly, the core and plug-in components can be hosted on different types
of instances separately. Depending on the size of the workload from users, these
components could be migrated or replicated to other resources, or reinforced
with additional resources to satisfy the increased load. Thus, employing
distributed modules of the WfMS on the basis of application requirements
helps scale the architecture.
Integration of Cloud Management Tools. As the WfMS is broken down
into components to be hosted across multiple cloud resources, we need a
mechanism to (a) access, transfer, and store data and (b) enable and monitor
executions that can utilize this approach of scalable distribution
of
components.
The cloud service provider may provide APIs and tools for discovering the
VM instances that are associated to a user‘s account. Because various types of
instances can be dynamically created, their characteristics such as CPU
capacity and amount of available memory are a part of the cloud service
provider‘s specifications. Similarly, for data storage and access, a cloud may
provide data sharing, data movement, and access rights management
capabilities to user‘s applications. Cloud measurement tools may be in place to
account for the amount of data and computing power used, so that users are
charged on the pay-per-use basis. A WfMS now needs to access these tools
12.5 UTILIZING CLOUDS FOR WORKFLOW EXECUTION
333
to discover and characterize the resources available in the cloud. It also needs to
interpret the access rights (e.g., access control lists provided by Amazon),
use the data movement APIs, and share mechanisms between VMs to fully
utilize the benefits of moving to clouds. In other words, traditional catalog
services such as the Globus Monitoring and Discovery Service (MDS) [23],
Replica Location Services, Storage Resource Brokers, Network Weather
Service [24], and so on could be easily replaced by more user-friendly and
scalable tools and APIs associated with a cloud service provider. We describe
some of these tools in the following section.
Tools for Utilizing Clouds in WfMS
The range of tools and services offered by cloud providers play an important
role in integrating WfMSs with clouds (Figure 12.5). Such services can facilitate
in the deployment, scaling, execution, and monitoring of workflow systems.
This section discusses some of the tools and services offered by various service
providers that can complement and support WfMSs.
A WfMS manages dynamic provisioning of compute and storage resources
in the cloud with the help of tools and APIs provided by service providers. The
provisioning is required to dynamically scale up/down according to application
requirements. For instance, data-intensive workflow applications may require
333
12.5 UTILIZING CLOUDS FOR WORKFLOW EXECUTION
Aneka ASP.Net
Azure Storage Service
Web Services
Web Requests
Other Aneka Client
applications
Workflow Engine
Aneka Enterprise Cloud
Usage Data
Amazon S3
Resource Monitoring
IaaS (e.g. Amazon EC2, GoGrid)
Resource
Usage Accounts
monitoring and
management
Workflow Web App
Management
SaaS
(e.g. Salesforce.com)
FIGURE 12.5. A workflow utilizing multiple cloud services.
Big Table
12.5 UTILIZING CLOUDS FOR WORKFLOW EXECUTION
333
large amount of disk space for storage. A WfMS could provision dynamic
volumes of large capacity that could be shared across all instances of VMs
(similar to snapshots and volumes provided by Amazon). Similarly, for
compute-intensive tasks in an workflow, a WfMS could provision specific
instances that would help accelerate the execution of these compute-intensive
tasks.
A WfMS implements scheduling policies to assign tasks to resources based
on applications‘ objectives. This task-resource mapping is dependent on several
factors: compute resource capacity, application requirements, user‘s QoS, and
so forth. Based on these objectives, a WfMS could also direct a VM
provisioning system to consolidate data center loads by migrating VMs so
that it could make scheduling decisions based on locality of data and compute
resources.
A persistence mechanism is often important in workflow management
systems and for managing metadata such as available resources, job queues,
job status, and user data including large input and output files. Technologies
such as Amazon S3, Google‘s BigTable, and the Windows Azure Storage
Services can support most storage requirements for workflow systems, while
also being scalable, reliable, and secure. If large quantities of user data are
being dealt with, such as a large number of brain images used in functional
magnetic resonance imaging (fMRI) studies , transferring them online can be
both expensive and time-consuming. In such cases, traditional post can prove
to be cheaper and faster. Amazon‘s AWS Import/Export 5 is one such service
that aims to speed up data movement by transferring large amounts of data in
portable storage devices. The data are shipped to/from Amazon and offloaded
into/from S3 buckets using Amazon‘s high-speed internal network. The cost
savings can be significant when transferring data on the order of terabytes.
Most cloud providers also offer services and APIs for tracking resource
usage and the costs incurred. This can complement workflow systems that
support budget-based scheduling by utilizing real-time data on the resources
used, the duration, and the expenditure. This information can be used both for
making scheduling decisions on subsequent jobs and for billing the user at the
completion of the workflow application.6
Cloud services such as Google App Engine and Windows Azure provide
platforms for building scalable interactive Web applications. This makes it
relatively easy to port the graphical components of a workflow management
system to such platforms while benefiting from their inherent scalability and
reduced administration. For instance, such components deployed on Google
App Engine can utilize the same scalable systems that drive Google
applications, including technologies such as BigTable [25] and GFS [26].
5
http://aws.amazon.com/importexport/
6
http://aws.amazon.com/devpay/
Download from Wow! eBook <www.wowebook.com>
12.5 UTILIZING CLOUDS FOR WORKFLOW EXECUTION
CASE STUDY:
OPTIMIZATIONS
EVOLUTIONARY
333
MULTIOBJECTIVE
This section presents a scientific application workflow based on an iterative
technique for optimizing multiple search objectives, known as evolutionary
multiobjective optimization (EMO) [27]. EMO is a technique based on genetic
algorithms. Genetic algorithms are search algorithms used for finding optimal
solutions in a large space where deterministic or functional approaches are not
viable. Genetic algorithms use heuristics to find an optimal solution that is
acceptable within a reasonable amount of time. In the presence of many
variables and complex heuristic functions, the time consumed in finding even an
acceptable solution can be too large. However, when multiple instances are run
in parallel in a distributed setting using different variables, the required time for
computation can be drastically reduced.
Objectives
The following are the objectives for modeling and executing an EMO workflow
on clouds:
● Design an execution model for EMO, expressed in the form of a workflow,
such that multiple distributed resources can be utilized.
● Parallelize the execution of EMO tasks for reducing the total completion
time.
● Dynamically provision compute resources needed for timely completion
of the application when the number of tasks increase.
● Repeatedly carry out similar experiments as and when required.
● Manage application execution, handle faults, and store the final results for
analysis.
Workflow Solution
In order to parallelize the execution of EMO, we construct a workflow model
for systematically executing the tasks. A typical workflow structure is depicted
in Figure 12.6.
In our case study, the EMO application consists of five different topologies,
upon which the iteration is done. These topologies are defined in five different
binary files. Each file becomes the input files for the top level tasks (A0emo1,
A0emo, . . . ). We create a separate branch for each topology file. In Figure 12.6,
there are two branches, which get merged on level 6. The tasks at the root level
operate on the topologies to create new population, which is then merged
by the task named ―emomerge.‖ In Figure 12.6, we see two ―emomerge‖ tasks
in the 2nd level, one task in the 6th level that merges two branches and then
splits the population to two branches again, two tasks on the 8th and 10th
A0e…
A0e…
A0e…
Topology 2
A0e…
A1e…
A0e…
A1e…
A0e…
A1e…
A0e…
A1e…
A0e…
A1e…
A1e…
B0e…
B1e…
B1e…
B0e…
B1e…
B0e…
B1e…
B0e…
B1e…
B0e…
B1e…
A1e…
A1e…
Iteration 1
Topology 1
B0e…
B0e…
B0e…
Iteration 2
Aem…
B1e…
B1e…
Bem…
FIGURE 12.6. EMO workflow structure (boxes represent task, arrows represent
datadependencies between tasks).
levels, and the final task on the 12th level. In the example figure, each topology
is iterated two times in a branch before getting merged. The merged population
is then split. This split is done two times in the figure. The tasks labeled B0e and
B1e (depicted as darker shade in Figure 12.6) is the start of second iteration.
Deployment and Results
EMO Application. We use ZDT2 [27] as a test function for the objective
f