teză de doctorat phd thesis
Universitatea POLITEHNICA Bucures, ti
Facultatea de Automatică s, i Calculatoare
Departamentul de Calculatoare
TEZĂ DE DOCTORAT
Managementul Resurselor pentru Optimizarea
Costului în Sistemele de Stocare Cloud
PHD THESIS
Resource Management for Cost Optimization in
Cloud Storage Systems
Scientific Adviser (Conducător S, tiint, ific)
Prof. Dr. Ing. Valentin
CRISTEA
-2016-
Author (Autor)
Ing. Cătălin NEGRU
Contents
Acknowledgements
iii
Summary
1
1 Introduction
1.1 Context and Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
1.2 Thesis Objectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
1.3 Thesis Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2
2
4
5
2 Cost Issues of Cloud Storage Systems
2.1 General Characteristics of Cloud Storage Systems . . . . . . . . . . . . . . . . . . .
2.2 Cost Models in Cloud Computing . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.2.1 Cost Factors in Cloud Computing . . . . . . . . . . . . . . . . . . . . . . .
2.2.2 Cost Models in Cloud Computing . . . . . . . . . . . . . . . . . . . . . . .
2.3 Requirements for Cloud Storage Resource Management Systems . . . . . . . . . .
2.3.1 Cloud Storage Resource Management Systems Functions and Requirements
2.3.2 Energy Efficient Cloud Storage Service: Key Issues and Challenges . . . . .
2.3.3 Impact of Virtual Machine Heterogeneity on Datacenter Power Consumption
2.3.4 Cost Reduction Strategies . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.4 Conclusions and Open Issues . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
7
7
13
14
15
18
19
20
23
37
38
3 Resource Allocation Methods for Cost-Aware Data Management
3.1 Cost Analysis of Distributed File Systems . . . . . . . . . . . . . . .
3.2 Efficient Re-scheduling Based on Cost-Aware Data Storage . . . . .
3.2.1 Re-Scheduling Service . . . . . . . . . . . . . . . . . . . . . .
3.2.2 Resource Management Tool . . . . . . . . . . . . . . . . . . .
3.2.3 Experimental Results and Evaluation . . . . . . . . . . . . .
3.3 Task Migration for Cost-Effective Data Processing . . . . . . . . . .
3.3.1 Proposed Hybrid Model . . . . . . . . . . . . . . . . . . . . .
3.3.2 Implementation Details and Experimental Results . . . . . .
3.3.3 Real Environment Integration . . . . . . . . . . . . . . . . . .
3.4 Conclusions and Open Issues . . . . . . . . . . . . . . . . . . . . . .
.
.
.
.
.
.
.
.
.
.
40
40
45
46
49
50
54
59
60
61
65
4 Budget Reduction in Cloud Storage Systems
4.1 Cost of Storage Services in Commercial Clouds . . . . . . . . . . . . . . . . . . . .
4.1.1 Pricing Schemes in Public Cloud Storage Services . . . . . . . . . . . . . . .
4.2 Budget-Aware Storage Service Selection . . . . . . . . . . . . . . . . . . . . . . . .
67
68
74
75
i
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
Contents
4.3
4.4
4.5
4.2.1 Storage Service Selection . . . . . . . . . . . . . . . . . . . . . . .
4.2.2 Cost Optimization Model using Linear Binary Programming . . .
4.2.3 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . .
Buget Constrained Storage Service Selection . . . . . . . . . . . . . . . . .
4.3.1 Optimal Storage Service Selection Problem . . . . . . . . . . . . .
4.3.2 Problem Modeling . . . . . . . . . . . . . . . . . . . . . . . . . . .
4.3.3 Linear Programming Algorithms . . . . . . . . . . . . . . . . . . .
4.3.4 Optimization Results . . . . . . . . . . . . . . . . . . . . . . . . . .
Cost-aware cloud storage service allocation for distributed data gathering
4.4.1 Continuous Linear Programming Solver . . . . . . . . . . . . . . .
4.4.2 Data Gathering Model . . . . . . . . . . . . . . . . . . . . . . . . .
4.4.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Conclusions and Open Issues . . . . . . . . . . . . . . . . . . . . . . . . .
ii
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
76
77
78
79
79
81
82
84
86
86
87
88
88
5 Storage Systems for Cyber-Physical Infrastructures: Natural Resources Management
90
5.1 Analysis of Data Requirements for CyberWater: from Heterogeneity to Uniform and
Unified Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90
5.1.1 Big Data, Heterogeneous Data . . . . . . . . . . . . . . . . . . . . . . . . . 92
5.1.2 Unified Approach of Big Data Modeling . . . . . . . . . . . . . . . . . . . . 95
5.1.3 CyberWater Case Study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98
5.2 Proposed Cloud Storage Solution for CyberWater . . . . . . . . . . . . . . . . . . . 100
5.2.1 CyberWater Storage Architecture . . . . . . . . . . . . . . . . . . . . . . . . 101
5.2.2 Cost Efficient Cloud-based Service Oriented Architecture for Water Pollution
Prediction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109
5.3 Conclusions and Open Issues . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116
6 Conclusions
118
6.1 Contributions of PhD Thesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119
6.2 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 120
6.2.1 Resource-Aware Reduction Techniques for Big Data . . . . . . . . . . . . . 120
Bibliography
Resource Management for Cost Optimization in Cloud Storage Systems (PhD Thesis) - Ing. Cătălin
NEGRU
125
Acknowledgements
This thesis is the results of my collaborations with so many wonderful people that I cannot name
them all here, but I am grateful to all of them.
First and foremost, I would like to thank Professor Valentin Cristea, for all the help and
guidance he has offered during PhD period and I wish to have the same collaboration in the
future. Also, I would like to express my special appreciation and thanks to my co-advisor Associate
Professor Florin Pop that have been a mentor for me, encouraged my research, allowing me to grow
as a research scientist.
Secondly, I would also like to thank Professor Mariana Mocanu and Ciprian Dobre, which
gave me the opportunity to work in research projects and offered support and precious advises
through the PhD thesis. I have enjoyed working with them, and hope to be able to continue to do
so in the future.
Furthermore, I had a successful and enjoyable collaboration inside CyberWater project with
Professor Radu Drobot and Aurelian Drăghia from Faculty of Hydrotechnics Technical University
of Civil Engineering of Bucharest, Professor Lucia Vacariu and Anca Hangan from Automation
and Computer Science Faculty Technical University of Cluj-Napoca. I would like to thank them
and hope to future colaborations.
Special thanks goes to all coleagues and students from Distributed Systems Laboratory with
whom I have worked in and published research papers. I mention here Radu-Ioan Ciobanu, Cristian
Chilipirea, Elena Apostol, Catalin Leordeanu, Sorin Ciolofan, Laura Vasiliu, Maria-Alexandra
Lovin and Ovidiu Marcu. I also thank to Prof. Nik Bessis, Dr. Stelios Sotiriadis and Prof. Jing
Li for a nice and successful collaboration.
Last but not least, I would like to express all my love and gratitude to my family, which have
supported me in all important moments of my life, especially my father Mircea (his memory will
be with me always), my mom Mihaiela, my wife Iuliana-Valentina and my brother Cristian.
iii
Summary/Rezumat
Nowadays, Cloud computing model provides everything as a service, in a functional, usable and
extremely powerful manner, permitting to use software and hardware resources in a pay-per-use
manner. Users can use Cloud services for many purposes and in different ways such as store data
for processing, backup data, synchronize all devices to use data stored on Cloud or archive data.
With the adoption of Cloud storage, arises the cost optimization problem. First, the Cloud storage
provider must optimize the resource management cost, calculate his total cost of ownership and
adequately put price on his services in order to have profit and amortize his investment. Second,
a Cloud user want to optimize his budget, calculate total cost of storing data in the Cloud and
minimize it as much as possible. For Cloud providers, the most relevant aspects of resource
management systems for cloud storage that help to reduce costs are represented by cost-efficient
scheduling algorithms for data processing tasks. Also, an important method to optimize the costs
in cloud storage systems is to reduce the volumes of data through intelligent reduction techniques.
At data center level cost are concentrated in servers, power infrastructure and networking. A low
utilization and failures of these resources leads to very low efficiency and business loss. Techniques
such as re-scheduling and task migration can achieve cost reduction in data processing tasks through
a better utilization of resources, fault propagation reduction and improved QoS. For Cloud users,
bi-linear programming techniques can be used successfully to reduce costs and to have an optimal
selection of service providers in multi-Cloud environments.
In prezent, modelul Cloud oferă totul sub forma de servicii, într-un mod funct, ional s, i us, or
de utilizat, ceea ce permite utilizarea resurselor software s, i hardware într-o manieră contractuală.
Utilizatorii pot folosi serviciile Cloud în mai multe scopuri s, i în moduri diferite, cum ar fi stocarea
datelor pentru prelucrare, backup de date sau arhivarea datelor. Odată cu adoptarea de serviciilor
stocare Cloud se pune problema optimizări costurilor. În primul rând, furnizorul de servicii Cloud
trebuie să optimizeze costurile de gestionare a resurselor s, i să stabilească un pret, adecvat pentru
serviciile sale, cu scopul de a face profit s, i să amortizeze investit, ia. În al doilea rând, un utilizator Cloud doreste să optimizeze bugetul de care dispune pentru a achizit, iona servicii de stocare.
Pentru furnizorii de servicii Cloud, algoritmi eficient, i de planificare a task-urilor de procesare de
date pot ajuta la optimizarea costurilor operat, ionale. De asemenea, o metodă importantă pentru
optimizarea costurilor în sistemele de stocare Cloud este de a reduce volumul de date prin intermediul tehnicilor de reducere inteligente. La nivelul centrului de date costurile sunt concentrate în
servere, infrastructura electrică s, i de ret, ea. O utilizare scăzută s, i defectele acestor resurse conduce
la o eficient, ă foarte scăzută s, i pierderi. Tehnici cum ar fi re-planificarea s, i migrat, ia task-urilor pot
realiza o reducere a costurilor în activităt, ile de prelucrare a datelor printr-o mai bună utilizare a
resurselor, reducerea propagării erorilor s, i QoS îmbunătăt, it. Din perspectiva utilizatorului, tehnicile de programare bi-liniara pot fi folosite cu succes pentru reducerea costurilor s, i selectia optimă
a furnizorilor de servicii în medii multi-Cloud.
1
1 | Introduction
1.1
Context and Motivation
Data used in nowadays applications are very diverse and are generated by different sources. The
data volumes are continuously increasing. For instance, 90% percent of the world’s data has been
created in the last two years. Large datasets are produced at the exponential rate as shown in
Figure 1.1, by multiple and diverse sources, like sensor networks for environmental data monitoring,
water resources management, scientific experiments, high-throughput instruments, WWW (e.g.
social media sites, digital pictures and videos). Every day 2220 Petabytes of data are created,
basically, we face with a data deluge, data size has surpassed the capabilities of computation [48].
This means that traditional ways of data processing are not efficient anymore.
All these large and complex datasets need to be stored, processed and analyzed by dataintensive computing applications and therefore, claim for high performance in response time, scalability, fault tolerance, reliability, and elasticity of the infrastructures that support them. This
cycle of operations (storage, process, and analysis) on large datasets is categorized as Big Data in
the scientific literature.
In general, Big Data refers to processing and analysis in a reasonable time of Petabytes and
Exabytes of data [183]. Furthermore, Big Data not only refers to the datasets but also to the tools
that are needed to store and process it in background or in real-time, and to platforms that ensure
(in addition to storage, processing and analysis) privacy, security and recovery of the information
in case of failures. Big Data currently represents a research frontier, having impact in many areas,
such as business, the scientific research, public administration, and so on. For instance, in the case
of fraud detection applications Big Data platforms that can analyze claims and transactions in real
time, identifying large-scale patterns across many transactions or detecting anomalous behavior
from an individual user. For IT log analytics with a Big Data solution in place, logs and trace
data can be put to good use in order to identify large-scale patterns to help in diagnosing and
preventing problems. In the case of call center analytics applications Big Data can help to identify
recurring problems or customer and staff behavior patterns on the fly not only by making sense
Computa:on Capacity Data Storage Processing speed (1012 million IPS) Data Volumes (exabytes) 700 600 500 400 300 200 100 0 1986 1993 2000 2007 2014 10.00 8.00 6.00 4.00 2.00 0.00 1986 1993 2000 2007 2014 Figure 1.1. Data deluge: the increase of data size has surpassed the capabilities of computation
2
1. Introduction
3
of time/quality resolution metrics, but also by capturing and processing call content itself. Social
media analysis applications with the help of Big Data Big Data can harvest and analyze social
media activity and provide real-time insights into how the market is responding to products and
campaigns. Decision support applications for large scale cyber-infrastructures systems such as
water resources management that include a large base of heterogeneous data sources, distributed
geographically can use Big Data to analyze data collected from different sources in order to offer
real-time support in decision making process. In this thesis we will refer and study las type of
applications.
Challenges arise at every step, from manipulating at high speed vast volumes of complex information (from structured, raw, semi-structured or unstructured data), to extracting the relevant
information, analyzing it or storing it. Moreover, at the business level, companies need to store
and access their data, and make a part of it public to the customers. Cloud computing services
represent a possible solution for these challenges.
The storage system plays a key role in data preserving for further use in knowledge extraction.
Cloud computing services represent a good candidate for storage, processing and analysis of these
large datasets. Cloud storage services are attractive since they deliver on-demand a virtualized
storage infrastructure, over an overlay network. Moreover, a service level agreement (SLA) guarantees a minimum level of performance for the offered service. Users can buy storage capacity in
a pay-per-use pricing scheme. This means that the customers pay only for the used services (by
GB*month or other criteria) in a certain amount of time. In a global vision of smart cities, public
Cloud Storage services become a fundamental part of platform architecture for such systems, as
data must be available any time, in any location, from any device.
The most evident benefits of switching to Big Data are the cost reductions through the usage
of technologies such as Hadoop, Spark, and Storm in a Cloud-based infrastructure. For instance,
the company UPS stores over 16 Petabytes of data, tracking 16 million packages per day for 9
million customers [36]. Due to an innovative optimization of navigation on roads they saved in
2011, 8.4 millions of gallons of fuel. A bank saved an order of magnitude by buying a Hadoop
cluster with 50 servers and 800 processors compared with a traditional warehouse. The second
motivation for companies to use Big Data is time reduction. Macy’s was able to optimize the
processing time for recalculation of prices for 73 million items on sale from 27 hours to 1 hour [37].
In addition, companies can run a lot more (on the order of 100.000) analytics models than before
(under 100). Time reduction allows also the real time reaction to customer habits. The third
motivation is to create new Big Data specific offerings. LinkedIn launched a set of new services
and features such as “Groups you may like”, “Jobs you may be interested in”, and “Who’s viewed
my profile” etc. Google developed Google Plus, Google Apps. Verizon, Sprint and T-Mobile deliver
services based on location data provided by mobiles. The fourth advantage offered by Big Data
to the business model is the support in taking decisions, considering that lot of data coming from
customer’s interaction is unstructured or semi-structured (web site clicks, voice recording from call
centers, notes, video, emails, web logs etc.). Natural language processing tools can translate voice
into text in order to analyze calls.
“Analytics 3.0” represents a new approach to data analytics. It explains the steps that well
established big companies had to do in order to integrate Big Data infrastructure into their existing
IT infrastructure (for example Hadoop clusters that have to coexist with IBM mainframes). The
"diversity" aspect of Big Data, which means the ability to analyze new types of data, is the main
concern for companies and not the large data volumes.
There are many challenges when dealing with Big Data handling, starting from data capture
to data visualization. Regarding the process of data capture, sources of data are heterogeneous,
Resource Management for Cost Optimization in Cloud Storage Systems (PhD Thesis) - Ing. Cătălin
NEGRU
1. Introduction
4
geographically distributed, and unreliable, being susceptible to errors. Current real-world storage
solutions such as databases are populated in consequence with inconsistent, incomplete and noisy
data. Therefore, several data preprocessing techniques, such as data reduction, data cleaning, data
integration, and data transformation, must be applied to remove noise and correct inconsistencies
and to help in decision making process [66].
There are three categories of promises related to big-data analytics identified by the authors
of [126]:
• Cost reduction – technologies like Hadoop and Cloud-based analytics can provide significant
cost advantages;
• Faster, better decision-making – this is obtained with the aid of frameworks such as Apache
Storm and Apache Spark designed to run programs up to 100x faster than Hadoop. Also the
possibility to analyze new sources of data can help in decision-making process. For example,
health-care companies try to use natural language processing tools to better understand
customer’s satisfaction;
• New products and services – this is possible due to new sources of data such as mobile phones.
The cost of the provided services plays a key role when talking about the Cloud computing
services and become a fundamental building block. Cloud providers offer a very wide portfolio of
services, while Cloud clients access them against some financial arrangement. There is a fundamental tradeoff between what Cloud provider can offer in terms of services (i.e., infrastructure,
platform, software storage, network) cost and what Cloud clients are willing to pay. Therefore,
the cost-efficiency of management operations is mandatory. So, Cloud providers need cost-efficient
resource management systems and methods for storage and processing for Big Data.
Energy-awareness represents a big challenge for Cloud computing infrastructures and together
with the increasing costs for energy, calls for energy-aware methods and techniques. The storage
system represents an important factor for energy consumption in a datacenter. According to [180],
the global datacenter energy consumed has grown by 19% in 2012. Moreover, cost of powering and
cooling is 53% of the total operational expenditure of datacenters. Facebook data centers used 532
million kilowatt hours of energy, and emitted 285,000 metric tons of CO2 equivalents in 2011 [100].
In comparison, Google, which released similar data, consumed 2 billion kilowatt hours of energy
in 2010.
1.2
Thesis Objectives
The main objective of my PhD thesis is to optimize the cost in Cloud storage systems. This
objective is focused on resource management methods and techniques aimed to optimize the storage
cost, the processing cost and the analysis cost of diverse and heterogeneous data.
Another important objective of this thesis is to analyze how different cost models and pricing
strategies influence the resource management process in the Cloud computing infrastructures in
general and Cloud storage in particular. We also aim to find the key factors that help to quantify
the impact of cost over the Cloud computing services.
The main role of resource management systems is to coordinate the resources in Cloud computing infrastructures as a response to management actions implemented by Cloud providers and
Cloud users. Another important objective, of this thesis is to identify the most relevant aspects of
resource management systems for Cloud storage that help to reduce costs.
In Cloud computing environments the processing time of a task represents a major part of the
Resource Management for Cost Optimization in Cloud Storage Systems (PhD Thesis) - Ing. Cătălin
NEGRU
1. Introduction
5
execution cost, along with data transfer cost and data storage cost. In the pay-per-use model, users
normally would like to pay the lowest possible price for a certain service. Reducing failure rate
in a Cloud environment, it might reduce the cost for resources usage and increase user trust. So,
an important objective of this thesis, is to provide an efficient re-scheduling service for cost-aware
Cloud storage.
Cloud computing services are used to perform data intensive tasks on large datasets. This
type of jobs falls under the category of many-task computing. This paradigm means to bring together high throughput computing and high performance computing. So, regarding the previously
presented situation, another important objective of this thesis is to provide a cost efficient hybrid
scheduling algorithm for Many Task Computing that addresses Big Data processing, taking into
consideration the deadlines and satisfying a data dependent task model.
An important method to optimize the costs in Cloud storage systems is to reduce the volumes
of data through intelligent reduction techniques. So, other important objective of this thesis is to
analyze the impact of Big Data reduction on the storage cost.
There are plenty Cloud storage providers that offer extendable Cloud storage services and
solutions. There are two sides of the cost optimization problem: first the Cloud storage provider
must calculate his total cost of ownership and adequately put price on his services in order to
have profit and amortize his investment and, second, a Cloud user must calculate total cost of
storing data in the Cloud and minimize it as much as possible. In a multi-Cloud storage model
in Cloud computing that holds an economical distribution of data among Clouds available in the
market, the scope is to provide customers with data availability as well as secure storage. Based on
these previous assumptions another objective of this thesis is to propose a cost efficient method for
budget-aware storage service selection. Also we propose a method for selection of storage service
in the presence of budget constraints.
The last objective of this PHD thesis is to apply the proposed resource management methods
for cost reduction within a real case scenario. So, we show how these methods can be applied in the
CyberWater project. This is a research project which aims to create a prototype platform using
advanced computational and communications technology for implementation of new frameworks
for managing water and land resources in a sustainable and integrative manner.
1.3
Thesis Outline
In Chapter 1, we start with a short introduction of Cloud computing paradigm and his relevance
in the domain of distributed systems. Then we present the context and motivation of our work,
describing the key factors that ask for cost optimization on Cloud storage systems. We continue
with the presentation of the principal objectives for this PhD thesis. The chapter ends with the
outline of the proposed work.
In Chapter 2, we analyze the problem of cost optimization in Cloud storage systems. In the
first part we present the general characteristics of Cloud storage systems. In the second part we
analyze the relevant cost models that apply to Cloud computing in general and Cloud storage
systems in particular, and the most important cost reduction strategies. In the last part of the
chapter we identify the main requirements for resource management in Cloud storage systems in
order to reduce costs. We also present an analisys of power consumption in heterogeneous virtual
machine environments in case of data-intensive applications.
In Chapter 3 we propose specific resource management methods and techniques for cost
optimization in Cloud storage systems. First, we propose a cost efficient re-scheduling heuristic,
based on cost-aware data storage. The second method proposed for cost optimization refers to a
Resource Management for Cost Optimization in Cloud Storage Systems (PhD Thesis) - Ing. Cătălin
NEGRU
1. Introduction
6
task migration heuristic in order to perform cost-effective data processing. The chapter ends with
conclusions and future work.
In Chapter 4 we try to optimize the cost of cloud storage services from cloud user perspective.
In consequence, we propose two methods for budget reduction in cloud storage systems. In the
first part, we propose a bilinear programming method for budget-aware selection of storage service
from many cloud storage providers. In the second part, we propose a method for storage service
selection in the presence of budget constraints. We end this chapter with conclusions and open
issues.
In Chapter 5, we present and analyze the case study of CyberWater project. First, we
present the problem of Big Data modeling, integration and reduction (e.g. different resourceaware techniques for data reduction) in the context of cyber-infrastructure systems. In the second
part of the chapter, we present the proposed Cloud storage solution for the CyberWater project.
Finally, we present the conclusions and possible further work. The motivation of this chapter comes
from the new age of applications (social media, smart cities, cyber-infrastructures, environment
monitoring, and control, healthcare, etc.), which produce big data and many new mechanisms for
data creation rather than a new mechanism for data storage. We will cover the subjects such as
data manipulation, analytics and Big Data reduction techniques considering descriptive analytics,
predictive analytics, and prescriptive analytics.
In Chapter 6 we present our main contributions, the possible future work, and further research
subjects.
Resource Management for Cost Optimization in Cloud Storage Systems (PhD Thesis) - Ing. Cătălin
NEGRU
2 | Cost Issues of Cloud Storage Systems
Cloud computing model provides everything as a service, in a functional, usable and extremely
powerful manner, permitting to use software and hardware resources in a pay-per-use manner. In
this way, the user has a great flexibility to adapt to changes regarding different aspects of business
level such as an increase in demand. In this chapter we aim to present the cost issues related to the
Cloud storage systems. First, we present the general characteristics of Cloud storage services, such
as: reduced total cost of ownership, unlimited scalability, elasticity, pay-per-user model. Then, we
analyze the main characteristics of Cloud storage systems (e.g. manageability, performance, access
method, multi-tenancy, scalability, data availability, control, efficiency and cost) in relation with
the layered architecture of the Cloud storage model and cost implications that produce.
In the second part of the chapter we make an analysis of cost models for different types of
services offered in Cloud computing. We review as well, the cost factors for Cloud computing
services and present how these are distributed between the Cloud provider and the Cloud user,
depending of the type of service (e.g. IaaS, PaaS and SaaS). The important metrics related to the
cost models are also presented. Here, we identify energy consumption as the main factor in the
cost structure for a certain service, having a major contribution to the total cost. The research
results presented in this section were published in [128].
In the third part of this chapter we analyze the main requirements of the Cloud storage
resource management systems in order to optimize the cost for the offered service. Further, we
focus on energy efficiency and present the challenges for an efficient Cloud storage service. We
analyze a Cloud storage service and show what are the key issues and challenges from energy-aware
perspective. The research results presented in this section were published in [132].
In addition, we evaluate the impact of virtual machine heterogeneity on power consumption
in Cloud computing infrastructures, for the case of data intensive applications. We have published
this research result in [129] and [130].
Furthermore, we present several cost optimization strategies, classified in two main categories:
architecture optimizations and workload optimizations. We end the chapter with conclusions and
present the open issues.
2.1
General Characteristics of Cloud Storage Systems
In this section, we present the general characteristics of Cloud storage services in relation with
the layered architecture of the Cloud storage model. Characteristics such as manageability, performance, access method, multi-tenancy, scalability, data availability, control, efficiency represents
important driving factors in buying cloud storage services. Moreover, each characteristic has major
impact on service cost.
7
2. Cost Issues of Cloud Storage Systems
8
Cloud storage represents a new model of storing data. This assumes that data is stored
in virtualized storage systems inside a datacenter and is remotely accessed through a web-based
interface or an application programming interface (API). It permits users to keep their files (e.g.
photos, videos, textual documents, data stores, databases etc.) in a Cloud storage platform,
paying certain amount of money. All files stored in Cloud storage can be accessed from any device
that has a functional Internet connection. In order to operate the files, a user can buy cloud
processing services or download those files, modify them and upload them back to the Cloud
storage platform. Users can use Cloud services for many purposes and in different ways such
as store data for processing, backup data, synchronize all devices to use data stored on Cloud or
archive data. The necessity of a data storage system that offers scalability, reliability, performance,
availability, affordability and manageability became a strong requirement for high-level applications
with multiple user interactions.
Cloud storage services provide on-demand shared and networked access to data, and needs
to address the demand for storage performance, capacity, availability, reliability, and security.
Compared with traditional storage systems, Cloud storage has the following benefits: reduced
cost obtained especially from the pay-as-you-go model and economies of large scale, easy access to
data, speed and agility. By 2020 almost 40% of all digital information will be processed in Cloud
according to the IDC report [175].
There are three types of Cloud storage solutions offered by different service providers. First
category is represented by Cloud storage for personal users, that use mainly Cloud storage services
to store documents, photo and video files, and we will refer to it as file storage. The second
category of Cloud storage solutions is offered to application developers that deploy applications in
Cloud and have storage needs. Third category is represented by Cloud storage solutions that can
be used by small and medium enterprises, research institutes, universities that need to deploy a
private Cloud storage solution, or to adopt a hybrid Cloud storage solution in order to satisfy the
storage needs.
Characteristics of Cloud storage services such as total cost of ownership (TCO), unlimited
scalability, elasticity, pay-per-use, access from anywhere and multi-tenancy are important from
the Cloud user perspective and they directly influence the service cost. Also, these characteristics
become important factors in the decision process of buying Cloud storage services.
TCO represents the sum of capital expenditures (CAPEX) such as equipment and real estate,
and operational expenditures (OPEX) such as electricity for powering the datacenter and salaries
of employees. A reduced TCO is the main benefit that Cloud storage offers to its users. It is
necessary that TCO reduction to be achieved without sacrificing the performance, availability,
or durability of offered services. Different resource management methods and techniques such as
scheduling, re-scheduling, virtual machine migration, data locality and data reduction techniques
can be used to minimize the OPEX component of the TCO.
Elasticity characteristic of cloud storage systems permits to decouple physical disks from
the storage pool, achieving in this way a less costly, unified and elastic storage, not affected by
variations in the physical environment, using virtualization technologies. Reducing the errors
produced by the physical environment and the possibility to scale out and in the capacity of the
system can reduce the management cost for the infrastructure. The pay-per-use model leads to
significant cost savings for the Cloud user that can obtain a reduced TCO by paying only for the
used services in a time interval and also permit the use of linear programming techniques for cost
optimization. “Anywhere access” is one of the main benefit of Clouds, compared to traditional
storage, by offering a REST API over HTTP (get, put, post, and delete) protocol. Cloud services
are very flexible and have no limitation regarding the location from where they are accessed (on
Resource Management for Cost Optimization in Cloud Storage Systems (PhD Thesis) - Ing. Cătălin
NEGRU
2. Cost Issues of Cloud Storage Systems
9
condition users have Internet access).
The generic architecture of the Cloud storage model, is presented in Figure 2.1. It is composed by five layers: network and infrastructure (e.g. physical components in the datacenter),
storage management (e.g. functions such as storage virtualization, data management, backup and
recovery), metadata management (e.g. various attributes of directories and files, such as ownership, permissions, quotas, and replication factor), storage overlay and access interface layer. When
talking about architectures we have to put them in relation with the facilities they offer such as
performance, remote access, cost, etc.
Figure 2.1. Generic architecture of Cloud storage model
The network and storage infrastructure layer includes the physical components in the
datacenter such as servers, network equipment, storage servers, electrical equipment, cooling and
real estate. The costs that are encountered at this level are fix costs, representing the CAPEX
component of the TCO. The operational cost that are related to this level are represented by the
energy consumption, business premises and salaries for the technicians and administrative staff and
represent a major component of the OPEX component of the TCO. The resource management
methods and techniques that we propose in this thesis will try to optimize the consumption of
energy by improving utilization of the hardware layer.
The storage management layer provides functions such as storage virtualization, data
management, backup, recovery, etc. The manageability (e.g. the ability to manage a system by
using minimal resources) characteristic of this layer has a major impact on the cost. The cloud
providers need to manage Cloud storage systems, in a cost efficient way in order to offer a service
with lower prices than the competition. The management cost is not visible on short term, but on
long-term, it is an important component of the overall cost. It is mandatory for a large-scale Cloud
storage system to be self-managed. This means that system supports the ability to introduce new
storage disks and the system self-configures to accommodate it and to self-heal in the presence of
errors. Furthermore, this layer has to offer scalability (e.g. ability to increase storage capacity and
network bandwidth on demand) which adds an increase in complexity of the architecture of Cloud
storage imposing adoption of different replication and distribution schemes or migration actions
(see Figure 2.2). Also, this layer has to provide availability of the service (e.g. user data must be
provided by the Cloud storage anytime the user places a request) which is hard to accomplish in
a reliable and deterministic way, given the network congestion, the latency of networks and the
occurring errors.
The metadata management layer contains the structure of stored data, by keeping various
attributes of directories and files, such as ownership, permissions, quotas and replication factor.
Resource Management for Cost Optimization in Cloud Storage Systems (PhD Thesis) - Ing. Cătălin
NEGRU
2. Cost Issues of Cloud Storage Systems
10
Figure 2.2. Scalability of Cloud storage systems
Storage overlay layer have to ensure the performance characteristic as the main challenge to
the Cloud Storage services is represented by the ability to move data between Cloud storage system
location and user location. The transport protocol used to carry data between sites represents
another problem that has to be considered. For instance, the Internet uses the TCP protocol,
suitable for moving small chunks of data, files, e-mails, etc. Aspera1 developed High-speed File
Transfer software that uses a new protocol called the Fast and Secure Protocol (FASP). This
protocol uses UDP, which is the partner transport protocol to TCP. The application layer of FASP
resolves the congestion problem passed by UDP protocol, as can be seen in Figure 2.3.
Access interface layer provide the access method, which represents the major difference
between Cloud storage and traditional storage. The most common access method is web service
APIs, although there are many other methods. Developers use REST principles to implement some
of the APIs and for message transport the HTTP protocol. REST APIs are stateless and therefore
simple and efficient to manage. Moreover, we must integrate web service APIs with an application
in order to take advantage of the storage in Cloud. File-based protocols such as NFS are used, along
with block-based protocols such as iSCSI or FCoE [51]. Also, this layer provides multi-tenancy,
which is in the same time an advantage and a potential limitation. The advantage comes from
centralized management that reduces costs. The limitation comes, from the security which is not
a real concern because of strong authentication, access controls, and various encryption options,
but represents a perceived issue. Furthermore, access interface provide control (i.e. customer’s
ability to control and manage how data are stored) characteristic. Amazon Reduced Redundancy
Storage (RRS) is a storage option within Amazon S3 that enables customers to reduce their costs
by storing non-critical, reproducible data at lower levels of redundancy than Amazon S3 standard
storage. It provides a cost-effective, highly available solution for distributing or sharing content
that is durably stored elsewhere, or for storing thumbnails, transcoded media, or other processed
data that can be easily reproduced. Amazon S3’s standard and reduced redundancy options both
store data in multiple facilities and devices, but with RRS, data is replicated fewer times, so the
1 http://asperasoft.com/
Resource Management for Cost Optimization in Cloud Storage Systems (PhD Thesis) - Ing. Cătălin
NEGRU
2. Cost Issues of Cloud Storage Systems
11
Figure 2.3. Fast and Secure Protocol from Aspera Software
cost is smaller [35].
For example, the Information Dispersal Algorithm (IDA) is used by Cleversafe2 to offer
greater availability of data in the face of physical failures and network outages. IDA permits data
slicing with Reed-Solomon codes for data reconstruction in the case of missing data. Also, it allows
configuration of the number of data slices. A given data object could be carved into four slices with
one tolerated failure or 20 slices with eight tolerated failures [109]. Similar to RAID, it permits
the reconstruction of data from a subset of the original data, with some amount of overhead for
error codes (dependent on the number of tolerated failures). With the ability to slice data along
with Cauchy Reed-Solomon correction codes, the slices can then be distributed to geographically
disparate sites for storage. For a number of slices p and a number of tolerated failures (m), the
resulting overhead is p/ (p-m). So, in the case of Figure 2.4, the overhead to the storage system
for p = 4 and m = 1 is 33%. The downside of IDA is that it is CPU intensive without hardware
acceleration. Replication is another useful technique and is implemented by a variety of Cloud
storage providers. Although replication introduces a large amount of overhead (100%), it’s simple
and efficient to provide.
Efficiency implies to make a storage system storing more data. Data reduction techniques
represent standard approaches, whereby the data are reduced in order to require less physical space.
Even if these methods are useful, they introduce some computational overheads, with implication
on total cost. For instance, compression requires to re-encode the data, while deduplication suppose
the calculation of the signatures of data in the search process for duplicates.
Cost is a key characteristic of Cloud computing. There are many types of costs such as cost
of hardware storage, the cost for electrical equipment, the cost for maintenance, as well as the cost
for management of storage facility. When we analyze the Cloud storage systems with respect to
cost, (including SLAs and increasing efficiency), the cost metrics play a crucial role. In order to
calculate the profitability of investment and to put price on his services, the Cloud provider must
perform the financial analysis with the aid of a few parameters such as total cost of ownership
(TCO) Cost/GB, net present value (NPV), internal rate of return (IRR) and return of investment
(ROI) [171]. Furthermore, cloud users need some of these metrics in order to decide if they move
2 https://www.cleversafe.com/
Resource Management for Cost Optimization in Cloud Storage Systems (PhD Thesis) - Ing. Cătălin
NEGRU
2. Cost Issues of Cloud Storage Systems
12
Figure 2.4. Cleversafe’s approach to extreme data availability
or not to the Cloud.
Net present value (NPV) is defined as the difference between the present value of cash inflows
and the present value of cash outflows. NPV is used in capital budgeting to analyze the profitability
of an investment or project. The formula for NPV [156] is presented below:
N P V (i, N ) =
N
X
t=1
Rt
,
(1 + i)t
(2.1)
where:
t - Time of the cash flow
i - Discount rate (the rate of return that could be earned on an investment in the
financial markets with similar risk.); the opportunity cost of capital
Rt - the net cash flow (the amount of cash, inflow minus outflow) at time t.
The internal rate of return (IRR) is a parameter for measuring the financial evaluation. It is
the discount rate for which a project’s benefits exactly equal its costs, meaning that, it is the rate
at which the project’s net present value is zero. The evolution of internal rate of return parameter
is shown in the 2.5.
The equation bellow describes the IRR parameter:
T
X
t=1
Ri
− C0 = 0,
(1 + i)t
(2.2)
where:
C0 -initial cost of the investment
Return of Investment (ROI) represent an important indicator that helps in decision of moving
or not in to the Cloud, by measuring the efficiency of an investment. When calculating ROI a
large number of factors involved in a business are taking into consideration. For calculating ROI,
the following information’s are needed: the initial cost of project, the investment made, the cost
Resource Management for Cost Optimization in Cloud Storage Systems (PhD Thesis) - Ing. Cătălin
NEGRU
2. Cost Issues of Cloud Storage Systems
13
Figure 2.5. Usual Internal Rate of Return
savings done owing to the new investment. The ROI formula for Cloud computing with time frame
per month or per year is the following [117]:
ROI =
2.2
Costs_saved − Investment
(Initial_cost − F inal_cost) − Investment
=
Investment
Investment
(2.3)
Cost Models in Cloud Computing
Cost models are essential leverages for Cloud Computing. Service providers offer a very wide
portfolio of services, whilst Cloud clients access them against some financial arrangement. There is
a fundamental trade-off between what Cloud provider can offer in terms of services (e.g. storage,
databases, and frameworks in the case of PaaS) and what Cloud clients have to pay. This depends
on the provided service kind, which can belong to one of the five main categories: Infrastructure-asa-Service (IaaS), Platform-as-a-Service (PaaS), Software-as-a-Service (SaaS), Storage as a Service
(STaaS), and Network as a service (NTaaS) [106].
Infrastructure-as-a-Service offers virtual machines, storage and networks. The main characteristics are:
• provision with fundamental computing resources: CPU, storage, network;
• resources are distributed and support dynamic scaling;
• is based on a utility pricing model and variable cost;
• user can deploy and run arbitrary software, meaning middleware, and operating systems;
• user has control over operating systems, storage, and deployed applications but does not
manage or control the underlying Cloud infrastructure.
Platform-as-a-Service delivers a computing platform that can be used by the application
developers to build and run their applications (e.g. web applications, email). Has the following
characteristics:
• capability to deploy applications onto the Cloud infrastructure;
• programming languages, tools and libraries are supported by the provider;
• the user has control over the deployed applications and possibly configuration settings for
the application-hosting environment.
Software-as-a-Service offers applications over the Internet in a pay-per-user manner. The
users do not manage anymore the infrastructure or the platform on which the application is running:
• applications are supplied by the provider;
Resource Management for Cost Optimization in Cloud Storage Systems (PhD Thesis) - Ing. Cătălin
NEGRU
2. Cost Issues of Cloud Storage Systems
14
• applications are accessible from various client devices through either a thin client interface,
such as a web browser (e.g., web-based email), or a program interface(rarely).
STorage as a Service offers storage capacity. Clients can store and access their data at any
time and from anywhere. STaaS offers high availability, reliability, performance, replication, data
consistency and redundancy. Another advantage of using a STaaS is that the Cloud applications
are able to scale according to the client needs. When choosing a solution for storage as a service,
one must think about the transfer cost in order to transfer all the data in and out a STaaS.
NeTwork as a Service includes the services for "network transport connectivity" [42]. It
involves the optimization of scheduling resources by taking into account the "available network
and computing resources as a whole" [42]. One reason for using the NaaS is to deploy new custom
routing protocols. NaaS includes network virtualization with the help of protocols such as OpenFlow [115]. NaaS provides network visibility, custom forwarding and processing of packets [42].
2.2.1
Cost Factors in Cloud Computing
Cost structure represents the identification of how costs associated with the production of a good
or service is distributed through the process [112]. There are two sides of the same problem:
the Cloud provider and the Cloud user perspective. The Cloud provider must calculate his total
cost of ownership and adequately put price on his services in order to have profit and amortize his
investment. On the other side, Cloud user has to calculate the budget needed for buying services in
the Cloud. So, there are cost factors that regard one side or another. For instance, a consumer of
Cloud IaaS services must take into consideration the following costs in order to calculate his total
cost for running an application: the usage cost for servers (CPU hours), the cost of incoming data
transfer to the Cloud platform and of the outgoing data transfer, which are measured in Gigabytes
per seconds (GB/s), cost for storage capacity measured in Gigabytes (GB), the cost for executing
different requests in the Cloud platform such as GET, PUT, COPY, POST.
Next we will present the most important factors that are included in the structure of cost for
three main services offered in Cloud computing IaaS, PaaS and, SaaS.
There are five groups of cost factors that mainly regard Cloud providers and partially, Cloud
users: electricity, hardware, and software, labor and business premises [83]. In Table 2.1 are presented these five groups and cost factors of each group for customer and service provider. For
example, in the case of IaaS the service provider must consider cost factors such as: real estate,
electrical equipment, utilities, hardware, maintenance, server, storage, network, OS licensing, support. The customer must take into consideration cost factors such as OS management, application
licensing, support, middleware management, business logic and process. The same logic must be
followed for the PaaS and SaaS.
Cost Factors Groups
Software
Electricity
Labor
Hardware
Business premises
Cost Factor Service Type
Business Process
Business Logic
Middleware Management
Application Licensing/ Support
OS Management
OS Licensing / Support
Server/Storage/Networks
HW/Maintenance
Utilities
ME Equipment
Real Estate
IaaS
PaaS
Customer
SaaS
Customer
Customer
Service Provider
Service Provider
Service Provider
Table 2.1. Total cost of ownership perspective
The electricity cost factor group refers to: energy consumption for cooling infrastructure,
powering server, storage and networking devices, and other electronic devices. The energy E is
Resource Management for Cost Optimization in Cloud Storage Systems (PhD Thesis) - Ing. Cătălin
NEGRU
2. Cost Issues of Cloud Storage Systems
15
measured in kilowatt-hours (kWh) and per day the consumption is calculated with the following
formula (E is equal to the power P in watts (W) times number of usage hours per day t divided
by 1000 watts per kilowatt):
E(kW h/day) =
P(W ) × t(h/day)
1000(W/kW )
(2.4)
Also we must consider the two values of power consumption: the value when the system is
idle and when the system is heavily used. The cost for electricity per day is calculated with bellow
formula (E in kWh per day times the energy cost of 1 kWh in cents/kWh divided by 100 cents per
dollar):
Cost($/day) =
E(kW h/day) × Cost(cent/kW h)
100(cent/$)
(2.5)
In this thesis we propose solutions to reduce Cloud storage cost through optimization of
energy consumption using different resource management methods and techniques such as virtual
machine placement, task re-scheduling, task migration, optimal data placement and so on.
The hardware cost factor refers to the acquisition of servers and networking equipments,
needed in-house. Because all the resources have an optimal lifetime for exploitation, a depreciation
parameter must be taken into consideration in order to evaluate the amortization of investment.
Software cost refers to the price that Cloud provider have to pay in order to purchase the
software licenses for operating systems that runs on servers, middleware and application software
in case of SaaS services.
Labor cost factor include salaries for technicians who work for maintenance of the datacenter
and in the support area. This cost varies in function of the location of the datacenter.
Business premises cost factor includes collateral cost such as: the price for datacenter
facility, price for all non-electric instruments, the price for cabling and so on.
2.2.2
Cost Models in Cloud Computing
The Cloud cost models used today differ from one service group to another. The customers and
Cloud providers need to optimize and forecast the cost over a time period and this impose building
cost models that are accurate and error free. In this section, we make a survey and analyze different
cost models in Cloud Computing and discuss open issues related to the subject.
In the case of IaaS, we use the following three general cost models used by Cloud providers
in their pricing specifications: model for data storage, computational machine and data transfer.
These models are offered by the cloud providers in their pricing specifications [174].
Model for data storage has the following formula:
Ms = size × tsub × unit_cost,
(2.6)
where:
size is the storage capacity in GB;
tsub is subscription time expressed as number of months;
unit_cost is the price of 1GB/month of data expressed in dollars or euros.
Model for computational machine has the following formula:
Mcm = machine_cost × thr ,
Resource Management for Cost Optimization in Cloud Storage Systems (PhD Thesis) - Ing. Cătălin
NEGRU
(2.7)
2. Cost Issues of Cloud Storage Systems
16
where:
machine_cost is the virtual machine price expressed in $/hour ;
tsub is subscription time expressed as number of months;
thr machine usage time expressed in hours .
The price for virtual machine (e.g. instance) can vary and depends on the characteristics of
the virtual machine such as CPU frequency, RAM capacity, available bandwith and disk storage.
Each Cloud provider offers a set of virtual machines with predefined characteristics. For example,
Amazon offers the following instaces types: General Purpose that provide a baseline level of
CPU performance with the ability to burst above the baseline, Compute Optimized that offers
highest performing processors and the lowest price/compute performance, Memory Optimized
that are optimized for large-scale, enterprise-class, in-memory applications and have the lowest
price per GiB of RAM, GPU that are intended for graphics and GPU compute applications,
Storage Optimized I2 High I/O that provide fast SSD instance storage optimized for radom
I/O performance and Dense-storage Instances that provide high disk troughput with the lowest
price per disk throughput performance.
Model for data transfer has the following formula:
Mdt = size(datai n) × unit_price + size(data_out) × unit_price,
(2.8)
where:
size(data_in) is the amount of data transferred in the Cloud platform;
size(data_out) is the amount of data transferred out of the cloud platform;
unit_price is the price per one GB of data transferred expressed in $/GB or EUR/GB .
Based on the previous cost models, Table 2.2 presents more complex cost models that refers
to sequential or multi-threaded programs and to parallel or MPI programs on multiple machines.
With the help of these cost models and monitoring data the authors of [207] develop a monitoring,
analysis, and cost estimation service for the scientific applications that run in Cloud environments.
The shortcoming of this solution is represented by the fact that is applicable only to scientific
applications. The model does not take into consideration the workload of the application (how
many users utilize the service or application at one time) and therefore cannot be applicable to
other applications or services for example a mail service or a web site.
Also for IaaS service model, virtualization technology imposes supplementary challenges for
providers to estimate the cost and for SaaS providers to bill. So, customers and providers need
models for the cost distribution in a virtualized environment. In [61] are presented two cost models
based on server usage: server-usage model, server-burst model.
Server-usage model that takes into account only resource consumption by w workloads, with
cost Cs of server S. The cost is a group of CAPEX components (e.g., fraction of acquisition costs
based on the length of the considered interval) and OPEX components (e.g. costs for power
associated with the server). The server usage model is defined as following:
Cserver_usage (s, w) = Cs ∗ P
ds,w
ds,w0
(2.9)
w0 =1
Server-burst model divides the burst portion of cost for a server in a manner that is weighted
by the burstiness of each workload on the server. To take into account burstiness and unallocated
resources server-burst model partition server cost Cs based on utilization to get cost associated
with the average utilization of the server. In a second step, unallocated resources of the server are
Resource Management for Cost Optimization in Cloud Storage Systems (PhD Thesis) - Ing. Cătălin
NEGRU
2. Cost Issues of Cloud Storage Systems
Model
Mds
Msm
Mse
Activities
Single data transfer without
the cost for machines
performing the transfer
Sequential/multi-threaded
program or single data
transfer with the cost
for machines performing
the transfer(cost monitoring)
Sequential or multi-threaded
program(cost estimation)
17
Cost
size(in) ∗ Mdf i + size(out) ∗ Mdf o
tc ∗ Mcm + size(out) ∗ Mdf o + size(in) ∗ Mdf i
fpi ∗ Mcm + size(out) ∗ Mdf o + size(in) ∗ Mdf i ,
where fp i is an estimated
performance improvement function - provided by
performance prediction tools or scientists.
Mpm
Mpe
Parallel/MPI programs on
multiple machines
(cost monitoring)
Parallel/MPI programs on
multiple machines
(cost estimation)
n ∗ Mcm ∗ tc + size(out) ∗ Mdf o + size(in) ∗ Mdf i
n ∗ Mcm ∗ fpi + size(out) ∗ Mdf o + size(in) ∗ Mdf i
where fpi is an estimated improvement
function when n processes are used.
Table 2.2. Cost models
Cs is the cost for server s;
ds,w is the average utilization of server s by workload w.
apportioned based on the bursty costs. The server burst model is defined as:
ds,w
ε + bs,w
Cserver_burst_temp (s, w) = Csd ∗ PW
+ Csb ∗ PW
0
0
w0 =1 ds,w
w0 =1 (ε + bs,w )
Cserver_burst_temp (s, w)
Cserver_burst (s, w) = Cserver_burst_temp (s, w) + Csa ∗ PW
w0 =1 Cserver_burst_temp (s, w)
(2.10)
(2.11)
bs,w is the difference between peak utilization of a server s by workload w and its average
utilization ds,w ;
parameter ensures that the denominator does not evaluate to zero when difference between
peak and mean resource usage is zero;
Csd is cost associated with the average utilization of the server;
Csb is cost associated with the difference between peak utilization of the resource and 100%;
Csa is cost associated with the difference between average utilization of the resource and 100%.
The cost models presented in [61] take into consideration only the server usage cost. These
models are useful for Cloud providers in order to estimate their cost and put prices on the offered
services. The cost for data communication of the application is not taken into consideration so
therefore is not a complete cost model.For the hybrid cloud model we need cost models that take
into consideration the cost for data transfer and the cost for used sevices from public cloud provider.
The authors of [83] proposed a cost model for hybrid Cloud (i.e. the combinations of a
private datacenter (private Cloud) and the public Cloud). They present a conceptual model that
assumes an organization which comprises the execution of N applications and M services. Cloud
Resource Management for Cost Optimization in Cloud Storage Systems (PhD Thesis) - Ing. Cătălin
NEGRU
2. Cost Issues of Cloud Storage Systems
18
users buy services to construct their applications. For the cost of data communication is used a
directed weighted graph which is shown in Figure 3, where edges show data communication and
vertices represent services. Based on the graph is constructed a distance matrix, which represent
the data-transfer related cost factors ai,j between service i and j. The authors propose a cost
formula which is the sum of two cost types: a fixed cost based on cost factors presented earlier and
a variable cost based on the services used for Cloud provider.
From the cost models presented above raises clearly the necessity for a general cost model
that can accurate predict the cost for running an application in Cloud environment.
2.3
Requirements for Cloud Storage Resource
Management Systems
Cloud storage systems store data on multiple virtual servers. The resource management system
virtualizes in the background, the resources according to the requirements of the user and it exposes
them as storage pools. Physically the resources span across multiple servers and furthermore, the
servers are distributed geographically. Moreover, the Cloud storage systems offer services for a
large number of users that have different necessities.
Resource management systems provide several key functions, through which the energy consumed by a Cloud storage system and therefore, the cost can be optimized. In this subsection, we
highlight the role of resource management methods and techniques in the process of cost reduction
for Cloud storage systems. First, we analyze the most important functions of cloud storage management systems and identify the ones that have a major impact on energy consumption reduction
and therefore on the cost of the storage system. Second, we present the key issues and challenges
for an energy efficient cloud storage system. Third, we evaluate the impact of virtual machine heterogeneity on datacenter power consumption for the case of data-intensive applications. Fourth,
we present several cost reduction strategies based on resource management methods.
The significant growth of data volumes drives the need for cost reduction. For example, in
2013, the engineers from CERN announced that in the last 20 years "CERN Data Centre has
recorded over 100 Petabytes of physics data". They claim that the collisions in the Large Hadron
Collider (LHC) have generated almost 75 Petabytes of data between 2010 and 2013. The amounts
of data are so large that CERN Datacenter sends them to other data centers from around the world
in order to be processed and analyzed. Engineers must find the relevant collisions by processing
almost 15 Petabytes of data generated every year. For these large datasets storage and fast access
to the data are required. CERN Data Centre is able to process almost one Petabyte of data every
day. CERN engineers say that there are 6000 changes that are performed in the database at every
second and more than one million jobs run daily. Furthermore, at peak rates, 10 gigabytes of data
are transferred every second from the data center to other locations [3].
Building a trustworthy and efficient system is one of the main challenges of developers who are
concerned with dependability issues. This is a vast domain that is in constant changes, regarding
the complexity of systems, the increasing number of uses, or the nature of faults and failures. In
such a dynamic environment, the need to deal with many types of threats is increasing.
Resource management systems have to solve problems related to heterogeneity of systems,
compatibility constraints between virtual machines and underlying hardware, islands of resources
created due to storage and network connectivity and limited scale of storage resources [89]. So,
different resource management strategies are needed, depending on the provided services (e.g.
IaaS, PaaS, and SaaS). In all three types of services, resource management systems are faced with
Resource Management for Cost Optimization in Cloud Storage Systems (PhD Thesis) - Ing. Cătălin
NEGRU
2. Cost Issues of Cloud Storage Systems
19
variable workloads.
2.3.1
Cloud Storage Resource Management Systems Functions and
Requirements
Scheduling function plays a key role in controling the system performance and the overall cost
for running the Cloud storage infrastructure. The platform schedulers have to cope with new
asymptotic scales, with resource heterogeneity, with the dynamic environment, with the size and
the number of tasks, and with the high number of constraints. There is no universal scheduler
that can be efficient on all platforms and on all environments. Moreover, the scheduling problem
for Big Data is an NP problem since there is no reasonable time to explore all the possibilities in
order to make good scheduling decisions. The scheduling algorithms that are used for processing
large data sets have important constraints such as data locality, meet deadlines and minimize
makespan, a good utilization of the resources, a good load balancing mechanism and so on. The
main requirement for the platform schedulers and scheduling algorithms is to improve the system
performance by minimizing the computation and communication time. Therefore, our objective will
be to develop scheduling algorithms and heuristics that satisfy the constraints of the applications
and minimize the energy consumption of the running infrastructure.
Another important function of Cloud storage management systems is storage provisioning.
It is done in steps, in a specific order, to deliver optimum system performance. Storage needs in
a Cloud computing infrastructure highly vary between different users and workloads [103], [169].
Moreover, computing platforms used for data-intensive applications have to manage heterogeneous
resources and workloads. So, datacenter management tools have to deal with high complexity when
need to provision storage resources for different workloads, while meeting the SLAs. For instance,
there are cases when we can predict a peak value and the resources can be provisioned in advance.
A known example in the case of web services is "Black Friday” [136]. In the case of unplanned large
workloads, the situation gets more complex. The main requirements for the provisioning function
are minimization of the provisioning time, to not overprovision with resources and to not accept
workload that will break the QoS or load balancing.
Monitoring function is constantly needed in order to manage and operate efficiently large scale
Cloud storage infrastructures for service delivery over the Internet. With the help of monitoring
function, we can monitor key performance metrics for cloud services, can set the level of monitoring
and can customize the monitoring displays. The continuous monitoring of the Cloud and of its SLAs
(for example, in terms of availability, delay, etc.) supplies both the providers and the consumers
with information such as the workload generated, the performance and QoS offered through the
Cloud, also allowing to implement mechanisms to prevent or recover violations (for both the
provider and consumers). In more general terms, Cloud computing involves many activities for
which monitoring represents an essential task.
Migration actions refers to moving data and tasks between different data centers. It is very
closely related to storage scheduling as the decision of migrating a specific amount of data can
be scheduled in order to improve overall performance and cost of the storage system. Through
migration actions, resource management systems can accommodate demand spikes by temporarily
taking resources away from underutilized applications and also save energy by shutting down idle
resources. Yet, these migration actions can have a negative impact on application performance.
So, the main requirement for these is that cannot be used if produce performance degradation
for applications. Savings in energy and costs cannot come with a degraded user experience and a
violation of SLA.
Resource Management for Cost Optimization in Cloud Storage Systems (PhD Thesis) - Ing. Cătălin
NEGRU
2. Cost Issues of Cloud Storage Systems
20
Another important function of cloud storage management systems is backup. It can be
incremental (i.e. selective backup) or full. Usually a full backup is done first and subsequent
backups are incremental. In the incremental backup, only the files that have changed on the
drive since the last backup will be backed up. Also, synchronization represent a key function
that management functions have to offer. There are issues related to used protocols and data
placement. The performance of the synchronization process can be influenced by the number of
files, file sizes and file types, data placement and therefore require the use of protocols that offers
a good performance.
All the previous presented functions of Cloud storage resource management functions need to
address the demand for energy efficiency, storage performance and efficiency in terms of capacity,
availability, reliability, security, fault-tolerance, scalability and performance.
Storage efficiency requirement refers to storage capacity; how much data we can store in a
Cloud storage datacenter. Storage efficiency can be achieved through data reduction techniques
such as deduplication, compression, snapshots, thin provisioning and so on. Data deduplication or
intelligent compression reduces storage space by elimination of redundant data. Basically we store
only a unique instance of data as redundant data is replaced with a pointer to the single instance
of the data. Although both methods are useful, compression involves processing (re-encoding the
data into and out of the infrastructure) and de-duplication involves calculating signatures of data
to search for duplicates. To manage the capacity of a Cloud storage system and to improve its
efficiency it is needed to accurately measure the capacity utilization across the entire infrastructure.
Efficiency Metrics are slot utilization (i.e. the ratio of storage frame slots that are populated with
drives compared to the total available storage frame slots) and overall storage efficiency (i.e. the
ratio of stored data compared to the raw storage capacity). Furthermore, capacity management
can be expressed by two important indicators used percentage of the Cloud storage capacity and
remained percentage of Cloud storage capacity.
Energy efficiency is one of the most important requirement of the Cloud storage system in
the perspective of reducing the total cost because every operation in data center consume energy.
The design of a datacenter architecture for energy-aware storage services should meet the following
objectives [17]: scalability, resiliency, uniform high storage capacity, backward compatibility.
2.3.2
Energy Efficient Cloud Storage Service: Key Issues and
Challenges
Cloud computing data centers consume large amounts of energy. According to [118], in the year
2013 in U.S., data centers consume an estimated 91 billion kilowatt-hours of electricity and it is
forecasted that will consume approximately 140 billion kilowatt-hours by 2020. Furthermore, most
of the energy is used inefficiently and computational resources such as CPU, storage, and network
consume a lot of power. A good balance between the computing resources is mandatory. Given the
heterogeneous nature of resources, workloads and usage patterns the power consumption reduction
remains a challenging problem.
The storage system in a large datacenter consume up to 40% of the total energy [67]. There
are plenty of methods to reduce power consumption in storage systems, some are based on powering
off storage devices, others are centered on placing the data optimally to allow powering off storage
devices and others are based on delaying the access to some of the storage devices. In order to have
an energy efficient Cloud storage service, primarily we must understand the power consumption
behavior of the components in the storage system, to be able to estimate trade-offs between power
and performance.
Resource Management for Cost Optimization in Cloud Storage Systems (PhD Thesis) - Ing. Cătălin
NEGRU
2. Cost Issues of Cloud Storage Systems
21
Different approaches and strategies for energy consumption reduction have been presented
in literature, such as energy efficient hardware architectures (enables slowing down CPU speeds,
turning off partial hardware components), energy-aware job scheduling, energy-efficient network
protocols and infrastructures [195].
Massive Arrays of Idle Disks (MAID) is a technique that power off the unused disks. This is
based on disk array without redundancy, but replicate recently used data on reserved cache disks.
The caches disks remain always spin up. In this way the regular disks remain idle for a longer
period and energy savings increase. The written data is also stored in the buffer cache memory
of drives [39]. The main drawback of this approach is represented by the random access pattern.
The risk is not using at all the recent data stored on reserved cache disks.
In [82] the authors propose a hybrid copy-on-write storage system that combines solid-state
disks and hard disk drives for consolidated environments. This solution takes advantage of both
devices (SSDs and HDDs) by proposing a scheme that places a read-only template disk image on a
solid-state disk, while write operations are isolated to the hard disk drive. So, in this architecture
the disk I/O performance benefits from the fast read access of the solid-state disk, especially for
random reads, precluding write operations from the degrading flash memory performance. The
raised costs represent the main drawback of this approach, because of the expensive SSD drives.
Another approach is called Fractional Replication for Energy Proportionality (FREP). It
is used for energy management in large datacenters. FREP is based on a replication strategy
and on basic functions (to enable flexible energy management) for load distribution and update
consistency [86]. However, the authors of this solution does not present the impact of replication
degree to the overall cost of storage system. The same approach is adopted in [179]. The authors
investigate the problem of creating an energy proportional storage system through power-aware
dynamic storage consolidation. Sample-Replicate Consolidate Mapping (SRCMap) is a storage
virtualization layer optimization that enables energy proportionality for dynamic I/O workloads.
This layer consolidates the workload on a subset of physical volumes proportional to the I/O
workload intensity. The advantage of this approach is that instead of migrating data across physical
volumes dynamically or replicating entire volumes (both are prohibitively expensive), SRCMap
samples a subset of blocks from each data volume that constitutes its working set and replicates
these on other physical volumes. During a given interval, SRCMap activates a minimal set of
physical volumes to serve the workload and spins down the remaining volumes, redirecting their
workload to replicas on active volumes. The data placement modification represents the main
drawback of this solution. It can decrease the performance of the system in full power mode and
especially for random workloads.
The authors of [84] propose a variant of HDFS called Green HDFS for managing intensive
data processing on commodity Hadoop cluster. Green HDFS divides the datacenter servers in cold
zones and hot zones. Through the data-classification-driven data placement scheme GreenHDFS
can group most used data in hot zones and rarely used data in cold zones. In this way it can scaledown by guaranteeing substantially long periods (several days) of idleness in a subset of servers
in the datacenter designated as the Cold Zone. These servers are put in high-energy-saving mode.
According to the obtained results Green HDFS was capable of achieving 26% savings in the energy
costs of a Hadoop cluster in a three-month simulation run. Also, simulation results extrapolate a
savings of $2.4 million annually when GreenHDFS technique is applied across all Hadoop clusters
(amounting to 38000 servers) at Yahoo.
Resource Management for Cost Optimization in Cloud Storage Systems (PhD Thesis) - Ing. Cătălin
NEGRU
2. Cost Issues of Cloud Storage Systems
22
Analysis of Power Comsumption in a Cloud Storage Service
The authors of [23] analyze the energy consumption of a storage service. They consider a file
storage and backup service, where all processing and computation are performed on the user’s
computer but user data are stored in the Cloud. Files are downloaded from the Cloud for viewing
and editing and then uploaded back to the Cloud for storage. The per-user power consumption of
the storage service Pst , calculated as a function of downloads per file per hour, is:
Pst =
Bd D
3Pst,SR
3PSD
(ET +
) + 2Bd
,
3600
2Cst,SR
2BSD
(2.12)
where:
Bd (bits) is the average size of a file;
D is the number of downloads per hour;
Pst,SR is the power consumption of each content server;
Cst,SR (bits per second) is the capacity of each content server;
PSD is the power consumption of hard disk arrays;
BSD (bits) is the capacity of hard disk arrays;
ET is the energy consumption of transmission and switching.
The disadvantage of this model is that it assumes that only files that are regularly accessed
consume energy when stored. It is not sufficient to model a real Cloud storage service which has
millions of users. It is hard for resource management systems to predict the data that will be
processed at a certain moment, in order to store them in a smaller number disks.
So, it is possible to optimize the power consumption of a storage service if it is possible to
predict the user behavior (to forecast the download rate). For instance, if we can predict the data
that will be downloaded next then we can group this data on a smaller number of disks and shut
down the rest of disks. Also, if we can predict the user behavior and for a reasonable time frame
between downloads, we can shut down the rest of disks and save energy.
Energy awareness key issues and challenges
The key research issues related to Cloud are specific to different services (e.g. IaaS, PaaS and
SaaS), to Cloud management and to applications and platforms that run in the Cloud. Important
challenges are the creation of middleware services to build, deploy, integrate and manage application in a scalable, elastic and multi-tenant environment. These challenges are also applicable to
Cloud storage services oriented on energy-awareness.
The main issue for energy aware Cloud storage services is to keep dedicated servers in Cloud
well utilized, which means that the idle power costs are efficiently reduced [70]. The key challenges
are performance degradation (because it increases energy per unit work and minimizing the number
of servers may not necessarily minimize energy) and power variation (the allocation of the data
objects will affect the power consumed) [10]. Also is important is to achieve a good trade-off
between energy savings and application performance.
Another key challenges identified in [125] are: architectural management, efficiency, reliability, response time, quality of services employing software development methodologies, addressing
collaboration issues between service provider and consumer while maintaining trusts in Cloud
computing, heterogeneity of the systems.
These challenges impose the adoption of different solutions such as automated service provisioning that is required due to the dynamic nature of these applications, virtual machine migration,
server consolidation, energy management, traffic management and analysis, data security, storage
Resource Management for Cost Optimization in Cloud Storage Systems (PhD Thesis) - Ing. Cătălin
NEGRU
2. Cost Issues of Cloud Storage Systems
23
technologies and data management in the context of Big Data and novel Cloud architectures [5].
For instance, server virtualization represents one step further in power consumption optimization
in Cloud computing datacenters, permitting effective and efficient energy management.
Resource provisioning represents another challenge as we encounter a great diversity of workloads (e.g. computationally-intensive, data-intensive, and hybrid), usage patterns (e.g. static,
periodic, once-in-a-lifetime, unpredictable, and continuously changing), and virtual machine heterogeneity. For instance, in the case of data-intensive applications a significant portion of energy
is used just to keep virtual machines alive or move data around without performing a useful computation. Reducing power consumption at the data center level has serious implications over the
usage cost.
Heterogeneity of resources has serious implications over the energy consumption and cost of
Cloud services and applications. Further, we analyze the impact of virtual machine heterogeneity
on power consumption in data intensive applications.
2.3.3
Impact of Virtual Machine Heterogeneity on Datacenter Power
Consumption
In the context of data-intensive applications, a significant portion of energy is consumed just to
keep virtual machines alive or to move data around without performing useful computation. Power
consumption optimization requires identification of the inefficiencies in the management system.
Based on the relation between server load and energy consumption, in this subsection, we study
the energy efficiency, and the penalties in terms of power consumption that are introduced by
different degrees of heterogeneity for a cluster of heterogeneous virtual machines.
VM Heterogeneity Problem
In order to support the increased usage of online services (e.g. banking, e-commerce, social networking, education, etc.) by different users, Cloud computing datacenters rely on virtualization
technology. It permits the independence of applications and servers. Moreover, virtualization offers
a new way to improve the data center energy efficiency by assignment of multiple virtual machines
(VMs) on a single physical server. Furthermore, most of the energy is used inefficiently because
of low utilization of virtual machines. Resources, such as CPU, memory, storage, and network,
consume energy even when they are in idle state [8].
Few important questions arise when talking about power efficiency related to workload type
and usage patterns. What happens when we have to deal with workloads that are computational
intensive or data-intensive? How is better from energy-consumption perspective, to use virtual
machines with low resource characteristics and process the tasks in a longer time, or to use virtual
machines with high resource characteristics and finish the tasks in a shorter time? Also, what are
the implications of virtual machine heterogeneity over the energy consumption? So, in order to
study the optimization of power consumption, we need in the first place, to identify the inefficiencies
in the underlying system. Heterogeneity of virtual machines has a great deal of importance in
reducing the power consumption.
Data-intensive applications (e.g. smart-cities and cyber-infrastructures) are I/O bounded.
These dedicate a significant part of execution time to the data movement process, and in consequence need a high bandwidth data access rate. If the available bandwidth is less than required,
the CPU is held idle until data sets are available. So, a virtual machine that is idle for a certain
amount of time, will consume much more energy. Additionally, a heterogeneous environment contributes to the waste of energy especially in the case of data-intensive applications. This happens
Resource Management for Cost Optimization in Cloud Storage Systems (PhD Thesis) - Ing. Cătălin
NEGRU
2. Cost Issues of Cloud Storage Systems
24
because of the diversity of the resource characteristics such as CPU, bandwidth and RAM memory
in virtual machines.
Also, Big Data applications can help to uncover the fine interactions between data. In this way
we are allowed to manipulate hidden, often-counterintuitive, levels that directly impact different
domains and activities. Moreover, these applications can bring up new opportunities for business
and consumer through modern marketing and networking technologies using an inclusive social and
technological environment [119], [124]. Also, there are tools that supports managers in identifying
forthcoming disruptive technologies and provide them with tailored strategic options [94]. Fields
such as environmental research, disaster management and information in relief operations, decision
support systems, crowd-sourcing, citizen sensing and sensor web technologies, need to make use of
new and innovative tools and methods for Big Data, in order to be more efficient. So, we must
be able to analyze all data in order to get the promoted benefits. For instance, a decision support
system can give better and accurate indications in a crisis situation.
As shown by the authors of [185] the measurement of energy consumption in a virtual machine
can be made by measuring the usage of the machine. So, there is a direct relation between server
load and consumption of energy. Based on this relation we try to evaluate what are the penalties
in terms of power consumption introduced by the heterogeneity of virtual machines.
Energy-Efficient Cloud computing
Over the past decade the computational power has risen to new heights, but unfortunately so have
the costs associated. Furthermore, maximum usage of power cannot always be guaranteed and
therefore problems occurs. As processing power got grouped into clusters a new interest raise for
efficient usage of power consumption. Although clusters now serve Cloud architectures, that fact
alone does not guarantee maximum performance at a minimum cost.
The problem of efficient power consumption of virtual machines in Cloud computing infrastructures was very intensively studied. Surveying the literature, we can distinguish few important
research directions for power efficient Cloud computing.
One research direction refers to methods and technologies for operation efficiency at the
hardware level, meaning computer and network infrastructure. Technologies, such as SpeedStep
[75], PowerNow [152], Cool’nQuiet [40] or DemandBased Switching [143] has been developed.
Also, techniques like dynamic voltage scaling [147] have been applied in different provisioning and
scheduling algorithms and workload consolidation techniques to minimize the power consumption.
Moreover, frameworks for reducing power consumption in computer networks and network-wide
optimization algorithms have been proposed. In [137] propose an two-level control framework
providing local control mechanisms that are implemented at the network device level and networkwide control strategies implemented at the central control level.
Another research direction that we have identified in scientific literature refers to virtual machine placement problem.Also, different methods and algorithms has being proposed. For instance,
in [62] the authors propose an algorithm that use methods like dynamic programming and local
search in order to place in an energy-efficient way, previous created copies of virtual machines
in order to meet the QoS, on the physical servers. An algorithm for virtual machine placement,
designed to increase environmental sustainability in the context of distributed data centers with
different carbon footprint and power utilization efficiency is presented in [85]. The results obtained
from the simulation shows a reduction of CO2 and power consumption, maintaining the same level
of quality of service. Furthermore, a multi-objective ant colony system algorithm for minimization
of total resource wastage and power consumption is proposed in [55]. The authors compare the
algorithm with existing multi-objective genetic algorithm and two single-objective algorithms.
Resource Management for Cost Optimization in Cloud Storage Systems (PhD Thesis) - Ing. Cătălin
NEGRU
2. Cost Issues of Cloud Storage Systems
25
Energy-efficient scheduling algorithms that assign virtual machines to physical machines represent another research direction that we have identified. For example, an algorithm, that aim
to minimize total power consumption of physical machines in the data center, by assigning efficiently the virtual machines in presented in [161]. The results obtained show 24.9% power saving
and nearly 1.2% performance degradation. Also, an algorithm called Dynamic Round-Robin for
energy-aware virtual machine scheduling and consolidation is proposed in [100]. Compared with
other strategies such as Greedy, Round-Robin and PowerSave implemented in Eucalyptus, this reduces a significant amount of power. In [116] the authors propose and implement a virtual machine
scheduling heuristics that take into consideration load-balancing and temperature-balancing with
the aim of reducing the energy consumption in a Cloud datacenter.
Energy efficient, data-aware scheduling is also a major research direction. In Cloud computing,
it pose additionally challenges, as data is stored and accessed on a large scale from distributed
servers. In this situation the energy consumption reduction represents the principal scheduling
objective. In [92], the authors deal with the problem of independent batch scheduling in grid
environment as a bi-objective minimization problem with makespan and energy consumption as
the scheduling criteria. Also in [93],are presented two implementations of classical genetic-based
data-aware schedulers of independent tasks submitted to the grid environment. In [13] is optimized
the energy efficiency of message exchanging for service distribution in interoperable infrastructures.
The authors consider two use cases. First, a requester sends messages to all interconnected nodes
and gets messages only from resources available to execute it and second, the requester sends
one message for all of the jobs of its local pool and gets a respond from available nodes, and
then obtainable resources are ranked and hierarchically categorized based on the performance
criterion e.g. latency competency. Also in [14] the authors propose a novel message-exchanging
optimization model to reduce energy consumption in distributed systems. They aim to achieve
the optimization of the energy consumption for communication and to improve the overall system
performance. The inter-cloud concept [167] encompasses the interconnectivity of node and authors
of [13], [14] developed the inter-cloud meta-scheduling simulation framework [166] to evaluate the
energy efficiency of message exchanging.
Workload consolidation represents also an important research direction, as it permits to place
the workload on fewer physical machines, taking into consideration as a principal parameter the
machine load. In this way is achieved the reduction of the power consumption. Usually, the workload placement problem is modeled as a multi-dimensional bin-packaging problem, as expressed
in [153], [144], [95]. Moreover, meta-heuristics such as Ant Colony Optimization [45], [52], [56],
Genetic Algorithms [91], [159], [150], [90] are used for power consumption optimization.
More recent research directions are in the domain of security. One example is prevention
of some energy oriented distributed denial of service (e-DDoS) attacks. These attacks are characterized by the fact that do not produce direct damage or block the activity of the targeted
infrastructure, but instead generate an anomalous and sustained power consumption on the target
side. In this way, IT equipment and facilities such as air conditioning, heating and ventilation are
affected, their lifetime is reduced significantly. Also, this type of attacks increases energy bills,
drastically [53].
The previous related works do not take specifically into consideration and do not evaluate the
heterogeneity degree of virtual machines when proposed different resource management methods
such as resource allocation, job scheduling and workload consolidation techniques. Quantifying the
penalties that are introduced by the different degrees of heterogeneity can be used further as the
input parameter in scheduling and provisioning algorithms for energy consumption optimization.
Based on previous related works we build a simple taxonomy presented in Figure 2.6.
Resource Management for Cost Optimization in Cloud Storage Systems (PhD Thesis) - Ing. Cătălin
NEGRU
2. Cost Issues of Cloud Storage Systems
26
Figure 2.6. Energy efficient power consumption taxonomy
Energy Efficient Processing in Big Data Platforms
Big Data can represent a meme and also a business term, but more importantly can represent
a buzzword for the advancing trends in technology, which describes a new approach in order to
understand the process of decision-making. Nowadays, data grow at 50% rate per year, new
streams of data such as digital sensors, industrial equipment, automobiles and electrical meters
(measuring and communicating all sort of parameters like location, temperature, movement) are
being generated [136]. Big Data is characterized by seven properties:
• Volume refers to the large amounts of generated data.These impose data processing challenges such as storage, analysis and processing, in the process to obtain valuable information.
All of this is possible with the help of Cloud storage services, where data can be stored in
different locations and processed all together. Another important aspect is represented by the
fact that the scheduling algorithms have to be able to migrate tasks near to datasets in a cost
efficient way. The scheduling phase makes a big difference in getting fast valuable content
from the datasets. Even the smallest optimization to the scheduling algorithm represents a
step ahead in getting the meaningful data in time and at a low cost;
• Variety refers to the diversity of data, imposing new challenges. Big Data systems must
be able to handle complex data, starting with the traditional one, like relational data, and
continuing with the nontraditional one, like raw data, semi-structured or unstructured data.
Almost 80% of the world’s data is now unorganized and traditional databases cannot be used
anymore for storing and managing it [111]. In the scheduling phase, the variety of data plays
an important role since different types of data must be processed and mapped to dedicated
resources;
• Velocity refers to the speed of generating data. This is defined as the speed at which the
data is flowing. A good example can be the social media messages that are spread in seconds.
Since the time of delivery services is highly important, for example the financial markets,
also the speed of manipulating and analyzing complex data in real time plays a key role. As
the ones from IBM say, one must "perform analytics against the volume and variety of data
while it is still in motion, not just after it is at rest" [74]. The velocity of data is highly
important for the scheduling part of processing Big Data because the scheduler must be able
to adjust and deal with the high loads of data at peak times;
• Veracity refers to the trustworthiness of the data. As the volume and complexity of data
are increasing, its quality and accuracy are becoming less controllable [22];
Resource Management for Cost Optimization in Cloud Storage Systems (PhD Thesis) - Ing. Cătălin
NEGRU
2. Cost Issues of Cloud Storage Systems
27
• Value refers to the value that can be extracted form datasets. This means that it does
not matter how big the data volume is or how complex it is unless we are able to extract
the meaningful information. Moreover, storing and processing meaningless data represents
a waste of money, time, business and obtaining the relevant information becomes harder.
Furthermore, scheduling jobs that analyze or process useless data is inefficient in terms of
costs (occupied resources, consumed energy) and introduces delays for getting the relevant
information;
• Volatility refers to time until the data is valid and should be stored. The scheduling for
volatile data must be implemented for real time systems. The scheduler has to take into
consideration the deadlines of the submitted jobs so that relevant data could be processed.
In the case of volatile data, the datasets must be processed and analyzed in real time otherwise
one cannot obtain meaningful information from it;
• Vicissitude property refers to the challenge of scaling Big Data complex workflows. This
property signifies a combination between the large volume of data and the complexity of the
processing workflow, which prevent to gather useful insights for data [60].
Big Data requires a pipeline of processing operations in order to accomplish efficient analytics.
The overall the scope is to offer support in the decision-making process.The Big Data stack is
composed of storage, infrastructure (e.g. Hadoop), data (e.g. human genome), applications, views
(e.g. Hive), and visualization. The majority of big companies already have in place a warehouse
and analytics solution which needs to be integrated now with the Big Data solution. An important
challenge, besides the large volumes and high-speed production rates (e.g. velocity), is raised by
the high data heterogeneity (e.g. variety).
Finding the best method for a particular processing request behind a particular use remains
a significant challenge. We can see the Big Data processing as a big "batch" process that runs on
an HPC cluster by splitting a job into smaller tasks and distributing the work to the cluster nodes.
The new types of applications, like social networking, graph analytics, complex business workflows,
require data movement and data storage. A general view of a four-layer big-data processing stack
[178] is presented in Figure 2.7. Storage Engine provides storage solutions (hardware/software) for
big data applications: HDFS, S3, Lustre, NFS, etc. Execution Engine has to provide reliable and
efficient use of computational resources to execute. This layers aggregate YARN-based processing
solutions. Programming Model offers support for application development and deployment. HighLevel Language allows modelling of queries and general data-processing tasks in easy and flexible
languages (especially for non-experts).
The processing models must be aware about data locality and fairness when decide to move
data on the computation node or to create new computation nodes near to data. The workload
optimization strategies are the key for a guaranteed profit for resource providers, by using the
resource to maximum capacity. For applications that are both computational and data intensive the
Figure 2.7. Big Data Processing Stack.
Resource Management for Cost Optimization in Cloud Storage Systems (PhD Thesis) - Ing. Cătălin
NEGRU
2. Cost Issues of Cloud Storage Systems
28
processing models combine different techniques like in-memory Big Data or CPU+GPU processing.
Figure 2.8 describes a general stack used to define a Big Data processing platform.
A general Big Data processing architecture basically consists of two parts: a job manager
that coordinates processing nodes and a storage manager that coordinates storage nodes [104].
Apache Hadoop is a set of open source applications that are used together in order to provide a
Big Data solution. The two main components mentioned above in Hadoop are HDFS and YARN.
HDFS - Hadoop Distributed File System is organized in clusters where each cluster consists of a
name node and several storage nodes. A large file is split into blocks and name node takes care
of persisting the parts on data nodes. The name node maintains metadata about the files and
commits updates to a file from a temporary cache to the permanent data node. The data node
does not have knowledge about the full logical HDFS file; it handles locally each block as a separate
file. Fault tolerance is achieved through replication; optimizing the communication by considering
the location of the data nodes (the ones located on the same rack are preferred). A high degree of
reliability is realized using "heartbeat" technique (for monitoring), snapshots, metadata replication,
checksums (for data integrity), re-balancing (for performance).
YARN is a resource manger for MapReduce v2.0. This implements a master/slave execution of
processes with a Job Tracker master node and a pool of Task Trackers which do the work. The two
main responsibilities of the Job Tracker, management of resources and job scheduling/monitoring
are split. There is a global resource manager (RM) and per application Application Master (AM).
The slave node has an entity named Node Manager (NM) which is doing the computations. The
AM negotiates with the RM for resources and monitors task progress.
Other components that are needed to be added on top of Hadoop in order to create a Big
Data ecosystem capable of configuration management are: (Zookeeper [73]), columnar organization
(HBase [79]), data warehouse querying (Hive [71]), easier development of MapReduce programs
(Pig [160]), machine learning algorithms (Mahout [141]).
In Big Data platforms each layer (e.g. operating systems, databases and environments, computing solutions, data operations and analytics and data sources) of the processing stack impose
power issues, as can see in the Figure 2.9. First, the usage of virtual machine components (e.g.
CPU, storage, RAM and network) have a direct influence on power consumption of the hardware
components. Operating systems, databases and environments have direct impact on the hardware
usage and thus on power consumption. Computing solutions produce the system behavior power
consumption factor. Data operations and analytics have effect on data movement and processing.
Finally, data sources produce effects in memory and storage components.
Figure 2.8. Big Data Platforms Stack: an extended view.
Resource Management for Cost Optimization in Cloud Storage Systems (PhD Thesis) - Ing. Cătălin
NEGRU
2. Cost Issues of Cloud Storage Systems
29
Figure 2.9. Energy efficiency issues Big Data processing
So, in order to optimize the energy consumption of the underlying hardware, every processing
step should be optimized. For instance, this could mean optimized reads and writes from memory
and storage, efficient data movement and processing, improved system behavior.
Virtual Machine Power Metering
In order to optimize power consumption, we have to measure it on per-virtual machine basis in an
accurate and efficient way. The virtual machine power models for consumption metering proposed
in literature can be classified in two categories: utilization-based models [107], [18], [11] and
performance-monitor-counter models [99], [15], [12]. The first category of models assume that a
server resources (e.g. CPU, memory, disk) consume of energy is linear with his utilization [120]:
Pserver = Pstatic +
X
(kj · Uj ),
(2.13)
j∈J
where:
Pstatic represents fixed power consumption when there is no workload;
Uj is the utilization of physical component;
kj is the dynamic power coefficient;
J = CP U, RAM, Disk, I/O is the set of power consuming components.
Starting from equation 2.13, can be obtained the most used virtual machine power model:
Pivm =
Pstatic X
·
(kj · Uj ),
M
(2.14)
j∈J
where:
Wi is the utilization of the virtual machine;
M is the number of active VMs on a server;
The second category, performance monitor counter models are based on, software components
called counters that monitor the performance of the physical server offering a real-time method for
power consumption monitoring. These counters are supported by all modern processor architectures. The power model for a virtual machine using performance monitor counter models can be
expressed as follows:
X
Pivm (t1 , t2 ) =
Pijvm (t1 , t2 ),
(2.15)
j∈J
Resource Management for Cost Optimization in Cloud Storage Systems (PhD Thesis) - Ing. Cătălin
NEGRU
2. Cost Issues of Cloud Storage Systems
30
where:
Pijvm (t1 , t2 ) is the power consumption consumed by physical component j in
time interval [t1 , t2 ].
The authors of [7], in order to formulate the minimization problem of power consumption use
the following objective function:
X
P (π) =
(µ
i∈[1,m]:Ai 6=
X
(l(dj )α + b),
(2.16)
dj ∈Ai
where:
π = A1 , . . . Am is the set of virtual machines;;
l(dj ) is load of virtual machine;
µ is dynamic power coefficient;
b is static power consumption
Then the power consumption function for a set of virtual machines can be express as following:
P (π) =
m
X
f (l(Ai ))
(2.17)
i=1
Further, we used Equation( 2.16) to quantify the power consumption for our considered set
of virtual machines.
Impact Evaluation on Power Consumption of Virtual Machines Heterogeneity
In order to evaluate the impact of virtual machine heterogeneity on power consumption at the data
center level we performed few experiments with different degrees of heterogeneity. To achieve our
goal, we used multiple sets of virtual machines having different heterogeneity degree and each set
of virtual machines it is composed by four instances. Further, we increase gradually the degree of
heterogeneity and calculated the power consumption for different degrees of heterogeneity.
We used the following formula in order to calculate the power consumption. This is based on
Equation 2.16:
P (4) =
4
X
(l(i)3 + 0.1),
(2.18)
i=1
where:
i represents a virtual machine instance;
α = 3;
µ = 1;
b = 0.1.
Regarding the workload type, we chose to perform a data-intensive job, all machines sending
and receiving simultaneously a file of 200M B. At every 15 seconds we get the server load on every
machines until the job is done.
In the first experiment we start four identical machines and performed the data transfer job.
In the second experiment we used three identical instances and the forth instance was different.
For the third experiment we used two machines of one type and two of other type. In the fourth
Resource Management for Cost Optimization in Cloud Storage Systems (PhD Thesis) - Ing. Cătălin
NEGRU
2. Cost Issues of Cloud Storage Systems
31
experiment we used three different sets of virtual machines. In the last experiment, all machines
were different, having the highest degree of heterogeneity.
Experimental Setup
We have two different experimental setups. First, in order to perform a data-intensive job, we
interconnected all machines together in a full mesh logical topology as shown in Figure 2.10.
Further, each virtual machine sends and receive data. We used four types of virtual machines
from Microsoft Azure Cloud with different CPU and RAM memory characteristics presented in
Table 2.3: Basic A0, Basic A1, Basic A2 and Basic A3 instance types.
Table 2.3. Azure instances types that have been used in experiments
VM type
Basic A0
Basic A1
Basic A2
Basic A3
Cores
0.25
1
2
4
RAM (GB)
0.75
1,75
3,00
7,00
Disk (GB)
30
30
30
30
Figure 2.10. Full mesh logical topology
For the second experimental setup, in order to evaluate power consumption in a Hadoop
MapReduce cluster, we used two different Hadoop clusters with same computing power. Cluster
1 has six worker nodes (with the following characteristics: 1 virtual CPU and 3.75 GB RAM
memory) and one master node (with 2 virtual CPUs and 7GB RAM memory). Cluster 2 has three
worker nodes and one master node, all virtual machines having same configuration, 2 virtual CPUs
and 7GB RAM memory.
In all performed experiments, we get the average load of the virtual machines using the "uptime" command. The average load Linux systems, represents a measurement of the computational
work that the system is performing. Moreover, Linux systems counts processes waiting for other
resources than CPU such as, processes that wait to read from or write to the disk.
Epxerimental Results
Evolution of server load
Resource Management for Cost Optimization in Cloud Storage Systems (PhD Thesis) - Ing. Cătălin
NEGRU
2. Cost Issues of Cloud Storage Systems
32
Figure 2.11. Experiment 1: Homogeneous environment
Figure 2.12. Experiment 2: Heterogeneous environment
In order to understand the effect of heterogeneity on machines, we present the evolution
of server load in a homogeneous and heterogeneous environments. In the first experiment we
considered a homogeneous environment, all virtual machines being Basic A0 instance types. The
evolution of server load is presented in the Figure 2.11. As can be seen the load is similar on all
machines and the transfer is finished simultaneously on all machines.
In the second experiment, we considered an heterogeneous environment with three identical
sets of virtual machines formed from "Basic A0" instance types and one different with "Basic A1"
instance type. The evolution of server load is presented in Figure 2.12. We can observe a more
random evolution pattern with large fluctuations of the load. "Basic A1" instance types finish first
the transfer and have to wait for the rest of instances to finish the data transfer.
In the third experiment we increased the degree of heterogeneity. We interconnected two sets
of virtual machines with different characteristics. First set has "Basic A0" instance types and the
second one has "Basic A1". In the Fig 2.13 is presented the evolution of server load. As can we
observe the sets with identical machines has the same evolution pattern. Moreover, the virtual
Resource Management for Cost Optimization in Cloud Storage Systems (PhD Thesis) - Ing. Cătălin
NEGRU
2. Cost Issues of Cloud Storage Systems
33
Figure 2.13. Experiment 3: Heterogeneous environment
Figure 2.14. Experiment 4: Heterogeneous environment
machines with higher performance characteristics (e.g. Basic A1) finish the transfer before the
other instance types. These have to wait for the slower machines to finish the transfer.
For the fourth experiment we have increased the degree of heterogeneity. We’ve interconnected
three different sets of instances (e.g. Basic A0, A1 and A2). The evolution of server load is presented
in Figure 2.14. As can we see an irregular pattern of the load evolution and the range of load values
increased. Also, "Basic A2" instance type finish first the transfer followed by the "Basic A1" and
"Basic A0".
In the fifth experiment, we considered an environment with the highest degree of heterogeneity. In this setup we interconnected four different sets of virtual machines. In Figure 2.15 is
presented the evolution of server load in this type of environment. The powerful instances A2, and
A3 finish the job much faster, approximatively after 100 second and have to wait for the other less
powerful virtual machines.
In the second part of our experiment we measured the evolution of load on two different
Hadoop clusters, while performing a TeraSort benchmark. TeraSort is one of Hadoop’s widely used
Resource Management for Cost Optimization in Cloud Storage Systems (PhD Thesis) - Ing. Cătălin
NEGRU
2. Cost Issues of Cloud Storage Systems
34
Figure 2.15. Experiment 5: Heterogeneous environment
Figure 2.16. Hadoop cluster 1 load evolution
benchmarks. Hadoop’s distribution contains both the input generator and sorting implementations:
the TeraGen generates the input and TeraSort conducts the sorting while TeraValidate validate
the sorted data. We run the benchmark for a file size of 10GB. The load evolution for the first
cluster is presented in Figure 2.16 and for the second cluster is presented in Figure 2.17.
Comparing the load evolution of the two clusters we can observe that the first cluster need
more time to generate (approximately five minutes more), sort and validate the data than the
second cluster. We can also observe that the cluster nodes are not equally balanced.
Power Consumption
In the Figure 2.18 is presented in an aggregated way the evolution of energy consumption.
As can we see the environment with the highest degree of heterogeneity (Experiment 5) consume
twice as more power that homogeneous environment (Experiment 1). Furthermore, we observe
that for the Experiment 5 in the first 15 seconds, which is the time that is needed for the powerful
machines to finish the transfer, is consumed an important quantity of energy. In the remaining
Resource Management for Cost Optimization in Cloud Storage Systems (PhD Thesis) - Ing. Cătălin
NEGRU
2. Cost Issues of Cloud Storage Systems
35
Figure 2.17. Hadoop cluster 2 load evolution
Figure 2.18. Power Consumption
time these stay in an idle state while have to wait for the slower machines to finish the transfer
and thus consume energy while do not make any useful work.
So, we can conclude that the heterogeneous environments consume more power because the
virtual machines with higher resource characteristics finish the transfer much faster than slower
machines and have to wait in an idle for the other virtual machines to finish the transfer, consuming
power without performing useful computation. The key issue is to reduce the idle time of the virtual
machines.
In the Figure 2.19 is presented the power consumption for the each experiment performed.
As can we see a fully heterogeneous environment consume approximately twice the power of a
homogeneous environment.
The power consumption evolution of the load for the nodes of Hadoop cluster 1 is presented
in Figure. 2.20 and for Hadoop cluster 2 is presented in Figure 2.21. As can we see, for the first
cluster the power consumption distribution on nodes is unbalanced. There are two nodes (e.g.
cluster-1-w-3 and cluster-1-w-5) that consume the same amount of power. Comparing the power
Resource Management for Cost Optimization in Cloud Storage Systems (PhD Thesis) - Ing. Cătălin
NEGRU
2. Cost Issues of Cloud Storage Systems
36
Figure 2.19. Power consumption for each experiment
Figure 2.20. Hadoop cluster 1 power consumption
consumed by the two clusters we can observe that cluster 1 consume approximately twice the power
of the cluster 2. So, we can conclude that in order to optimize power consumption in Big Data
processing platforms we must find the best combination between the configuration of computing
resources and workload type.
Results show that the power consumption is proportional with heterogeneity degree. This
happens because of the fact that the powerful machines finish the transfer much more quickly that
the less powerful ones, and wait just to receive the data.
As we showed through the paper the degree of heterogeneity has a big impact on power
consumption of a set of virtual machines that perform data-intensive tasks. This has also impact
on cost, as the cost of energy represent an important component of the cost for services and
resources.
These results can be applied also in the scheduling process. Platform schedulers should take
into consideration the heterogeneity of resources and to schedule task on homogeneous set of virtual
machines in order to achieve energy efficiency. Furthermore, based on these results we build a job
Resource Management for Cost Optimization in Cloud Storage Systems (PhD Thesis) - Ing. Cătălin
NEGRU
2. Cost Issues of Cloud Storage Systems
37
Figure 2.21. Hadoop cluster 2 power consumption
scheduling algorithm that take into consideration degree of heterogeneity of the set of instances
that must execute scheduled job, respect the deadlines and is aware of data locality in order to
optimize the power consumption at data center level.
2.3.4
Cost Reduction Strategies
Based on previously presented results in Section 2.3.2 and 2.3.3, in order to optimize the cost of
a Cloud storage system the efforts can be split in two major categories: architecture optimizations (i.e. network overlay and distributed cache) and workload optimizations such as workflow
optimization, identification of common tasks, keep data in the workflow, data moving and data
processing optimizations;
One of the most important cost reduction strategies is the reduction of energy consumption.
It can be achieved at hardware level or middleware level through different energy efficient methods
and techniques. It also represents a major component in the structure of the cost for a service in
Cloud environments.
Another method is proposed in [190] by studying the dynamic control of electricity cost to
provide low unpredictability in power demand and cut of power peaks. The proposed solution
minimizes electricity costs, provides low variation in power demand by penalizing the change in
workload and reduce the power peaks by tracking the available power budget. Also, variation in
demand represents another cost reduction strategy for Cloud providers. Cloud providers in function
of the load of his servers lower their prices in order to have a good distribution of workload of the
servers.
Data geo-location represents another method for cost reduction. For example, the modern
electric power grid in North America permits dynamic pricing for electricity. The price varies in
function of region, time of the day and the power demand [146].
Cost-aware resource management strategies such as allocation methods, can achieve important
cost reduction in Cloud storage systems by optimizing the processing time and performance of data
operations or by adopting cost aware data strategies, which implies minimum degree of redundancy.
In this PhD thesis we propose three main strategies for reducing the cost in Cloud storage
systems. First refers to the design of an efficient re-scheduling heuristic for data processing, based
Resource Management for Cost Optimization in Cloud Storage Systems (PhD Thesis) - Ing. Cătălin
NEGRU
2. Cost Issues of Cloud Storage Systems
38
on cost-aware data storage. Our purpose is to minimize the execution time, in order to achieve
cost reduction, for execution of task workflows, through efficient re-scheduling, while improving
the service quality and respecting as much as possible the user demands.
The second strategy for cost reduction implies a task migration heuristic for cost-aware processing and is based on a data diffusion approach. Big Data means large data volumes and number
of tasks. As data transfer cost and running time represent major components in the total cost, we
show that is more cost efficient to schedule computation close to data. Through exploitation of
data locality and applications data access patterns, we will acquire cost optimization. Resources
can be acquired on demand, based on the increases in requests, permitting in this way a faster response time for subsequent calls for the same data. Once the request drops, the acquired resources
can be released.
Third approach for reducing the cost is based on data reduction techniques. As the volumes
and complexity of the Big Data are increasing, it is essential to make reductions in these large
amounts of data in order to get greater insights and accuracy of data for making good decisions.
Moreover, it is very important to reduce the high volumes into meaningful data. The analytics for
data reduction can be divided into three categories3 :
• Descriptive: mines the data and uses business intelligence in order to provide trending
information. They compute descriptive statistics that summarize certain groupings or filtered
version of the data and are based on standard aggregate functions in databases. Scenarios
for using this type of analytics can be found in management reporting such as marketing;
order to provide trending information
• Predictive: forecasts events based on statistical models. It provides various future scenarios
based on historical and current facts for events or situations by using statistical models and
data mining. This helps users to make better decisions, based on relevant data. A good
example for using this model could be the case of a company that wants to predict the
customer behavior based on the customer data ;
• Prescriptive: makes use of optimizations and simulations to suggest actions. This model
analyzes possible actions and provides options based on the descriptive and predictive analysis
done previously. The suggested solutions consist of a reliable path for the optimal solution
together with explanations.
2.4
Conclusions and Open Issues
At data center level cost are concentrated in servers, power infrastructure and networking. A
low utilization of these resources leads to very low efficiency and business loss. Also the power
consumption has a very important role in the efficiency of the data centers and the reduction of
costs. Geo-diversifying the location of data centers can improve performance and increase reliability
in the event of site failures, and also reduce the costs [63].
The big problems with cost models today are that there is not a general optimized model for
estimating the costs. An important aspect is represented by the major cost factors and their proper
estimation, since any error will have a major impact on the accuracy of the overall estimated cost.
To be more precise, accurately focusing on major cost factors is suggested, since any error in their
estimates has a large impact on the accuracy of the overall cost estimation. A cost model that
take into consideration the characteristics of applications or service such as: the data pattern, the
data transfer, the average and peak utilization, would be more realistic one and help to estimate
3 http://docs.caba.org/documents/IS/IS-2014-49.pdf
Resource Management for Cost Optimization in Cloud Storage Systems (PhD Thesis) - Ing. Cătălin
NEGRU
2. Cost Issues of Cloud Storage Systems
39
more accurate the cost for running in the Cloud. Also a good estimation of variable cost factors is
needed; contrary the cost model will give poor results and estimations.
The main open research issues related to Cloud computing are oriented on PaaS and IaaS
Cloud management and on Cloud-enabled applications and platforms. Other important challenges
are represented by the creation of middleware services to build, deploy, integrate and manage
application in a scalable, elastic and multi-tenant environment. These challenges are also applicable
to Cloud storage services oriented on energy-awareness.
In this section we have presented a short analysis of Cloud storage service and a model to
compute the user cost and we describe the main challenges for storage services. The main challenges
in building Cloud enabled applications and platforms are to take advantage of the scalability, agility
and reliability of the Cloud. We presented also, an approach for evaluation of the impact of the
heterogeneity degree on power consumption for a set of virtual machines in a Cloud computing
environment in the case of a data-intensive jobs.
A good balance between workloads, usage patterns and virtual machine computing power is
mandatory in order to achieve power efficiency. If the virtual machine utilization is low, and is
still running, more power is consumed. As a consequence, virtual machines should be dynamically
adjusted to match the characteristics of the other virtual machines that performs the job. In this
way the degree of heterogeneity decrease and the virtual machines can finish the data transfer in
simultaneously, reducing the energy consumed. The key issue is to reduce the idle time for the
used resources.
Results show that the power consumption is proportional with heterogeneity degree. This
happens because of the fact that the powerful machines finish the transfer much quickly that the
less powerful ones, and wait just to receive the data. As we showed the degree of heterogeneity has
a big impact on power consumption of a set of virtual machines that perform data-intensive tasks.
This also has impact on cost, because energy consumption represent an important cost factor.
Resource Management for Cost Optimization in Cloud Storage Systems (PhD Thesis) - Ing. Cătălin
NEGRU
3 | Resource Allocation Methods for
Cost-Aware Data Management
In this chapter we present cost efficient scheduling solutions at datacenter level for processing dataintensive applications in Cloud environments. We take into consideration distributed nature of
these applications and techniques for failure management in order to achieve cost-efficient resource
allocation.
Distributed file systems such as Google File System (GFS), Hadoop Distributed File System
(HDFS), Amazon S3 Lustre, BlobSeer, etc. are the main component in the I/O stack of the Cloud
storage systems and use a general solution to serve I/O requests for all types of files generated
by applications. From the cost optimization perspective there is a fundamental tradeoff regarding
the file size, access time to data and replication degree. So, in Section 3.1 we present a cost
analysis of the I/O requests for different file sizes. TThis analysis represents the base of the cost
aware solutions for resource management in Cloud storage systems. The proposed solutions can
be applied in CyberWater project presented in Chapter 5. In this project we collect water quality
data from various sources (e.g. sensors, water treatment plants, third-party institutions, etc.)
distributed geographically. All these data have to be analyzed and stored in a central location. So,
in order to optimize the cost transfers and processing operations we need to know the important
parameters such as optimal file size, access time to data from data locations. These parameters
help us to decide if we transfer data in a central location or to perform computations close to data
The distributed systems are prone to errors. Reducing failure rate in a Cloud environment,
might reduce the cost for resource usage. In consequence re-scheduling strategies are needed, in
order to minimize the cost for running data intensive computing applications. So, in Section 3.2 we
aim to minimize the execution time of workflow tasks, through an efficient re-scheduling service.
We improve in this way the service quality and respect as much as possible the user demands.
Data locality and access data patterns are important factors that can be exploited for cost
optimization data processing operations. Big Data environments deal with very large data volumes
and number of tasks. In Section 3.3, we present a task migration method for cost-effective data
processing. Considering the data transfer cost, and time, we show that it is more cost efficient to
schedule computation close to data when dealing with large data volumes.
We end this chapter with the conclusions and present the open issues. The contributions of
this chapter were published in [133] and [151].
3.1
Cost Analysis of Distributed File Systems
In this section we aim to evaluate the I/O requests cost for different file sizes through benchmarking
of the storage system. Distributed file systems include three main components: a metadata server,
object storage servers, and clients. These servers run on commodity Linux machines. Furthermore,
40
3. Resource Allocation Methods for Cost-Aware Data Management
41
it is easy to run both an object storage server and a client on the same machine, as long as machine
resources permit.
Figure 3.1. General architecture of parallel file systems
Figure 3.1 presents the general architecture of distributed file systems. This architecture is
used by different distributed file systems such as Google File System(GFS), Amazon S3, Hadoop
Distributed File System (HDFS), Lustre, BlobSeer, etc. The particular model presented in the
Figure 3.1 is GFS [59]. This is a proprietary distributed system developed by Google and specially
designed to provide efficient, reliable access to data using large clusters of commodity servers.
These file systems achieve reliability by replicating the data across multiple servers.
Files are divided into fixed-size data blocks and every block has a unique identifier (handle)
assigned by the metadata server at the time of block creation. Object storage servers on their local
disks store data block as Linux files and read or write block data. For instance, a typical block size
file system under Linux is 4KB. By default, for reliability each block has three replicas on multiple
object storage servers. Both block size and replication factor are configurable by the management
system. For sensitive and important data in order to improve reliability the replication factor can
be higher, but this comes with increased costs. File size is important in cost reduction and improve
the efficiency of the storage system with respect to the I/O requests.
The metadata server keeps filesystem metadata, send instructions and check the health storage
servers. File system metadata refers to the namespace, access control information, the mapping
from files to blocks, and the locations of blocks. Further, metadata server manages different tasks
such as data blocks management, garbage collection of orphaned blocks, and blocks migration
between object storage servers. It communicates periodically with every metadata server in heart
beat messages to give it instructions and collect its state. The metadata server namespace is a
hierarchy of files and directories. Files and directories are represented by inodes, which record
attributes like permissions, modification and access times, namespace and disk space quotas. The
file content is split into large blocks (typically 128 megabytes, but user selectable file-by-file) and
each block of the file is independently replicated at multiple sotorage nodes (typically three, but
user selectable file-by-file). The metadataserver maintains the namespace tree and the mapping of
file blocks to DataNodes (the physical location of file data).
For instance, HDFS keeps the entire namespace in RAM. The inode data and the list of
blocks belonging to each file comprise the metadata of the name system called the image. The
persistent record of the image stored in the local host’s native files system is called a checkpoint.
The metadata server also stores the modification log of the image called the journal in the local
host’s native file system. For improved durability, redundant copies of the checkpoint and journal
Resource Management for Cost Optimization in Cloud Storage Systems (PhD Thesis) - Ing. Cătălin
NEGRU
3. Resource Allocation Methods for Cost-Aware Data Management
42
can be made at other servers. During restarts the metadata server restores the namespace by
reading the namespace and replaying the journal. The locations of block replicas may change over
time and are not part of the persistent checkpoint.
In order to access a file a client have to execute three steps. In the first step clients send
I/O requests for the locations of data blocks comprising the file to metadata server, in the second
step the metadata server sends block handle and block locations to the client and in step three
the client contact the cosest object storage server, sends block handle and block range and receive
the block data. When writing data, the client requests the metadata server to nominate a suite of
three storage nodes to host the block replicas. The client then writes data to the storage nodes in
a pipeline fashion. Usually exists one medataserver for each cluster, which can represent a single
point of failure. The cluster can have thousands of storage nodes and tens of thousands of clients
per cluster, as each storage node may execute multiple application tasks concurrently.
Several distributed file systems have or are exploring truly distributed implementations of
the namespace. Ceph has a cluster of namespace servers and uses a dynamic subtree partitioning
algorithm in order to map the namespace tree to metadataservers evenly. GFS is also evolving into
a distributed namespace implementation having hundreds of namespace servers (masters) with 100
million files per master. A file is assigned to a particular metadata server using a hash function on
the file name.
File system clients are software modules linked to applications that implements the file system
API and communicate with metadata server for metadata operations and with object storage
servers for read or write operations.
The distributed architecture introduces few overheads for the clients to access metadata:
• Access operations need one extra network round because the metadata is stored on a separate
server. For small files, this overhead is amplified since the gains from accessing file objects
are limited;
• The distribution of files introduces overheads for lookup files that are stored in directories
with deep tree structure. The metadata server may need more round-trips to locate such
files because have to traverse the tree structure;
• Concurrent metadata operations in the case of small file accesses make the metadata servers
hot spots, especially with metadata-intensive workloads because of the many I/O access to
the metadata server and object storage servers. As consequence this will significantly increase
the cost of accessing metadata and degrade the I/O performance.
The data I/O parallelism is the main benefit of distributed file systems (e.g. the I/O requests
for files are executed in parallel on different object storage servers). For small files, the I/O
operations are not efficient since the total data amount for each request is smaller than the large
file I/O. Moreover, the extra round-trips before the parallel I/O process are too expensive and
unnecessary when only one Input Output System (IOS) is involved in the I/O process, which
should be the common case in practice [32]. In the case of large files, exist the risk to transfer
more data than is needed.
Recent studies on scientific workloads exhibit significant changes compared with previous
results such as the increase of the small files in workloads [182]:
• 58% of the 12 million files on a file system were fewer than 64 KB;
• 30% of files (in Lustre file system of cluster located at LLNL) in % thirty-two file servers are
smaller than 1 MB during the science runs phase;
• up to 30 million files averaging 190 KB generated by sequencing the human genome.
Also, many computing systems support more applications from different fields (such as home,
Resource Management for Cost Optimization in Cloud Storage Systems (PhD Thesis) - Ing. Cătălin
NEGRU
3. Resource Allocation Methods for Cost-Aware Data Management
43
enterprise and scratch storage) rather than scientific-only. Hence, distributed filesystems are expected to experience more different I/O access patterns. The authors of [34] studied the files in
scratch storage under two parallel file systems for and observed that over 55% of files were smaller
than 1 megabyte in one system, while this proportion is over 83% in the other system.
The small files are split into much smaller data files on different IOS’s. Accessing small data
files makes the further performance degradation on each IOS. The total time Ttotal of an I/O
request can be described as follows [193]:
Ttotal = Tmeta + max(Tio,1 , Tio,2 , . . . Tio,n ),
(3.1)
where:
Tmeta is the time of accessing metadata composed of one network round
and the cost of reading metadata on metadataserver;
Tio,i represents the time of accessing the i-th IOS, which given by 3.2.
Tio,i = Nround × Tnet−round + Tdataf ile−io
(3.2)
where:
Nround is the number of round-trips;
Tnet−round is the cost of network round-trips;
Tdataf ile−io is the I/O cost on the IOS node;.
The case of small files, differ from large files, however, the parallelism cannot substantially
improve the throughput of small file I/O, and even degrade the performance on the contrary. The
main drawback of large files from cost optimization perspective is that exist the risk to transfer
more than is needed, wasting in consequence resources and money. Following equation gives the
total cost Tsmall−totoal of accessing a small file.
Tsmalltotoal = Tmeta + max(Tio,1 , Tio,2 , . . . , Tio,n )
(3.3)
We performed an experiment to measure the throughput results from running the read file
test for different file sizes. We used a test platform with Lustre filesystem having the topology
presented in the Figure 3.2. We have used three machines with the following characteristics: CPU
Intel Pentium 4 Dual Core at 3.2 GHZ and 2 GB of RAM. Protocol was TCP/IP and link speed
10 GB/s.
In order to test the performance of read operation for an existing file, we used IOzone benchmark. IOzone is a file system benchmark tool that generates and measures a variety of file operations such as read, write, etc. IOzone has been ported to many machines and runs under many
operating systems [139]. We performed the benchmark for file read test different file sizes in order
to evaluate the filesystem performance regarding the throughput. The read test creates a file and
then read it from the filesystem.
Based on the results of our analysis, we conclude that in order to improve the application
performance and to reduce costs we must perform computation close to data, minimizing in this
way the response time for I/O requests. Further, is more cost efficient by minimizing data transfer
cost between storage system and the computing system and to lowering the response errors. These
can be very expensive in the case of data intensive applications. Another factor that improve the
performance is the minimization of the I/O requests number.
If we discuss the case of CyberWater project that collects data from geographically distributed
Resource Management for Cost Optimization in Cloud Storage Systems (PhD Thesis) - Ing. Cătălin
NEGRU
3. Resource Allocation Methods for Cost-Aware Data Management
44
Figure 3.2. General architecture of parallel file systems
Figure 3.3. Throughput for read operation
sensors, it is important to transfer an optimal amount of data. Instead of performing inefficient
small 4KB data transfers, we can store data from sensors locally until we can transfer an optimal
amount of data.
Resource Management for Cost Optimization in Cloud Storage Systems (PhD Thesis) - Ing. Cătălin
NEGRU
3. Resource Allocation Methods for Cost-Aware Data Management
3.2
45
Efficient Re-scheduling Based on Cost-Aware Data
Storage
Distributed systems support data-intensive computing for large distributed data sets that have
long execution times causing a high failure rate and leads to resource wastage (hardware, human
resources, and electrical power) and increased costs because all the computations tasks that have
already been completed may be lost.
In this section, we propose a re-scheduling service for Cloud computing environments at the
datacenter level that aims to reduce the cost and resource wastage through fault propagation
elimination and re-scheduling task execution in the presence or errors. Other benefits of the proposed solution come from the transparency of fault tolerance mechanisms and increased reliability.
Furthermore, we analyze the performance of several scheduling algorithms.
Both the cloud user and the cloud provider are interested in cost reduction regarding the
workflow task execution. On one hand, clients that buy IaaS services and execute workflows want
to achieve a short execution time and to avoid error propagation paying in this way a less amount
of money. On the other hand, cloud providers want offer workflow execution services that are
efficient and error-free improving in this way the service quality and respecting as much as possible
the user demands.
Workflow execution cost consists of three parts: CPU execution time, data transfer (bandwidth cost), and data storage cost. So, we aim to minimize the execution time of workflow tasks
in the presence of errors, through an efficient re-scheduling service, based on cost-aware data.
From start, we need to accept that errors will always occur in spite of all the efforts to
eliminate faults that might cause them. Cloud computing systems provide redundancy (for data
and applications), which is at the core of all fault-tolerance techniques. But, redundancy comes
with higher costs. Fault tolerance is an important property for distributed systems because the
resources’ reliability cannot be guaranteed.
Possible solutions to assure fault tolerance are replication, migration and recovery. In the case
of replication, data are sent for processing to more resources and then the results are compared,
but this approach is expensive. In the case of migration, tasks or virtual machines are moved to
a new resource. The idea is to prevent faults by taking preventive migration actions. Migration
relies on accurate prediction of the location, time, and type of failure that will occur and should be
used with other techniques such as checkpoint or restart. When migration is combined with other
methods, is difficult to decide how frequently the application should be checkpointed for instance
or when to restart the application from scratch. The commonly used solution is checkpointing
the application state periodically during the execution. But, checkpointing introduces additional
costs in the process of saving the checkpoint data when writing the state of a process to persistent
storage and further when the application is recovered from the checkpoint files.
Our proposed solution represents a cost aware approach due to the fact that we do not replicate the workflow execution or save checkpoint files to storage system. Our task re-scheduling
service detect tasks that are in error state and re-schedule them achieving in this way fault transparency. So, our service will have an error detection component that will monitor the applications
and hardware resources. To complete the workflow and to meet user requirements, re-scheduling
achieve fault transparency and adapt to dynamic situations such as changes in resource availability due to the error. The most important errors and also the most likely to happen in a Cloud
environment are: interaction errors, life cycle errors timing, omission and physical errors. We will
talk about them in the description of the error detection component.
Resource Management for Cost Optimization in Cloud Storage Systems (PhD Thesis) - Ing. Cătălin
NEGRU
3. Resource Allocation Methods for Cost-Aware Data Management
3.2.1
46
Re-Scheduling Service
The main idea of the re-scheduling service is to recalculate and adjust the order of execution of
tasks. Therefore, re-scheduling mechanism is considered, not only for the occurrence of errors,
but also to increase the performance and to reduce the overall costs. The costs are higher for
dependent tasks than for independent tasks because this data dependence can spread more easily
an error that is not discovered in time [145].
The proposed re-scheduling service can be used with a wide variety of scheduling algorithms,
which are chosen in advance depending on the system structure. Re-scheduling mechanisms can
be divided into two categories: regular (periodically) or alerted by events. The re-scheduling
component is called by a monitor who interrogates the system periodically.
In the Figure 3.4 is presented the architecture of re-scheduling service. First, the user submits
a task workflow, one task or a bag of tasks. A bag of tasks is specified as a path to the directory
where the configuration files are. A workflow represents a bag of tasks with dependencies. These
files contain the description of a scheduling request that specifies tasks requirements along with
other additional information such as the task ID, path to the executable, the arguments, the input,
output and error files and the arriving time (e.g. expected time when the task should start). The
requirements specified for each request refers to resources (e.g. CPU Power, Free Memory, Free
Swap), restrictions (e.g. processing time, deadline time), number of executions that may occur
and priorities.
Figure 3.4. Re-scheduling Service Architecture
The input consists in a workflow file with a format similar with DAGMan, but simplified.
This file is parsed by the Workflow Analysis component and a directed acyclic graph file is created
to represent the tasks. The common approach, is to model the workflow as a directed acyclic graph
(DAG), where nodes represents the tasks and the edges represent the dependencies between tasks.
A DAG scheduling is a combination of two steps, namely task assignment (e.g. mapping of tasks
to available resources) and scheduling (that decides the order for the assignment of tasks for each
machine). This will involve partitioning the application into interactive tasks, taking into account
data dependencies.
When all the information’s are set, the schedule is called by the Scheduler component. First,
the selected scheduling algorithm will provide nodes allocation and scheduling time to start. Then
the Execution Service component will set a proper timer for each task and when the timer expires
it will establish the necessary context and will launch the task execution. Every time a task is
send to compute, it is added to the Monitor Service list.
Until the Execution Service has successfully completed the entire graph, tasks are continuously
monitored. When an error occurs at one of the scheduled task the Re-Scheduler component will
try to reschedule it. If the task in error state belongs to a directed acyclic graph (DAG) with
Resource Management for Cost Optimization in Cloud Storage Systems (PhD Thesis) - Ing. Cătălin
NEGRU
3. Resource Allocation Methods for Cost-Aware Data Management
47
dependencies between nodes the Re-Scheduler will analyze if the node where the error occurred
can be re-scheduled alone or we should re-schedule it with all the others nodes that forms together
a dependency sub-graph having as root of the current node.
The implementation of error detection module inside the Monitor Service component is very
important to determine and analyze the types of errors that may occur. Fault detectors are
basic components for fault-tolerant systems. A simple algorithm for error detection, often used
in practice, consist in exchanging messages (heart beat messages) between processes. This can be
summarized as following: at regular intervals process p sends messages to process q; or if time
expires before process q receives new message from p, q begins to suspect p.
We consider the following types of errors: timing errors, omission errors, physical errors and
interaction errors. Timing errors are divided into two types of errors depending on the time
of their appearance: at the beginning of the connection or during the communication. The first
category is due to the inability to establish a connection, and the second, which is considered a
performance error, occurs when the response time exceeds the time in which the caller expects to
receive a reply. Omission errors refers to messages delayed or lost. Physical errors include
critical conditions of physical resources like CPU errors, memory errors, storage errors. In this
case, the resource is declared nonfunctional. Interaction errors are caused by incompatibilities
at the communication protocol stack level, security, workflows or timing. Because these conditions
happen when the application is running and the context cannot be reproduced in testing phase,
this type of error is the most common one.
Along the execution, a task can be in various states. When the user submits the request
of executing the specific workflow of tasks, all the tasks will be assigned to the "initial" state.
This service schedules the specified tasks and sends them to the processors on which they will be
executed. At this moment, the state of these tasks is changed to submitted". When a particular
task is effective executing on one of the available processors, it is in the "running" state. The last
state is "finished", and is reached when a task completes its job. At any time, a failure can occur
and that will lead to the change of that specific task in "error" state.
Figure 3.5 presents the behavior of this re-scheduling algorithm in the case of failure detection.
As it can be seen, the first graph represents the initial directed acyclic graph of also presents several
stages of the graph development from the initial state until all tasks are completed. This example
explains how the algorithm works on a graph with 12 tasks with dependencies on two available
resources task. According to the schedule each task will be submitted at a certain moment of time.
In the second graph, tasks C and D are already running and tasks B and E were just submitted.
The third graph shows that an error occurred. At this moment, all tasks that are in "running" or
"submitted" state will be canceled and rescheduled. The rescheduling service needs to construct a
new graph with all the unfinished tasks and to send it to the schedule. Now, the unfinished tasks
will be executed according with the new schedule.
To easily insert the re-scheduling fault tolerance mechanism in any type of systems, we have
designed a re-scheduling algorithm that can be used in combination with a wide variety of scheduling algorithms which are chosen in advance depending on the system structure (here we may
consider factors like the number of existing processors, the structure of graph task that we want to
schedule) to achieve optimal results. The pseudo-code for the re-scheduling algorithm is presented
in Algorithm 3.1
The algorithm requires as input one independent task or in the case of dependency tasks, a
sub-graph of tasks, the scheduling heuristic and the current set of the available resources. In the
first step the algorithm initializes Scurrent an initial schedule of the DAG. Until DAG unfinished
if and error is detected then update the list R of the available resources. Then the rest section
Resource Management for Cost Optimization in Cloud Storage Systems (PhD Thesis) - Ing. Cătălin
NEGRU
3. Resource Allocation Methods for Cost-Aware Data Management
48
Figure 3.5. Example of execution steps of the re-scheduling service using CCF
of DAG is rescheduled using the heuristic H specified as input. After re-scheduling, the tasks
execution order is changed and new associations (task, processor, start time) are built and sent
to execution in the same way as the initial schedule did. The schedule procedure calls one of the
scheduling algorithm (we used HLFET, CCF, ETF, Hybrid Remapper) to reschedule the graph
section affected by the failure, pointed in advance (mention in the Algorithm 3.1).
CCF is a dynamic list scheduling algorithm providing good load balancing. The graph is
visited in topological order, and tasks are submitted as soon as scheduling decisions are taken.
The algorithm considers that when a task is submitted for execution it is inserted into RunningQueue (maintained by Scheduler). If a task is extracted from the Running-Queue, all its successors
are inserted into the Children-Queue. These queues can be priority queues where the priority is
assigned based on t-level (e.g. the length of the longest path from the entry node to the node) and
b-level (e.g. the length of the longest path from the end node to the node) parameters[149].
ETF is an algorithm that aims at keeping the processors as busy as possible using a simple
greedy strategy. It schedules a task to the available processor that start the task as early as possible. In this way ETF hide communication with computation, which guarantees his performance.
Regarding the applicability of ETF, the authors of [17] show that ETF has a good performance if
messages are short and the links are fast and a poor performance otherwise.
HLFET is an algorithm based on both lists and levels scheduling heuristics. It assigns each
node a level, or priority for scheduling. The level of a node is defined as the largest sum of execution
times along any directed path from a node N, to an end node of the graph, over all endnotes of
Resource Management for Cost Optimization in Cloud Storage Systems (PhD Thesis) - Ing. Cătălin
NEGRU
3. Resource Allocation Methods for Cost-Aware Data Management
49
Algorithm 3.1 Generic Re-Scheduling Algorithm
Require: H - heuristic used for re-scheduling, S - schedule, R - available resources
1.initialize schedule S, current a initial schedule of DAG
while DAG unfinished do
if error detected then
updade R
S = schedule(Scurrent, H)
if Scurrent : task_asoc_to_res! = S : task_asoc_to_res then
S current = S
Execute Scurrent
end if
end if
end while
the graph. The list scheduling algorithm is then invoked using these priorities. HLFET algorithm
demonstrates near-optimal performance when communication costs are not included [162].
HybridRemapper is a dynamic algorithm specially designed for heterogeneous environments.
It is based on a centralized policy and improves a statically obtained initial schedule by remapping
to reduce the overall execution time. During application execution, the HybridRemapper uses runtime values for the subtask completion times and machine availability times whenever possible. The
potential of this algorithm to improve the performance of initial static schedules was demonstrated
using simulation studies [108].
3.2.2
Resource Management Tool
This section presents the analysis of two important existing Cloud scheduling tools: Condor and
PBS. These are resource management tools.
Condor is a specialized resource management system developed for intensive computing tasks,
which provides a task queuing mechanism, scheduling policy, priority scheme, resource monitoring,
and resource management. The usage is as simple as possible, users submit their tasks to Condor,
and it places them into a queue, chooses when and where to run the specific tasks based upon a
policy. Also it deals with monitoring their progress, and finally informing the user upon completion.
Condor can be used to manage a cluster of dedicated compute nodes.
PBS offers Torque an open source resource manager that provides control over batch jobs
and distributed compute nodes. It is a centralized system, in which a controller is responsible for
the decision-making process and for estimating the state of the system. The controller has the
following functions: mediates access to resources, optimally mapping tasks to resources, deploying
and monitoring task execution, accessing data during task execution and presenting results.
We choose Torque instead off Condor because the first one may be freely used, modified, and
distributed under the constraints of the included license. Moreover, Condor offers a lot of already
implemented services and it is harder to add a new layer that has to overwrite some facilities.
A Torque1 cluster has of one head node and a large number of compute nodes. On the head
node runs the pbs_server (batch server) daemon and a scheduler daemon. The scheduler interacts
with pbs_server to make local policy decisions for resource usage and allocate nodes to jobs.
When pbs_server receives a new job, it informs the scheduler. When the scheduler finds nodes for
the job, it sends instructions to run the job with the node list to pbs_server. Then, pbs_server
sends the new job to the first node in the node list and instructs it to launch the job. On the
compute nodes run the pbs_mom (messages server) daemon. Users submit jobs to pbs_server
1 http://docs.adaptivecomputing.com/torque/6-0-1/torqueAdminGuide-6.0.1.pdf
Resource Management for Cost Optimization in Cloud Storage Systems (PhD Thesis) - Ing. Cătălin
NEGRU
3. Resource Allocation Methods for Cost-Aware Data Management
50
using the qsub command. Commands for submitting and managing jobs can be installed on any
host (including hosts not running pbs_server or pbs_mom). For our Re-Scheduling service, we
add a new layer over the resource management system like in Figure 3.6.
Figure 3.6. Global Re-scheduling Architecture
3.2.3
Experimental Results and Evaluation
Example of an Execution Flow
We are interested in analyzing the cost-efficiency of re-scheduling service on a realistic scenario.
For this reason, the experiment considered existence of two available resources, directly connected
to each other and a set of tasks with dependencies. We also performed several functional test
scenarios divided into following categories:
• Test that contains a bag of independent jobs. These are useful in order to test the basic
functionality and interoperability of all the project components: resource analyzer, scheduler,
monitor and reschedule;
• Test that contains workflows with jobs that have interdependencies, but the output files of
one task is not required for another job. The purpose is to test the scheduling algorithm, the
correct launch of tasks on the nodes;
• Test that contains a workflow with jobs that have interdependencies also in the form of inputoutput files. In other words, the output file of a task will be the input file of another task.
This test is different from the previous one be-cause it refers to the sandbox synchronization.
The output and input files must be in the same place for every node, and the sandbox should
not suffer of synchronization issues. For example an output file is ready on node x where a
task has finished, but is not present on node y where another task needs it because of the
sshfs delay;
Resource Management for Cost Optimization in Cloud Storage Systems (PhD Thesis) - Ing. Cătălin
NEGRU
3. Resource Allocation Methods for Cost-Aware Data Management
51
• Test that contains workflows with jobs that have interdependencies also in the form of inputoutput files, and a task will fail once. This is a basic test for the monitor and re-scheduler
components. We must be sure that the monitor identifies correctly and in time the faulty job
and calls the re-scheduler. Also, it is important what sub-graph is send to the re-scheduler.
• Test that contains the setting for the previous test, only that the error occurs with the
frequency of 1-5%. This test aims to be as close as possible to % a real life test.
• A comparison test that repeats the third test but using different scheduling algorithms. The
purpose is to compare the performance of each algorithm on different workflows and to decide
which the best on particular inputs is;
• Another set of tests try to determine several limitations of the solution. The test is meant
to determine the minimal time duration of a task, the maximal file that could be safely
transferred without causing overhead and synchronization problems between sandboxes.
Cost Analysis of Scheduling Algorithms
For the experiments it was chosen a DAG of 12 tasks which contains a classic scenario of masterslaves type, with sequences of linear dependencies between nodes. This graph is presented in Figure
3.7.
Figure 3.7. Test graph
A node represents a task identified by an id and the execution time and the edges indicate
the dependencies between them. Each edge has a cost which represents the communication time
between virtual machines, if the tasks are executed on different resources. With the assumed
conditions, the four experiments generated the following scheduling decision of task assignments
that are presented in Figure 3.8.
CCF algorithm offers the best load balancing and the minimum running time. By minimizing
the runing time and having a good load on all machines the cost for running the resources is also
minimized. HLFET produces a schedule that has the same execution time as CCF, while ETF
Resource Management for Cost Optimization in Cloud Storage Systems (PhD Thesis) - Ing. Cătălin
NEGRU
3. Resource Allocation Methods for Cost-Aware Data Management
52
Figure 3.8. Schedule decisions for all implemented algorithms
and HybridRemapper algorithms generates a schedule that needs more time for completion. The
HLFET scheduling algorithms, as shown in the experimental results, have a good time complexity
and finish the scheduling operation in a very short time. The ETF algorithm brings a more complex
approach and therefore is more time-consuming in allocating all resources to the correspondent
nodes. In spite of that, sometimes, ETF has proven more efficient than HLFET.
We evaluate also, the performance of our re-scheduling algorithm in case of errors. We
considered two scenarios. First one considered that an error occurs in a task with a short execution
time, which is the best case scenario. The second scenario considered that error occurred in the
task with the longest execution time which is the worst case scenario.
In Figure 3.9 are presented the results obtained from the best case scenario in the presence
of errors. Comparing the re-scheduling cost with re-execution cost we observe that our approach
obtains better results for all considered algorithms. Regarding the scheduling algorithm that
should be used with our rescheduling service, CCF and HLFET obtains the best results. For the
re-execution of DAG, HLFET obtains best results.
In Figure 3.10 are presented the results obtained from the worst case scenario. The small
cost for the re-scheduling is obtained, same as in the previous scenario, by the CCF and HLFET.
Comparing the cost difference between re-scheduling and re-execution for the two scenarios, the
difference is bigger for the second scenario that considered an error occurs on the node with longest
execution time.
In order to evaluate the efficiency of the rescheduling service we defined the following metric:
E=
Cree − Cres
,
C
(3.4)
where:
In the Figure 3.11 are presented the efficiency results for the two considered scenarios. For
Resource Management for Cost Optimization in Cloud Storage Systems (PhD Thesis) - Ing. Cătălin
NEGRU
3. Resource Allocation Methods for Cost-Aware Data Management
53
Figure 3.9. Execution cost for best case scenario in presence of errors
Figure 3.10. Execution cost for worst case scenario
Cree the cost for re-execution of DAG;
Cres the cost for re-scheduling of DAG;
C the cost for DAG scheduling with no errors.
the best case scenario, the efficiency is not so good compared with the worst case scenario that
achieved an efficiency of 58.33% for ETF algorithm. In the best case scenario, the highest efficiency
is obtained by the CCF, follow by the HibrydRemaper and HLFET. In the second scenario the
best efficiency is obtained by the ETF, follow by HybridRemaper and CCF. We can conclude that
our re-scheduling service is more efficient in the case of tasks that have long execution times.
The main contribution of this section is the design and implementation of a re-scheduling
service on top of the PBS resource manager, in his open source form, Torque. The purpose of this
service is to improve the service quality and to reduce the cost of the scheduling in the presence
of errors.
Other important contribution is that we perform a critical analysis of several scheduling algorithms for independent and dependent task scheduling. This analysis was the base for selecting and
implementing the proposed re-scheduling algorithm heuristics. Other contribution is the validation
of the re-scheduling service proposed by integration and testing it in a real environment.
Our solution presents some limitations. First of all, if multiple workflows are submitted, they
Resource Management for Cost Optimization in Cloud Storage Systems (PhD Thesis) - Ing. Cătălin
NEGRU
3. Resource Allocation Methods for Cost-Aware Data Management
54
Figure 3.11. Re-scheduling efficiency
are scheduled one after the other. This means that resources are allocated for the first workflow.
After completion of the first workflow, the second workflow is submitted. Another thing is that
we have only one manager that does resource allocation. This could also provide a single point
of failure. Furthermore, task must be accurately profiled in order to have an efficient scheduling.
Another thing worth mentioning is the fact the tasks should have a time length of a minimal
magnitude in tens of seconds. This is important for the way the monitor works: it issues a qstat
command and parses the output, determining what tasks have finished. This is done with a
frequency of one test per second. Thus, a very short task with duration of just a few seconds could
be missed.
3.3
Task Migration for Cost-Effective Data Processing
In this subsection, we present an efficient scheduling algorithm for for Many Task Computing that
addresses Big Data processing. Considering the data transfer costs, efficient utilization of hardware, data management, I/O management and internode communication performance in Cloud
environments we argue that is more cost efficient to schedule computation close to data through
task migration.
The Figure 3.12 presents our processing model. We have a set of heterogeneous data sources
geographically distributed. A layer of local processing datacenters near to data sources store and
process these data. A central processing datacenter need to perform different tasks on these data.
So, instead of transferring data to central processing center, we process data near to data by
migrating tasks to local processing data centers.
The scheduling problem that we want to address is stated as following: there is given a
finite set of resources and an application with an infinite number of tasks. Each task must be
executed on a specific machine, having specific computational requirements and data. A machine
can process one task at a time and preemption is not allowed, which means that once a task starts
its execution, it can no longer be interrupted. We have an additional constraint represented by the
deadlines of each task that must be taken into consideration when scheduling the tasks. We have
to schedule and send for execution all the tasks with minimum penalties and high throughput.
Moreover, regarding the model we consider the case of data dependent tasks. But, at some point,
Resource Management for Cost Optimization in Cloud Storage Systems (PhD Thesis) - Ing. Cătălin
NEGRU
3. Resource Allocation Methods for Cost-Aware Data Management
55
Figure 3.12. Data processing model for task migration
a data dependent model may enforce execution dependencies. As the workload will represent a
large amount of complex information received at high speed, in this section we propose a hybrid
scheduler for dealing with these requirements and designed for Big Data environments. Related
to the available resources, each machine is responsible for the execution of an infinite number of
tasks; by infinite, we understand the repetitive and non-deterministic execution of tasks.
The solution consists of two heuristics combined in order to provide a scheduling algorithm
that takes into consideration the deadlines and satisfies a data dependent task model in order to
minimize the usage cost for Cloud computing services and resources.
A concrete applicability for our scheduling model could be represented by the CyberWater
project. The purpose of this project is that of using advanced computational and communications
technology in order to implement a new framework for managing water and land resources in a
sustainable and integrative manner. The main focus of this effort will be on acquiring diverse
datasets structured and unstructured from various sources like sensor networks, web, regulatory
institutions, in a common digital platform that is subsequently used for storage, process and
analysis in order to offer routine decision making in normal conditions and for providing assistance
in critical situations related to water and environment, in general, such as accidental pollution
flooding.
Hybrid scheduler for Many Task Computing
In this section, we introduce the proposed model and the algorithms that were combined in order
to obtain our proposed hybrid scheduling algorithm in Big Data environments.
Many Task Computing paradigm represents a connection between the high throughput computing and high performance computing paradigms. It uses many computing resources for large
number of short computational tasks (either independent or dependent tasks). There are four
main problem types with respect to data size and number of tasks (see Figure 3.9): tightly coupled
MPI applications, analytics category like data mining analysis (MapReduce), loosely coupled applications involving many tasks and the fourth category represented by data-intensive many-task
Resource Management for Cost Optimization in Cloud Storage Systems (PhD Thesis) - Ing. Cătălin
NEGRU
3. Resource Allocation Methods for Cost-Aware Data Management
56
computing with many tasks and large datasets. The typical tasks for many task computing are
as following: small or large, uniprocessor or multi- processor, compute intensive or data-intensive.
Furthermore, "the set of tasks may be static or dynamic, homogeneous or heterogeneous, loosely
coupled or tightly coupled. The aggregate number of tasks, quantity of computing, and volumes of
data may be extremely large." The suitable applications for many task computing are the loosely
coupled ones that are communication-intensive. Moreover, many task computing focuses on data
intensive applications as nowadays there has been noticed a big difference between the amount processing power and the storage performance. Many task computing computations include multiple
distinct activities, coupled via files, shared memory, or message passing [154].
In our model, we will consider the case of a low virtual machine heterogeneity and high task
heterogeneity. Due to the tasks and workloads heterogeneity and data dependencies between tasks
the algorithm will consist of two phases: the task selection phase and the machine selection phase
for that task.
In order to build our hybrid scheduling algorithm we combined two heuristics, min-min and
min-max. For these heuristics, we will compute the estimated execution time of all the tasks on all
the available resources. We will consider Ei,j the estimated execution time of task Tj , j = 1, ..., n
on resource Ri where i = 1, ..., m . We define Wi with i = 1, ..., m as being the previous workload
on resource Rj . The time needed for Ri to finish the execution of all allocated tasks is presented
in Equation 3.5.
X
Ei,j + wj , j = 1, 2, ..., n
(3.5)
Min-min and min-max heuristics consider two metrics for evaluating the performance of the
scheduling algorithm, these are the makespan and the flowtime. The first one represents the time
when the latest task finishes. The flowtime is defined as the sum of all finalization times of all
tasks. According to this, the formulas for computing the makespan and flowtime are depicted
in 3.6, respectively, 3.7, as described in [77].
X
makespan = max
Ei,j + wi ,
(3.6)
where i = 1, 2, ..., m and j = 1, 2, ..., n
f lowtime =
m
X
Ei,j , j = 1, 2, ..., n
(3.7)
i=1
MMin-min is a heuristic uses as metric the minimum completion time. This indicates that the
task that can be completed earliest will be executed first. Let’s define U as the set of unmapped
tasks that have to be scheduled. Based on these tasks, is computed the set of minimum completion
times M = min(completiont ime(Ti , Mj )) with (1 ≤ i ≤ n, 1 ≤ j ≤ m) [77]. The entries of set
M represent each unmapped task. The next step consists of selecting the task with the overall
completion time among set M. The selected task is assigned to be executed on the corresponding
resource and removed from the set of unmapped tasks. The selected resource workload is updated.
This procedure repeats until there are no unmapped tasks left [77]. This heuristic minimizes the
flowtime. The pseudo code of this heuristic is shown in the listing 3.2:
Resource Management for Cost Optimization in Cloud Storage Systems (PhD Thesis) - Ing. Cătălin
NEGRU
3. Resource Allocation Methods for Cost-Aware Data Management
57
Algorithm 3.2 Min-Min Heuristic
1:
2:
3:
4:
5:
6:
7:
8:
9:
10:
11:
12:
13:
14:
15:
U = set of unmapped tasks;
while U 6= φ do
Z←φ
for each Tj ∈ U do
for each Ri , i = 1, 2, . . . , m do
Cij = Wi + Eij ;
end for
Cxj = mini=1,2,...,m {Cij };
Z ← Z ∪ Cxj ;
end for
Select Cqp = minCxy ∈Z {Cxy };
Allocate task Tp to resource Rq ;
Wq = Wq + Eqp ;
U ← U − Tp ;
end while
Min-max heuristic uses two metrics for assigning each task: the minimum completion time
and the minimum execution time. This heuristic has two steps. At the beginning, is considered
the set unmapped tasks U. In the first step, it is computed the set of minimum completion times
for all available machines being M = min(completiont ime(Ti , Mj )) with (1 ≤ i ≤ n, 1 ≤ j ≤ m).
In the second step, the task with the maximum value obtained by dividing the minimum execution
time to its execution time is selected for scheduling. Min-max heuristic minimizes the makespan.
The listing 3.3 presents the pseudo code of this heuristic.
Algorithm 3.3 Min-Max Heuristic
1:
2:
3:
4:
5:
6:
7:
8:
9:
10:
11:
12:
13:
14:
15:
16:
17:
U = set of unmapped tasks;
while U 6= φ do
Z←φ
for each Tj ∈ U do
for each Ri , i = 1, 2, . . . , m do
Cij = Wi + Eij ;
end for
Cxj = mini=1,2,...,m {Cij };
Ehj = mini=1,2,...,m {Eij };
Exj
Kxj = Ehj
;
Z ← Z ∪ Kxj ;
end for
Select Kqp = maxKxy ∈Z {Kxy };
Allocate task Tp to resource Rq ;
Wq = Wq + Eqp ;
U ← U − Tp ;
end while
General Model
Our model, will consider a heterogeneous computing environment and therefore, the workload will
be very diverse and so will be the machine resources capabilities.
Resource Management for Cost Optimization in Cloud Storage Systems (PhD Thesis) - Ing. Cătălin
NEGRU
3. Resource Allocation Methods for Cost-Aware Data Management
58
We will address the following problem: a finite set R of resource machines and a finite set
T of tasks submitted. We will assume that the tasks will have inter-task data dependencies and
preemption is not allowed. When scheduling the tasks we will take into consideration also the
deadlines of every task.
T = T1 , T2 , . . . , Tn
(3.8)
Each machine will maintain a task queue with the ready tasks already submitted by the
scheduler. On each machine, the tasks from the queue will be scheduled in First Come, First
Served order.
scheduled.
R = R1 , R2 , . . . , Rm
(3.9)
We will assume that the set of tasks and resources will be known from the beginning along
with each resource capability. A resource is defined by the following parameters
Ri = Pi , Di , Qi , i = 1, M ,
(3.10)
where:
Pi represents the computing power of resource Ri , in MFLOPS;
Di represents the available disk on resource Ri , in megabytes;
Qi is the queue of tasks of resource Ri , that will be scheduled locally in First Come First Served
order. This queue is empty at the beginning and is filled by the scheduler with tasks.
When the tasks are received for scheduling, they arrive with several requirements and parameters. We will assume that each task knows its computational and data requirements for execution.
Also, we will know the deadline time of the task. A task is defined by the next characteristics
Tj = {pj , dj , arrivatT imej , startT imej ,
availableDataj },
(3.11)
j = 1, . . . , N
where:
pj represents the processing units needed by Tj to be executed;
dj represents the required disk for executing Tj ;
arrivatT imej is the time at which Tj arrives for scheduling;
startT imej represents the latest time at which the task can be sent for execution. This represents
the deadline for Tj ;
availableDataj parameter indicate whether or not the task has data available so that it can start executing.
The cost function for executing a task Tj on resource Rj is stated as following:
C = Pi ∗ pj + Di ∗ dj , i = 1, M andj = 1, N
(3.12)
For the scheduling algorithm we will need to be able to estimate the time required by a task
to run on a machine. In order to do this, we need the processing (Pi ) and data (Di ) capabilities
of a machine and the memory (pj ) and data (dj ) requirements of a task Tj . Assuming this, the
estimated time to compute for a task will be:
ET C(Tj , Ri ) =
pj
Pi
+
dj
Dj
(3.13)
Resource Management for Cost Optimization in Cloud Storage Systems (PhD Thesis) - Ing. Cătălin
NEGRU
3. Resource Allocation Methods for Cost-Aware Data Management
59
The proposed scheduling model will be a hybrid algorithm between several heuristics, combined with the earliest deadline first in the non-preemptive version.
3.3.1
Proposed Hybrid Model
Our hybrid scheduling algorithm is a combination of the min-min heuristic and min-max heuristic.
At each moment, there are two sets of tasks: the waiting and the ready set of tasks. A waiting
task goes from the waiting set to the ready set of tasks each time the data for that task is ready.
The proposed scheduling algorithm uses the following metrics for assigning each task: minimum completion time, minimum execution time and minimum deadline. At the beginning, all
tasks are in a waiting list. When the data dependencies for a task are solved, it passes from the
waiting list to the ready list. We begin by considering the set of ready tasks U. Next it is computed
the set of minimum completion times and the set of secondly minimum completion times for all
available machines.
The formula for calculating the parameters is M = min(completiont ime(Ti , Mj )) with (1 ≤
i ≤ n, 1 ≤ j ≤ m). Between the tasks with minimum completion time and task with second
minimum completion time is selected the task that has the maximum execution time. In the last
step, among the selected tasks in the previous step, is selected for scheduling the task with the
earliest deadline first. The steps described above repeat until the set of unmapped tasks becomes
empty. The pseudo code for the proposed hybrid scheduler can be noticed in 3.4.
Algorithm 3.4 Proposed Heuristic
1:
2:
3:
4:
5:
6:
7:
8:
9:
10:
11:
12:
13:
14:
15:
16:
17:
18:
19:
20:
21:
22:
23:
24:
25:
26:
W aitingT ask = set of waiting tasks;
ReadyT ask = set of ready tasks;
while W aitingT ask 6= φ do
for each Tj inW aitingT ask do
if Tj has data ready then
ReadyT ask ← ReadyT ask ∪ Tj ;
W aitingT ask ← W aitingT ask − Tj ;
end if
end for
end while
U ← ReadyT ask;
while U 6= φ do
Z←φ
for each Tj ∈ U do
for each Ri , i = 1, 2, . . . , m do
Cij = Wi + Eij ;
end for
Cxj = mini=1,2,...,m {Cij };
Cyj = mini=1,2,...,m;x6=y {Cij };
Z ← Z ∪ max {Exj , Eyj };
end for
Select Kqp = minKxy ∈Z {Kxy };
Allocate task Tp to resource Rq ;
Wq = Wq + Eqp ;
U ← U − Tp ;
end while
Resource Management for Cost Optimization in Cloud Storage Systems (PhD Thesis) - Ing. Cătălin
NEGRU
3. Resource Allocation Methods for Cost-Aware Data Management
3.3.2
60
Implementation Details and Experimental Results
In order to test the proposed hybrid scheduling algorithm, we built a task scheduling simulator.
The simulator supports multiple scheduling algorithms among we can mention: first come first
served, min-min heuristic, min-max heuristic.
The task scheduling simulator keeps two lists of tasks: the waiting tasks and the ready ones.
The waiting tasks signify those tasks for which the data dependencies have not been solved, so
they were not ready for scheduling. The ready list consists of tasks for which the data required
was ready and they could be scheduled at any time from the moment that they entered the ready
list. For simulating this behavior, we considered that at every 100 milliseconds, from a range of
5 up to 5000 tasks, depending on the amount, the tasks were passing from the waiting list to the
ready list of tasks. Regarding the deadlines, at every second, the deadline for all tasks in both lists
waiting and ready, were decreased by one. The task scheduling simulator reads from an input file
the configuration of the tasks and resource machines properties. In this way we could simulate the
task and machine heterogeneity environment. The simulator is composed of the Task and Resource
entities encapsulating the properties defined in the configuration file. Each resource is running in
a separate thread and puts in a queue the received tasks for scheduling. The tasks from the queue
of each resource are executed in First Come First Served order. For simulating the execution, a
sleep time is introduced for that thread. The sleep time equals the estimated time to compute
of the task on that resource. Each supported algorithm is running in a different thread until no
tasks are left in the ready and waiting lists. The algorithms consider for scheduling only the tasks
from the ready queue. Each algorithm puts in the chosen resource’s queue, the selected task for
execution.
For evaluating the proposed hybrid algorithm and for comparing its performance with the
other scheduling algorithms, we considered the makespan, the flowtime, the number of deadlines
met and QoS. We have considered as QoS measure the following formula:
QoS =
makespan
1 + nmd
(3.14)
, where nmd is number of miss deadlines.
In our simulation we used a setup with low heterogeneity workloads and low heterogeneity
resources. The tasks execution varied from the 25 milliseconds and reached up to 110 milliseconds
on different scenarios.
Along with varying the number of tasks and number of machines, we also varied the frequency
at which the tasks were ready for scheduling. Moreover, we varied the deadlines of the tasks
according to their number. For example, let’s take the scenario with 500000 tasks in which we
used deadlines of 500 seconds. Depending on the number of tasks and resources we adjusted
also the deadlines of the tasks such data they were neither too tight, nor too loose. In order to
accommodate the tasks in a reasonable time, we simulated a setup with 100 machines. Moreover,
the rate at which data dependencies were resolved was of 10000 at each 100 milliseconds. This
way, we had 10000 tasks at every 100 milliseconds ready for scheduling.
With respect to makespan (see Figure 3.13), the new algorithm proposed has not so good
results comparing with the other two algorithms but, in terms of flowtime (see Figure 3.14), the
algorithm’s performance is close to the results of the other two algorithms.
The results prove that the proposed hybrid algorithm scheduler behaves very well in terms of
meeting deadlines (see Figure 3.15) in comparison with the other two heuristics. It meets all the
deadlines even for 500000 tasks, whereas the other reach up to almost 40000 missed deadlines
Regarding the measurement of the QoS (see Figure 3.16), our proposed hybrid scheduling
Resource Management for Cost Optimization in Cloud Storage Systems (PhD Thesis) - Ing. Cătălin
NEGRU
3. Resource Allocation Methods for Cost-Aware Data Management
61
algorithm obtains the best results among the other two heuristics. This proves that for our problem,
the hybrid scheduler has a good performance.
3.3.3
Real Environment Integration
This section describes several possible integrations of the proposed hybrid scheduler for Many Task
Computing in a couple of Big Data platforms, such as Hadoop and OpenStack.
Before describing the possibilities and modifications for integrating the algorithm, first we
highlight the three main phases of the hybrid scheduling algorithm that we proposed:
Figure 3.13. Comparison on results on makespan (milliseconds)
Figure 3.14. Comparison on results on flowtime (milliseconds)
Resource Management for Cost Optimization in Cloud Storage Systems (PhD Thesis) - Ing. Cătălin
NEGRU
3. Resource Allocation Methods for Cost-Aware Data Management
62
• For each task, select the first and second best resources that minimize the cost function (sum
between the execution time and the previous workload on that machine);
• Between the two selected resources from the first step, choose the one that minimized the
execution time;
• Select the task with the earliest deadline for execution.
All the above steps repeat until no tasks are left unmapped. We consider for scheduling only
the tasks that have the data ready.
Figure 3.15. Comparison on results on the number of missed deadlines
Figure 3.16. Comparison on results on the QoS
Resource Management for Cost Optimization in Cloud Storage Systems (PhD Thesis) - Ing. Cătălin
NEGRU
3. Resource Allocation Methods for Cost-Aware Data Management
63
Integration with Hadoop
Hadoop is the open-source version of Map-Reduce Apache software project for distributed processing of huge amounts of data across clusters of computers. This can be installed on a com- modify
Linux cluster to permit large scale distributed data analysis. No hardware modification is needed
other than possible changes to meet minimum recommended RAM, disk space, etc. requirements
per node. Hadoop2 offers scalability, reliability, fault tolerance and distributed computing. It includes several modules : Hadoop Common (e.g. libraries that support the other Hadoop modules),
Hadoop Distributed File System (a distributed file system that provides high- throughput access
to application data) and Hadoop MapReduce (programming model for processing and generating
large data sets).
Let’s consider a scenario with multiple jobs submitted for execution. The jobs have different
execution constraints (need a variable number of slots for mappers and reducers) and have different
deadlines for constraints. When a job is submitted, a schedule test is performed in order to obtain
the number of map and reduce slots required for its execution. The three steps described in the
beginning of the chapter can be customized as following in the Hadoop system:
• Select the two jobs that have found the smallest number of available slots for execution;
• Between the two jobs, select the one with the shortest execution;
• Select the job with the earliest deadline
Integration with OpenStack
OpenStack 3 is an open source infrastructure as a service platform for public and private Clouds.
Basically, OpenStack is an open source framework designed to create a standard-based Cloud
computing environment, secure and scalable, based on a collection of technologies products. The
computational component is OpenStack Compute and the storage component is OpenStack Object
Storage known as Swift and is a highly available, distributed, and masterless software stack. It
is a Cloud operating system that controls compute, storage and networking resources inside a
data center. Through OpenStack, enterprises and service providers can offer computing resource
on demand, by provisioning and managing large networks of virtual machines. The developers
that deploy Cloud applications can access the resources through APIs and the administrators
and users through web-based dashboard. The architecture is a flexible one, since hardware and
software are not proprietary, and it can be integrated with third party technologies or with legacy
systems. It scales horizontally on standard hardware. It provides the possibility of managing and
automating the pools of compute resources. Furthermore, it can be used with a wide variety of
virtualization technologies such as bare metal or HPC. Among the features it provides, we can
mention the distributed and asynchronous architecture, the management of virtualized commodity
server resources, live VM management, VM image caching on compute nodes.
Before customize the proposed hybrid scheduler for the OpenStack platform, let us consider
the following scenario. Let us suppose that a user wants to start more than 100 VMs at a time
and each VM has different resource capabilities and different deadlines. In this case, we refer
to deadline as the latest time when the VM should be started. For implementing the proposed
hybrid scheduler, we have to adapt the algorithm to the architecture and workflow of OpenStack.
By translating the three steps of the proposed scheduling algorithm described above into the
OpenStack architecture, we should design our own filter that does the following:
• For each VM, choose the first and second hosts that match best the resource constraints for
2 http://hortonworks.com/hadoop/yarn/
3 http://www.openstack.org/software/openstack-compute/
Resource Management for Cost Optimization in Cloud Storage Systems (PhD Thesis) - Ing. Cătălin
NEGRU
3. Resource Allocation Methods for Cost-Aware Data Management
64
the selected VM;
• Between the selected two hosts for deploying the VM, choose the one with the least workload;
• Choose for scheduling the VM with the earliest deadline
As described in section 2.4.4, the OpenStack VM instances scheduler requires two steps: the
filtering phase and the weighing phase. For creating a new customized filter, one has to inherit the
BaseHostFilter class and implement the hostp asses method that returns true if the filter accepts
the host. As parameters, the hostp asses method receives the state of the host and the filter
properties. Multiple filters can be used simultaneously. There are also some defined filters that
can be extended or combined to offer the functionality required in the first step. Among them
we can mention the ComputeCapabilitiesFilter, the ImagePropertiesFilter, the RamFilter or the
DiskFilter.
The ComputeCapabilitiesFilter4 verifies whether the capabilities of a host match the
requirements of the VM instance. The ImagePropertiesFilter is used to check if a host can
satisfy the VM’s image properties. RamFilter filters the hosts based on their RAM and the
DiskFilter based on the disk allocation, such that only the hosts that have enough disk space are
considered.
Integration with BlueMix
BlueMix is the platform as a service solution provided by IBM, available in beta version from
February 2014. Built on top of the IBM’s Open Cloud Architecture, it offers a diverse set of
services and runtime frameworks enhancing the developers to "rapidly build, deploy and manage
their Cloud applications" [87]. Is stated that the Cloud applications built on BlueMix:
• improve the time application or infrastructure provisioning;
• offer flexible capacity;
• address the deficiency of tech resources;
• reduce the Total Cost of Ownership;
• enhance the exploration of new workloads such as social, mobile or big data;
BlueMix not only offers a solution for developers, but also for the business side and end-users.
The BlueMix goal is to improve the "exploration of Cloud application capabilities" [6] like social,
mobile or big data challenges and to leverage the development of the future Cloud applications
and services. BlueMix offers large variety of services from different categories like [5]:
• runtimes: Java, Node.js, Ruby;
• Web and application: Data Cache, Session Cache, Elastic MQ, Rules, Single Sign On, Travel
Boundary, Validate Address, Reverse Geocoding, Geocoding, Redis, RabbitMQ, RapidApps,
Cloud Integration, CloudAMQP, RedisLabs, SendGrid, Application AutoScaling, Log Analysis, Twilio;
• Mobile: Push, Internet Of Thing Cloud, Mobile Data, Mobile Application; Security, Mobile
Quality Assurance, Twilio, Square;
• Data Management: SQL Database, Cloudant NoSQL, ClearDB, ElephantSQL, MongoDB,
PostgreSQL, MySQL;
• BigData: Analytics Warehouse, Analytics for Hadoop, Time Series Database;
• DevOps: Monitoring and Analytics, Mobile Quality Assurance, Git Hosting, Web IDE, Continuous Integration, Continous Pipeline, Agile Planning and Tracking, BlazeMeter, Load
Impact.
4 http://docs.openstack.org/developer/nova/devref/filter_scheduler.html
Resource Management for Cost Optimization in Cloud Storage Systems (PhD Thesis) - Ing. Cătălin
NEGRU
3. Resource Allocation Methods for Cost-Aware Data Management
65
The integration with this platform requires building a dedicated service inside this BlueMix
PaaS. The service should receive as input the set of tasks and set of resource with their capabilities,
requirements and data dependencies. As result, the service should output the tasks execution order
and their assignation to the resources received.
In this section we achieved the goal of designing and benchmarking a hybrid scheduling
algorithm for Many Task Computing that matches our problem description. The problem that
we tried to solve was scheduling tasks in a heterogeneous Big Data environment, taking into
account the data dependencies between tasks and deadlines of the tasks. Moreover, in the phase
of scheduling we considered the requirements of the tasks and the capabilities of the available
resources.
We described how this scheduling algorithm canbe integrated with various Big Data platforms
such as Hadoop, OpenStack infrastructure as a service and BlueMix, platform as a service. With
the help of our builtin task scheduling simulator, we compared the results of the proposed algorithm
with the min-min and min-max heuristics. This way we proved that our hybrid algorithm meets
the deadlines better than the other two scheduling algorithms and obtains a good QoS.
While developing the hybrid scheduling algorithm, we designed and built a task scheduling
simulator in order to test the performance of the proposed algorithm. The task scheduling simulator
is built in such a way that adding a new scheduling algorithm and testing it is very simple.
The scheduling algorithm addresses a current problem that exists in the todays platforms and
Big Data environments: the scheduling of Many Task Computing with deadlines that have a high
velocity.
A future improvement to the algorithm could be to load balance the work between the resources. For that, when taking the scheduling decisions we could consider the task queue of each
resource. When we refer to the task queue, we do not only think of the length of the queue, but
also to the length of the tasks from the queue. Another future enhancement to the algorithm could
be to schedule chunks of k task at a time for the cases when there are not so many tasks in a time
slice.
As a future work, the algorithm should be deployed and tested in real environments like
Hadoop, OpenStack or BlueMix.
3.4
Conclusions and Open Issues
In Section 3.2 we presented an efficient re-scheduluing heuristic based on cost-aware data. One of
the major disadvantages of distributed systems is the existence of errors and failures of resources.
Resources enter and leave the system all the time. Failures in the system can be caused either
by a single component’s error or through the interaction between components. One of main fault
tolerance methods is re-scheduling. The re-scheduling algorithm proposed in this thesis has an
important characteristic: it is a generic algorithm because it can be used with a large variety
of scheduling heuristics. The proposed evaluation model is based on a series of metrics. This
model led to a classification of the scheduling algorithms used together with the proposed rescheduling procedure, based on their performance in the case of errors and the case of re-scheduling.
Assessments concluded with the observation that algorithms with the best performance if no errors
occur tend to achieve the best scores also when re-scheduling is needed.
The re-scheduling service proposed can be used with several re-scheduling strategies, whose
classification will be developed according with the major types of graphs that needs to be rescheduled. By defining and implementing these classification methods it can analyze what combination
of scheduling algorithms and re-scheduling strategies are the most appropriate to use, depending
Resource Management for Cost Optimization in Cloud Storage Systems (PhD Thesis) - Ing. Cătălin
NEGRU
3. Resource Allocation Methods for Cost-Aware Data Management
66
on the type of graph.
In Section 3.3 we achieved the goal of designing and benchmarking a hybrid scheduling
algorithm for Many Task Computing for cost-effective data processing. The problem that we tried
to solve was scheduling tasks in a heterogeneous Big Data environment, taking into account the
data dependencies between tasks and deadlines of the tasks. Moreover, in the phase of scheduling
we considered the requirements of the tasks and the capabilities of the available resources. We
described how this scheduling algorithm can be integrated with various Big Data platforms such
as Hadoop, OpenStack infrastructure as a service and BlueMix, platform as a service. With the
help of our built-in task scheduling simulator, we compared the results of the proposed algorithm
with the min-min and min-max heuristics. This way we proved that our hybrid algorithm meets
the deadlines better than the other two scheduling algorithms and obtains a good QoS.
A future improvement to the algorithm could be to load balance the work between the resources. For that, when taking the scheduling decisions we could consider the task queue of each
resource. When we refer to the task queue, we do not only think of the length of the queue, but
also to the length of the tasks from the queue. Another future enhancement to the algorithm could
be to schedule chunks of k task at a time for the cases when there are not so many tasks in a time
slice.
Resource Management for Cost Optimization in Cloud Storage Systems (PhD Thesis) - Ing. Cătălin
NEGRU
4 | Budget Reduction in Cloud Storage
Systems
Working with big volumes of data collected through many applications in multiple storage locations
is both challenging and rewarding. Large scale, cyber-infrastructure systems include a large base
of heterogeneous data sources, distributed geographically. Cloud storage services can be an ideal
candidate for storing all these data. But, is difficult to find the optimal solution in order to select
a set of storage providers, with respect to different multiple objectives of these type of systems
such as, cost optimization, budget constraints, Quality of Service (QoS), load balancing etc.
In the context of a multi-Cloud environment, the storage service selection problem arises,
which is particularly challenging for cyber-infrastructures. The problem of storage service selection is very close related to data placement problem. Nowadays, both describes a multi-Cloud
environment and are situated at intersection of different research problems such as cost optimization, QoS and data placement. We have identified a series of challenges that will be presented
next.
First challenge arises because of the relations that exist between different groups of sensors,
geographically distributed. These relations are built on the measured values of the different parameters. For instance, in a river monitoring pollution platform, in the case of a pollution event,
monitored values of water parameters from sensors must be analyzed, correlated and validated.
In these circumstances, the access time to this data must be minimized as much as possible so, is
mandatory to store it close to data-sources.
Another challenge arises when we have multiple objectives for the data placement problem,
which are sometimes contradictory and cannot be satisfied simultaneously. For instance, in order
to achieve low latency data must be stored close to sensors, for analysis purposes data must be
stored in the same datacenter and for cost-efficiency a Cloud provider or a set of Cloud providers
depending on the region that offer best prices must be selected.
A third challenge identified is represented by the master slave paradigm as data stored at
different service providers must be gathered in one place for analysis purposes. Read and write
latencies are influenced by the location of master, and respectively the slaves.
Budget constraints add extra complexity to service selection problem. All these challenges
transform this problem into a multi-objective optimization problem, and it makes hard to find an
optimal solution [80].
Techniques such as linear programming have been used to achieve optimization of cost in
Cloud computing environments [176],[181], [27], [135]. These are a general techniques which are
best suited to deal with such an optimization problem. Moreover, nonlinear optimization problems
have been reduced to multiple linear optimization problems .
In this chapter we present two methods for budget reduction in Cloud storage systems. In
the first part of the chapter we present an analysis of the cost for storage services in commercial
67
4. Budget Reduction in Cloud Storage Systems
68
clouds. In the second part we present a budget-aware method for optimal selection of a storage
service. In the third part we present a cost efficient method for selection of storage service in the
presence of budget constraints. Finally we end this chapter with conclusions and open issues.
The research results of this chapter were published in [131] and [134] and
4.1
Cost of Storage Services in Commercial Clouds
Cloud storage services, in the context of cyber-infrastructure systems can represent the ideal candidate for the storage of large volumes of data due to characteristics such as geographical spread,
elasticity, availability, on-demand capabilities, cost efficiency, pay-per-use model and SLA.
There are plenty Cloud providers, such as Amazon S3 [142], Google [46], Microsoft AZURE
[24], Dropbox [47]and so on, that offer Cloud services (e.g. Storage as a Service, Infrastructure
as a Service, Software as a Service, etc.). Clients that use Cloud storage services and pay to them
can store their data and retrieve it via access standard methods (e.g. PUTs, GETs, REST, SOAP,
etc.). In this way customers do not have to deal with complexities associated with setup and
management of the of data centre for underlying storage infrastructure.
To compare Cloud Storage Providers there are several criteria considered: price for storage
capacity, upload/download speed, reliability, support and features (24/7 support, phone, chat,
email, tutorial, etc.). Another important aspect is represented by the platform: PC/Mac compatible, Mobile access, support for iOS/Android applications, etc. From Cloud provider point of view,
the most representative commercial products are: Amazon EC2 (Elastic Block Store; Amazon
Simple Storage Service (S3); Amazon SimpleDB), Windows Azure (Azure storage service and SQL
Data Services) and Google App Engine (BigTable and MegaStore).
Cloud storage providers offer services for different data storage models. Every type of data
model is built to meet certain requirements for a specific use case, offering necessary functionalities.
For instance, file storage is mainly designed for the storage of files such as pictures, video, text files.
Data models in Cloud storage systems can be classified in the following seven categories: block
storage, object storage, instance storage, file service storage, relational database storage, key-value
data model and semi-structured data storage;
Block storage service offers a very high degree of flexibility, as these block devices are
attached as a conventional disk to a VM and applications do not have to be modified in order to
be run in a Cloud environment. A block of data can be either mapped to a local attached hard
drive, or to a logical volume from a Storage Area Network [101]. Due to the sharing capabilities,
in a multi-tenant environment such as a Cloud storage system, this type of storage introduce
fluctuations in performance. If more applications share a disk, naturally they will compete for I/O
accesses which leads to different interferences in disk operations [65] and for example reads are in
conflict with writes [50].
Object storage service also called object-based-storage organize data in objects. Every
object is formed from three parts: data, metadata and unique identifier. Data store any type
of data such as files, metadata contains contextual information about stored data. This type of
storage is very handy when it comes to dealing with exponential data growth. Another issue that
is addressed by object storage is represented by the data provisioning management due to the
fact that objet based storage architectures can be scaled up and down by adding more nodes.
An object storage service can be accessed via REST interface in order to store and access data
and offer cost-effectiveness, and easy-of-use. The trade-offs are related to consistency of data and
is not well suited for high- performance and availability. An example for Cloud object store is
represented by Walnut [29], which represents a low-level storage layer for different Cloud data
Resource Management for Cost Optimization in Cloud Storage Systems (PhD Thesis) - Ing. Cătălin
NEGRU
4. Budget Reduction in Cloud Storage Systems
69
management systems such as Hadoop, MObStor and PNUTS. The key performance issue is to
meet requirements in latency and throughput in a wide range of workloads.
Instance storage service represents a type of temporary block-level storage for use with
a virtual machine instance. In terms of size this can vary between 900 MB to up to 48 TB. The
volumes of this type of data store can be used with only a single instance of a virtual machine,
during its lifetime and can’t be detached and then attached to another instance.
File service storage refers to services that are offered by many their files through a web
interface or a dedicated client. Data is stored in a hierarchical structure, same as on personal disk
and data is synchronized with data stored locally on the personal computer of the user when the
device is connected to Internet.
Relational database storage service is a model, where data is stored in a relational
database which runs in a Cloud environment, and can be set up, managed and scale up and down
through a web interface or a dedicated client. Relational database management systems represent
an important component for any IT infrastructure and in Cloud computing can is offered as a
service, called database-as-a-service (DBaaS). The benefits of using DBaaS comes from economies
of scale and form pricing model, which is pay-per-use. In this model user pays only for the resources
that are used in a certain amount of time.
Key-value data model is offered through NoSQL databases, key-value stores, document
stores and tabular stores. This model suppose that a value corresponds to a key. This model
has following characteristics: simple structure, the query speed is higher than relational database,
support mass storage and high concurrency, etc., query and modify operations for data through
the primary key were supported well. Key-value pair data represents the driver for performance
in map-reduce programming model. This model has a single key value index for all data being
similar to memcached distributed in memory cache. This type of data is stored in key-value stores
and in general provides a persistence mechanism and additional functionality as well: replication,
versioning, locking, transactions, sorting, and/or other features [25].
The main advantages of NoSQL databases are: reading and writing data quickly, supporting
mass storage, easy to expand, low cost. NoSQL databases systems design has several advantages for data handling compared with relational database systems [93]. First, data storage part,
called also key-value storage, focus on the scalability and high-performance. Second, management
part provides low-level access mechanism, which gives the possibility that tasks related to data
management can be implemented in the application layer. Contrary, relational databases spread
management logic in SQL or stored procedures. So, NoSQL systems provide flexible mechanisms
for data handling, and application developments and deployments can be easily updated. Furthemore, schema-free design of NoSQL databases permits to modify structure of data in applications;
moreover, the management layer has policies to enforce data integration and validation [148].
Amazon is one of the most important Cloud comptuing providers. In terms of Cloud storage
services Amazon offers the following type of services: Amazon Simple Storage Service (S3), Amazon
Elastic Block Store (EBS), Amazon SimpleDB, Amazon, and DynamoDB
Amazon Simple Storage Service (S3) - is a object storage Cloud storage solution.Data is
stored in containers called "buckets". A bucket can store any amount of objects, whilst performing
write, read and delete operations. Maximum size for an object can be five terabytes. The user has
the following management functions: control the access to the bucket, view access logs, choose the
AWS region where a bucket is stored. With this last feature the user can perform optimization
operations for data transfers in terms of latency and cost; The formula for calculating the price
Resource Management for Cost Optimization in Cloud Storage Systems (PhD Thesis) - Ing. Cătălin
NEGRU
4. Budget Reduction in Cloud Storage Systems
70
for storage is presented in 4.1:
Cost(m) =
m
X
Cstorage (t) + Creduced_redundancy (t) + Crequest (t) + Cdata_transf er (t), t = 1, 2, . . . , n
t=1
(4.1)
where:
m is the number of months;
Cstorage is the cost for the amount of storage per month $/GB;
Creduced_redundancy_storage is the cost for the amount of storage with reduced durability per
month $/GB;
Pn
Crequest = i=1 CP U T (i) + CGET (i) is the cost for number of PUT and GET requests
in a month;
Pn
Cdata_transf er = i=1 Cinter−region (i) + Ctranf er_out (i) + Ctransf er_in (i) is the cost for
data transfer. Cinter−region is the cost for data transfer between two datacenters in different
regions;
Ctransf er_out is the cost for data transfer out of the Cloud datacenter;
Ctransf er_in is the cost for data transfer in datacenter.
Amazon Elastic Block Store (EBS) service offers highly available and reliable block level
storage volumes, primarly designed to be attached to EC2 virtual machine instances running in
the same Availability Zone. The volumes persist independently form the life of the instance; The
pricing equation for EBS is presented in 4.2:
Cost(m) =
m
X
Cssd (t) + Cssd_iops (t) + Cmagnetic (t) + Cshapshot (t), t = 1, 2 . . . , m
(4.2)
t=1
where:
t is the number of months;
Cssd - cost for General Purpose (SSD) volumes $/GB;
Cssd_iops is the cost for provisioned SSD volumes;
Cmagnetic is the cost for magnetic volumes;
Cshapshot is the cost for snapshots to Amazon S3.
Amazon SimpleDB service storage represents a non-relational data store which permits to
the user to query data items through web services requests. The advantages of this solution are:
availability, scalability, and flexibility. The pricing formula for SimpleDB is presented in 4.3
Cost(m) =
m
X
Cstructured_data (t) + Cmachine_usage (t) + Cdata_transf er (t), t = 1, 2, . . . , m (4.3)
t=1
where:
Pn
• Cstrucured_data = i=1 Cnr_items (i) + Cnr_atr (i) + Csize_atr (i) - cost for structured data,
where Cnr_items is cost for number of items stored, Cnra tr is cost for average number of
atributes and Csizea tr is cost for the total size of attribute values;
Pn
• Cmachine_usage =
i=1 Cnr_BatchP uts (i) + Cavg_N rItemsBatch (i) + Cnr_Gets (i) +
Cnr_SimpleSelects (i),
where: Cnr_BatchP uts is cost for number of BatchPuts requests, Cavg_N rItemsBatch cost
for average number of items per BatchPut, Cnr_Gets cost for number of Gets requests,
Cnr_SimpleSelects cost for number of simple selects requests;
Pn
• Cdata_transf er = i=1 Cdata_out (i) + Cdata_in (i) is cost for data transfer out and in;
Resource Management for Cost Optimization in Cloud Storage Systems (PhD Thesis) - Ing. Cătălin
NEGRU
4. Budget Reduction in Cloud Storage Systems
71
Amazon DynamoDB is a service that offers a fully managed NoSQL database storage
type. This solution is designed for applications that need consistency, very low latency at singledigit millisecond and scalability. Also support two type of data models, key-value and documents.
This type of solution is suitable for mobile, web, gaming applications. The formula for the price
of the serice is expressed in 4.4
Cost(m) =
m
X
Cdatasets ize (t) + Cprovisioned_throughput_cap (t) + Cdata_transf er (t)
(4.4)
t=1
,t=1,2,. . . ,m ,where:
• Cdatasets ize is cost for indexed data storage in $/GB;
Pn
• Cprovisioned_throughput_cap
=
+ Cnr_items_read (i) +
i=1 Citem_size (i)
Ctype_of _read_consistency (i) + Cnr_itemsw rittens ec (i) is cost for provisioned throughput
capacity, where Citem_size is cost for item size, Cnr_items_read is cost for number of items
read per second, Ctype_of _read_consistency is type of consistency and Cnr_itemsw rittens ec is
number of items written per second;
Pn
• Cdata_transf er = i=1 Cdata_out (i) + Cdata_in (i) is cost for data transfer out and in;
Regarding the pricing scheme the services provided by Amazon in based on a pay-per-use
pricing model as follows: pay as you go, pay less when you reserve, pay even less per unit by using
more, pay even less as AWS grows and custom pricing.
Google is another Cloud computing service provider. Google offers the following storage
services to the users:
Object storage service - is a oject store, that can store ojects in terabytes in size, in a
scalable and highly available environment. The formula for cost of the services is in 4.5:
Cost(m) =
m
X
Cstorage (t) + Creduceda vail (t) + Ca_ops (t) + Cb_ops (t) + Cdata_transf er (t)
(4.5)
t=1
, t=1,2,. . . ,m where:
• Cstorage is the cost for the standard storage $GB;
• Creduceda vail is the cost for the amount of reduced availability lower cost storage, used for
archive data;
• Ca_ops is cost class for A operations (PUT, POST, GET);
• Cb_ops is cost class for B operations (GET object, HEAD requests);
• Cdata_transf er is the cost for out data transfer between different regions;
CloudSQL - represents a MySQL database that runs in Cloud with all the functionality and
of capabilities of a MySQL database more or less. The formula for pricing is presented in 4.6
Cost(m) =
m
X
Cusage_type (t)+Cinstances ize (t)+Cmaxs torage (t)+Cio (t)+Cavg_hr (t)+Cavg_dayw eek
t=1
(4.6)
, t=1,2,. . . ,m where:
• Cusage_type (t) is the cost for type of usage;
• Cinstance_size is the cost for the instance type
• Cmax_storage is the cost for maximum amount of storage;
• Cio is the cost for I/O operations;
• Cavg_hr is the cost for the average hours/day for running server;
• Cavg_day_week is the cost for the average days/week for running server;
Resource Management for Cost Optimization in Cloud Storage Systems (PhD Thesis) - Ing. Cătălin
NEGRU
4. Budget Reduction in Cloud Storage Systems
72
Cloud Datastore - is a NoSQL data storage service, fully managed, schemalless, designed to
store non- relational data. Is a scalable, highly available and consistent solution. Moreover offers
strong consistency for read and ancestor queries. The pricing formula is in 4.7
Cost(m) =
m
X
Cstorage (t) + Cwrite_ops (t) + Cread_ops (t)
(4.7)
t=1
, t=1,2,. . . ,m where:
• Cstorage is the cost for amount of storage in GB/month;
• Cwrite_ops is the cost for write operations per month;
• Cread_ops is the cost for read operations per month;
Google pricing scheme in principal is based on pay-per-use model, but there are also some
variations such as: on-demand pricing and sustained-use discounts pricing. The last mentioned
pricing scheme, is mainly applicable to instances and suppose a certain degree of prediction in the
usage of the resources. This scheme offers auto-calculated discounts for instances that are used
more than 25% in a month and moreover, if a VM is used for an entire month there are 30%
savings over the new on-demand prices.
Microsoft represents also another major player in the Cloud computing services. Microsoft
offers the following storage services:
• Windows Azure Blob storage is designed to store file data. In this type of storage a blob can
represent text or binary data, such as documents, media files or application installer;
• Windows Azure Table storage is a NoSQL key-attribute data store used to store datasets
that are strucured. This type of storage can store large quantities of data and provide fast
access to the data;
• Windows SQL Azure - is a storage service that provide a relational database system.
The pricing formula for the Microsoft Azure Cloud storage services is expressed in 4.8:
Cost(m) =
m
X
Cstorage (t) + Cf ile_prew (t) + Ctrans (t)
(4.8)
t=1
, t=1,2,. . . ,m where:
• Cstorage is the cost for the amunt of standard storage;
• Cf ile_prew is the cost for files preview, which are files shared between running virtual machines;
• Ctrans is the cost for storage operation transations;
Also in important to note that Microsoft Azure storage service are charged differently based on
the type of redundancy.
HP Helion Public Cloud offers storage solutions with distinct capabilities providing performance, durability, availability an so on. The Cloud storage solution offered is based on OpenStack
technology. The storage overlay is distributed across a global network of servers minimizing in this
way the latency and moreover, ensuring that any user can access a copy of the data from a near
server. HP offers three type of services:
• block storage;
• content delivery network;
• object storage.
Bellow in the following figures are presented the prices for different Cloud storage providers
for Cloud object storage type, in terms of storage capacity and, data transfer (Figure 4.1) and
Resource Management for Cost Optimization in Cloud Storage Systems (PhD Thesis) - Ing. Cătălin
NEGRU
4. Budget Reduction in Cloud Storage Systems
73
PUT, GET requests (Figure 4.2) per month.
Figure 4.1. Cloud storage capacity and transfer price
Figure 4.2. Cloud storage cost for PUT and GET requests
The total cost for a complete storage solution for different Cloud storage providers is presented
in Figure 4.3
Figure 4.3. Total cost for Cloud storage
As can we see there are many Cloud providers that offers Cloud storage services. Although,
Resource Management for Cost Optimization in Cloud Storage Systems (PhD Thesis) - Ing. Cătălin
NEGRU
4. Budget Reduction in Cloud Storage Systems
74
all can obtain good performances1 in term of scalability, reliability, when comes down to prices, the
problem gets a bit complicated. All offer different pricing schemes based on degree of utilization of
resources, advanced reservation and on-demand requests of resources. Also prices are differentiated
based on the location of the datacenter that offer the storage service. Moreover, to all that
construction we have to add the time perspective, as for example Amazon AWS2 has lowered
prices 47 times in the last six years, Google3 also cut the prices, Microsoft Azure as well [41].
The problem of cost optimization when buying Cloud services in general, and Cloud storage
services in particular become a multi-criteria, multi-objective optimization problem.
4.1.1
Pricing Schemes in Public Cloud Storage Services
Cloud storage providers offer a very wide portfolio of services, whilst clients access them against
some financial arrangement. There are plenty of pricing schemes that Cloud storage providers
offers. In Table 4.1 is presented a price comparison between different Cloud storage providers.
It is important to understand these pricing schemes in order to minimize the cost when buying
Cloud storage services and also to understand how the bill will be issued under each model. We
identified the following cost models:
• Consumption-based pricing model - client pays for what resources are used (such as
storage capacity, bandwidth);
• Subscription-based pricing - client pays the use of the service on monthly basis;
• Advertising-based pricing - client receive a small amount of resources, and no pays for
them, but in exchange receive allot of advertising;
• Market-based pricing - in this model the price for resources is given by the market and
client can buy the resource and use it right away.
In order to calculate the total cost for storing data in Cloud the following parameters must be
taken into consideration: storage capacity used on monthly bases, cost for out data transfer (data
transferred outside the Cloud), and the cost for additional features offered by Cloud providers such
as: increased redundancy, backup, archive, and so on.
The big problem when calculating the total cost is represented by the fact that not all Cloud
storage providers offer a transparent scheme for cost calculation, offering instead "packages" (such
as personal plans, pro plans or server plans) at a given cost.
Another important aspect is represented by pricing mechanisms. There are three categories
of pricing mechanisms: fixed, differential and market pricing; In the fixed pricing category are the
following billing options: pay-per-use, subscription and list price. For differential pricing are the
following: service feature dependent, customer characteristic dependent, volume dependent and
value based. In the market pricing category are three types of pricing mechanisms: bargaining,
auction and dynamic market.For example Google use a combined approach, for different services,
where the client can choose a pay-per-use for IaaS service model (the user pay CPU/hour), or in
the case of SaaS Google offer a subscription model. The volume dependant pricing mechanism is
used especially in Cloud storage, where the client can have a better price if store a larger amount
of storage [138].
1 http://www6.nasuni.com/rs/nasuni/images/Nasuni-2015-State-of-Cloud-Storage-Report.pdf
2 http://aws.amazon.com/pricing/?nc2=h_ql_1
3 http://www.rightscale.com/blog/Cloud-cost-analysis/google-has-cut-Cloud-prices-38-percent-2014-google-vsaws-pricing
Resource Management for Cost Optimization in Cloud Storage Systems (PhD Thesis) - Ing. Cătălin
NEGRU
4. Budget Reduction in Cloud Storage Systems
75
Table 4.1. Cloud storage provider’s price comparison
Cloud Storage Provider
Dropbox
Amazon S3
Google Cloud Storage
Azure
LiveDrive
Carbonite
Copy
Just Cloud
Rackspace
HP Cloud Services
4.2
Storage Price
($ per GB per month)
0.09
0.085
0.0449
0.5
0.0037
1,9
0.039
0.039
0.10
0.10
Data Transfer Out
($ per GB)
NA
0.12
0.12
0.12
NA
NA
NA
NA
0.12
0.12
PUT & GET Requests
(1 milion)
NA
0.005
10
0.1
NA
NA
NA
NA
0.10
Budget-Aware Storage Service Selection
With the adoption of Cloud storage, there are two sides of the cost optimization problem. First,
the Cloud storage provider must calculate his total cost of ownership and adequately put price
on his services in order to have profit and amortize his investment. Second, a Cloud user must
calculate total cost of storing data in the Cloud and minimize it as much as possible. Different cost
components and their owners with claims for Storage as a Service are presented in Table Table
4.2 [128].
Table 4.2. Total cost of ownership perspective
Component
Business Process
Business Logic
Middleware Management
Application Licensing/ Support
OS Management
OS Licensing / Support
Server/Storage/Networks HW / Maintenance
Domestic Utilities
Maintenance Equipment
Real Estate
Storage as a Service
Customer
Cloud Storage Provider
A major problem with Cloud storage services is represented by the "vendor lock-in" problem,
which refers to dependence on a particular Cloud storage provider. Switching from one provider
to another can be expensive as Cloud storage providers charge inbound, outbound bandwidth
and requests of data. A client who wants to move from one provider to another must pay twice
the bandwidth and in addition for the actual cost of online storage. The authors of [2] propose
application of RAID-like techniques at Cloud storage level, meaning striping of user data across
multiple providers.
In [164] is proposed a secured cost-effective multi-Cloud storage model in Cloud computing
which holds an economical distribution of data among the available services in the market, to
provide customerswith data availability as well as secure storage.
We propose a model based on binary linear programming, which use real information and
real scenarios, aiming to offer the best storage scheme with minimum cost.
Resource Management for Cost Optimization in Cloud Storage Systems (PhD Thesis) - Ing. Cătălin
NEGRU
4. Budget Reduction in Cloud Storage Systems
4.2.1
76
Storage Service Selection
Cloud service selection problem is an actual and interesting research topic; it has been studied in
many papers. A first step in solving this problem is the formalization of Cloud service selection
by proposing a general methodology for multi criteria Cloud service selection, as Cloud services
could be characterized using multiple criteria such as cost, pricing policy, performance, QoS, SLA,
etc. [155]. Another approach is to propose a service selection algorithm based on an indexing
technique for managing the information of a large number of Cloud service providers [170]. A
model of Cloud service selection which aggregate the information from the user’s feedback and
objective performance analysis from a trusted third party can be another solution.
Another interesting approach for Cloud storage service selection is presented in [184] . The
authors propose the following three contributions: fuzzy sets theory for expressing vagueness in
the subjective preferences of the customers, distributed application of fuzzy inference, game theoretic approach for promoting truth telling ones among service providers. These approaches are
used in order to overcome uncertainty in the expression of subjective preferences from customers,
untrustworthy indications concerning the quality of service levels and prices associated to their
offers. However, the main drawback of this approach is represented by this empirical nature. Also
solution is validated with empirical evidence.
Data placement problem in multi-Cloud environments has been also, intensely studied. The
authors of [81] study the data placement problem of socially aware services in multi-Cloud environments. They aim to obtain multi objective optimization for placing users’ data over multiple
Clouds in the context of socially aware services. A model framework that can accommodate a
range of different objectives is proposed. In this framework original problem is decomposed into
two simpler sub problems and solves them alternately in multiple rounds. The proposed solution
is evaluated in on a large group of world geographically distributed users having realistic interactions. The obtained placement scheme is compared with random placement and greedy placement.
The gain is quantified in terms of distance to data, inter-Cloud traffic, carbon footprint, workload,
convergence and scalability. However, it must be noticed that a global optimal solution is hard to
find, and near-optimal solutions are available by relaxing different constraints. An algorithm for
selection of an optimal provider subset for data placement in a multi-Cloud storage architecture,
based on Information Dispersal Algorithm is presented in [191]. The aim is to achieve a good tradeoff between cost of storage, algorithm cost, vendor lock-in problem, transmission performance and
data availability.
The authors of [158] proposed an automated Cloud storage service selection, based on a
machine readable description of the capabilities such as features, performance, cost, of each storage
system. These capabilities are processed together with the user’s specified requirements. Are
presented different use cases, in order to evaluate public storage Clouds and also local Clouds.
Scenarios such as choosing Cloud storage services for a new application, estimation of cost savings
by switching storage services,the estimation of the evolution over time of cost and performance,
and to provide information in an Amazon EC2 to Eucalyptus migration. However, the presented
approach, analysis and validation are empirical.
Another interesting approach for Cloud storage service selection is presented in [49]. The
authors propose the following three contributions: fuzzy sets theory for expressing vagueness in
the subjective preferences of the customers, distributed application of fuzzy inference, game theoretic approach for promoting truth-telling ones among service providers. These approaches are
used in order to overcome uncertainty in the expression of subjective preferences from customers,
untrustworthy indications concerning the quality of service levels and prices associated to their
offers. However the main drawback of this approach is represented by the his empirical nature.
Resource Management for Cost Optimization in Cloud Storage Systems (PhD Thesis) - Ing. Cătălin
NEGRU
4. Budget Reduction in Cloud Storage Systems
77
Furthermore, the solution is validated with empirical evidence.
Cost optimization in Cloud environments, is closely related to data placement and service
selection problems, and represents also a hot research topic nowadays. In [64], the authors study
cost minimization problem via a joint optimization of task assignment, data placement, and data
movements for big data services in geo-distributed data centers. The cost optimization problem is
modeled as a mixed-integer nonlinear programming and is proposed an efficient solution to linearize
it.
We propose a cost minimization model based on binary linear programing aiming to find
a cost efficient storage scheme for many heterogeneous data blocks using multiple public Cloud
storage providers.
4.2.2
Cost Optimization Model using Linear Binary Programming
We deal with the following problem: a finite set of N heterogeneous data blocks (with different
size −DBd , 1 ≤ d ≤ N ) and a finite set of M Cloud Storage providers are given. A specific
provider CPp , 1 ≤ p ≤ M with a defined cost, Costp , stores on ore more blocks of data. We have a
maximum storage capacity, Cp , for each provider. Each link between a specific location loc and a
Cloud Provider has different transfer cost T Cloc,p . Let consider L location that can be considered
as data source (1 ≤ loc ≤ L).
Lets consider the binary variable xd , p denotes whether the corresponding Cloud provider
p for data demand d is selected. The cost optimization problem can thus be formulated as the
following linear binary optimization problem:
TotalStorageCost =
PN
PM
PM
= 1and
PM
xd , p = N
PN
xd , p ∗ DBd ≤ Cp
p=1
d=1
PN
d=1
d=1
TotalTransferCost =
p=1
PL
p=1
lpc=1
xd , p ∗ Costp ∗ DBd
PM
p=1
T Cl oc, p ∗ DBl oc
TotalStorageCost-The first function that is subject of
minimization.
Each data block will be store on one Cloud provider and all
data blocks will be stored.
The storage scheme respects the maximum capacity for each
Cloud Provider p.
TotalTransferCost - The second function that is subject of
minimization. DBl oc is the total data produced in a specific
location.
We use a bi-linear programming technique to model the problem of cost optimization. Linear
programming represents a method to achieve the best outcome (such as maximum profit or lowest
cost) in a mathematical model whose requirements are represented by linear relationships. This
method is one of the most used in industry. It saves companies thousands or even millions of
dollars a year.
The CyberWater scenario states as proof of concept for our proposed model. CyberWater
(cyber-infrastructure for natural resource management) [121] aimins to create a repository for
data, concerning polluted water management. All this data must be stored and managed in a cost
effective way. The sources of data are heterogeneous, such as sensors, data from public institutions,
in various formats (e.g. xml, xls, csv, pdf) and geographically distributed [122]. There are two
possible solutions to deal with this situation: first is to store all data in a private data centre and to
support all the costs for storage and management, the second is to store data in different locations
at different Cloud storage providers which are near, to the data sources. The question that raise
here is, how to achieve the optimal cost for storing and manage that data, if is stored at different
Cloud providers? We will impose as request to store all data blocks, so is it possible face with
a splitting data problem. This is necessary for optimization process, but will be transparent
for end user.
Resource Management for Cost Optimization in Cloud Storage Systems (PhD Thesis) - Ing. Cătălin
NEGRU
4. Budget Reduction in Cloud Storage Systems
78
We choose 5 Cloud Providers. Table 4.3 presents the storage costs per 1GB of data for 1
month and the cost of transfer from data sites to a storage provider ($/GB). We also consider
4 geographically areas of Romania where data are produced. The data from Table 4.3 are taken
form Cloud providers public web sites.
Table 4.3. Cost of data transfer from storage provider to data sites (in $ for 1 GB/mo)
Data storage price (per GB/mo)
Amazon
$0,085
Google
$0,0449
Azure
$0,050
Rackspace
$0,100
HP
$0,100
Amazon
$0,08
$0,11
$0,12
$0,12
Google
$0,08
$0,11
$0,12
$0,12
Azure
$0,08
$0,11
$0,12
$0,12
Rackspace
$0,07
$0,10
$0,12
$0,12
HP
$0,08
$0,11
$0,12
$0,12
Cost of transfer from data sites to a storage provider ($/GB)
Romania
Romania
Romania
Romania
4.2.3
S-E
S-W
N-E
N-W
Experimental Results
We present three scenarios for CyberWater project aiming to prove with them that it is possible to
optimize your initial storage scheme and reduce the final costs. In all scenarios we considered five
Cloud Storage Providers (Amazon, Google, Azure, Rackspace and HP), four geographically areas
for Romania (S-E, S-W, N-E, N-W) and different data blocks (DB).
Scenario 1 : We set-up the initial capacity for all Cloud Providers with specified values. For
considered data blocks the propose optimization scheme distributes the data to all providers and
splits the data for DB4 into two different blocks. The final capacity is equal with the initial
capacity, so there will be no optimization
Table 4.4. Scenario 1
Amount of data to store (GB)
Romania S-E
Romania S-W
Romania N-E
Romania N-W
Total Data Stored
Amazon
0
0
0
200
200
Google
0
0
250
350
600
Azure
0
0
200
0
200
Rackspace
0
400
0
0
400
HP
200
100
0
0
300
Total
200
500
450
550
Data Blocks (DB)
200
500
450
550
Initial Capacity
Total Cost for Initial Capacity
200
$17,00
600
$26,94
200
$10,00
400
$40,00
300
$30,00
TOTAL
$123,94
Final Capacity
Total Cost for Final Capacity
200
$17,00
600
$26,94
200
$10,00
400
$40,00
300
$30,00
$123,94
Cost for data storage (Final)
Cost of data transfer
$17,00
$24,00
$26,94
$72,00
$10,00
$24,00
$40,00
$40,00
$30,00
$27,00
$123,94
$187,00
Final Gain
0%
TOTAL COST
$310,94
Scenario 2 : we set-up the initial capacity for all Cloud Providers with a high value. For
considered data blocks the propose optimization scheme distribute splits DB1 and DB4 into two
different blocks. The final capacity is less than initial capacity, so there will be 64% cost optimization for storage. We can observe that Google and AZURE Cloud providers are used at their
maximum capacity, having the lowest price.
Scenario 3 : we set-up the initial capacity for Google Cloud Provider with a high value,
considering that here we have the lowest price. There will be no split blocks and all of them will
be stored on Google. The optimization is 80% according with initial scheme
We can observe that for all considered scenarios the cost of data transfer is higher that the
storage cost, so this is an important cost factor that must be considered in any optimization models.
Resource Management for Cost Optimization in Cloud Storage Systems (PhD Thesis) - Ing. Cătălin
NEGRU
4. Budget Reduction in Cloud Storage Systems
79
Table 4.5. Scenario 2
Amount of data to store (GB)
Romania S-E
Romania S-W
Romania N-E
Romania N-W
Total Data Stored
Amazon
0
0
0
500
500
Google
450
500
0
50
1000
Azure
550
0
450
0
1000
Rackspace
0
0
0
0
0
HP
0
0
0
0
0
Total
1000
500
450
550
Data Blocks (DB)
1000
500
450
550
Initial Capacity
Total Cost for Initial Capacity
1000
$85,00
1000
$44,90
1000
$50,00
1000
$100,00
1000
$100,00
TOTAL
$379,90
Final Capacity
Total Cost for Final Capacity
500
$42,50
1000
$44,90
1000
$50,00
0
$0,00
0
$0,00
$137,40
Cost for data storage (Final)
Cost of data transfer
$42,50
$60,00
$44,90
$97,00
$50,00
$98,00
$0,00
$0,00
$0,00
$0,00
$137,40
$255,00
Final Gain
64%
Total cost
$392,40
Total
1000
500
450
550
Table 4.6. Scenario 3
Amount of data to store (GB)
Romania S-E
Romania S-W
Romania N-E
Romania N-W
Total Data Stored
Amazon
0
0
0
0
0
Google
1000
500
450
550
2500
Azure
0
0
0
0
0
Rackspace
0
0
0
0
0
HP
0
0
0
0
0
Data Blocks (DB)
1000
500
450
550
Initial Capacity
Total Cost for Initial Capacity
1000
$85,00
5000
$224,50
1000
$50,00
1000
$100,00
1000
$100,00
TOTAL
$559,50
Final Capacity
Total Cost for Final Capacity
0
$0,00
2500
$112,25
0
$0,00
0
$0,00
0
$0,00
$112,25
Cost for data storage (Final)
Cost of data transfer
$0,00
$0,00
$112,25
$255,00
$0,00
$0,00
$0,00
$0,00
$0,00
$0,00
$112,25
$255,00
Final Gain
80%
Total cost
$367,25
We can conclude that the final gain depends on initial demand of data, however in some
cases is surprisingly high (e.g. 80%), so this method can be successfully used for any scenarios
considering public Clouds.
4.3
Buget Constrained Storage Service Selection
In this section we study the multi-objective optimization problem for storage service selection with
budget constraints. We start from a real world case scenario and build our mathematical model
for the optimization problem. Then we propose a linear programming technique to find a near
optimal solution for the service selection problem.
4.3.1
Optimal Storage Service Selection Problem
A realistic view of proposed model for advanced processing of data collected and aggregated from
different heterogeneous sources (e.g. wireless sensor networks, mobile phones, world wide web,
third party institutions) in datacenters is presented in Figure 4.6. In the figure is expressed the
following data processing model: data collected form different heterogeneous sources distributed
geographically are primarily stored at various Cloud storage providers and for the analysis and
processing purposes are brought in a central datacenter.
In order to achieve cost optimization and a good level of QoS in terms of latency, the Cloud
Resource Management for Cost Optimization in Cloud Storage Systems (PhD Thesis) - Ing. Cătălin
NEGRU
4. Budget Reduction in Cloud Storage Systems
80
user must make the optimal selection between various Cloud storage providers. In order to get
an insight about the latency of different service we performed some measurement tests and then
compared them as can be seen in Figure 4.4. Can be observed that the latency varies very much
depending on location as expected and the type of service offered. We can conclude that the
geographical location of a particular storage service is crucial in order to achieve a good level of
QoS.
Figure 4.4. Latency comparison.
Another important metric that must take into consideration, especially in budget optimization, is the price/GB of different service providers. In order to evaluate the price for the storage
service we compared multiple providers in terms of price of storage capacity and the price of the
data transfer outside the Cloud computing platform. In Figure 4.5 is presented the price comparison between different Cloud storage providers. As can we see the Microsoft Azure Object Storage
us-central has the lower price, but if we correlate it with latency this service provider does not
offer the optimal solution.
Taking into consideration just only the two metrics presented above we can state that the
optimal storage service selection problem becomes a multi-objective optimization problem. In the
next section we will present the mathematical modeling of our problem.
Resource Management for Cost Optimization in Cloud Storage Systems (PhD Thesis) - Ing. Cătălin
NEGRU
4. Budget Reduction in Cloud Storage Systems
81
Figure 4.5. Price comparison between service providers.
4.3.2
Problem Modeling
We consider the following model: n data locations describe a set of geographically distributed
sources of data (e.g. WSN, mobile data, www data), each source Li having Data(Li ) amount of
data, i = 1 . . . n; m Cloud storage providers that can be accesed from any data location, each
provider Cj , j = 1 . . . m being able to store an infinite amount of data,
DataCapacity(Cj ) X
Data(Li ),
(4.9)
i
for any Cj , at any moment a processing datacenter D that has the capability to process any amount
of data, collected from Clouds.
We introduce the following cost functions:
• cost(Li , Cj ) = cij represents the transfer cost of data from location Li to a Cloud storage
provider Cj (for example Latency in ms or Transfer Price in EUR/GB from EU);
• cost(Cj , D) = cjD represents the transfer cost of data stored in a Cloud location Cj to the
datacenter where the data will be processed (similar costs like cij );
• cost(Cj ) = bj represents the budget needed to store data for a specific Cloud storage provider
Cj (represented by Price in EUR/GB);
To describe the optimization problem for budget constrained selection of Cloud Storage Services, we introduce the assigment binary matrix A with the following mean: A(i, j) = 1 if all
amount of data from location Li , denoted by Data(Li ) is stored on Cloud provider Cj . Based on
this definitions, we have the following properties:
1. A(i, j) ∈ {0, 1};
Resource Management for Cost Optimization in Cloud Storage Systems (PhD Thesis) - Ing. Cătălin
NEGRU
4. Budget Reduction in Cloud Storage Systems
82
Pm
A(i, j) = 1, for any i = 1 . . . n, which means that all amount of data from a location
Li are stored on a single Cloud Storage provider;
Pn
3. 0 ≤ i=1 A(i, j) ≤ n, for any j = 1 . . . m, which means that we may not use a specific Cloud
Pn
storage provider if we have 0 = i=1 A(i, j), or we may use only one Cloud storage provider
Pn
for all locations if we have i=1 A(i, j) = n.
2.
j=1
Let’s consider Data(Cj ) the all amount of data gathered from one or many geographical
locations by a Cloud storage provider Cj , after a feasible assignment is computed. So, we have:
Data(Cj ) =
n
X
A(i, j) × Data(Li ).
(4.10)
i=1
Now, the optimization problem can be described, under a total budget constraint B and a
request D (D is a n elements binary array that specifies a data processing request; if Dj = 1 then
all data from Cloud provider Cj will be transfered to the datacenter), as follow:
min
j
( n
X
)
A(i, j) × cij × Data(Li ) ,
j = 1 . . . m;
i=1
min

m
X

j=1


Dj × cjD × Data(Cj ) ;

m
X
(4.11)
bj × Data(Cj ) ≤ B.
j=1
with the following bounds:
Datamin ≤ Data(Li ) ≤ Datamax ,
m
X
i = 1 . . . n;
Dj × Data(Cj ) ≤ DataCapacity(D).
(4.12)
j=1
We can introduce a relaxation factor < 1 in Equation 4.11, last formula, as following:
m
X
bj × Data(Cj ) ≤ B(1 + ).
(4.13)
j=1
In this way, we can control some errors like no primal/dual feasible solution, no convergence,
iterations/time limit exhausted. These errors can appear in the solving process for this optimization
problem.
4.3.3
Linear Programming Algorithms
The linear programming technique is defined as the problem of finding a vector x in order to
minimize a linear function f T x subject to linear constraints: minx f T x such that one or more of
the following hold:
A·x≤b
Aeq · x = beq
l ≤ x ≤ u.
Resource Management for Cost Optimization in Cloud Storage Systems (PhD Thesis) - Ing. Cătălin
NEGRU
4. Budget Reduction in Cloud Storage Systems
83
Figure 4.6. The architecture of advanced processing in a datacenter for multiple distributed data sources.
There are many linear programming algorithms such as interior-point linprog algorithm,
interior-point-legacy linear algorithm, active-set linprog algorithm, linprog simplex algorithm and
dual-simplex algorithm. Next we will describe the interior-point linprog algorithm.
Interior-Point linprog Algorithm
The linprog interior-point algorithm shares many features with the linprog interior-point-legacy
algorithm. These algorithms have the same general outline:
• presolve, suppose the simplification and conversion of the problem to a standard form.
• generate an initial point - the choice of an initial point is important in order to solve the
algorithm efficiently.This step can be time-consuming.
• predictor-corrector represents the phase of iterations for solving the Karush-Kuhn-Tucker
(KKT) equations.
Presolve
In this pahse the algorithm represents the starting point of the algorithm. The problem is
simplified by removing redundancies and simplifying constraints. The tasks performed during this
phase are the following:
• check if any variables have equal upper and lower bounds. If so, check for feasibility, and
then fix and remove the variables.
• check if any linear inequality constraint involves just one variable. If so, check for feasibility,
and change the linear constraint to a bound.
• check if any linear equality constraint involves just one variable. If so, check for feasibility,
and then fix and remove the variable.
• check if any linear constraint matrix has zero rows. If so, check for feasibility, and delete the
rows.
• check if the bounds and linear constraints are consistent.
• check if any variables appear only as linear terms in the objective function and do not appear
in any linear constraint. If so, check for feasibility and boundedness, and fix the variables at
their appropriate bounds.
Resource Management for Cost Optimization in Cloud Storage Systems (PhD Thesis) - Ing. Cătălin
NEGRU
4. Budget Reduction in Cloud Storage Systems
84
• change any linear inequality constraints to linear equality constraints by adding slack variables.
If algorithm detects an infeasible or unbounded problem, it halts and issues an appropriate
exit message. The algorithm might arrive at a single feasible point, which represents the solution. If
the algorithm does not detect an infeasible or unbounded problem in the presolve step, it continues,
if necessary, with the other steps. At the end, the algorithm reconstructs the original problem,
undoing any presolve transformations. This final step is the postsolve step. For simplicity, the
algorithm shifts all lower bounds to zero. While these preprocessing steps can do much to speed
up the iterative part of the algorithm, if the Lagrange multipliers are required, the preprocessing
steps must be undone since the multipliers calculated during the algorithm are for the transformed
problem, and not the original. Thus, if the multipliers are not requested, this transformation back
is not computed, and might save some time computationally.
Generate Initial Point
To set the initial point, x0, the algorithm does the following:
• initialize x0 to ones(n,1), where n is the number of elements of the objective function vector
f.
• convert all bounded components to have a lower bound of 0. If component i has a finite
upper bound u(i), then x0 (i) = u/2.
• for components that have only one bound, modify the component if necessary to lie strictly
inside the bound.
• to put x0 close to the central path, take one predictor-corrector step, and then modify the
resulting position and slack variables to lie well within any bounds.
Predictor-Corrector
The algorithm tries to find a point where the KKT conditions hold. To describe these equations for the linear programming problem, consider the standard form of the linear programming
problem after preprocessing:

 Ā = b̄

minx f T x subject to
x+t=u


x, t ≥ 0
• Ā is the extended linear matrix that includes both linear inequalities and linear equalities. b̄
is the corresponding linear equality vector;
• t is the vector of slacks that convert upper bounds to equalities.
Stopping conditions
The predictor-corrector algorithm iterates until it reaches a point that is feasible (satisfies
the constraints to within tolerances) and where the relative step sizes are small.
4.3.4
Optimization Results
We start with Equation 4.13 and we can observe that the last two equations have no explicit A(i, j)
variable in their expression, so we can change the optimization problem considering Data(Cj ), for
j = 1 . . . m as decision variables at the optimum.
We can re-write the formula with the following notations:
• c = {cj }j=1...m , cj = Dj ×cjD is a column array containing the objective function coefficients;
• x = {xj }j=1...m , xj = Data(Cj ) is the decision variable (a column array);
• b = {bj }j=1...m is a column array containing the constraints coefficients (the storage budget
for each Cloud provider).
Resource Management for Cost Optimization in Cloud Storage Systems (PhD Thesis) - Ing. Cătălin
NEGRU
4. Budget Reduction in Cloud Storage Systems
85
Now, the problem is:
min cT x;
subject
to
bT x ≤ B(1 + );
(4.14)
0 ≤ xi ≤ DataCapacity(Cj ).
We used Octave to solve this Linear Programming problem by using the glpk function. We
used the values presented in Figure 4.5 as follows:
•
•
•
•
•
•
c = [0, 0.1097, 0.0182, 0, 0.0791, 0.05, 0.2292]T , all values are measured in EUR/GB;
b = [0.022, 0.0237, 0.0274, 0.0366, 0.0517, 0.079, 0.0914]T , all values are measured in EUR/GB;
DataCapacity(Cj ) ≤ 1000 GB;
= 0 . . . 1;
B = 100 EUR;
all elements from D are equal to 1.
The experimental results are presented in Figure 4.7 and Figure 4.8. We can observe that
some Cloud providers are used at the maximum upper bound and others are not used. In the case
of 4 Cloud Providers the used capacity increased from minimum to maximum when the budget is
relaxed. In this problem we considered that all data from all Cloud providers are moved to the
processing datacenter.
Figure 4.7. The optimizer (the value of the decision variables at the optimum), which means the value
for Data(Cj ) for all 7 investigated Cloud providers.
Now, according with Equation 4.10, we can formulate a new optimization problem from
Equation 4.11, as follow:
∀j = 1 . . . m
n
X
A(i, j) × Data(Li ) = Data(Cj );
i=1
(
min
j
n
X
(4.15)
)
A(i, j) × cij × Data(Li ) .
i=1
Resource Management for Cost Optimization in Cloud Storage Systems (PhD Thesis) - Ing. Cătălin
NEGRU
4. Budget Reduction in Cloud Storage Systems
86
Figure 4.8. The optimum value of the objective function (cT xopt ).
and we need to solve n linear optimization problems to find all elements of assignment matrix.
4.4
4.4.1
Cost-aware cloud storage service allocation for
distributed data gathering
Continuous Linear Programming Solver
The solution for problem 4.15 is to compute in n iterations the global allocation, by solving in each
step an optimization problem. We have the following parameters: c is a matrix containing the
objective function coefficients (each column contains the transfer cost from a specific data source
to all Cloud Providers - see Table 4.7);
Table 4.7. Transfer Cost Matrix (c).
Cloud Provider
cp1
cp2
cp3
cp4
cp5
cp6
cp7
ds1
0.0000
0.1097
0.0182
0.0000
0.0791
0.0500
0.2292
ds2
0.2292
0.0000
0.1097
0.0182
0.0000
0.0791
0.0500
ds3
0.0500
0.2292
0.0000
0.1097
0.0182
0.0000
0.0791
ds4
0.0791
0.0500
0.2292
0.0000
0.1097
0.0182
0.0000
a is an array containing the constraints coefficients (storage costs); B is a number containing the
right-hand side value for each iteration (constant for all iterations); lb is an array containing the
lower bound on each of the variables (the default lower bound is zero); ub is an array containing
the upper bound on each of the variables (we consider the values from Table 4.8);
CT Y P E represents the sense of constraint ("U" means an inequality constraint with an upper
bound); V ART Y P E encodes a continuous variable; itlim is the simplex iterations limit (it is
decreased by one each time when one simplex iteration has been performed, and reaching zero
value signals the solver to stop the search); msglev specify that error and warning messages can
be displayed during the solver run. Finally, If SEN SE is 1, the problem is a minimization and
Resource Management for Cost Optimization in Cloud Storage Systems (PhD Thesis) - Ing. Cătălin
NEGRU
4. Budget Reduction in Cloud Storage Systems
87
Table 4.8. Cloud Providers Capacity (Data(Cj ) - ub).
Cloud Provider
cp1
cp2
cp3
cp4
cp5
cp6
cp7
Capacity (GB)
5.000
5.000
5.000
3.000
2.000
1.000
1.000
if SEN SE is -1, the problem is a maximization (we want to maximize the amount of stored data
having a specific budget).
The proposed solver is described in the following listing. We used GNU Linear Programming
Kit package [110].
% n - number of data locations
% m - number of Cloud Storage Providers
[m n] = size(c);
% seting the parameters of the
% optimization problem
ctype = "U";
vartype = "CCCCCCC";
s = -1;
param.msglev = 1;
param.itlim = 1000;
for i = 1 : n
[xopt(:,j), fopt(j), status, extra] =
glpk (c(:,j), a, b, lb, ub,
CTYPE, VARTYPE, SENSE,
param);
end
In this listing, XOP T is the optimizer (the value of the decision variables at the optimum)
and F OP T is the optimum value of the objective function. We used GNU Linear Programming
Kit solver software
4.4.2
Data Gathering Model
In Fig. 4.9 we represent the vision of our model: we have multiple heterogeneous data sources
geographically distributed that are connected with a set of Cloud storage providers. Furthermore,
the Cloud providers have link with different processing data center and periodically send data to
be processed. Additionally we know the cost of transfer from the data source to the cloud provider.
So we aim to select the best Cloud providers to sent data such that the cost to be minimum and
to respect he budget.
Table 4.9 presents the storage costs for different Cloud Providers. Also, Table 4.7 contains
the transfer costs. All of these numerical values can be obtained using the following API-based
cost model. Every Cloud storage provider offers publicly the price for available services. In order
Resource Management for Cost Optimization in Cloud Storage Systems (PhD Thesis) - Ing. Cătălin
NEGRU
4. Budget Reduction in Cloud Storage Systems
88
Figure 4.9. The model used for data processing: data sources (on the top), seven public Cloud Storage
providers and several datacenters.
to obtain prices we can implement a web service that use the APIs offered by the Cloud Providers.
A method for prices comparison is offered by Cloudorado which find the best solution for your
specific requirements (https://www.cloudorado.com).
Table 4.9. Storage Costs for Cloud Providers.
Cloud
Provider
cp1
cp2
cp3
cp4
cp5
cp6
cp7
4.4.3
Name of Cloud Provider
Microsoft Azure Object Storage us-central
Google Cloud Storage eu
Amazon S3 ap-northeast-1
SoftLayer Object Storage AMS
Exoscale Object Storage CH-GV2
Aruba Cloud Storage R1-CZ
Liquid Web Storm Object Storage us-central
Cost4 (EUR/GB/month)
0.0220
0.0237
0.0274
0.0366
0.0517
0.0790
0.0914
Results
We run the optimizer solver with n iterations considering two value for the available budget. I the
first case (B = 100), only the first 3 Cloud Providers were selected, all of them having the lowest
price (see Figure 4.10). The second case (B = 500) selects all Cloud Providers, even the cp4, . . . cp7
are more expensive( see Figure 4.11). These are selected because cp1, cp2 and cp3 reached their
maximum capacity (see Table 4.8). As a conclusion, the optimum value of the objective function
increases two times even the total available budget was increased by five times.
4.5
Conclusions and Open Issues
Service selection problem in multi-Cloud environments is a difficult and current research problem.
The cost reduction should be the main benefit.
Resource Management for Cost Optimization in Cloud Storage Systems (PhD Thesis) - Ing. Cătălin
NEGRU
4. Budget Reduction in Cloud Storage Systems
89
Figure 4.10. XOP T - Cloud Provider Allocation for a B = 100. Only cheapest Cloud Providers are
used; sum(F OP T ) = 3308, 3.
Figure 4.11. XOP T - Cloud Provider Allocation for a B = 500.
sum(F OP T ) = 6770, 0.
All Cloud Providers are used;
In the Section 4.2 of this chapter we propose a binary linear optimization method for cost
optimization when buying Cloud storage capacity from public Cloud providers. We can conclude
that the final gain depends on initial demand of data, however in some cases is surprisingly high
(e.g. 80%), so this method can be successfully used for any scenarios considering public Clouds.
In the Section 4.3, we presented the problem of optimal service selection in Cloud storage
systems in the presence of budget constraints. We modeled this as a multi-objective optimization
problem. In order to optimize the cost for storage services, while meeting the constraints on budget
and QoS parameters such as latency, we proposed a greedy algorithm. Because of the fact that this
type of problem is a multi-objective optimization problem, it is very difficult to find the optimal
solution while respecting all the constraints. Therefore, we were able to find near optimal solutions
while relaxing some constraints. The results show that the allocation of cloud providers depends
on transfer costs and on the available budget. The proposed method satisfy the data demand.
Resource Management for Cost Optimization in Cloud Storage Systems (PhD Thesis) - Ing. Cătălin
NEGRU
5 | Storage Systems for Cyber-Physical
Infrastructures:
Natural Resources Management
Water resource monitoring implies a huge amount of information with different levels of heterogeneity (e.g. spatial data, sensor data, and multimedia data), availability (e.g. data must have
a minimum degree of redundancy) and accessibility (e.g. methods for data access such as REST,
SOAP, WSDL) [163]. It is very important to acquire, store and transmit data in order to respond
in real-time to possible threads or accidents. Equally important is to have access to historical data
for calibration and validation of the models. All these requirements impose a great pressure on
the storage system which has to be efficient and with reduced costs in terms of management.
Cloud-based Cyberinfrastructures represents a combination of data resources, network protocols, Computing platforms, and computational services, used to acquire, store, manage, integrate
and visualize data [187], like satellite image processing, distributed video surveillance, intelligent
building management, or asynchronous mobile systems with many components. In order to benefit
from “pay-as-you go” model applications and services that run in Cloud, must be optimized.
Prototype Cyber Infrastructure-based System for Decision-Making Support in Water Resources Management (CyberWater) [36] is a national project aims to create a prototype platform
using advanced computational and communications technology for implementation of new frameworks for managing water and land resources in a sustainable and integrative manner. The main
focus of CyberWater effort will be on acquiring diverse data from various disciplines in a common
digital platform that is subsequently used for routine decision making in normal conditions and for
providing assistance in critical situations related to water, such as accidental pollution flooding.
This will generate a large amount of data that must be stored and processed in an efficient and
cost effective way.
In the first part of the chapter we analyze the problem of Big Data modeling and integration
in the context of cyber-infrastructure systems. In the second part of the chapter we present the
storage architecture of CyberWater. In the third part we present a cost efficient cloud-based service
oriented architecture for water pollution prediction. The chapter ends with conclusions and open
issues.
5.1
Analysis of Data Requirements for CyberWater: from
Heterogeneity to Uniform and Unified Approach
Nowadays, information represents an essential factor in the process of supporting decision-making
and that’s why heterogeneous data have to be operated, in order to provide a unique view of information for any type of application. Extracting valuable information from data combine qualitative
90
5. Storage Systems for Cyber-Physical Infrastructures: Natural Resources Management
91
and quantitative analysis techniques.
This section addresses the problem of modeling and integration of heterogeneous data that
comes from multiple heterogeneous sources in the context of Cyber-infrastructure systems and
Big Data platforms. First part analyses existing data models for water resource monitoring and
management considering computational methods (e. g. neural networks) for handling environmental data, using software engineering to deliver the computational solutions to the end-user and
developing methods for continuous environmental monitoring based on sensors networks as part
of Internet of Things. In the second part, based on the previous analysis, we address the problem of data modeling considering a huge volume, high velocity and variety of data collected from
environments (another Big Data problem) and process in real-time to be used for water quality
support. The third part presents the unified approach for water resources data models considering
the context of observation and the possibility of usage to enhance the organization, publication,
and analysis of world-distributed point observations data while retaining a simple relational format. This will enable development of new applications and services for water management that
are increasingly aware of and adapt to their changing contexts in any dynamic environment. A
context aware approach requires an appropriate model to aggregate, semantically organize and
access large amounts of data, in various formats, collected from sensors or users, from public open
data sources.
The CyberWater case study will refer to the modeling, integration and operation of these
data in order to provide a unified approach and unique view. The main purpose is to offer support
for different processes inside the CyberWater platform such as monitoring, analysis, and control
of natural water resources, with the scope to preserve the water quality.
Many scientific fields such as Cyber-infrastructures, Smart Cities, e-Health, Social Media, Web
3.0 etc., try to extract valuable information from huge amounts of data, generated on a daily basis.
Moreover, these data are unorganized or semi-organized, having a high level of heterogeneity [78].
In the case of Smart Cities, Big Data relates to urban data, basically referring to space
and time perspective, which is gathered mostly from different sensors. Furthermore, the growth
of Big Data changes the planning strategies from long-term thinking to short term thinking as
the management of the city can be made more efficient [88]. Moreover, this data can open the
possibility for real-time analysis of city life, new modes of city administration and offer also the
possibility for more efficient, sustainable, competitive, productive, open and transparent city [89].
Healthcare systems are also transformed by Big Data paradigm. Data are generated from
different sources such as electronic medical records systems, mobilized health records, personal
health records, mobile healthcare monitors and predictive analytics as well as a large array of
biomedical sensors and smart devices. Thse data rise up to 1000 Petabytes [102].
People and the interaction between them produce, regarding the Social Media and Web 3.0,
massive quantities of information. A huge effort is made to understand social media interactions,
online communities, human communication and culture [20].
The motivation of this section comes from the necessity of a unified approach of data processing in large-scale cyber-infrastructures systems because the characteristics of scale cyber-physical
systems (e.g. data sources, communication and computing) exhibit significant heterogeneity. We
analyze different heterogeneous sources and formats of data collected from sensors, users, and the
web, from public open data sources, such as regulatory institutions.
Resource Management for Cost Optimization in Cloud Storage Systems (PhD Thesis) - Ing. Cătălin
NEGRU
5. Storage Systems for Cyber-Physical Infrastructures: Natural Resources Management
5.1.1
92
Big Data, Heterogeneous Data
It is important to provide an integrated and interoperable framework for environmental monitoring
and data processing that includes development of sound solutions for a sustainable crisis management in order to support human life in case of disasters. For instance, the authors of [72] propose
an approach to integrate heterogeneous data with uncertainty in emergency systems.
Based on information and communications technologies support and Big Data technologies,
the response to natural disasters needs to be both rapid and carefully coordinated. Saving lives
and providing shelter for those displaced are two key priorities for first responders. Helping citizens
post disaster is a key part of the mandate for damage assessors. This is realized by measuring and
quantifying a disaster’s impact on the community, then providing assistance to individuals. Mobile
and Cloud based GIS offers many potential benefits to improve disaster management.
Furthermore, we have to combine mobile applications, existing electronic services and data
repositories, in an architecture based on Cloud solutions and existing Big Data approaches; Also we
have to contribute to development of INSPIRE compliant solutions [177], based on the ISO 19156
standard on Observations and Measurements (O&M) and SOS (Sensor Observation Service), SES
(Sensor Event Service), SAS (Sensor Alert Service) as a main contribution to standardization.
It is essential to have a scalable environment with flexible information access, easy communication and real time collaboration from all types of computing devices, including mobile handheld
devices (e.g. smart phones, PDAs and tablets (iPads)). Also, it is mandatory that the system
must be accessible, scalable, and transparent from location, migration and resource perspectives.
Data models
Data models represent the building blocks of Big Data applications, having a major impact on the
performance and capabilities of those applications. Moreover, the different tools that are used for
data processing impose these data models. Next, we will discuss the different data models that
exist in the context of Big Data: structured data, text file data, semi-structured data, key-value
pair data [16].
Structured data models refer to data that is contained in databases or spreadsheets files. The
sources of structured data in a Cyber-infrastructure are water quality sensors (e. g. water parameters such as conductivity, salinity, total dissolved solids, resistivity, density, dissolved oxygen,
pH, temperature and so on), mobile phone sensors (e.g. location data), geographical information
systems (e. g. geo-database that contain spatial data such as points, polygons, raster, annotations
and so on) and click-stream sources (e.g. data generated by human intervention in the case of
reporting an incident related to a pollution event). In Figure 5.1 is presented a database table
that contains information about a spatial dataset which has different attributes. Data can be
aggregated and queried across the entire database. Things get more complicated when we want to
aggregate data from many tables because the problem becomes exponentially more complex. The
reason behind this complexity is that each query requires the read of the entire dataset.
Spatial and temporal database systems are in close collaborative relation with other research
area of information technology. These systems integrate with other disciplines such as medicine,
CAD/CAM, GIS, environmental science, molecular biology or genomics/bioinformatics. Also,
they use large, real databases to store and handle large amount of data [157]. Spatial databases
are designed to store and process spatial information systems efficiently. Temporal databases
represent attribute of objects which are changing with respect to time. There are different models
for spatial/temporal and Spatio-temporal systems [28]:
• Snapshot Model - temporal aspects of data time-stamped layers;
Resource Management for Cost Optimization in Cloud Storage Systems (PhD Thesis) - Ing. Cătălin
NEGRU
5. Storage Systems for Cyber-Physical Infrastructures: Natural Resources Management
93
Figure 5.1. Geographical data example
• Space-Time composite - every line in space and time is projected onto a spatial plane and
intersected with each other;
• Simple Time Stamping - each object consists of a pair of time stamps representing creation
and deletion time of the object;
• Event-Oriented model - a log of events and changes made to the objects are logged into a
transaction log;
• History Graph Model - each object version is identified by two timestamps describing time
interval;
• Object-Relationship (O-R) Model - a conceptual level representation of spatio-temporal
database;
• Spatio-temporal object-oriented data model - is based on object oriented technology;
• Moving Object Data Model - objects are viewed as 3D elements.
The systems designed considering these models manage both space and time information for
several classes of applications, like: tracking of moving people or objects, management of wireless
communication networks, GIS applications, traffic jam preventions, weather prediction, electronic
services (e.g. e-commerce), etc.
Text data models on the other side are at the opposite end of structured data, as this type
of data has no well-defined structure and meaning. The sources for this data are represented by
different documents related to diverse regulations released by regulatory institutions in the field of
cyber-infrastructure domains such as water management [16].
Semi-structured data represents data that has a structure but is not relational. Instrumentation equipment such as sensors generates this data. In order to be stored in a relational database
data must be transformed. A major advantage of semi-structured data is represented by the fact
that can be loaded directly into a Hadoop HDFS file system and processed in raw from ther [54].
Key-value data model suppose that a value corresponds to a key. This model has following
characteristics: simple structure, the query speed is higher than relational database, support mass
storage and high concurrency, etc., query and modify operations for data through the primary
key were supported well. Key-value pair data represents the driver for performance in mapreduce programming model. This model has a single key value index for all data being similar
to memcached distributed in memory cache. This type of data is stored in key-value stores and
in general provides a persistence mechanism and additional functionality as well: replication,
versioning, locking, transactions, sorting, and/or other features [25].
Key-value stores represents schema less NoSql data stores, where values are associated with
keys represented by character strings. There are four basic operations when dealing with this type
of data stores:
Resource Management for Cost Optimization in Cloud Storage Systems (PhD Thesis) - Ing. Cătălin
NEGRU
5. Storage Systems for Cyber-Physical Infrastructures: Natural Resources Management
94
1. P ut(key, value) - associate a value with the corresponding key;
2. Delete(key) - removes all the associated values with supplied key;
3. Get(key) - returns the values for the provided key;
4. M ultiGet(key1 , key2 , . . . , keyn ) - returns the list of values for the provided keys.
In this type of data stores can be stored a large amount of data values that are indexed and
can be appended at same key due to the simplicity of representation. Moreover tables can be
distributed across storage nodes [105].
Although key-value stores can be used efficiently to store data resulted from algorithms such
as phrase counts, these have some drawbacks. Firs it is hard to maintain unique values as keys,
being more and more difficult to generate unique characters string for keys. Second, this model
does not provide capacities such as consistency for multiple transactions executed simultaneously,
those must be provided by the application level.
Semi-structured data storage is represented by storage systems who store semi-structured
data. This type of data has a self-describing structure, containing tags in order to separate elements
and establish hierarchies of records and fields of the data. In general this type of data are stored
in special types of databases HBase being one of the most known example. Document stores
are similar to key-values stores, main difference being that values are represented by "documents"
that have some structure and encoding models (e.g. XML, JSON, BSON, etc.) of managed data.
Also document stores provide usually and API for retrieving the data. Tabular stores represent
stores for management of structured data based on tables, descending from Bigtable design from
Google. HBase form Hadoop represents such type on NoSql data management system. In this type
of data store data is stored in a three-dimensional table that is indexed by a row key (that is used
in a fashion that is similar to the key- value and document stores), a column key that indicates
the specific attribute for which a data value is stored, and a timestamp that may refer to the time
at which the row’s column value was stored.
Data gathering
Environmental data gathering and analysis, resulted from continuous monitoring, suppose design
and analysis of methods for workflows, and interaction with data sets. A big part of these data
is made available by public institutions, private companies and scientists [23]. Furthermore, data
collected from social media sites such as geotagged photos, video clips, social-interactions may
contain information related to environmental conditions. Moreover, this process of data gathering
must be energy efficient and cost aware in the context of Cloud Computing services. New methods
for gathering and integration for these types of data for further processing stages must be designed.
Geospatial sensor web [140] has been used widely for environmental monitoring. Different
from a sensor network, in a sensor web infrastructure the device layers, the network communication
details, and heterogeneous sensor hardware are hidden [21]. The Sensor Web Enablement (SWE) of
the Open Geospatial Consortium (OGC) define a sensor web as an infrastructure that enable access
to sensor networks and archived sensor data that can be discovered and accessed using standard
protocols and interfaces [19]. In [31] the authors propose a sensor web heterogeneous node metamodel discussing development of five basic metadata components, the design of a nine-tuple node
information description structure as shown in Figure 5.2.
A service-oriented multi-purpose SOS framework can be developed in order to access data in
a single method approach, integrating the sensor observation service with other OGC services. The
Resource Management for Cost Optimization in Cloud Storage Systems (PhD Thesis) - Ing. Cătălin
NEGRU
5. Storage Systems for Cyber-Physical Infrastructures: Natural Resources Management
95
Figure 5.2. Sensor web meta-model
solution includes few components such as extensible sensor data adapter, OGC-compliant geospatial SOS, geospatial catalogue service, a transactional web feature service, and a transactional web
coverage service for the SOS, and a geospatial sensor client. Data from live sensors, models and
simulation is stored and managed with the aid of extensible sensor data adapter [30].
There is an important need for data gathering and content analysis techniques from social
media streams, such as Twitter, Facebook. These have become essential real-time information
resources with a wide range of users and applications [57]. Gathering valuable data form social
media represents a good opportunity for environmental monitoring. Also social media can play an
important role in pollution accidents or natural disasters as an information propagator.
5.1.2
Unified Approach of Big Data Modeling
In this part we present the unified approach for water resources data models considering the context
of observation and the possibility of usage to enhance the organization, publication, and analysis
of world-distributed point observations data while retaining a simple relational format. This will
enable development of new applications and services for water management that are increasingly
aware of and adapt to their changing contexts in any dynamic environment.
According to the National Oceanic and Atmospheric Administration1 (NOAA) that collects,
manages, and disseminates a wide range of climate, weather, ecosystem and other environmental,
there are nine principles for effective environmental data management:
• Environmental data should be archived and made accessible;
• Data-generating activities should include adequate resources to support end-to-end data management;
• Environmental data management activities should recognize user needs;
• Effective interagency and international partnerships are essential;
• Metadata are essential for data management;
• Data and metadata require expert stewardship;
• A formal, on-going process, with broad community input, is needed to decide what data to
archive and what data not to archive;
• An effective data archive should provide for discovery, access, and integration;
1 http://www.noaa.gov/
Resource Management for Cost Optimization in Cloud Storage Systems (PhD Thesis) - Ing. Cătălin
NEGRU
5. Storage Systems for Cyber-Physical Infrastructures: Natural Resources Management
96
• Effective data management requires a formal, on-going planning process.
A matrix represents the basic structure of an environmental dataset, where usually the
rows correspond to the individual objects (measurements, time units or measuring spots) and
the columns contain the series of reading for the corresponding variable. The units in the columns
may be logical characters (true=1 of false=0), ordered or unordered categories, integers (count
data) or reals (measurements), they may also contain a time information or a coding of the measurement spots. The coding of missing data and of censored data (for extremes values) has to be
fixed. Of course some describing or classifying text may be contained as well in the rows.
In a geographical information system (GIS) there are for basic data structures: vector, raster,
triangulated irregular network and tabular information (table of attributes). For example, a virtual
representation of Earth mostly contains data values that are observed within the physical Earth
system. On the other side data models are required to allow the integration of data across the
silos of various Earth and environmental sciences domains. Creating a mapping between the welldefined terminologies of these silos is a challenging problem. A generalized ontology for use within
Web 3.0 services, which builds on European Commission spatial data infrastructure models that
acknowledges that there are many complexities to the description of environmental properties
which can be observed within the physical Earth system is presented in [97]. The ontology is
shown to be flexible and robust enough to describe concepts drawn from a range of Earth science
disciplines, including ecology, geochemistry, hydrology and oceanography.
Another important aspect is represented by INSPIRE (Infrastructure for Spatial Information
in Europe) directive which aims to harmonize the creation, representation and access to data
geographic, and sharing them between EU Member limbs. The main goal of this Initiative is
to improve environmental policy-making at European level, to ensure compatibility between SDI
(spatial data infrastructure) of Member States to adopt a set of rules of implementation which are
common in many areas of data:
•
•
•
•
•
metadata;
specifications data;
network services (search, visualization, transformation, download and invocation);
sharing of data and services;
monitoring and reporting;
One of the main obligations imposed by INSPIRE is sharing and harmonization of geographic
datasets. The disclosure also called interoperability is seen by the Commission as " the possibility
of combining data sets or services, without repetitive manual intervention, so that the result is
coherent and from this to obtain surplus value" [44]. In this context, harmonization of means
of transforming data from existing models or schemes in order to consistent data model called
INSPIRE (mainly based on the ISO 19100 series of standards). The proposed solution is that
existing systems can still be used, but also to be the source for the above transformation is that
of network services. This eliminates the ad hoc integration of data from heterogeneous sources,
obtaining data interoperability at European level.
From the perspective of public organizations that fall under specified themes INSPIRE, the
steps to publish the data, in accordance with the requirements of the Directive are:
• creation of metadata, including existing data;
• possibility of metadata based search and visualization services;
• services for download and conversion.
So, regarding the presented aspects about INSPIRE directive the CyberWater project fall
under specified themes INSPIRE because operates with spatial data mainly, so it needs to follow
Resource Management for Cost Optimization in Cloud Storage Systems (PhD Thesis) - Ing. Cătălin
NEGRU
5. Storage Systems for Cyber-Physical Infrastructures: Natural Resources Management
97
the three steps presented above in order to be in concordance with the directive. The project must
create and store metadata about the spatial and monitored data, offering the possibility to end
users and third-party services to search and visualize data and meta-data. Also the third layer of
CyberWater architecture deals with creation of services and applications for clients, with purpose
for download and conversion.
Unified Data Representation and Aggregation
When comes down to Big Data representation and aggregation the most important question that
has to be answered is: how to represent and aggregate relational and non-relational data in the
same storage engine. Moreover, this data must be queried in an efficient way in order to offer
relevant results across all data types
The integration of heterogeneous observations in different applications is difficult because
these differ in spatial-temporal coverage and resolution. An approach for spatial-temporal aggregation in the Sensor Web using the Geoprocessing Web is to define a tailored observation model
for different aggregation levels, a process model for aggregation processes and a Spatial-Temporal
Aggregation Service [168].
Also context-awareness represents a core function for the development of modern ubiquitous
systems. This offer the capacity to gather and deliver to the next level any relevant information
that can characterize service-provisioning environment, such as computing resources/capabilities,
physical device location, user preferences, time constraints and so on [9].
There are lots of applications that need to query multiple databases, with heterogeneous
schemas. The semantic integration approaches focus on heterogeneous schemas, with homogenous
data sources. A new query type, called decomposition aggregate query, integrate heterogeneous
data sources domains. This approach is based on 3-role structure. Decomposing compounds into
components and translate non-aggregate queries over compounds into aggregate queries answerable
by other data sources is achieved by a type of data sources called dnodes. Is worth to mention
that, this is a solution mainly designed for database management systems [186].
Data Access and Real-Time Processing
A monitoring platform for water data management needs to access distributed data sources (e. g.
sensor networks, mobile systems, data repository, social web, and so on). Next, this data has to
be processed in real time in order to prevent natural disasters such as water pollution and more
importantly to alert the possible affected people.
Apache Kafka is a solution that propose a unified approach to offline and online processing by
providing a mechanism for parallel load in Hadoop systems as well as the ability to partition realtime consumption over a cluster of machines. Also provides a real-time publish-subscribe solution,
which overcomes the challenges of real-time data usage for consumption, for data volumes that
may grow in order of magnitude, larger that the real data [58]. In the Figure 5.3 is presented Big
Data aggregation-and-analysis scenario supported by the Apache Kafka messaging system.
There are to main categories of data: one which has value at the given moment in time such
as prediction data and other whose value remains forever such as for example the maximum value
possible for pollutants or sensor data which represents historical data. Mining the instantaneously
valued data requires a real time platform. In [123] the authors propose, a method of dynamic
pattern identification for logically clustering log data. The method is a real time and generalized
solution to the process of log file management and analysis.
Computing frameworks such as MapReduce [189] or Dryad [76] are used for large-scale data
Resource Management for Cost Optimization in Cloud Storage Systems (PhD Thesis) - Ing. Cătălin
NEGRU
5. Storage Systems for Cyber-Physical Infrastructures: Natural Resources Management
98
Figure 5.3. Big Data aggregation and analysis in Kafka
processing. In this paradigm users write parallel computations with the aid of high-level operators,
without paying attention to data distribution or fault tolerance. The main drawback of these
systems is represented by the fact that these are batch-processing systems and are not designed
for real-time processing.
Storm [173] and Spark [194] represent a possible solution for real-time data streaming processing. Storm is used currently at Twitter for real-time distributed processing of stream data.
The most important proprieties of Storm are: scalability, resiliency, extensibility efficiency and
easy-administration [173].
5.1.3
CyberWater Case Study
In this section we present the case study of CyberWater project. This is a prototype cyberinfrastructure based system for decision-making support in water resources management.The main
reason we present this case study is to highlight the need for heterogeneous data modelling in Big
Data era, especially the context of Cyber-Infrastructure-based systems.
In Figure 5.5 is presented the layered architecture of CyberWater monitoring platform. This
is focused on three levels: data, storage and access level, management and the data processing.
Between data and storage level exist the interface for monitoring, analysis and processing rules.
Further, between storage and management level is placed the data access interface.
Data level consist of various heterogeneous data sources used by the platform, such as sensor
network, data suppliers (e.g. GIS, water treatments plants etc.), and third-party services (e.g.
ANAR, ApaNova, and other Romanian institutions). Data level consist of various heterogeneous
data sources used by the platform, such as sensor network, data suppliers (e.g. GIS, water treatments plants etc.), and third-party services (e.g. ANAR, ApaNova, and other Romanian institutions). A sensor gathers the following parameters from water resources: temperature, pH, specific
conductivity, turbidity, dissolved oxygen. Data obtained from third party suppliers is received in
heterogeneous formats, in form of files such as pdf, csv, txt, etc. For example, data obtained from
ApaNova is in form of a pdf file, called analysis bulletin, and contain information about: taste,
color, odor, turbidity, pH, conductivity, free residual chlorine, ammonium, nitrites, nitrates, iron,
ox disability, total hardness, aluminum. In the Figure 5.4 is presented an XML document modeled
from pdf bulletin gathered from ApaNova. From the above format can be identified three main
parts:
• Limitations specification - in a fixed format;
• Measured values;
• Semantics- which refers to explanation of the measures.
Resource Management for Cost Optimization in Cloud Storage Systems (PhD Thesis) - Ing. Cătălin
NEGRU
5. Storage Systems for Cyber-Physical Infrastructures: Natural Resources Management
99
Figure 5.4. Modeled XML data
At Storage level data collected from different data sources is stored, in order to offer aggregation services, such as workflow execution, pollution propagation modelling, pollution prediction
methods, and platform configuration. Also, the aggregation service is connected with a Knowledge
management service and Model Parameter Estimator service, in order to offer support for the
offered services.
On top layer of the architecture is Level of access, management and the data processing.
At this level are service such as spatial and temporal query services, customized view of data,
data validation or specific preferences providing functionality for applications like decision support
systems, data analysis, real-time alerts, video, mobile access and online support.
The interface for monitoring, analysis and processing rules has the role to store data form
data level (e. g. sensor network, data supplier, external events and third-party services) in a
flexible and efficient manner.
Data access interface ensure the access to data for CyberWater services and applications and
also for third party services who need access to data, offering access methods such as REST, SOAP
Resource Management for Cost Optimization in Cloud Storage Systems (PhD Thesis) - Ing. Cătălin
NEGRU
5. Storage Systems for Cyber-Physical Infrastructures: Natural Resources Management
100
and WSDL.
A sensor gathers the following parameters from water resources: temperature, pH, specific
conductivity, turbidity, dissolved oxygen.
Figure 5.5. CyberWater layered architecture
5.2
Proposed Cloud Storage Solution for CyberWater
Due to the continuous growth of cyber-infrastructures systems in scope and scale of the provided
applications and data sources, lead to the concept of "data lake". This concept from practical point
of view is characterized by three key attributes
• collect everything. A data lake contains all data, both raw sources over extended periods of
time as well as any processed data.
• dive in anywhere. A data lake enables users across multiple business units to refine, explore
and enrich data on their terms;
• flexible access. A data lake enables multiple data access patterns across a shared infrastructure: batch, interactive, on-line, search, in-memory and other processing engines; The result:
A data lake delivers maximum scale and insight with the lowest possible friction and cost.
As data continues to grow exponentially, then Enterprise Hadoop and EDW investments can
provide a strategy for both efficiency in a modern data architecture, and opportunity in an
enterprise data lake.
Following characteristics are important to take into consideration when design and analyze
the storage system:
• File size distribution - is important for I/O optimization; The file size in workloads depends
Resource Management for Cost Optimization in Cloud Storage Systems (PhD Thesis) - Ing. Cătălin
NEGRU
5. Storage Systems for Cyber-Physical Infrastructures: Natural Resources Management
•
•
•
•
•
•
•
101
on the program styles of specific applications;
Data Commonality/Data Compressibility - determine the data similarity of files;
Data lifetime - can help to choose an optimum storage device (e.g., workflow’s intermediate
files and checkpoint images) are only temporary;
Data open pattern - represents the access modes for file by specifying parameters in file
open call;
Data request frequency - the frequency of access to the file;
Directory pattern - directory structure of file;
Data Locality - some applications exhibit high access locality, that is, the working sets of
multiple application instances running on different nodes significantly overlap, while in other
cases application instances running on different nodes have disjoint working sets;
Data consistency requirements.
These characteristics can be exploited through different optimization techniques to enhance
storage system performance and to reduce the operation cost, with regard to different metrics
such as latency, throughput, disk utilization, CPU load. For example, buffering can dramatically
enhance throughput for write only workloads. Furthermore, deduplication tehchnique can save
considerable storage space for highly compressible workloads, but consume more when data need
to be accessed.
It is important to note that different optimization techniques can have a negative impact on
the operational cost. For example consistency mechanisms might have a negative impact on write
throughput. Consequently, these optimizations do not generally coexist on the same data pipeline.
5.2.1
CyberWater Storage Architecture
In Figure 5.6 is presented the architecture of the storage system for CyberWater project. Data
sources layer consist of various heterogeneous data sources geographically distributed such as sensor
network, spatial data, data suppliers (e.g. GIS, water treatments plants etc.), and third-party
services data (e.g. ANAR, ApaNova, and other Romanian institutions). In order to process data
first we store it at different Cloud storage providers, near to data sources. Second, depending on
data needs these data are transfered for processing in a private datacenter. Moreover, we can buy
also cloud processing services if in a certain use case is more cost efficient to process data with a
cloud service or if the processing capabilities of our private datacenter are exceeded. In this case
we can use the solution proposed in Section 3.3 by migrating processing tasks in Cloud.
Ingestion layer deals with the process of obtaining and importing data for storage or immediate
use for the case of real-time alerts service. For the first use case data can be ingested in batches
(e.g. discrete chunks at periodic intervals of time). In the second use case ingestion happens in
real-time, data being ingested as it is emitted by the source. Data ingestion process supposes two
phases data import and the data routing to the correct storage engine. For instance, if we deal
with sensor data then we route them to he oracle spatio/temporal database.
Data storage layer consist of Hadoop Distributed File System. Hadoop provides HDFS and
a framework for the analysis and transformation of very large data sets using the MapReduce
paradigm. An important characteristic of Hadoop is the partitioning of data and computation
across many (thousands) of hosts, and executing application computations in parallel close to their
data. A Hadoop cluster scales computation capacity, storage capacity and I/O bandwidth by
simply adding commodity servers. Moreover, MapReduce and HDFS may run on the same set
of nodes, which means that the compute nodes and the storage nodes are hosted on the same
machines. The framework achieves higher bandwidth across the cluster by scheduling tasks on the
Resource Management for Cost Optimization in Cloud Storage Systems (PhD Thesis) - Ing. Cătălin
NEGRU
5. Storage Systems for Cyber-Physical Infrastructures: Natural Resources Management
102
Figure 5.6. CyberWater Cloud storage architecture
same nodes where data is stored. The Map tasks process input records and write to the local disk
a set of intermediate records.
HDFS stores file system metadata and application data separately. As in other distributed
file systems, like PVFS, Lustre and GFS, HDFS stores metadata on a dedicated server, called
the NameNode. Application data are stored on other servers called DataNodes. All servers are
fully connected and communicate with each other using TCP-based protocols. Unlike Lustre and
PVFS, the DataNodes in HDFS do not use data protection mechanisms such as RAID to make the
data durable. Instead, like GFS, the file content is replicated on multiple DataNodes for reliability.
While ensuring data durability, this strategy has the added advantage that data transfer bandwidth
is multiplied, and there are more opportunities for locating computation near the needed data.
On top of HDFS we put different storage engines such as Oracle spatio/temporal database
for sensor and GIS data, NoSql database for key-value data, document and tabular store for semistructured data. Relational databases can handle various types of data for example sensor data
or GIS data. Every query has the same kind of data – location, water parameters, map and so
on. These are all stored in a table with one column for each piece of data. On the other side we
have multimedia files such as images or videos that can be attached to an event reporting action.
Resource Management for Cost Optimization in Cloud Storage Systems (PhD Thesis) - Ing. Cătălin
NEGRU
5. Storage Systems for Cyber-Physical Infrastructures: Natural Resources Management
103
Multimedia files cannot be represented in the same way as a series of columns. What can be stored
in a relational way is represented by the data about the files, which is in fact metadata. Alongside
with multimedia files we have social media objects such as blogs, tweets, and emails, which can be
categorized in the category of non-relational data.
Related to the uniform data management, a context-aware approach requires an appropriate
model to aggregate, semantically organize and access large amounts of data, in various formats,
collected from sensors or users, from public open data sources. The most important question that
has to be answered is: how to handle relational and non-relational data in the same storage engine.
Data handling methods have been applied in several areas including water network data
analysis and modelling, water quality analysis, energy-water relationship, efficiency modelling of
regional water a resource under disasters constrains, etc.
One approach is to modify existing relational database management systems in a way that can
store non-relational data. For example, multimedia files can be stored in Blob data type. In this
way data can be retrieved. The worst part is represented by the fact that data stored in this way
cannot be processed, for example the image cannot be scanned in order to find useful in-formation.
Another approach is to design new engines for database systems that can handle all these Big Data
challenges. So for example the new engine can have functions for parsing multimedia files in order
to find information about a pollution event. In [1] propose a hybrid system called HadoopDB,
which combine the parallel database systems with MapReduce-based systems, in order to benefit
from the performance and efficiency of the first type of systems and scalability, fault tolerance,
and flexibility of the second ones. The main idea of this solution is to connect multiple single-node
database systems using Hadoop as the task coordinator and to communicate through network layer.
In this way queries can be parallelized with MapReduce. The drawback of this approach is that
HadoopDB does not match the performance of parallel database systems. The novel definition of
a declarative language that have to be able to map with precision an ontology into queries for a set
of data sources could represent another approach. This method is mainly designed for the case of
integration of multiple heterogeneous relational database systems [98]. The authors introduce the
concepts of Semantic Identifier and Semantic Join. First, represent a solution for the problem of
entity resolution and the second one is designed to help in the problem of record linkage. Although,
is an interesting approach, this have to be modified in order to be used for multiple heterogeneous
data sources in the context of Big Data. Still, we cannot know for sure how accurate the mapping
phase can be when dealing with these type of data sources.
Teradata2 propose also an interesting approach. This supposes the design of a management
system composed of three elements storage engine, processing layer and function library [183]. In
the storage engine relational data is stored in database tables and non-relational data is stored as
deserialized objects that are similar to Blobs. An extended SQL engine that includes MapReduce
functions represents the processing layer. In this way relational data is queried with SQL and
non-relational data is queried with MapReduce functions. The layer of function library is the core
element. In this layer functions written by users permit the manipulation and query non-relational
data. Next these functions are stored in a library and also the results of the functions are stored
in database tables. This is due to the closure principle from relational database model. Basically
this principle state that any query against a table or tables of data must return the answer in the
form of a table, permitting in this way the chaining of queries.
Our approach, in order to overcome the data heterogeneity in Big Data platforms and to
provide a unified and unique view of heterogeneous data is to add a layer on top of the different
data management systems with aggregation and integration functions.
2 http://www.teradata.com/
Resource Management for Cost Optimization in Cloud Storage Systems (PhD Thesis) - Ing. Cătălin
NEGRU
5. Storage Systems for Cyber-Physical Infrastructures: Natural Resources Management
104
Based on the presented architecture, we modeled few conceptual workflows for decision support module, real-time alerts and visualization of the maps and informations presented on them.
The applications offered by the platform need a unique view and integrated approach for data.
In the Figure 5.7 is presented the visualization workflow. A user accesses the application and
can do one of the following actions: can do a zoom in action, click on a sensor or apply a filter. In
the case of first action a detailed portion of a map is visualized and data are brought form storage
system by the aggregation service. In this module can be identified the following components:
analytics component, data filtering component, and access to distributed data-bases. Analytics
component refers to all the activities in the workflow, such as zoom in, click on a sensor, apply a
filter, and is important because is the most accessed and utilized part of the platform, all the users
having access to this component.
Figure 5.7. Visualisation Module
Data filtering component occurs when users want to find specific data about an event. So, in
order to get data, the user must apply a series of filters, which will be used as an input to query
the storage system. Access to distributed storage supposes that data gathered form sensors are
stored in the near storage site and all the data must be synchronized and processed.
In Figure 5.8 is presented decision support service. First, the client access decision support
module from the web application, by making one of the two requests: data validation or temporal
and spatial queries. Next the aggregation service will interrogate the storage and service repositories in order to calculate the propagation model or prediction methods. There are two important
components: analytics component and automatic control component. Analytics component can
be identified in the sub-workflow, which is responsible with temporal and spatial queries. Data is
aggregated form the storage system in order to provide the application with historical data. In
this case the cost for data aggregation is very important in order to optimize the overall cost of
the entire system.
Another important service offered by the CyberWater project is Real-Time Alerts. In Figure 5.9 is presented the workflow of the service. The clients or administrators will make preferences
for platform configuration and the service will perform spatial and temporal queries, aggregating
data form storage and service repositories. After this step data is validated against third party data
suppliers, and passed to alert generation module, which will classify the alert and take necessary
steps for sending the alert.
All workflows presented above have been encapsulated in a web mapping application presented
in Figure 5.10. Users are informed about the water quality on a certain river in a certain point by
clicking on a sensor marked on river course. Also the watercourse is colored based on water quality
Resource Management for Cost Optimization in Cloud Storage Systems (PhD Thesis) - Ing. Cătălin
NEGRU
5. Storage Systems for Cyber-Physical Infrastructures: Natural Resources Management
105
Figure 5.8. Decision Support Module
Figure 5.9. Real-Time Alerts Module
(blue is good quality and red bad quality of the water). The evolution of a certain parameter can
be viewed in a graph displayed on top of the map. Our system also offers a prediction and alert
service for water quality in case of a pollution event. Users can report an incident by clicking on
map a filling a web form that describe the reported incident, can upload multimedia files related
to the incident and share the reported event on social networks.
Cloud storage systems performance evaluation
To conduct our experiments, we zipped different chunk of data received from a sensor from 1KK
up to 128 MB then uploaded it to and downloaded it from Google Drive, SkyDrive, Amazon S3,
Dropbox and a private HDFS. We performed each set of uploads and downloads 8 times and
took the average, conducting our testing over the course of five business days (this aspect is not
Resource Management for Cost Optimization in Cloud Storage Systems (PhD Thesis) - Ing. Cătălin
NEGRU
5. Storage Systems for Cyber-Physical Infrastructures: Natural Resources Management
106
Figure 5.10. CyberWater front-end
important for HDFS cluster).
Each test was performed using the latest version of the Safari browser with a medium speed
Ethernet connection to a national ISP, which typically averages 57.2 Mbps down and 14.8 Mbps
up on Speedtest.net.
Figure 5.11. Average Upload/Download Speed (MB/s) for different Cloud Storage Providers.
Figure 5.11 presents the average upload/download speed (MB/s) for different Cloud storage
providers. HDFS has a higher speed (both up and down) because it is installed on a local cluster.
Amazon S3 and Google Drive have similar performance, and SkyDrive and Dropbox offer different
performance for up and down speed.
Figure 5.12 present the total task time, which means total transfer time for performed experiments. We present this results in ascending order. This results confirm again that a local HDFS
cluster has better global performance. On the other hand, the total overhead is 25%, which can
be acceptable for low speed process, like processed performed in a farm or in a greenhouse.
Figure 5.13 shows the comparison between the uploading and downloading speed for files of
different sizes, where we can observe that copying files from local to HDFS is much faster than
vice-versa.
Resource Management for Cost Optimization in Cloud Storage Systems (PhD Thesis) - Ing. Cătălin
NEGRU
5. Storage Systems for Cyber-Physical Infrastructures: Natural Resources Management
107
Figure 5.12. Total transfer time for performed experiments for different Cloud Storage Providers.
Figure 5.13. Average Upload/Download Time (s) for different filesize (KB) for HDFS.
Figure 5.14. Average putObject/getObject Time (s) for different filesize (KB) for Amazon S3.
Resource Management for Cost Optimization in Cloud Storage Systems (PhD Thesis) - Ing. Cătălin
NEGRU
5. Storage Systems for Cyber-Physical Infrastructures: Natural Resources Management
108
The graph in Figure 5.14 illustrates a comparison between the transfer time (measured in
seconds) obtained by each program running on client data set generated for Amazon S3. Based on
the obtained results it can be seen that during read operations is significantly lower than that of
writing as the size of the transferred object increases.
Figure 5.15. Average Upload/Download Time (s) for different filesize (KB) for Google Drive.
Figure 5.15 presents average upload/download time (s) for different file size (KB) for Google
Drive. It can be seen that after 512 KB, the upload time grows proportional to the size. There seem
to be little deviations from this rule. The downloading speed grows slowly for files us to 1024KB
and after that it approximately doubles for each file increase. It can be observed that upload
speeds are approximately five times slower that download speeds. The medium speed connection
we have tested the the pro- gram is rated at 20 mbps and the speed have improved. The times
are even less than double because this connection seems to be more reliable and achieves higher
speeds when connecting to Google Servers. On this connection download speeds are also faster
than upload speeds although they are similar to the times obtained when manually uploading to
Drive. Uploading is approximately three times slower than downloading.
Figure 5.16 presents average upload/download Time (s) for different filesize (KB) for SkyDrive. For larger files, there is much lower speeds, both for download, but especially for upload.
Values less than 8KB may be excluded from the results, which were inconclusive.
Figure 5.16. Average Upload/Download Time (s) for different filesize (KB) for SkyDrive.
Resource Management for Cost Optimization in Cloud Storage Systems (PhD Thesis) - Ing. Cătălin
NEGRU
5. Storage Systems for Cyber-Physical Infrastructures: Natural Resources Management
5.2.2
109
Cost Efficient Cloud-based Service Oriented Architecture for
Water Pollution Prediction
River water quality nowadays represents a major concern. Performing water monitoring, in order
to detect the pollutant with wireless sensor networks is not enough. Furthermore, the solution to
place more sensors for monitoring is not cost efficient as this type of sensors are very expensive.
In the case of a pollution accident on a river, it is mandatory to alert people and, more important,
to predict the evolution of pollutant concentration, downstream. Also, it is essential to minimize
the time-frame to send the alert to the possibly affected people. In this paper, we propose a
cost-efficient Cloud-based service oriented architecture for water pollution prediction and alert
system. The cost efficiency of our approach comes from the three main directions. The first way
is represented by the usage of less water monitoring specific sensors due to the usage of complex
hydraulic models. The second direction is represented by the construction of a knowledge-base with
pre-run scenarios of pollution propagation events. The third direction is represented by the usage
of Cloud computing services which are proven to be cost effective. The novelty of our approach
comes from the integration of different Cloud computing platforms and services, in order achieve
scalability, provisioning of resources in real time, to have a simplified deployment and management
of resources and applications, and to get a better cost/performance ratio.
Water is one of the most important natural resources, being the principal source that sustain
life for humans, animals and plants.Among sources of water pollution we can enumerate incidents
as: chemical discharge, agricultural chemicals from farms,petroleum leaks and spills etc. All these
pollution factors make the process of water monitoring and decision support not an easy task.
Preserving water quality in rivers is a mandatory requirement in order to support water related
activities. Moreover, important EU environmental legislations such as:Bathing Water Directive
2006/7/ EC (EC, 2006a ), Shellfish Waters Directive 2006/113/EC ( EC, 2006b ) and Water
Framework Directive 2000/60/EC (EC, 2000) have been adopted to reinforce the water quality
preservation.
The traditional methods for detection of water quality suppose the manual collection of water
samples and after, analyzed in a laboratory. These methods do not provide real-time results about
the collected data and, more important do not offers the results data in auseful period of time in
case of a pollution event.
Nowadays, modern methods make use of technologies such as wireless sensor networks, to
monitor the water resources. Complex sensors which can sense different water parameters such as
temperature, pH, conductivity, dissolved oxygen, nitrogen, and phosphorous turbidity etc., are used
by decision support systems, in order to provide real-time assistance in decision making process.
Environmental data comes from multiple sensors which are distributed in different locations. This
is possible due to the wireless sensor networks(WSN) concept, which is based on the fact that
inexpensive tiny sensors capture environmental parameters such as humidity, daylight, pressure
etc., and collaborate together to form a wireless sensor network. WSNs are considered to be
the basic infrastructure of many smart environmental monitoring applications (e.g., air pollution,
water contamination, volcanoes, twisters, floods, fires etc.). Collected data is sent to a centralized
node called sink node in order to be analyzed and to take necessary actions.
Water monitoring represents a special case of environmental monitoring as the sensors that
are used to collect water-related parameters are expensive. Besides that, the sensor node is characterized by many limitations such as scarce energy sources, small memory footprint and limited
processing capabilities. Furthermore, sensors are deployed in large numbers. Consequently, there
are many other issues with WSNs, such as scalability, data reliability, security and key management,
Resource Management for Cost Optimization in Cloud Storage Systems (PhD Thesis) - Ing. Cătălin
NEGRU
5. Storage Systems for Cyber-Physical Infrastructures: Natural Resources Management
110
data analysis, and efficient multi-hop routing.WSNs are considered to be the basic infrastructure
of many smart environmental monitoring applications (e.g., air pollution, water contamination,
volcanoes, twisters, floods, fires etc.).
Our proposed approach take into consideration these problems and limitations and surpass
them by proposing a cost efficient Cloud-based service oriented architecture, which has fewer
sensors and make use of data generated by predefined water pollution scenarios in order to provide
the results in a shorter time and improving the QoS for the offered service.
Situated at the intersection of Cyber-physical and Big Data platforms, Cloud systems support
efficient and effective workflows. The increased computing power for executing workflows, need
efficient use of computation power also appears. The experimental approach considers more workflows, such as decision support, real-time alerts and view using computing resources, storage and
network in Cloud platforms. Cost effective applications running on Cloud Computing platforms
and not only becomes a necessity nowadays.
Advantages of Cloud services and technologies as a core component for Cyber-Infrastructure
and Big Data platforms have been extensively studied over time [180], [192], [126]. They identified
the following benefits: scalability, provisioning resources in real time, simplified deployment and
management of resources and applications, better rapport cost/performance etc. Our approach is
to deploy our platform in a Cloud environment in order to benefit from all these advantages.
Water related problems are very intense studied in scientific literature. Among the treated
research issues we can enumerate decision support systems for water management, flood forecasting
systems, water quality prediction improvement, suitability for recreation activities and conservation
etc.
The authors of [114] propose a decision support system for the management of the Elbe river
basin. The system is build on three conceptual models: model for discharges of nutrient, a model
for the waste water pathways and a model for the aquatic fate assessment coupled with a GISbased discrete digitized river network. The system is mainly build for calculation of the long-term
prediction such as erosion with phosphate.
In the [165] is presented a flood forecasting system which includes a hydrological forecast
model and 1D/2D hydraulic model. This system is designed for Songhua River Basin in north
eastern China. This system is based on three subsystems:
• hidrological and hydraulic forecast models;
• real-time GIS based flood management and forecasting system;
• web based decision support system that integrates the aforementioned components.
The system has the possibility to run in two modes: real-time and pre-cooked scenarios. In the
first mode, the system make use of MIKE FLOOD WATCH, which is a forecast framework. In
the case of pre-cooked scenarios, spatially detailed, two dimensional simulation results are used in
real-time mode.The real-time scenarios are based on o one-dimensional forecast models.
In [113] the authors discuss the water quality prediction improvement of the Cauca River
from Colombia by integration of prediction models. In this way they aims to improve a pollution
early warning system called EWS-Centinela. Also they show what is the impact of integration of
prediction models into the existing system. The system use as main data source a database from
Milán Station located in the Navarro sector and the intake of Puerto Mallarino Treatment Plant.
The proposed integration solution shows effects in terms of quality and quantity of discharge, but is
also a tool for decision-making in case of pollution. Overall this system is not scalable and cannot
be apply to other rivers, as the main data sources is represented by water treatment plants.
The authors of [172] simulate and analyze the effect on water quality of pollutants in case of
accidents with chemical substances. They study the water pollution risk simulation and prediction
Resource Management for Cost Optimization in Cloud Storage Systems (PhD Thesis) - Ing. Cătălin
NEGRU
5. Storage Systems for Cyber-Physical Infrastructures: Natural Resources Management
111
Figure 5.17. Modeled reach between Glina and Budesti [33]
in the main canal of the South-to-North Water Transfer Project. In order to calculate the hydraulic
characteristics of the river channel the authors used MIKE11 HD tool. The hydraulic model include
six types of hydraulic structures (e.g. inverted siphons, gates, highway bridges, culverts and
tunnels). The results obtained showed that the computed values agreed well with the measured
values.Are proposed three scenarios with different discharge rates (ranged from 12 m3/s to 17
m3/s, 40 m3/s, and 60 m3/s) and three pollution loading concentration levels (5t, 10t and 20t)
(i.e., phosphate fertilizer, cyanide, oil and chromium solution), and based on obtained results,
emergency measures were proposed.
In [6] the authors monitor Bertam River, in Cameron Highlands, in order to assess its suitability for recreational activities and conservation. The authors select seven sampling points in
the river and tributaries: dissolved oxygen, biochemical and chemical oxygen demands (BOD and
COD), total suspended solids (TSS), ammonia nitrate (NH3–N) and pH were measured and the
water quality index (WQI) was computed during high and average water flow. In addition, water
quality surface data were generated from the sampling points using the interpolation technique of
geographic information system to predict values of unknown locations.
In CyberWater project, mathematical modeling of hydraulic and advection-dispersion processes plays an important role by using measured and forecasted date of certain parameters of
the system and, after the numerical simulation, by providing information for every grid node and
every time step within the boundary of the system, for every analyzed parameters. In addition, it
may be carried out hypothetical scenarios in order to obtain the knowledge about further possible
evolution of the system and their consequences.
The study area is located on the Dambovita river downstream Bucharest between Glina waste
water treatment plant (WWTP) and Budesti hydrometric station, as is shown in Fig. 5.17
The hydraulic model was created in Mike 11 for steady and unsteady state. As upstream
Resource Management for Cost Optimization in Cloud Storage Systems (PhD Thesis) - Ing. Cătălin
NEGRU
5. Storage Systems for Cyber-Physical Infrastructures: Natural Resources Management
112
Figure 5.18. CyberWater decision support system
boundary conditions the hydro-graphs of Dambovita River and its tributaries were used, while the
rating curve from Budesti station was used for the downstream boundary.
At the beginning of heavy rains, the pollutant concentration increases dramatically. Due to
high discharges the waste water treatment capacity is exceeded and only 9 m3/s can be processed,
while the difference is bypassed directly downstream in the river. Accidental pollution could also
occur in these situations, mainly due to oil spills or detergents [33].
The CyberWater decision support module is the main component of the water pollution decision support system. This provide assistance in the decision making process for the user. The
architecture of the system is presented in Figure 5.18. The system relies on two main components of the CyberWater platform, the storage system, decision support module and an external
component which is MIKE11 pollution propagation service. Regarding the integration in Cloud
of this architecture with the three main components two approaches are possible: first is to place
all three components at one Cloud provider, and the second one is to a adopt a hybrid Cloud
model. In our setup, we will adopt the second approach. So, the Mike 11 propagation service and
CyberWater Pollution Propagation Module will be stored in a private Cloud facility. The Storage
System component will be on a public Cloud computing platform.
From this module the user has possibility through a user interface to run a new pollution
propagation scenario in case of an pollution event, and visualize the results. In the first step the
user automatically sets the parameters necessary. In step two, update file by entering time series
data on normal flow of the river and also the amount of pollutant discharged into the river and
concentration that pollutant. In step three update simulation file, modifying date start and end of
the deployment model. In the final step execution model launches.This user interface is presented
in the Figure 5.19
The cost effectiveness of our approach comes from three major components:
• first, due to the usage of MIKE11 that enable the building of complex and accurate hydraulic
models, we can use fewer water monitoring sensors which are very expensive;
• second, we plan to run multiple pollution scenarios that will represent a knowledge-base for
the system which will improve with each run of the MIKE11 service; This knowledge-base
will be the first place to look for answers in case of a pollution event. In this way we will
improve the QoS as, in distributed systems failure can happen at any time [96], and will
reduce the cost for running the pollution propagation service;
• third, the cost efficiency comes from the usage of Cloud computing services which is proven
to be cost effective.
The platform is designed to function in real-time is hosted in an Cloud computing environment. This aim to provide the following services: permanent monitoring, alerting and decision
support system (DSS) in case of accidental pollution or flooding events. In order to use the facilities of mathematical modeling in real time, we need a software that can be interfaced with
other informatics programs. Mike 11 is one of them and this feature allow us to integrate it in the
Resource Management for Cost Optimization in Cloud Storage Systems (PhD Thesis) - Ing. Cătălin
NEGRU
5. Storage Systems for Cyber-Physical Infrastructures: Natural Resources Management
113
Figure 5.19. Web user interface
CyberWater platform.
In order to explain the way in which CyberWater platform interacts with Mike 11 models,it
was created the workflow diagram which is presented in Figure. 5.20. Explanations based on the
diagram are provided below.
Let’s suppose that the model that was already created, updated and followed by a simulation
had been performed. New data flow will be received in sections with hydrometric stations. Next
step is to query the system if a pollution event happened in the last time interval. In case of
an affirmative response, the characteristics of the event will be read: 1) the moment when the
accident occurred; 2) location on the river where the pollutant was discharged; 3) the type(name)
of the chemical compound; 4) the liquid discharge and 5) concentration of chemical compound in
the liquid. Further, the model setup will be updated and run. After the numerical simulation,
information about evolution in time and space of state variables like water discharge, water level
and pollutant concentration will be available for users.
Let’s suppose that the model was already created, updated and a simulation has been performed until the present moment, meaning until the step i-1. New data flow will be received in
sections with hydro-metric stations (see element 2 in diagram). The next step it will be the update
of (elem. 3) the time series, by adding the new value at the end of registered values. This operation
will be done by using an API (Application Programming Interface) provided by company where
Resource Management for Cost Optimization in Cloud Storage Systems (PhD Thesis) - Ing. Cătălin
NEGRU
5. Storage Systems for Cyber-Physical Infrastructures: Natural Resources Management
Figure 5.20. Interaction between CyberWater platform and Mike11
Resource Management for Cost Optimization in Cloud Storage Systems (PhD Thesis) - Ing. Cătălin
NEGRU
114
5. Storage Systems for Cyber-Physical Infrastructures: Natural Resources Management
115
Figure 5.21. Upstream boundary pollutograph
Mike is produced (DHI). The file which will be updated(edited) is part of the model and is binary
serialized.
In this stage, CyberWater platform has access to all the essential information from mathematical model: input data, parameters that defines the model and output data. By using a GUI
(Graphical User Interface), the user can show on the screen a plenty of graphics, tables and other
elements that can be useful for understanding the behavior of the natural system and for helping
in decision making process.
A number of 18 scenarios was examined, by considering 6 variants of constant flow (6, 10,
20,. . . , 50m3 /s) for each discharge proposing 3 values of concentrations (1, 10 and 100 mg/l). The
total polluted inflow was considered equal with the capacity of a common tank (30,000 l). As an
example, a pollutograph for one of the examined scenario is presented in Figure 5.21.
The simulations provided among other results the moment at which the pollutant reached
each section, an example for a concentration of 100 mg/l being presented in Figure 5.22
It can be noticed that the greater the discharge, the shorter the propagation time is. For
instance, at 5000 m upstream the confluence with Arges River the propagation time is 8 hours for
6 m3/s, while for 50 m3/s is only 4 h and 40 minutes. The explanation is related to both advection
and hydrodynamic dispersion (even the water velocity is much greater for high discharges than in
case of low flow). For smaller concentrations, the same graph shows that the time of pollutant
propagation increases for 6 m3/s with 12%, while for 50 m3/s the increase is only 8
The results of pollutant transport are visualized in different formats on the platform, the
most relevant being the pdf map, the concentration at a certain moment being represented in color
scale, from blue to red. An example of map of pollution at a certain moment along Dambovita
River at 5 hours after the tank accident is presented in Figure 5.23. The pollutant was transferred
downstream (maximum concentration represented with red color), while immediately downstream
the truck accident the concentration decreased, being represented with yellow color.
This graph will be used for the decision makers to anticipate the maximum concentration and
the arrival time downstream a possible pollutant discharge in the river and to take all necessary
Resource Management for Cost Optimization in Cloud Storage Systems (PhD Thesis) - Ing. Cătălin
NEGRU
5. Storage Systems for Cyber-Physical Infrastructures: Natural Resources Management
116
Figure 5.22. Delay of pollutant transfer along Dambovita River (c = 100 mg/l)
measures (like interrupting the water abstraction from the river during the event) in order to
diminish the pollution consequences.
5.3
Conclusions and Open Issues
In the Section 5.1 we analyzed the problem of modeling and integration of heterogeneous data
sources in the context of Cyber-infrastructure systems and Big Data as it is a significant research
issue. In this type of systems information represents a key factor in the process of decision-making.
Hence, the urgent need to integrate and operate heterogeneous data, in order to provide a unique
and unified view of information. Benefits such as cost reduction, faster and better decision making,
and the development of new products and services, can be obtained.
The unified approach for Big Data in environmental monitoring relies on a separation of
heterogeneous information into data, metadata and semantics. In terms of data management operation such as integration, reduction, querying, indexing analysis and mining must be performed.
Every operation corresponds to a specific layer and represents an important stage in the processing
pipeline. Data integrity check and validation also plays and important role as there are many errors
and missing data in environmental datasets. Visualization tools plays a central role as front-end
interface can offer a unified and uniform view of the entire system permitting users to get useful
insights from data and take the necessary actions.
Also, key research issue nowadays is to find an approach that combine the relational database
management systems and NoSQL database systems in order to benefit from the two paradigms.
Equally important is to have new methods for the query optimization.
The CyberWater case-study presented in the chapter highlight the need for methods and
tools that integrate heterogeneous data sources. These need to be integrated in order to build a
robust and resilient system that offers support in the decision-making process in the case of water
Resource Management for Cost Optimization in Cloud Storage Systems (PhD Thesis) - Ing. Cătălin
NEGRU
5. Storage Systems for Cyber-Physical Infrastructures: Natural Resources Management
117
Figure 5.23. Example of generated map of pollution at a certain moment along Dambovita River (5
hours after the tank accident)
resource management. Although there are many tools for integrating various heterogeneous data
sources, these does not provide the performance needed by a real-time system. New scalable and
resilient methods and tools to integrate data source heterogeneity, to filter-our uncorrelated data
need to be developed.
In the Section 5.2 we presented a system for water pollution propagation. This approach is
cost efficient, is based on Cloud services and is service oriented. The analyzed scenarios represent
a first phase in developing the database with more complex scenarios, considering unsteady flows
on Dambovita river and its tributaries for a large set of pollution events.
The issues addressed in this section were also presented in the literature in various forms:
methodologies, applications, etc. The authors of [43] review state-of-the-art research issues in
the field of analytics over Big Data. Among the open problems and research trends in Big Data
analytics identified by the authors are the following:
•
•
•
•
•
•
Data source heterogeneity and incongruence;
Filtering-out uncorrelated data;
Strongly unstructured nature of data sources;
High scalability;
Combining the Benefits of RDBMS and NoSQL Database Systems;
Query Optimization.
Resource Management for Cost Optimization in Cloud Storage Systems (PhD Thesis) - Ing. Cătălin
NEGRU
6 | Conclusions
This chapter presents an overview and the conclusions of our thesis as well a summary of our
contributions. It also proposes possible future work.
In Chapter 1, we presented Cloud computing model in the context of Big Data applications
and cost optimization challenges. Context and motivation section described the key factors that
ask for cost optimization on cloud storage systems.
In Chapter 2, we presented the cost optimization problem in cloud storage systems in relation
with characteristics of these systems. We also presented an analysis of Cloud storage services, a
model to compute the user cost and the main challenges for cost-efficient storage services. Do,
the main challenges in building cost-efficient Cloud enabled applications and platforms are to take
advantage of the scalability, agility and reliability of the Cloud. We presented also, an analisys
of virtual machine heterogeneity on datacenter power consumption in the case of a data-intensive
jobs.
In Chapter 3 we propose two solutions for cost optimization in cloud storage systems. First
solution is a cost efficient re-scheduling heuristic, based on cost-aware data storage. The algorithm
proposed is generic and can be used with a large variety of scheduling heuristics. The proposed
evaluation model is based on a series of metrics such as execution time, makespan and load balancing. This model led to a classification of the scheduling algorithms used together with the
proposed re-scheduling procedure, after their performance in case of errors and re-scheduling. Assessments concluded with the observation that algorithms with the best performance if no errors
occur tend to achieve the best scores also when re-scheduling is needed. The second proposed
solution is a task migration heuristic for cost-effective data processing. The problem that we tried
to solve was scheduling tasks in a heterogeneous Big Data environment, taking into account the
data dependencies between tasks and deadlines of the tasks. Moreover, in the phase of scheduling
we considered the requirements of the tasks and the capabilities of the available resources. We
described how this scheduling algorithm can be integrated with various Big Data platforms such
as Hadoop, OpenStack infrastructure as a service and BlueMix, platform as a service. With the
help of our built-in task scheduling simulator, we compared the results of the proposed algorithm
with the min-min and min-max heuristics. This way we proved that our hybrid algorithm meets
the deadlines better than the other two scheduling algorithms and obtains a good QoS.
In Chapter 4 we modeled the cost optimization problem from the cloud user perspective.
We proposed two methods for budget reduction in cloud storage systems. First one is a bilinear
programming method for budget-aware selection of storage service in multi-clouds environments.
Based on this method we conclude that the final gain depends on initial demand of data, however
in some cases is surprisingly high (e.g. 80%), so this method can be successfully used for any
scenarios considering public Clouds. The second method is for storage service selection in the
presence of budget constraints. So, in order to optimize the cost for storage services, while meeting
the constraints on budget and QoS parameters such as latency, we proposed a greedy algorithm.
The results show that the allocation of cloud providers depends on transfer costs and on the
118
6. Conclusions
119
available budget. The proposed method satisfy the data demand.
In Chapter 5 we presented and analyzed the case study of CyberWater project. We presented
the problem of Big Data modeling, integration and reduction in the context of cyber-infrastructure
systems. Then we presented the proposed Cloud storage solution for the CyberWater project. The
motivation of this chapter comes from the new age of applications (social media, smart cities,
cyber-infrastructures, environment monitoring, and control, healthcare, etc.), which produce big
data and many new mechanisms for data creation rather than a new mechanism for data storage.
6.1
Contributions of PhD Thesis
This thesis has several original contributions in the research area of cost optimization in Cloud computing systems through resource management techniques and methods. They can be summarized
as follows:
• Critical analysis of Cloud computing cost models - we analyzed cost models depending on the
types of services offered in Cloud computing. We review as well, the cost factors for Cloud
computing services and present how these are distributed between the Cloud provider and
the Cloud user for every type of service (e.g. IaaS, PaaS and SaaS); Energy consumption is
the main factor in the cost structure for a certain service, having a major contribution to the
total cost;
• Key Issues and Challenges for an energy efficient cloud storage service - we presented a short
analysis of a Cloud storage service and a model to compute the user cost and we describe the
main challenges for storage services. These are to take advantage of the scalability, agility
and reliability of the Cloud;
• Analysis of virtual machine heterogeneity on power consumption in datacenters, for the
case of data intensive applications - we showed that a good balance between workloads,
usage patterns and virtual machine computing power is mandatory in order to achieve power
efficiency. Results show that the power consumption is proportional with heterogeneity
degree. As we showed this has a big impact on power consumption of a set of virtual
machines that perform data-intensive tasks. This also has impact on cost, because energy
consumption represent an important cost factor;
• Efficient re-scheduling service for cost-aware Cloud storage system - we presented an efficient
re-scheduling heuristic based on cost-aware data. The re-scheduling algorithm proposed
in this thesis has an important characteristic: it is a generic algorithm because it can be
used with a large variety of scheduling heuristics. This model led to a classification of the
scheduling algorithms used together with the proposed re-scheduling procedure, based on
their performance in the case of errors and in the case of re-scheduling. Assessments concluded
with the observation that algorithms with the best performance if no errors occur tend to
achieve the best scores also when re-scheduling is needed. By defining and implementing
these classification methods it can analyze what combination of scheduling algorithms and
re-scheduling strategies are the most appropriate to use, depending on the type of graph;
• Cost efficient hybrid scheduling algorithm for task migration - we achieved the goal of designing and benchmarking a hybrid scheduling algorithm for Many Task Computing for Many
Task Computing for cost-effective data processing. The problem that we tried to solve was
scheduling tasks in a heterogeneous Big Data environment, taking into account the data
dependencies between tasks and deadlines of the tasks. We described how this scheduling
algorithm can be integrated with various Big Data platforms such as Hadoop, OpenStack
infrastructure as a service and BlueMix, platform as a service. With the help of our built-in
Resource Management for Cost Optimization in Cloud Storage Systems (PhD Thesis) - Ing. Cătălin
NEGRU
6. Conclusions
•
•
•
•
•
120
task scheduling simulator, we compared the results of the proposed algorithm with the minmin and min-max heuristics. In this way we proved that our hybrid algorithm meets the
deadlines better than the other two scheduling algorithms and obtains a good QoS;
Cost efficient method for budget-aware storage service selection - we proposed a binary linear
optimization method for buying cost of Cloud storage capacity from public Cloud providers.
We can conclude that the final gain depends on initial demand of data, however in some cases
is surprisingly high (e.g. 80%), so this method can be successfully used for any scenarios
considering public Clouds;
Cost efficient method for selection of storage service in the presence of budget constraints
- we modeled this as a multi-objective optimization problem. In order to optimize the cost
for storage services, while meeting the constraints on budget and QoS parameters such as
latency, we proposed a greedy algorithm. We were able to find near optimal solutions while
relaxing some constraints. The results show that the allocation of cloud providers depends
on transfer costs and on the available budget;
Apply the proposed methods for cost reduction in a research project - we presented how the
proposed methods can be applied in a real-life project; The CyberWater case-study presented
highlight the need for methods for cost efficient data processing. These need to be integrated
in order to build a robust and resilient system that offers support in the decision-making
process in the case of water resource management;
Analysis of the data requirements for CyberWater research project - we analyzed the problem of modeling and integration of heterogeneous data sources in the context of Cyberinfrastructure systems and Big Data;
Cost-efficient storage architecture for CyberWater project - we presented a cost efficient
storage architecture for water resources data management and processing. Based on this
architecture we presented a service oriented system for prediction of water pollution propagation.
6.2
6.2.1
Future Work
Resource-Aware Reduction Techniques for Big Data
Decision-making is critical in real-time systems and mobile systems [125] and has an important role
in business [60]. This process uses data as input, but not the whole data. So, a representative and
relevant data set must be extracted from data. This is the subject of data reduction. On the other
hand, recognize crowd-data significance is another challenge with respect to making sense of Big
Data: it means to determine "wrong" information from "disagreeing" information and find metrics
to determine certainty [94]. par Since we face a large variety of solutions for specific applications
and platforms, a thorough and systematic analysis of existing solutions for data reduction models,
methods and algorithms used in Big Data is needed [31, 32].
Current research focuses on the dimension reduction, which targets Spatio-temporal data and
processes and is achieved in terms of parameters, grouping or state1 . Guhaniyogi et al. [83] address
the dimension reduction in both parameters and data space. Johnson et al. [110] create clusters
of sites based upon their temporal variability. Leininger et al. [130] propose methods to model
extensive classification data (land-use, census, and topography). Wu et al. [217] considering the
problem of predicting migratory bird settling, propose a threshold vector-autoregressive model for
the Conway-Maxwell Poisson (CMP) intensity parameter that allows for regime switching based
on climate conditions. Dunstan et al. [65] goal is to study how communities of species respond to
Resource Management for Cost Optimization in Cloud Storage Systems (PhD Thesis) - Ing. Cătălin
NEGRU
6. Conclusions
121
environmental changes. In this respect they o classify species into one of a few archetypal forms of
environmental response using regression models. Hooten et al. [95] are concerned with ecological
diffusion partial differential equations (PDE’s) and propose an optimal approximation of the PDE
solver which dramatically improves efficiency. Yang et al. [222] are concerned with prediction in
ecological studies based on high-frequency time signals (from sensing devices) and in this respect
they develop nonlinear multivariate time-frequency functional models.
Big Data Reduction Techniques
In Data Analysis as part of the Qualitative Research for large datasets, in past decades there were
proposed Content Analysis (counting the number of occurrences of a word in a text but without
considering the context) and Thematic Analysis (themes are patterns that occurs repeatedly in
data sets and which are important to the research question). The first form of data reduction is
to decide which data from the initial set of data is going to be analyzed (since not all data could
be relevant, some of it can be eliminated). In this respect there should be defined some method
for categorizing data [127].
Structural coding - code related to questions is then applied to responses to that question in the
text. Data can be then sorted using these codes (structural coding acting as labeling).
Frequencies - word counting can be a good method to determine repeated ideas in text. It
requires prior knowledge about the text since one should know before the keywords that will
be searched. An improvement is to count not words but codes applications (themes).
Co-occurrence - more codes exist inside a segment of text. This allows Boolean queries (find
segment with code A AND code B). Hierarchical Clustering - using co-occurrence matrices
(or code similarity matrices) as input the goal is to derive natural groupings (clusters) in
large datasets. A value matrix element v(i, j) = n means that code i and code j co-occurs
in n participant files.
Multidimensional Scaling - the input is also similarity matrix and ideas that are considered
to be close each to the other are represented as points with a small distance between them.
This way is intuitive to visualize graphically the clusters.
Big Data does not rise only engineering concerns (how to manage effectively the large volume
of data) but also semantic concerns (how to get meaning information regardless implementation
or application specific aspects). Meaningful data integration process requires following stages, not
necessarily in this order [16]:
• Define the problem to be resolved;
• Search the data to find the candidate datasets that meet the problem criteria;
• ETL (Extract, Transform and Load) of the appropriate parts of the candidate data for future
processing;
• Entity Resolution - check if data is unique, comprehensive and relevant;
• Answer the problem - perform computations to give a solution to the initial problem.
Using the Web of Data, which according to some statistics contains 31 billion RDF triples,
is possible to find all data about people and their creations (books, films, musical creations, etc),
translate the data into a single target vocabulary, discover all resources about a specific entity
and then integrate this data into a single coherent representation. RDF and Linked Data (such
as pre-crawled web data sets BTC 2011 with 2 billion RDF triples or Sindice 2011 with 11 billion
RDF triples extracted from 280 million web pages annotated with RDF) are schema less models
Resource Management for Cost Optimization in Cloud Storage Systems (PhD Thesis) - Ing. Cătălin
NEGRU
6. Conclusions
122
that suits Big Data, considering that less than 10% is genuinely relational data. The challenge
is to combine DBMS’s with reasoning (the next smart databases) that goes beyond OWL, RIF
or SPARQL and for this reason use cases are needed from the community in order to determine
exactly what requirements the future DB must satisfy. A web portal should allow people to search
keywords in ontologies, data itself and mappings created by users [16].
Descriptive analytics
Descriptive analytics is oriented on descriptive statistics (counts, sums, averages, percentages, min,
max and simple arithmetic) that summarizes certain groupings or filtered type of the data, which
are typically simple counts of some functions, criteria or events. For example, number of post on
a forum, number of likes on FaceBook or number of sensors in a specific area, etc. The techniques
behind descriptive analytics are: standard aggregation in databases, filtering techniques and basic
statistics. Descriptive analytics use filters on the date before applying specific statistical functions.
We can use geo-filters to get metrics for a geographic region (a country) or temporal filter, to extract
date only for a specific period of time (a week). More complex descriptive analytics dimensional
reduction or stochastic variation.
Dimensionality reduction represents an important tool in information analysis. Also scaling
down data dimensions is important in process of recognition and classification. Is important to
notice that, sparse local operators, which imply less quadratic complexity and faithful multi-scale
models make the design of dimension reduction procedure a delicate balance between modeling
accuracy and efficiency. Moreover the efficiency of dimension reduction tools is measured in terms
of memory and computational complexity. The authors provide a theoretical support and demonstrate that working in the natural Eigen-space of the data one could reduce the process complexity
while maintaining the model fidelity [4].
A stochastic variation inference is used for Gaussian process models in order to enable the
application of Gaussian process (GP) models to data sets containing millions of data points. The
key finding of the section is that GPs can be decomposed to depend on a set of globally relevant
inducing variables, which factorize the model in the necessary manner to perform variation inference. These expressions allow for the transfer of a multitude of Gaussian process techniques to big
data [68].
Predictive analytics
Predictive analytics, which are probabilistic techniques refers to: (i) temporal predictive models
that can be used to summarize existing data, and then to extrapolate to the future where data
doesn’t exist; (ii) non-temporal predictive models (e.g. a model that based on someone’s existing
social media activity data will predict his/her potential to influence [26]; or sentimental analysis).
The most challenges aspect here is to validate the model in the context of Big Data analysis. One
example of this model, based on clustering, is presented in the following.
A novel technique for effectively processing big graph data on Cloud surpasses the raising
challenges when data is processed in heterogeneous environments, such as parallel memory bottlenecks, deadlocks and inefficiency. The data is compressed based on spatial-temporal features,
exploring correlations that exist in spatial data. Taking into consideration those correlations graph
data is partitioned into clusters where, the workload can be shared by the inference based on time
series similarity. The clustering algorithm compares the data streams according to the topology
of the streaming data graph topologies from the real world. Furthermore, because the data items
in streaming big data sets are heterogeneous and carry very rich order information themselves, an
Resource Management for Cost Optimization in Cloud Storage Systems (PhD Thesis) - Ing. Cătălin
NEGRU
6. Conclusions
123
order compression algorithm to further reduce the size of big data sets is developed. The clustering
algorithm is developed on the cluster-head. It takes time series set X and similarity threshold as
inputs. The output is a clustering result which specifying each cluster-head node and its related
leaf nodes [188].
The prediction models used by predictive analytics should have the following properties:
simplicity (a simple mathematical model for a time series), flexibility (possibility to configure and
extend the model), visualization (the evolution of the predicted values can be seen in parallel with
real measured values) and the computation speed (considering full vectorization techniques for
array operations). Let’s consider a data series V1 , V2 , . . . , Vn extracted by a descriptive analytics
technique. We consider for the prediction problem P (Vt+1 ) that denotes the predicted value for
the moment t + 1 (next value). This value is:
P Vt+1 = P (Vt+1 ) = f (Vt , Vt−1 , . . . , Vt−window ),
where window represent a specific interval with window + 1 values and f can be a linear function
such as mean, median, standard deviation or a complex function that uses a bio-inspired techniques
(an adaptive one or a method based on neural networks).
The linear prediction can be expressed as follow:
P Vt+1 = fw =
window
X
1
wi Vt−i ,
window + 1 i=0
where w = [wi ]0≤i≤window is a vector with weights. If ∀i, wi = 1 then we have the mean function.
It is possible to consider specific distribution of weights as follow: wi = t − i.
The predictive analytics are very useful to make estimation for future behaviors especially
when the date is no accessible (it is not possible to obtain or to predict) or is to expensive (e.g.
money, time) to measure or to compute the data. The main challenge is to validate the predicted
data. One solution is to wait for the real value (in the future) to measure the error, than to
propagate it in the system in order to improve the future behavior. Other solution is to measure
the impact of the predicted date in the applications that use the data.
Prescriptive analytics
Prescriptive analytics predicts multiple futures based on the decision maker’s actions. A predictive
model of the data is created with two components: actionable (decision making support) and
feedback system (tracks the outcome of made decisions). Prescriptive analytics can be used for
recommendation systems because it is possible to to predict the consequences based on predictive
models used. A self-tuning database systems is an example that we will present in the following.
Starfish is a self-tuning system for big data analytics, build on Hadoop. This system is
designed according to self-tuning database systems [69]. Cohen et al. proposed the acronym
MAD (Magnetism, Agility, and Depth) in order to express the features that users expect from a
system for big data analytics [38]. Magnetism represents the propriety of a system that attracts all
sources of data regardless of different issues (e.g. possible presence of outliers, unknown schema,
lack of structure, missing values) keeping many data sources out of conventional data warehouses.
Agility, represents the propriety of adaptation of systems in sync with rapid data evolution. Depth
represents the propriety of a system which supports analytics needs that go far beyond conventional
rollups and drilldowns to complex statistical and machine-learning analysis. Hadoop represents a
MAD system that is very popular for big data analytics. This type of systems poses new challenges
Resource Management for Cost Optimization in Cloud Storage Systems (PhD Thesis) - Ing. Cătălin
NEGRU
6. Conclusions
124
in the path to self-tuning such as: data opacity until processing, file-based processing, and heavy
use of programming languages.
Furthermore three more features in addition to MAD are becoming important in analytics
systems: data-lifecycle awareness, elasticity, and robustness. Data-lifecycle-awareness means optimization of the movement, storage, and processing of big data during its entire lifecycle by going
beyond query execution. Elasticity means adjustment of resource usage and operational costs to
the workload and user requirements. Robustness means that this type of system continues to
provide service, possibly with graceful degradation, in the face of undesired events like hardware
failures, software bugs, and data corruption.
The Starfish system has four levels of tuning: Job-level tuning, Workflow-level tuning, and
Workload-level tuning. The novelty in Starfish’s approach comes from how it focuses simultaneously on different workload granularities-overall workload, workflows, and jobs (procedural and
declarative) - as well as across various decision points-provisioning, optimization, scheduling, and
data layout. This approach enables Starfish to handle the significant interactions arising among
choices made at different levels [69].
To evaluate a prescriptive analytics model we need a feedback system (to tracks the adjusted
outcome based on the action taken) and a model for tacking actions (take actions based on the
predicted outcome and based on feedback). We define several metrics for evaluating the performance of prescriptive analytics. Precision is the fraction of the data retrieved that are relevant to
the user’s information need. Recall is the fraction of the data that are relevant to the query that
are successfully retrieved. Fall-Out is the proportion of non-relevant data that are retrieved, out
of all non-relevant documents available. Fmeasure is the weighted harmonic mean of precision and
recall:
2 ∗ P recision ∗ Recall
.
Fmeasure =
P recision + Recall
The general formula for this metric is:
Fβ =
(1 + β 2 ) ∗ P recision ∗ Recall
.
β 2 ∗ (P recision + Recall)
This metric measures the effectiveness of retrieval with respect to a user who attaches β times as
much importance to recall as precision.
As a general conclusion we can summarize the actions performed by there three types analytics
as follow: descriptive analytics summarize the data (data reduction, sum, count, aggregation, etc.),
predictive analytics predict data that we don’t have (influence scoring, trends, social analysis, etc.)
and prescriptive analytics guide the decision making to a specific outcome.
Resource Management for Cost Optimization in Cloud Storage Systems (PhD Thesis) - Ing. Cătălin
NEGRU
Bibliography
[1] Azza Abouzeid, Kamil Bajda-Pawlikowski, Daniel Abadi, Avi Silberschatz, and Alexander Rasin.
Hadoopdb: an architectural hybrid of mapreduce and dbms technologies for analytical workloads.
Proceedings of the VLDB Endowment, 2(1):922–933, 2009.
[2] Hussam Abu-Libdeh, Lonnie Princehouse, and Hakim Weatherspoon. Racs: a case for cloud storage
diversity. In Proceedings of the 1st ACM symposium on Cloud computing, pages 229–240. ACM,
2010.
[3] SPS accelerating cavity CERN-PHOTO. Cern accelerating science. to do, 2011.
[4] Yonathan Aflalo, Ron Kimmel, and Dan Raviv. Scale invariant geometry for nonrigid shapes. SIAM
Journal on Imaging Sciences, 6(3):11579–1597, 2013.
[5] Divyakant Agrawal, Sudipto Das, and Amr El Abbadi. Big data and cloud computing: current state
and future opportunities. In Proceedings of the 14th International Conference on Extending Database
Technology, pages 530–533. ACM, 2011.
[6] Mansir Aminu, Abdul-Nassir Matori, Khamaruzaman Wan Yusof, Amirhossein Malakahmad, and
Rosilawati Binti Zainol. A gis-based water quality model for sustainable tourism planning of bertam
river in cameron highlands, malaysia. Environmental Earth Sciences, 73(10):6525–6537, 2014.
[7] Jordi Arjona Aroca, Antonio Fernández Anta, Miguel A Mosteiro, Christopher Thraves, and Lin
Wang. Power-efficient assignment of virtual machines to physical machines. Future Generation
Computer Systems, 54:82–94, 2016.
[8] Luiz André Barroso, Jimmy Clidaras, and Urs Hölzle. The datacenter as a computer: An introduction
to the design of warehouse-scale machines. Synthesis lectures on computer architecture, 8(3):1–154,
2013.
[9] Paolo Bellavista, Antonio Corradi, Mario Fanelli, and Luca Foschini. A survey of context data
distribution for mobile ubiquitous systems. ACM Computing Surveys (CSUR), 44(4):24, 2012.
[10] Anton Beloglazov, Jemal Abawajy, and Rajkumar Buyya. Energy-aware resource allocation heuristics
for efficient management of data centers for cloud computing. Future Generation Computer Systems,
28(5):755–768, 2012.
[11] Andreas Berl and Hermann De Meer. An energy consumption model for virtualized office environments. Future Generation Computer Systems, 27(8):1047–1055, 2011.
[12] Ramon Bertran, Yolanda Becerra, David Carrera, Vicenç Beltran, Marc Gonzàlez, Xavier Martorell, Nacho Navarro, Jordi Torres, and Eduard Ayguadé. Energy accounting for shared virtualized
environments under dvfs using pmc-based power models. Future Generation Computer Systems,
28(2):457–468, 2012.
[13] Nik Bessis, Stelios Sotiriadis, Florin Pop, and Valentin Cristea. Optimizing the energy efficiency of
message exchanging for service distribution in interoperable infrastructures. In Intelligent Networking
and Collaborative Systems (INCoS), 2012 4th International Conference on, pages 105–112. IEEE,
2012.
[14] Nik Bessis, Stelios Sotiriadis, Florin Pop, and Valentin Cristea. Using a novel message-exchanging optimization (meo) model to reduce energy consumption in distributed systems. Simulation Modelling
Practice and Theory, 39:104–120, 2013.
[15] W Lloyd Bircher and Lizy K John. Complete system power estimation using processor performance
events. Computers, IEEE Transactions on, 61(4):563–577, 2012.
[16] Christian Bizer, Peter Boncz, Michael L Brodie, and Orri Erling. The meaningful use of big data:
four perspectives–four challenges. ACM SIGMOD Record, 40(4):56–60, 2012.
[17] Cristina Boeres, George Chochia, and Peter Thanisch. On the scope of applicability of the ETF
algorithm. Springer, 1995.
125
Bibliography
126
[18] Ata E Husain Bohra and Vipin Chaudhary. Vmeter: Power modelling for virtualized clouds. In
Parallel & Distributed Processing, Workshops and Phd Forum (IPDPSW), 2010 IEEE International
Symposium on, pages 1–8. Ieee, 2010.
[19] Mike Botts, George Percivall, Carl Reed, and John Davidson. Ogc® sensor web enablement:
Overview and high level architecture. In GeoSensor networks, pages 175–190. Springer, 2008.
[20] Danah Boyd and Kate Crawford. Critical questions for big data: Provocations for a cultural, technological, and scholarly phenomenon. Information, communication & society, 15(5):662–679, 2012.
[21] Arne Bröring, Johannes Echterhoff, Simon Jirka, Ingo Simonis, Thomas Everding, Christoph Stasch,
Steve Liang, and Rob Lemmens. New generation sensor web enablement. Sensors, 11(3):2652–2699,
2011.
[22] Hans Ulrich Buhl, Maximilian Röglinger, Dipl-Kfm Florian Moser, and Julia Heidemann. Big data.
Business & Information Systems Engineering, 5(2):65–69, 2013.
[23] W Buytaert, C Vitolo, SM Reaney, and K Beven. Hydrological models as web services: Experiences
from the environmental virtual observatory project. In AGU Fall Meeting Abstracts, volume 1, page
1491, 2012.
[24] Brad Calder, Ju Wang, Aaron Ogus, Niranjan Nilakantan, Arild Skjolsvold, Sam McKelvie, Yikang
Xu, Shashwat Srivastav, Jiesheng Wu, Huseyin Simitci, et al. Windows azure storage: a highly
available cloud storage service with strong consistency. In Proceedings of the Twenty-Third ACM
Symposium on Operating Systems Principles, pages 143–157. ACM, 2011.
[25] Andrea Calì, Diego Calvanese, Giuseppe De Giacomo, and Maurizio Lenzerini. Data integration
under integrity constraints. In Seminal Contributions to Information Systems Engineering, pages
335–352. Springer, 2013.
[26] Erik Cambria, Dheeraj Rajagopal, Daniel Olsher, and Dipankar Das. Big social data analysis. Big
Data Computing, pages 401–414, 2013.
[27] Sivadon Chaisiri, Bu-Sung Lee, and Dusit Niyato. Optimization of resource provisioning cost in
cloud computing. Services Computing, IEEE Transactions on, 5(2):164–177, 2012.
[28] Cindy X Chen. Spatio-temporal databases. In Encyclopedia of GIS, pages 1121–1121. Springer, 2008.
[29] Jianjun Chen, Chris Douglas, Michi Mutsuzaki, Patrick Quaid, Raghu Ramakrishnan, Sriram Rao,
and Russell Sears. Walnut: a unified cloud object store. In Proceedings of the 2012 ACM SIGMOD
International Conference on Management of Data, pages 743–754. ACM, 2012.
[30] Nengcheng Chen, Liping Di, Genong Yu, and Min Min. A flexible geospatial sensor observation
service for diverse sensor data based on web service. ISPRS Journal of Photogrammetry and Remote
Sensing, 64(2):234–242, 2009.
[31] Nengcheng Chen, Ke Wang, Changjiang Xiao, and Jianya Gong. A heterogeneous sensor web node
meta-model for the management of a flood monitoring system. Environmental Modelling & Software,
54:222–237, 2014.
[32] Yong Chen, Xian-He Sun, Rajeev Thakur, Philip C Roth, and William D Gropp. Lacio: a new
collective i/o strategy for parallel i/o systems. In Parallel & Distributed Processing Symposium
(IPDPS), 2011 IEEE International, pages 794–804. IEEE, 2011.
[33] M Cheveresan, C Dinu, and R Drobot. Decision support tools for water quality management. 14th
SGEM GeoConference on Water Resources. Forest, Marine And Ocean Ecosystems, 1(GEM2014
Conference Proceedings, ISBN 978-619-7105-13-1/ISSN 1314-2704, June 19-25, 2014, Vol. 1):199–
206, 2014.
[34] Hyeyoung Cho, Sungho Kim, and Sik Lee. Analysis of long-term file system activities on cluster
systems. Month, 20(09):0–4, 2009.
[35] Richard Chow, Philippe Golle, Markus Jakobsson, Elaine Shi, Jessica Staddon, Ryusuke Masuoka,
and Jesus Molina. Controlling data in the cloud: outsourcing computation without outsourcing
control. In Proceedings of the 2009 ACM workshop on Cloud computing security, pages 85–90. ACM,
2009.
[36] Sorin N Ciolofan, Mariana Mocanu, Florin Pop, and Valentin Cristea. Improving quality of water
related data in a cyberinfrastructure.
[37] Jeffrey Cohen, Brian Dolan, Mark Dunlap, Joseph M Hellerstein, and Caleb Welton. Mad skills:
new analysis practices for big data. Proceedings of the VLDB Endowment, 2(2):1481–1492, 2009.
[38] Jeffrey Cohen, Brian Dolan, Mark Dunlap, Joseph M. Hellerstein, and Caleb Welton. Mad skills:
New analysis practices for big data. Proc. VLDB Endow., 2(2):1481–1492, August 2009.
Resource Management for Cost Optimization in Cloud Storage Systems (PhD Thesis) - Ing. Cătălin
NEGRU
Bibliography
127
[39] Dennis Colarelli and Dirk Grunwald. Massive arrays of idle disks for storage archives. In Proceedings
of the 2002 ACM/IEEE Conference on Supercomputing, SC ’02, pages 1–11, Los Alamitos, CA, USA,
2002. IEEE Computer Society Press.
[40] AMD Cool ‘n’ Quiet. Cool ‘n’ quiet technology installation guide, november 2009.
[41] Marshall Copeland, Julian Soh, Anthony Puca, Mike Manning, and David Gollob. Microsoft azure
and cloud computing. In Microsoft Azure, pages 3–26. Springer, 2015.
[42] Paolo Costa, Matteo Migliavacca, Peter Pietzuch, and Alexander L Wolf. Naas: Network-as-aservice in the cloud. In Proceedings of the 2nd USENIX conference on Hot Topics in Management
of Internet, Cloud, and Enterprise Networks and Services, Hot-ICE, volume 12, pages 1–1, 2012.
[43] Alfredo Cuzzocrea, Il-Yeol Song, and Karen C Davis. Analytics over large-scale multidimensional
data: the big data revolution! In Proceedings of the ACM 14th international workshop on Data
Warehousing and OLAP, pages 101–104. ACM, 2011.
[44] INSPIRE Directive. Directive 2007/2/ec of the european parliament and of the council of 14 march
2007 establishing an infrastructure for spatial information in the european community (inspire).
Published in the official Journal on the 25th April, 2007.
[45] Marco Dorigo and Mauro Birattari. Ant colony optimization. In Encyclopedia of machine learning,
pages 36–39. Springer, 2010.
[46] Idilio Drago, Enrico Bocchi, Marco Mellia, Herman Slatman, and Aiko Pras. Benchmarking personal
cloud storage. In Proceedings of the 2013 conference on Internet measurement conference, pages 205–
212. ACM, 2013.
[47] Idilio Drago, Marco Mellia, Maurizio M Munafo, Anna Sperotto, Ramin Sadre, and Aiko Pras. Inside
dropbox: understanding personal cloud storage services. In Proceedings of the 2012 ACM conference
on Internet measurement conference, pages 481–494. ACM, 2012.
[48] The Economist. The data deluge, February 2010.
[49] Christian Esposito, Massimo Ficco, Francesco Palmieri, and Arcangelo Castiglione. Smart cloud
storage service selection based on fuzzy logic, theory of evidence and game theory. 2015.
[50] Roberto R Expósito, Guillermo L Taboada, Sabela Ramos, Jorge González-Domínguez, Juan
Touriño, and Ramón Doallo. Analysis of i/o performance on an amazon ec2 cluster compute and
high i/o platform. Journal of grid computing, 11(4):613–631, 2013.
[51] Marc Farley. Storage Networking Fundamentals: An Introduction to Storage Devices, Subsystems,
Applications, Management, and File Systems (Cisco Press Fundamentals). Cisco Press, 2004.
[52] Eugen Feller, Louis Rilling, and Christine Morin. Energy-aware ant colony based workload placement
in clouds. In Proceedings of the 2011 IEEE/ACM 12th International Conference on Grid Computing,
pages 26–33. IEEE Computer Society, 2011.
[53] M. Ficco and F. Palmieri. Introducing fraudulent energy consumption in cloud infrastructures: A
new generation of denial-of-service attacks. Systems Journal, IEEE, PP(99):1–11, 2015.
[54] André Freitas, Edward Curry, João Gabriel Oliveira, and Seán O Riain. Querying heterogeneous
datasets on the linked data web: Challenges, approaches, and trends. Internet Computing, IEEE,
16(1):24–33, 2012.
[55] Yongqiang Gao, Haibing Guan, Zhengwei Qi, Yang Hou, and Liang Liu. A multi-objective ant
colony system algorithm for virtual machine placement in cloud computing. Journal of Computer
and System Sciences, 79(8):1230–1242, 2013.
[56] Yongqiang Gao, Haibing Guan, Zhengwei Qi, Yang Hou, and Liang Liu. A multi-objective ant
colony system algorithm for virtual machine placement in cloud computing. Journal of Computer
and System Sciences, 79(8):1230–1242, 2013.
[57] Yue Gao, Fanglin Wang, Huanbo Luan, and Tat-Seng Chua. Brand data gathering from live social
media streams. In Proceedings of International Conference on Multimedia Retrieval, page 169. ACM,
2014.
[58] Nishant Garg. Apache Kafka. Packt Publishing Ltd, 2013.
[59] Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung. The google file system. In ACM SIGOPS
operating systems review, volume 37, pages 29–43. ACM, 2003.
[60] Bogdan Ghit, Mihai Capota, Tim Hegeman, Jan Hidders, Dick Epema, and Alexandru Iosup. V
for vicissitude: The challenge of scaling complex big data workflows. In Cluster, Cloud and Grid
Computing (CCGrid), 2014 14th IEEE/ACM International Symposium on, pages 927–932. IEEE,
2014.
Resource Management for Cost Optimization in Cloud Storage Systems (PhD Thesis) - Ing. Cătălin
NEGRU
Bibliography
128
[61] Daniel Gmach, Jerry Rolia, and Ludmila Cherkasova. Resource and virtualization costs up in the
cloud: Models and design choices. In Dependable Systems & Networks (DSN), 2011 IEEE/IFIP 41st
International Conference on, pages 395–402. IEEE, 2011.
[62] Hadi Goudarzi and Massoud Pedram. Energy-efficient virtual machine replication and placement in
a cloud computing system. In Cloud Computing (CLOUD), 2012 IEEE 5th International Conference
on, pages 750–757. IEEE, 2012.
[63] Albert Greenberg, James Hamilton, David A Maltz, and Parveen Patel. The cost of a cloud: research
problems in data center networks. ACM SIGCOMM computer communication review, 39(1):68–73,
2008.
[64] Lin Gu, Deze Zeng, Peng Li, and Song Guo. Cost minimization for big data processing in geodistributed data centers. Emerging Topics in Computing, IEEE Transactions on, 2(3):314–323,
2014.
[65] Ajay Gulati, Chethan Kumar, and Irfan Ahmad. Storage workload characterization and consolidation
in virtualized environments. In Workshop on Virtualization Performance: Analysis, Characterization,
and Tools (VPACT), 2009.
[66] Jiawei Han and Micheline Kamber. Data Mining, Southeast Asia Edition: Concepts and Techniques.
Morgan kaufmann, 2006.
[67] Danny Harnik, Dalit Naor, and Itai Segall. Low power mode in cloud storage systems. In Proceedings
of the 2009 IEEE International Symposium on Parallel&Distributed Processing, IPDPS ’09, pages
1–8, Washington, DC, USA, 2009. IEEE Computer Society.
[68] James Hensman, Nicolo Fusi, and Neil D Lawrence. Gaussian processes for big data. arXiv preprint
arXiv:1309.6835, 2013.
[69] Herodotos Herodotou, Harold Lim, Gang Luo, Nedyalko Borisov, Liang Dong, Fatma Bilgen Cetin,
and Shivnath Babu. Starfish: A self-tuning system for big data analytics. In CIDR, volume 11,
pages 261–272, 2011.
[70] Ching-Hsien Hsu, Shih-Chang Chen, Chih-Chun Lee, Hsi-Ya Chang, Kuan-Chou Lai, Kuan-Ching
Li, and Chunming Rong. Energy-aware task consolidation technique for cloud computing. In Cloud
Computing Technology and Science (CloudCom), 2011 IEEE Third International Conference on,
pages 115–121. IEEE, 2011.
[71] Yin Huai, Ashutosh Chauhan, Alan Gates, Gunther Hagleitner, Eric N. Hanson, Owen O’Malley,
Jitendra Pandey, Yuan Yuan, Rubao Lee, and Xiaodong Zhang. Major technical advancements in
apache hive. In Proceedings of the 2014 ACM SIGMOD International Conference on Management
of Data, SIGMOD ’14, pages 1235–1246, New York, NY, USA, 2014. ACM.
[72] Wei Huang, Kai-wen Chen, and Chao Xiao. Integration on heterogeneous data with uncertainty in
emergency system. In Fuzzy Information & Engineering and Operations Research & Management,
pages 483–490. Springer, 2014.
[73] Patrick Hunt, Mahadev Konar, Flavio P. Junqueira, and Benjamin Reed. Zookeeper: Wait-free
coordination for internet-scale systems. In Proceedings of the 2010 USENIX Conference on USENIX
Annual Technical Conference, USENIXATC’10, pages 11–11, Berkeley, CA, USA, 2010. USENIX
Association.
[74] IBM, Paul Zikopoulos, and Chris Eaton. Understanding Big Data: Analytics for Enterprise Class
Hadoop and Streaming Data. McGraw-Hill Osborne Media, 1st edition, 2011.
[75] Enhanced Intel. Speedstep® technology for the intel® pentium® m processor, 2004.
[76] Michael Isard, Mihai Budiu, Yuan Yu, Andrew Birrell, and Dennis Fetterly. Dryad: distributed
data-parallel programs from sequential building blocks. In ACM SIGOPS Operating Systems Review,
volume 41, pages 59–72. ACM, 2007.
[77] Hesam Izakian, Ajith Abraham, and Václav Snášel. Performance comparison of six efficient pure
heuristics for scheduling meta-tasks on heterogeneous distributed environments. Neural Network
World, 19(6):695–710, 2009.
[78] HV Jagadish, Johannes Gehrke, Alexandros Labrinidis, Yannis Papakonstantinou, Jignesh M Patel,
Raghu Ramakrishnan, and Cyrus Shahabi. Big data and its technical challenges. Communications
of the ACM, 57(7):86–94, 2014.
[79] Yifeng Jiang. HBase Administration Cookbook. Packt Publishing, 2012.
[80] Lei Jiao, Jun Li, Tianyin Xu, and Xiaoming Fu. Cost optimization for online social networks on
geo-distributed clouds. In Network Protocols (ICNP), 2012 20th IEEE International Conference on,
pages 1–10. IEEE, 2012.
Resource Management for Cost Optimization in Cloud Storage Systems (PhD Thesis) - Ing. Cătălin
NEGRU
Bibliography
129
[81] Lei Jiao, Jun Lit, Wei Du, and Xiaoming Fu. Multi-objective data placement for multi-cloud socially
aware services. In INFOCOM, 2014 Proceedings IEEE, pages 28–36. IEEE, 2014.
[82] Heeseung Jo, Youngjin Kwon, Hwanju Kim, Euiseong Seo, Joonwon Lee, and Seungryoul Maeng.
Ssd-hdd-hybrid virtual disk in consolidated environments. In Proceedings of the 2009 International
Conference on Parallel Processing, Euro-Par’09, pages 375–384, Berlin, Heidelberg, 2010. SpringerVerlag.
[83] Mohammad Mahdi Kashef and Jörn Altmann. A cost model for hybrid clouds. In Economics of
Grids, Clouds, Systems, and Services, pages 46–60. Springer, 2012.
[84] Rini T. Kaushik and Milind Bhandarkar. Greenhdfs: Towards an energy-conserving, storage-efficient,
hybrid hadoop compute cluster. In Proceedings of the 2010 International Conference on Power Aware
Computing and Systems, HotPower’10, pages 1–9, Berkeley, CA, USA, 2010. USENIX Association.
[85] Atefeh Khosravi, Saurabh Kumar Garg, and Rajkumar Buyya. Energy and carbon-efficient placement of virtual machines in distributed cloud data centers. In Euro-Par 2013 Parallel Processing,
pages 317–328. Springer, 2013.
[86] Jinoh Kim and Doron Rotem. Frep: Energy proportionality for disk storage using replication. J.
Parallel Distrib. Comput., 72(8):960–974, August 2012.
[87] M Kim, A Mohindra, V Muthusamy, R Ranchal, V Salapura, A Slominski, and R Khalaf. Building scalable, secure, multi-tenant cloud services on ibm bluemix. IBM Journal of Research and
Development, 60(2-3):8–1, 2016.
[88] Rob Kitchin. Big data and human geography opportunities, challenges and risks. Dialogues in human
geography, 3(3):262–267, 2013.
[89] Rob Kitchin. The real-time city? big data and smart urbanism. GeoJournal, 79(1):1–14, 2014.
[90] Joanna KołOdziej and Samee Ullah Khan. Multi-level hierarchic genetic-based scheduling of independent jobs in dynamic heterogeneous grid environment. Information Sciences, 214:1–19, 2012.
[91] Joanna Kołodziej, Samee Ullah Khan, and Fatos Xhafa. Genetic algorithms for energy-aware scheduling in computational grids. In P2P, Parallel, Grid, Cloud and Internet Computing (3PGCIC), 2011
International Conference on, pages 17–24. IEEE, 2011.
[92] Joanna Kolodziej, Magdalena Szmajduch, Tahir Maqsood, Sajjad Ahmad Madani, Nasro Min-Allah,
and Samee U Khan. Energy-aware grid scheduling of independent tasks and highly distributed
data. In Frontiers of Information Technology (FIT), 2013 11th International Conference on, pages
211–216. IEEE, 2013.
[93] Joanna Koodziej, Samee U Khan, Magdalena Szmajduch, Lizhe Wang, Dan Chen, et al. Geneticbased solutions for independent batch scheduling in data grids. 2013.
[94] Roman Kopetzky, Markus Günther, Natalia Kryvinska, Andreas Mladenow, Christine Strauss, and
Christian Stummer. Strategic management of disruptive technologies: a practical framework in the
context of voice services and of computing towards the cloud. International Journal of Grid and
Utility Computing, 4(1):47–59, 2013.
[95] Lawrence T. Kou and George Markowsky. Multidimensional bin packing algorithms. IBM Journal
of Research and development, 21(5):443–448, 1977.
[96] Andrei Lavinia, Ciprian Dobre, Florin Pop, and Valentin Cristea. A failure detection system for
large scale distributed systems. In Complex, Intelligent and Software Intensive Systems (CISIS),
2010 International Conference on, pages 482–489. IEEE, 2010.
[97] Adam M Leadbetter and Peter N Vodden. Semantic linking of complex properties, monitoring
processes and facilities in web-based representations of the environment. International Journal of
Digital Earth, (ahead-of-print):1–25, 2015.
[98] Marcello Leida, Alex Gusmini, and John Davies. Semantics-aware data integration for heterogeneous
data sources. Journal of Ambient Intelligence and Humanized Computing, 4(4):471–491, 2013.
[99] Min Yeol Lim, Allan Porterfield, and Robert Fowler. Softpower: fine-grain power estimations using
performance counters. In Proceedings of the 19th ACM International Symposium on High Performance Distributed Computing, pages 308–311. ACM, 2010.
[100] Ching-Chi Lin, Pangfeng Liu, and Jan-Jan Wu. Energy-aware virtual machine dynamic provision
and scheduling for cloud computing. In Cloud Computing (CLOUD), 2011 IEEE International
Conference on, pages 736–737. IEEE, 2011.
[101] Xing Lin, Yun Mao, Feifei Li, and Robert Ricci. Towards fair sharing of block storage in a multitenant cloud. In Proceedings of the 4th USENIX conference on Hot Topics in Cloud Ccomputing,
pages 15–15. USENIX Association, 2012.
Resource Management for Cost Optimization in Cloud Storage Systems (PhD Thesis) - Ing. Cătălin
NEGRU
Bibliography
130
[102] Wenxin Liu and EK Park. Big data as an e-health service. In Computing, Networking and Communications (ICNC), 2014 International Conference on, pages 982–988. IEEE, 2014.
[103] Charles Z. Loboz. Cloud resource usage: Extreme distributions invalidating traditional capacity
planning models. In Proceedings of the 2Nd International Workshop on Scientific Cloud Computing,
ScienceCloud ’11, pages 7–14, New York, NY, USA, 2011. ACM.
[104] David Loshin. Chapter 7 - big data tools and techniques. In David Loshin, editor, Big Data Analytics,
pages 61 – 72. Morgan Kaufmann, Boston, 2013.
[105] David Loshin. Chapter 9 - nosql data management for big data. In David Loshin, editor, Big Data
Analytics, pages 83 – 90. Morgan Kaufmann, Boston, 2013.
[106] Jun-Zhou Luo, Jia-Hui Jin, Ai-Bo Song, and Fang Dong. Cloud computing: architecture and key
technologies. Journal of China Institute of Communications, 32(7):3–21, 2011.
[107] P Maciel, G Callou, E Tavares, E Sousa, B Silva, et al. Estimating reliability importance and total
cost of acquisition for data center power infrastructures. In Systems, Man, and Cybernetics (SMC),
2011 IEEE International Conference on, pages 421–426. IEEE, 2011.
[108] Muthucumaru Maheswaran and Howard Jay Siegel. A dynamic matching and scheduling algorithm
for heterogeneous computing systems. In Heterogeneous Computing Workshop, 1998.(HCW 98)
Proceedings. 1998 Seventh, pages 57–69. IEEE, 1998.
[109] Zaigham Mahmood and Saqib Saeed. Software engineering frameworks for the cloud computing
paradigm. Springer, 2013.
[110] ANDREW Makhorin. Glpk (gnu linear programming kit), version 4.42. URL http://www. gnu.
org/software/glpk, 2004.
[111] James Manyika, Michael Chui, Brad Brown, Jacques Bughin, Richard Dobbs, Charles Roxburgh,
and Angela H Byers. Big data: The next frontier for innovation, competition, and productivity.
asdsa, 2011.
[112] Benedikt Martens, Marc Walterbusch, and Frank Teuteberg. Costing of cloud computing services:
A total cost of ownership approach. In System Science (HICSS), 2012 45th Hawaii International
Conference on, pages 1563–1572. IEEE, 2012.
[113] Carlos Martínez-Cano, Alberto Galvis, Franco Alvis, and Micha Werner. Model integration to
improve an early warning system for the pollution control of the cauca river, colombia. 2014.
[114] M Matthies, J Berlekamp, S Lautenbach, N Graf, and S Reimer. Decision support system for the
elbe river water quality management. In Proceedings of the International Congress on Modelling and
Simulation (MODSIM 2003), Townsville, Australia, 2003.
[115] Nick McKeown, Tom Anderson, Hari Balakrishnan, Guru Parulkar, Larry Peterson, Jennifer Rexford, Scott Shenker, and Jonathan Turner. Openflow: enabling innovation in campus networks. ACM
SIGCOMM Computer Communication Review, 38(2):69–74, 2008.
[116] Yousri Mhedheb, Foued Jrad, Jie Tao, Jiaqi Zhao, Joanna Kołodziej, and Achim Streit. Load and
thermal-aware vm scheduling on the cloud. In Algorithms and Architectures for Parallel Processing,
pages 101–114. Springer, 2013.
[117] Subhas Chandra Misra and Arka Mondal. Identification of a company’s suitability for the adoption of
cloud computing and modelling its corresponding return on investment. Mathematical and Computer
Modelling, 53(3):504–521, 2011.
[118] Sparsh Mittal. A survey of techniques for approximate computing. ACM Computing Surveys (CSUR),
48(4):62, 2016.
[119] Andreas Mladenow, Natalia Kryvinska, and Christine Strauss. Towards cloud-centric service environments. Journal of Service Science Research, 4(2):213–234, 2012.
[120] Christoph Mobius, Waltenegus Dargie, and Alexander Schill. Power consumption estimation models
for processors, virtual machines, and servers. Parallel and Distributed Systems, IEEE Transactions
on, 25(6):1600–1614, 2014.
[121] Mariana Mocanu and Alexandru Craciun. Monitoring watershed parameters through software services. In EIDWT, pages 287–292, 2012.
[122] Mariana Mocanu, Lucia Vacariu, Radu Drobot, and Marian Muste. Information-centric systems for
supporting decision-making in watershed resource development. In Control Systems and Computer
Science (CSCS), 2013 19th International Conference on, pages 611–616. IEEE, 2013.
[123] Bhupendra Moharil, Chaitanya Gokhale, Vijayendra Ghadge, Pranav Tambvekar, Sumitra Pundlik,
and Gaurav Rai. Real time generalized log file management and analysis using pattern matching
and dynamic clustering. International Journal of Computer Applications, 91(16):1–6, 2014.
Resource Management for Cost Optimization in Cloud Storage Systems (PhD Thesis) - Ing. Cătălin
NEGRU
Bibliography
131
[124] Eugen Molnar, Natalia Kryvinska, and M Greguś. Customer driven big-data analytics for the companies servitization.
[125] Rafael Moreno-Vozmediano, Rubén S Montero, and Ignacio Martín Llorente. Key challenges in cloud
computing: Enabling the future internet of services. Internet Computing, IEEE, 17(4):18–25, 2013.
[126] Ovidiu Muresan, Florin Pop, Dorian Gorgan, and Valentin Cristea. Satellite image processing applications in mediogrid. In Parallel and Distributed Computing, 2006. ISPDC’06. The Fifth International Symposium on, pages 253–262. IEEE, 2006.
[127] Emily Namey, Greg Guest, Lucy Thairu, and Laura Johnson. Data reduction techniques for large
qualitative data sets. Handbook for team-based qualitative research, pages 137–162, 2007.
[128] Catalin Negru and Valentin Cristea. Cost models–pillars for efficient cloud computing: position
paper. International Journal of Intelligent Systems Technologies and Applications, 12(1):28–38, 2013.
[129] Catalin Negru, Mariana Mocanu, and Valentin Cristea. Impact of virtual machines heterogeneity on
data center power consumption in data-intensive applications. In Adaptive Resource Management
and Scheduling for Cloud Computing, pages 91–102. Springer, 2015.
[130] Catalin Negru, Mariana Mocanu, Valentin Cristea, Stelios Sotiriadis, and Nik Bessis. Analysis of
power consumption in heterogeneous virtual machine environments. Soft Computing, pages 1–12,
2016.
[131] Catalin Negru, Florin Pop, and Valentin Cristea. Cost optimization for data storage in public clouds:
A user perspective. In Proceedings of 13th International Conference on Informatics in Economy, 2014.
[132] Catalin Negru, Florin Pop, Valentin Cristea, Nik Bessisy, and Jing Li. Energy efficient cloud storage
service: Key issues and challenges. In Proceedings of the 2013 Fourth International Conference on
Emerging Intelligent Data and Web Technologies, EIDWT ’13, pages 763–766, Washington, DC,
USA, 2013. IEEE Computer Society.
[133] Cătălin NEGRU, Florin POP, Ciprian DOBRE, and Valentin CRISTEA. Performance analysis of
lustre file systems based on benchmarking for a cluster system. U.P.B. Scientific Bulletin Series C,
75:27–36, 2013.
[134] Catalin Negru, Florin Pop, Ovidiu Cristian Marcu, Mariana Mocanu, and Valentin Cristea. Budget
constrained selection of cloud storage services for advanced processing in datacenters. In RoEduNet
International Conference-Networking in Education and Research (RoEduNet NER), 2015 14th, pages
158–162. IEEE, 2015.
[135] Pop F. & Cristea V. Negru, C. Cost optimization for data storage in public clouds: A user perspective.
In In Proceedings of 13th International Conference on Informatics in Economy.
[136] Bogdan Nicolae, Pierre Riteau, Kate Keahey, et al. Transparent throughput elasticity for iaas cloud
storage using guest-side block-level caching. In UCC’14: 7th IEEE/ACM International Conference
on Utility and Cloud Computing, 2014.
[137] Ewa Niewiadomska-Szynkiewicz, Andrzej Sikora, Piotr Arabas, Mariusz Kamola, Marcin Mincer,
and Joanna Kołodziej. Dynamic power management in energy-aware computer networks and data
intensive computing systems. Future Generation Computer Systems, 37:284–296, 2014.
[138] Di Niu, Chen Feng, and Baochun Li. Pricing cloud bandwidth reservations under demand uncertainty.
In ACM SIGMETRICS Performance Evaluation Review, volume 40, pages 151–162. ACM, 2012.
[139] William D Norcott and Don Capps. Iozone filesystem benchmark. URL: www. iozone. org, 55, 2003.
[140] The Open Geospatial Consortium (OGC). Why is the ogc involved in sensor webs?, 2008.
[141] Sean Owen, Robin Anil, Ted Dunning, and Ellen Friedman. Mahout in Action. Manning Publications
Co., Greenwich, CT, USA, 2011.
[142] Mayur R Palankar, Adriana Iamnitchi, Matei Ripeanu, and Simson Garfinkel. Amazon s3 for science
grids: a viable solution? In Proceedings of the 2008 international workshop on Data-aware distributed
computing, pages 55–64. ACM, 2008.
[143] Venkatesh Pallipadi. Enhanced intel speedstep technology and demand-based switching on linux.
Intel Developer Service, 2009.
[144] Rina Panigrahy, Kunal Talwar, Lincoln Uyeda, and Udi Wieder. Heuristics for vector bin packing.
research. microsoft. com, 2011.
[145] Jih-Kwon Peir, Shih-Chang Lai, Shih-Lien Lu, Jared Stark, and Konrad Lai. Bloom filtering cache
misses for accurate data speculation and prefetching. In ACM International Conference on Supercomputing 25th Anniversary Volume, pages 347–356. ACM, 2014.
Resource Management for Cost Optimization in Cloud Storage Systems (PhD Thesis) - Ing. Cătălin
NEGRU
Bibliography
132
[146] Zachary NJ Peterson, Mark Gondree, and Robert Beverly. A position paper on data sovereignty:
the importance of geolocating data in the cloud. In Proceedings of the 3rd USENIX conference on
Hot topics in cloud computing, pages 9–9. USENIX Association, 2011.
[147] Padmanabhan Pillai and Kang G Shin. Real-time dynamic voltage scaling for low-power embedded
operating systems. In ACM SIGOPS Operating Systems Review, volume 35, pages 89–102. ACM,
2001.
[148] Jaroslav Pokorny. Nosql databases: a step to database scalability in web environment. International
Journal of Web Information Systems, 9(1):69–82, 2013.
[149] Florin Pop, Ciprian Dobre, and Valentin Cristea. Performance analysis of grid dag scheduling algorithms using monarc simulation tool. In Parallel and Distributed Computing, 2008. ISPDC’08.
International Symposium on, pages 131–138. IEEE, 2008.
[150] Florin Pop, Ciprian Dobre, Valentin Cristea, Nik Bessis, Fatos Xhafa, and Leonard Barolli. Deadline
scheduling for aperiodic tasks in inter-cloud environments: a new approach to resource management.
The Journal of Supercomputing, 71(5):1754–1765, 2015.
[151] Florin Pop, Ciprian Dobre, Catalin Negru, and Valentin Cristea. Re-scheduling service for distributed
systems. In Advances in Intelligent Control Systems and Computer Science, pages 423–437. Springer
Berlin Heidelberg, 2013.
[152] AMD PowerNOW. Technology, amd white paper, november 2000.
[153] George Prekas, Mia Primorac, Adam Belay, Christos Kozyrakis, and Edouard Bugnion. Energy
proportionality and workload consolidation for latency-critical applications. In Proceedings of the
Sixth ACM Symposium on Cloud Computing, pages 342–355. ACM, 2015.
[154] Ioan Raicu, Ian T Foster, and Yong Zhao. Many-task computing for grids and supercomputers.
In Many-Task Computing on Grids and Supercomputers, 2008. MTAGS 2008. Workshop on, pages
1–11. IEEE, 2008.
[155] Zia Ur Rehman, Farookh K Hussain, and Omar K Hussain. Towards multi-criteria cloud service
selection. In Innovative Mobile and Internet Services in Ubiquitous Computing (IMIS), 2011 Fifth
International Conference on, pages 44–48. IEEE, 2011.
[156] Donald S Remer and Armando P Nieto. A compendium and comparison of 25 project evaluation
techniques. part 1: Net present value and rate of return methods. International Journal of Production
Economics, 42(1):79–96, 1995.
[157] John F. Roddick, Erik Hoel, Max J. Egenhofer, Dimitris Papadias, and Betty Salzberg. Spatial,
temporal and spatio-temporal databases - hot issues and directions for phd research. SIGMOD Rec.,
33(2):126–131, June 2004.
[158] Arkaitz Ruiz-Alvarez and Marty Humphrey. An automated approach to cloud storage service selection. In Proceedings of the 2nd international workshop on Scientific cloud computing, pages 39–48.
ACM, 2011.
[159] Andrei Sfrent and Florin Pop. Asymptotic scheduling for many task computing in big data platforms.
Information Sciences, 2015.
[160] Weiyi Shang, Bram Adams, and Ahmed E. Hassan. Using pig as a data preparation language for
large-scale mining software repositories studies: An experience report. J. Syst. Softw., 85(10):2195–
2204, October 2012.
[161] Mohsen Sharifi, Hadi Salimi, and Mahsa Najafzadeh. Power-efficient distributed scheduling of virtual
machines using workload-aware consolidation techniques. The Journal of Supercomputing, 61(1):46–
66, 2012.
[162] Gilbert C Sih and Edward A Lee. A compile-time scheduling heuristic for interconnection-constrained
heterogeneous processor architectures. Parallel and Distributed Systems, IEEE Transactions on,
4(2):175–187, 1993.
[163] Vivek K Singh, Mingyan Gao, and Ramesh Jain. Situation recognition: an evolving problem for
heterogeneous dynamic big multimedia data. In Proceedings of the 20th ACM international conference
on Multimedia, pages 1209–1218. ACM, 2012.
[164] Yashaswi Singh, Farah Kandah, and Weiyi Zhang. A secured cost-effective multi-cloud storage in
cloud computing. In Computer Communications Workshops (INFOCOM WKSHPS), 2011 IEEE
Conference on, pages 619–624. IEEE, 2011.
[165] C Skotner, A Klinting, HC Ammentorp, F Hansen, J Høst-Madsen, QM Lu, and Han Junshan. A tailored gis-based forecasting system of songhua river basin, china. In Proceedings of Esri International
user conference, 2013.
Resource Management for Cost Optimization in Cloud Storage Systems (PhD Thesis) - Ing. Cătălin
NEGRU
Bibliography
133
[166] S. Sotiriadis, N. Bessis, A. Anjum, and R. Buyya. An inter-cloud meta-scheduling (icms) simulation
framework: Architecture and evaluation. IEEE Transactions on Services Computing, PP(99):1–1,
2015.
[167] Stelios Sotiriadis, Nik Bessis, and Nick Antonopoulos. Towards inter-cloud schedulers: A survey
of meta-scheduling approaches. In P2P, Parallel, Grid, Cloud and Internet Computing (3PGCIC),
2011 International Conference on, pages 59–66. IEEE, 2011.
[168] Christoph Stasch, Theodor Foerster, Christian Autermann, and Edzer Pebesma. Spatio-temporal
aggregation of european air quality observations in the sensor web. Computers & Geosciences,
47:111–118, 2012.
[169] Murray Stokely, Amaan Mehrabian, Christoph Albrecht, Francois Labelle, and Arif Merchant. Projecting disk usage based on historical trends in a cloud environment. In Proceedings of the 3rd
Workshop on Scientific Cloud Computing Date, ScienceCloud ’12, pages 63–70, New York, NY,
USA, 2012. ACM.
[170] Smitha Sundareswaran, Anna Squicciarini, and Dongyang Lin. A brokerage-based approach for cloud
service selection. In Cloud Computing (CLOUD), 2012 IEEE 5th International Conference on, pages
558–565. IEEE, 2012.
[171] Byung Chul Tak, Bhuvan Urgaonkar, and Anand Sivasubramaniam. To move or not to move: The
economics of cloud computing. In Proceedings of the 3rd USENIX conference on Hot topics in cloud
computing, pages 5–5. USENIX Association, 2011.
[172] Caihong Tang, Yujun Yi, Zhifeng Yang, and Xi Cheng. Water pollution risk simulation and prediction
in the main canal of the south-to-north water transfer project. Journal of Hydrology, 519:2111–2120,
2014.
[173] Ankit Toshniwal, Siddarth Taneja, Amit Shukla, Karthik Ramasamy, Jignesh M Patel, Sanjeev
Kulkarni, Jason Jackson, Krishna Gade, Maosong Fu, Jake Donham, et al. [email protected] twitter. In
Proceedings of the 2014 ACM SIGMOD international conference on Management of data, pages
147–156. ACM, 2014.
[174] Hong-Linh Truong and Schahram Dustdar. Composable cost estimation and monitoring for computational applications in cloud computing environments. Procedia Computer Science, 1(1):2175–2184,
2010.
[175] Vernon Turner, John F Gantz, David Reinsel, and Stephen Minton. The digital universe of opportunities: Rich data and the increasing value of the internet of things. IDC Analyze the Future,
2014.
[176] Ruben Van den Bossche, Kurt Vanmechelen, and Jan Broeckhove. Cost-optimal scheduling in hybrid
iaas clouds for deadline constrained workloads. In Cloud Computing (CLOUD), 2010 IEEE 3rd
International Conference on, pages 228–235. IEEE, 2010.
[177] B van Loenen and M Grothe. Inspire as enabler of open data objectives. 2014.
[178] Ana Lucia Varbanescu and Alexandru Iosup. On many-task big data processing: from gpus to
clouds. In MTAGS Workshop, held in conjunction with ACM/IEEE International Conference for
High Performance Computing, Networking, Storage and Analysis (SC), pages 1–8. ACM, 2013.
[179] Akshat Verma, Ricardo Koller, Luis Useche, and Raju Rangaswami. Srcmap: Energy proportional
storage using dynamic consolidation. In Proceedings of the 8th USENIX Conference on File and
Storage Technologies, FAST’10, pages 20–20, Berkeley, CA, USA, 2010. USENIX Association.
[180] Zhanming Wan, Yang Hong, Sadiq Khan, Jonathan Gourley, Zachary Flamig, Dalia Kirschbaum, and
Guoqiang Tang. A cloud-based global flood disaster community cyber-infrastructure: Development
and demonstration. Environmental Modelling & Software, 58:86–94, 2014.
[181] Cong Wang, Kui Ren, and Jia Wang. Secure and practical outsourcing of linear programming in
cloud computing. In INFOCOM, 2011 Proceedings IEEE, pages 820–828. IEEE, 2011.
[182] Feng Wang, Qin Xin, Bo Hong, Scott A Brandt, Ethan L Miller, Darrell DE Long, and Tyce T
McLarty. File system workload analysis for large scale scientific computing applications. 2004.
[183] M. Whitehorn. Aster data founders explain unified approach to data big and small, 2015.
[184] Erik Wittern, Jörn Kuhlenkamp, and Michael Menzel. Cloud service selection based on variability
modeling. In Service-Oriented Computing, pages 127–141. Springer, 2012.
[185] Peng Xiao, Zhigang Hu, Dongbo Liu, Guofeng Yan, and Xilong Qu. Virtual machine power measuring
technique with bounded error in cloud environments. Journal of Network and Computer Applications,
36(2):818–828, 2013.
Resource Management for Cost Optimization in Cloud Storage Systems (PhD Thesis) - Ing. Cătălin
NEGRU
Bibliography
134
[186] Jian Xu and Rachel Pottinger. Integrating domain heterogeneous data sources using decomposition
aggregation queries. Information Systems, 39:80–107, 2014.
[187] Chaowei Yang, Robert Raskin, Michael Goodchild, and Mark Gahegan. Geospatial cyberinfrastructure: past, present and future. Computers, Environment and Urban Systems, 34(4):264–277,
2010.
[188] Chi Yang, Xuyun Zhang, Changmin Zhong, Chang Liu, Jian Pei, Kotagiri Ramamohanarao, and
Jinjun Chen. A spatiotemporal compression based approach for efficient big data processing on
cloud. Journal of Computer and System Sciences, 2014.
[189] Hung-chih Yang, Ali Dasdan, Ruey-Lung Hsiao, and D Stott Parker. Map-reduce-merge: simplified
relational data processing on large clusters. In Proceedings of the 2007 ACM SIGMOD international
conference on Management of data, pages 1029–1040. ACM, 2007.
[190] Jianguo Yao, Xue Liu, Wenbo He, and Ashikur Rahman. Dynamic control of electricity cost with
power demand smoothing and peak shaving for distributed internet data centers. In Distributed
Computing Systems (ICDCS), 2012 IEEE 32nd International Conference on, pages 416–424. IEEE,
2012.
[191] Wenbin Yao and Liang Lu. A selection algorithm of service providers for optimized data placement in
multi-cloud storage environment. In Intelligent Computation in Big Data Era, pages 81–92. Springer,
2015.
[192] MYA Younis and Kashif Kifayat. Secure cloud computing for critical infrastructure: A survey.
Liverpool John Moores University, United Kingdom, Tech. Rep, 2013.
[193] Yen-Ting Yu, Yujia Zhu, Wilfred Ng, Juniarto Samsudin, and Zuyi Li. Optimized sort partition: A
file assignment strategy to achieve minimized response time for parallel storage systems. In APMRC,
2012 Digest, pages 1–8. IEEE, 2012.
[194] Matei Zaharia, Mosharaf Chowdhury, Michael J Franklin, Scott Shenker, and Ion Stoica. Spark:
cluster computing with working sets. In Proceedings of the 2nd USENIX conference on Hot topics
in cloud computing, volume 10, page 10, 2010.
[195] Qi Zhang, Lu Cheng, and Raouf Boutaba. Cloud computing: state-of-the-art and research challenges.
Journal of internet services and applications, 1(1):7–18, 2010.
Resource Management for Cost Optimization in Cloud Storage Systems (PhD Thesis) - Ing. Cătălin
NEGRU
Was this manual useful for you? yes no
Thank you for your participation!

* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project

Download PDF

advertisement