University of Leeds School of Computing Msc. Advanced computer

University of Leeds School of Computing Msc. Advanced computer
University of Leeds
School of Computing
Msc. Advanced computer science
Data Management in Cloud Computing
Bishakha Dutta Gupta
MSc Advanced Computer science
Session (2010/11)
The candidate confirms that the work submitted is their own and the appropriate credit has been given where
reference has been made to the work of others.
I understand that failure to attribute material which is obtained from another source may be considered as
plagiarism.
(Signature of Student) -----------------------------------
Summary
Handling data of large size in a scalable way in clouds has been a concern for a long time. In this project, we
introduce a framework that is built using Hadoop to store and process large pathology data. Hadoop is been
widely used to process large scale data efficiently and scalably. Efforts on applying Hadoop on scenarios other
than server side computation like web indexing... etc is very few. In this project, the Hadoop cluster or cloud
Framework would be implemented using its Map Reduce programming model to store and process the pathology
data efficiently on the Hadoop cluster or clouds. The approach is then generalized and discussed further to
handle similar or other scientific data processing scenarios.
Acknowledgements
I would like to thank personally Dr. Karim Djemame, my supervisor for his timely feedback; continuous support
and motivation through out the project and helping me to complete the research project on time. I would also like
to thank my assessor Dr. Vania Dimitrova for providing me good feedback on both the mid-term project report
and during the progress meeting. I would also thank Django Armstrong for his quick and timely help for any
technical issues with regards to the cloud test bed. I would also like to thank my lovely parents for their incessant
love, inspiration and financial support. I also would like to thank my uncle Dr. Jyoti N. Sengupta, aunty Jayati
Sengupta and grandparents for their continuous love and encouragement during my tenure as a student in Leeds.
List of Acronyms
CPU
Central Processing Unit
MB
Megabyte
GB
Gigabyte
TB
Terabyte
PB
Petabytes
HTTP
Hyper Text Transfer Protocol
I/O
Input/Output
IaaS
Infrastructure as a Service
PaaS
Platform as a Service
SaaS
Software as a Service
RAM
Random Access Memory
SSH
Secure Shell
VIM
Virtual Infrastructure Manager
VM
Virtual Machine
Amazon EC2
Amazon Elastic Compute Cloud
API
Application Programming Interface
SNA
Shared Nothing Architecture
HDFS
Hadoop Distributed File System
DFS
Distributed File System
Table of Contents
1. INTRODUCTION ................................................................................................................................................1
1.1 Project Aim: ...................................................................................................................................................1
1.2 Objectives:......................................................................................................................................................1
1.3 Minimum Requirements: ...............................................................................................................................1
1.4 Motivation ......................................................................................................................................................1
1.5 Methodology: .................................................................................................................................................2
1.6 Schedule .........................................................................................................................................................2
1.7 Contributions:.................................................................................................................................................3
1.8 Report Structure: ............................................................................................................................................3
2. CLOUD COMPUTING........................................................................................................................................5
2.1 Introduction ....................................................................................................................................................5
2.2 Properties of a Cloud ......................................................................................................................................6
2.3 Architecture of Cloud Computing ..................................................................................................................6
2.4 Models ............................................................................................................................................................7
2.5 Types of cloud system ....................................................................................................................................9
2.6 Companies that offer Cloud services ...........................................................................................................10
2.7 Virtualization................................................................................................................................................12
2.8 Types of Virtualization ................................................................................................................................13
2.9 Major Players and Products in Virtualization ..............................................................................................14
2.10 Virtualization Project Steps........................................................................................................................14
2.11 Hypervisor: XEN .......................................................................................................................................15
2.12 Characteristics of a cloud ...........................................................................................................................17
2.13 Data management applications...................................................................................................................18
2.13.1 Transactional data management ..........................................................................................................18
2.13.2 Analytical data management ...............................................................................................................19
2.14 Analysis of Data in the Cloud ....................................................................................................................20
2.15 Analyzing the Data using Hadoop Framework ..........................................................................................21
2.16 Cloud Test Bed...........................................................................................................................................21
2.16 Related Work .............................................................................................................................................22
3. Case Study: Pathology Application ....................................................................................................................24
3.1 General Description of Pathology data Processing ......................................................................................24
3.2 Accessing the Application............................................................................................................................25
4. HADOOP Approach ...........................................................................................................................................27
4.1 MapReduce ..................................................................................................................................................29
4.2 MapReduce Execution Overview.................................................................................................................31
4.3 Working of Map and Reduce Programming model with an example ..........................................................33
4.4 The Hadoop Distributed File System (HDFS) .............................................................................................34
4.5 Why use Hadoop and MapReduce? .............................................................................................................37
4.6 Why Use Hadoop framework for the Pathology Application Data? ............................................................37
4.7 Hadoop Streaming ........................................................................................................................................38
5. Installation of HADOOP ....................................................................................................................................40
5.1 From two single-node clusters to a multi-node cluster ................................................................................43
6. Implementation of MAP REDUCE programming model ..................................................................................47
6.1 General Description of the current data processing used by the Pathology Application .............................47
6.2 System Design ..............................................................................................................................................47
6.3 Process Flow ................................................................................................................................................48
6.3.1 The pre-processing Step........................................................................................................................48
6.3.2 Loading pre-processed data into HDFS ................................................................................................49
6.3.3 Process Data using the Map Reduce Programming Model...................................................................49
6.3.4 Running the Python Code in Hadoop ...................................................................................................50
7. Evaluation and Experimentation ........................................................................................................................52
7.1 Response time to process the Pathology Data on a single node / Cluster ....................................................52
7.2 Response time to run the pathology Application on based on Data Size .....................................................54
7.3 Results and comparison................................................................................................................................58
7.4 Evaluation of the Software ...........................................................................................................................58
7.5 Further Work ................................................................................................................................................59
7.6 How would this Project Idea Help Other Applications? ..............................................................................59
7.7 Meeting Minimum Requirements ................................................................................................................59
7.8 Objectives Met .............................................................................................................................................60
8. Conclusion ..........................................................................................................................................................61
8.1 Project Success .............................................................................................................................................61
8.2 Summary ......................................................................................................................................................61
Bibliography ...........................................................................................................................................................62
Appendix A - Project Reflection ............................................................................................................................66
Appendix B - Critical Technical Issues and solution .............................................................................................68
Appendix C - Hadoop configuration files ..............................................................................................................72
C.1 Core-site.xml / mapred-site.xml / hdfs-site.xml Configuration files on master Node (Debian02) .............72
C.2 Core-site.xml / mapred-site.xml / hdfs-site.xml Configuration files on Slave Nodes (Debian05) ............73
C.3 Core-site.xml / mapred-site.xml / hdfs-site.xml Configuration files on Slave Nodes (Debian06) ..............75
Appendix D - MapReduce Program processing 1 GB data ....................................................................................77
Appendix E - Installation of Java / Python .............................................................................................................80
Installation of JAVA ..........................................................................................................................................80
Installation of Python .........................................................................................................................................81
Appendix F – Schedule ..........................................................................................................................................82
Appendix G – Interim Project Report.....................................................................................................................83
Table of Figures
Figure 1: Schedule used for the project with the major tasks and time scales/deadlines to complete these tasks. ..3
Figure 2: cloud computing Architecture ..................................................................................................................7
Figure 3: Types of cloud Models .............................................................................................................................8
Figure 4: Types of cloud system ............................................................................................................................10
Figure 5: Virtualization .........................................................................................................................................12
Figure 6: Types of Virtualization ..........................................................................................................................13
Figure 7: An example of 2 nodes in the cloud showing the Layered architecture. ................................................21
Figure 8: Data is distributed across nodes at load time. ........................................................................................28
Figure 9: Map Reduce programming Model .........................................................................................................29
Figure 10: Different colors represent different keys. All values with the same key are presented to a single
reduce task. .............................................................................................................................................................30
Figure 11: Mapping creates a new output list by applying a function to individual elements of an input list. .....30
Figure 12: Reducing a list iterates over the input values to produce an aggregate value as output. ......................31
Figure 13: Map Reduce Execution Overview........................................................................................................32
Figure 14: Map Reduce Programming Flow .........................................................................................................33
Figure 15: Map Reduce Flow Example .................................................................................................................34
Figure 16: Data Nodes holding blocks of multiple files with a replication factor of 2. The NameNode maps the
filenames onto the block IDs. .................................................................................................................................35
Figure 17: HDFS Architecture...............................................................................................................................36
Figure 18: Set up of a Multi Node cluster .............................................................................................................44
Figure 19: multi-node cluster setup ......................................................................................................................44
Figure 20: Running the program for the first time takes 47 seconds approximately and 37 seconds for the second
and consecutive runs...............................................................................................................................................53
Figure 21: Running the program for the first time on a Hadoop cluster consisting of two nodes takes 32 seconds
and for the second and consecutive runs takes 27 seconds. ...................................................................................53
Figure 22: Running the program for the first time on a Hadoop cluster consisting of three nodes takes 30
seconds and for the second and consecutive runs takes 22 seconds .......................................................................54
Figure 23: Difference of running the Pathology data on a Hadoop cluster with a single node and a Hadoop
cluster with two and three nodes ............................................................................................................................54
Figure 24: The difference of the time taken to process the data of size 24 KB, 131 KB, 3074 KB and 39 MB on
a Hadoop cluster with a single node. ......................................................................................................................55
Figure 25: The difference of the time taken to process the data of size 24 KB, 131 KB, 3074 KB and 39 MB on
a Hadoop cluster with two nodes............................................................................................................................56
Figure 26: The difference of the time taken to process the data of size 24 KB, 131 KB, 3074 KB and 39 MB on
a Hadoop cluster with three nodes..........................................................................................................................56
Figure 27: The Response time taken to process the data of size 24 KB, 131 KB, 3074 KB and 39 MB on single
and a multi node cluster setup ................................................................................................................................57
Figure 28: Processing time taken by a Hadoop cluster with one, two and three Nodes for 1GB pathology data. 57
-1-
Chapter 1
1. INTRODUCTION
1.1 Project Aim:
With the increasing popularity of cloud computing Hadoop software Framework is becoming widely used for
processing large data on clouds. The cloud in simple terms is a large group of computers that are interconnected.
The main objectives of this research project are to process large datasets of a pathology application on the cloud
using the Hadoop framework and its programming model Map Reduce and evaluate the performance and
scalability using the Hadoop Framework. Hadoop is used for large scale data processing on cluster of machines.
1.2 Objectives:

Obtain a good understanding of cloud computing, types of cloud computing, cloud architecture,
cloud models and virtualization.

Understanding how the use of clouds could benefit the IT industry.

Understanding the limitations and opportunities using cloud computing

Understand the concept of virtualization and learn to deploy virtual machines on the cloud.

Understand Hadoop Framework and learn to deploy a Hadoop cluster.

Learn Hadoop's programming model (Map Reduce) and design a framework to process large data
using Hadoop and its Map Reduce programming model.

Evaluate the performance and scalability of this framework.
1.3 Minimum Requirements:

Successfully implementing Hadoop as a single node setup

Successfully implementing Hadoop as a multi node setup with two/three machines.

Designing a MapReduce programming model to process the pathology application data using
Python programming language and the Hadoop streaming utility.

Evaluating the performance of the data processing under the proposed Hadoop and its
programming model MapReduce framework.
1.4 Motivation
In the modern computing, Cloud technologies have become progressively prevalent with the number of
successful solutions growing rapidly. Amazon, Google, Microsoft and IBM's putting forward the abilities of the
cloud, are attracting many businesses to use cloud technologies. The technologies driving the development of
-2cloud computing are relatively new and thus there are numerous open research questions being worked on. The
Hadoop framework for processing large data in cloud is a very interesting concept to handle a large scale
computation in clouds, below are few of some successful implementation of using Hadoop Framework.
A successful implementation of Hadoop has been reported for scalable image-processing related to astronomical
images. Astronomical surveys of the sky generate tens of terabytes of images every night. The study of these
images involves computation challenges and these studies benefit only from the highest quality data. Given the
quantity of data and the computational load involved, these problems could only be handled by distributing the
work over a large number of machines. The report concluded the use of Hadoop showing great improvement for
processing large volumes of astronomical datasets. On a 400-node cluster they were able to process 100,000 files
with approximately 300 million pixels in just three minutes (Keith, 2011).
Another successful implementation is in the field of biology where scientific data is processed on a cloud using
Hadoop. The implementation develops a Hadoop-based cloud computing application that processes sequences of
microscope images of live cells. Hadoop was evaluated working on a scientific data processing case study. The
report concluded that the cloud solution was attractive because it took advantage of many of the desirable
features offered by the cloud concept and Hadoop, including scalability, reliability, and fault-tolerance, easy
deploy ability, etc (Chen, 2010).Therefore, looking at the success of Hadoop Framework on different areas of
computer, the current project proposed will be another implementation using Hadoop.
1.5 Methodology:
In order to complete the objectives and to meet the minimum requirements of this project most of the time has
been spent productively on researching about cloud computing and its latest trends and technologies. Once the
background reading was completed and a good understanding of the Research topic area had been reached a
solution was proposed. The solution proposed reflects the objectives and the minimum requirements mentioned
above. A suitable solution for working with the pathology data was created by using the agile technique (Macias,
2003). The agile technique is based on iterative and incremental development and thus the testing was done
along with the development allowing changes to the design at every stage during the development as needed
which gave rapid results. The design was then evaluated based on efficiency and scalability on the cloud test bed
at the Leeds University and conclusions were drawn to see if the solution adds value to the research area.
1.6 Schedule
A schedule was planned using Gantt chart to manage the allocation time of the project tasks. The tasks were
allotted sufficient time for their completion. The schedule started from June 1st with a detailed understanding of
the problem area and to have hands on session on the School of Computing cloud test bed. There were regular
progress meetings that helped deciding on how to go about with the solution and the software tools that to be
used for designing the solution for the problem area. Once a solution was worked out the implementation was
-3started after a thorough discussion with the supervisor. Sufficient time was dedicated for evaluating the
developed software and final report writing. At every step of the plan there was an assessment to check whether
the tasks met the requirements of the project.
Figure 1: Schedule used for the project with the major tasks and time scales/deadlines to complete these tasks.
1.7 Contributions:
The following contribution has been given by the research project:

The solution proposed and implemented will be useful in making decisions in regards to time and cost
involved to process large data of an application using the Hadoop cloud framework.

The evaluation of the proposed solution based on performance and scalability will enable suggestions to
be made for future work or improvements in cloud technologies.
1.8 Report Structure:
The following is the structure of the report:

Chapter 2 describes the background research in order to obtain an understanding of what is cloud
computing, its architecture, cloud types, cloud models, cloud technologies, cloud trends, virtualization
concept, types of virtualization and a brief description on managing data on clouds.
-4
Chapter 3 describes a case study, and discusses the real world scenario and how cloud computing
technologies and techniques could be beneficial.

Chapter 4 deals with the Hadoop Framework and how it helps in processing large volumes (Petabytes)
of data using its two important components; Map Reduce Programming model and the Hadoop
Distributed File systems.

Chapter 5 delineates step by step Installation of the Hadoop cluster with a single node and a Hadoop
cluster with two Nodes.

Chapter 6 describes a solution using the Map Reduce programming model to handle the processing of
the Pathology application data that was discussed in chapter 3.

Chapter 7 evaluates the implemented solution described in chapter 6 by running experiments.

Chapter 8 concludes the overall impact of the project.
-5-
Chapter 2
2. CLOUD COMPUTING
“Clouds are a large pool of easily usable and accessible virtualized resources (such as hardware, development
platforms and/or services). These resources can be dynamically reconfigured to adjust to a variable load (scale),
allowing also for an optimum resource utilization. This pool of resources is typically exploited by a pay-per-use
model in which guarantees are offered by the Infrastructure Provider by means of customized SLA” - (Vaquero,
2010).
“Cloud computing seems to be little more than a marketing umbrella, encompassing topics such as distributed
computing, grid computing, utility computing, and software-as-a-service, that have already received significant
research focus and commercial implementation. There also exist an increasing number of large companies that
are offering cloud computing infrastructure products and services” - (Daniel, 2009).
2.1 Introduction
With traditional desktop computing we can run a software program on each computer we own and documents
that we create are stored on the computer that it was created on. Although documents that are created on a
computer can be accessed from any other computer on the network but they cannot be accessed outside the
network. This entire scenario is PC centered. With the use of cloud computing the software programs we use are
no more run from our PC but are stored on servers that are accessed via the internet. The major advantage of this
is that the software is still available for use even if the computer crashes. Similar is the case for the documents
that are created by us , they are placed on a collection of servers and is accessed by the internet, not only can we
access the document but any one with permission can access the document and can make changes to the
document in real time. Therefore, cloud computing model is not PC- centric (Miller, 2008).
“The cloud” is the key to define cloud computing technology. The cloud in simple term is a large group
of computers that are interconnected, these computers can be PC‟s or network servers, which could be even
private or public. For example, Google hosts a cloud that consist of small PC‟s and larger servers and their cloud
is a private cloud, Google owns the cloud that is publicly accessible by Google‟s users only. Any authorized user
can access the data and application from a computer using an internet connection and for the user the
infrastructure and the technology is completely invisible. Cloud computing is not network computing. With
network computing documents or applications are hosted on a single server owned by the company and can be
accessed from any computer on the network. The concept of cloud computing is much bigger. It encompasses
multiple servers, multiple networks and multiple companies. Unlike network computing, cloud services and
cloud storage can be accessed from anywhere using an internet connection but with network computing the
-6accesses is limited only within the company‟s network. Cloud computing is also not outsourcing where a
company subcontracts its computing services to an outside firm. The outsourcing firm can host a company‟s
application or data, but the application or data is accessed only by the employees of the company by using the
company‟s network and not to the entire world via the internet (Miller, 2008).
2.2 Properties of a Cloud
Cloud computing is user-centric: Once we are connected to the cloud, the documents, data, images, application
stored on the cloud is not only ours but can also be shared with others.
Cloud computing is centered on tasks: Focus is not on what an application can do, the focus is more on how
the application can do it for you.
Cloud computing is powerful: connecting hundreds and thousands of computers in a cloud means there is a
wealth of computing power which is not possible to be found on a single PC.
Cloud computing is widely accessible: As the data is stored on the cloud, users have the privilege that they
could retrieve more valuable information from multiple sources or repositories. The advantage is we are not
restricted to data from a single source as we are on a desktop PC.
Perhaps the best and the popular examples of cloud computing applications are Google Docs and
spreadsheets, Google Calendar, Gmail and Picassa. All these applications are hosted on Google‟s server and
accessible to any user with an internet connection. Thus, with the help of cloud computing there is a shift from
the computer-dependent isolated data use to data that could be shared and accessed by anyone from anywhere,
and from application to task. In cloud computing technology, the user doesn‟t even have to know where exactly
the data is located. All that matters is data is on the cloud and is immediately available to the user and other
authorized users (Miller, 2008).
2.3 Architecture of Cloud Computing
The key to cloud computing is the network of servers or even individual PC‟s that are interconnected. As these
computers run in parallel, combining the powers of each provide supercomputing like power that are publicly
accessible via the internet. Users using the computer via the internet connect to the cloud. To a user the cloud is
a single device, application or document. For the user the hardware and how the hardware is managed by the
operating system are completely invisible. When we talk about a cloud computing system it is easier to
understand if we divide it into two sections: one section as the front end and the other section as the back end
(Jonathan, 2011). Internet is usually the network by which the front end and the back end are connected to each
other. The front end is the computer user or the client. The cloud section of the system is the backend. The Front
end includes the computer of the user or the network and the application that is required to access the cloud
system. “Various computers, servers and data storage systems that create the "cloud" of computing services form
the back end of the system”- (Jonathan, 2011). It begins with the front end interface from where the user
selects the task or service like starting an application. The system management then gets the users request, which
-7finds the correct resources and calls the appropriate provisioning services. This service gets the appropriate
resources on the cloud and launches the appropriate web application. Once the application is launched the
system‟s metric and monitoring function tracks the usage of the cloud and based on that the users are charged
(Miller, 2008).
Communication between networked computers is possible by using a middleware. If a company that
provides cloud computing has a lot of clients there would be a high demand for a large volume of storage space
thus hundreds of digital storage devices are required by many companies. A cloud computing system must
replicate all its clients' information/data and store it on other devices as there could be occasional break down of
these devices. These replicated data are later accessed by the central server during break down to retrieve data
that would be unreachable if there was no replication of the data (Jonathan, 2011). The Cloud computing
architectural model is presented below.
Figure 2: cloud computing Architecture
(Source: http://www.jot.fm/issues/issue_2009_03/column3/)
2.4 Models
There are five cloud models to consider:
(1) Public, (2) private, (3) hybrid clouds, (4) federated clouds and (5) Multi clouds
IT organizations can choose to deploy applications on public, private, hybrid clouds, federated clouds and multi
clouds. Public clouds are out there on the internet i.e. globally available and private clouds are typically located
on premises of an organization or company i.e. locally available. Companies may make considerations in regards
to which cloud computing model they choose to employ i.e. if they just want to use a single model or a
combination of the two rather than having one model to solve single or different problems. An application that is
needed temporarily might be best suited for deployment in a public cloud because it could help avoiding the
need to purchase additional equipment, software and hardware to solve a temporary need .On the other hand, a
permanent application or one that has specific requirements on quality of service or location of data, might best
be suited to be deployed in a private or hybrid cloud (Sun Microsystems, 2009).
-8-
Figure 3: Types of cloud Models
(Source: http://computinged.com/edge/cloud-computing-winning-formula-for-organizations-in-gaining-technology-edge/)
Public clouds
Public clouds are run by third parties and different applications from various customers could be incorporated
into the cloud provider‟s servers. Public clouds are most often hosted away from customer premises, and they
provide a way to reduce customer risk like natural disaster, infrastructure risks and the cost by providing a
flexible, even temporary extension to enterprise infrastructure. If a public cloud is implemented by keeping in
mind the performance and security, the existence of other applications running in the cloud should be transparent
to both cloud architects and end users. One of the major benefits of public clouds is that they can be much larger
than a private cloud with the ability to scale up and down on demand, and moving infrastructure risks from the
enterprise to the provider of the cloud even if it is temporary.Thus, public cloud describes as cloud computing
where resources are dynamically provided on a self-service and fine-grained basis over the internet via web
services, from a third-party provider who charges based on a fine-grained utility computing basis (Vaquero,
2009).
Private clouds
Private clouds are built exclusively for one client, providing security, quality of services and control over data.
The company owns the entire infrastructure and has complete control over how the applications are deployed on
it. They can be built and managed by the company‟s own IT staff or by a cloud provider.For, mission critical
applications IT organizations use their own private clouds to protect critical infrastructures as placing such
application globally could have security risks (Vaquero, 2009).
Hybrid clouds
A hybrid storage cloud basically is a combination of public and private storage clouds. It can be used to handle
planned increase in workload. It has a disadvantage when deciding on how to distribute the applications across
both a private and public cloud. Other than this issue, it also needs to consider the relationship between data and
processing resources. A hybrid cloud can be of great use and success if the data is small or if the application is
stateless (Vaquero, 2009).
-9-
Federated Clouds
This model is one in which different computing infrastructure providers (IP‟s) can join together to create a
federated cloud. The advantages of such a model includes cost savings due to not over provisioning for spikes in
capacity demand. The biggest advantage of the federated model is the lack of reliability on a single vendor and
having a higher availability of the services due to greater distribution of computing resources across different
infrastructure. One of the primary reluctance to a complete cloud hosting solution is the reliance on a single
vendor for the entire availability of the site. A federated cloud would eliminate this issue (Abbott, Keeven &
Fisher partners, 2011).
Multi cloud
Let us consider a multi cloud engine "RightScale." They basically interact with API's of the cloud infrastructure
and manage the aspects of each cloud. The advantage of RightScale is that it does not lock you into any one
particular cloud - we are free to choose among different providers of cloud services, and we can even deploy and
move our applications across multiple clouds. Users also have the privilege to modify the input parameters that
are cloud-specific and users can launch servers with the same configurations on the new cloud (rightscale.com,
2011).
“Deployments spanning multiple clouds can enable disaster recovery scenarios, geography-specific data and
processing, or hybrid architectures across private and public clouds” - (rightscale.com, 2011).
The present project would be using the University of Leeds, School of computing, which a private cloud Test
Bed owned by the School of computing.
2.5 Types of cloud system
In this section an attempt has been made to distinguish the kind of systems where Clouds are mostly used.
Many companies use software services as their business basis. These Service Providers (SPs) provides services
and also gives access to those services to the users who use this service via internet-based interfaces. Clouds
outsource the provision of the computing infrastructure required to host services. Infrastructure Providers (IPs)
offers this infrastructure „as a service‟, shifting computing resources or services from the SPs to the IPs, such
that the SP‟s can gain in flexibility and reduce costs (Vaquero, 2010).
- 10 -
Figure 4: Types of cloud system
(Source: http://www.bitsandbuzz.com/article/dont-get-stuck-in-a-cloud/)
Infrastructure as a Service
A large set of computing services or resources such as processing and storing capacity are managed by the IP‟s.
They are able to split, assign and dynamically resize these resources via virtualization to build systems as
demanded by customers. This is the scenario of “Infrastructure as a Service” (IaaS) (Vaquero, 2010).
Software as a Service
The most common type of cloud service development is “Software as a Service” (SaaS). This types of cloud
computing delivers a single application through the browser to thousands of customers. An example of SaaS is
the online word processors (Vaquero, 2010) which are an online alternative of typical office applications. This
scenario is called Software as a Service (SaaS). On the customer side, it means no investment in servers or
software licensing; on the provider side, with just one app to maintain, costs are low compared to conventional
hosting (Vaquero, 2010).customers pay for using it but doesn‟t pay for owning the software. Users access an
application via an API. An API is an Application development Interface that allows a remote program to use or
communicate with another program or service.
Platform as a Service.
This form of cloud computing delivers development environments / platforms as a service. We build our own
applications that run on the provider's infrastructure and are delivered to our users via the internet from the
provider's servers. This is Platform as a Service (PaaS) quite constrained by the vendor's design and capabilities,
so you don't get complete freedom, but you do get predictability and pre-integration. Examples of PaaS include
Coghead and the new Google Apps Engine (Eric, 2011).
2.6 Companies that offer Cloud services
In this section we look at some companies that offer cloud services.
- 11 Amazon
It is a primary provider of cloud computing services. One of the services is the Elastic Compute Cloud also
known as EC2.Developers and companies are allowed to rent the capacity on Amazon‟s propriety cloud of
servers which happens to be he biggest server farms in the world.EC2 enables customers to request a set of
virtual machine, onto which they can deploy any application of their choice. Hence, customers can create, launch
and terminate server instances on demand creating elasticity. Users pick the size and power they want for their
virtual servers and Amazon does the rest.EC2 is just a part Amazon also provides developers with direct access
to its softwares and machines. Developers can then build low cost, reliable and powerful web based application.
The cloud provided by Amazon gives access to developers to do the rest. Developers pay for the computing
(Miller, 2008).
Google App Engine
Google also offers cloud computing services. The service is the Google apps engine which enables developers to
build their web application utilizing the same infrastructure that power‟s Google‟s applications. A fully
integrated application environment is provided by the Google App engine. The Google App engine is easy to
build, maintain and scale. All it needs is to develop the application and upload it to the App engine cloud.
Google App Engine unlike other cloud hosting application is free to use (Miller, 2008).
Salesforce.com
The company is very well known for cloud computing developments and its sales management Saas. Dubbed
Force.com is the company‟s cloud computing architecture. The service provided is on-demand and runs across
the internet. Salesforce provides its own developer‟s toolkit and API and charges fees per user usage. It has a
directory of web-based applications called AppExchange. The AppExchange applications uploaded by other
developers can be accessed by developers. The developers can also share their developed applications or can
also make them private so that they can be accessed by authorized clients. Most of the applications are free to
use in the AppExchange library. Most of the applications are sales related like sale analysis tool, financial
analysis apps. etc (Miller, 2008).
IBM
IBM also offers cloud computing solution. Small and medium size business is targeted by the company by their
on-demand cloud based suite. The services provided by the cloud computing suite include email continuity,
email archiving, backup of data and recovery… etc. It manages its cloud hardware using Hadoop which is based
on Map Reduce software that is used by Google (Miller, 2008).
IBM, Amazon, Salesforce.com, Google are not the only companies that provide cloud services there are other
smaller companies including 3tera, 10gen, Nirvanix… etc (Miller, 2008).
- 12 -
2.7 Virtualization
A data center consists of hundreds and thousands of servers and most of the time the server‟s capacity is not
fully used and thus there's unused processing power that goes waste. In such cases there could be a possibility to
make a physical server think that it is, multiple servers, each running with its own operating system. The
technique is termed as server virtualization. This technique reduces the need to have more physical machines
(Jonathan Strickland, 2011).A traditional server contains a single application running on an operating system. This
leads to tremendous cost in number of areas in terms of hardware, operations, management and maintenance. To
handle these issues and cost, enterprise IT has come up with the most compelling tool called the Virtualization
Technology. For each of these applications the average utilization is about 5-10%. These servers are barely used
across the environment 90-95% of its capacity are not used in an average. So what basically the virtualization
technology does is that it takes the advantage of that and runs these environments side by side on an much lower
number of physical servers. The environments could be databases, business applications, web servers etc we can
take these down and consolidate them into much smaller number of physical servers. Running multiple logical
servers on a single physical machine which is also termed as server consolidation is a popular way to save
money spent on hardware and make administration and backup easier (Shinder, 2008). Virtual applications
reduce hardware costs and ease application deployment. Each of these environments now run side by side on a
single machine and each of them are isolated and fully encapsulated.
Figure 5: Virtualization
(Source: CNET/James Urquhart)
The Reasons are clear from above to why we need to use virtualization and they are the following:

It saves money: Virtualization reduces the number of servers; this means there is a significant savings
on hardware cost and also on the amount of energy needed to run the hardware.
- 13 
It’s good for the environment: Energy savings that could be brought by adopting virtualization
technologies would reduce the need to build many power plants and would thus help to conserve energy
resources.

Reduces work of system administrators: With virtualization administrators would not have to support
many machines and could work on tasks that need more strategic administration.

Better use from hardware: There is a higher hardware utilization rate as there are enough virtual
machines on each server to increase its utilization from the typical 5 – 10 % to as much as 90 - 95%.

It makes software installation easier: Vendors are more inclined towards delivering their products
preinstalled in virtual machines thus decreasing the traditional installation and configuration work
(Bernard Golden, 2007).
2.8 Types of Virtualization
Most of the activity in the virtualization Technology focuses on server virtualization. There are three main types
of virtualization:
1) Hardware emulation: A Machines hardware environment is represented as software so that we can install
multiple operating systems on a single machine.
2) Para virtualization: The software layer coordinates the access from various operating systems to the
underlying hardware.
3) Operating system virtualization: A self-contained representations is created of the underlying operating
system in order to provide the applications with isolated execution environments. The underlying operating
system version is reflected by each self-contained container (Bernard Golden, 2007).
Figure 6: Types of Virtualization
Hypervisors: It‟s virtualization software that allows a single hardware to be shared by multiple operating
systems. Hypervisors are also termed as a virtual machine manager. Each operating system appears to have the
host's processor, memory and other resources all to itself but it is the hypervisor that is controlling the host
resources and processors, allocating what is needed by each of the operating system and making sure that the
- 14 virtual machines (guest operating systems) cannot disrupt each other (searchservervirtualization.techtarget.com,
2011). Few of the Hypervisors are VmWare, Xen and Kvm.
Virtual Infrastructure Manager: Examples for a Virtual Infrastructure Manager would be

OpenNebula (libvirt.org, 2011). It controls Virtual Machines (VM) in a collection of distributed
resources by arranging storage, network and virtualization technologies. OpenNebula driver lets us
manage the private or hybrid (Amazon EC2 or Elastic Hosts based) cloud using a standard library
virtualization interface, API as well as the related tools and VM description files (opennebula.org, 2011).

Eucalyptus is another virtual infrastructure manager and has the ability to deploy public or private
clouds. The virtual machine instances deployed can be run and controlled by the users using Eucalyptus.
It is a part of the Ubuntu Enterprise cloud. It is used to enable hybrid cloud infrastructures between the
public and the private clouds (open.eucalyptus.com, 2011).

Nimbus deploys virtual machines on those remote resources that are leased by the clients by configuring
them in way that is desired by the user. It is a collection of open source tools that provide Infrastructure
as a service cloud solution (nimbusproject.org, 2011).
2.9 Major Players and Products in Virtualization
This list below represents the major players in virtualization:
VMware: “Provides hardware emulation virtualization products called VMware Server and ESX Server “(Bernard Golden, 2007).
Xen: “A new open source contender. Provides a paravirtualization solution. Xen comes bundled with most Linux
distributions.”- (Bernard Golden, 2007).
XenSource: “Provides products that are commercial extensions of Xen focused on Windows virtualization. “(Bernard Golden, 2007).
OpenVZ: “An open source product providing operating system virtualization. Available for both Windows and
Linux.” - (Bernard Golden, 2007).
SWsoft: “The commercial sponsor of OpenVZ. Provides commercial version of OpenVZ called Virtuozzo.” (Bernard Golden, 2007).
OpenSolaris: “The open source version of Sun’s Solaris operating system provides operating system
virtualization and will also provide Xen support in an upcoming release.” - (Bernard Golden, 2007).
2.10 Virtualization Project Steps
Once we have evaluated virtualization and its benefits we can then implement a virtualization project. A
virtualization project can be implemented using these five steps:

Evaluate the workload of our current server
- 15 We check to see if virtualization can benefit us in regards to cost, management and maintenance.

Define our system architecture.
What type of virtualization would we use, and what kind of use case would we need to support

Select your hosting hardware and virtualization software.
We evaluate the capabilities of the virtualization software to ensure that it supports the use case selected
by us.

Migrate the existing servers to the new environment.
Ensure if the new migration products can help us move our systems or if we need to move them
manually.

Administer your virtualized environment.
Check if the tools for virtualization product management are sufficient for our needs or whether we
should work with more general system management tools to monitor our environment (Bernard Golden,
2007).
2.11 Hypervisor: XEN
In this section, hypervisor “Xen” will be discussed, since this will be used to create two virtual machines.
The Xen hypervisor is the most secure and the fastest infrastructure virtualization solution currently available. It
supports a wide range of operating systems including Linux, Solaris and Windows. A thin software layer known
as the Xen hypervisor is inserted with the Xen Virtualization between the hardware of the server and the
Operating System (xen.org, 2011). Thus, an abstraction layer is provided that allows each physical server to run
effectively one or more "virtual servers".
The reason why Xen hypervisor was chosen to other hypervisors was because Xen has a “thin hypervisor”
model.
It has no device drivers and keeps guests isolated, 2 MB executable and its functionality relies on
service domains.
Currently Xen hypervisor version 4.0.1 is installed on machine “testgrid3” as shown below
`[email protected] /opt/images/user-images/sc10bdg % dmesg | grep Xen\ version
[ 0.000000] Xen version: 4.0.1 (preserve-AD) (dom0)
The Xen hypervisor consists of (1) Xen.conf: A Xen configuration file which can be modified and 2) Root.img:
contains the disk image.
We modify the Xen configuration file and point the kernel setup to the directory in which the kernel image is
located. We also need to change location of the disk image and the memory. Once the changes are made we save
the Xen configuration file with a unique name (xen.org, 2011). In this research project we will be creating three
- 16 virtual machines and thus have used three configurations and image files. The setup for the three configuration
files are given as below:
`[email protected] /opt/images/user-images/sc10bdg % ls -ltr
-rw-r--r-- 1 sc10bdg sc10bdg
238 Jul 4 14:13 hadoop.cfg
-rw-r--r-- 1 sc10bdg sc10bdg
240 Jul 8 23:42 hadoop1.cfg
-rw-r--r-- 1 sc10bdg sc10bdg
240 Jul 8 23:42 hadoop2.cfg
-rw-r--r-- 1 sc10bdg sc10bdg 23068672000 Jul 26 02:21 hadoop1.img
-rw-r--r-- 1 sc10bdg sc10bdg 23069720576 Jul 26 02:17 hadoop.img
-rw-r--r-- 1 sc10bdg sc10bdg 23068672000 Jul 26 02:21 hadoop2.img
`[email protected] /opt/images/user-images/sc10bdg % cat hadoop.cfg
kernel = "/boot/vmlinuz-2.6.32.24"
memory = 512
name = "hadoop"
vif = ['mac=00:03:0a:00:0A:02',]
disk = ['tap2:tapdisk:aio:/opt/images/user-images/sc10bdg/hadoop.img,xvda,w']
root = "/dev/xvda"
extra = "fastboot console=hvc0 xencons=tty"
`[email protected] /opt/images/user-images/sc10bdg % cat hadoop1.cfg
kernel = "/boot/vmlinuz-2.6.32.24"
memory = 512
name = "hadoop1"
vif = ['mac=00:03:0a:00:0A:03',]
disk = ['tap2:tapdisk:aio:/opt/images/user-images/sc10bdg/hadoop1.img,xvda,w']
root = "/dev/xvda"
extra = "fastboot console=hvc0 xencons=tty"
`[email protected] /opt/images/user-images/sc10bdg % cat hadoop.cfg
kernel = "/boot/vmlinuz-2.6.32.24"
memory = 512
name = "hadoop2"
vif = ['mac=00:03:0a:00:0A:04',]
disk = ['tap2:tapdisk:aio:/opt/images/user-images/sc10bdg/hadoop2.img,xvda,w']
root = "/dev/xvda"
extra = "fastboot console=hvc0 xencons=tty"
We can start the images by running the command “xm create hadoop” and “xm create hadoop1” and “xm create
hadoop2” (xen.org, 2011). To check if the virtual machines are running we run the below command.
- 17 `[email protected] /opt/images/user-images/sc10bdg % sudo xm list
Name
ID Mem VCPUs State Time(s)
hadoop
21 512 1 -b---- 17.1
hadoop1
22 512 1 -b---- 16.2
hadoop2
23 1024 4 -b---- 30.7
patho-sc10bdg
24 1024 4 -b---- 33.5
If we have to shut down the Virtual machines we execute the below commands
`[email protected] /opt/images/user-images/sc10bdg % sudo xm shutdown 21
`[email protected] /opt/images/user-images/sc10bdg % sudo xm shutdown 22
`[email protected] /opt/images/user-images/sc10bdg % sudo xm shutdown 23
To check if the Virtual machines have shut down we execute the “xm list” command again (xen.org, 2011)
`[email protected] /opt/images/user-images/sc10bdg % sudo xm list
Name
ID Mem VCPUs State Time(s)
patho-sc10bdg
24 1024 4 -b---- 33.5
2.12 Characteristics of a cloud
As in the next section we show what types of database applications could be considered for a cloud
deployment. Therefore, in this section we first discuss few of the characteristics of a cloud computing that are
most important.
It is elastic, provided workload is parallelizable. One of the advantages of cloud computing is its ability to
handle changing conditions. During seasonal or unexpected increase in demand for a product sold by an ecommerce company, or during a increase in growth phase for a social networking site, additional computational
resources can be allocated to handle the increasing demands (Daniel, 2009). In this environment, we only pay for
what one uses or needs, so we can obtain the increased resources to handle spikes in workload and then release
the additional resources once the spike has subsided. This is also termed as pay as you go Ex: Metered Taxi‟s.
However, getting additional computational resources is obtained by allocating additional server instances to a
task. Amazon‟s Elastic Compute Cloud (EC2) provides computing resources in large, extra large and small
virtual private server instances. If an application is not able to take advantage of the additional server instances
by offloading some of its work which runs in parallel, then having the additional server instances is not of much
help.
Storing of data on an untrusted host. Moving data off organization/persons premises increases the chances of
security risks, and appropriate precautions must be used to handle this. Although the name “cloud computing”
gives us an idea that the storage and computing resources are being delivered from an outside location and is
subject to local rules and regulations of that country. In the United States, the US Patriot Act gives the
government the right to demand an access to the data stored on any computer; if the data is being hosted by a
- 18 third party in our case US, the data needs to be handed over without the client using the hosting service (Daniel,
2009) having any knowledge of it. Since, few of the cloud computing vendors give the client little control over
where data is stored, in such scenario the customer needs to take the risk else he could encrypt the data with the
key not residing in the host and then put the encrypted data on the host.
Replication of Data: Availability, accessibility and durability of data are important features for cloud storage
providers, as loosing data or unavailability of data could affect the customer‟s business by not meeting the set
targets in service level agreements (SLA) and to business reputation. Data availability is typically achieved
through replication .Large cloud service providers with their data centers spread over the globe have to deal with
fault tolerance by replicating the data. Amazon‟s S3 cloud storage service replicates the data across availability
zones and regions such that data and applications are not hampered even if there is a failure over an entire
location (Daniel, 2009). Client should understand the details of the replication scheme carefully as; Amazon‟s
Elastic Block Store replicates data with in the same availability zone and is prone to more failures when
compared to Amazon‟s S3 cloud Storage.
2.13 Data management applications
Looking at the above characteristics we can have an idea on what type of data management applications
could be moved into the cloud. In this section, we have described the suitability of moving transactional
databases and analytical databases into the cloud.
2.13.1 Transactional data management
“Transactional data management”, we refer to the databases that are the essential sustaining element to
banking, airline reservation and online e-commerce, applications. These applications tend to be fairly writeintensive. Unavailability of database of such applications can hamper their business as most of these applications
are mission / business critical applications (Daniel, 2009).For the following reasons transactional data
management applications are not a good option to be deployed in the cloud, for the following reasons:
Transactional databases do not use a shared-nothing architecture.
“The transactional database market is dominated by Oracle, IBM DB2, Microsoft SQL Server, and Sybase “(Olofson, 2006).
Microsoft SQL Server and Sybase can be deployed using SNA. IBM released a shared-nothing implementation
of DB2 .It is designed to help scale the analytical applications running on data warehouses (Paul McInerney,
2011).
“Oracle has no shared-nothing Architecture. Implementing a transactional database system using a sharednothing architecture is non-trivial, since data is partitioned across sites and, in general, transactions can not be
restricted to accessing data from a single site. This results in complex distributed locking, commit protocols, and
- 19 in data being shipped over the network leading to increased time delay and network bandwidth problems.
Furthermore the main benefit of a shared-nothing architecture is its scalability”-(Daniel, 2009).
However this feature is not very relevant for transactional data processing as majority of the deployments is less
than 1 TB in size.
Shared nothing Architecture
Shared nothing architecture is a distributed computing architecture that consists of multiple nodes where each of
the nodes has its own private memory, input/output devices and disks which is independent of any other machine
in the network. Each machine is self sufficient and does not share anything across the network. This type of
system has become popular and is highly scalable (Akshaya Bhatia, 2011).
There could be high risks in storing of transactional data on an untrusted host. These databases contain the
operational data needed to power business-critical and mission-critical business processes. This data often
includes personal information such as customer information, customer data and sensitive information like credit
card numbers. Any kind of security breaches or privacy violations is not acceptable. Therefore, transactional data
management applications are not well suited for cloud deployment (Daniel, 2009).
” Though, there are a couple of companies that will sell you a transactional database that can run in Amazon’s
cloud: EnterpriseDB’s Postgres Plus Advanced Server and Oracle. However, there has yet to be any published
case studies of customers successfully implementing a mission critical transactional database using these cloud
products and, at least in Oracle’s case, the cloud version seems to be mainly intended for database backup”(Monash, 2008).
2.13.2 Analytical data management
“Analytical data management” we refer to database that are queried for business planning, analysis,
problem solving, and decision support. Historical data from various operational databases are typically involved
in the analysis (Daniel, 2009). The scale of an analytical data management system is generally bigger than
transactional systems. As in transactional systems the scale of a database is about 1TB for analytical systems it is
crossing the petabytes barrier (Monash, 2011).
” Furthermore, analytical systems tend to be read-mostly (or read-only), with occasional batch inserts.
Analytical data management consists of $3.98 billion”- (Vesset, 2006)” of the
$14.6 billion database market “-(Olofson, 2006)” and” is growing at a rate of 10.3% annually “-(Vesset, 2006).
In this section we can see why Analytical data management systems are well-suited to run in a cloud
environment.
Shared-nothing architecture is good for analytical data management.
“Teradata, Netezza, Greenplum, DATAllegro (recently acquired byMicrosoft), Vertica, and Aster Data all use a
shared-nothing architecture (at least in the storage layer) in their analytical DBMS products, with IBM DB2 and
- 20 recently Oracle is also adding shared-nothing analytical products. The ever increasing amount of data involved
in data analysis workloads is the primary driver behind the choice of a shared-nothing architecture, as the
architecture is widely believed to scale the best” -(Daniel, 2009).
Work load involved in data analysis normally consist of star schema joins and multidimensional
aggregations which are easy parallelize across machines in a shared-nothing network. Complex distributed
locking and commit protocols are eliminated due to the infrequent (Daniel, 2009).
Sensitive data could be left out. In certain scenarios we can identify the data that could impact or could lead to
damage if accessed by a third party. Once such data has been identified, we could both remove it from the
analytical data store and include it only after encrypting it.
Looking at the above characteristics, it could be concluded that analytical data management applications
are well-suited for cloud deployment. The cloud could be a deployment option for medium-sized businesses,
especially for those that do not currently have a data warehouses due to the high capital expenditures and for
sudden projects that arise due to changing business requirements (Daniel, 2009).
2.14 Analysis of Data in the Cloud
As we could see from above that analytic database systems are the ones that prefer to move into the cloud. We
will focus on one of software solutions: MapReduce-like software. Before looking at this in details, we would
look at some characteristics and features that these solutions should have based on the cloud DBMS
characteristic.
Cloud DBMS characteristics
Efficiency: The cost of Cloud computing is structured in such a way that you pay for only what you have used,
the price linearly increases as the storage, the network and computation power increases. Hence, if data analysis
software product ABC needs more computation units than the software product XYZ to perform the same task,
then product ABC will cost more than XYZ (Daniel, 2009).
Fault Tolerance. Fault tolerance for analytical data workloads is measured differently when compared to fault
tolerance for transactional workloads. For read-only queries, there are no write transactions to commit and there
are also no updates to lose where there is a node failure. Thus in the case of a fault tolerant analytical DBMS it is
simply the one that do not have to restart the query if one of the nodes that is involved in query processing fails
(Daniel, 2009).
Should run on a heterogeneous environment. The performance of the nodes in a cloud is mostly not same or
consistent. Some nodes sometimes have a worst performance than other nodes. There could be numerous reasons
why this might occur and one of them could be hardware failure leading to performance degradation of a node
(Daniel, 2009). If the total amount of work that is needed to run a query is divided among the cloud nodes
equally, then there could be a concern, that the time taken to complete the query will be equal to time taken by
the slowest node to complete the assigned task. A node thus having a slow performance would in turn have an
- 21 effect on total query performance. Therefore, a system designed to run in an environment that is heterogeneous
would take the appropriate measures to help preventing this from occurring.
2.15 Analyzing the Data using Hadoop Framework
Hadoop is an open source Apache project written in Java. It provides users with a distributed file system and a
method for distributed computation. It is based on Google's File System and Map Reduce concept which
describes on methods to build a framework capable of executing intensive computations across number of
machines (Michael G. Noll, 2011). Although it can be used on a single machine, its actual power lies in its
ability to work with hundreds or thousands of machines, each with several processor cores. It can also distribute
workload across machines effectively. Hadoop is built to process large volumes of data (hundreds of gigabytes,
terabytes or petabytes). Hadoop thus includes a distributed file system which divides the input data and sends
sub parts of the original data to different machines in the cluster to store. Thus, the problem is processed in
parallel using the machines in the cluster and results are computed as efficiently as possible. But, whenever there
are multiple machines in use that needs cooperation between one another, there is a rise in the probability of
failures. Hadoop is designed to handle data congestion issues and hardware failures robustly.
2.16 Cloud Test Bed
School of computing consists of a cloud test bed which is comprised of 8 machines. It is basically used for
research on some of the open questions in the field of distributed and cloud computing. The cloud is fire walled
and can be accessed from with in the school of computing using SSH.
Figure 7: An example of 2 nodes in the cloud showing the Layered architecture.
(Source: University of Leeds, School of computing)
- 22 The specifications of the node machine are:
CPU:
Memory:
NIC:
Quad Core 2.83Ghz
4GB RAM
1GBit
We can access the cloud test bed via SSH using a login and a password created by the cloud administrator. The
work required in the project can be done using command line tools and the accesses to these tools were made
available. To understand how to use the cloud test bed; training was given for a certain period. Additionally, a
simple tutorial was given by the administrator of the cloud which contained the basic commands and steps
needed for the creation of a virtual machine using the Xen hypervisor. The tutorial also included information on
where the Xen documentation could be found for additional commands that can be used on the cloud test bed.
A virtual machine (10.0.10.2 / Debian02) initially was created using the Xen hypervisor and accessed using SSH
from with the cloud test bed
Example:
Access the cloud test bed: ssh [email protected] and the password given by the administrator
Access the Virtual machine: ssh [email protected] and the password
A template file was created that included the information required to provision a virtual machine on to the cloud
from a specified debian image. Multiple template files were created so that multiple virtual machines could be
deployed on to the cloud.
2.16 Related Work
Few of the related works are presented in a list below:

A successful implementation of Hadoop has been reported for scalable image-processing related to
Astronomical Images. Astronomical surveys of the sky generate tens of terabytes of images every night.
The study of these images involves computation challenges and these studies benefit from the highest
quality data. Given the quantity of data and the computational load, these problems could be addressed
only by distributing the volume of work over a large number of nodes or machines. The report
concluded the use of Hadoop showing great improvement for processing large volumes of astronomical
datasets. On a 400-node cluster they were able to process 100,000 files with approximately 300 million
pixels in just three minutes (Keith, 2011).

Another successful implementation of MapReduce used for data intensive scientific analyses. As
scientific data analysis deals with volumes of data efficient concurrent and parallel programming are the
key for handling such data. The project concluded that scientific data could achieve scalability and speed
- 23 up using the Map Reduce Technique. It also stated that applications that are tightly coupled could
benefit using the mapreduce technique if the data size used is appropriate and the overhead introduced
by a particular runtime diminishes as the amount of data and the computation increases (Jaliya, 2008).

Another successful implementation is in the field of Biology where scientific data is processed on a
cloud using Hadoop. The implementation develops a Hadoop-based cloud computing application that
processes sequences of microscope images of live cells. The cloud solution was attractive because it
took advantage of many of the desirable features offered by the cloud concept and Hadoop, including
scalability, reliability, fault-tolerance, easy deploy ability, etc (Chen, 2010).

In one of the paper, by Tomi Aarnio, from the Helsinki University of Technology on Parallel data
processing with MapReduce stated that when using MapReduce in a cluster the computation power is
improved by adding new nodes to the network. Thus with more nodes there is more parallel task. It also
mentioned how code written becomes easier, simpler and smaller to maintain and understand as the user
represents the problem using only the Map and the Reduce function (Tomi, 2009).

In another paper "Storage and Retrieval of Large RDF Graph Using Hadoop and MapReduce" described
a framework to store and retrieve large RDF triples using hadoop. The HDFS and MapReduce
framework was efficiently used to store and retrieve the RDF data. The results showed that huge data
can be stored on Hadoop clusters built on cheap commodity hardware can retrieve data fast enough
(Mohammed, 2009).
- 24 -
Chapter 3
3. Case Study: Pathology Application
3.1 General Description of Pathology data Processing
There exists a current project in the University of Leeds that relates to the reconstruction of 3D volumes from 2D
stained serial sections using image based registration at multiple scales. The novelty of the project is the use of
the full (cellular) resolution and full extent of the images to perform accurate registration while taking into
account local tissue imperfection like tearing. Over the last two and a half years the group have developed a tool
that is now in regular use within the Leeds by various researchers, and is usable by a trained lab technician
(Djemame, 2010) but the disadvantage of the current system is that the application runs on a single machine and
that data processing could end up taking a lot of time because of memory limitations of a single computer. We
consider the development of a scalable and reliable cloud system using HADOOP Framework to see if using this
method we can process the data efficiently.
a)
b)
c)
d)
a) The existing application for slice alignment showing reconstruction of ~200 serial sections through a mouse
embryo, b)-d) various methods of visualizing the reconstructed volumes – 3D rendering in color of the whole
- 25 volume, arbitrary 2D sections drawn through the entire volume, and visualization of anatomic structures
segmented (i.e. visualized with a surface) in 3D.
3.2 Accessing the Application
In the non-cloud version the client or the user requests the patient data sending a „HTTP‟ request to the image
server that resides at the St. James hospital which consists of the pathology images like tissue images and the
server then sends back the results based on the parameters to the user. We can access the pathology application
that was made available for the project and can be accessed as below
`[email protected] /opt/images % ssh [email protected]
Once we are in the application server under the /opt directory we have the details of the pathology application
and the image server
`-patho01 ~ # cd /opt
`-patho01 /opt # ls -ltr
drwx------ 6 root root 4096 2011-07-29 18:32 patho
drwx------ 9 root root 20480 2011-07-29 19:40 imageserver
We can start the Aperio server by executing an already existing shell script to start the server as below
`-patho01 /opt #
`-patho01 /opt # cd imageserver
`-patho01 /opt/imageserver # ls -ltr *run_server*
-rwx------ 1 root root 632 2010-12-08 00:20 run_server.sh
`-patho01 /opt/imageserver # ./run_server.sh
Waiting for server to start...
Server started
Server has port
The images are stored in the /slides directory of the image server. Below are the details of the images being used
to obtain the pathology data.
`-patho01 /opt/imageserver/slides # ls -ltr
total 1919892
-rw------- 1 root root 177552579 2009-12-29 15:01 CMU-1.svs
-rw------- 1 root root 390750635 2009-12-29 15:06 CMU-2.svs
-rw------- 1 root root 253815723 2009-12-29 15:16 CMU-3.svs
- 26 -rw------- 1 root root 1141873193 2010-11-16 15:25 117473.svs
Once the Image server has started we execute a sample query to run the application and the collect the output
(data) it gives. In this project, we will be working with this data obtained. The main core of the application is the
program “deconvolution-model” that calculates the No. of Nuclei and its coordinates given an area of an Image.
`-patho01 /opt/patho # ls –ltr
-rwx------ 1 root root
227 2010-11-23 15:28 deconvolution-model.sh
`-patho01 /opt/patho # ./deconvolution-model.sh
Native Image size : 93120 x 60946
Native zoom: 40
Loading Deconvolution vectors: estimated_cdv.cdv
[0.471041,0.735899,0.486387]
[0.349641,0.78081,0.517771]
[0.7548,0.001,0.655955]
Channel 1
No.nuclei: 5
346.448,439.332
46.8712,102.746
369.361,7.27193
230.885,197.594
138.333,128.117
In this research project, we will try to describe the advantages of using Hadoop and the Map Reduce
programming model on the pathology data to show how Megabytes and Gigabytes of such data can be processed
effectively and scalably using the Hadoop Framework and evaluate its performance.
- 27 -
Chapter 4
4. HADOOP Approach
Hadoop is a distributed system which is fault-tolerant and is used for storing data and is also highly scalable. The
scalability is due to Hadoop distributed file system which is a clustered storage of high bandwidth and also
because of Map Reduce fault tolerant distributed processing. It analyzes and processes variety of older and
newer data to obtain business operations. In most cases, data is moved to the node that performs the
computation. In Hadoop processing of the data is done where the data resides. The Hadoop cloud or cluster in
data center is disruptive. One of the major advantages of Hadoop is that the jobs can be submitted within the
datacenter orderly (hadoop.apache.org, 2011).
“Even if hundreds or thousands of CPU cores are placed on a single machine, it would not be possible to deliver
input data to these cores fast enough for processing. Individual hard drives can only sustain read speeds
between 60-100 MB/second. These speeds have been increasing over time, but not at the same breakneck pace
as processors. Optimistically assuming the upper limit of 100 MB/second, and assuming four independent I/O
channels are available to the machine, that provides 400 MB of data every second. A 4 terabyte data set would
thus take over 10,000 seconds to read--about three hours just to load the data! With 100 separate machines each
with two I/O channels on the job, this drops to three minutes.”- (developer.yahoo.com, 2011).
Hadoop processes large amount of data by connecting many commodity computers together and making them
work in parallel. A theoretical 100-CPU machine would cost a very large amount of money. It would in fact be
costlier than 100 single-CPU machines. Hadoop basically ties together smaller and more reasonably priced
computers to form a single cost-effective compute cluster.
Computation on a large amount of data has been done before in a distributed setting. The simplified
programming model is the reason that makes Hadoop unique. In a Hadoop cluster when the data is loaded it is
distributed to all the machines of the cluster. Hadoop Distributed File System (HDFS) splits the large data files
into parts which are managed by different machines in the cluster. Each part is replicated across many machines
in a cluster, so that if there is a single machine failure it does not result in data being unavailable
(developer.yahoo.com, 2011).
In the Hadoop programming framework data is record oriented. Specific to the application logic,
individual input data files are broken into various formats. Subsets of these records are then processed by each
process running on a machine in the cluster. Using the knowledge from the DFS these processes are scheduled
- 28 by the Hadoop framework based on the location of the record or data. The files are spread across the DFS as
chunks and are computed by the process running on the node. Hadoop framework helps in preventing unwanted
network transfers and strain on network can be obtained by reading data from the local disk directly into the
CPU. Thus with hadoop one could have high performance results due to data locality, with their strategy of
moving the computation to the data (hadoop.apache.org, 2011).
.
Figure 8: Data is distributed across nodes at load time.
(Source: http://developer.yahoo.com/hadoop/tutorial/module1.html)
Working of Hadoop has been shown on clusters consisting of 4000 nodes. Performance of sort on 900 nodes is
good .It was shown that on 900 nodes sorting 9TB of date takes around 1 hour 40 mts and this can be improved
using the below configuration values that are non-default (hadoop.apache.org, 2011).
dfs.block.size = 134217728
dfs.namenode.handler.count = 40
mapred.reduce.parallel.copies = 20
mapred.child.java.opts = -Xmx512m
fs.inmemory.size.mb = 200
io.sort.factor = 100
io.sort.mb = 200
io.file.buffer.size = 131072
The Sort performances shown on 1400 and 2000 nodes are good to sorting on a 1400-node cluster, data of 14TB
takes 2.2 hours on the other hand sorting on a 2000-node cluster data of 20TB takes 2.5 hours by
(hadoop.apache.org, 2011) updating the configuration files to the below setup:
mapred.job.tracker.handler.count = 60
mapred.reduce.parallel.copies = 50
tasktracker.http.threads = 50
mapred.child.java.opts = -Xmx1024m
- 29 -
4.1 MapReduce
Programs in Hadoop must be written in a particular programming model, "MapReduce". Map Reduce
programming model requires a successfully configured Hadoop environment to run the Map Reduce programs.
Large volumes of data are computed in a parallel fashion using Map Reduce programs. Thus, the workload is
divided across clusters of machines. In Map Reduce the data elements cannot be updated i.e. if in the mapping
task we try to change the input pairs (key, value) it will not get reflected in the input files used. If there is any
such requirement then we could generate the new output pairs (key, Value) and forward them to the next phase
of the execution by the Hadoop system (hadoop.apache.org/mapreduce, 2011).
In MapReduce, records are processed by tasks called Mappers. The output that is generated from the
Mappers is brought together into a second set of tasks called Reducers; here the results from different Mappers
are merged together.
Figure 9: Map Reduce programming Model
(Source: http://map-reduce.wikispaces.asu.edu/)
Analogy to SQL

Map is a GROUP BY clause of an aggregate Query

Reduce is an aggregate function computed over all rows with some GROUP BY Attribute.
The concept also works like a UNIX Pipeline

cat input | Grep | Sort | Unique –c | cat > output

Input
| Map | Shuffle and Sort | Reduce | Output
“MapReduce works by breaking the processing into two phases: the Map phase and the
Reduce phase. Each phase has key-value pairs as input and output, the types of which may be chosen by the
programmer. The programmer also specifies two functions: the map function and the reduce function. The input
- 30 to our map phase is the raw data. The map function is simple. We pull out the fields we are interested in. So, the
map function is just a data preparation phase, setting up the data in such a way that the reducer function can do
its work on it. The map function is also a good place to drop bad records: here we filter out data that are
missing, suspect, or erroneous”- (Tom White, 2009).
It is a good fit for lot of applications like,

Log processing

Web Index Building

Image processing... etc (Miha, 2011)
As discussed above the first phase of a MapReduce program is termed as mapping. A list of data elements are
given one at a time to the Mapper function, the Mapper then individually transforms each element to an output
data element.
Figure 10: Different colors represent different keys. All values with the same key are presented to a single reduce task.
(Source: http://developer.yahoo.com/hadoop/tutorial/module4.html)
Figure 11: Mapping creates a new output list by applying a function to individual elements of an input list.
(Source: http://developer.yahoo.com/hadoop/tutorial/module4.html)
- 31 An example of map would be, suppose we had a function toUpper (str) that would return a string in uppercase of
its input string. We could use this function with map to turn a list of strings into a list of uppercase strings. We
are not modifying the input string here but are returning a new string that will form part of the new output list.
Reducing combines values together. A reducer function gets an iterator of input values from the input list. It then
combines the values together, returning a single output value. Reducing is basically used to produce a summary
of the data basically, turning a large volume of data into a smaller summary of the data. Like, a "+" could be
used as a reducing function that would return a sum of a list of input values (Yahoo! Inc, 2011).
Figure 12: Reducing a list iterates over the input values to produce an aggregate value as output.
(Source: http://developer.yahoo.com/hadoop/tutorial/module4.html)
To process large volumes of data the Hadoop MapReduce framework takes the map and the Reduce concept.
The two main components of the MapReduce Program are the ones that implement the Mapper and the Reducer
program. In MapReduce each value is associated with a key which identifies related values. The “map” and the
“reduce” functions just don‟t receive values but a (key, value) pair. The output of these functions should also
emit both key and a value. A Mapper can map a single input into one or many outputs and a reducer can
compute an input list and emit a single output or many different outputs. All the values having the same key are
sent to a single reducer and this is performed independently (Yahoo! Inc, 2011).
4.2 MapReduce Execution Overview
The overall flow of a MapReduce operation is shown by the illustration below. The following sequence of
actions occurs when the Mapper function is called by the user program.
- 32 -
Figure 13: Map Reduce Execution Overview
(Source: http://hadoop.apache.org/)
Input files:
The data is initially stored here for a MapReduce task mostly the files reside in Hadoop Distributed File System
(HDFS). The format of the input files is arbitrary and they could be in binary format or input records that are
multi-lined. The input files are generally very large having their size in gigabytes or petabytes (Yahoo! Inc,
2011).
Mapper:
Given the key and a value the map function generates a (key, value) according to the user-defined program and
forwards the output to the reducer. Each map task initiates an instance of Mapper as a separate java process.
Each Mapper cannot intentionally communicate with one another allowing reliability such that every map task is
governed entirely by the local machine's reliability (Yahoo! Inc, 2011).
Partition & Shuffle:
Shuffling is the processes of collecting the Map outputs and moving it to the reducer. The values having the
same key are reduced together and don‟t depend on which Mapper it originates from (Yahoo! Inc, 2011).
Sort:
The responsibility of the reduce task is to reduce the values that are associated with many intermediate keys.
Before being presented to the Reducer the intermediate keys on a node is sorted automatically (Yahoo! Inc,
2011).
- 33 Reduce:
Each reduce task creates a reducer instance. The reduce method is called once for each key it receives a key and
an iterator over the values of the respective key. The key and the value associated with it are returned by the
iterator (Yahoo! Inc, 2011).
Output Format:
The (key, value) pairs output of the reduce phase is written to output files. These files are then available on the
Hadoop distributed file systems and could be either used by other MapReduce job or for inspection or analysis
(Yahoo! Inc, 2011).
Figure 14: Map Reduce Programming Flow
(Source: (Yahoo! Inc, 2011).)
4.3 Working of Map and Reduce Programming model with an example
Let us consider we want to see the maximum temperature of a given year using the MapReduce framework
below is an example to go about with the solution.
Consider the following line of input data as an example
0067112990999991986051507004...9999999N9+000022+99999999999...
This line is presented to the map function in the form of key-value pairs:
- 34 (0, 0067112990999991986051507004...9999999N9+000022+99999999999...) here „0‟ is the key. Within the file
the keys are the line offsets. Given the line the Map function extracts the required fields and emits them as
output. In this example we consider the weather data and we would like to find the maximum temperature in a
given year. So, our fields of interest are year and temperature and thus the Map Function output is in the below
form (Tom White, 2009).
(1986, 0)
(1986, 22)
(1986, −11)
(1987, 111)
(1987, 78)
Once we have the output from the map function before being sent to the reduce function the data is processed by
the MapReduce framework. The processing involved sorts and groups the key-value pairs based on the key. The
values are in the form of lists. So our reduce function in our example sees the following input:
(1987, [111, 78])
(1986, [0, 22, −11])
Thus the reduce function now has to iterate through the list and pick up the maximum temperature from the list.
So from our above example the reduce function output would be the below:
(1987, 111)
(1986, 22)
The entire data flow is illustrated in the figure below
Figure 15: Map Reduce Flow Example
(Source: Hadoop: The Definitive Guide MapReduce for the Cloud - by Tom White)
So we basically need three things: (1) a map function, (2) a reduce function, and (3) some code to run the job.
4.4 The Hadoop Distributed File System (HDFS)
Hadoop distributed file systems provides a high throughput access to data of an application and is suitable for
applications that need to work with large data sets. It is designed to hold terabytes or petabytes of data and
- 35 provides higher throughput access to this data. Files containing data are stored redundantly across number of
machines for higher availability and durability to failure. In this section we will discuss on the design of the
distributed file system and how we could operate it. It is designed in such a way so that it is more robust (Yahoo!
Inc, 2011).
1) HDFS makes sure that there is data reliability so if one or many machines in the cluster malfunction it should
make data available.
2) HDFS should make sure that it can provide scalable and fast access to data. It should make sure it can serve as
many as number of clients by increasing the number of machines in the cluster.
3) HDFS should make sure that it integrates properly with Hadoop Map Reduce by allowing data or information
to be computed locally when possible.
HDFS has a design that is based on the Google File System (GFS) design.
It has a file system that is block structured and the individual files are broken into fixed size blocks. The blocks
are then distributed and stored across the cluster of machines having data storage capacity. Each machine in the
cluster is referred to as DataNode. A file consists of several blocks, and they may or may not be necessarily
stored on the same node; the destination or target machine that hold each block of the file are chosen randomly
on a block by block basis. If several machines or nodes are involved in serving a file, then the file would be
unavailable if one of the machines in the cluster crashes. HDFS handles this problem by replicating each block
across a number of nodes in the cluster it.
Figure 16: Data Nodes holding blocks of multiple files with a replication factor of 2. The NameNode maps the filenames
onto the block IDs.
(Source: Yahoo! Inc, 2011)
Most file system that is block-structured use a block size of 4 or 8 KB but by default HDFS uses a block size of
64MB.The files that are stored in HDFS are not part of the ordinary file system. If we type a “ls” on DataNode
demon running machine it will display the contents of the Linux file system but will not include any the HDFS
stored files to display. For HDFS stored files we need to type “bin/hadoop dfs –ls” the reason for this is HDFS
- 36 has a separate namespace, isolated from the local file contents. The files inside HDFS are managed and stored by
the DataNode service and are named with block ids. It is not possible to interact with files stored in HDFS using
the Linux file commands like ls -ltr, ls, mv, cp. etc (Tom White, 2009).
HDFS has its own utilities for managing file which act very similar to the ordinary Linux file system like
bin/hadoop dfs –ls to list the files and bin/hadoop dfs –rmr /home/hdusr/patho.txt to remove a file from the
HDFS file system. The model in which file data is accessed is in a write once and read many times, the file
metadata structures i.e the information in regards to the names of files and the directories can be modified
concurrently by number of clients. Thus, it is important that metadata information is not desynchronized. The
metadata information is thus handled by a single machine i.e the NameNode. It stores the metadata for the file
system. As the metadata per file is relatively low, this information can be stored in the Namenode Machines
main memory, allowing the metadata to be accessed faster (hadoop.apache.org/hdfs/, 2011).
In a scenario, where a file needs to be opened, the NameNode is contacted by the client and gets a list containing
the locations of the blocks that comprise the file. The locations that are identified are the DataNodes that hold
each block. Files are then read by the clients directly from the DataNode servers, in parallel. NameNode is not
involved in this bulk data transfer.
There are multiple redundant systems that help to preserve Namenode file systems metadata if the Namenode
fails or crashes irrecoverably. The Namenode failure is severe for the cluster than if a DataNode fails or crashes.
The entire cluster will still continue to operate if an individual DataNodes crashes or fails but the loss of the
NameNode would stop the working of the cluster and the cluster will remain inaccessible until it is restored
manually.
Figure 17: HDFS Architecture
(http://hadoop.apache.org/core/docs/current/hdfs_design.html)
- 37 -
4.5 Why use Hadoop and MapReduce?
Apart from MapReduce (Ghemawat, 2008) and the related software, open source Hadoop (hadoop.apache.org,
2011) there are other tools that are used to automate the parallelization of large data analysis and few of them
are, useful extensions (Olston, 2008), and Microsoft‟s Dryad/SCOPE stack (Chaiken, 2008). MapReduce is one
of the useful tools in the clouds for performing data (lexemetech.com, 2008) Google's MapReduce, Microsoft's
Dryad, and Clustera investigate distributed programming and execution frameworks. MapReduce aims at
simplicity while Dryad provides generality but is complex to write programs. Clustera is similar to Dryad but
uses a different scheduling mechanism. MapReduce has been used for a variety of applications and have been
proved successful (Ghemawat, 2004), (Catanzaro, 2008), (Elsayed, 2008).
The following characteristics states the reasons of MapReduce programming model being a useful tool for
performing data Analysis:
Fault Tolerance. Dealing with fault tolerance is its highest priority. Data analysis jobs are divided into smaller
tasks and upon a failure the failed task from the failed machine are reassigned to another machine transparently.
“In a set of experiments in the original MapReduce paper, it was shown that explicitly killing 200 out of 1746
worker processes involved in a MapReduce job resulted in only a 5% degradation in query performance.”
(Vaquero, 2010).
Run in a heterogeneous environment. It is also designed to run in a heterogeneous environment. Once the
MapReduce job comes to an end, tasks that are still in the state of “in- progress” are executed redundantly on
other machines, and the task is stated as completed as soon as the primary or the backup execution has
completed.
“In a set of experiments in the original MapReduce paper, it was shown that backup task execution improves
query performance by 44% by alleviating the adverse affect caused by slower machines” (Vaquero, 2010).
Efficiency: MapReduce performance is highly dependent on the applications that it is used for. For analysis of
unstructured data where brute-force scans is the right execution strategy, it is likely to be a good fit (Vaquero,
2010).
4.6 Why Use Hadoop framework for the Pathology Application Data?
The entire pathology data is stored and processed on a single machine. Typically, a single machine has a few
gigabytes of memory. If the input data is several Terabytes then that would require hundred or more machines to
hold it in RAM – but even then no single machine would be able to process that huge chunk of data.
Hard drives are much larger now a days and a single machine could hold multiple terabytes of information on its
hard drives. But intermediate data sets generated while performing a large-scale computation could easily fill up
- 38 more space than what the original data set has occupied. Another problem in a single-machine environment is
“failure”, if the machine has crashed, then there is no way for the program to recover anyway and the data is lost.
We use Hadoop to handle the above problems as it is designed to process large amount of information or data
effectively by connecting lots of commodity computers together to work in parallel. It will tie the smaller and
more reasonably priced computers together into a single cost-effective compute cluster. Thus, it is a time and
cost effective method to work with large data (Yahoo! Inc, 2011).
4.7 Hadoop Streaming
Hadoop distribution comes with a utility called Hadoop streaming. This utility allows us to create and run map
and reduce jobs with any script or executable as the mapper and the reducer. To run a job with Hadoop
Streaming we could use the following command:
$ bin/hadoop jar contrib/streaming/hadoop-version-streaming.jar
The above command with no arguments will only print some usage information.
An Example is shown below:
$HADOOP_HOME/hadoop-streaming.jar \
-input myInputDirs \
-output myOutputDir \
-mapper /bin/cat \
-reducer /bin/wc
Streaming allows programs that are written in any language to be used as Hadoop Mapper and a Hadoop
Reducer implementations.
Mappers and Reducers receive input on stdin and pass the output which is in the form
of key, value pairs on stdout. In streaming, input and output are represented textually. The input is in the form of
Key <tab> Value <newline> streaming then splits the input of such form based on the tab character for each line
to obtain the key and the value. The output of the streaming program is also written to stdout in a similar format
i.e key <tab> value <newline>
The output from the mapper which is an input to the reducer are sorted such that the values are adjacent to one
another for the same key. We can write our own scripts in python, bash, perl, or any other language but provided
that the needed interpreter is present on each of the nodes in the cluster.
Below is an example to run real commands on a single machine or a cluster:
$ bin/hadoop jar contrib/streaming-hadoop-0.20.2-streaming.jar -mapper \
MapProgram -reducer ReduceProgram -input /some/dfs/path \
- 39 -output /some/other/dfs/path
The above assumes that Mapper Program and Reducer Program are each present on every node in the cluster. If
they are not present on the cluster nodes but is present on the node launching the job then the two programs can
be shipped to the remaining nodes in the cluster with the option -file as shown below:
$ bin/hadoop jar contrib/streaming-hadoop-0.20.2-streaming.jar -mapper \
MapProgram -reducer ReduceProgram -file \
MapProgram -file ReduceProgram
There could be scenarios where one would like to process input data using only a single map function. To obtain
this we need to set the "mapred.reduce.tasks" property in the mapred-site.xml to zero. The reducer tasks will
then not be created in the mapreduce framework making the output of the mapper program as the final output.
Hadoop Streaming supports the "-reduce NONE" option this is equivalent to "-jobconf mapred.reduce.tasks=0".
- 40 -
Chapter 5
5. Installation of HADOOP
This section will describe steps for setting up a single-node Hadoop cluster and then extend the single node setup
to a multi node Hadoop cluster setup in the next section of this chapter. After the installation some example
programs will be run to check the setup (Yahoo! Inc, 2011).
The implementation was done using the Hadoop version:
Hadoop 0.20.2, released February 2010
1) Assuming that JavaTM 1.6.x is installed which can be checked by running the command java -version. We
need to make sure that SSH is installed and is running. If we do not have it installed we need to install it.
2) Download a stable HADOOP release:
Download the Hadoop 0.20.2 version from the Apache Download Mirrors and extract the contents of the
Hadoop package under the location "/usr/local/ hadoop-0.20.2". Unpack the downloaded Hadoop Version using
the command
tar -xvf hadoop-0.20.2.tar.gz
3) Update the Hadoop-related environment variables under $HOME/.bashrc
# Set Hadoop-related environment variables
export HADOOP_HOME=/usr/local/hadoop-0.20.2
# Add Hadoop bin/ directory to PATH
export PATH=$PATH:$HADOOP_HOME/bin
4) The environment variable we have to configure for Hadoop under /conf/hadoop_env.sh is the JAVA_HOME.
Open “/conf/hadoop-env.sh” in an editor and set the JAVA_HOME environment variable to the below.
# The java implementation to use. Required.
# export JAVA_HOME=/usr/lib/j2sdk1.5-sun
export JAVA_HOME=/usr/lib/jvm/java-6-sun
5) We will now configure the directory “/conf” where Hadoop will store its data files, the network ports it listens
to, etc. The setup will use Hadoop‟s Distributed File System, HDFS, even though we are currently working with
Hadoop single node setup. The hadoop.tmp.dir variable has the directory set to “/app/hadoop/tmp”. We create
the directory “/app/hadoop/tmp” and set the ownerships and permissions using “chmod 777 /app/hadoop/tmp”
(hadoop.apache.org, 2011).Add the following lines between the <configuration> ...</configuration> tags in the
configuration *-site. XML files as shown below.
<configuration>
- 41 <property>
<name>hadoop.tmp.dir</name>
<value>/app/hadoop/tmp</value>
<description>A base for other temporary directories.</description>
</property>
<property>
<name>fs.default.name</name>
<value>hdfs://localhost:54310</value>
</property>
</configuration>
<configuration>
<property>
<name>dfs.replication</name>
<value>1</value>
</property>
</configuration>
<configuration>
<property>
<name>mapred.job.tracker</name>
<value>localhost:54311</value>
</property>
</configuration>
6) The first step to start the Hadoop installation is by formatting the Hadoop distributed file system (HDFS),
which is implemented on top of the local file system of our “cluster”. Currently, it includes only one single local
machine. We need to do this the first time we set up a Hadoop cluster.
debian02:/usr/local/hadoop-0.20.2/bin# /usr/local/hadoop-0.20.2/bin/hadoop namenode -format
11/07/01 16:18:29 INFO namenode.NameNode: STARTUP_MSG:
/************************************************************
STARTUP_MSG: Starting NameNode
STARTUP_MSG: host = debian02/10.0.10.2
STARTUP_MSG: build = https://svn.apache.org/repos/asf/hadoop/common/branches/branch-0.20 -r 911707;
compiled by 'chrisdo' on Fri Feb 19 08:07:34 UTC 2010
************************************************************/
11/07/01 16:18:29 INFO common.Storage: Storage directory /app/hadoop/tmp/dfs/name has been successfully
formatted.
- 42 11/07/01 16:18:29 INFO namenode.NameNode: SHUTDOWN_MSG:
/************************************************************
SHUTDOWN_MSG: Shutting down NameNode at debian02/10.0.10.2
7) We now start our single-node cluster by debian02:/usr/local/hadoop-0.20.2/bin/start-all.sh
This will start the Namenode, DataNode, Jobtracker, secondarynamenode and Tasktracker processes on our
Hadoop cluster with a single Node. We can check if the expected Hadoop processes are running by executing the
command “jps” at the terminal as shown below
debian02:/usr/local/hadoop-0.20.2/conf# jps
7103 NameNode
7483 Jps
7198 DataNode
7289 SecondaryNameNode
7363 JobTracker
7447 TaskTracker
We can also check with “netstat” if Hadoop is listening on the configured ports.
debian02:/usr/local/hadoop-0.20.2/conf# netstat -plten | grep java
8) We will now run an example Hadoop MapReduce job. We will use the existing Word Count example
job which reads the text files and counts how often words that are unique occur. The output is a text file where
each line of the file contains a word and the count of how often it occurred, separated by a tab.
We will use six e-books for this example:
Thus, we download each e-book as text files in Plain Text UTF-8 encoding and store the files in a temporary
directory of our choice.
9) Once the files are on our local file system we then copy the Local files to HDFS (Hadoop Distributed File
System). Before we run the MapReduce job, we need to copy the files from our local file system to HDFS.
debian02:/usr/local/hadoop-0.20.2# bin/hadoop dfs -copyFromLocal /tmp/ebooks /user/hduser/ebooks
We can check if the files are uploaded by executing the below command:
debian02:/usr/local/hadoop-0.20.2# bin/hadoop dfs -ls /user/hduser/ebooks
Found 6 items
-rw-r--r-- 1 root supergroup 336710 2011-08-11 23:33 /user/hduser/ebooks/pg132.txt
-rw-r--r-- 1 root supergroup 581878 2011-08-11 23:33 /user/hduser/ebooks/pg1661.txt
-rw-r--r-- 1 root supergroup 1916262 2011-08-11 23:33 /user/hduser/ebooks/pg19699.txt
-rw-r--r-- 1 root supergroup 674566 2011-08-11 23:33 /user/hduser/ebooks/pg20417.txt
-rw-r--r-- 1 root supergroup 1540059 2011-08-11 23:33 /user/hduser/ebooks/pg4300.txt
- 43 -rw-r--r-- 1 root supergroup
384408 2011-08-11 23:33 /user/hduser/ebooks/pg972.txt
10) We now run the MapReduce job to test if our Hadoop Setup is correct. Below is the command which is used
to run the Word Count job.
debian02:/usr/local/hadoop-0.20.2# bin/hadoop jar hadoop*examples*.jar wordcount /user/hduser/ebooks
/user/hduser/ebooks-output
11/08/12 02:08:57 INFO input.FileInputFormat: Total input paths to process : 6
11/08/12 02:08:58 INFO mapred.JobClient: Running job: job_201108120205_0001
11/08/12 02:08:59 INFO mapred.JobClient: map 0% reduce 0%
11/08/12 02:09:15 INFO mapred.JobClient: map 33% reduce 0%
11/08/12 02:09:31 INFO mapred.JobClient: map 66% reduce 11%
11/08/12 02:09:40 INFO mapred.JobClient: map 100% reduce 22%
11/08/12 02:09:46 INFO mapred.JobClient: map 100% reduce 33%
11/08/12 02:09:52 INFO mapred.JobClient: map 100% reduce 100%
11/08/12 02:09:54 INFO mapred.JobClient: Job complete: job_201108120205_0001
The command will read the files that are in the HDFS directory /ebooks, processes it, and stores the result in the
HDFS directory /ebooks-output. We can check if the result is successfully stored in HDFS directory by having a
look under the directory /user/hduser/ebooks-output
11) We then retrieve the result from HDFS or we can copy it from HDFS to the local file system using the below
command.
debian02:/usr/local/hadoop-0.20.2# bin/hadoop dfs -cat /user/hduser/ebooks-output/part-00000
12) Once the jobs are run we stop our single-node cluster by running the command “./stop-all.sh”
5.1 From two single-node clusters to a multi-node cluster
We will build a multi-node cluster using two Linux VM boxes i.e debian02 as the master and debian05 as the
slave. The best way to do this is to install and configure hadoop 0.20.2 on each node and test the “local” Hadoop
setup for each of the two Linux machines, and in a second step to combine these two single-node clusters into
one multi-node cluster in which one Linux machine will become the master but will also act as a slave with
regard to data storage and processing and the other box will become only a slave (hadoop.apache.org, 2011).
- 44 -
Figure 18: Set up of a Multi Node cluster
(Source: http://www.michael-noll.com/tutorials/running-hadoop-on-ubuntu-linux-single-node-cluster/)
We now will configure one Linux machine as a master node (debian02) and the other Linux machine as a slave
node (debian05). The master node (debian02) will also act as a slave because we are currently working with two
machines in our cluster but still want to spread the data storage and processing to both the machines.
Figure 19: multi-node cluster setup
(Source: http://www.michael-noll.com/tutorials/running-hadoop-on-ubuntu-linux-multi-node-cluster/)
The master machine (debian02) will run the “master” daemons: NameNode for the Hadoop distributed file
system storage layer and the Job Tracker for the MapReduce processing layer. Both the nodes will run the
“slave” daemons: DataNode for the Hadoop distributed File system layer and Task Tracker for MapReduce
processing layer. The “master” daemons are the ones responsible for coordination and management of the
“slave” daemons while the “slave” daemons will do the actual data storage and data processing work.
- 45 The conf/masters file defines on which machines Hadoop will start the secondary Name Nodes in our
multi-node cluster setup. In our setup this is just the master node i.e debian02. The primary Name Node and the
Job Tracker will always be the master machine. We can start a Hadoop daemon manually on a machine
via bin/hadoop-daemon.sh start [namenode | secondarynamenode | datanode | jobtracker | tasktracker], which
will not take the conf/masters and conf/slaves files into account. In our case the master machine used is debian02
(10.0.10.2) and debian05 (10.0.10.5) and we run the bin/start-dfs.sh and then the bin/start-mapred.sh on the
master machine.
Update /conf/masters and /conf/slaves files accordingly on the master machine as shown below
debian02:/usr/local/hadoop-0.20.2/conf# cat master
10.0.10.2
debian02:/usr/local/hadoop-0.20.2/conf# cat slaves
10.0.10.2
10.0.10.5
We then have to change the configuration files conf/core-site.xml, conf/mapred-site.xml and conf/hdfssite.xml on ALL machines as follows. First, we have to change the fs.default.name variable (in conf/coresite.xml) which specifies the NameNode host (10.0.10.2) and port (54310). In our case, this is
the master machine i.e debian02. Second, we have to change the “mapred.job.tracker” variable (in conf/mapredsite.xml) which specifies the Job Tracker host and port. Again, this is debian02 the master node in our case.
Third, we change the dfs.replication variable in conf/hdfs-site.xml which specifies the default block replication.
It states on how many machines a single file must be replicated. The default value of dfs.replication is 3. As we
are currently using only two nodes, we set dfs.replication to 2.Once the *-site.xml files are modified in the
debian02 machine we then go ahead and make similar changes in the /*-site.xml files in the slave machines
(hadoop.apache.org, 2011).
Again as in our single-node cluster setup we need to format the Hadoop distributed file system for the
NameNode in our Multi Node setup and we do this in the master machine i.e debian02. We need to do this every
time we set up a Hadoop cluster for the first time. We should never format a running Hadoop NameNode as this
will erase the data in the HDFS and also corrupt the system.
debian02:/usr/local/hadoop-0.20.2/bin# /usr/local/hadoop-0.20.2/bin/hadoop namenode –format
We start the cluster in two steps. Firstly, the HDFS daemons are initiated i.e the NameNode daemon is started on
the master machine i.e debian02 and DataNode daemons are started on the slave i.e on debian05. Secondly, the
MapReduce daemons are initiated: the Job Tracker is started on master node, and Task Tracker daemons are
started on the slave i.e debian05.
The following Java processes should run on the master machine at this time:
- 46 debian02:/usr/local/hadoop-0.20.2/conf# jps
6401 DataNode
6231 SecondaryNameNode
6928 Jps
6688 JobTracker
6778 TaskTracker
6042 NameNode
and the following on slave.
debian05:/app/hadoop/tmp/dfs/data/current# jps
4629 Jps
4419 DataNode
4546 TaskTracker
Running an example Mapreduce Job:
To test if the Hadoop cluster setup is correct we will run the Word count MapReduce job again . After
downloading the e-texts, we have copied them to the HDFS, run the Word Count MapReduce job from the
master machine, and retrieved the job result from HDFS
debian02:/usr/local/hadoop-0.20.2# bin/hadoop dfs -copyFromLocal /tmp/ebooks /user/hduser/ebooks
debian02:/usr/local/hadoop-0.20.2# bin/hadoop dfs -ls ebooks
Found 6 items
debian02:/usr/local/hadoop-0.20.2# bin/hadoop jar hadoop*examples*.jar wordcount /user/hduser/ebooks
/user/hduser/ebooks-output1-deb02
11/08/12 00:18:14 INFO input.FileInputFormat: Total input paths to process : 6
11/08/12 00:18:15 INFO mapred.JobClient: Running job: job_201108120006_0001
11/08/12 00:18:16 INFO mapred.JobClient: map 0% reduce 0%
11/08/12 00:18:50 INFO mapred.JobClient: map 100% reduce 22%
11/08/12 00:18:59 INFO mapred.JobClient: map 100% reduce 100%
11/08/12 00:19:01 INFO mapred.JobClient: Job complete: job_201108120006_0001
Running the mapreduce job successfully states that there are no issues with the installation and configuration
setup and thus confirming that the Installation was a success .We now have a running Hadoop cluster with two
nodes.
To stop the Hadoop cluster we first stop the HDFS daemons in the master machine and then stop the
mapred daemons as shown below
debian02:/usr/local/hadoop-0.20.2/bin# ./stop-mapred.sh
debian02:/usr/local/hadoop-0.20.2/bin# ./stop-dfs.sh
- 47 -
Chapter 6
6. Implementation of MAP REDUCE programming model
6.1 General Description of the current data processing used by the Pathology
Application
The client executes the pathology application with certain parameters to get the number of nuclei of a given area
of a pathology image and gets an output that consists of a set of parameters passed and the Number of nuclei. In
this section, we will implement the MapReduce programming model to process this data containing the
parameters and the Number of Nuclie and to check the benefits of using the MapReduce and the HDFS when
there are number of such similar data and how the data are combined together to return a single output value. We
process the data using two Phases i.e the Map Phase and the Reduce Phase bringing out the characteristics and
features of a map reduce programming Model.
6.2 System Design
The pathology data is to be first structured in a way as read in earlier chapter 4, so that, it can be processed by
the MapReduce programming model. The output must be structured in such a way that it‟s in a Key, Value
format. So, we first design a pre-processing script written in Python Programming Language to convert the
Pathology data format shown in Dataformat 6.1 to a format shown in Dataformat 6.2 and write it to a single
file. Once we get the format (key, Value) which can be processed by the Mapper/Reducer Program we load the
processed file with the data to the Hadoop distributed file system. Once the file is in the HDFS we run the
Mapper and the Reducer program using the Hadoop streaming method discussed in earlier chapter 4, to obtain a
reduced output having the parameter and the Number of nuclei.
This is discussed in chapter 3 where the pathology program “deconvolution-model” is executed to obtain
the pathology data containing the Number of Nuclei and its associated parameters given a sub area of an image.
Native Image size : 66000 x 45402
Native zoom: 20
Loading Deconvolution vectors: estimated_cdv.cdv
[0.471041,0.735899,0.486387]
[0.349641,0.78081,0.517771]
[0.7548,0.001,0.655955]
Channel 1
No.nuclei: 4
Dataformat 6.1: Pathology Data
- 48 Native Image size : 66000 x 45402 Native zoom: 20 Loading Deconvolution vectors: estimated_cdv.cdv
[0.471041,0.735899,0.486387] [0.349641,0.78081,0.517771] [0.7548,0.001,0.655955] Channel 1
No.nuclei:
4
Dataformat 6.2: Pathology Data in Key value Format
6.3 Process Flow
The pathology data is placed under a folder /patho under the Hadoop directory. A shell script is then written
which basically performs two tasks. Firstly, it takes all the pathology data in each individual file under the /patho
directory and preprocesses the data by running the preprocessing program “pre-processing.py” and thus creating
a single file with the preprocessed data named “nucleiprocessed.txt” and is placed under the same directory
“/patho”. Once, the file is created the second task of the shell script is to copy the processed pathology data to
the Hadoop distributed file system (HDFS) under the “/user/hduser/patho” directory. Once the pathology data is
uploaded to the HDFS the data is now ready to be processed by the MapReduce programming model and
produce a single output file with the required data. The steps mentioned as described in details below:
6.3.1 The pre-processing Step
The preprocessing program used to format the pathology data is presented below
Code for Pre-Processing:
import os
import glob
#Files to process
files = glob.glob('/export/mailgrp4_f/sc10bdg/pathodir/*patho*')
# processed file
nuclieprocessed=file('nuclieprocessed1.txt', 'w')
for f in files:
file = open(f)
#Array
arr= []
#Variable containing each line information
line = file.readline()
while line:
#Read each line in a file
line = line.strip()
#load the processed data from file into an array
arr.append(line)
line = file.readline()
# Write to an output file
nuclieprocessed.write(arr[0]+' '+arr[1]+' '+arr[2]+' '+arr[3]+' '+ arr[4]+' '+arr[5]+' '+arr[6]+'
'+arr[7]+'\n')
- 49 6.3.2 Loading pre-processed data into HDFS
The “nuciliprocessed.txt” file created under the specified output directory is then uploaded into the HDFS using
a shell script
UploadHDFS.sh
#!/bin/bash
HADOOP_DIR=/usr/local/hadoop-0.20.2
# 1. Convert the Pathology data files into a single processed file by calling the pre-processing program
#Processed file is place under the HADOOP DIR
${HADOOP_DIR}/pre-process.py
# 2. Store processed file on the HDFS
${HADOOP_DIR}/bin/hadoop dfs -copyFromLocal /usr/local/hadoop-0.20.2/Patho/nuclieprocessed.txt
/user/hduser/patho/`
Once the file is uploaded and available on the HDFS we can use the MapReduce Programming model to process
the data.
6.3.3 Process Data using the Map Reduce Programming Model
Input: Pathology data will be in the form of (key, value) pairs after pre-processing where key is the parameter
and the value is the Number of Nuclei.
Output: If there are multiple occurrences of the same data in the input file, the values are combined together
based on the Key, returning a single output value containing the parameter and the No. of Nuclei.
The Logic behind the MapReduce program written in python is that it uses the Hadoop streaming concept to help
us pass data between the Mapper and the reducer program via standard input (STDIN) and standard output
(STDOUT). It uses the python‟s sys.stdin to read the input data and to print the output data to the sys.stdout.
Mapper program Description
The Map code is saved as Mapper.py under the HADOOP_HOME. The Mapper program will read data from
standard input (STDIN) and output a list of lines to STDOUT (standard output). The Mapper script will not
generate the unique occurrences of each parameter and its associated value (i.e no of nuclei) instead will let the
Reduce step to obtain the final unique occurrences. We must change the execution permission of the Mapper.py
file (chmod +x mapper.py) else we could run into problems in regards to permissions.
Code for Map:
import sys
# input comes from STDIN (standard input)
for line in sys.stdin:
line = line.strip()
print '%s' % (line)
Reducer program Description: The below code is saved with the filename “reducer.py” under
the HADOOP_HOME directory. The program will read the results or the output from the mapper.py program
- 50 obtain the unique occurrence given a parameter and its associated nuclei value. The result is then sent to
STDOUT (standard output).The “reducer.py” file needs to have the proper execution permission i.e chmod +x
reducer.py else it would throw errors while running the program.
Code for Reduce:
from operator import itemgetter
import sys
#Parameters
current_Para= None
current_value = 0
for line in sys.stdin:
# remove leading and trailing whitespace
line = line.strip()
# parse the input we got from mapper.py into Parameter details and Value (No.of Nuclie)
Parameter, value = line.split('
', 1)
#The Logic for Reduction
if current_Para== Parameter:
current_value == value
else:
if current_Para:
# write result to STDOUT
print '%s\t%s' % (current_Para, current_value)
current_value = value
current_Para= Parameter
if current_Para== Parameter:
print '%s\t%s' % (current_para, current_value)
6.3.4 Running the Python Code in Hadoop
To run the Map Reduce program we need to check that the data to be processed “nuclieprocessed.txt” is on the
Hadoop Distributed file system.
debian02:/usr/local/hadoop-0.20.2# bin/hadoop dfs -ls /user/hduser/pathology1
-rw-r--r-- 1 root supergroup
1251 2011-08-04 04:32 /user/hduser/pathology1/nuclieprocessed.txt
The input file contained the following data
Native Image size : 66000 x 45402 Native zoom: 20 Loading Deconvolution vectors: estimated_cdv.cdv
[0.471041,0.735899,0.486387] [0.349641,0.78081,0.517771] [0.7548,0.001,0.655955] Channel 1
No.nuclei:
4
Native Image size : 66000 x 45402 Native zoom: 20 Loading Deconvolution vectors: estimated_cdv.cdv
[0.471041,0.735899,0.486387] [0.349641,0.78081,0.517771] [0.7548,0.001,0.655955] Channel 1
No.nuclei:
4
Native Image size : 55000 x 11234 Native zoom: 5 Loading Deconvolution vectors: estimated_cdv.cdv
[0.471041,0.734899,0.486387] [0.34234,0.78081,0.517771] [0.7548,0.001,0.655955] Channel 1
No.nuclei:
5
- 51 Native Image size : 55000 x 11234 Native zoom: 20 Loading Deconvolution vectors: estimated_cdv.cdv
[0.471041,0.734899,0.486387] [0.34234,0.78081,0.517771] [0.7548,0.001,0.655955] Channel 1
No.nuclei:
6
Native Image size : 55000 x 11234 Native zoom: 5 Loading Deconvolution vectors: estimated_cdv.cdv
[0.471041,0.734899,0.486387] [0.34234,0.78081,0.51777] [0.548,0.001,0.633355] Channel 1
No.nuclei: 6
Native Image size : 55000 x 11234 Native zoom: 5 Loading Deconvolution vectors: estimated_cdv.cdv
[0.471041,0.734899,0.486387] [0.34234,0.78081,0.517771] [0.7548,0.001,0.655955] Channel 1
No.nuclei:
5
Once, the data is available on the Hadoop distributed file system we run the Python MapReduce job on the
cluster with currently two machines. We use the Hadoop Streaming as discussed above to pass data between the
Map code and the Reduce code via standard Input (STDIN) and standard output (STDOUT). We run the
MapReduce program using the below command:
debian02:/usr/local/hadoop-0.20.2# bin/hadoop jar contrib/streaming/hadoop-*streaming*.jar -file
/usr/local/hadoop-0.20.2/mappy.py -mapper /usr/local/hadoop-0.20.2/mappy.py -file /usr/local/hadoop0.20.2/reducer.py -reducer /usr/local/hadoop-0.20.2/reducer.py -input /user/hduser/patho/* -output
/user/hduser/patho-output
The above command executed gives us the below output
packageJobJar: [/usr/local/hadoop-0.20.2/mapper.py, /usr/local/hadoop-0.20.2/reducer.py,
/hadoop/tmp/dir/hadoop-unjar8894788337428936557/] [] /tmp/streamjob7431796847369503584.jar
tmpDir=null
11/08/07 21:34:55 INFO streaming.StreamJob: Running job: job_201108072123_0001
11/08/07 21:34:55 INFO streaming.StreamJob: Tracking URL:
http://debian02:50030/jobdetails.jsp?jobid=job_201108072123_0001
11/08/07 21:34:56 INFO streaming.StreamJob: map 0% reduce 0%
11/08/07 21:35:15 INFO streaming.StreamJob: map 100% reduce 100%
11/08/07 21:35:18 INFO streaming.StreamJob: Job complete: job_201108072123_0001
We then check if the output is stored successfully on /user/hduser/patho-output and can inspect the contents of
the file using the –cat command. It should contain the reduced output.
debian02:/usr/local/hadoop-0.20.2# bin/hadoop dfs -cat /user/hduser/patho-output/part-00000
Native Image size : 55000 x 11234 Native zoom: 20 Loading Deconvolution vectors: estimated_cdv.cdv
[0.471041,0.734899,0.486387] [0.34234,0.78081,0.517771] [0.7548,0.001,0.655955] Channel 1 No.nuclei: 6
Native Image size : 55000 x 11234 Native zoom: 5 Loading Deconvolution vectors: estimated_cdv.cdv
[0.471041,0.734899,0.486387] [0.34234,0.78081,0.517771] [0.7548,0.001,0.655955] Channel 1 No.nuclei: 5
Native Image size : 55000 x 11234 Native zoom: 5 Loading Deconvolution vectors: estimated_cdv.cdv
[0.471041,0.734899,0.486387] [0.34234,0.78081,0.51777] [0.548,0.001,0.633355] Channel 1 No.nuclei: 6
Native Image size : 66000 x 45402 Native zoom: 20 Loading Deconvolution vectors:
estimated_cdv.cdv [0.471041,0.735899,0.486387] [0.349641,0.78081,0.517771]
[0.7548,0.001,0.655955] Channel 1 No.nuclei: 4
- 52 -
Chapter 7
7. Evaluation and Experimentation
Once we successfully store and process the pathology data using Hadoop framework and its map reduce
programming model we evaluate the performance of the implemented framework of Hadoop and observe the
benefits of using the proposed solution. In this section, we will measure the performance by using quantitative
metrics such as scalability and time taken to process that pathology data. The first experiment would observe the
change in the processing time of the pathology data by changing the number of machines or nodes in the Hadoop
cluster using a fixed data size. The second experiment would be to observe how the processing time of the
pathology data would change based of different data size on a Hadoop single node setup and on a Hadoop multinode cluster setup.
7.1 Response time to process the Pathology Data on a single node / Cluster
In this experiment, I have run the MapReduce program on pathology data size of 24KB on a single node
Hadoop setup and have also run the MapReduce program on the pathology data of the same size on a Hadoop
cluster setup with two machines and also using a Hadoop cluster with three machines .The results obtained by
running the experiment are shown in figure 20, figure 21 and figure 22.The difference in time of running the
experiment on a single node and on a cluster with two and three nodes are shown in figure 23. We first run the
MapReduce program to process the data of size 24KB and observe that when we run the program to process the
data for the first time the time taken to process the data is 47 seconds and when we run the program again for the
second time we observe that the processing time reduces by approximately 10 seconds and this run time is
almost the same for the consecutive runs. The reason that there is a drop in the processing time of the data is
because the data is cached i.e. the data is stored transparently such that the future requests can be served faster.
The stored data in the cache are normally the values that have been computed before or duplicate data.
Figure 20 shows us the difference in the processing time between the first, second and consecutive runs
- 53 Data Processing on a cluster with a single Node
50
Time taken
40
30
20
10
0
1
2
3
4
5
Number of runs
Figure 20: Running the program for the first time takes 47 seconds approximately and 37 seconds for the second and
consecutive runs.
We now run the MapReduce program to process the data of size 24KB on a Hadoop cluster with two nodes and
observe that when we run the program to process the data for the first time the time taken to process the data is
approximately 32 seconds and when we run the program again in the cluster for the second time we observe that
the processing time reduces to 27 seconds and the run time is almost the same for the consecutive runs.
Time Taken
Data processing using a cluster with two Nodes
33
32
31
30
29
28
27
26
25
24
1
2
3
4
5
Number of runs
Figure 21: Running the program for the first time on a Hadoop cluster consisting of two nodes takes 32 seconds and for the
second and consecutive runs takes 27 seconds.
We now run the Map Reduce program to process the data of size 24KB on a Hadoop cluster with three nodes
and observe that when we run the program to process the data for the first time the time taken to process the data
is approximately 30 seconds and when we run the program again in the cluster for the second time we observe
that the processing time reduces by approximately 22 seconds.
- 54 Data processing using a cluster with three Nodes
35
Time Taken
30
25
20
15
10
5
0
1
2
3
4
5
Number of runs
Figure 22: Running the program for the first time on a Hadoop cluster consisting of three nodes takes 30 seconds and for the
second and consecutive runs takes 22 seconds
We can now see that processing the data on a Hadoop cluster with a single node takes approximately 37 seconds
and the processing time on a Hadoop cluster with two nodes takes 32 seconds and with three nodes, takes 30
seconds i.e. there is a reduction in time when the number of nodes are increased. Thus, we can observe that when
more nodes are added to the cluster the processing time of the data decreases as the computation power of the
cluster increases. The data in a cluster is distributed among the slave nodes for processing. Therefore, instead of
one machine processing the data there are three of them working with the data thus reducing the overall
processing time of the data. The difference of running the Pathology data on a Hadoop cluster with a single
node, two Nodes and a Hadoop cluster with three nodes are shown below:
Time taken to process
the data
Data processing using Hadoop cluster with 3
Nodes
50
40
Using 1 Node
30
Using 2 Nodes
20
Using 3 Nodes
10
0
1
2
3
4
5
Number of Runs
Figure 23: Difference of running the Pathology data on a Hadoop cluster with a single node and a Hadoop cluster with two
and three nodes
7.2 Response time to run the pathology Application on based on Data Size
In this experiment we have run the MapReduce program on a Pathology data size of 24KB, 131KB, 3074KB,
39MB on a single node Hadoop setup and have also run the MapReduce Program on the pathology data of the
- 55 same data size set on a Hadoop cluster setup with two machines and a Hadoop cluster setup with three machines.
The results obtained by running the experiment is shown in figure 24, figure 25 and figure 26.The difference in
the Response time of running the experiment on a single node and on a cluster with two nodes and three Nodes
with respective to the data size is shown in figure 27. We first run the Map Reduce program to process the data
of size 24KB, 131 KB, 3074 KB and 39 MB and observe that when we run the program on a Hadoop cluster
with a single node the time taken to process 24KB of data is 37 seconds, 131 KB of data is 37.92, 3074KB of
data is 39.49 and to process 39 MB of data is 40.70. The below figure 24 shows us the difference of the time
taken to process the data of different sizes on a Hadoop cluster with a single node. We can observe that the is
very little time difference to process the data of size 24KB and data of size 39 MB bringing out the computation
power of Hadoop to handle large data.
140
120
100
80
60
40
20
0
273833892
40917708
(39MB)
3147516
(3074KB)
133857
(131KB)
Time Taken on a
single Machine
25020
(24K)
Time taken in seconds
Data Processing Time Taken on a single Machine
Data size
Figure 24: The difference of the time taken to process the data of size 24 KB, 131 KB, 3074 KB and 39 MB on a Hadoop
cluster with a single node.
We now run the Map Reduce program to process the data of size 24KB, 131 KB, 3074 KB and 39 MB in a
Hadoop cluster with two nodes .We observe that the time taken to process 24KB of data is 32 seconds, 131 KB
of data is 24.32, 3074 KB of data is 25.28 and to process 39 MB of data is 32.18. The below figure 25 shows us
the difference of the time taken to process the data of different sizes on a Hadoop cluster with two nodes.
- 56 -
100
80
60
40
20
0
273833892
40917708
(39MB)
3147516
(3074KB)
133857
(131KB)
Time Taken on a
cluster with 2 Nodes
25020
(24K)
Time taken in seconds
Data Processing Time Taken on a cluster with 2
Nodes
Data Size
Figure 25: The difference of the time taken to process the data of size 24 KB, 131 KB, 3074 KB and 39 MB on a Hadoop
cluster with two nodes.
We now run the MapReduce program to process the data of size 24KB, 131 KB, 3074 KB and 39 MB on a
Hadoop cluster with three nodes .We observe that the time taken to process 24KB of data is 30 seconds, 131 KB
of data is 22, 3074 KB of data is 24 and to process 39 MB of data is 30 seconds. The below figure 26 shows us
the difference of the time taken to process the data of different sizes on a Hadoop cluster with a three nodes. We
can observe that as there is an increase in the data size the computation time to process the data also increases
which is expected.
80
60
40
20
0
273833892
40917708
(39MB)
3147516
(3074KB)
133857
(131KB)
Time taken on a
cluster with 3 Nodes
25020
(24K)
Time taken in seconds
Data processing Time taken on a cluster with 3
Nodes
Data Size
Figure 26: : The difference of the time taken to process the data of size 24 KB, 131 KB, 3074 KB and 39 MB on a Hadoop
cluster with three nodes.
We can now see that processing the data on a Hadoop cluster with a single node takes approximately 37, 37.92,
39.49 and 40.70 seconds on data sets of size 24 KB, 131 KB, 3074 KB and 39 MB and the processing the same
data set on a Hadoop cluster with two nodes takes 21, 24.32, 25.28 and 32.18 seconds and on a Hadoop cluster
with three nodes are 30, 22, 24, 30 i.e. even when the data size increases it is better handled by a Hadoop cluster
with two nodes and three nodes than with the Hadoop cluster with a single node and again there is a reduction in
- 57 time when the number of nodes are increased for a given set of data size. Thus, when the volume of the data is
more it is processed efficiently by Hadoop. The differences of running the Pathology data on a Hadoop cluster
with a single node and a Hadoop cluster with two and three nodes over a set of different sizes of data are shown
below:
140
120
100
80
60
40
20
0
273833892
40917708
(39MB)
3147516
(3074KB)
133857
(131KB)
Time Taken on a
single Machine
25020
(24K)
Time Taken
Data processing time taken on a Hadoop cluster
based on data size
Time Taken on a
cluster with 2 Nodes
Time taken on a
cluster with 3 Nodes
Data Size
Figure 27: The Response time taken to process the data of size 24 KB, 131 KB, 3074 KB and 39 MB on single and a multi
node cluster setup
In the above graph it clearly shows that time taken to process the data of size 25020 bytes is almost the same
time by a Hadoop cluster having a single Node, two Nodes and three Nodes. On the other hand there is a notable
difference in the processing of data of size 273,833,892 bytes using a Hadoop cluster containing a single Node,
two Nodes and three Nodes thus proving the point that Hadoop works the best with large volumes of data than a
small amount of data.
Time taken in seconds
Processing time taken for 1GB of data
800
700
600
500
400
300
200
100
0
using a single Node
cluster
using a cluster with
two nodes
using a cluster with
three Nodes
Run 1
Run 2
Number of runs
Figure 28: Processing time taken by a Hadoop cluster with one, two and three Nodes for 1GB pathology data.
- 58 In Figure 28, it clearly shows that time taken to process the data of size 1,643,003,352 bytes using a Hadoop
cluster having a single Node, two Nodes and three Nodes decreases as the number of Nodes Increases. On the
other hand there is also a notable difference in the processing of data of size 1,643,003,352 bytes using a Hadoop
cluster containing a single Node, two Nodes and three Nodes thus stating that Hadoop works the best with large
volumes of data than data of smaller size. Also, we could see that the time taken to process the Data of size
39MB takes approximately the same time to process the data of 24 KB takes bringing out the computation power
of Hadoop.
7.3 Results and comparison
The results are collected from running the experiments discussed above and are contrasted with already existing
work.

From the experiments performed above we can observe that processing the data of size 1GB on a
Hadoop cluster with a single machine takes approximately 10minutes and 6 seconds and the processing
the same data on a Hadoop cluster with two machines takes 7 minutes and 29 seconds and with three
nodes there it further reduction in time to 6 minutes 49 sec. Thus, we can see that there is a reduction in
time when the numbers of machines in the cluster are increased. Therefore, we can conclude that when
more machines are increased in the cluster the processing time of the data decreases i.e the computation
power of the cluster increases. I have contrasted my work with one of the papers written by Tomi Aarnio
(2009) that observes the same and was discussed in sub section "Related work" under chapter 1. His
work also stated that when using MapReduce in a cluster the computation power is improved by adding
new nodes to the network. Thus with more nodes there is more parallel task (Tomi, 2009).

In experiment two, we can also observe that the Hadoop cluster processes the data having a size more
than 39 MB i.e 273 MB and 1GB as shown in Figure 7.8 more efficiently and there is a notable diffence
in the time taken to process the data on a single Node, two Nodes and three Nodes.This observation is
contrasted with a similar work on "Data Intensive Scientific Analyses" discussed in related work under
chapter 1 where the results showed that applications could benefit using the MapReduce technique if the
data size used is appropriate and the overhead introduced by a particular runtime diminishes as the
amount of data and the computation increases (Jaliya, 2008).
Therefore, the results obtained in this project agrees with those found in the literature, especially about the two
main characteristics of Hadoop, that it works the best with large data and the computation power of the Hadoop
cluster increases with the increase in the number of machines.
7.4 Evaluation of the Software
The development of the software in the project was done by using the Agile Technique. The quality of the code
is one of the important requirements and was considered while working on the implementation of the solution
proposed. The code written has proper naming standards, proper variable names to make the code more
- 59 readable, appropriate commenting for a better understanding on the working of the code. Proper documentation
has been created for running the solution implemented. The current MapReduce programming model
implemented would need modifications in order to make the implemented model work with other similar
applications.
7.5 Further Work
The following provides the key areas of the improvement:

Effect on the performance of the software by implementing the above Map Reduce programming model
using Java programming language and not using the python programming language and the Hadoop
streaming utility of Hadoop.

We could also demonstrate on how Hadoop‟s Map Reduce Framework could be extended to work with
image data for image processing. It would include working with the pathology images to obtain the
pathology data used in this project. It will include how Hadoop can process and store the pathology
images by loading Pathology Images as a sequence file containing the pixel values of the image and
loading the sequence file with the pixel data on the HDFS and obtaining a image file used by the
application from the HDFS instead from the local file system and evaluate the performance and
scalability of such system.

Using the HDFS FUSE functionality to work with the Pathology image Files where the Image files are
accessed by the Pathology application from the HDFS instead of the Local File system. HDFS-FUSE is
a functionality that lets you mount the Hadoop Distributed File system in the user space. The hadoopfuse package enables us to use the Hadoop Distributed File System cluster like a traditional LINUX file
system.
7.6 How would this Project Idea Help Other Applications?
As discussed earlier that the results obtained from the above experiments clearly shows that for time efficient
and scalable data processing Hadoop is a technology worth considering. It could be helpful for those
organizations to consider Hadoop who derives and manage business values from there enormous volumes of
growing data. Hadoop in a health care industry could be useful for analyzing volumes of electronic health
records, clinical outcomes and treatment protocols. It can also be useful for image analysis and social network
analysis which involves large data and complex algorithms that could be difficult with SQL. In financial sectors,
Hadoop can used to analyze daily transaction data.
7.7 Meeting Minimum Requirements
The minimum requirements stated in section 1.3 under chapter 1 have been met and is discussed in brief in the
following section:
- 60 
In chapter 5, a successful implementation of the Hadoop framework using a single Node was presented
and an example application was run without errors to show that the Hadoop cluster implementation was
successful using a single node. Once the implementation using of the single Node Hadoop setup was
successful, the setup was extended with the Hadoop cluster having two and three nodes. Again examples
were run to show that the Hadoop multi-node cluster setup was a success.

In chapter 6, the MapReduce programming solution was implemented and the code used for the Map
phase and the reduced phase was discussed to obtain a reduced output of the given pathology data. A
sample input file to the Map phase was given and the output obtained from the reduced phase was show.

In chapter 7, the solution implemented to process the pathology data using the Hadoop framework was
evaluated using two experiments. The first experiment showed the performance of the data processing
based on the number of nodes introduced in the cluster. The second experiment focused on the response
time to process data based on the size of the data.
7.8 Objectives Met
The objectives stated in section 1.2 under chapter 1 have been met. Each objective has been documented in this
report once it has been met.
- 61 -
Chapter 8
8. Conclusions
8.1 Project Success
This research project has discussed the benefits of using Hadoop framework and its MapReduce programming
model to process the pathology data on a Hadoop cluster or cloud. Chapter 4 has indicated that storing data on a
single machine could lead to loss of data if the machine malfunctions or crashes. It has also been shown during
the experiments that processing of data is done relatively faster when the number of nodes or machines in the
cluster increases. The processing of the data is done faster when having more than one machine because the
number of machines in a cluster increases the computation power of the cluster. We could also conclude from
the discussion in chapter 7 that Hadoop works well with large datasets. The time difference is not that large to
process data of smaller size (24KB) and that of larger size (39MB) bringing out the power of Hadoop data
processing. The main idea put forward in this project is that Hadoop can be used to process large data robustly
and efficiently.
8.2 Summary
The aim of this project is to process volumetric pathology application data using the Hadoop framework and to
evaluate its performance and scalability on a cloud have been achieved.
- 62 -
Bibliography
ABBOTT, KEEVEN & FISHER PARTNERS. (). Federated Cloud. Available:
http://akfpartners.com/techblog/2011/05/09/federated-cloud/. Last accessed 15th june 2011.
akshaya bhatia. (). Shared Nothing Architecture. Available:
http://it.toolbox.com/wiki/index.php/Shared_Nothing_Architecture. Last accessed 15th june 2011.
Bernard Golden (2007). Virtualization For Dummies. United States of America: Paperback. p1-384.
C. Monash. The 1-petabyte barrier is crumbling. http://www.networkworld.com/community/node/
31439.
C. Olofson. Worldwide RDBMS 2005 vendor shares. Technical Report 201692, IDC, May 2006
Chen Zhang , Hans De Sterck , Ashraf Aboulnaga , Haig Djambazian , and Rob Sladek. (2010). Case Study of
Scientific Data Processing on a Cloud Using Hadoop. High Performance Computing Systems and Applications. (
), p.400-415.
Catanzaro, B., N. Sundaram, and K. Keutzer, \A MapReduce framework for programming graphics processors," in Workshop on Software Tools for MultiCore Systems, 2008.
c.olofson. (2006). worldwide Embedded DBMS 2005 vendor shares. Technical Report_IDC. 1 (1), p1-12.
C. Olston, B. Reed, U. Srivastava, R. Kumar, and A. Tomkins. Pig latin: a not-so-foreign language for data
processing.
In SIGMOD Conference, pages 1099–1110, 2008.
D. Vesset. Worldwide data warehousing tools 2005 vendor shares. Technical Report 203229, IDC, August 2006.
- 63 D. Magee, K. Djemame, D. Treanor ,Reconstruction of 3D Volumes from multiple 2D Gigapixel Microscopy
Images using Transparent Cloud Technology, Internal Document , School of Computing , 2010 , Leeds UK
Daniel J. Abadi. (2009). Data Management in the Cloud: Limitations and Opportunities. Bulletin of the IEEE
Computer Society Technical Committee on Data Engineering. 1 (1), p1-10.
Elsayed, T., J. Lin, and D. Oard, \Pairwise Document Similarity in Large Collections with MapReduce," Proc. Annual Meeting of the Association for Computational
Linguistics, 2008.
exemetech.com http://www.lexemetech.com/2008/08/elastic-hadoop-clusters-with-amazons.html.
eucalyptus. ( ). . Available: http://open.eucalyptus.com/wiki/FAQ. Last accessed 20 Aug 2011.
Eric Knorr, Galen Gruman. (). what-cloud-computing-really-means.Available:
http://www.infoworld.com/d/cloud-computing/what-cloud-computing-really-means-031. Last accessed 15th
June 2011.
F. Macias, M. Holcombe, and M. Gheorghe. A formal experiment comparing extreme programming with
traditional software construction. In Computer Science, 2003. ENC 2003. Proceedings
of the Fourth Mexican International Conference on, pages 73 – 80, sept. 2003.
hadoop. (). Welcome to Apache™ Hadoop™. Available: http://hadoop.apache.org/. Last accessed 25 Aug 2011.
hadoop. (). Welcome to Hadoop™ MapReduce!. Available: http://hadoop.apache.org/mapreduce/. Last accessed
25 Aug 2011.
hadoop. (). Welcome to Hadoop™ HDFS!. Available: http://hadoop.apache.org/hdfs/. Last accessed 25 Aug
2011.
- 64 Jaliya Ekanayake, Shrideep Pallickara, Geoffrey Fox. (2008). MapReduce for Data Intensive Scientific
Analyses. escience, 2008 Fourth IEEE International Conference on eScience. ( ), pp.277-284.
Jonathan Strickland. (). Cloud Computing Architecture. Available: http://computer.howstuffworks.com/cloudcomputing1.htm. Last accessed 15th june 2011.
J. Dean and S. Ghemawat. Mapreduce: Simplified data processing on large clusters. pages 137–150, December
2004
K. Wiley, A. Connolly, J. Gardner ,S. Krugho ,M. Balazinska, B. Howe, Y. Kwon and Y. Bu. (01/2011).
Astronomy in the Cloud: Using MapReduce for Image Coaddition. Bulletin of the American Astronomical
Society. 43 (1), 344.12.
libvirt.org. (). OpenNebula Virtual Infrastructure Manager driver.Available: http://libvirt.org/drvone.html. Last
accessed 15th june 2011.
Mohammad Farhan Husain, Pankil Doshi, Latifur Khan, and Bhavani Thuraisingham. (2009). Storage and
Retrieval of Large RDF Graph Using Hadoop and MapReduce. . ( ), .
Michael G. Noll. (). Running Hadoop On Ubuntu Linux (Multi-Node Cluster). Available: http://www.michaelnoll.com/tutorials/running-hadoop-on-ubuntu-linux-multi-node-cluster/. Last accessed 15th june 2011.
Michael Miller (2008). Cloud computing. United States of America: Que. P1-283
Miha Ahronovitz,Kuldip Pabla. ( ). What is Hadoop?. Available:
http://ahrono.com/app/download/1900278304/Hadoop+Paper+v4aspdf.pdf. Last accessed 28th Aug 2011.
nimbus. ( ). . Available: http://www.nimbusproject.org/doc/nimbus/faq/. Last accessed 20 Aug 2011.
opennebula.org. (). opennebula. Available: http://opennebula.org/. Last accessed 15th june 2011.
- 65 Paul McInerney. (). DB2 partitioning features. Available:
http://www.ibm.com/developerworks/data/library/techarticle/dm-0608mcinerney/index.html. Last accessed 15th
june 2011..
RightScale. Top reasons amazon ec2 instances disappear. http://blog.rightscale.com/2008/02/02/
R. Chaiken, B. Jenkins, P.-A. Larson, B. Ramsey, D. Shakib, S. Weaver, and J. Zhou. Scope: Easy and efficient
parallel processing of massive data sets. In Proc. of VLDB, 2008.
rightscale. (). multi-cloud-engine. Available: http://www.rightscale.com/products/features/multi-cloudengine.php. Last accessed 15th june 2011.
Sun Microsystems , Inc. (2009). Introduction to cloud computing architecture. White Papers. 1 (1), p1-40.
Shinder, D. L., & Vanover, R. . (2008). 20 Things You Should Know About Virtualization. . CNET Networks
Inc, TechRepbulic. 1 (1), p1-6.
searchservervirtualization.techtarget.com. (). Hypervisor. Available:
http://searchservervirtualization.techtarget.com/definition/hypervisor. Last accessed 15th june 2011.
Tomi Aarnio. (2009). Parallel data processing with MapReduce. TKK T-110.5190 Seminar on Internetworking. (
), .
Vaquero LM., Rodero-Merino L., Cáceres J., Lindner M. A Break in the Clouds: Towards a Cloud Definition.
ACM Computer Communication Reviews. January 2009.
Xen Hypervisor. ( ). Xen Hypervisor - Leading Open Source Hypervisor for Servers. Available:
http://xen.org/products/xenhyp.html. Last accessed 20 Aug 2011.
Yahoo! Inc. (). Hadoop Tutorial from Yahoo. Available: http://developer.yahoo.com/hadoop/tutorial/index.html.
Last accessed 15th june 2011.
- 66 -
Appendix A - Project Reflection
This project was one of the most different and a challenging project done in my master‟s degree. I have learned
and worked on something that never had a class room teaching and therefore found it very challenging. I have
enjoyed learning the cloud computing concepts and how one could work with virtual machines. This project has
given me the opportunity to learn new skill which will be helpful in my career ahead. Overall, I have enjoyed
implementing the Hadoop Framework all by myself solving various technical issues by loads of research. I have
made sure to document every step taken during the implementation to help other students doing something
similar and helping them learn from my experience.
Initially, when I took up this project I was not very sure if I could complete the project with in the
assigned time but with the help of my project supervisor and his assistance I was able to have a schedule with
realistic objectives that could be achieved in the given time period. I have learnt while working on my project
that planning, organizing and time management are the key aspects for a successful project. I made sure that I
have enough buffer time after every technical implementation as during the planning phase it is difficult to know
if for a given time the implementation could be a success as there are always some or the other technical errors. I
would suggest any student taking up a research project to spend adequate time on sketching out a good project
schedule.
As a part of my project I had weekly meetings with my supervisor which was very helpful in having my
work up to date. It‟s always needed to have your supervisor know every piece of work you are doing to help him
guide. I am glad to have a very supportive supervisor who has always encouraged me with the work I did. The
feedback given to me during each project meeting helped me improve on the development of concepts. After
having functional expertise in projects for three years, working on a strong technical project was very difficult
for me. However, the support provided by my supervisor kept me focused on the development work. I would
suggest other student to get help from their supervisors on regular basis during their project, as during a learning
phase it‟s mandatory to have a proper guidance to accomplish the goal.
I also had an opportunity to meet my assessor half way during my project work and present to her and to
my supervisor on the work done till date and discuss on the objectives of my project. The collaborative feedback
given to me by both my assessor and supervisor helped me evaluate on my progress and performance. Feedback
is one of the important ways one can improve the work. I made sure that I begin my write up while working on
- 67 the implementation and not wait till the end to write it. This gave me ample time in the end to do some changes
on the write-up. I would suggest students to have there write up written up to date without delaying it till the end.
There were numerous technical challenges faced while working on this project. Most of the issues faced was also
faced by other developers but there was very little documentation on how to resolve the problems. I ended up
doing lots of trial and error methods to help resolve the issues faced. I made sure I read loads of articles on the
similar work done by others and the way they worked on providing a solution for the problem. I was provided
help with workarounds for few problems which was difficult to resolve. Every piece of work developed by me
was discussed with my supervisor to make sure on the validity of the solution developed. Thus choosing a
project which was interesting and challenging kept me motivated and focused through out the project. I
personally feel satisfied on the work done by me as it was a very nice learning experience. I learnt things on my
own effort and dedication. I have gained knowledge in an area of computer science which is growing rapidly. I
feel elated about the fact that my project was a contribution to cloud computing and the knowledge learnt would
help me in my future career. I think a student should take up projects that are new in the field of computer
science and which they find interesting as it would help them in research as well as in an IT industry.
Overall, I had a great experience in learning more and also adding new skills to the skill sets that I was expertise
in and helping me grow as a developer. I would advice students to take up challenging work as things start
becoming easy when you start working on it and also take lots of feedback from their supervisor, as this is a
learning phase. So, learn as much as you can.
- 68 -
Appendix B - Critical Technical Issues and solution
1) The datanode process does not start showing the below error in the log files of the datanode.
The log files can be accessed under the $HADOOP_HOME/logs directory.
debian04:/usr/local/hadoop-0.20.2/logs# ls –ltr *datanode*
-rw-r--r-- 1 root root 10623 2011-08-25 05:38 hadoop-root-datanode-debian04.log
The log files have the below error “ java.io.IOException: Incompatible namespaceIDs”
************************************************************/
2011-08-25 05:38:06,941 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: STARTUP_MSG:
/************************************************************
STARTUP_MSG: Starting DataNode
STARTUP_MSG: host = debian04/10.0.10.4
STARTUP_MSG: args = []
STARTUP_MSG: version = 0.20.3-dev
STARTUP_MSG: build = -r ; compiled by 'root' on Sat Jul 23 22:10:32 BST 2011
************************************************************/
2011-08-25 05:38:12,265 ERROR org.apache.hadoop.hdfs.server.datanode.DataNode: java.io.IOException:
Incompatible namespaceIDs in /app/hadoop/tmp/dfs/data: namenode namespaceID = 434273426; datanode
namespaceID = 254963473
at org.apache.hadoop.hdfs.server.datanode.DataStorage.doTransition(DataStorage.java:233)
at org.apache.hadoop.hdfs.server.datanode.DataStorage.recoverTransitionRead(DataStorage.java:148)
at org.apache.hadoop.hdfs.server.datanode.DataNode.startDataNode(DataNode.java:298)
at org.apache.hadoop.hdfs.server.datanode.DataNode.<init>(DataNode.java:216)
at org.apache.hadoop.hdfs.server.datanode.DataNode.makeInstance(DataNode.java:1283)
at org.apache.hadoop.hdfs.server.datanode.DataNode.instantiateDataNode(DataNode.java:1238)
at org.apache.hadoop.hdfs.server.datanode.DataNode.createDataNode(DataNode.java:1246)
at org.apache.hadoop.hdfs.server.datanode.DataNode.main(DataNode.java:1368)
2011-08-25 05:38:12,270 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: SHUTDOWN_MSG:
/************************************************************
SHUTDOWN_MSG: Shutting down DataNode at debian04/10.0.10.4
************************************************************/
The Issue was resolved by changing the “hadoop.tmp.dir” property in the /conf/core-site.xml to a new directory
path “ /app/hadoop/temp” instead of “ /app/hadoop/tmp” and formatting the name node.
- 69 debian04:/usr/local/hadoop-0.20.2/conf# cat core-site.xml
<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<!-- Put site-specific property overrides in this file. -->
<configuration>
<property>
<name>hadoop.tmp.dir</name>
<value>/app/hadoop/temp</value>
<description>A base for other temporary directories.</description>
</property>
debian04:/usr/local/hadoop-0.20.2/bin# jps
22027 Jps
21264 SecondaryNameNode
21078 NameNode
21422 TaskTracker
21815 DataNode
21338 JobTracker
The issue occurs if the namenode is formatted more than once.In a Hadoop cluster the Namenode should
be formatted only once.
2) The Local file is not copied to the HDFS file system and executing the below command gives the error
debian04:/usr/local/hadoop-0.20.2# bin/hadoop dfs -copyFromLocal /tmp/dir dir
11/07/01 20:15:01 WARN hdfs.DFSClient: DataStreamer Exception:
org.apache.hadoop.ipc.RemoteException: java.io.IOException: File /user/hduser/ebook/pg4300.txt could
only be replicated to 0 nodes, instead of 1
at
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getAdditionalBlock(FSNamesystem.java:1271)
at org.apache.hadoop.hdfs.server.namenode.NameNode.addBlock(NameNode.java:422)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
- 70 There could be two reasons for this problem

The dfs.repliction value which specifies on how many machines the file has to be replicated if this
value is set more than the number of slave machines involved we could get this error we then
change the dfs.replication property in the "hdfs-site.xml" from the default value of 3 to the number
of slaves in the cluster.

Issue could be also because of the local file system size having no space. Once the file system had
more space the Issue was resolved.
debian04:~# df -h
Filesystem
Size Used Avail Use% Mounted on
/dev/sda1
22G 11G 11G 48% /
3) The jobtracker process doesn‟t start in a Hadoop cluster and gives the error - BindException: Address already in
use
This issue was resolved by changing the port details in the /conf/mapred-site.xml in the master and slave
machines.
In my case I changed the value of the port value of 54311 to 54312 in the /conf/mapred-site.xml file and
restarted the Hadoop daemons and the Issue was resolved.
4) The “java heap space” error is displayed while executing a example application on a Hadoop cluster.
debian02:/usr/local/hadoop-0.20.2# bin/hadoop jar hadoop*examples*.jar
wordcount dir ebook-dir
11/07/06 15:22:04 INFO input.FileInputFormat: Total input paths to
process : 4
11/07/06 15:22:04 INFO mapred.JobClient: Running job:
job_201107061520_0001
11/07/06 15:22:05 INFO mapred.JobClient: map 0% reduce 0%
11/07/06 15:22:20 INFO mapred.JobClient: Task Id :
attempt_201107061520_0001_m_000001_0, Status : FAILED
Error: Java heap space
- 71 11/07/06 15:22:24 INFO mapred.JobClient: Task Id :
attempt_201107061520_0001_m_000000_0, Status : FAILED
java.io.IOException: Spill failed
at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.flush(MapTask.java:1123)
.
.
11/07/06 15:22:30 INFO mapred.JobClient: Task Id :
attempt_201107061520_0001_m_000001_1, Status : FAILED
java.io.IOException: Cannot run program "bash": java.io.IOException:
error=12, Cannot allocate memory
at java.lang.ProcessBuilder.start(ProcessBuilder.java:460)
at org.apache.hadoop.util.Shell.runCommand(Shell.java:149)
at org.apache.hadoop.util.Shell.run(Shell.java:134)
at org.apache.hadoop.fs.DF.getAvailable(DF.java:73)
at
org.apache.hadoop.fs.LocalDirAllocator$AllocatorPerContext.getLocalPathForWrite(LocalDirAllocator.java:32
9)
This error is because there is no swap space. When there’s no dedicated swap partition, a workaround is possible
by the means of swap files below are the steps that need to be executed.
1. First, create an empty file which will serve as a swap file by issuing the following command:
dd if=/dev/zero of=/swap bs=1024 count=1048576
where /swap is the desired name of the swap file, and count=1048576 sets the size to 1024 MB swap
2. Set up a Linux swap area with:
mkswap /swap
3. set the permissions as follows:
chmod 0600 /swap
4. Add the new swap file to /etc/fstab:
/swap
swap
swap
defaults,noatime
00
This way it will be loaded automatically on boot.
5. To enable the new swap space immediately, issue: swapon -a
Check with free -m if everything went right.
we should be seeing additional swap space available.
- 72 -
Appendix C - Hadoop configuration files
C.1 Core-site.xml / mapred-site.xml / hdfs-site.xml Configuration files on master Node (Debian02)
debian02:/usr/local/hadoop-0.20.2/conf# cat core-site.xml_bkp
<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<!-- Put site-specific property overrides in this file. -->
<configuration>
<property>
<name>hadoop.tmp.dir</name>
<value>/app/hadoop/tmp/dir</value>
<description>A base for other temporary directories.</description>
</property>
<property>
<name>fs.default.name</name>
<value>hdfs://10.0.10.2:54310</value>
</property>
</configuration>
debian02:/usr/local/hadoop-0.20.2/conf# cat mapred-site.xml
<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<!-- Put site-specific property overrides in this file. -->
<configuration>
<property>
<name>mapred.job.tracker</name>
<value>10.0.10.2:54312</value>
<description>The host and port that the MapReduce job tracker runs
at. If "local", then jobs are run in-process as a single map
and reduce task.
</description>
</property>
</configuration>
debian02:/usr/local/hadoop-0.20.2/conf# cat mapred-site.xml
- 73 <?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<!-- Put site-specific property overrides in this file. -->
<configuration>
<property>
<name>mapred.job.tracker</name>
<value>10.0.10.2:54312</value>
<description>The host and port that the MapReduce job tracker runs
at. If "local", then jobs are run in-process as a single map
and reduce task.
</description>
</property>
</configuration>
debian02:/usr/local/hadoop-0.20.2/conf# cat hdfs-site.xml
<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<!-- Put site-specific property overrides in this file. -->
<configuration>
<property>
<name>dfs.replication</name>
<value>2</value>
</property>
</configuration>
Node details on the /conf/master and /conf/slave files on the master machine
debian02:/usr/local/hadoop-0.20.2/conf# cat slaves
10.0.10.2
10.0.10.5
10.0.10.6
debian02:/usr/local/hadoop-0.20.2/conf# cat masters
10.0.10.2
C.2 Core-site.xml / mapred-site.xml / hdfs-site.xml Configuration files on Slave Nodes (Debian05)
debian05:/usr/local/hadoop-0.20.2/conf# cat core-site.xml
<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<!-- Put site-specific property overrides in this file. -->
<configuration>
- 74 <property>
<name>hadoop.tmp.dir</name>
<value>/hadoop/tmp/dir</value>
<description>A base for other temporary directories.</description>
</property>
<property>
<name>fs.default.name</name>
<value>hdfs://10.0.10.2:54310</value>
</property>
</configuration>
debian05:/usr/local/hadoop-0.20.2/conf# cat mapred-site.xml
<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<!-- Put site-specific property overrides in this file. -->
<configuration>
<property>
<name>mapred.job.tracker</name>
<value>10.0.10.2:54312</value>
<description>The host and port that the MapReduce job tracker runs
at. If "local", then jobs are run in-process as a single map
and reduce task.
</description>
</property>
</configuration>
debian05:/usr/local/hadoop-0.20.2/conf# cat cat hdfs-site.xml
cat: cat: No such file or directory
<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<!-- Put site-specific property overrides in this file. -->
<configuration>
<property>
<name>dfs.replication</name>
<value>2</value>
</property>
</configuration>
Node details on the /conf/master and /conf/slave files on the slave machine (debian05)
- 75 debian05:/usr/local/hadoop-0.20.2/conf# cat masters
localhost
debian05:/usr/local/hadoop-0.20.2/conf# cat slaves
10.0.10.2
10.0.10.5
10.0.10.6
C.3 Core-site.xml / mapred-site.xml / hdfs-site.xml Configuration files on Slave Nodes (Debian06)
debian06:/usr/local/hadoop-0.20.2/conf# cat core-site.xml
<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<!-- Put site-specific property overrides in this file. -->
<configuration>
<property>
<name>hadoop.tmp.dir</name>
<value>/hadoop/tmp/dir</value>
<description>A base for other temporary directories.</description>
</property>
<property>
<name>fs.default.name</name>
<value>hdfs://10.0.10.2:54310</value>
</property>
</configuration>
debian06:/usr/local/hadoop-0.20.2/conf# cat mapred-site.xml
<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<!-- Put site-specific property overrides in this file. -->
<configuration>
<property>
<name>mapred.job.tracker</name>
<value>10.0.10.2:54312</value>
<description>The host and port that the MapReduce job tracker runs
at. If "local", then jobs are run in-process as a single map
and reduce task.
</description>
</property>
- 76 </configuration>
debian06:/usr/local/hadoop-0.20.2/conf# cat hdfs-site.xml
<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<!-- Put site-specific property overrides in this file. -->
<configuration>
<property>
<name>dfs.replication</name>
<value>3</value>
</property>
</configuration>
Node details on the /conf/master and /conf/slave files on the slave machine (debian06)
debian06:/usr/local/hadoop-0.20.2/conf# cat masters
localhost
debian06:/usr/local/hadoop-0.20.2/conf# cat slaves
10.0.10.2
10.0.10.5
10.0.10.6
- 77 -
Appendix D - MapReduce Program processing 1 GB data
debian02:/usr/local/hadoop-0.20.2# bin/hadoop jar contrib/streaming/hadoop-*streaming*.jar -file
/usr/local/hadoop-0.20.2/mappy.py -mapper /usr/local/hadoop-0.20.2/mappy.py -file /usr/local/hadoop0.20.2/reducer.py -reducer /usr/local/hadoop-0.20.2/reducer.py -input /user/hduser/patho/* -output
/user/hduser/patho-1GB-2
packageJobJar: [/usr/local/hadoop-0.20.2/mappy.py, /usr/local/hadoop-0.20.2/reducer.py, /hadoop/tmp/dir/hadoopunjar1363922815805956110/] [] /tmp/streamjob3943827107289634139.jar tmpDir=null
11/08/21 03:57:34 INFO mapred.FileInputFormat: Total input paths to process : 1
11/08/21 03:57:35 INFO streaming.StreamJob: getLocalDirs(): [/hadoop/tmp/dir/mapred/local]
11/08/21 03:57:35 INFO streaming.StreamJob: Running job: job_201108190519_0004
11/08/21 03:57:35 INFO streaming.StreamJob: To kill this job, run:
11/08/21 03:57:35 INFO streaming.StreamJob: /usr/local/hadoop-0.20.2/bin/../bin/hadoop job Dmapred.job.tracker=10.0.10.2:54312 -kill job_201108190519_0004
11/08/21 03:57:35 INFO streaming.StreamJob: Tracking URL:
http://debian02:50030/jobdetails.jsp?jobid=job_201108190519_0004
11/08/21 03:57:36 INFO streaming.StreamJob: map 0% reduce 0%
11/08/21 03:57:53 INFO streaming.StreamJob: map 3% reduce 0%
11/08/21 03:57:54 INFO streaming.StreamJob: map 7% reduce 0%
11/08/21 03:57:56 INFO streaming.StreamJob: map 10% reduce 0%
11/08/21 03:57:57 INFO streaming.StreamJob: map 12% reduce 0%
11/08/21 03:57:58 INFO streaming.StreamJob: map 15% reduce 0%
11/08/21 03:57:59 INFO streaming.StreamJob: map 17% reduce 0%
11/08/21 03:58:01 INFO streaming.StreamJob: map 19% reduce 0%
11/08/21 03:58:02 INFO streaming.StreamJob: map 20% reduce 0%
11/08/21 03:58:04 INFO streaming.StreamJob: map 21% reduce 0%
11/08/21 03:58:09 INFO streaming.StreamJob: map 22% reduce 0%
11/08/21 03:58:13 INFO streaming.StreamJob: map 23% reduce 0%
11/08/21 03:58:20 INFO streaming.StreamJob: map 24% reduce 0%
11/08/21 03:58:37 INFO streaming.StreamJob: map 26% reduce 0%
11/08/21 03:58:40 INFO streaming.StreamJob: map 27% reduce 0%
11/08/21 03:58:44 INFO streaming.StreamJob: map 31% reduce 0%
11/08/21 03:58:46 INFO streaming.StreamJob: map 34% reduce 0%
11/08/21 03:58:49 INFO streaming.StreamJob: map 35% reduce 0%
11/08/21 03:58:52 INFO streaming.StreamJob: map 36% reduce 0%
11/08/21 03:58:54 INFO streaming.StreamJob: map 37% reduce 0%
11/08/21 03:59:01 INFO streaming.StreamJob: map 38% reduce 0%
11/08/21 03:59:04 INFO streaming.StreamJob: map 39% reduce 3%
11/08/21 03:59:07 INFO streaming.StreamJob: map 39% reduce 4%
11/08/21 03:59:11 INFO streaming.StreamJob: map 40% reduce 4%
11/08/21 03:59:13 INFO streaming.StreamJob: map 42% reduce 4%
- 78 11/08/21 03:59:14 INFO streaming.StreamJob: map 46% reduce 4%
11/08/21 03:59:17 INFO streaming.StreamJob: map 47% reduce 4%
11/08/21 03:59:24 INFO streaming.StreamJob: map 47% reduce 5%
11/08/21 03:59:28 INFO streaming.StreamJob: map 48% reduce 5%
11/08/21 03:59:34 INFO streaming.StreamJob: map 48% reduce 7%
11/08/21 03:59:36 INFO streaming.StreamJob: map 48% reduce 9%
11/08/21 03:59:44 INFO streaming.StreamJob: map 50% reduce 9%
11/08/21 03:59:48 INFO streaming.StreamJob: map 55% reduce 9%
11/08/21 03:59:52 INFO streaming.StreamJob: map 56% reduce 9%
11/08/21 03:59:53 INFO streaming.StreamJob: map 56% reduce 11%
11/08/21 04:00:07 INFO streaming.StreamJob: map 56% reduce 12%
11/08/21 04:00:13 INFO streaming.StreamJob: map 56% reduce 13%
11/08/21 04:00:19 INFO streaming.StreamJob: map 56% reduce 15%
11/08/21 04:00:24 INFO streaming.StreamJob: map 56% reduce 17%
11/08/21 04:00:28 INFO streaming.StreamJob: map 56% reduce 19%
11/08/21 04:00:30 INFO streaming.StreamJob: map 57% reduce 19%
11/08/21 04:00:32 INFO streaming.StreamJob: map 58% reduce 19%
11/08/21 04:00:35 INFO streaming.StreamJob: map 61% reduce 19%
11/08/21 04:00:38 INFO streaming.StreamJob: map 67% reduce 19%
11/08/21 04:00:40 INFO streaming.StreamJob: map 72% reduce 19%
11/08/21 04:00:42 INFO streaming.StreamJob: map 77% reduce 19%
11/08/21 04:00:43 INFO streaming.StreamJob: map 79% reduce 19%
11/08/21 04:00:48 INFO streaming.StreamJob: map 80% reduce 19%
11/08/21 04:01:14 INFO streaming.StreamJob: map 80% reduce 20%
11/08/21 04:01:16 INFO streaming.StreamJob: map 80% reduce 21%
11/08/21 04:01:25 INFO streaming.StreamJob: map 80% reduce 24%
11/08/21 04:01:29 INFO streaming.StreamJob: map 83% reduce 24%
11/08/21 04:01:31 INFO streaming.StreamJob: map 86% reduce 24%
11/08/21 04:01:32 INFO streaming.StreamJob: map 86% reduce 25%
11/08/21 04:01:34 INFO streaming.StreamJob: map 88% reduce 25%
11/08/21 04:01:36 INFO streaming.StreamJob: map 89% reduce 25%
11/08/21 04:01:39 INFO streaming.StreamJob: map 91% reduce 25%
11/08/21 04:01:42 INFO streaming.StreamJob: map 93% reduce 25%
11/08/21 04:01:45 INFO streaming.StreamJob: map 94% reduce 25%
11/08/21 04:01:48 INFO streaming.StreamJob: map 96% reduce 27%
11/08/21 04:01:51 INFO streaming.StreamJob: map 99% reduce 27%
11/08/21 04:01:54 INFO streaming.StreamJob: map 100% reduce 27%
11/08/21 04:02:14 INFO streaming.StreamJob: map 100% reduce 29%
11/08/21 04:02:23 INFO streaming.StreamJob: map 100% reduce 32%
11/08/21 04:03:02 INFO streaming.StreamJob: map 100% reduce 33%
11/08/21 04:03:23 INFO streaming.StreamJob: map 100% reduce 67%
11/08/21 04:03:31 INFO streaming.StreamJob: map 100% reduce 72%
11/08/21 04:03:46 INFO streaming.StreamJob: map 100% reduce 83%
11/08/21 04:03:56 INFO streaming.StreamJob: map 100% reduce 89%
11/08/21 04:04:14 INFO streaming.StreamJob: map 100% reduce 100%
- 79 11/08/21 04:04:23 INFO streaming.StreamJob: Job complete: job_201108190519_0004
11/08/21 04:04:23 INFO streaming.StreamJob: Output: /user/hduser/patho-1GB-2
- 80 -
Appendix E - Installation of Java / Python
Installation of JAVA
debian02:/usr/local# echo $JAVA_HOME
*** No java installed ***
*** Install Java ***
debian02:/usr/local# apt-get install sun-java6-jdk
E: Unable to fetch some archives, maybe run apt-get update or try with --fix-missing?
*** Fix ***
1)Update the source list
apt-get update
2)Install java JDK and JRE with apt-get install
apt-get install sun-java6-jdk sun-java6-jre
3)After installation done, jdk and jre will install at /usr/lib/jvm/java-6-sun1.6.0.06debian02:/usr/lib/jvm# ls -ltr
total 4
lrwxrwxrwx 1 root root 19 2011-07-01 15:28 java-6-sun -> java-6-sun-1.6.0.22
drwxr-xr-x 8 root root 4096 2011-07-01 15:28 java-6-sun-1.6.0.22
4)debian02:~# java -version
java version "1.6.0_22"
Java(TM) SE Runtime Environment (build 1.6.0_22-b04)
Java HotSpot(TM) 64-Bit Server VM (build 17.1-b03, mixed mode)
- 81 Installation of Python
1) Download Python 2.6.7 tar file from
http://www.python.org/getit/releases/2.6.7/
2) debian02:~# cd /usr/local
debian02:/usr/local# ls -ltr
total 154176
-rw------- 1 root staff 13322372 2011-07-29 20:24 Python-2.6.7.tgz
3) Untar the tar file creating a directory Python-2.6.7
debian02:/usr/local# cd Python-2.6.7
debian02:/usr/local/Python-2.6.7#
4) To start building on UNIX: we execute the configure file from the terminal "./configure" from the
Current directory and when it execution completes, type "make". This will create an Executable
"./python"; one this runs successfully to install in the /usr/local directory , first type "su root" and then
execute "make install".
- 82 -
Appendix F – Schedule
The time line of the project is as below for reference.
- 83 -
Appendix G – Interim Project Report
The Interim project report feedback is attached at the end of the report for reference.
Was this manual useful for you? yes no
Thank you for your participation!

* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project

Download PDF

advertisement