University of Leeds School of Computing Msc. Advanced computer science Data Management in Cloud Computing Bishakha Dutta Gupta MSc Advanced Computer science Session (2010/11) The candidate confirms that the work submitted is their own and the appropriate credit has been given where reference has been made to the work of others. I understand that failure to attribute material which is obtained from another source may be considered as plagiarism. (Signature of Student) ----------------------------------- Summary Handling data of large size in a scalable way in clouds has been a concern for a long time. In this project, we introduce a framework that is built using Hadoop to store and process large pathology data. Hadoop is been widely used to process large scale data efficiently and scalably. Efforts on applying Hadoop on scenarios other than server side computation like web indexing... etc is very few. In this project, the Hadoop cluster or cloud Framework would be implemented using its Map Reduce programming model to store and process the pathology data efficiently on the Hadoop cluster or clouds. The approach is then generalized and discussed further to handle similar or other scientific data processing scenarios. Acknowledgements I would like to thank personally Dr. Karim Djemame, my supervisor for his timely feedback; continuous support and motivation through out the project and helping me to complete the research project on time. I would also like to thank my assessor Dr. Vania Dimitrova for providing me good feedback on both the mid-term project report and during the progress meeting. I would also thank Django Armstrong for his quick and timely help for any technical issues with regards to the cloud test bed. I would also like to thank my lovely parents for their incessant love, inspiration and financial support. I also would like to thank my uncle Dr. Jyoti N. Sengupta, aunty Jayati Sengupta and grandparents for their continuous love and encouragement during my tenure as a student in Leeds. List of Acronyms CPU Central Processing Unit MB Megabyte GB Gigabyte TB Terabyte PB Petabytes HTTP Hyper Text Transfer Protocol I/O Input/Output IaaS Infrastructure as a Service PaaS Platform as a Service SaaS Software as a Service RAM Random Access Memory SSH Secure Shell VIM Virtual Infrastructure Manager VM Virtual Machine Amazon EC2 Amazon Elastic Compute Cloud API Application Programming Interface SNA Shared Nothing Architecture HDFS Hadoop Distributed File System DFS Distributed File System Table of Contents 1. INTRODUCTION ................................................................................................................................................1 1.1 Project Aim: ...................................................................................................................................................1 1.2 Objectives:......................................................................................................................................................1 1.3 Minimum Requirements: ...............................................................................................................................1 1.4 Motivation ......................................................................................................................................................1 1.5 Methodology: .................................................................................................................................................2 1.6 Schedule .........................................................................................................................................................2 1.7 Contributions:.................................................................................................................................................3 1.8 Report Structure: ............................................................................................................................................3 2. CLOUD COMPUTING........................................................................................................................................5 2.1 Introduction ....................................................................................................................................................5 2.2 Properties of a Cloud ......................................................................................................................................6 2.3 Architecture of Cloud Computing ..................................................................................................................6 2.4 Models ............................................................................................................................................................7 2.5 Types of cloud system ....................................................................................................................................9 2.6 Companies that offer Cloud services ...........................................................................................................10 2.7 Virtualization................................................................................................................................................12 2.8 Types of Virtualization ................................................................................................................................13 2.9 Major Players and Products in Virtualization ..............................................................................................14 2.10 Virtualization Project Steps........................................................................................................................14 2.11 Hypervisor: XEN .......................................................................................................................................15 2.12 Characteristics of a cloud ...........................................................................................................................17 2.13 Data management applications...................................................................................................................18 2.13.1 Transactional data management ..........................................................................................................18 2.13.2 Analytical data management ...............................................................................................................19 2.14 Analysis of Data in the Cloud ....................................................................................................................20 2.15 Analyzing the Data using Hadoop Framework ..........................................................................................21 2.16 Cloud Test Bed...........................................................................................................................................21 2.16 Related Work .............................................................................................................................................22 3. Case Study: Pathology Application ....................................................................................................................24 3.1 General Description of Pathology data Processing ......................................................................................24 3.2 Accessing the Application............................................................................................................................25 4. HADOOP Approach ...........................................................................................................................................27 4.1 MapReduce ..................................................................................................................................................29 4.2 MapReduce Execution Overview.................................................................................................................31 4.3 Working of Map and Reduce Programming model with an example ..........................................................33 4.4 The Hadoop Distributed File System (HDFS) .............................................................................................34 4.5 Why use Hadoop and MapReduce? .............................................................................................................37 4.6 Why Use Hadoop framework for the Pathology Application Data? ............................................................37 4.7 Hadoop Streaming ........................................................................................................................................38 5. Installation of HADOOP ....................................................................................................................................40 5.1 From two single-node clusters to a multi-node cluster ................................................................................43 6. Implementation of MAP REDUCE programming model ..................................................................................47 6.1 General Description of the current data processing used by the Pathology Application .............................47 6.2 System Design ..............................................................................................................................................47 6.3 Process Flow ................................................................................................................................................48 6.3.1 The pre-processing Step........................................................................................................................48 6.3.2 Loading pre-processed data into HDFS ................................................................................................49 6.3.3 Process Data using the Map Reduce Programming Model...................................................................49 6.3.4 Running the Python Code in Hadoop ...................................................................................................50 7. Evaluation and Experimentation ........................................................................................................................52 7.1 Response time to process the Pathology Data on a single node / Cluster ....................................................52 7.2 Response time to run the pathology Application on based on Data Size .....................................................54 7.3 Results and comparison................................................................................................................................58 7.4 Evaluation of the Software ...........................................................................................................................58 7.5 Further Work ................................................................................................................................................59 7.6 How would this Project Idea Help Other Applications? ..............................................................................59 7.7 Meeting Minimum Requirements ................................................................................................................59 7.8 Objectives Met .............................................................................................................................................60 8. Conclusion ..........................................................................................................................................................61 8.1 Project Success .............................................................................................................................................61 8.2 Summary ......................................................................................................................................................61 Bibliography ...........................................................................................................................................................62 Appendix A - Project Reflection ............................................................................................................................66 Appendix B - Critical Technical Issues and solution .............................................................................................68 Appendix C - Hadoop configuration files ..............................................................................................................72 C.1 Core-site.xml / mapred-site.xml / hdfs-site.xml Configuration files on master Node (Debian02) .............72 C.2 Core-site.xml / mapred-site.xml / hdfs-site.xml Configuration files on Slave Nodes (Debian05) ............73 C.3 Core-site.xml / mapred-site.xml / hdfs-site.xml Configuration files on Slave Nodes (Debian06) ..............75 Appendix D - MapReduce Program processing 1 GB data ....................................................................................77 Appendix E - Installation of Java / Python .............................................................................................................80 Installation of JAVA ..........................................................................................................................................80 Installation of Python .........................................................................................................................................81 Appendix F – Schedule ..........................................................................................................................................82 Appendix G – Interim Project Report.....................................................................................................................83 Table of Figures Figure 1: Schedule used for the project with the major tasks and time scales/deadlines to complete these tasks. ..3 Figure 2: cloud computing Architecture ..................................................................................................................7 Figure 3: Types of cloud Models .............................................................................................................................8 Figure 4: Types of cloud system ............................................................................................................................10 Figure 5: Virtualization .........................................................................................................................................12 Figure 6: Types of Virtualization ..........................................................................................................................13 Figure 7: An example of 2 nodes in the cloud showing the Layered architecture. ................................................21 Figure 8: Data is distributed across nodes at load time. ........................................................................................28 Figure 9: Map Reduce programming Model .........................................................................................................29 Figure 10: Different colors represent different keys. All values with the same key are presented to a single reduce task. .............................................................................................................................................................30 Figure 11: Mapping creates a new output list by applying a function to individual elements of an input list. .....30 Figure 12: Reducing a list iterates over the input values to produce an aggregate value as output. ......................31 Figure 13: Map Reduce Execution Overview........................................................................................................32 Figure 14: Map Reduce Programming Flow .........................................................................................................33 Figure 15: Map Reduce Flow Example .................................................................................................................34 Figure 16: Data Nodes holding blocks of multiple files with a replication factor of 2. The NameNode maps the filenames onto the block IDs. .................................................................................................................................35 Figure 17: HDFS Architecture...............................................................................................................................36 Figure 18: Set up of a Multi Node cluster .............................................................................................................44 Figure 19: multi-node cluster setup ......................................................................................................................44 Figure 20: Running the program for the first time takes 47 seconds approximately and 37 seconds for the second and consecutive runs...............................................................................................................................................53 Figure 21: Running the program for the first time on a Hadoop cluster consisting of two nodes takes 32 seconds and for the second and consecutive runs takes 27 seconds. ...................................................................................53 Figure 22: Running the program for the first time on a Hadoop cluster consisting of three nodes takes 30 seconds and for the second and consecutive runs takes 22 seconds .......................................................................54 Figure 23: Difference of running the Pathology data on a Hadoop cluster with a single node and a Hadoop cluster with two and three nodes ............................................................................................................................54 Figure 24: The difference of the time taken to process the data of size 24 KB, 131 KB, 3074 KB and 39 MB on a Hadoop cluster with a single node. ......................................................................................................................55 Figure 25: The difference of the time taken to process the data of size 24 KB, 131 KB, 3074 KB and 39 MB on a Hadoop cluster with two nodes............................................................................................................................56 Figure 26: The difference of the time taken to process the data of size 24 KB, 131 KB, 3074 KB and 39 MB on a Hadoop cluster with three nodes..........................................................................................................................56 Figure 27: The Response time taken to process the data of size 24 KB, 131 KB, 3074 KB and 39 MB on single and a multi node cluster setup ................................................................................................................................57 Figure 28: Processing time taken by a Hadoop cluster with one, two and three Nodes for 1GB pathology data. 57 -1- Chapter 1 1. INTRODUCTION 1.1 Project Aim: With the increasing popularity of cloud computing Hadoop software Framework is becoming widely used for processing large data on clouds. The cloud in simple terms is a large group of computers that are interconnected. The main objectives of this research project are to process large datasets of a pathology application on the cloud using the Hadoop framework and its programming model Map Reduce and evaluate the performance and scalability using the Hadoop Framework. Hadoop is used for large scale data processing on cluster of machines. 1.2 Objectives: Obtain a good understanding of cloud computing, types of cloud computing, cloud architecture, cloud models and virtualization. Understanding how the use of clouds could benefit the IT industry. Understanding the limitations and opportunities using cloud computing Understand the concept of virtualization and learn to deploy virtual machines on the cloud. Understand Hadoop Framework and learn to deploy a Hadoop cluster. Learn Hadoop's programming model (Map Reduce) and design a framework to process large data using Hadoop and its Map Reduce programming model. Evaluate the performance and scalability of this framework. 1.3 Minimum Requirements: Successfully implementing Hadoop as a single node setup Successfully implementing Hadoop as a multi node setup with two/three machines. Designing a MapReduce programming model to process the pathology application data using Python programming language and the Hadoop streaming utility. Evaluating the performance of the data processing under the proposed Hadoop and its programming model MapReduce framework. 1.4 Motivation In the modern computing, Cloud technologies have become progressively prevalent with the number of successful solutions growing rapidly. Amazon, Google, Microsoft and IBM's putting forward the abilities of the cloud, are attracting many businesses to use cloud technologies. The technologies driving the development of -2cloud computing are relatively new and thus there are numerous open research questions being worked on. The Hadoop framework for processing large data in cloud is a very interesting concept to handle a large scale computation in clouds, below are few of some successful implementation of using Hadoop Framework. A successful implementation of Hadoop has been reported for scalable image-processing related to astronomical images. Astronomical surveys of the sky generate tens of terabytes of images every night. The study of these images involves computation challenges and these studies benefit only from the highest quality data. Given the quantity of data and the computational load involved, these problems could only be handled by distributing the work over a large number of machines. The report concluded the use of Hadoop showing great improvement for processing large volumes of astronomical datasets. On a 400-node cluster they were able to process 100,000 files with approximately 300 million pixels in just three minutes (Keith, 2011). Another successful implementation is in the field of biology where scientific data is processed on a cloud using Hadoop. The implementation develops a Hadoop-based cloud computing application that processes sequences of microscope images of live cells. Hadoop was evaluated working on a scientific data processing case study. The report concluded that the cloud solution was attractive because it took advantage of many of the desirable features offered by the cloud concept and Hadoop, including scalability, reliability, and fault-tolerance, easy deploy ability, etc (Chen, 2010).Therefore, looking at the success of Hadoop Framework on different areas of computer, the current project proposed will be another implementation using Hadoop. 1.5 Methodology: In order to complete the objectives and to meet the minimum requirements of this project most of the time has been spent productively on researching about cloud computing and its latest trends and technologies. Once the background reading was completed and a good understanding of the Research topic area had been reached a solution was proposed. The solution proposed reflects the objectives and the minimum requirements mentioned above. A suitable solution for working with the pathology data was created by using the agile technique (Macias, 2003). The agile technique is based on iterative and incremental development and thus the testing was done along with the development allowing changes to the design at every stage during the development as needed which gave rapid results. The design was then evaluated based on efficiency and scalability on the cloud test bed at the Leeds University and conclusions were drawn to see if the solution adds value to the research area. 1.6 Schedule A schedule was planned using Gantt chart to manage the allocation time of the project tasks. The tasks were allotted sufficient time for their completion. The schedule started from June 1st with a detailed understanding of the problem area and to have hands on session on the School of Computing cloud test bed. There were regular progress meetings that helped deciding on how to go about with the solution and the software tools that to be used for designing the solution for the problem area. Once a solution was worked out the implementation was -3started after a thorough discussion with the supervisor. Sufficient time was dedicated for evaluating the developed software and final report writing. At every step of the plan there was an assessment to check whether the tasks met the requirements of the project. Figure 1: Schedule used for the project with the major tasks and time scales/deadlines to complete these tasks. 1.7 Contributions: The following contribution has been given by the research project: The solution proposed and implemented will be useful in making decisions in regards to time and cost involved to process large data of an application using the Hadoop cloud framework. The evaluation of the proposed solution based on performance and scalability will enable suggestions to be made for future work or improvements in cloud technologies. 1.8 Report Structure: The following is the structure of the report: Chapter 2 describes the background research in order to obtain an understanding of what is cloud computing, its architecture, cloud types, cloud models, cloud technologies, cloud trends, virtualization concept, types of virtualization and a brief description on managing data on clouds. -4 Chapter 3 describes a case study, and discusses the real world scenario and how cloud computing technologies and techniques could be beneficial. Chapter 4 deals with the Hadoop Framework and how it helps in processing large volumes (Petabytes) of data using its two important components; Map Reduce Programming model and the Hadoop Distributed File systems. Chapter 5 delineates step by step Installation of the Hadoop cluster with a single node and a Hadoop cluster with two Nodes. Chapter 6 describes a solution using the Map Reduce programming model to handle the processing of the Pathology application data that was discussed in chapter 3. Chapter 7 evaluates the implemented solution described in chapter 6 by running experiments. Chapter 8 concludes the overall impact of the project. -5- Chapter 2 2. CLOUD COMPUTING “Clouds are a large pool of easily usable and accessible virtualized resources (such as hardware, development platforms and/or services). These resources can be dynamically reconfigured to adjust to a variable load (scale), allowing also for an optimum resource utilization. This pool of resources is typically exploited by a pay-per-use model in which guarantees are offered by the Infrastructure Provider by means of customized SLA” - (Vaquero, 2010). “Cloud computing seems to be little more than a marketing umbrella, encompassing topics such as distributed computing, grid computing, utility computing, and software-as-a-service, that have already received significant research focus and commercial implementation. There also exist an increasing number of large companies that are offering cloud computing infrastructure products and services” - (Daniel, 2009). 2.1 Introduction With traditional desktop computing we can run a software program on each computer we own and documents that we create are stored on the computer that it was created on. Although documents that are created on a computer can be accessed from any other computer on the network but they cannot be accessed outside the network. This entire scenario is PC centered. With the use of cloud computing the software programs we use are no more run from our PC but are stored on servers that are accessed via the internet. The major advantage of this is that the software is still available for use even if the computer crashes. Similar is the case for the documents that are created by us , they are placed on a collection of servers and is accessed by the internet, not only can we access the document but any one with permission can access the document and can make changes to the document in real time. Therefore, cloud computing model is not PC- centric (Miller, 2008). “The cloud” is the key to define cloud computing technology. The cloud in simple term is a large group of computers that are interconnected, these computers can be PC‟s or network servers, which could be even private or public. For example, Google hosts a cloud that consist of small PC‟s and larger servers and their cloud is a private cloud, Google owns the cloud that is publicly accessible by Google‟s users only. Any authorized user can access the data and application from a computer using an internet connection and for the user the infrastructure and the technology is completely invisible. Cloud computing is not network computing. With network computing documents or applications are hosted on a single server owned by the company and can be accessed from any computer on the network. The concept of cloud computing is much bigger. It encompasses multiple servers, multiple networks and multiple companies. Unlike network computing, cloud services and cloud storage can be accessed from anywhere using an internet connection but with network computing the -6accesses is limited only within the company‟s network. Cloud computing is also not outsourcing where a company subcontracts its computing services to an outside firm. The outsourcing firm can host a company‟s application or data, but the application or data is accessed only by the employees of the company by using the company‟s network and not to the entire world via the internet (Miller, 2008). 2.2 Properties of a Cloud Cloud computing is user-centric: Once we are connected to the cloud, the documents, data, images, application stored on the cloud is not only ours but can also be shared with others. Cloud computing is centered on tasks: Focus is not on what an application can do, the focus is more on how the application can do it for you. Cloud computing is powerful: connecting hundreds and thousands of computers in a cloud means there is a wealth of computing power which is not possible to be found on a single PC. Cloud computing is widely accessible: As the data is stored on the cloud, users have the privilege that they could retrieve more valuable information from multiple sources or repositories. The advantage is we are not restricted to data from a single source as we are on a desktop PC. Perhaps the best and the popular examples of cloud computing applications are Google Docs and spreadsheets, Google Calendar, Gmail and Picassa. All these applications are hosted on Google‟s server and accessible to any user with an internet connection. Thus, with the help of cloud computing there is a shift from the computer-dependent isolated data use to data that could be shared and accessed by anyone from anywhere, and from application to task. In cloud computing technology, the user doesn‟t even have to know where exactly the data is located. All that matters is data is on the cloud and is immediately available to the user and other authorized users (Miller, 2008). 2.3 Architecture of Cloud Computing The key to cloud computing is the network of servers or even individual PC‟s that are interconnected. As these computers run in parallel, combining the powers of each provide supercomputing like power that are publicly accessible via the internet. Users using the computer via the internet connect to the cloud. To a user the cloud is a single device, application or document. For the user the hardware and how the hardware is managed by the operating system are completely invisible. When we talk about a cloud computing system it is easier to understand if we divide it into two sections: one section as the front end and the other section as the back end (Jonathan, 2011). Internet is usually the network by which the front end and the back end are connected to each other. The front end is the computer user or the client. The cloud section of the system is the backend. The Front end includes the computer of the user or the network and the application that is required to access the cloud system. “Various computers, servers and data storage systems that create the "cloud" of computing services form the back end of the system”- (Jonathan, 2011). It begins with the front end interface from where the user selects the task or service like starting an application. The system management then gets the users request, which -7finds the correct resources and calls the appropriate provisioning services. This service gets the appropriate resources on the cloud and launches the appropriate web application. Once the application is launched the system‟s metric and monitoring function tracks the usage of the cloud and based on that the users are charged (Miller, 2008). Communication between networked computers is possible by using a middleware. If a company that provides cloud computing has a lot of clients there would be a high demand for a large volume of storage space thus hundreds of digital storage devices are required by many companies. A cloud computing system must replicate all its clients' information/data and store it on other devices as there could be occasional break down of these devices. These replicated data are later accessed by the central server during break down to retrieve data that would be unreachable if there was no replication of the data (Jonathan, 2011). The Cloud computing architectural model is presented below. Figure 2: cloud computing Architecture (Source: http://www.jot.fm/issues/issue_2009_03/column3/) 2.4 Models There are five cloud models to consider: (1) Public, (2) private, (3) hybrid clouds, (4) federated clouds and (5) Multi clouds IT organizations can choose to deploy applications on public, private, hybrid clouds, federated clouds and multi clouds. Public clouds are out there on the internet i.e. globally available and private clouds are typically located on premises of an organization or company i.e. locally available. Companies may make considerations in regards to which cloud computing model they choose to employ i.e. if they just want to use a single model or a combination of the two rather than having one model to solve single or different problems. An application that is needed temporarily might be best suited for deployment in a public cloud because it could help avoiding the need to purchase additional equipment, software and hardware to solve a temporary need .On the other hand, a permanent application or one that has specific requirements on quality of service or location of data, might best be suited to be deployed in a private or hybrid cloud (Sun Microsystems, 2009). -8- Figure 3: Types of cloud Models (Source: http://computinged.com/edge/cloud-computing-winning-formula-for-organizations-in-gaining-technology-edge/) Public clouds Public clouds are run by third parties and different applications from various customers could be incorporated into the cloud provider‟s servers. Public clouds are most often hosted away from customer premises, and they provide a way to reduce customer risk like natural disaster, infrastructure risks and the cost by providing a flexible, even temporary extension to enterprise infrastructure. If a public cloud is implemented by keeping in mind the performance and security, the existence of other applications running in the cloud should be transparent to both cloud architects and end users. One of the major benefits of public clouds is that they can be much larger than a private cloud with the ability to scale up and down on demand, and moving infrastructure risks from the enterprise to the provider of the cloud even if it is temporary.Thus, public cloud describes as cloud computing where resources are dynamically provided on a self-service and fine-grained basis over the internet via web services, from a third-party provider who charges based on a fine-grained utility computing basis (Vaquero, 2009). Private clouds Private clouds are built exclusively for one client, providing security, quality of services and control over data. The company owns the entire infrastructure and has complete control over how the applications are deployed on it. They can be built and managed by the company‟s own IT staff or by a cloud provider.For, mission critical applications IT organizations use their own private clouds to protect critical infrastructures as placing such application globally could have security risks (Vaquero, 2009). Hybrid clouds A hybrid storage cloud basically is a combination of public and private storage clouds. It can be used to handle planned increase in workload. It has a disadvantage when deciding on how to distribute the applications across both a private and public cloud. Other than this issue, it also needs to consider the relationship between data and processing resources. A hybrid cloud can be of great use and success if the data is small or if the application is stateless (Vaquero, 2009). -9- Federated Clouds This model is one in which different computing infrastructure providers (IP‟s) can join together to create a federated cloud. The advantages of such a model includes cost savings due to not over provisioning for spikes in capacity demand. The biggest advantage of the federated model is the lack of reliability on a single vendor and having a higher availability of the services due to greater distribution of computing resources across different infrastructure. One of the primary reluctance to a complete cloud hosting solution is the reliance on a single vendor for the entire availability of the site. A federated cloud would eliminate this issue (Abbott, Keeven & Fisher partners, 2011). Multi cloud Let us consider a multi cloud engine "RightScale." They basically interact with API's of the cloud infrastructure and manage the aspects of each cloud. The advantage of RightScale is that it does not lock you into any one particular cloud - we are free to choose among different providers of cloud services, and we can even deploy and move our applications across multiple clouds. Users also have the privilege to modify the input parameters that are cloud-specific and users can launch servers with the same configurations on the new cloud (rightscale.com, 2011). “Deployments spanning multiple clouds can enable disaster recovery scenarios, geography-specific data and processing, or hybrid architectures across private and public clouds” - (rightscale.com, 2011). The present project would be using the University of Leeds, School of computing, which a private cloud Test Bed owned by the School of computing. 2.5 Types of cloud system In this section an attempt has been made to distinguish the kind of systems where Clouds are mostly used. Many companies use software services as their business basis. These Service Providers (SPs) provides services and also gives access to those services to the users who use this service via internet-based interfaces. Clouds outsource the provision of the computing infrastructure required to host services. Infrastructure Providers (IPs) offers this infrastructure „as a service‟, shifting computing resources or services from the SPs to the IPs, such that the SP‟s can gain in flexibility and reduce costs (Vaquero, 2010). - 10 - Figure 4: Types of cloud system (Source: http://www.bitsandbuzz.com/article/dont-get-stuck-in-a-cloud/) Infrastructure as a Service A large set of computing services or resources such as processing and storing capacity are managed by the IP‟s. They are able to split, assign and dynamically resize these resources via virtualization to build systems as demanded by customers. This is the scenario of “Infrastructure as a Service” (IaaS) (Vaquero, 2010). Software as a Service The most common type of cloud service development is “Software as a Service” (SaaS). This types of cloud computing delivers a single application through the browser to thousands of customers. An example of SaaS is the online word processors (Vaquero, 2010) which are an online alternative of typical office applications. This scenario is called Software as a Service (SaaS). On the customer side, it means no investment in servers or software licensing; on the provider side, with just one app to maintain, costs are low compared to conventional hosting (Vaquero, 2010).customers pay for using it but doesn‟t pay for owning the software. Users access an application via an API. An API is an Application development Interface that allows a remote program to use or communicate with another program or service. Platform as a Service. This form of cloud computing delivers development environments / platforms as a service. We build our own applications that run on the provider's infrastructure and are delivered to our users via the internet from the provider's servers. This is Platform as a Service (PaaS) quite constrained by the vendor's design and capabilities, so you don't get complete freedom, but you do get predictability and pre-integration. Examples of PaaS include Coghead and the new Google Apps Engine (Eric, 2011). 2.6 Companies that offer Cloud services In this section we look at some companies that offer cloud services. - 11 Amazon It is a primary provider of cloud computing services. One of the services is the Elastic Compute Cloud also known as EC2.Developers and companies are allowed to rent the capacity on Amazon‟s propriety cloud of servers which happens to be he biggest server farms in the world.EC2 enables customers to request a set of virtual machine, onto which they can deploy any application of their choice. Hence, customers can create, launch and terminate server instances on demand creating elasticity. Users pick the size and power they want for their virtual servers and Amazon does the rest.EC2 is just a part Amazon also provides developers with direct access to its softwares and machines. Developers can then build low cost, reliable and powerful web based application. The cloud provided by Amazon gives access to developers to do the rest. Developers pay for the computing (Miller, 2008). Google App Engine Google also offers cloud computing services. The service is the Google apps engine which enables developers to build their web application utilizing the same infrastructure that power‟s Google‟s applications. A fully integrated application environment is provided by the Google App engine. The Google App engine is easy to build, maintain and scale. All it needs is to develop the application and upload it to the App engine cloud. Google App Engine unlike other cloud hosting application is free to use (Miller, 2008). Salesforce.com The company is very well known for cloud computing developments and its sales management Saas. Dubbed Force.com is the company‟s cloud computing architecture. The service provided is on-demand and runs across the internet. Salesforce provides its own developer‟s toolkit and API and charges fees per user usage. It has a directory of web-based applications called AppExchange. The AppExchange applications uploaded by other developers can be accessed by developers. The developers can also share their developed applications or can also make them private so that they can be accessed by authorized clients. Most of the applications are free to use in the AppExchange library. Most of the applications are sales related like sale analysis tool, financial analysis apps. etc (Miller, 2008). IBM IBM also offers cloud computing solution. Small and medium size business is targeted by the company by their on-demand cloud based suite. The services provided by the cloud computing suite include email continuity, email archiving, backup of data and recovery… etc. It manages its cloud hardware using Hadoop which is based on Map Reduce software that is used by Google (Miller, 2008). IBM, Amazon, Salesforce.com, Google are not the only companies that provide cloud services there are other smaller companies including 3tera, 10gen, Nirvanix… etc (Miller, 2008). - 12 - 2.7 Virtualization A data center consists of hundreds and thousands of servers and most of the time the server‟s capacity is not fully used and thus there's unused processing power that goes waste. In such cases there could be a possibility to make a physical server think that it is, multiple servers, each running with its own operating system. The technique is termed as server virtualization. This technique reduces the need to have more physical machines (Jonathan Strickland, 2011).A traditional server contains a single application running on an operating system. This leads to tremendous cost in number of areas in terms of hardware, operations, management and maintenance. To handle these issues and cost, enterprise IT has come up with the most compelling tool called the Virtualization Technology. For each of these applications the average utilization is about 5-10%. These servers are barely used across the environment 90-95% of its capacity are not used in an average. So what basically the virtualization technology does is that it takes the advantage of that and runs these environments side by side on an much lower number of physical servers. The environments could be databases, business applications, web servers etc we can take these down and consolidate them into much smaller number of physical servers. Running multiple logical servers on a single physical machine which is also termed as server consolidation is a popular way to save money spent on hardware and make administration and backup easier (Shinder, 2008). Virtual applications reduce hardware costs and ease application deployment. Each of these environments now run side by side on a single machine and each of them are isolated and fully encapsulated. Figure 5: Virtualization (Source: CNET/James Urquhart) The Reasons are clear from above to why we need to use virtualization and they are the following: It saves money: Virtualization reduces the number of servers; this means there is a significant savings on hardware cost and also on the amount of energy needed to run the hardware. - 13 It’s good for the environment: Energy savings that could be brought by adopting virtualization technologies would reduce the need to build many power plants and would thus help to conserve energy resources. Reduces work of system administrators: With virtualization administrators would not have to support many machines and could work on tasks that need more strategic administration. Better use from hardware: There is a higher hardware utilization rate as there are enough virtual machines on each server to increase its utilization from the typical 5 – 10 % to as much as 90 - 95%. It makes software installation easier: Vendors are more inclined towards delivering their products preinstalled in virtual machines thus decreasing the traditional installation and configuration work (Bernard Golden, 2007). 2.8 Types of Virtualization Most of the activity in the virtualization Technology focuses on server virtualization. There are three main types of virtualization: 1) Hardware emulation: A Machines hardware environment is represented as software so that we can install multiple operating systems on a single machine. 2) Para virtualization: The software layer coordinates the access from various operating systems to the underlying hardware. 3) Operating system virtualization: A self-contained representations is created of the underlying operating system in order to provide the applications with isolated execution environments. The underlying operating system version is reflected by each self-contained container (Bernard Golden, 2007). Figure 6: Types of Virtualization Hypervisors: It‟s virtualization software that allows a single hardware to be shared by multiple operating systems. Hypervisors are also termed as a virtual machine manager. Each operating system appears to have the host's processor, memory and other resources all to itself but it is the hypervisor that is controlling the host resources and processors, allocating what is needed by each of the operating system and making sure that the - 14 virtual machines (guest operating systems) cannot disrupt each other (searchservervirtualization.techtarget.com, 2011). Few of the Hypervisors are VmWare, Xen and Kvm. Virtual Infrastructure Manager: Examples for a Virtual Infrastructure Manager would be OpenNebula (libvirt.org, 2011). It controls Virtual Machines (VM) in a collection of distributed resources by arranging storage, network and virtualization technologies. OpenNebula driver lets us manage the private or hybrid (Amazon EC2 or Elastic Hosts based) cloud using a standard library virtualization interface, API as well as the related tools and VM description files (opennebula.org, 2011). Eucalyptus is another virtual infrastructure manager and has the ability to deploy public or private clouds. The virtual machine instances deployed can be run and controlled by the users using Eucalyptus. It is a part of the Ubuntu Enterprise cloud. It is used to enable hybrid cloud infrastructures between the public and the private clouds (open.eucalyptus.com, 2011). Nimbus deploys virtual machines on those remote resources that are leased by the clients by configuring them in way that is desired by the user. It is a collection of open source tools that provide Infrastructure as a service cloud solution (nimbusproject.org, 2011). 2.9 Major Players and Products in Virtualization This list below represents the major players in virtualization: VMware: “Provides hardware emulation virtualization products called VMware Server and ESX Server “(Bernard Golden, 2007). Xen: “A new open source contender. Provides a paravirtualization solution. Xen comes bundled with most Linux distributions.”- (Bernard Golden, 2007). XenSource: “Provides products that are commercial extensions of Xen focused on Windows virtualization. “(Bernard Golden, 2007). OpenVZ: “An open source product providing operating system virtualization. Available for both Windows and Linux.” - (Bernard Golden, 2007). SWsoft: “The commercial sponsor of OpenVZ. Provides commercial version of OpenVZ called Virtuozzo.” (Bernard Golden, 2007). OpenSolaris: “The open source version of Sun’s Solaris operating system provides operating system virtualization and will also provide Xen support in an upcoming release.” - (Bernard Golden, 2007). 2.10 Virtualization Project Steps Once we have evaluated virtualization and its benefits we can then implement a virtualization project. A virtualization project can be implemented using these five steps: Evaluate the workload of our current server - 15 We check to see if virtualization can benefit us in regards to cost, management and maintenance. Define our system architecture. What type of virtualization would we use, and what kind of use case would we need to support Select your hosting hardware and virtualization software. We evaluate the capabilities of the virtualization software to ensure that it supports the use case selected by us. Migrate the existing servers to the new environment. Ensure if the new migration products can help us move our systems or if we need to move them manually. Administer your virtualized environment. Check if the tools for virtualization product management are sufficient for our needs or whether we should work with more general system management tools to monitor our environment (Bernard Golden, 2007). 2.11 Hypervisor: XEN In this section, hypervisor “Xen” will be discussed, since this will be used to create two virtual machines. The Xen hypervisor is the most secure and the fastest infrastructure virtualization solution currently available. It supports a wide range of operating systems including Linux, Solaris and Windows. A thin software layer known as the Xen hypervisor is inserted with the Xen Virtualization between the hardware of the server and the Operating System (xen.org, 2011). Thus, an abstraction layer is provided that allows each physical server to run effectively one or more "virtual servers". The reason why Xen hypervisor was chosen to other hypervisors was because Xen has a “thin hypervisor” model. It has no device drivers and keeps guests isolated, 2 MB executable and its functionality relies on service domains. Currently Xen hypervisor version 4.0.1 is installed on machine “testgrid3” as shown below `[email protected] /opt/images/user-images/sc10bdg % dmesg | grep Xen\ version [ 0.000000] Xen version: 4.0.1 (preserve-AD) (dom0) The Xen hypervisor consists of (1) Xen.conf: A Xen configuration file which can be modified and 2) Root.img: contains the disk image. We modify the Xen configuration file and point the kernel setup to the directory in which the kernel image is located. We also need to change location of the disk image and the memory. Once the changes are made we save the Xen configuration file with a unique name (xen.org, 2011). In this research project we will be creating three - 16 virtual machines and thus have used three configurations and image files. The setup for the three configuration files are given as below: `[email protected] /opt/images/user-images/sc10bdg % ls -ltr -rw-r--r-- 1 sc10bdg sc10bdg 238 Jul 4 14:13 hadoop.cfg -rw-r--r-- 1 sc10bdg sc10bdg 240 Jul 8 23:42 hadoop1.cfg -rw-r--r-- 1 sc10bdg sc10bdg 240 Jul 8 23:42 hadoop2.cfg -rw-r--r-- 1 sc10bdg sc10bdg 23068672000 Jul 26 02:21 hadoop1.img -rw-r--r-- 1 sc10bdg sc10bdg 23069720576 Jul 26 02:17 hadoop.img -rw-r--r-- 1 sc10bdg sc10bdg 23068672000 Jul 26 02:21 hadoop2.img `[email protected] /opt/images/user-images/sc10bdg % cat hadoop.cfg kernel = "/boot/vmlinuz-2.6.32.24" memory = 512 name = "hadoop" vif = ['mac=00:03:0a:00:0A:02',] disk = ['tap2:tapdisk:aio:/opt/images/user-images/sc10bdg/hadoop.img,xvda,w'] root = "/dev/xvda" extra = "fastboot console=hvc0 xencons=tty" `[email protected] /opt/images/user-images/sc10bdg % cat hadoop1.cfg kernel = "/boot/vmlinuz-2.6.32.24" memory = 512 name = "hadoop1" vif = ['mac=00:03:0a:00:0A:03',] disk = ['tap2:tapdisk:aio:/opt/images/user-images/sc10bdg/hadoop1.img,xvda,w'] root = "/dev/xvda" extra = "fastboot console=hvc0 xencons=tty" `[email protected] /opt/images/user-images/sc10bdg % cat hadoop.cfg kernel = "/boot/vmlinuz-2.6.32.24" memory = 512 name = "hadoop2" vif = ['mac=00:03:0a:00:0A:04',] disk = ['tap2:tapdisk:aio:/opt/images/user-images/sc10bdg/hadoop2.img,xvda,w'] root = "/dev/xvda" extra = "fastboot console=hvc0 xencons=tty" We can start the images by running the command “xm create hadoop” and “xm create hadoop1” and “xm create hadoop2” (xen.org, 2011). To check if the virtual machines are running we run the below command. - 17 `[email protected] /opt/images/user-images/sc10bdg % sudo xm list Name ID Mem VCPUs State Time(s) hadoop 21 512 1 -b---- 17.1 hadoop1 22 512 1 -b---- 16.2 hadoop2 23 1024 4 -b---- 30.7 patho-sc10bdg 24 1024 4 -b---- 33.5 If we have to shut down the Virtual machines we execute the below commands `[email protected] /opt/images/user-images/sc10bdg % sudo xm shutdown 21 `[email protected] /opt/images/user-images/sc10bdg % sudo xm shutdown 22 `[email protected] /opt/images/user-images/sc10bdg % sudo xm shutdown 23 To check if the Virtual machines have shut down we execute the “xm list” command again (xen.org, 2011) `[email protected] /opt/images/user-images/sc10bdg % sudo xm list Name ID Mem VCPUs State Time(s) patho-sc10bdg 24 1024 4 -b---- 33.5 2.12 Characteristics of a cloud As in the next section we show what types of database applications could be considered for a cloud deployment. Therefore, in this section we first discuss few of the characteristics of a cloud computing that are most important. It is elastic, provided workload is parallelizable. One of the advantages of cloud computing is its ability to handle changing conditions. During seasonal or unexpected increase in demand for a product sold by an ecommerce company, or during a increase in growth phase for a social networking site, additional computational resources can be allocated to handle the increasing demands (Daniel, 2009). In this environment, we only pay for what one uses or needs, so we can obtain the increased resources to handle spikes in workload and then release the additional resources once the spike has subsided. This is also termed as pay as you go Ex: Metered Taxi‟s. However, getting additional computational resources is obtained by allocating additional server instances to a task. Amazon‟s Elastic Compute Cloud (EC2) provides computing resources in large, extra large and small virtual private server instances. If an application is not able to take advantage of the additional server instances by offloading some of its work which runs in parallel, then having the additional server instances is not of much help. Storing of data on an untrusted host. Moving data off organization/persons premises increases the chances of security risks, and appropriate precautions must be used to handle this. Although the name “cloud computing” gives us an idea that the storage and computing resources are being delivered from an outside location and is subject to local rules and regulations of that country. In the United States, the US Patriot Act gives the government the right to demand an access to the data stored on any computer; if the data is being hosted by a - 18 third party in our case US, the data needs to be handed over without the client using the hosting service (Daniel, 2009) having any knowledge of it. Since, few of the cloud computing vendors give the client little control over where data is stored, in such scenario the customer needs to take the risk else he could encrypt the data with the key not residing in the host and then put the encrypted data on the host. Replication of Data: Availability, accessibility and durability of data are important features for cloud storage providers, as loosing data or unavailability of data could affect the customer‟s business by not meeting the set targets in service level agreements (SLA) and to business reputation. Data availability is typically achieved through replication .Large cloud service providers with their data centers spread over the globe have to deal with fault tolerance by replicating the data. Amazon‟s S3 cloud storage service replicates the data across availability zones and regions such that data and applications are not hampered even if there is a failure over an entire location (Daniel, 2009). Client should understand the details of the replication scheme carefully as; Amazon‟s Elastic Block Store replicates data with in the same availability zone and is prone to more failures when compared to Amazon‟s S3 cloud Storage. 2.13 Data management applications Looking at the above characteristics we can have an idea on what type of data management applications could be moved into the cloud. In this section, we have described the suitability of moving transactional databases and analytical databases into the cloud. 2.13.1 Transactional data management “Transactional data management”, we refer to the databases that are the essential sustaining element to banking, airline reservation and online e-commerce, applications. These applications tend to be fairly writeintensive. Unavailability of database of such applications can hamper their business as most of these applications are mission / business critical applications (Daniel, 2009).For the following reasons transactional data management applications are not a good option to be deployed in the cloud, for the following reasons: Transactional databases do not use a shared-nothing architecture. “The transactional database market is dominated by Oracle, IBM DB2, Microsoft SQL Server, and Sybase “(Olofson, 2006). Microsoft SQL Server and Sybase can be deployed using SNA. IBM released a shared-nothing implementation of DB2 .It is designed to help scale the analytical applications running on data warehouses (Paul McInerney, 2011). “Oracle has no shared-nothing Architecture. Implementing a transactional database system using a sharednothing architecture is non-trivial, since data is partitioned across sites and, in general, transactions can not be restricted to accessing data from a single site. This results in complex distributed locking, commit protocols, and - 19 in data being shipped over the network leading to increased time delay and network bandwidth problems. Furthermore the main benefit of a shared-nothing architecture is its scalability”-(Daniel, 2009). However this feature is not very relevant for transactional data processing as majority of the deployments is less than 1 TB in size. Shared nothing Architecture Shared nothing architecture is a distributed computing architecture that consists of multiple nodes where each of the nodes has its own private memory, input/output devices and disks which is independent of any other machine in the network. Each machine is self sufficient and does not share anything across the network. This type of system has become popular and is highly scalable (Akshaya Bhatia, 2011). There could be high risks in storing of transactional data on an untrusted host. These databases contain the operational data needed to power business-critical and mission-critical business processes. This data often includes personal information such as customer information, customer data and sensitive information like credit card numbers. Any kind of security breaches or privacy violations is not acceptable. Therefore, transactional data management applications are not well suited for cloud deployment (Daniel, 2009). ” Though, there are a couple of companies that will sell you a transactional database that can run in Amazon’s cloud: EnterpriseDB’s Postgres Plus Advanced Server and Oracle. However, there has yet to be any published case studies of customers successfully implementing a mission critical transactional database using these cloud products and, at least in Oracle’s case, the cloud version seems to be mainly intended for database backup”(Monash, 2008). 2.13.2 Analytical data management “Analytical data management” we refer to database that are queried for business planning, analysis, problem solving, and decision support. Historical data from various operational databases are typically involved in the analysis (Daniel, 2009). The scale of an analytical data management system is generally bigger than transactional systems. As in transactional systems the scale of a database is about 1TB for analytical systems it is crossing the petabytes barrier (Monash, 2011). ” Furthermore, analytical systems tend to be read-mostly (or read-only), with occasional batch inserts. Analytical data management consists of $3.98 billion”- (Vesset, 2006)” of the $14.6 billion database market “-(Olofson, 2006)” and” is growing at a rate of 10.3% annually “-(Vesset, 2006). In this section we can see why Analytical data management systems are well-suited to run in a cloud environment. Shared-nothing architecture is good for analytical data management. “Teradata, Netezza, Greenplum, DATAllegro (recently acquired byMicrosoft), Vertica, and Aster Data all use a shared-nothing architecture (at least in the storage layer) in their analytical DBMS products, with IBM DB2 and - 20 recently Oracle is also adding shared-nothing analytical products. The ever increasing amount of data involved in data analysis workloads is the primary driver behind the choice of a shared-nothing architecture, as the architecture is widely believed to scale the best” -(Daniel, 2009). Work load involved in data analysis normally consist of star schema joins and multidimensional aggregations which are easy parallelize across machines in a shared-nothing network. Complex distributed locking and commit protocols are eliminated due to the infrequent (Daniel, 2009). Sensitive data could be left out. In certain scenarios we can identify the data that could impact or could lead to damage if accessed by a third party. Once such data has been identified, we could both remove it from the analytical data store and include it only after encrypting it. Looking at the above characteristics, it could be concluded that analytical data management applications are well-suited for cloud deployment. The cloud could be a deployment option for medium-sized businesses, especially for those that do not currently have a data warehouses due to the high capital expenditures and for sudden projects that arise due to changing business requirements (Daniel, 2009). 2.14 Analysis of Data in the Cloud As we could see from above that analytic database systems are the ones that prefer to move into the cloud. We will focus on one of software solutions: MapReduce-like software. Before looking at this in details, we would look at some characteristics and features that these solutions should have based on the cloud DBMS characteristic. Cloud DBMS characteristics Efficiency: The cost of Cloud computing is structured in such a way that you pay for only what you have used, the price linearly increases as the storage, the network and computation power increases. Hence, if data analysis software product ABC needs more computation units than the software product XYZ to perform the same task, then product ABC will cost more than XYZ (Daniel, 2009). Fault Tolerance. Fault tolerance for analytical data workloads is measured differently when compared to fault tolerance for transactional workloads. For read-only queries, there are no write transactions to commit and there are also no updates to lose where there is a node failure. Thus in the case of a fault tolerant analytical DBMS it is simply the one that do not have to restart the query if one of the nodes that is involved in query processing fails (Daniel, 2009). Should run on a heterogeneous environment. The performance of the nodes in a cloud is mostly not same or consistent. Some nodes sometimes have a worst performance than other nodes. There could be numerous reasons why this might occur and one of them could be hardware failure leading to performance degradation of a node (Daniel, 2009). If the total amount of work that is needed to run a query is divided among the cloud nodes equally, then there could be a concern, that the time taken to complete the query will be equal to time taken by the slowest node to complete the assigned task. A node thus having a slow performance would in turn have an - 21 effect on total query performance. Therefore, a system designed to run in an environment that is heterogeneous would take the appropriate measures to help preventing this from occurring. 2.15 Analyzing the Data using Hadoop Framework Hadoop is an open source Apache project written in Java. It provides users with a distributed file system and a method for distributed computation. It is based on Google's File System and Map Reduce concept which describes on methods to build a framework capable of executing intensive computations across number of machines (Michael G. Noll, 2011). Although it can be used on a single machine, its actual power lies in its ability to work with hundreds or thousands of machines, each with several processor cores. It can also distribute workload across machines effectively. Hadoop is built to process large volumes of data (hundreds of gigabytes, terabytes or petabytes). Hadoop thus includes a distributed file system which divides the input data and sends sub parts of the original data to different machines in the cluster to store. Thus, the problem is processed in parallel using the machines in the cluster and results are computed as efficiently as possible. But, whenever there are multiple machines in use that needs cooperation between one another, there is a rise in the probability of failures. Hadoop is designed to handle data congestion issues and hardware failures robustly. 2.16 Cloud Test Bed School of computing consists of a cloud test bed which is comprised of 8 machines. It is basically used for research on some of the open questions in the field of distributed and cloud computing. The cloud is fire walled and can be accessed from with in the school of computing using SSH. Figure 7: An example of 2 nodes in the cloud showing the Layered architecture. (Source: University of Leeds, School of computing) - 22 The specifications of the node machine are: CPU: Memory: NIC: Quad Core 2.83Ghz 4GB RAM 1GBit We can access the cloud test bed via SSH using a login and a password created by the cloud administrator. The work required in the project can be done using command line tools and the accesses to these tools were made available. To understand how to use the cloud test bed; training was given for a certain period. Additionally, a simple tutorial was given by the administrator of the cloud which contained the basic commands and steps needed for the creation of a virtual machine using the Xen hypervisor. The tutorial also included information on where the Xen documentation could be found for additional commands that can be used on the cloud test bed. A virtual machine (10.0.10.2 / Debian02) initially was created using the Xen hypervisor and accessed using SSH from with the cloud test bed Example: Access the cloud test bed: ssh [email protected] and the password given by the administrator Access the Virtual machine: ssh [email protected] and the password A template file was created that included the information required to provision a virtual machine on to the cloud from a specified debian image. Multiple template files were created so that multiple virtual machines could be deployed on to the cloud. 2.16 Related Work Few of the related works are presented in a list below: A successful implementation of Hadoop has been reported for scalable image-processing related to Astronomical Images. Astronomical surveys of the sky generate tens of terabytes of images every night. The study of these images involves computation challenges and these studies benefit from the highest quality data. Given the quantity of data and the computational load, these problems could be addressed only by distributing the volume of work over a large number of nodes or machines. The report concluded the use of Hadoop showing great improvement for processing large volumes of astronomical datasets. On a 400-node cluster they were able to process 100,000 files with approximately 300 million pixels in just three minutes (Keith, 2011). Another successful implementation of MapReduce used for data intensive scientific analyses. As scientific data analysis deals with volumes of data efficient concurrent and parallel programming are the key for handling such data. The project concluded that scientific data could achieve scalability and speed - 23 up using the Map Reduce Technique. It also stated that applications that are tightly coupled could benefit using the mapreduce technique if the data size used is appropriate and the overhead introduced by a particular runtime diminishes as the amount of data and the computation increases (Jaliya, 2008). Another successful implementation is in the field of Biology where scientific data is processed on a cloud using Hadoop. The implementation develops a Hadoop-based cloud computing application that processes sequences of microscope images of live cells. The cloud solution was attractive because it took advantage of many of the desirable features offered by the cloud concept and Hadoop, including scalability, reliability, fault-tolerance, easy deploy ability, etc (Chen, 2010). In one of the paper, by Tomi Aarnio, from the Helsinki University of Technology on Parallel data processing with MapReduce stated that when using MapReduce in a cluster the computation power is improved by adding new nodes to the network. Thus with more nodes there is more parallel task. It also mentioned how code written becomes easier, simpler and smaller to maintain and understand as the user represents the problem using only the Map and the Reduce function (Tomi, 2009). In another paper "Storage and Retrieval of Large RDF Graph Using Hadoop and MapReduce" described a framework to store and retrieve large RDF triples using hadoop. The HDFS and MapReduce framework was efficiently used to store and retrieve the RDF data. The results showed that huge data can be stored on Hadoop clusters built on cheap commodity hardware can retrieve data fast enough (Mohammed, 2009). - 24 - Chapter 3 3. Case Study: Pathology Application 3.1 General Description of Pathology data Processing There exists a current project in the University of Leeds that relates to the reconstruction of 3D volumes from 2D stained serial sections using image based registration at multiple scales. The novelty of the project is the use of the full (cellular) resolution and full extent of the images to perform accurate registration while taking into account local tissue imperfection like tearing. Over the last two and a half years the group have developed a tool that is now in regular use within the Leeds by various researchers, and is usable by a trained lab technician (Djemame, 2010) but the disadvantage of the current system is that the application runs on a single machine and that data processing could end up taking a lot of time because of memory limitations of a single computer. We consider the development of a scalable and reliable cloud system using HADOOP Framework to see if using this method we can process the data efficiently. a) b) c) d) a) The existing application for slice alignment showing reconstruction of ~200 serial sections through a mouse embryo, b)-d) various methods of visualizing the reconstructed volumes – 3D rendering in color of the whole - 25 volume, arbitrary 2D sections drawn through the entire volume, and visualization of anatomic structures segmented (i.e. visualized with a surface) in 3D. 3.2 Accessing the Application In the non-cloud version the client or the user requests the patient data sending a „HTTP‟ request to the image server that resides at the St. James hospital which consists of the pathology images like tissue images and the server then sends back the results based on the parameters to the user. We can access the pathology application that was made available for the project and can be accessed as below `[email protected] /opt/images % ssh [email protected] Once we are in the application server under the /opt directory we have the details of the pathology application and the image server `-patho01 ~ # cd /opt `-patho01 /opt # ls -ltr drwx------ 6 root root 4096 2011-07-29 18:32 patho drwx------ 9 root root 20480 2011-07-29 19:40 imageserver We can start the Aperio server by executing an already existing shell script to start the server as below `-patho01 /opt # `-patho01 /opt # cd imageserver `-patho01 /opt/imageserver # ls -ltr *run_server* -rwx------ 1 root root 632 2010-12-08 00:20 run_server.sh `-patho01 /opt/imageserver # ./run_server.sh Waiting for server to start... Server started Server has port The images are stored in the /slides directory of the image server. Below are the details of the images being used to obtain the pathology data. `-patho01 /opt/imageserver/slides # ls -ltr total 1919892 -rw------- 1 root root 177552579 2009-12-29 15:01 CMU-1.svs -rw------- 1 root root 390750635 2009-12-29 15:06 CMU-2.svs -rw------- 1 root root 253815723 2009-12-29 15:16 CMU-3.svs - 26 -rw------- 1 root root 1141873193 2010-11-16 15:25 117473.svs Once the Image server has started we execute a sample query to run the application and the collect the output (data) it gives. In this project, we will be working with this data obtained. The main core of the application is the program “deconvolution-model” that calculates the No. of Nuclei and its coordinates given an area of an Image. `-patho01 /opt/patho # ls –ltr -rwx------ 1 root root 227 2010-11-23 15:28 deconvolution-model.sh `-patho01 /opt/patho # ./deconvolution-model.sh Native Image size : 93120 x 60946 Native zoom: 40 Loading Deconvolution vectors: estimated_cdv.cdv [0.471041,0.735899,0.486387] [0.349641,0.78081,0.517771] [0.7548,0.001,0.655955] Channel 1 No.nuclei: 5 346.448,439.332 46.8712,102.746 369.361,7.27193 230.885,197.594 138.333,128.117 In this research project, we will try to describe the advantages of using Hadoop and the Map Reduce programming model on the pathology data to show how Megabytes and Gigabytes of such data can be processed effectively and scalably using the Hadoop Framework and evaluate its performance. - 27 - Chapter 4 4. HADOOP Approach Hadoop is a distributed system which is fault-tolerant and is used for storing data and is also highly scalable. The scalability is due to Hadoop distributed file system which is a clustered storage of high bandwidth and also because of Map Reduce fault tolerant distributed processing. It analyzes and processes variety of older and newer data to obtain business operations. In most cases, data is moved to the node that performs the computation. In Hadoop processing of the data is done where the data resides. The Hadoop cloud or cluster in data center is disruptive. One of the major advantages of Hadoop is that the jobs can be submitted within the datacenter orderly (hadoop.apache.org, 2011). “Even if hundreds or thousands of CPU cores are placed on a single machine, it would not be possible to deliver input data to these cores fast enough for processing. Individual hard drives can only sustain read speeds between 60-100 MB/second. These speeds have been increasing over time, but not at the same breakneck pace as processors. Optimistically assuming the upper limit of 100 MB/second, and assuming four independent I/O channels are available to the machine, that provides 400 MB of data every second. A 4 terabyte data set would thus take over 10,000 seconds to read--about three hours just to load the data! With 100 separate machines each with two I/O channels on the job, this drops to three minutes.”- (developer.yahoo.com, 2011). Hadoop processes large amount of data by connecting many commodity computers together and making them work in parallel. A theoretical 100-CPU machine would cost a very large amount of money. It would in fact be costlier than 100 single-CPU machines. Hadoop basically ties together smaller and more reasonably priced computers to form a single cost-effective compute cluster. Computation on a large amount of data has been done before in a distributed setting. The simplified programming model is the reason that makes Hadoop unique. In a Hadoop cluster when the data is loaded it is distributed to all the machines of the cluster. Hadoop Distributed File System (HDFS) splits the large data files into parts which are managed by different machines in the cluster. Each part is replicated across many machines in a cluster, so that if there is a single machine failure it does not result in data being unavailable (developer.yahoo.com, 2011). In the Hadoop programming framework data is record oriented. Specific to the application logic, individual input data files are broken into various formats. Subsets of these records are then processed by each process running on a machine in the cluster. Using the knowledge from the DFS these processes are scheduled - 28 by the Hadoop framework based on the location of the record or data. The files are spread across the DFS as chunks and are computed by the process running on the node. Hadoop framework helps in preventing unwanted network transfers and strain on network can be obtained by reading data from the local disk directly into the CPU. Thus with hadoop one could have high performance results due to data locality, with their strategy of moving the computation to the data (hadoop.apache.org, 2011). . Figure 8: Data is distributed across nodes at load time. (Source: http://developer.yahoo.com/hadoop/tutorial/module1.html) Working of Hadoop has been shown on clusters consisting of 4000 nodes. Performance of sort on 900 nodes is good .It was shown that on 900 nodes sorting 9TB of date takes around 1 hour 40 mts and this can be improved using the below configuration values that are non-default (hadoop.apache.org, 2011). dfs.block.size = 134217728 dfs.namenode.handler.count = 40 mapred.reduce.parallel.copies = 20 mapred.child.java.opts = -Xmx512m fs.inmemory.size.mb = 200 io.sort.factor = 100 io.sort.mb = 200 io.file.buffer.size = 131072 The Sort performances shown on 1400 and 2000 nodes are good to sorting on a 1400-node cluster, data of 14TB takes 2.2 hours on the other hand sorting on a 2000-node cluster data of 20TB takes 2.5 hours by (hadoop.apache.org, 2011) updating the configuration files to the below setup: mapred.job.tracker.handler.count = 60 mapred.reduce.parallel.copies = 50 tasktracker.http.threads = 50 mapred.child.java.opts = -Xmx1024m - 29 - 4.1 MapReduce Programs in Hadoop must be written in a particular programming model, "MapReduce". Map Reduce programming model requires a successfully configured Hadoop environment to run the Map Reduce programs. Large volumes of data are computed in a parallel fashion using Map Reduce programs. Thus, the workload is divided across clusters of machines. In Map Reduce the data elements cannot be updated i.e. if in the mapping task we try to change the input pairs (key, value) it will not get reflected in the input files used. If there is any such requirement then we could generate the new output pairs (key, Value) and forward them to the next phase of the execution by the Hadoop system (hadoop.apache.org/mapreduce, 2011). In MapReduce, records are processed by tasks called Mappers. The output that is generated from the Mappers is brought together into a second set of tasks called Reducers; here the results from different Mappers are merged together. Figure 9: Map Reduce programming Model (Source: http://map-reduce.wikispaces.asu.edu/) Analogy to SQL Map is a GROUP BY clause of an aggregate Query Reduce is an aggregate function computed over all rows with some GROUP BY Attribute. The concept also works like a UNIX Pipeline cat input | Grep | Sort | Unique –c | cat > output Input | Map | Shuffle and Sort | Reduce | Output “MapReduce works by breaking the processing into two phases: the Map phase and the Reduce phase. Each phase has key-value pairs as input and output, the types of which may be chosen by the programmer. The programmer also specifies two functions: the map function and the reduce function. The input - 30 to our map phase is the raw data. The map function is simple. We pull out the fields we are interested in. So, the map function is just a data preparation phase, setting up the data in such a way that the reducer function can do its work on it. The map function is also a good place to drop bad records: here we filter out data that are missing, suspect, or erroneous”- (Tom White, 2009). It is a good fit for lot of applications like, Log processing Web Index Building Image processing... etc (Miha, 2011) As discussed above the first phase of a MapReduce program is termed as mapping. A list of data elements are given one at a time to the Mapper function, the Mapper then individually transforms each element to an output data element. Figure 10: Different colors represent different keys. All values with the same key are presented to a single reduce task. (Source: http://developer.yahoo.com/hadoop/tutorial/module4.html) Figure 11: Mapping creates a new output list by applying a function to individual elements of an input list. (Source: http://developer.yahoo.com/hadoop/tutorial/module4.html) - 31 An example of map would be, suppose we had a function toUpper (str) that would return a string in uppercase of its input string. We could use this function with map to turn a list of strings into a list of uppercase strings. We are not modifying the input string here but are returning a new string that will form part of the new output list. Reducing combines values together. A reducer function gets an iterator of input values from the input list. It then combines the values together, returning a single output value. Reducing is basically used to produce a summary of the data basically, turning a large volume of data into a smaller summary of the data. Like, a "+" could be used as a reducing function that would return a sum of a list of input values (Yahoo! Inc, 2011). Figure 12: Reducing a list iterates over the input values to produce an aggregate value as output. (Source: http://developer.yahoo.com/hadoop/tutorial/module4.html) To process large volumes of data the Hadoop MapReduce framework takes the map and the Reduce concept. The two main components of the MapReduce Program are the ones that implement the Mapper and the Reducer program. In MapReduce each value is associated with a key which identifies related values. The “map” and the “reduce” functions just don‟t receive values but a (key, value) pair. The output of these functions should also emit both key and a value. A Mapper can map a single input into one or many outputs and a reducer can compute an input list and emit a single output or many different outputs. All the values having the same key are sent to a single reducer and this is performed independently (Yahoo! Inc, 2011). 4.2 MapReduce Execution Overview The overall flow of a MapReduce operation is shown by the illustration below. The following sequence of actions occurs when the Mapper function is called by the user program. - 32 - Figure 13: Map Reduce Execution Overview (Source: http://hadoop.apache.org/) Input files: The data is initially stored here for a MapReduce task mostly the files reside in Hadoop Distributed File System (HDFS). The format of the input files is arbitrary and they could be in binary format or input records that are multi-lined. The input files are generally very large having their size in gigabytes or petabytes (Yahoo! Inc, 2011). Mapper: Given the key and a value the map function generates a (key, value) according to the user-defined program and forwards the output to the reducer. Each map task initiates an instance of Mapper as a separate java process. Each Mapper cannot intentionally communicate with one another allowing reliability such that every map task is governed entirely by the local machine's reliability (Yahoo! Inc, 2011). Partition & Shuffle: Shuffling is the processes of collecting the Map outputs and moving it to the reducer. The values having the same key are reduced together and don‟t depend on which Mapper it originates from (Yahoo! Inc, 2011). Sort: The responsibility of the reduce task is to reduce the values that are associated with many intermediate keys. Before being presented to the Reducer the intermediate keys on a node is sorted automatically (Yahoo! Inc, 2011). - 33 Reduce: Each reduce task creates a reducer instance. The reduce method is called once for each key it receives a key and an iterator over the values of the respective key. The key and the value associated with it are returned by the iterator (Yahoo! Inc, 2011). Output Format: The (key, value) pairs output of the reduce phase is written to output files. These files are then available on the Hadoop distributed file systems and could be either used by other MapReduce job or for inspection or analysis (Yahoo! Inc, 2011). Figure 14: Map Reduce Programming Flow (Source: (Yahoo! Inc, 2011).) 4.3 Working of Map and Reduce Programming model with an example Let us consider we want to see the maximum temperature of a given year using the MapReduce framework below is an example to go about with the solution. Consider the following line of input data as an example 0067112990999991986051507004...9999999N9+000022+99999999999... This line is presented to the map function in the form of key-value pairs: - 34 (0, 0067112990999991986051507004...9999999N9+000022+99999999999...) here „0‟ is the key. Within the file the keys are the line offsets. Given the line the Map function extracts the required fields and emits them as output. In this example we consider the weather data and we would like to find the maximum temperature in a given year. So, our fields of interest are year and temperature and thus the Map Function output is in the below form (Tom White, 2009). (1986, 0) (1986, 22) (1986, −11) (1987, 111) (1987, 78) Once we have the output from the map function before being sent to the reduce function the data is processed by the MapReduce framework. The processing involved sorts and groups the key-value pairs based on the key. The values are in the form of lists. So our reduce function in our example sees the following input: (1987, [111, 78]) (1986, [0, 22, −11]) Thus the reduce function now has to iterate through the list and pick up the maximum temperature from the list. So from our above example the reduce function output would be the below: (1987, 111) (1986, 22) The entire data flow is illustrated in the figure below Figure 15: Map Reduce Flow Example (Source: Hadoop: The Definitive Guide MapReduce for the Cloud - by Tom White) So we basically need three things: (1) a map function, (2) a reduce function, and (3) some code to run the job. 4.4 The Hadoop Distributed File System (HDFS) Hadoop distributed file systems provides a high throughput access to data of an application and is suitable for applications that need to work with large data sets. It is designed to hold terabytes or petabytes of data and - 35 provides higher throughput access to this data. Files containing data are stored redundantly across number of machines for higher availability and durability to failure. In this section we will discuss on the design of the distributed file system and how we could operate it. It is designed in such a way so that it is more robust (Yahoo! Inc, 2011). 1) HDFS makes sure that there is data reliability so if one or many machines in the cluster malfunction it should make data available. 2) HDFS should make sure that it can provide scalable and fast access to data. It should make sure it can serve as many as number of clients by increasing the number of machines in the cluster. 3) HDFS should make sure that it integrates properly with Hadoop Map Reduce by allowing data or information to be computed locally when possible. HDFS has a design that is based on the Google File System (GFS) design. It has a file system that is block structured and the individual files are broken into fixed size blocks. The blocks are then distributed and stored across the cluster of machines having data storage capacity. Each machine in the cluster is referred to as DataNode. A file consists of several blocks, and they may or may not be necessarily stored on the same node; the destination or target machine that hold each block of the file are chosen randomly on a block by block basis. If several machines or nodes are involved in serving a file, then the file would be unavailable if one of the machines in the cluster crashes. HDFS handles this problem by replicating each block across a number of nodes in the cluster it. Figure 16: Data Nodes holding blocks of multiple files with a replication factor of 2. The NameNode maps the filenames onto the block IDs. (Source: Yahoo! Inc, 2011) Most file system that is block-structured use a block size of 4 or 8 KB but by default HDFS uses a block size of 64MB.The files that are stored in HDFS are not part of the ordinary file system. If we type a “ls” on DataNode demon running machine it will display the contents of the Linux file system but will not include any the HDFS stored files to display. For HDFS stored files we need to type “bin/hadoop dfs –ls” the reason for this is HDFS - 36 has a separate namespace, isolated from the local file contents. The files inside HDFS are managed and stored by the DataNode service and are named with block ids. It is not possible to interact with files stored in HDFS using the Linux file commands like ls -ltr, ls, mv, cp. etc (Tom White, 2009). HDFS has its own utilities for managing file which act very similar to the ordinary Linux file system like bin/hadoop dfs –ls to list the files and bin/hadoop dfs –rmr /home/hdusr/patho.txt to remove a file from the HDFS file system. The model in which file data is accessed is in a write once and read many times, the file metadata structures i.e the information in regards to the names of files and the directories can be modified concurrently by number of clients. Thus, it is important that metadata information is not desynchronized. The metadata information is thus handled by a single machine i.e the NameNode. It stores the metadata for the file system. As the metadata per file is relatively low, this information can be stored in the Namenode Machines main memory, allowing the metadata to be accessed faster (hadoop.apache.org/hdfs/, 2011). In a scenario, where a file needs to be opened, the NameNode is contacted by the client and gets a list containing the locations of the blocks that comprise the file. The locations that are identified are the DataNodes that hold each block. Files are then read by the clients directly from the DataNode servers, in parallel. NameNode is not involved in this bulk data transfer. There are multiple redundant systems that help to preserve Namenode file systems metadata if the Namenode fails or crashes irrecoverably. The Namenode failure is severe for the cluster than if a DataNode fails or crashes. The entire cluster will still continue to operate if an individual DataNodes crashes or fails but the loss of the NameNode would stop the working of the cluster and the cluster will remain inaccessible until it is restored manually. Figure 17: HDFS Architecture (http://hadoop.apache.org/core/docs/current/hdfs_design.html) - 37 - 4.5 Why use Hadoop and MapReduce? Apart from MapReduce (Ghemawat, 2008) and the related software, open source Hadoop (hadoop.apache.org, 2011) there are other tools that are used to automate the parallelization of large data analysis and few of them are, useful extensions (Olston, 2008), and Microsoft‟s Dryad/SCOPE stack (Chaiken, 2008). MapReduce is one of the useful tools in the clouds for performing data (lexemetech.com, 2008) Google's MapReduce, Microsoft's Dryad, and Clustera investigate distributed programming and execution frameworks. MapReduce aims at simplicity while Dryad provides generality but is complex to write programs. Clustera is similar to Dryad but uses a different scheduling mechanism. MapReduce has been used for a variety of applications and have been proved successful (Ghemawat, 2004), (Catanzaro, 2008), (Elsayed, 2008). The following characteristics states the reasons of MapReduce programming model being a useful tool for performing data Analysis: Fault Tolerance. Dealing with fault tolerance is its highest priority. Data analysis jobs are divided into smaller tasks and upon a failure the failed task from the failed machine are reassigned to another machine transparently. “In a set of experiments in the original MapReduce paper, it was shown that explicitly killing 200 out of 1746 worker processes involved in a MapReduce job resulted in only a 5% degradation in query performance.” (Vaquero, 2010). Run in a heterogeneous environment. It is also designed to run in a heterogeneous environment. Once the MapReduce job comes to an end, tasks that are still in the state of “in- progress” are executed redundantly on other machines, and the task is stated as completed as soon as the primary or the backup execution has completed. “In a set of experiments in the original MapReduce paper, it was shown that backup task execution improves query performance by 44% by alleviating the adverse affect caused by slower machines” (Vaquero, 2010). Efficiency: MapReduce performance is highly dependent on the applications that it is used for. For analysis of unstructured data where brute-force scans is the right execution strategy, it is likely to be a good fit (Vaquero, 2010). 4.6 Why Use Hadoop framework for the Pathology Application Data? The entire pathology data is stored and processed on a single machine. Typically, a single machine has a few gigabytes of memory. If the input data is several Terabytes then that would require hundred or more machines to hold it in RAM – but even then no single machine would be able to process that huge chunk of data. Hard drives are much larger now a days and a single machine could hold multiple terabytes of information on its hard drives. But intermediate data sets generated while performing a large-scale computation could easily fill up - 38 more space than what the original data set has occupied. Another problem in a single-machine environment is “failure”, if the machine has crashed, then there is no way for the program to recover anyway and the data is lost. We use Hadoop to handle the above problems as it is designed to process large amount of information or data effectively by connecting lots of commodity computers together to work in parallel. It will tie the smaller and more reasonably priced computers together into a single cost-effective compute cluster. Thus, it is a time and cost effective method to work with large data (Yahoo! Inc, 2011). 4.7 Hadoop Streaming Hadoop distribution comes with a utility called Hadoop streaming. This utility allows us to create and run map and reduce jobs with any script or executable as the mapper and the reducer. To run a job with Hadoop Streaming we could use the following command: $ bin/hadoop jar contrib/streaming/hadoop-version-streaming.jar The above command with no arguments will only print some usage information. An Example is shown below: $HADOOP_HOME/hadoop-streaming.jar \ -input myInputDirs \ -output myOutputDir \ -mapper /bin/cat \ -reducer /bin/wc Streaming allows programs that are written in any language to be used as Hadoop Mapper and a Hadoop Reducer implementations. Mappers and Reducers receive input on stdin and pass the output which is in the form of key, value pairs on stdout. In streaming, input and output are represented textually. The input is in the form of Key <tab> Value <newline> streaming then splits the input of such form based on the tab character for each line to obtain the key and the value. The output of the streaming program is also written to stdout in a similar format i.e key <tab> value <newline> The output from the mapper which is an input to the reducer are sorted such that the values are adjacent to one another for the same key. We can write our own scripts in python, bash, perl, or any other language but provided that the needed interpreter is present on each of the nodes in the cluster. Below is an example to run real commands on a single machine or a cluster: $ bin/hadoop jar contrib/streaming-hadoop-0.20.2-streaming.jar -mapper \ MapProgram -reducer ReduceProgram -input /some/dfs/path \ - 39 -output /some/other/dfs/path The above assumes that Mapper Program and Reducer Program are each present on every node in the cluster. If they are not present on the cluster nodes but is present on the node launching the job then the two programs can be shipped to the remaining nodes in the cluster with the option -file as shown below: $ bin/hadoop jar contrib/streaming-hadoop-0.20.2-streaming.jar -mapper \ MapProgram -reducer ReduceProgram -file \ MapProgram -file ReduceProgram There could be scenarios where one would like to process input data using only a single map function. To obtain this we need to set the "mapred.reduce.tasks" property in the mapred-site.xml to zero. The reducer tasks will then not be created in the mapreduce framework making the output of the mapper program as the final output. Hadoop Streaming supports the "-reduce NONE" option this is equivalent to "-jobconf mapred.reduce.tasks=0". - 40 - Chapter 5 5. Installation of HADOOP This section will describe steps for setting up a single-node Hadoop cluster and then extend the single node setup to a multi node Hadoop cluster setup in the next section of this chapter. After the installation some example programs will be run to check the setup (Yahoo! Inc, 2011). The implementation was done using the Hadoop version: Hadoop 0.20.2, released February 2010 1) Assuming that JavaTM 1.6.x is installed which can be checked by running the command java -version. We need to make sure that SSH is installed and is running. If we do not have it installed we need to install it. 2) Download a stable HADOOP release: Download the Hadoop 0.20.2 version from the Apache Download Mirrors and extract the contents of the Hadoop package under the location "/usr/local/ hadoop-0.20.2". Unpack the downloaded Hadoop Version using the command tar -xvf hadoop-0.20.2.tar.gz 3) Update the Hadoop-related environment variables under $HOME/.bashrc # Set Hadoop-related environment variables export HADOOP_HOME=/usr/local/hadoop-0.20.2 # Add Hadoop bin/ directory to PATH export PATH=$PATH:$HADOOP_HOME/bin 4) The environment variable we have to configure for Hadoop under /conf/hadoop_env.sh is the JAVA_HOME. Open “/conf/hadoop-env.sh” in an editor and set the JAVA_HOME environment variable to the below. # The java implementation to use. Required. # export JAVA_HOME=/usr/lib/j2sdk1.5-sun export JAVA_HOME=/usr/lib/jvm/java-6-sun 5) We will now configure the directory “/conf” where Hadoop will store its data files, the network ports it listens to, etc. The setup will use Hadoop‟s Distributed File System, HDFS, even though we are currently working with Hadoop single node setup. The hadoop.tmp.dir variable has the directory set to “/app/hadoop/tmp”. We create the directory “/app/hadoop/tmp” and set the ownerships and permissions using “chmod 777 /app/hadoop/tmp” (hadoop.apache.org, 2011).Add the following lines between the <configuration> ...</configuration> tags in the configuration *-site. XML files as shown below. <configuration> - 41 <property> <name>hadoop.tmp.dir</name> <value>/app/hadoop/tmp</value> <description>A base for other temporary directories.</description> </property> <property> <name>fs.default.name</name> <value>hdfs://localhost:54310</value> </property> </configuration> <configuration> <property> <name>dfs.replication</name> <value>1</value> </property> </configuration> <configuration> <property> <name>mapred.job.tracker</name> <value>localhost:54311</value> </property> </configuration> 6) The first step to start the Hadoop installation is by formatting the Hadoop distributed file system (HDFS), which is implemented on top of the local file system of our “cluster”. Currently, it includes only one single local machine. We need to do this the first time we set up a Hadoop cluster. debian02:/usr/local/hadoop-0.20.2/bin# /usr/local/hadoop-0.20.2/bin/hadoop namenode -format 11/07/01 16:18:29 INFO namenode.NameNode: STARTUP_MSG: /************************************************************ STARTUP_MSG: Starting NameNode STARTUP_MSG: host = debian02/10.0.10.2 STARTUP_MSG: build = https://svn.apache.org/repos/asf/hadoop/common/branches/branch-0.20 -r 911707; compiled by 'chrisdo' on Fri Feb 19 08:07:34 UTC 2010 ************************************************************/ 11/07/01 16:18:29 INFO common.Storage: Storage directory /app/hadoop/tmp/dfs/name has been successfully formatted. - 42 11/07/01 16:18:29 INFO namenode.NameNode: SHUTDOWN_MSG: /************************************************************ SHUTDOWN_MSG: Shutting down NameNode at debian02/10.0.10.2 7) We now start our single-node cluster by debian02:/usr/local/hadoop-0.20.2/bin/start-all.sh This will start the Namenode, DataNode, Jobtracker, secondarynamenode and Tasktracker processes on our Hadoop cluster with a single Node. We can check if the expected Hadoop processes are running by executing the command “jps” at the terminal as shown below debian02:/usr/local/hadoop-0.20.2/conf# jps 7103 NameNode 7483 Jps 7198 DataNode 7289 SecondaryNameNode 7363 JobTracker 7447 TaskTracker We can also check with “netstat” if Hadoop is listening on the configured ports. debian02:/usr/local/hadoop-0.20.2/conf# netstat -plten | grep java 8) We will now run an example Hadoop MapReduce job. We will use the existing Word Count example job which reads the text files and counts how often words that are unique occur. The output is a text file where each line of the file contains a word and the count of how often it occurred, separated by a tab. We will use six e-books for this example: Thus, we download each e-book as text files in Plain Text UTF-8 encoding and store the files in a temporary directory of our choice. 9) Once the files are on our local file system we then copy the Local files to HDFS (Hadoop Distributed File System). Before we run the MapReduce job, we need to copy the files from our local file system to HDFS. debian02:/usr/local/hadoop-0.20.2# bin/hadoop dfs -copyFromLocal /tmp/ebooks /user/hduser/ebooks We can check if the files are uploaded by executing the below command: debian02:/usr/local/hadoop-0.20.2# bin/hadoop dfs -ls /user/hduser/ebooks Found 6 items -rw-r--r-- 1 root supergroup 336710 2011-08-11 23:33 /user/hduser/ebooks/pg132.txt -rw-r--r-- 1 root supergroup 581878 2011-08-11 23:33 /user/hduser/ebooks/pg1661.txt -rw-r--r-- 1 root supergroup 1916262 2011-08-11 23:33 /user/hduser/ebooks/pg19699.txt -rw-r--r-- 1 root supergroup 674566 2011-08-11 23:33 /user/hduser/ebooks/pg20417.txt -rw-r--r-- 1 root supergroup 1540059 2011-08-11 23:33 /user/hduser/ebooks/pg4300.txt - 43 -rw-r--r-- 1 root supergroup 384408 2011-08-11 23:33 /user/hduser/ebooks/pg972.txt 10) We now run the MapReduce job to test if our Hadoop Setup is correct. Below is the command which is used to run the Word Count job. debian02:/usr/local/hadoop-0.20.2# bin/hadoop jar hadoop*examples*.jar wordcount /user/hduser/ebooks /user/hduser/ebooks-output 11/08/12 02:08:57 INFO input.FileInputFormat: Total input paths to process : 6 11/08/12 02:08:58 INFO mapred.JobClient: Running job: job_201108120205_0001 11/08/12 02:08:59 INFO mapred.JobClient: map 0% reduce 0% 11/08/12 02:09:15 INFO mapred.JobClient: map 33% reduce 0% 11/08/12 02:09:31 INFO mapred.JobClient: map 66% reduce 11% 11/08/12 02:09:40 INFO mapred.JobClient: map 100% reduce 22% 11/08/12 02:09:46 INFO mapred.JobClient: map 100% reduce 33% 11/08/12 02:09:52 INFO mapred.JobClient: map 100% reduce 100% 11/08/12 02:09:54 INFO mapred.JobClient: Job complete: job_201108120205_0001 The command will read the files that are in the HDFS directory /ebooks, processes it, and stores the result in the HDFS directory /ebooks-output. We can check if the result is successfully stored in HDFS directory by having a look under the directory /user/hduser/ebooks-output 11) We then retrieve the result from HDFS or we can copy it from HDFS to the local file system using the below command. debian02:/usr/local/hadoop-0.20.2# bin/hadoop dfs -cat /user/hduser/ebooks-output/part-00000 12) Once the jobs are run we stop our single-node cluster by running the command “./stop-all.sh” 5.1 From two single-node clusters to a multi-node cluster We will build a multi-node cluster using two Linux VM boxes i.e debian02 as the master and debian05 as the slave. The best way to do this is to install and configure hadoop 0.20.2 on each node and test the “local” Hadoop setup for each of the two Linux machines, and in a second step to combine these two single-node clusters into one multi-node cluster in which one Linux machine will become the master but will also act as a slave with regard to data storage and processing and the other box will become only a slave (hadoop.apache.org, 2011). - 44 - Figure 18: Set up of a Multi Node cluster (Source: http://www.michael-noll.com/tutorials/running-hadoop-on-ubuntu-linux-single-node-cluster/) We now will configure one Linux machine as a master node (debian02) and the other Linux machine as a slave node (debian05). The master node (debian02) will also act as a slave because we are currently working with two machines in our cluster but still want to spread the data storage and processing to both the machines. Figure 19: multi-node cluster setup (Source: http://www.michael-noll.com/tutorials/running-hadoop-on-ubuntu-linux-multi-node-cluster/) The master machine (debian02) will run the “master” daemons: NameNode for the Hadoop distributed file system storage layer and the Job Tracker for the MapReduce processing layer. Both the nodes will run the “slave” daemons: DataNode for the Hadoop distributed File system layer and Task Tracker for MapReduce processing layer. The “master” daemons are the ones responsible for coordination and management of the “slave” daemons while the “slave” daemons will do the actual data storage and data processing work. - 45 The conf/masters file defines on which machines Hadoop will start the secondary Name Nodes in our multi-node cluster setup. In our setup this is just the master node i.e debian02. The primary Name Node and the Job Tracker will always be the master machine. We can start a Hadoop daemon manually on a machine via bin/hadoop-daemon.sh start [namenode | secondarynamenode | datanode | jobtracker | tasktracker], which will not take the conf/masters and conf/slaves files into account. In our case the master machine used is debian02 (10.0.10.2) and debian05 (10.0.10.5) and we run the bin/start-dfs.sh and then the bin/start-mapred.sh on the master machine. Update /conf/masters and /conf/slaves files accordingly on the master machine as shown below debian02:/usr/local/hadoop-0.20.2/conf# cat master 10.0.10.2 debian02:/usr/local/hadoop-0.20.2/conf# cat slaves 10.0.10.2 10.0.10.5 We then have to change the configuration files conf/core-site.xml, conf/mapred-site.xml and conf/hdfssite.xml on ALL machines as follows. First, we have to change the fs.default.name variable (in conf/coresite.xml) which specifies the NameNode host (10.0.10.2) and port (54310). In our case, this is the master machine i.e debian02. Second, we have to change the “mapred.job.tracker” variable (in conf/mapredsite.xml) which specifies the Job Tracker host and port. Again, this is debian02 the master node in our case. Third, we change the dfs.replication variable in conf/hdfs-site.xml which specifies the default block replication. It states on how many machines a single file must be replicated. The default value of dfs.replication is 3. As we are currently using only two nodes, we set dfs.replication to 2.Once the *-site.xml files are modified in the debian02 machine we then go ahead and make similar changes in the /*-site.xml files in the slave machines (hadoop.apache.org, 2011). Again as in our single-node cluster setup we need to format the Hadoop distributed file system for the NameNode in our Multi Node setup and we do this in the master machine i.e debian02. We need to do this every time we set up a Hadoop cluster for the first time. We should never format a running Hadoop NameNode as this will erase the data in the HDFS and also corrupt the system. debian02:/usr/local/hadoop-0.20.2/bin# /usr/local/hadoop-0.20.2/bin/hadoop namenode –format We start the cluster in two steps. Firstly, the HDFS daemons are initiated i.e the NameNode daemon is started on the master machine i.e debian02 and DataNode daemons are started on the slave i.e on debian05. Secondly, the MapReduce daemons are initiated: the Job Tracker is started on master node, and Task Tracker daemons are started on the slave i.e debian05. The following Java processes should run on the master machine at this time: - 46 debian02:/usr/local/hadoop-0.20.2/conf# jps 6401 DataNode 6231 SecondaryNameNode 6928 Jps 6688 JobTracker 6778 TaskTracker 6042 NameNode and the following on slave. debian05:/app/hadoop/tmp/dfs/data/current# jps 4629 Jps 4419 DataNode 4546 TaskTracker Running an example Mapreduce Job: To test if the Hadoop cluster setup is correct we will run the Word count MapReduce job again . After downloading the e-texts, we have copied them to the HDFS, run the Word Count MapReduce job from the master machine, and retrieved the job result from HDFS debian02:/usr/local/hadoop-0.20.2# bin/hadoop dfs -copyFromLocal /tmp/ebooks /user/hduser/ebooks debian02:/usr/local/hadoop-0.20.2# bin/hadoop dfs -ls ebooks Found 6 items debian02:/usr/local/hadoop-0.20.2# bin/hadoop jar hadoop*examples*.jar wordcount /user/hduser/ebooks /user/hduser/ebooks-output1-deb02 11/08/12 00:18:14 INFO input.FileInputFormat: Total input paths to process : 6 11/08/12 00:18:15 INFO mapred.JobClient: Running job: job_201108120006_0001 11/08/12 00:18:16 INFO mapred.JobClient: map 0% reduce 0% 11/08/12 00:18:50 INFO mapred.JobClient: map 100% reduce 22% 11/08/12 00:18:59 INFO mapred.JobClient: map 100% reduce 100% 11/08/12 00:19:01 INFO mapred.JobClient: Job complete: job_201108120006_0001 Running the mapreduce job successfully states that there are no issues with the installation and configuration setup and thus confirming that the Installation was a success .We now have a running Hadoop cluster with two nodes. To stop the Hadoop cluster we first stop the HDFS daemons in the master machine and then stop the mapred daemons as shown below debian02:/usr/local/hadoop-0.20.2/bin# ./stop-mapred.sh debian02:/usr/local/hadoop-0.20.2/bin# ./stop-dfs.sh - 47 - Chapter 6 6. Implementation of MAP REDUCE programming model 6.1 General Description of the current data processing used by the Pathology Application The client executes the pathology application with certain parameters to get the number of nuclei of a given area of a pathology image and gets an output that consists of a set of parameters passed and the Number of nuclei. In this section, we will implement the MapReduce programming model to process this data containing the parameters and the Number of Nuclie and to check the benefits of using the MapReduce and the HDFS when there are number of such similar data and how the data are combined together to return a single output value. We process the data using two Phases i.e the Map Phase and the Reduce Phase bringing out the characteristics and features of a map reduce programming Model. 6.2 System Design The pathology data is to be first structured in a way as read in earlier chapter 4, so that, it can be processed by the MapReduce programming model. The output must be structured in such a way that it‟s in a Key, Value format. So, we first design a pre-processing script written in Python Programming Language to convert the Pathology data format shown in Dataformat 6.1 to a format shown in Dataformat 6.2 and write it to a single file. Once we get the format (key, Value) which can be processed by the Mapper/Reducer Program we load the processed file with the data to the Hadoop distributed file system. Once the file is in the HDFS we run the Mapper and the Reducer program using the Hadoop streaming method discussed in earlier chapter 4, to obtain a reduced output having the parameter and the Number of nuclei. This is discussed in chapter 3 where the pathology program “deconvolution-model” is executed to obtain the pathology data containing the Number of Nuclei and its associated parameters given a sub area of an image. Native Image size : 66000 x 45402 Native zoom: 20 Loading Deconvolution vectors: estimated_cdv.cdv [0.471041,0.735899,0.486387] [0.349641,0.78081,0.517771] [0.7548,0.001,0.655955] Channel 1 No.nuclei: 4 Dataformat 6.1: Pathology Data - 48 Native Image size : 66000 x 45402 Native zoom: 20 Loading Deconvolution vectors: estimated_cdv.cdv [0.471041,0.735899,0.486387] [0.349641,0.78081,0.517771] [0.7548,0.001,0.655955] Channel 1 No.nuclei: 4 Dataformat 6.2: Pathology Data in Key value Format 6.3 Process Flow The pathology data is placed under a folder /patho under the Hadoop directory. A shell script is then written which basically performs two tasks. Firstly, it takes all the pathology data in each individual file under the /patho directory and preprocesses the data by running the preprocessing program “pre-processing.py” and thus creating a single file with the preprocessed data named “nucleiprocessed.txt” and is placed under the same directory “/patho”. Once, the file is created the second task of the shell script is to copy the processed pathology data to the Hadoop distributed file system (HDFS) under the “/user/hduser/patho” directory. Once the pathology data is uploaded to the HDFS the data is now ready to be processed by the MapReduce programming model and produce a single output file with the required data. The steps mentioned as described in details below: 6.3.1 The pre-processing Step The preprocessing program used to format the pathology data is presented below Code for Pre-Processing: import os import glob #Files to process files = glob.glob('/export/mailgrp4_f/sc10bdg/pathodir/*patho*') # processed file nuclieprocessed=file('nuclieprocessed1.txt', 'w') for f in files: file = open(f) #Array arr= [] #Variable containing each line information line = file.readline() while line: #Read each line in a file line = line.strip() #load the processed data from file into an array arr.append(line) line = file.readline() # Write to an output file nuclieprocessed.write(arr[0]+' '+arr[1]+' '+arr[2]+' '+arr[3]+' '+ arr[4]+' '+arr[5]+' '+arr[6]+' '+arr[7]+'\n') - 49 6.3.2 Loading pre-processed data into HDFS The “nuciliprocessed.txt” file created under the specified output directory is then uploaded into the HDFS using a shell script UploadHDFS.sh #!/bin/bash HADOOP_DIR=/usr/local/hadoop-0.20.2 # 1. Convert the Pathology data files into a single processed file by calling the pre-processing program #Processed file is place under the HADOOP DIR ${HADOOP_DIR}/pre-process.py # 2. Store processed file on the HDFS ${HADOOP_DIR}/bin/hadoop dfs -copyFromLocal /usr/local/hadoop-0.20.2/Patho/nuclieprocessed.txt /user/hduser/patho/` Once the file is uploaded and available on the HDFS we can use the MapReduce Programming model to process the data. 6.3.3 Process Data using the Map Reduce Programming Model Input: Pathology data will be in the form of (key, value) pairs after pre-processing where key is the parameter and the value is the Number of Nuclei. Output: If there are multiple occurrences of the same data in the input file, the values are combined together based on the Key, returning a single output value containing the parameter and the No. of Nuclei. The Logic behind the MapReduce program written in python is that it uses the Hadoop streaming concept to help us pass data between the Mapper and the reducer program via standard input (STDIN) and standard output (STDOUT). It uses the python‟s sys.stdin to read the input data and to print the output data to the sys.stdout. Mapper program Description The Map code is saved as Mapper.py under the HADOOP_HOME. The Mapper program will read data from standard input (STDIN) and output a list of lines to STDOUT (standard output). The Mapper script will not generate the unique occurrences of each parameter and its associated value (i.e no of nuclei) instead will let the Reduce step to obtain the final unique occurrences. We must change the execution permission of the Mapper.py file (chmod +x mapper.py) else we could run into problems in regards to permissions. Code for Map: import sys # input comes from STDIN (standard input) for line in sys.stdin: line = line.strip() print '%s' % (line) Reducer program Description: The below code is saved with the filename “reducer.py” under the HADOOP_HOME directory. The program will read the results or the output from the mapper.py program - 50 obtain the unique occurrence given a parameter and its associated nuclei value. The result is then sent to STDOUT (standard output).The “reducer.py” file needs to have the proper execution permission i.e chmod +x reducer.py else it would throw errors while running the program. Code for Reduce: from operator import itemgetter import sys #Parameters current_Para= None current_value = 0 for line in sys.stdin: # remove leading and trailing whitespace line = line.strip() # parse the input we got from mapper.py into Parameter details and Value (No.of Nuclie) Parameter, value = line.split(' ', 1) #The Logic for Reduction if current_Para== Parameter: current_value == value else: if current_Para: # write result to STDOUT print '%s\t%s' % (current_Para, current_value) current_value = value current_Para= Parameter if current_Para== Parameter: print '%s\t%s' % (current_para, current_value) 6.3.4 Running the Python Code in Hadoop To run the Map Reduce program we need to check that the data to be processed “nuclieprocessed.txt” is on the Hadoop Distributed file system. debian02:/usr/local/hadoop-0.20.2# bin/hadoop dfs -ls /user/hduser/pathology1 -rw-r--r-- 1 root supergroup 1251 2011-08-04 04:32 /user/hduser/pathology1/nuclieprocessed.txt The input file contained the following data Native Image size : 66000 x 45402 Native zoom: 20 Loading Deconvolution vectors: estimated_cdv.cdv [0.471041,0.735899,0.486387] [0.349641,0.78081,0.517771] [0.7548,0.001,0.655955] Channel 1 No.nuclei: 4 Native Image size : 66000 x 45402 Native zoom: 20 Loading Deconvolution vectors: estimated_cdv.cdv [0.471041,0.735899,0.486387] [0.349641,0.78081,0.517771] [0.7548,0.001,0.655955] Channel 1 No.nuclei: 4 Native Image size : 55000 x 11234 Native zoom: 5 Loading Deconvolution vectors: estimated_cdv.cdv [0.471041,0.734899,0.486387] [0.34234,0.78081,0.517771] [0.7548,0.001,0.655955] Channel 1 No.nuclei: 5 - 51 Native Image size : 55000 x 11234 Native zoom: 20 Loading Deconvolution vectors: estimated_cdv.cdv [0.471041,0.734899,0.486387] [0.34234,0.78081,0.517771] [0.7548,0.001,0.655955] Channel 1 No.nuclei: 6 Native Image size : 55000 x 11234 Native zoom: 5 Loading Deconvolution vectors: estimated_cdv.cdv [0.471041,0.734899,0.486387] [0.34234,0.78081,0.51777] [0.548,0.001,0.633355] Channel 1 No.nuclei: 6 Native Image size : 55000 x 11234 Native zoom: 5 Loading Deconvolution vectors: estimated_cdv.cdv [0.471041,0.734899,0.486387] [0.34234,0.78081,0.517771] [0.7548,0.001,0.655955] Channel 1 No.nuclei: 5 Once, the data is available on the Hadoop distributed file system we run the Python MapReduce job on the cluster with currently two machines. We use the Hadoop Streaming as discussed above to pass data between the Map code and the Reduce code via standard Input (STDIN) and standard output (STDOUT). We run the MapReduce program using the below command: debian02:/usr/local/hadoop-0.20.2# bin/hadoop jar contrib/streaming/hadoop-*streaming*.jar -file /usr/local/hadoop-0.20.2/mappy.py -mapper /usr/local/hadoop-0.20.2/mappy.py -file /usr/local/hadoop0.20.2/reducer.py -reducer /usr/local/hadoop-0.20.2/reducer.py -input /user/hduser/patho/* -output /user/hduser/patho-output The above command executed gives us the below output packageJobJar: [/usr/local/hadoop-0.20.2/mapper.py, /usr/local/hadoop-0.20.2/reducer.py, /hadoop/tmp/dir/hadoop-unjar8894788337428936557/] [] /tmp/streamjob7431796847369503584.jar tmpDir=null 11/08/07 21:34:55 INFO streaming.StreamJob: Running job: job_201108072123_0001 11/08/07 21:34:55 INFO streaming.StreamJob: Tracking URL: http://debian02:50030/jobdetails.jsp?jobid=job_201108072123_0001 11/08/07 21:34:56 INFO streaming.StreamJob: map 0% reduce 0% 11/08/07 21:35:15 INFO streaming.StreamJob: map 100% reduce 100% 11/08/07 21:35:18 INFO streaming.StreamJob: Job complete: job_201108072123_0001 We then check if the output is stored successfully on /user/hduser/patho-output and can inspect the contents of the file using the –cat command. It should contain the reduced output. debian02:/usr/local/hadoop-0.20.2# bin/hadoop dfs -cat /user/hduser/patho-output/part-00000 Native Image size : 55000 x 11234 Native zoom: 20 Loading Deconvolution vectors: estimated_cdv.cdv [0.471041,0.734899,0.486387] [0.34234,0.78081,0.517771] [0.7548,0.001,0.655955] Channel 1 No.nuclei: 6 Native Image size : 55000 x 11234 Native zoom: 5 Loading Deconvolution vectors: estimated_cdv.cdv [0.471041,0.734899,0.486387] [0.34234,0.78081,0.517771] [0.7548,0.001,0.655955] Channel 1 No.nuclei: 5 Native Image size : 55000 x 11234 Native zoom: 5 Loading Deconvolution vectors: estimated_cdv.cdv [0.471041,0.734899,0.486387] [0.34234,0.78081,0.51777] [0.548,0.001,0.633355] Channel 1 No.nuclei: 6 Native Image size : 66000 x 45402 Native zoom: 20 Loading Deconvolution vectors: estimated_cdv.cdv [0.471041,0.735899,0.486387] [0.349641,0.78081,0.517771] [0.7548,0.001,0.655955] Channel 1 No.nuclei: 4 - 52 - Chapter 7 7. Evaluation and Experimentation Once we successfully store and process the pathology data using Hadoop framework and its map reduce programming model we evaluate the performance of the implemented framework of Hadoop and observe the benefits of using the proposed solution. In this section, we will measure the performance by using quantitative metrics such as scalability and time taken to process that pathology data. The first experiment would observe the change in the processing time of the pathology data by changing the number of machines or nodes in the Hadoop cluster using a fixed data size. The second experiment would be to observe how the processing time of the pathology data would change based of different data size on a Hadoop single node setup and on a Hadoop multinode cluster setup. 7.1 Response time to process the Pathology Data on a single node / Cluster In this experiment, I have run the MapReduce program on pathology data size of 24KB on a single node Hadoop setup and have also run the MapReduce program on the pathology data of the same size on a Hadoop cluster setup with two machines and also using a Hadoop cluster with three machines .The results obtained by running the experiment are shown in figure 20, figure 21 and figure 22.The difference in time of running the experiment on a single node and on a cluster with two and three nodes are shown in figure 23. We first run the MapReduce program to process the data of size 24KB and observe that when we run the program to process the data for the first time the time taken to process the data is 47 seconds and when we run the program again for the second time we observe that the processing time reduces by approximately 10 seconds and this run time is almost the same for the consecutive runs. The reason that there is a drop in the processing time of the data is because the data is cached i.e. the data is stored transparently such that the future requests can be served faster. The stored data in the cache are normally the values that have been computed before or duplicate data. Figure 20 shows us the difference in the processing time between the first, second and consecutive runs - 53 Data Processing on a cluster with a single Node 50 Time taken 40 30 20 10 0 1 2 3 4 5 Number of runs Figure 20: Running the program for the first time takes 47 seconds approximately and 37 seconds for the second and consecutive runs. We now run the MapReduce program to process the data of size 24KB on a Hadoop cluster with two nodes and observe that when we run the program to process the data for the first time the time taken to process the data is approximately 32 seconds and when we run the program again in the cluster for the second time we observe that the processing time reduces to 27 seconds and the run time is almost the same for the consecutive runs. Time Taken Data processing using a cluster with two Nodes 33 32 31 30 29 28 27 26 25 24 1 2 3 4 5 Number of runs Figure 21: Running the program for the first time on a Hadoop cluster consisting of two nodes takes 32 seconds and for the second and consecutive runs takes 27 seconds. We now run the Map Reduce program to process the data of size 24KB on a Hadoop cluster with three nodes and observe that when we run the program to process the data for the first time the time taken to process the data is approximately 30 seconds and when we run the program again in the cluster for the second time we observe that the processing time reduces by approximately 22 seconds. - 54 Data processing using a cluster with three Nodes 35 Time Taken 30 25 20 15 10 5 0 1 2 3 4 5 Number of runs Figure 22: Running the program for the first time on a Hadoop cluster consisting of three nodes takes 30 seconds and for the second and consecutive runs takes 22 seconds We can now see that processing the data on a Hadoop cluster with a single node takes approximately 37 seconds and the processing time on a Hadoop cluster with two nodes takes 32 seconds and with three nodes, takes 30 seconds i.e. there is a reduction in time when the number of nodes are increased. Thus, we can observe that when more nodes are added to the cluster the processing time of the data decreases as the computation power of the cluster increases. The data in a cluster is distributed among the slave nodes for processing. Therefore, instead of one machine processing the data there are three of them working with the data thus reducing the overall processing time of the data. The difference of running the Pathology data on a Hadoop cluster with a single node, two Nodes and a Hadoop cluster with three nodes are shown below: Time taken to process the data Data processing using Hadoop cluster with 3 Nodes 50 40 Using 1 Node 30 Using 2 Nodes 20 Using 3 Nodes 10 0 1 2 3 4 5 Number of Runs Figure 23: Difference of running the Pathology data on a Hadoop cluster with a single node and a Hadoop cluster with two and three nodes 7.2 Response time to run the pathology Application on based on Data Size In this experiment we have run the MapReduce program on a Pathology data size of 24KB, 131KB, 3074KB, 39MB on a single node Hadoop setup and have also run the MapReduce Program on the pathology data of the - 55 same data size set on a Hadoop cluster setup with two machines and a Hadoop cluster setup with three machines. The results obtained by running the experiment is shown in figure 24, figure 25 and figure 26.The difference in the Response time of running the experiment on a single node and on a cluster with two nodes and three Nodes with respective to the data size is shown in figure 27. We first run the Map Reduce program to process the data of size 24KB, 131 KB, 3074 KB and 39 MB and observe that when we run the program on a Hadoop cluster with a single node the time taken to process 24KB of data is 37 seconds, 131 KB of data is 37.92, 3074KB of data is 39.49 and to process 39 MB of data is 40.70. The below figure 24 shows us the difference of the time taken to process the data of different sizes on a Hadoop cluster with a single node. We can observe that the is very little time difference to process the data of size 24KB and data of size 39 MB bringing out the computation power of Hadoop to handle large data. 140 120 100 80 60 40 20 0 273833892 40917708 (39MB) 3147516 (3074KB) 133857 (131KB) Time Taken on a single Machine 25020 (24K) Time taken in seconds Data Processing Time Taken on a single Machine Data size Figure 24: The difference of the time taken to process the data of size 24 KB, 131 KB, 3074 KB and 39 MB on a Hadoop cluster with a single node. We now run the Map Reduce program to process the data of size 24KB, 131 KB, 3074 KB and 39 MB in a Hadoop cluster with two nodes .We observe that the time taken to process 24KB of data is 32 seconds, 131 KB of data is 24.32, 3074 KB of data is 25.28 and to process 39 MB of data is 32.18. The below figure 25 shows us the difference of the time taken to process the data of different sizes on a Hadoop cluster with two nodes. - 56 - 100 80 60 40 20 0 273833892 40917708 (39MB) 3147516 (3074KB) 133857 (131KB) Time Taken on a cluster with 2 Nodes 25020 (24K) Time taken in seconds Data Processing Time Taken on a cluster with 2 Nodes Data Size Figure 25: The difference of the time taken to process the data of size 24 KB, 131 KB, 3074 KB and 39 MB on a Hadoop cluster with two nodes. We now run the MapReduce program to process the data of size 24KB, 131 KB, 3074 KB and 39 MB on a Hadoop cluster with three nodes .We observe that the time taken to process 24KB of data is 30 seconds, 131 KB of data is 22, 3074 KB of data is 24 and to process 39 MB of data is 30 seconds. The below figure 26 shows us the difference of the time taken to process the data of different sizes on a Hadoop cluster with a three nodes. We can observe that as there is an increase in the data size the computation time to process the data also increases which is expected. 80 60 40 20 0 273833892 40917708 (39MB) 3147516 (3074KB) 133857 (131KB) Time taken on a cluster with 3 Nodes 25020 (24K) Time taken in seconds Data processing Time taken on a cluster with 3 Nodes Data Size Figure 26: : The difference of the time taken to process the data of size 24 KB, 131 KB, 3074 KB and 39 MB on a Hadoop cluster with three nodes. We can now see that processing the data on a Hadoop cluster with a single node takes approximately 37, 37.92, 39.49 and 40.70 seconds on data sets of size 24 KB, 131 KB, 3074 KB and 39 MB and the processing the same data set on a Hadoop cluster with two nodes takes 21, 24.32, 25.28 and 32.18 seconds and on a Hadoop cluster with three nodes are 30, 22, 24, 30 i.e. even when the data size increases it is better handled by a Hadoop cluster with two nodes and three nodes than with the Hadoop cluster with a single node and again there is a reduction in - 57 time when the number of nodes are increased for a given set of data size. Thus, when the volume of the data is more it is processed efficiently by Hadoop. The differences of running the Pathology data on a Hadoop cluster with a single node and a Hadoop cluster with two and three nodes over a set of different sizes of data are shown below: 140 120 100 80 60 40 20 0 273833892 40917708 (39MB) 3147516 (3074KB) 133857 (131KB) Time Taken on a single Machine 25020 (24K) Time Taken Data processing time taken on a Hadoop cluster based on data size Time Taken on a cluster with 2 Nodes Time taken on a cluster with 3 Nodes Data Size Figure 27: The Response time taken to process the data of size 24 KB, 131 KB, 3074 KB and 39 MB on single and a multi node cluster setup In the above graph it clearly shows that time taken to process the data of size 25020 bytes is almost the same time by a Hadoop cluster having a single Node, two Nodes and three Nodes. On the other hand there is a notable difference in the processing of data of size 273,833,892 bytes using a Hadoop cluster containing a single Node, two Nodes and three Nodes thus proving the point that Hadoop works the best with large volumes of data than a small amount of data. Time taken in seconds Processing time taken for 1GB of data 800 700 600 500 400 300 200 100 0 using a single Node cluster using a cluster with two nodes using a cluster with three Nodes Run 1 Run 2 Number of runs Figure 28: Processing time taken by a Hadoop cluster with one, two and three Nodes for 1GB pathology data. - 58 In Figure 28, it clearly shows that time taken to process the data of size 1,643,003,352 bytes using a Hadoop cluster having a single Node, two Nodes and three Nodes decreases as the number of Nodes Increases. On the other hand there is also a notable difference in the processing of data of size 1,643,003,352 bytes using a Hadoop cluster containing a single Node, two Nodes and three Nodes thus stating that Hadoop works the best with large volumes of data than data of smaller size. Also, we could see that the time taken to process the Data of size 39MB takes approximately the same time to process the data of 24 KB takes bringing out the computation power of Hadoop. 7.3 Results and comparison The results are collected from running the experiments discussed above and are contrasted with already existing work. From the experiments performed above we can observe that processing the data of size 1GB on a Hadoop cluster with a single machine takes approximately 10minutes and 6 seconds and the processing the same data on a Hadoop cluster with two machines takes 7 minutes and 29 seconds and with three nodes there it further reduction in time to 6 minutes 49 sec. Thus, we can see that there is a reduction in time when the numbers of machines in the cluster are increased. Therefore, we can conclude that when more machines are increased in the cluster the processing time of the data decreases i.e the computation power of the cluster increases. I have contrasted my work with one of the papers written by Tomi Aarnio (2009) that observes the same and was discussed in sub section "Related work" under chapter 1. His work also stated that when using MapReduce in a cluster the computation power is improved by adding new nodes to the network. Thus with more nodes there is more parallel task (Tomi, 2009). In experiment two, we can also observe that the Hadoop cluster processes the data having a size more than 39 MB i.e 273 MB and 1GB as shown in Figure 7.8 more efficiently and there is a notable diffence in the time taken to process the data on a single Node, two Nodes and three Nodes.This observation is contrasted with a similar work on "Data Intensive Scientific Analyses" discussed in related work under chapter 1 where the results showed that applications could benefit using the MapReduce technique if the data size used is appropriate and the overhead introduced by a particular runtime diminishes as the amount of data and the computation increases (Jaliya, 2008). Therefore, the results obtained in this project agrees with those found in the literature, especially about the two main characteristics of Hadoop, that it works the best with large data and the computation power of the Hadoop cluster increases with the increase in the number of machines. 7.4 Evaluation of the Software The development of the software in the project was done by using the Agile Technique. The quality of the code is one of the important requirements and was considered while working on the implementation of the solution proposed. The code written has proper naming standards, proper variable names to make the code more - 59 readable, appropriate commenting for a better understanding on the working of the code. Proper documentation has been created for running the solution implemented. The current MapReduce programming model implemented would need modifications in order to make the implemented model work with other similar applications. 7.5 Further Work The following provides the key areas of the improvement: Effect on the performance of the software by implementing the above Map Reduce programming model using Java programming language and not using the python programming language and the Hadoop streaming utility of Hadoop. We could also demonstrate on how Hadoop‟s Map Reduce Framework could be extended to work with image data for image processing. It would include working with the pathology images to obtain the pathology data used in this project. It will include how Hadoop can process and store the pathology images by loading Pathology Images as a sequence file containing the pixel values of the image and loading the sequence file with the pixel data on the HDFS and obtaining a image file used by the application from the HDFS instead from the local file system and evaluate the performance and scalability of such system. Using the HDFS FUSE functionality to work with the Pathology image Files where the Image files are accessed by the Pathology application from the HDFS instead of the Local File system. HDFS-FUSE is a functionality that lets you mount the Hadoop Distributed File system in the user space. The hadoopfuse package enables us to use the Hadoop Distributed File System cluster like a traditional LINUX file system. 7.6 How would this Project Idea Help Other Applications? As discussed earlier that the results obtained from the above experiments clearly shows that for time efficient and scalable data processing Hadoop is a technology worth considering. It could be helpful for those organizations to consider Hadoop who derives and manage business values from there enormous volumes of growing data. Hadoop in a health care industry could be useful for analyzing volumes of electronic health records, clinical outcomes and treatment protocols. It can also be useful for image analysis and social network analysis which involves large data and complex algorithms that could be difficult with SQL. In financial sectors, Hadoop can used to analyze daily transaction data. 7.7 Meeting Minimum Requirements The minimum requirements stated in section 1.3 under chapter 1 have been met and is discussed in brief in the following section: - 60 In chapter 5, a successful implementation of the Hadoop framework using a single Node was presented and an example application was run without errors to show that the Hadoop cluster implementation was successful using a single node. Once the implementation using of the single Node Hadoop setup was successful, the setup was extended with the Hadoop cluster having two and three nodes. Again examples were run to show that the Hadoop multi-node cluster setup was a success. In chapter 6, the MapReduce programming solution was implemented and the code used for the Map phase and the reduced phase was discussed to obtain a reduced output of the given pathology data. A sample input file to the Map phase was given and the output obtained from the reduced phase was show. In chapter 7, the solution implemented to process the pathology data using the Hadoop framework was evaluated using two experiments. The first experiment showed the performance of the data processing based on the number of nodes introduced in the cluster. The second experiment focused on the response time to process data based on the size of the data. 7.8 Objectives Met The objectives stated in section 1.2 under chapter 1 have been met. Each objective has been documented in this report once it has been met. - 61 - Chapter 8 8. Conclusions 8.1 Project Success This research project has discussed the benefits of using Hadoop framework and its MapReduce programming model to process the pathology data on a Hadoop cluster or cloud. Chapter 4 has indicated that storing data on a single machine could lead to loss of data if the machine malfunctions or crashes. It has also been shown during the experiments that processing of data is done relatively faster when the number of nodes or machines in the cluster increases. The processing of the data is done faster when having more than one machine because the number of machines in a cluster increases the computation power of the cluster. We could also conclude from the discussion in chapter 7 that Hadoop works well with large datasets. The time difference is not that large to process data of smaller size (24KB) and that of larger size (39MB) bringing out the power of Hadoop data processing. The main idea put forward in this project is that Hadoop can be used to process large data robustly and efficiently. 8.2 Summary The aim of this project is to process volumetric pathology application data using the Hadoop framework and to evaluate its performance and scalability on a cloud have been achieved. - 62 - Bibliography ABBOTT, KEEVEN & FISHER PARTNERS. (). Federated Cloud. Available: http://akfpartners.com/techblog/2011/05/09/federated-cloud/. Last accessed 15th june 2011. akshaya bhatia. (). Shared Nothing Architecture. Available: http://it.toolbox.com/wiki/index.php/Shared_Nothing_Architecture. Last accessed 15th june 2011. Bernard Golden (2007). Virtualization For Dummies. United States of America: Paperback. p1-384. C. Monash. The 1-petabyte barrier is crumbling. http://www.networkworld.com/community/node/ 31439. C. Olofson. Worldwide RDBMS 2005 vendor shares. Technical Report 201692, IDC, May 2006 Chen Zhang , Hans De Sterck , Ashraf Aboulnaga , Haig Djambazian , and Rob Sladek. (2010). Case Study of Scientific Data Processing on a Cloud Using Hadoop. High Performance Computing Systems and Applications. ( ), p.400-415. Catanzaro, B., N. Sundaram, and K. Keutzer, \A MapReduce framework for programming graphics processors," in Workshop on Software Tools for MultiCore Systems, 2008. c.olofson. (2006). worldwide Embedded DBMS 2005 vendor shares. Technical Report_IDC. 1 (1), p1-12. C. Olston, B. Reed, U. Srivastava, R. Kumar, and A. Tomkins. Pig latin: a not-so-foreign language for data processing. In SIGMOD Conference, pages 1099–1110, 2008. D. Vesset. Worldwide data warehousing tools 2005 vendor shares. Technical Report 203229, IDC, August 2006. - 63 D. Magee, K. Djemame, D. Treanor ,Reconstruction of 3D Volumes from multiple 2D Gigapixel Microscopy Images using Transparent Cloud Technology, Internal Document , School of Computing , 2010 , Leeds UK Daniel J. Abadi. (2009). Data Management in the Cloud: Limitations and Opportunities. Bulletin of the IEEE Computer Society Technical Committee on Data Engineering. 1 (1), p1-10. Elsayed, T., J. Lin, and D. Oard, \Pairwise Document Similarity in Large Collections with MapReduce," Proc. Annual Meeting of the Association for Computational Linguistics, 2008. exemetech.com http://www.lexemetech.com/2008/08/elastic-hadoop-clusters-with-amazons.html. eucalyptus. ( ). . Available: http://open.eucalyptus.com/wiki/FAQ. Last accessed 20 Aug 2011. Eric Knorr, Galen Gruman. (). what-cloud-computing-really-means.Available: http://www.infoworld.com/d/cloud-computing/what-cloud-computing-really-means-031. Last accessed 15th June 2011. F. Macias, M. Holcombe, and M. Gheorghe. A formal experiment comparing extreme programming with traditional software construction. In Computer Science, 2003. ENC 2003. Proceedings of the Fourth Mexican International Conference on, pages 73 – 80, sept. 2003. hadoop. (). Welcome to Apache™ Hadoop™. Available: http://hadoop.apache.org/. Last accessed 25 Aug 2011. hadoop. (). Welcome to Hadoop™ MapReduce!. Available: http://hadoop.apache.org/mapreduce/. Last accessed 25 Aug 2011. hadoop. (). Welcome to Hadoop™ HDFS!. Available: http://hadoop.apache.org/hdfs/. Last accessed 25 Aug 2011. - 64 Jaliya Ekanayake, Shrideep Pallickara, Geoffrey Fox. (2008). MapReduce for Data Intensive Scientific Analyses. escience, 2008 Fourth IEEE International Conference on eScience. ( ), pp.277-284. Jonathan Strickland. (). Cloud Computing Architecture. Available: http://computer.howstuffworks.com/cloudcomputing1.htm. Last accessed 15th june 2011. J. Dean and S. Ghemawat. Mapreduce: Simplified data processing on large clusters. pages 137–150, December 2004 K. Wiley, A. Connolly, J. Gardner ,S. Krugho ,M. Balazinska, B. Howe, Y. Kwon and Y. Bu. (01/2011). Astronomy in the Cloud: Using MapReduce for Image Coaddition. Bulletin of the American Astronomical Society. 43 (1), 344.12. libvirt.org. (). OpenNebula Virtual Infrastructure Manager driver.Available: http://libvirt.org/drvone.html. Last accessed 15th june 2011. Mohammad Farhan Husain, Pankil Doshi, Latifur Khan, and Bhavani Thuraisingham. (2009). Storage and Retrieval of Large RDF Graph Using Hadoop and MapReduce. . ( ), . Michael G. Noll. (). Running Hadoop On Ubuntu Linux (Multi-Node Cluster). Available: http://www.michaelnoll.com/tutorials/running-hadoop-on-ubuntu-linux-multi-node-cluster/. Last accessed 15th june 2011. Michael Miller (2008). Cloud computing. United States of America: Que. P1-283 Miha Ahronovitz,Kuldip Pabla. ( ). What is Hadoop?. Available: http://ahrono.com/app/download/1900278304/Hadoop+Paper+v4aspdf.pdf. Last accessed 28th Aug 2011. nimbus. ( ). . Available: http://www.nimbusproject.org/doc/nimbus/faq/. Last accessed 20 Aug 2011. opennebula.org. (). opennebula. Available: http://opennebula.org/. Last accessed 15th june 2011. - 65 Paul McInerney. (). DB2 partitioning features. Available: http://www.ibm.com/developerworks/data/library/techarticle/dm-0608mcinerney/index.html. Last accessed 15th june 2011.. RightScale. Top reasons amazon ec2 instances disappear. http://blog.rightscale.com/2008/02/02/ R. Chaiken, B. Jenkins, P.-A. Larson, B. Ramsey, D. Shakib, S. Weaver, and J. Zhou. Scope: Easy and efficient parallel processing of massive data sets. In Proc. of VLDB, 2008. rightscale. (). multi-cloud-engine. Available: http://www.rightscale.com/products/features/multi-cloudengine.php. Last accessed 15th june 2011. Sun Microsystems , Inc. (2009). Introduction to cloud computing architecture. White Papers. 1 (1), p1-40. Shinder, D. L., & Vanover, R. . (2008). 20 Things You Should Know About Virtualization. . CNET Networks Inc, TechRepbulic. 1 (1), p1-6. searchservervirtualization.techtarget.com. (). Hypervisor. Available: http://searchservervirtualization.techtarget.com/definition/hypervisor. Last accessed 15th june 2011. Tomi Aarnio. (2009). Parallel data processing with MapReduce. TKK T-110.5190 Seminar on Internetworking. ( ), . Vaquero LM., Rodero-Merino L., Cáceres J., Lindner M. A Break in the Clouds: Towards a Cloud Definition. ACM Computer Communication Reviews. January 2009. Xen Hypervisor. ( ). Xen Hypervisor - Leading Open Source Hypervisor for Servers. Available: http://xen.org/products/xenhyp.html. Last accessed 20 Aug 2011. Yahoo! Inc. (). Hadoop Tutorial from Yahoo. Available: http://developer.yahoo.com/hadoop/tutorial/index.html. Last accessed 15th june 2011. - 66 - Appendix A - Project Reflection This project was one of the most different and a challenging project done in my master‟s degree. I have learned and worked on something that never had a class room teaching and therefore found it very challenging. I have enjoyed learning the cloud computing concepts and how one could work with virtual machines. This project has given me the opportunity to learn new skill which will be helpful in my career ahead. Overall, I have enjoyed implementing the Hadoop Framework all by myself solving various technical issues by loads of research. I have made sure to document every step taken during the implementation to help other students doing something similar and helping them learn from my experience. Initially, when I took up this project I was not very sure if I could complete the project with in the assigned time but with the help of my project supervisor and his assistance I was able to have a schedule with realistic objectives that could be achieved in the given time period. I have learnt while working on my project that planning, organizing and time management are the key aspects for a successful project. I made sure that I have enough buffer time after every technical implementation as during the planning phase it is difficult to know if for a given time the implementation could be a success as there are always some or the other technical errors. I would suggest any student taking up a research project to spend adequate time on sketching out a good project schedule. As a part of my project I had weekly meetings with my supervisor which was very helpful in having my work up to date. It‟s always needed to have your supervisor know every piece of work you are doing to help him guide. I am glad to have a very supportive supervisor who has always encouraged me with the work I did. The feedback given to me during each project meeting helped me improve on the development of concepts. After having functional expertise in projects for three years, working on a strong technical project was very difficult for me. However, the support provided by my supervisor kept me focused on the development work. I would suggest other student to get help from their supervisors on regular basis during their project, as during a learning phase it‟s mandatory to have a proper guidance to accomplish the goal. I also had an opportunity to meet my assessor half way during my project work and present to her and to my supervisor on the work done till date and discuss on the objectives of my project. The collaborative feedback given to me by both my assessor and supervisor helped me evaluate on my progress and performance. Feedback is one of the important ways one can improve the work. I made sure that I begin my write up while working on - 67 the implementation and not wait till the end to write it. This gave me ample time in the end to do some changes on the write-up. I would suggest students to have there write up written up to date without delaying it till the end. There were numerous technical challenges faced while working on this project. Most of the issues faced was also faced by other developers but there was very little documentation on how to resolve the problems. I ended up doing lots of trial and error methods to help resolve the issues faced. I made sure I read loads of articles on the similar work done by others and the way they worked on providing a solution for the problem. I was provided help with workarounds for few problems which was difficult to resolve. Every piece of work developed by me was discussed with my supervisor to make sure on the validity of the solution developed. Thus choosing a project which was interesting and challenging kept me motivated and focused through out the project. I personally feel satisfied on the work done by me as it was a very nice learning experience. I learnt things on my own effort and dedication. I have gained knowledge in an area of computer science which is growing rapidly. I feel elated about the fact that my project was a contribution to cloud computing and the knowledge learnt would help me in my future career. I think a student should take up projects that are new in the field of computer science and which they find interesting as it would help them in research as well as in an IT industry. Overall, I had a great experience in learning more and also adding new skills to the skill sets that I was expertise in and helping me grow as a developer. I would advice students to take up challenging work as things start becoming easy when you start working on it and also take lots of feedback from their supervisor, as this is a learning phase. So, learn as much as you can. - 68 - Appendix B - Critical Technical Issues and solution 1) The datanode process does not start showing the below error in the log files of the datanode. The log files can be accessed under the $HADOOP_HOME/logs directory. debian04:/usr/local/hadoop-0.20.2/logs# ls –ltr *datanode* -rw-r--r-- 1 root root 10623 2011-08-25 05:38 hadoop-root-datanode-debian04.log The log files have the below error “ java.io.IOException: Incompatible namespaceIDs” ************************************************************/ 2011-08-25 05:38:06,941 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: STARTUP_MSG: /************************************************************ STARTUP_MSG: Starting DataNode STARTUP_MSG: host = debian04/10.0.10.4 STARTUP_MSG: args = [] STARTUP_MSG: version = 0.20.3-dev STARTUP_MSG: build = -r ; compiled by 'root' on Sat Jul 23 22:10:32 BST 2011 ************************************************************/ 2011-08-25 05:38:12,265 ERROR org.apache.hadoop.hdfs.server.datanode.DataNode: java.io.IOException: Incompatible namespaceIDs in /app/hadoop/tmp/dfs/data: namenode namespaceID = 434273426; datanode namespaceID = 254963473 at org.apache.hadoop.hdfs.server.datanode.DataStorage.doTransition(DataStorage.java:233) at org.apache.hadoop.hdfs.server.datanode.DataStorage.recoverTransitionRead(DataStorage.java:148) at org.apache.hadoop.hdfs.server.datanode.DataNode.startDataNode(DataNode.java:298) at org.apache.hadoop.hdfs.server.datanode.DataNode.<init>(DataNode.java:216) at org.apache.hadoop.hdfs.server.datanode.DataNode.makeInstance(DataNode.java:1283) at org.apache.hadoop.hdfs.server.datanode.DataNode.instantiateDataNode(DataNode.java:1238) at org.apache.hadoop.hdfs.server.datanode.DataNode.createDataNode(DataNode.java:1246) at org.apache.hadoop.hdfs.server.datanode.DataNode.main(DataNode.java:1368) 2011-08-25 05:38:12,270 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: SHUTDOWN_MSG: /************************************************************ SHUTDOWN_MSG: Shutting down DataNode at debian04/10.0.10.4 ************************************************************/ The Issue was resolved by changing the “hadoop.tmp.dir” property in the /conf/core-site.xml to a new directory path “ /app/hadoop/temp” instead of “ /app/hadoop/tmp” and formatting the name node. - 69 debian04:/usr/local/hadoop-0.20.2/conf# cat core-site.xml <?xml version="1.0"?> <?xml-stylesheet type="text/xsl" href="configuration.xsl"?> <!-- Put site-specific property overrides in this file. --> <configuration> <property> <name>hadoop.tmp.dir</name> <value>/app/hadoop/temp</value> <description>A base for other temporary directories.</description> </property> debian04:/usr/local/hadoop-0.20.2/bin# jps 22027 Jps 21264 SecondaryNameNode 21078 NameNode 21422 TaskTracker 21815 DataNode 21338 JobTracker The issue occurs if the namenode is formatted more than once.In a Hadoop cluster the Namenode should be formatted only once. 2) The Local file is not copied to the HDFS file system and executing the below command gives the error debian04:/usr/local/hadoop-0.20.2# bin/hadoop dfs -copyFromLocal /tmp/dir dir 11/07/01 20:15:01 WARN hdfs.DFSClient: DataStreamer Exception: org.apache.hadoop.ipc.RemoteException: java.io.IOException: File /user/hduser/ebook/pg4300.txt could only be replicated to 0 nodes, instead of 1 at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getAdditionalBlock(FSNamesystem.java:1271) at org.apache.hadoop.hdfs.server.namenode.NameNode.addBlock(NameNode.java:422) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) - 70 There could be two reasons for this problem The dfs.repliction value which specifies on how many machines the file has to be replicated if this value is set more than the number of slave machines involved we could get this error we then change the dfs.replication property in the "hdfs-site.xml" from the default value of 3 to the number of slaves in the cluster. Issue could be also because of the local file system size having no space. Once the file system had more space the Issue was resolved. debian04:~# df -h Filesystem Size Used Avail Use% Mounted on /dev/sda1 22G 11G 11G 48% / 3) The jobtracker process doesn‟t start in a Hadoop cluster and gives the error - BindException: Address already in use This issue was resolved by changing the port details in the /conf/mapred-site.xml in the master and slave machines. In my case I changed the value of the port value of 54311 to 54312 in the /conf/mapred-site.xml file and restarted the Hadoop daemons and the Issue was resolved. 4) The “java heap space” error is displayed while executing a example application on a Hadoop cluster. debian02:/usr/local/hadoop-0.20.2# bin/hadoop jar hadoop*examples*.jar wordcount dir ebook-dir 11/07/06 15:22:04 INFO input.FileInputFormat: Total input paths to process : 4 11/07/06 15:22:04 INFO mapred.JobClient: Running job: job_201107061520_0001 11/07/06 15:22:05 INFO mapred.JobClient: map 0% reduce 0% 11/07/06 15:22:20 INFO mapred.JobClient: Task Id : attempt_201107061520_0001_m_000001_0, Status : FAILED Error: Java heap space - 71 11/07/06 15:22:24 INFO mapred.JobClient: Task Id : attempt_201107061520_0001_m_000000_0, Status : FAILED java.io.IOException: Spill failed at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.flush(MapTask.java:1123) . . 11/07/06 15:22:30 INFO mapred.JobClient: Task Id : attempt_201107061520_0001_m_000001_1, Status : FAILED java.io.IOException: Cannot run program "bash": java.io.IOException: error=12, Cannot allocate memory at java.lang.ProcessBuilder.start(ProcessBuilder.java:460) at org.apache.hadoop.util.Shell.runCommand(Shell.java:149) at org.apache.hadoop.util.Shell.run(Shell.java:134) at org.apache.hadoop.fs.DF.getAvailable(DF.java:73) at org.apache.hadoop.fs.LocalDirAllocator$AllocatorPerContext.getLocalPathForWrite(LocalDirAllocator.java:32 9) This error is because there is no swap space. When there’s no dedicated swap partition, a workaround is possible by the means of swap files below are the steps that need to be executed. 1. First, create an empty file which will serve as a swap file by issuing the following command: dd if=/dev/zero of=/swap bs=1024 count=1048576 where /swap is the desired name of the swap file, and count=1048576 sets the size to 1024 MB swap 2. Set up a Linux swap area with: mkswap /swap 3. set the permissions as follows: chmod 0600 /swap 4. Add the new swap file to /etc/fstab: /swap swap swap defaults,noatime 00 This way it will be loaded automatically on boot. 5. To enable the new swap space immediately, issue: swapon -a Check with free -m if everything went right. we should be seeing additional swap space available. - 72 - Appendix C - Hadoop configuration files C.1 Core-site.xml / mapred-site.xml / hdfs-site.xml Configuration files on master Node (Debian02) debian02:/usr/local/hadoop-0.20.2/conf# cat core-site.xml_bkp <?xml version="1.0"?> <?xml-stylesheet type="text/xsl" href="configuration.xsl"?> <!-- Put site-specific property overrides in this file. --> <configuration> <property> <name>hadoop.tmp.dir</name> <value>/app/hadoop/tmp/dir</value> <description>A base for other temporary directories.</description> </property> <property> <name>fs.default.name</name> <value>hdfs://10.0.10.2:54310</value> </property> </configuration> debian02:/usr/local/hadoop-0.20.2/conf# cat mapred-site.xml <?xml version="1.0"?> <?xml-stylesheet type="text/xsl" href="configuration.xsl"?> <!-- Put site-specific property overrides in this file. --> <configuration> <property> <name>mapred.job.tracker</name> <value>10.0.10.2:54312</value> <description>The host and port that the MapReduce job tracker runs at. If "local", then jobs are run in-process as a single map and reduce task. </description> </property> </configuration> debian02:/usr/local/hadoop-0.20.2/conf# cat mapred-site.xml - 73 <?xml version="1.0"?> <?xml-stylesheet type="text/xsl" href="configuration.xsl"?> <!-- Put site-specific property overrides in this file. --> <configuration> <property> <name>mapred.job.tracker</name> <value>10.0.10.2:54312</value> <description>The host and port that the MapReduce job tracker runs at. If "local", then jobs are run in-process as a single map and reduce task. </description> </property> </configuration> debian02:/usr/local/hadoop-0.20.2/conf# cat hdfs-site.xml <?xml version="1.0"?> <?xml-stylesheet type="text/xsl" href="configuration.xsl"?> <!-- Put site-specific property overrides in this file. --> <configuration> <property> <name>dfs.replication</name> <value>2</value> </property> </configuration> Node details on the /conf/master and /conf/slave files on the master machine debian02:/usr/local/hadoop-0.20.2/conf# cat slaves 10.0.10.2 10.0.10.5 10.0.10.6 debian02:/usr/local/hadoop-0.20.2/conf# cat masters 10.0.10.2 C.2 Core-site.xml / mapred-site.xml / hdfs-site.xml Configuration files on Slave Nodes (Debian05) debian05:/usr/local/hadoop-0.20.2/conf# cat core-site.xml <?xml version="1.0"?> <?xml-stylesheet type="text/xsl" href="configuration.xsl"?> <!-- Put site-specific property overrides in this file. --> <configuration> - 74 <property> <name>hadoop.tmp.dir</name> <value>/hadoop/tmp/dir</value> <description>A base for other temporary directories.</description> </property> <property> <name>fs.default.name</name> <value>hdfs://10.0.10.2:54310</value> </property> </configuration> debian05:/usr/local/hadoop-0.20.2/conf# cat mapred-site.xml <?xml version="1.0"?> <?xml-stylesheet type="text/xsl" href="configuration.xsl"?> <!-- Put site-specific property overrides in this file. --> <configuration> <property> <name>mapred.job.tracker</name> <value>10.0.10.2:54312</value> <description>The host and port that the MapReduce job tracker runs at. If "local", then jobs are run in-process as a single map and reduce task. </description> </property> </configuration> debian05:/usr/local/hadoop-0.20.2/conf# cat cat hdfs-site.xml cat: cat: No such file or directory <?xml version="1.0"?> <?xml-stylesheet type="text/xsl" href="configuration.xsl"?> <!-- Put site-specific property overrides in this file. --> <configuration> <property> <name>dfs.replication</name> <value>2</value> </property> </configuration> Node details on the /conf/master and /conf/slave files on the slave machine (debian05) - 75 debian05:/usr/local/hadoop-0.20.2/conf# cat masters localhost debian05:/usr/local/hadoop-0.20.2/conf# cat slaves 10.0.10.2 10.0.10.5 10.0.10.6 C.3 Core-site.xml / mapred-site.xml / hdfs-site.xml Configuration files on Slave Nodes (Debian06) debian06:/usr/local/hadoop-0.20.2/conf# cat core-site.xml <?xml version="1.0"?> <?xml-stylesheet type="text/xsl" href="configuration.xsl"?> <!-- Put site-specific property overrides in this file. --> <configuration> <property> <name>hadoop.tmp.dir</name> <value>/hadoop/tmp/dir</value> <description>A base for other temporary directories.</description> </property> <property> <name>fs.default.name</name> <value>hdfs://10.0.10.2:54310</value> </property> </configuration> debian06:/usr/local/hadoop-0.20.2/conf# cat mapred-site.xml <?xml version="1.0"?> <?xml-stylesheet type="text/xsl" href="configuration.xsl"?> <!-- Put site-specific property overrides in this file. --> <configuration> <property> <name>mapred.job.tracker</name> <value>10.0.10.2:54312</value> <description>The host and port that the MapReduce job tracker runs at. If "local", then jobs are run in-process as a single map and reduce task. </description> </property> - 76 </configuration> debian06:/usr/local/hadoop-0.20.2/conf# cat hdfs-site.xml <?xml version="1.0"?> <?xml-stylesheet type="text/xsl" href="configuration.xsl"?> <!-- Put site-specific property overrides in this file. --> <configuration> <property> <name>dfs.replication</name> <value>3</value> </property> </configuration> Node details on the /conf/master and /conf/slave files on the slave machine (debian06) debian06:/usr/local/hadoop-0.20.2/conf# cat masters localhost debian06:/usr/local/hadoop-0.20.2/conf# cat slaves 10.0.10.2 10.0.10.5 10.0.10.6 - 77 - Appendix D - MapReduce Program processing 1 GB data debian02:/usr/local/hadoop-0.20.2# bin/hadoop jar contrib/streaming/hadoop-*streaming*.jar -file /usr/local/hadoop-0.20.2/mappy.py -mapper /usr/local/hadoop-0.20.2/mappy.py -file /usr/local/hadoop0.20.2/reducer.py -reducer /usr/local/hadoop-0.20.2/reducer.py -input /user/hduser/patho/* -output /user/hduser/patho-1GB-2 packageJobJar: [/usr/local/hadoop-0.20.2/mappy.py, /usr/local/hadoop-0.20.2/reducer.py, /hadoop/tmp/dir/hadoopunjar1363922815805956110/] [] /tmp/streamjob3943827107289634139.jar tmpDir=null 11/08/21 03:57:34 INFO mapred.FileInputFormat: Total input paths to process : 1 11/08/21 03:57:35 INFO streaming.StreamJob: getLocalDirs(): [/hadoop/tmp/dir/mapred/local] 11/08/21 03:57:35 INFO streaming.StreamJob: Running job: job_201108190519_0004 11/08/21 03:57:35 INFO streaming.StreamJob: To kill this job, run: 11/08/21 03:57:35 INFO streaming.StreamJob: /usr/local/hadoop-0.20.2/bin/../bin/hadoop job Dmapred.job.tracker=10.0.10.2:54312 -kill job_201108190519_0004 11/08/21 03:57:35 INFO streaming.StreamJob: Tracking URL: http://debian02:50030/jobdetails.jsp?jobid=job_201108190519_0004 11/08/21 03:57:36 INFO streaming.StreamJob: map 0% reduce 0% 11/08/21 03:57:53 INFO streaming.StreamJob: map 3% reduce 0% 11/08/21 03:57:54 INFO streaming.StreamJob: map 7% reduce 0% 11/08/21 03:57:56 INFO streaming.StreamJob: map 10% reduce 0% 11/08/21 03:57:57 INFO streaming.StreamJob: map 12% reduce 0% 11/08/21 03:57:58 INFO streaming.StreamJob: map 15% reduce 0% 11/08/21 03:57:59 INFO streaming.StreamJob: map 17% reduce 0% 11/08/21 03:58:01 INFO streaming.StreamJob: map 19% reduce 0% 11/08/21 03:58:02 INFO streaming.StreamJob: map 20% reduce 0% 11/08/21 03:58:04 INFO streaming.StreamJob: map 21% reduce 0% 11/08/21 03:58:09 INFO streaming.StreamJob: map 22% reduce 0% 11/08/21 03:58:13 INFO streaming.StreamJob: map 23% reduce 0% 11/08/21 03:58:20 INFO streaming.StreamJob: map 24% reduce 0% 11/08/21 03:58:37 INFO streaming.StreamJob: map 26% reduce 0% 11/08/21 03:58:40 INFO streaming.StreamJob: map 27% reduce 0% 11/08/21 03:58:44 INFO streaming.StreamJob: map 31% reduce 0% 11/08/21 03:58:46 INFO streaming.StreamJob: map 34% reduce 0% 11/08/21 03:58:49 INFO streaming.StreamJob: map 35% reduce 0% 11/08/21 03:58:52 INFO streaming.StreamJob: map 36% reduce 0% 11/08/21 03:58:54 INFO streaming.StreamJob: map 37% reduce 0% 11/08/21 03:59:01 INFO streaming.StreamJob: map 38% reduce 0% 11/08/21 03:59:04 INFO streaming.StreamJob: map 39% reduce 3% 11/08/21 03:59:07 INFO streaming.StreamJob: map 39% reduce 4% 11/08/21 03:59:11 INFO streaming.StreamJob: map 40% reduce 4% 11/08/21 03:59:13 INFO streaming.StreamJob: map 42% reduce 4% - 78 11/08/21 03:59:14 INFO streaming.StreamJob: map 46% reduce 4% 11/08/21 03:59:17 INFO streaming.StreamJob: map 47% reduce 4% 11/08/21 03:59:24 INFO streaming.StreamJob: map 47% reduce 5% 11/08/21 03:59:28 INFO streaming.StreamJob: map 48% reduce 5% 11/08/21 03:59:34 INFO streaming.StreamJob: map 48% reduce 7% 11/08/21 03:59:36 INFO streaming.StreamJob: map 48% reduce 9% 11/08/21 03:59:44 INFO streaming.StreamJob: map 50% reduce 9% 11/08/21 03:59:48 INFO streaming.StreamJob: map 55% reduce 9% 11/08/21 03:59:52 INFO streaming.StreamJob: map 56% reduce 9% 11/08/21 03:59:53 INFO streaming.StreamJob: map 56% reduce 11% 11/08/21 04:00:07 INFO streaming.StreamJob: map 56% reduce 12% 11/08/21 04:00:13 INFO streaming.StreamJob: map 56% reduce 13% 11/08/21 04:00:19 INFO streaming.StreamJob: map 56% reduce 15% 11/08/21 04:00:24 INFO streaming.StreamJob: map 56% reduce 17% 11/08/21 04:00:28 INFO streaming.StreamJob: map 56% reduce 19% 11/08/21 04:00:30 INFO streaming.StreamJob: map 57% reduce 19% 11/08/21 04:00:32 INFO streaming.StreamJob: map 58% reduce 19% 11/08/21 04:00:35 INFO streaming.StreamJob: map 61% reduce 19% 11/08/21 04:00:38 INFO streaming.StreamJob: map 67% reduce 19% 11/08/21 04:00:40 INFO streaming.StreamJob: map 72% reduce 19% 11/08/21 04:00:42 INFO streaming.StreamJob: map 77% reduce 19% 11/08/21 04:00:43 INFO streaming.StreamJob: map 79% reduce 19% 11/08/21 04:00:48 INFO streaming.StreamJob: map 80% reduce 19% 11/08/21 04:01:14 INFO streaming.StreamJob: map 80% reduce 20% 11/08/21 04:01:16 INFO streaming.StreamJob: map 80% reduce 21% 11/08/21 04:01:25 INFO streaming.StreamJob: map 80% reduce 24% 11/08/21 04:01:29 INFO streaming.StreamJob: map 83% reduce 24% 11/08/21 04:01:31 INFO streaming.StreamJob: map 86% reduce 24% 11/08/21 04:01:32 INFO streaming.StreamJob: map 86% reduce 25% 11/08/21 04:01:34 INFO streaming.StreamJob: map 88% reduce 25% 11/08/21 04:01:36 INFO streaming.StreamJob: map 89% reduce 25% 11/08/21 04:01:39 INFO streaming.StreamJob: map 91% reduce 25% 11/08/21 04:01:42 INFO streaming.StreamJob: map 93% reduce 25% 11/08/21 04:01:45 INFO streaming.StreamJob: map 94% reduce 25% 11/08/21 04:01:48 INFO streaming.StreamJob: map 96% reduce 27% 11/08/21 04:01:51 INFO streaming.StreamJob: map 99% reduce 27% 11/08/21 04:01:54 INFO streaming.StreamJob: map 100% reduce 27% 11/08/21 04:02:14 INFO streaming.StreamJob: map 100% reduce 29% 11/08/21 04:02:23 INFO streaming.StreamJob: map 100% reduce 32% 11/08/21 04:03:02 INFO streaming.StreamJob: map 100% reduce 33% 11/08/21 04:03:23 INFO streaming.StreamJob: map 100% reduce 67% 11/08/21 04:03:31 INFO streaming.StreamJob: map 100% reduce 72% 11/08/21 04:03:46 INFO streaming.StreamJob: map 100% reduce 83% 11/08/21 04:03:56 INFO streaming.StreamJob: map 100% reduce 89% 11/08/21 04:04:14 INFO streaming.StreamJob: map 100% reduce 100% - 79 11/08/21 04:04:23 INFO streaming.StreamJob: Job complete: job_201108190519_0004 11/08/21 04:04:23 INFO streaming.StreamJob: Output: /user/hduser/patho-1GB-2 - 80 - Appendix E - Installation of Java / Python Installation of JAVA debian02:/usr/local# echo $JAVA_HOME *** No java installed *** *** Install Java *** debian02:/usr/local# apt-get install sun-java6-jdk E: Unable to fetch some archives, maybe run apt-get update or try with --fix-missing? *** Fix *** 1)Update the source list apt-get update 2)Install java JDK and JRE with apt-get install apt-get install sun-java6-jdk sun-java6-jre 3)After installation done, jdk and jre will install at /usr/lib/jvm/java-6-sun1.6.0.06debian02:/usr/lib/jvm# ls -ltr total 4 lrwxrwxrwx 1 root root 19 2011-07-01 15:28 java-6-sun -> java-6-sun-1.6.0.22 drwxr-xr-x 8 root root 4096 2011-07-01 15:28 java-6-sun-1.6.0.22 4)debian02:~# java -version java version "1.6.0_22" Java(TM) SE Runtime Environment (build 1.6.0_22-b04) Java HotSpot(TM) 64-Bit Server VM (build 17.1-b03, mixed mode) - 81 Installation of Python 1) Download Python 2.6.7 tar file from http://www.python.org/getit/releases/2.6.7/ 2) debian02:~# cd /usr/local debian02:/usr/local# ls -ltr total 154176 -rw------- 1 root staff 13322372 2011-07-29 20:24 Python-2.6.7.tgz 3) Untar the tar file creating a directory Python-2.6.7 debian02:/usr/local# cd Python-2.6.7 debian02:/usr/local/Python-2.6.7# 4) To start building on UNIX: we execute the configure file from the terminal "./configure" from the Current directory and when it execution completes, type "make". This will create an Executable "./python"; one this runs successfully to install in the /usr/local directory , first type "su root" and then execute "make install". - 82 - Appendix F – Schedule The time line of the project is as below for reference. - 83 - Appendix G – Interim Project Report The Interim project report feedback is attached at the end of the report for reference.
* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project
advertisement