Project number: 317871 Project acronym: BIOBANKCLOUD Project title: Scalable, Secure Storage of Biobank Data Project website: http://www.biobankcloud.eu Project coordinator: Jim Dowling (KTH) Coordinator e-mail: [email protected] WORK PACKAGE 6: Integration and Evaluation WP leader: Michael Humml WP leader organization: CHARITE WP leader e-mail: [email protected] PROJECT DELIVERABLE D6.1 BiobankCloud Platform-as-a-Service Due date: 30th November, 2015 (M36) D6.1 – BiobankCloud Platform-as-a-Service Editor Jim Dowling (KTH) Contributors Jim Dowling, Salman Niazi, Mahmoud Ismail, Kamal Hakimzadeh, Ermias Gebremelski (KTH), Marc Bux (HU), Tiago Oliveira, Ricardo Mendes (LIS) Disclaimer This work was partially supported by the European Commission through the FP7-ICT program under project BiobankCloud, number 317871. The information in this document is provided as is, and no warranty is given or implied that the information is fit for any particular purpose. The user thereof uses the information at its sole risk and liability. The opinions expressed in this deliverable are those of the authors. They do not necessarily represent the views of all BiobankCloud partners. D6.1 – BiobankCloud Platform-as-a-Service Executive Summary In this deliverable, we introduce two new software services that we built to deploy and use BiobankCloud, respectively. Firstly, we present Karamel, a standalone application that enables a BiobankCloud cluster to be deployed in just a few mouse clicks. Karamel is an orchestration engine for the configuration management framework Chef, enabling it to coordinate the installation of software on distributed systems. Karamel also integrates with cloud APIs to create virtual machines for different cloud platforms. Together, these features enable Karamel to provide an end-to-end system that creates virtual machines and coordinates the provisioning and configuration of the software for those virtual machines. The second service that we present is HopsWorks, a Software-as-a-Service (SaaS) user interface to our Hadoop platform and BiobankCloud. All of the software components in BiobankCloud have been integrated in HopsWorks, from our security model, to HopsFS, HopsYARN, SAASFEE Bioinformatics toolkit, and Charon for sharing files between clusters. Together, Karamel and HopsWorks enable non-sophisticated users to deploy BiobankCloud on cloud infrastructures, and immediately be able to use the software to curate data (Biobankers) or run workflows (Bioinformaticians), while storing petabytes of data in secure, isolated studies. The document is structured as an overview to both services containing help guides for both Karamel and HopsWorks. D6.1 – BiobankCloud Platform-as-a-Service Introduction During the course of BiobankCloud, we investigated many different possible approaches to reducing the burden of deploying the BiobankCloud platform. BiobankCloud is based on Hops Hadoop [3] and contains a large number of complex software services, each of which requires installation and configuration. From experience, we realized that there was a compelling need for the automated installation and configuration of BiobankCloud if the platform was to gain wide adoption in the community. As of late 2015, there are many different platforms that support automated installation [1]. • Google provide Kubernetes [1] for Google Cloud Engine, as well as an open-source variant that is not yet as complete, feature-wise, as the managed version; • Amazon provide Opsworks [5], as a way to automate the installation of custom software by providing Chef cookbooks to install the software; • Docker [1] provides a way to install software using container technology, with the advantage of being platform independent; • OpenStack provides Heat [2] as a way to define clusters in a declarative manner, but needs a backend configuration management platform, such as Chef [4, 5] or Puppet [5], to install the software; • JuJu [2] provides a managed way to install applications on Ubuntu hosts in a declarative manner. All of the above are fine solutions, but we needed a system that was: 1. open: supporting public clouds, private clouds, and on-premises installations; 2. easy-to-use: normal users should be able to click their way to a clustered deployment; 3. configurable: normal users should be able to configure the cluster to their available resources and environment. Of the above systems, Opsworks, OpenStack and JuJu are not open, working only on their own platforms, Docker instances are not yet configurable - typically people run Chef or Puppet D6.1 – BiobankCloud Platform-as-a-Service on Docker instances to configure them, and Kubernetes is not yet feature complete enough for non-google deployments. We designed and developed Karamel to meet all our three requirements. Karamel is open, supporting Amazon Web Services (AWS), Google Compute Engine (GCE), OpenStack, and baremetal hosts. Karamel is easy-to-use, and users can both deploy clusters using a user-friendly web user interface (UI). Users can also configure their clusters using the Web UI, adding machines easily, changing the configuration of services (for example, changing the amount of memory used by services such as the database and Hadoop). Karamel is built as an orchestration engine on top of Chef. Chef is a popular configuration framework for managing and provisioning software on large clusters. Chef does not support either the orchestration of services (starting services in a well-defined order) or the creation of virtual machines or docker instances. Chef assumes an existing cluster and works with that. Chef provides two modes of use: using a Chef server or serverless. Karamel is built on serverless Chef, called Chef Solo. In Chef Server mode, all nodes in the cluster run a chef client that periodically contacts the chef server for instructions on software to install or configure. The Chef server maintains the configuration information and credentials needed by the services. In Karamel, our Karamel client application plays the role of the Chef server, but only during installation. Karamel injects configuration parameters into Chef solo runs, enabling the passing of parameters between different services during installation. For example, when deploying Master/Slave software (such our database, NDB, or data processing frameworks such as Spark), Karamel installs the master first and passes the public OpenSsh key for the master to the slave nodes, so that they can be configured allowing the master passwordless ssh access to the slave machines. Karamel requires Chef cookbooks to install and configure software. At a high level, Chef cookbooks can be thought of as containers for logic for how to install and configure software services. At a lower level, Chef cookbooks are containers for software programs written in ruby called recipes that install and configure software services. Chef cookbooks also provide parameters for configuring how software is installed (Chef attributes). These Chef attributes are used by recipes to customize the software being installed or configured. In this deliverable, we also wrote the Chef cookbooks for all our software services. Instead of providing web pages containing instructions on how to install BiobankCloud, we now have programs that are version-controlled in GitHub, automatically tested (using the Kitchen framework), and can be composed in cluster definitions in Karamel. The first part of this deliverable includes the a user guide and developer guide for Karamel, including sample cluster definitions that can be used to deploy BiobankCloud. In the second part of this deliverable, we provide the user, installation, administration, and developer guides for our Software-as-a-Service (SaaS) for HopsWorks and Hops. HopsWorks is the frontend to BiobankCloud and integrates all the software components from BiobankCloud. In Table 1, we show the features provided by HopsWorks and which deliverables they derived from in the project. D6.1 – BiobankCloud Platform-as-a-Service Feature Description Integrated from Deliverable(s) Two-factor authentication Dynamic User Roles Secure authorization using smartphones and Yubikeys Users can have different privileges in different studies Biobanking forms Consent forms, Non-consent Forms Audit Trails Study membership mgmt Metadata mgmt Logging of user activity in the system Study owners manage users and their roles Metadata designer and metadata entry for files/directories Free-text search Search for projects/datasets/files/directories using Elasticsearch Sharing data between studies without copying Explore/upload/download files and directores in HopsFS Bioinformatics workflows on YARN using Cuniform and HiWAY D3.4 Security Toolset Final Version D3.4 Security Toolset Final Version D1.3 Legal and Ethical Framework ... D3.4 Security Toolset Final Version D1.4 Disclosure model, D3.4 Security Toolset Final Version D3.5 Object Model Implementation D3.5 Object Model Implementation D2.3 Scalable and Highly Available HDFS D3.5 Object Model Implementation D2.3 Scalable and Highly Available HDFS D3.5 Object Model Implementation D2.3 Scalable and Highly Available HDFS D1.2 Object model for biobank data sharing D5.3 Workflows for NGS data analysis use cases D6.3: Analysis Pipelines Linked to Public Biological Annotation D4.3 Overbank Implementation and Evaluation D2.4 Secure, scalable, highly-available Filesystem ... D6.1 BiobankCloud Platform-as-a-Service Data set sharing Data set browser SAASFEE Charon Sharing data between Biobanks Apache Zeppelin Interactive analytics using Spark and Flink Table 1: HopsWorks integrates features from BiobankCloud Deliverables. HopsWorks as a new UI for Hadoop Existing models for multi-tenancy in Hadoop, such as Amazon Web Services’ Elastic MapReduce (EMR) platform, Google’s Dataproc platform, and Altiscale’s Hadoop-as-a-Service, provide multi-tenant Hadoop by running separate Hadoop clusters for separate projects or organizations. They improve cluster efficiency by running Hadoop clusters on virtualized or containerized platforms, and in some cases, the clusters are not elastic, that is, they cannot be easily scaled up or down in size. There are no tools for securely sharing data between platforms without copying data. HopsWorks is a front-end to Hadoop that provides a new model for multi-tenancy in Hadoop, based around projects. A project is like a GitHub project - the owner of the project manages membership, and users can have different roles in the project: data scientists can run programs and data owners can also curate, import, and export data. Users can’t copy data between projects or run programs that process data from different projects, even if the user is a member of multiple projects. That is, we implement multi-tenancy with dynamic roles, where the user’s role is based on the currently active project. Users can still share datasets between projects, however. HopsWorks has been enabled by migrating all metadata in HDFS and YARN into an open-source, shared nothing, in-memory, distributed database, called NDB. HopsWorks is open-source and licensed as Apache v2, with database connectors licensed as GPL v2. From late January 2016, HopsWorks will be provided as software-as-a-service for researchers and companies in Sweden from the Swedish ICT SICS Data Center (https://www.sics.se/projects/sicsice-data-center-in-lulea). HopsWorks Implementation HopsWorks is a J2EE7 web application, that runs by default on Glassifsh, and has a modern AngularJS user interface, supporting responsive HTML using the Bootstrap framework (that is, the UI adapts its layout for mobile devices). We have a separate administration application Karamel Documentation Release 0.2 www.karamel.io December 12, 2015 CONTENTS 1 What is Karamel? 2 Getting Started 2.1 How to run a cluster? . . . . . . . . . . . . . . . . . 2.2 Launching an Apache Hadoop Cluster with Karamel 2.3 Designing an experiment with Karamel/Chef . . . . 2.4 Designing an Experiment: MapReduce Wordcount . 1 . . . . 3 3 5 6 6 3 Web-App 3.1 Board-UI . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.2 Karamel Terminal . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.3 Experiment Designer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 9 20 24 4 Cluster Definition 4.1 AWS(Amazon EC2) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.2 Google Compute Engine . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.3 Bare-metal . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31 32 33 33 5 Deploying BiobankCloud with Karamel 35 6 Developer Guide 6.1 Code quality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.2 Build and run from Source . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.3 Building Window Executables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41 41 41 41 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . i ii CHAPTER ONE WHAT IS KARAMEL? Karamel is a management tool for reproducibly deploying and provisioning distributed applications on bare-metal, cloud or multi-cloud environments. Karamel provides explicit support for reproducible experiments for distributed systems. Users of Karamel experience the tool as an easy-to-use UI-driven approach to deploying distributed systems or running distributed experiments, where the deployed system or experiment can be easily configured via the UI. Karamel users can open a cluster definition file that describes a distributed system or experiment as: • the application stacks used in the system, containing the set of services in each application stack, • the provider(s) for each application stack in the cluster (the cloud provider or IP addresses of the bare-metal hosts), • the number of nodes that should be created and provisioned for each application stack, • configuration parameters to customize each application stack. Karamel is an orchestration engine that orchestrates: • the creation of virtual machines if a cloud provider is used; • the global order for installing and starting services on each node; • the injection of configuration parameters and passing of parameters between services. Karamel enables the deployment of arbitrarily large distributed systems on both virtualized platforms (AWS, Vagrant) and bare-metal hosts. Karamel is built on the configuration framework, Chef. The distributed system or experiment is defined in YAML as a set of node groups that each implement a number of Chef recipes, where the Chef cookbooks are deployed on github. Karamel orchestrates the execution of Chef recipes using a set of ordering rules defined in a YAML file (Karamelfile) in each cookbook. For each recipe, the Karamelfile can define a set of dependent (possibly external) recipes that should be executed before it. At the system level, the set of Karamelfiles defines a directed acyclic graph (DAG) of service dependencies. Karamel system definitions are very compact. We leverage Berkshelf to transparently download and install transitive cookbook dependencies, so large systems can be defined in a few lines of code. Finally, the Karamel runtime builds and manages the execution of the DAG of Chef recipes, by first launching the virtual machines or configuring the bare-metal boxes and then executing recipes with Chef Solo. The Karamel runtime executes the node setup steps using JClouds and Ssh. Karamel is agentless, and only requires ssh to be installed on the target host. Karamel transparently handles faults by retrying, as virtual machine creation or configuration is not always reliable or timely. Existing Chef cookbooks can easily be karamelized, that is, wrapped and extended with a Karamelfile containing orchestration rules. In contrast to Chef, which is used primarily to manage production clusters, Karamel is designed to support the creation of reproducible clusters for running experiments or benchmarks. Karamel provides additional Chef cookbook support for copying experiment results to persistent storage before tearing down clusters. In Karamel, infrastructure and software are delivered as code while the cluster definitions can be configured by modifying the configuration parameters for the services containined in the cluster definition. Karamel uses Github as the 1 Karamel Documentation, Release 0.2 artifact-server for Chef cookbooks, and all experiment artifacts are globally available - any person around the globe can replay/reproduce the construction of the distributed system. Karamel leverages virtual-machines to provision infrastructures on different clouds. We have cloud-connectors for Amazon EC2, Google Compute Engine, OpenStack and on-premises (bare-metal). 2 Chapter 1. What is Karamel? CHAPTER TWO GETTING STARTED 2.1 How to run a cluster? To run a simple cluster you need: • a cluster definition file; • access to a cloud (or bare-metal cluster); • the Karamel client application. You can use Karamel as a standalone application with a Web UI or embed Karamel as a library in your application, using the Java-API to start your cluster. 2.1.1 Linux/Mac 1. Starting Karamel To run Karamel, download the Linux/Mac binaries from http://www.karamel.io. You first have to unzip the binaries (tar -xf karamel-0.2.tgz). From your machine’s terminal (command-line), run the following commands: cd karamel-0.2 ./bin/karamel This should open a window on your Web Browser if it is already open or open your default Web Browswer if one is not already open. Karamel will appear on the webpage opened. 2.1.2 Windows 1. Starting Karamel To run Karamel, download the Windows binaries from http://www.karamel.io. You first have to unzip the binaries. From Windows Explorer, navigate to the folder karamel-0.2 (probably in the Downloads folder) and doubleclick on karamel.exe file to start Karamel. 2. Customize and launch your cluster Take a look into the Board-UI. 3 Karamel Documentation, Release 0.2 Fig. 2.1: Karamel Homepage. Click on Menu to load a Cluster Definition file. 2.1.3 Command-Line in Linux/Mac You can either set environment variables containing your EC2 credentials or enter them from the console. We recommend you set the environment variables, as shown below. export AWS_KEY=... export AWS_SECRET_KEY=... ./bin/karamel -launch examples/hadoop.yml After you launch a cluster from the command-line, the client loops, printing out to stdout the status of the install DAG of Chef recipes every 20 seconds or so. Both the GUI and command-line launchers print out stdout and stderr to log files that can be found from the current working directory in: tail -f log/karamel.log 2.1.4 Java-API: You can run your cluster in your Java program by using our API. 1. Jar-file dependency First add a dependency into the karamel-core jar-file, its pom file dependency is as following: <dependency> <groupId>se.kth</groupId> <artifactId>karamel-core</artifactId> <scope>compile</scope> </dependency> 2. Karamel Java API Load the content of your cluster definition into a variable and call KaramelApi like this example: //instantiate the API KaramelApi api = new KaramelApiImpl(); //load your cluster definition into a java variable String clusterDefinition = ...; //The API works with json, convert the cluster-definition into json String json = api.yamlToJson(ymlString); //Make sure your ssh keys are available, if not let API generate it for SshKeyPair sshKeys = api.loadSshKeysIfExist(""); if (sshKeys == null) { 4 Chapter 2. Getting Started Karamel Documentation, Release 0.2 sshKeys = api.generateSshKeysAndUpdateConf(clusterName); } //Register your ssh keys, thats the way of confirming your ssh-keys api.registerSshKeys(sshKeys); //Check if your credentials for AWS (or any other cloud) already exist otherwise register th Ec2Credentials credentials = api.loadEc2CredentialsIfExist(); api.updateEc2CredentialsIfValid(credentials); //Now you can start your cluster by giving json representation of your cluster api.startCluster(json); //You can always check status of your cluster by running the "status" command through the AP //Run status in some time-intervals to see updates for your cluster long ms1 = System.currentTimeMillis(); int mins = 0; while (ms1 + 24 * 60 * 60 * 1000 > System.currentTimeMillis()) { mins++; System.out.println(api.processCommand("status").getResult()); Thread.currentThread().sleep(60000); } This code block will print out your cluster status to the console every minute. 2.2 Launching an Apache Hadoop Cluster with Karamel A cluster definition file is shown below that defines a Apache Hadoop V2 cluster to be launched on AWS/EC2. If you click on Menu->Load Cluster Definition and open this file, you can then proceed to launch this Hadoop cluster by entering your AWS credentials and selecting or generating an Open Ssh keypair. The cluster defintion includes a cookbook called ‘hadoop’, and recipes for HDFS’ NameNode (nn) and DataNodes (dn), as well as YARN’s ResourceManager (rm) and NodeManagers (nm) and finally a recipe for the MapReduce JobHistoryService (jhs). The nn, rm, and jhs recipes are included in a single group called ‘metadata’ group, and a single node will be created (size: 1) on which all three services will be installed and configured. On a second group (the datanodes group), dn and nm services will be installed and configured. They will will be installed on two nodes (size: 2). If you want more instances of a particular group, you simply increase the value of the size attribute, (e.g., set “size: 100” for the datanodes group if you want 100 data nodes and resource managers for Hadoop). Finally, we parameterize this cluster deployment with version 2.7.1 of Hadoop (attr -> hadoop -> version). The attrs section is used to supply parameters that are fed to chef recipes during installation. name: ApacheHadoopV2 ec2: type: m3.medium region: eu-west-1 cookbooks: hadoop: github: "hopshadoop/apache-hadoop-chef" version: "v0.1" attrs: hadoop: version: 2.7.1 groups: metadata: size: 1 recipes: 2.2. Launching an Apache Hadoop Cluster with Karamel 5 Karamel Documentation, Release 0.2 - hadoop::nn - hadoop::rm - hadoop::jhs datanodes: size: 2 recipes: - hadoop::dn - hadoop::nm The cluster definition file also includes a cookbooks section. Github is our artifact server. We only support the use of cookbooks in our cluster definition file that are located on GitHub. Dependent cookbooks (through Berkshelf) may also be used (from Opscode’s repository, Chef supermarket or GitHub), but the cookbooks referenced in the YAML file must be hosted on GitHub. The reason for this is that the Karamel runtime uses Github APIs to query cookbooks for configuration parameters, available recipes, dependencies (Berksfile) and orchestration rules (defined in a Karamelfile). The set of all Karamelfiles for all services is used to build a directed-acyclic graph (DAG) of the installation order for recipes. This allows for modular development and automatic composition of cookbooks into cluster, where each cookbook encapsulates its own orchestration rules. In this way, deployment modules for complicated distributed systems can be developed and tested incrementally, where each service defines its own independent deployment model in Chef and Karamel, and independet deployment modules can be automatically composed into clusters in cluster definition files. This approach supports an incremental test and development model, helping improve the quality of deployment software. 2.3 Designing an experiment with Karamel/Chef An experiment in Karamel is a cluster definition file that contains a recipe defining the experiment. As such, an experiment requires a Chef cookbook and recipe, and writing Chef cookbooks and recipes can be a daunting prospect for even experienced developers. Luckily, Karamel provides a UI that can take a bash script or a python program and generate a karamelized Chef cookbook with a Chef recipe for the experiment. The Chef cookbook is automatically uploaded to a GitHub repository that Karamel creates for you. You recipe may have dependencies on other recipes. For example, a MapReduce experiment defined on the above cluster should wait until all the other services have started before it runs. On examination of the Karamelfile for the hadoop cookbook, we can see that hadoop::jhs and hadoop::nm are the last services to start. Our MapReduce experiment can state in the Karamelfile that it should start after the hadoop::jhs and hadoop::nm services have started at all nodes in the cluster. Experiments also have parameters and produce results. Karamel provides UI support for users to enter parameter values in the Configure menu item. An experiment can also download experiment results to your desktop (the Karamel client) by writing to the filename /tmp/<cookbook>__<recipe>.out. For detailed information on how to design experiments, go to experiment designer 2.4 Designing an Experiment: MapReduce Wordcount This experiment is a wordcount program for MapReduce that takes as a parameter an input textfile as a URL. The program counts the number of occurances of each word found in the input file. First, create a new experiment called mapred in GitHub (any organization). You will then need to click on the advanced tickbox to allow us to specify dependencies and parameters. .. We keep them separate experiments to measure their time individually. user=mapred group=mapred textfile=http://www.gutenberg.org/cache/epub/1787/pg1787.txt 6 Chapter 2. Getting Started Karamel Documentation, Release 0.2 Fig. 2.2: Defining the texfile input parameter. Parameters are key-value pairs defined in the Parameter box. The code generator bash script must wait until all HDFS datanodes and YARN nodemanagers are up and running before it is run. To indicate this, we add the following lines to Dependent Recipes textbox: hadoop::dn hadoop::nm Our new cookbook will be dependent on the hadoop cookbook, and we have to enter into the Cookbook Dependencies textbox the relative path to the cookbook on GitHub: cookbook 'hadoop', github: 'hopshadoop/apache-hadoop-chef' The following code snippet runs MapReduce wordcount on the input parameter textfile. The parameter is referenced in the bash script as #{node.mapred.textfile}, which is a combination of node.‘‘<cookbookname>‘‘.‘‘<parameter>‘‘. 2.4. Designing an Experiment: MapReduce Wordcount 7 Karamel Documentation, Release 0.2 Fig. 2.3: Define the Chef cookbook dependencies as well as the dependent recipes, the recipes that have to start before the experiments in this cookbook. 8 Chapter 2. Getting Started CHAPTER THREE WEB-APP Karamel provides a web-based UI and is a lightweight standalone application that runs on user machines, typically desktops. The user interface it has three different perspectives: board, terminal and experiment designer. 3.1 Board-UI The Board is the landing page that appears in your browswer when you start Karamel. The Board is a view on a cluster definition file that you load. You can modify the cluster using the UI (adding/removing recipes, entering parameter values, save updated cluster definitions) and run the cluster definition from the UI. This way, inexperienced users can launch clusters without needing to read cluster definitions in YAML. 3.1.1 Load Cluster Definition Click on the menu item, and then click on Load Cluster Defn: Fig. 3.1: Load Cluster Definition. Lists are shown in the board perspective of the UI, where each list represents a group of machines that install the same application stack. At the top of each list you see the group-name followed by the number of machines in that group (in parentheses). Each list consists of a set of cards, where each card represents a service (Chef recipe) that will be installed on all the machines in that group. Chef recipes are programs written in Ruby that contain instructions for how to install and configure a piece of software or run an experiment. 9 Karamel Documentation, Release 0.2 Fig. 3.2: Lists of cards in Karamel. Each card is a Chef recipe. 3.1.2 Change group name and size To change the GroupName and/or number of machines in each group, double click on the header of the group. In the following dialog, you can make your changes and submit them (to indicate your are finished). Fig. 3.3: Changing the number of nodes in a NodeGroup 3.1.3 Add a new recipe to a group In the top-left icon in the header of each group, there is a menu to update the group. Select the Add recipe menu item: In order to add a recipe to a group, you must enter the GitHub URL for the (karamelized) Chef cookbook where your recipe resides, and then press fetch to load available recipes from the cookbook. Choose your recipe from the combo-box below: 10 Chapter 3. Web-App Karamel Documentation, Release 0.2 Fig. 3.4: Adding a Chef recipe to a node group. Fig. 3.5: Adding a Chef recipe from a GitHub cookbook to a node group. 3.1. Board-UI 11 Karamel Documentation, Release 0.2 3.1.4 Customize Chef attributes for a group Parameters (Chef attributes) can be entered within the scope of a NodeGroup: group scope values have higher precedence than (override) global cluster scope values. To update chef attributes for a group, select its menu item from the group settings menu: Fig. 3.6: Updating Chef Attributes at the Group level. In the dialog below, there is a tab per used cookbook in that group, in each tab you see all customizable attributes, some of them are mandatory and some optional with some default values. Users must set a value for all of the mandatory attributes (or accept the default value, if one is given). 3.1.5 Customize cloud provider for a group Cluster definition files support the use of multiple (different) cloud providers within the same cluster definition. Each group can specify its own cloud provider. This way, we can support multi-cloud deployments. Like attributes, cloud provider settings at the NodeGroup scope will override cloud provider settings at the global scope. Should you have multi-cloud settings in in your cluster, at launch time you must supply credentials for each cloud separately in the launch dialog. Choose the cloud provider for the current group then you will see moe detailed settings for the cloud provider. 3.1.6 Delete a group If you want to delete a group find the menu-item in the group menu. Once you delete a group the list and all the settings related to that group will be disappeared forever. 12 Chapter 3. Web-App Karamel Documentation, Release 0.2 Fig. 3.7: Entering attribute values to customize service. Fig. 3.8: Multi-cloud deployments are supported by specifying different cloud providers for different node groups. 3.1. Board-UI 13 Karamel Documentation, Release 0.2 Fig. 3.9: Configuring a cloud provider per Node Group. Fig. 3.10: Delete a Node Group. 14 Chapter 3. Web-App Karamel Documentation, Release 0.2 Fig. 3.11: Delete Confirmation. 3.1.7 Update cluster-scope attributes When you are done with your group settings you can have some global values for Chef attributes. By choosing Configure button in the middle of the top bar a configuration dialog will pop up, there you see several tabs each named after one used chef-cookbook in the cluster definition. Those attributes are pre-built by cookbook designers for runtime customization. There are two types of attributes mandatory and optional - most of them usually have a default value but if they don’t, the user must fill in mandatory values to be able to proceed. Fig. 3.12: To fill in optional and mandatory attributes. By default each cookbook has a parameter for the operating system’s user-name and group-name. It is recommended to set the same user and group for all cookbooks that you don’t face with permission issues. It is also important to fine-tune your systems with the right parameters, for instance according to type of the machines in your cluster you should allocate enough memory to each system. 3.1.8 Start to Launch Cluster Finally you have to launch your cluster by pressing launch icon in the top bar. There exist a few tabs that user must go through all of them, you might have to specify values and confirm everything. Even though Karamel caches those values, you have to always confirm that Karamel is allowed to use those values for running your cluster. 3.1.9 Set SSH Keys In this step first you need to specify your ssh key pair - Karamel uses that to establish a secure connection to virtual machines. For Linux and Mac operating systems, Karamel finds the default ssh key pair in your operating system and will use it. 3.1. Board-UI 15 Karamel Documentation, Release 0.2 Fig. 3.13: Filling in optional and mandatory attributes. Fig. 3.14: Launch Button. 16 Chapter 3. Web-App Karamel Documentation, Release 0.2 Fig. 3.15: SSH key paths. 3.1.10 Generate SSH Key If you want to change the default ssh-key you can just check the advance box and from there ask Karamel to generate a new key pair for you. 3.1.11 Password Protected SSH Keys If your ssh key is password-protected you need to enter your password in the provided box, and also in case you use bare-metal (karamel doesn’t fork machines from cloud) you have to give sudo-account access to your machines. 3.1.12 Cloud Provider Credentials In the second step of launch you need to give credentials for accessing the cloud of your choice. If your cluster is running on a single cloud a tab related to that cloud will appear in the launch dialog and if you use multi-cloud a separate tab for each cloud will appear. Credentials are usually in different formats for each cloud, for more detail information please find it in the related cloud section. 3.1.13 Final Control When you have all the steps passed in the summary tab you can launch your cluster, it will bring you to the terminal there you can control the installation of your cluster. 3.1. Board-UI 17 Karamel Documentation, Release 0.2 Fig. 3.16: Advanced options for SSH keys. 18 Chapter 3. Web-App Karamel Documentation, Release 0.2 Fig. 3.17: Provider-specific credentials. Fig. 3.18: Validity summary for keys and credentials. 3.1. Board-UI 19 Karamel Documentation, Release 0.2 3.2 Karamel Terminal The terminal perspective enables user to monitor and manage running clusters as well as debugging Chef recipes by running them. 3.2.1 Open Terminal The Board-UI redirects to the terminal as soon as a cluster launches. Another way to access the terminal is by clicking on the terminal menu-item from the menu dropdown list, as shown in the screen-shot below. Fig. 3.19: Selecting the terminal perspective. 3.2.2 Button Bar The Terminal has a menu bar in which the available menu items (buttons) change dynamically based on the active page. 3.2.3 Command Bar Under the menu bar, there is a long text area where you can execute commands directly. The buttons (menu items) are, in fact, just widely used commands. To see list of commands click on the Help button. 3.2.4 Main Page The main page in the terminal shows available running clusters - you can run multiple clusters at the same time they just need to have different names - where you see general status of your cluster. There are some actions in front of each cluster where you can obtain more detail information about each cluster. 20 Chapter 3. Web-App Karamel Documentation, Release 0.2 Fig. 3.20: Successful launch redirects to terminal page. 3.2.5 Cluster Status Status page pushes the new status for the chosen cluster very often. In the first table you see phases of the cluster and each of them they passed successfully or not. Fig. 3.21: Cluster Status - A recently started cluster The cluster deployment phases are: #1. Pre-Cleaning #2. Forking Groups #3. Forking Machines #4. Installing As soon as the cluster passes the forking groups phase, a list of machine tables appear under the phases table. Each machine table indicates that the virtual machine (VM) has been forked and some details on the VM are available, such as its IP Addresses (public and private) and its connection status. Inside each machine table there exists a smaller table for showing the tasks that are going to be submitted into that machine. Before all machines became forked and ready, all task tables are empty for all machines. Once all machines have started forking tasks, a list of tasks are displayed for each machine. The Karamel Scheduler orders tasks and decides when each task is ready to be run. The scheduler assigns a status label to each task. The task status labels are: • Waiting: the task is still waiting until its dependencies have finished; • Ready: the task is ready to be run but the associated machine has not yet taken it yet because it is running another task; • Ongoing: the task is currently running on the machine; • Succeed: the task has finished successfully; • Failed: the task has failed - each failure will be propagated up into cluster and will cause the cluster to pause the installation. 3.2. Karamel Terminal 21 Karamel Documentation, Release 0.2 Fig. 3.22: Cluster Status - Forking Machines When a task is finished a link to its log content will be displayed in the third column of task table. The log is the merged content of the standard output and standard error streams. Fig. 3.23: Cluster Status - Installing. 3.2.6 Orchestartion DAG The scheduler in Karamel builds a Directed Acyclic Graph (DAG) from the set of tasks in the cluster. In the terminal perspective, the progress of the DAG execution can be visualized by clicking on the “Orchestration DAG” button. Each Node of the DAG represents a task that must be run on a certain machine. Nodes dynamically change their color according to the status change of their tasks. Each color is interpreted as follows: • Blue: Waiting • Ready: Yellow • Ongoing: Blinking orange • Succeed: Green • Failed: Red The Orchestration DAG is not only useful to visualize the cluster progress but can also help in debugging the level of parallelization in the installation graph. If some tasks are acting as global barriers during installation, they can be quickly identified by inspecting the DAG and seeing the nodes with lots of incoming edges and some outgoing edges. As have local orchestration rules in their Karamelfiles, the DAG is built from the set of Karamelfiles. It is not easy 22 Chapter 3. Web-App Karamel Documentation, Release 0.2 Fig. 3.24: Orchestration DAG to manually traverse the DAG, given a set of Karamelfiles, but the visual DAG enables easier inspection of the global order of installation of tasks. 3.2.7 Quick Links Quick links a facility that Karamel provides in the terminal perspective to access web pages for services in your cluster. For example, when you install Apache Hadoop, you might want to access the NameNode or ResourceManager’s web UI. Those links must be designed in karamelized cookbooks (in the metadata.rb file). Karamel parses the metadata.rb files, extracting the webpage links and displaying them in the Quick Links tab. Fig. 3.25: Quick Links 3.2. Karamel Terminal 23 Karamel Documentation, Release 0.2 3.2.8 Statistics Currently Karamel collects information about the duration of all tasks when you deploy a cluster. Duration statistics are available by clicking on statistics button that will show the names of the tasks and their execution time. It might be have you have several instances of each task in your cluster, for example, you may install the hadoop::dn recipe on several machines in your cluster - all such instances will appear in the statistics table. Statistics is a good way for performance measurement for some type of experiments. You can just draw a plot on them to show the performance of your experiment. 3.2.9 Pause/Resume A cluster may pause running either because the user’s order or when a failure happens. It is a good way if user wants to change something or if he wants to avoid running the entire cluster for some reason. In that case when you click on the “Pause” button it takes some time until all machines finish their current running task and go into the paused mode. When cluster is paused, a resume button will appear which proceeds running the cluster again. 3.2.10 Purge Purge is a button to destroy and release all the resources both on Clouds and Karamel-runtime, destroying any virtual machines created. It is recommended to use the purge function via Karamel to clean-up resources rather than doing so manually - Karamel makes sure all ssh connections, local threads, virtual machines and security groups are released completely. 3.3 Experiment Designer The experiment Designer perspective in Karamel helps you to design your experiment in a bash script or a python program without needing to know Chef or Git. Take the following steps to design and deploy your experiment. 3.3.1 Find experiment designer When you have Karamel web app up and running, you can access the experiment designer from the Experiment menu-item on the left-hand side of the application. Fig. 3.26: Get into the experiment designer. 24 Chapter 3. Web-App Karamel Documentation, Release 0.2 3.3.2 Login into Github Github is Karamel’s artifact server, here you will have to login into your Github account for the first time while Karamel will remember your credentials for other times. Fig. 3.27: Login button. Fig. 3.28: Github credentials. 3.3.3 Start working on experiment You can either create a new experiment or alternatively load the already designed experiment into the designer. Fig. 3.29: Work on a new or old experiment. 3.3.4 Create a new experiment If you choose to create a new experiment you will need to choose a name for it, optionally describe it and choose which Github repo you want to host your experiment in. As you can see in the below image Karamel connects and fetches your available repos from Github. 3.3. Experiment Designer 25 Karamel Documentation, Release 0.2 Fig. 3.30: New experiment on a Github repo. 3.3.5 Write body of experiment At this point you land into the programming section of your experiment. The default name for the experiment recipe is “experiment”. In the large text-area, as can be seen in the screenshot below, you can write your experiment code either in bash or python. Karamel will automatically wrap your code into a chef recipe. All parameters in experiment come in the format of Chef variables, you should wrap them inside #{} and prefix them node.<cookbookname>. By default, they have the format #{node.cookbook.paramName}, where paramName is the name of your parameter. If you write results of your experiment in a file called /tmp/wordcout__experiment.out - if your cookbook called “wordcount” and your recipe called “experiment”- Karamel will download that file and will put it into ~/.karamel/results/ folder of your client machine. 3.3.6 Define orchestration rules for experiment Placing your experiment in the right order in the cluster orchestration is a very essential part of your experiment design. Click the advanced checkbox, write in the line-separated Cookbook::recipe_name that your experiment requires have finished before the experiment will start. If your experiment is dependent on other cookbooks (for recipes or parameters), you must enter the relative GitHub name for the cookbook and the version/branch in lineseparated format in the second text-area. 3.3.7 Push your experiment into Github You can save your cluster to GitHub by pressing the save button in the top-right hand corner of the webpage. This will generate your cookbook and copy all the files to Github by adding, committing, and pushing the new files to GitHub. 26 Chapter 3. Web-App Karamel Documentation, Release 0.2 Fig. 3.31: Experiment bash script. 3.3.8 Approve uploaded experiment to Github Navigate to your Github repo on your web browser and you can see your cookbook. 3.3. Experiment Designer 27 Karamel Documentation, Release 0.2 Fig. 3.32: Orchestration rules for new cluster. Fig. 3.33: Push the experiment to a Github repository. 28 Chapter 3. Web-App Karamel Documentation, Release 0.2 Fig. 3.34: New experiment added to Github. 3.3. Experiment Designer 29 Karamel Documentation, Release 0.2 30 Chapter 3. Web-App CHAPTER FOUR CLUSTER DEFINITION The cluster definition format is an expressive DSL based on YAML as you can see in the following sample. Since Karamel can run several clusters simultaneously, the name of each cluster must be unique. Currently We support four cloud providers: Amazon EC2 (ec2), Google Compute Engine (gce), Openstack Nova (nova) and bare-metal(baremetal). You can define a provider globally within a cluster definition file or you can define a different provider for each group in the cluster definition file. In the group scope, you can overwrite some attributes of the network/machines in the global scope or you can choose an entirely different cloud provider, defining a multi-cloud deployment. Settings and properties for each provider is introduced in later section. For a single cloud deployment, one often uses group-scope provider details to override the type of instance used for machines in the group. For example, one group of nodes may require lots of memory and processing power, while other nodes require less. For AWS, you would achive this by overriding the instanceType attribute. The Cookbooks section specifies GitHub references to the cookbooks used in the cluster definition. It is possible to refer to a specific version or branch for each GitHub repository. We group machines based on the application stack (list of recipes) that should be installed on the machines in the group. The number of machines in each group and list of recipes must be defined under each group name. 31 Karamel Documentation, Release 0.2 name: spark ec2: type: m3.medium region: eu-west-1 cookbooks: hadoop: github: "hopshadoop/apache-hadoop-chef" spark: github: "hopshadoop/spark-chef" branch: "master" groups: namenodes: size: 1 recipes: - hadoop::nn - hadoop::rm - hadoop::jhs - spark::master datanodes: size: 2 recipes: - hadoop::dn - hadoop::nm - spark::slave 4.1 AWS(Amazon EC2) In cluster definitions, we use key word ec2 for deploying the cluster on Amazon EC2 Cloud. The following code snippet shows all supported attributes for AWS. ec2: type: c4.large region: eu-west-1 ami: ami-47a23a30 price: 0.1 vpc: vpc-f70ea392 subnet: subnet-e7830290 Type of the virtual machine, region (data center) and Amazon Machine Image are the basic properties. We support spot instances that is a way to control your budget. Since Amazon prices are changing based on demand, price is a limit you can set if you are not willing to pay beyond that limit (price unit is USD). 4.1.1 Virtual Private Cloud on AWS-EC2 We support AWS VPC on EC2 for better performance. First you must define your VPC in EC2 with the following steps then include your vpc and subnet id in the cluster definition as it is shown above. 1. Make a VPC and a subnet assigned to it under your ec2. 2. Check the “Auto-assign Public IP” item for your subnet. 3. Make an internet gateway and attach it to the VPC. 4. Make a routing table for your VPC and add a row for your gateway into it, on this row open all ips ‘0.0.0.0/0’. 32 Chapter 4. Cluster Definition Karamel Documentation, Release 0.2 5. Add your vpc-id and subnet-id into the ec2 section of your yaml like the following example. Also make sure you are using the right image and type of instance for your vpc. 4.2 Google Compute Engine To deploy the cluster on Google’s infrastructure, we use the keyword gce in the cluster definition YAML file. Following code snippet shows the current supported attributes: gce: type: n1-standard-1 zone: europe-west1-b image: ubuntu-1404-trusty-v20150316 Machine type, zone of the VMs, and the VM image can be specified by the user. Karamel uses Compute Engine’s OAuth 2.0 authentication method. Therefore, an OAuth 2.0 client ID needs to be created through the Google’s Developer Console. The description on how to generate a client ID is available here. You need to select Service account as the application type. After generating a service account, click on Generate new JSON key button to download a generated JSON file that contains both private and public keys. You need to register the fullpath of the generated JSON file with Karamel API. 4.3 Bare-metal Bare-metal clusters are supported, but the machines must first be prepared with support for login using a ssh-key that is stored on the Karamel client. The target hosts must be contactable using ssh from the Karamel client, and the target hosts’ ip-addresses must be specified in the cluster definition. If you have many ip-addresses in a range, it is possible to give range of addresses instead of specifying them one by one (the second example below). The public key stored on the Karamel client should be copied to the .ssh/authorized_keys file in the home folder of the sudo account on the target machines that will be used to install the software. The username goes into the cluster definition is the sudo account, and if there is a password required to get sudo access, it must be entered in the Web UI or entered through Karamel’s programmatic API. baremetal: username: ubuntu ips: - 192.168.33.12 - 192.168.33.13 - 192.168.33.14 - 192.168.44.15 4.3.1 IP-Range baremetal: username: ubuntu ips: - 192.168.33.12-192.168.33.14 - 192.168.44.15 4.2. Google Compute Engine 33 Karamel Documentation, Release 0.2 34 Chapter 4. Cluster Definition CHAPTER FIVE DEPLOYING BIOBANKCLOUD WITH KARAMEL BiobankCloud is a Platform-as-a-Service (PaaS) for biobanking with Big Data (Hadoop). BiobankCloud brings together • Hops Hadoop with • SAASFEE, a Bioinformatics platform for YARN that provides both a workflow language (Cuneiform) and a 2nd-level scheduler (HiWAY) • Charon, a cloud-of-clouds filesystem, for sharing data between BiobankCloud clusters. We have written karamelized Chef cookbooks for installing all of the components of BiobankCloud, and we provide some sample cluster definitions for installing small, medium, and large BiobankCloud clusters. Users are, of course, expected to adapt these sample cluster definitions to their cloud provider or bare-metal environment as well as their needs. The following is a brief description of the karmelized Chef cookbooks that we have developed to support the installation of BiobankCloud. The cookbooks are all publicly available at: http://github.com/. • hopshadoop/apache-hadoop-chef • hopshadoop/hops-hadoop-chef • hopshadoop/elasticsearch-chef • hopshadoop/ndb-chef • hopshadoop/zeppelin-chef • hopshadoop/hopsworks-chef • hopshadoop/spark-chef • hopshadoop/flink-chef • biobankcloud/charon-chef • biobankcloud/hiway-chef The following is a cluster definition file that installs BiobankCloud on a single m3.xlarge instance on AWS/EC2: name: BiobankCloudSingleNodeAws ec2: type: m3.xlarge region: eu-west-1 cookbooks: hops: github: "hopshadoop/hops-hadoop-chef" branch: "master" hadoop: github: "hopshadoop/apache-hadoop-chef" 35 Karamel Documentation, Release 0.2 branch: "master" hopsworks: github: "hopshadoop/hopsworks-chef" branch: "master" ndb: github: "hopshadoop/ndb-chef" branch: "master" spark: github: "hopshadoop/spark-chef" branch: "hops" zeppelin: github: "hopshadoop/zeppelin-chef" branch: "master" elastic: github: "hopshadoop/elasticsearch-chef" branch: "master" charon: github: "biobankcloud/charon-chef" branch: "master" hiway: github: "biobankcloud/hiway-chef" branch: "master" attrs: hdfs: user: glassfish conf_dir: /mnt/hadoop/etc/hadoop hadoop: dir: /mnt yarn: user: glassfish nm: memory_mbs: 9600 vcores: 4 mr: user: glassfish spark: user: glassfish hiway: home: /mnt/hiway user: glassfish release: false hiway: am: memory_mb: '512' vcores: '1' worker: memory_mb: '3072' vcores: '1' hopsworks: user: glassfish twofactor_auth: "true" hops: use_hopsworks: "true" ndb: DataMemory: '50' IndexMemory: '15' dir: "/mnt" shared_folder: "/mnt" 36 Chapter 5. Deploying BiobankCloud with Karamel Karamel Documentation, Release 0.2 mysql: dir: "/mnt" charon: user: glassfish group: hadoop user_email: [email protected] use_only_aws: true groups: master: size: 1 recipes: - ndb::mysqld - ndb::mgmd - ndb::ndbd - hops::ndb - hops::rm - hops::nn - hops::dn - hops::nm - hopsworks - zeppelin - charon - elastic - spark::master - hiway::hiway_client - hiway::cuneiform_client - hiway::hiway_worker - hiway::cuneiform_worker - hiway::variantcall_worker The following is a cluster definition file that installs a very large, highly available, BiobankCloud cluster on 56 m3.xlarge instance on AWS/EC2: name: BiobankCloudMediumAws ec2: type: m3.xlarge region: eu-west-1 cookbooks: hops: github: "hopshadoop/hops-hadoop-chef" branch: "master" hadoop: github: "hopshadoop/apache-hadoop-chef" branch: "master" hopsworks: github: "hopshadoop/hopsworks-chef" branch: "master" ndb: github: "hopshadoop/ndb-chef" branch: "master" spark: github: "hopshadoop/spark-chef" branch: "hops" zeppelin: github: "hopshadoop/zeppelin-chef" branch: "master" elastic: github: "hopshadoop/elasticsearch-chef" 37 Karamel Documentation, Release 0.2 branch: "master" charon: github: "biobankcloud/charon-chef" branch: "master" hiway: github: "biobankcloud/hiway-chef" branch: "master" attrs: hdfs: user: glassfish conf_dir: /mnt/hadoop/etc/hadoop hadoop: dir: /mnt yarn: user: glassfish nm: memory_mbs: 9600 vcores: 8 mr: user: glassfish spark: user: glassfish hiway: home: /mnt/hiway user: glassfish release: false hiway: am: memory_mb: '512' vcores: '1' worker: memory_mb: '3072' vcores: '1' hopsworks: user: glassfish twofactor_auth: "true" hops: use_hopsworks: "true" ndb: DataMemory: '8000' IndexMemory: '1000' dir: "/mnt" shared_folder: "/mnt" mysql: dir: "/mnt" charon: user: glassfish group: hadoop user_email: [email protected] use_only_aws: true groups: master: size: 1 bbcui: - ndb::mgmd - ndb::mysqld - hops::ndb - hops::client 38 Chapter 5. Deploying BiobankCloud with Karamel Karamel Documentation, Release 0.2 - hopsworks - spark::yarn - charon - zeppelin - hiway::hiway_client - hiway::cuneiform_client metadata: size: 2 recipes: - hops::ndb - hops::rm - hops::nn - ndb::mysqld elastic: size: 1 recipes: - elastic database: size: 2 recipes: - ndb::ndbd workers: size: 50 recipes: - hops::ndb - hops::dn - hops::nm - hiway::hiway_worker - hiway::cuneiform_worker - hiway::variantcall_worker Alternative configurations are, of course, possible. You could run several Elasticsearch instances for high availability and more master instances if you have many active clients. 39 Karamel Documentation, Release 0.2 40 Chapter 5. Deploying BiobankCloud with Karamel CHAPTER SIX DEVELOPER GUIDE We have organized our code into two main projects, karamel-core and karamel-ui. The core is our engine for launching, installing and monitoring clusters. The UI is a standalone web application containing several designers and visualizers. There is a REST-API in between the UI and the core. The core and REST-API are programmed in Java 7, and the UI is programmed in Angular JS. 6.1 Code quality 1. Testability and mockability: Write your code in a way that you test each unit separately. Split concerns into different modules that you can mock one when testing the other. We use JUnit-4 for unit testing and mockito for mocking. 2. Code styles: Write a DRY (Don’t repeat yourself) code, use spaces instead of tab and line width limit is 120. 3. We use Google Guava and its best practices, specially the basic ones such as nullity checks and preconditions. 6.2 Build and run from Source Ubuntu Requirements: apt-get install lib32z1 lib32ncurses5 lib32bz2-1.0 Centos 7 Requirements: Install zlib.i686, ncurses-libs.i686, and bzip2-libs.i686 on CentOS 7 Building from root directory: mvn install Running: cd karamel-ui/target/appassembler ./bin/karamel 6.3 Building Window Executables You need to have 32-bit libraries to build the windows exe from Linux, as the launch4j plugin requires them. 41 Karamel Documentation, Release 0.2 sudo apt-get install gcc binutils-mingw-w64-x86-64 -y # Then replace 32-bit libraries with their 64-bit equivalents cd /home/ubuntu/.m2/repository/net/sf/ cd launch4j/launch4j/3.8.0/launch4j-3.8.0-workdir-linux/bin rm ld windres ln -s /usr/bin/x86_64-w64-mingw32-ld ./ld ln -s /usr/bin/x86_64-w64-mingw32-windres ./windres Then run maven with the -Pwin to run the plugin: mvn -Dwin package 42 Chapter 6. Developer Guide HopsWorks Documentation www.hops.io December 12, 2015 CONTENTS 1 2 3 4 Hops Overview 1.1 Audience . . . . . . . . . . . 1.2 Revision History . . . . . . . 1.3 What is Hops? . . . . . . . . . 1.4 HopsWorks . . . . . . . . . . 1.4.1 Users . . . . . . . . . 1.4.2 Projects and DataSets . 1.4.3 Analytics . . . . . . . 1.4.4 MetaData Management 1.4.5 Free-text search . . . . 1.5 HopsFS . . . . . . . . . . . . 1.6 HopsYarn . . . . . . . . . . . 1.7 BiobankCloud . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 4 4 5 5 6 6 7 7 7 7 8 8 System Requirements 2.1 Recommended Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.2 Entire Hops platform on a single baremetal machine . . . . . . . . . . . . . . . . 2.3 Entire Hops platform on a single virtualbox instance (vagrant) . . . . . . . . . . 2.4 DataNode and NodeManager . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.5 NameNode, ResourceManager, NDB Data Nodes, HopsWorks, and ElasticSearch . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 10 10 11 11 11 Hops Installation 3.1 Cloud Platforms (AWS, GCE, OpenStack) 3.1.1 Karamel/Chef . . . . . . . . . . . 3.2 On-Premises (baremetal) Installation . . . 3.3 Vagrant (Virtualbox) . . . . . . . . . . . 3.4 Windows . . . . . . . . . . . . . . . . . 3.5 Apple OSX/Mac . . . . . . . . . . . . . 3.6 Hops Chef Cookbooks . . . . . . . . . . 3.7 BiobankCloud Chef Cookbooks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 13 13 15 17 18 18 18 19 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . HopsWorks User Guide 20 4.1 First Login (no 2-Factor Authentication) . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20 4.2 First Login with 2-Factor Authentication . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22 4.3 Register a New Account on HopsWorks . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23 1 4.4 4.5 4.6 4.7 4.8 4.9 4.10 4.11 4.12 4.13 4.14 4.15 4.16 4.17 4.18 4.19 5 6 7 Forgotten Password / Lost Smartphone Update your Profile/Password . . . . If it goes wrong . . . . . . . . . . . . Create a New Project . . . . . . . . . Delete a Project . . . . . . . . . . . . Data Set Browser . . . . . . . . . . . Upload Data . . . . . . . . . . . . . . Compress Files . . . . . . . . . . . . Share a Data Set . . . . . . . . . . . . Free-text Search . . . . . . . . . . . . Jobs . . . . . . . . . . . . . . . . . . Charon . . . . . . . . . . . . . . . . . Apache Zeppelin . . . . . . . . . . . Metadata Management . . . . . . . . MetaData Designer . . . . . . . . . . MetaData Attachment and Entry . . . . . . . . . . . . . . . . . . . HopsFS User Guide 5.1 Unsupported HDFS Features . . . . . . 5.2 NameNodes . . . . . . . . . . . . . . . 5.2.1 Formating the Filesystem . . . . 5.2.2 NameNode Caches . . . . . . . 5.2.3 Adding/Removing NameNodes 5.3 DataNodes . . . . . . . . . . . . . . . . 5.4 HopsFS Clients . . . . . . . . . . . . . 5.5 Compatibility with HDFS Clients . . . 5.6 HopsFS Async Quota Management . . . 5.7 Block Reporting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Hops-YARN User Guide 6.1 Removed/Replaced YARN Features . . . . . . 6.2 ResourceManager . . . . . . . . . . . . . . . . 6.2.1 Adding/Removing a ResourceManager 6.3 YARN Clients . . . . . . . . . . . . . . . . . . 6.4 YARN NodeManager: . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . HopsWorks Administrator Guide 7.1 Activating users . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.2 User fails to receive an email to validate her account . . . . . . . . . . . . . 7.3 User receives email, but fails to validate the account . . . . . . . . . . . . . 7.4 Configuring email for HopsWorks . . . . . . . . . . . . . . . . . . . . . . 7.5 User successfully validates the account, but still can’t login . . . . . . . . . 7.6 User account has been disabled due to too many unsuccessful login attempts 7.7 Disabling a user account . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.8 Re-activating a user account . . . . . . . . . . . . . . . . . . . . . . . . . 7.9 Managing Project Quotas . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.10 Disabling/Re-enabling Projects . . . . . . . . . . . . . . . . . . . . . . . . 7.11 Ubikeys in HopsWorks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25 26 26 27 27 27 27 27 28 28 28 29 29 29 29 30 . . . . . . . . . . 31 31 32 32 33 33 33 34 34 34 35 . . . . . 36 36 37 37 37 37 . . . . . . . . . . . 39 39 39 40 40 40 40 40 40 40 41 41 7.11.1 Glassfish Adminstration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42 8 9 HopsFS Configuration 8.1 Leader Election . . . . . . . . . . . . . . . . . . . . . . . . . . 8.2 NameNode Cache . . . . . . . . . . . . . . . . . . . . . . . . . 8.3 Distributed Transaction Hints . . . . . . . . . . . . . . . . . . . 8.4 Quota Management . . . . . . . . . . . . . . . . . . . . . . . . 8.5 Block Reporting . . . . . . . . . . . . . . . . . . . . . . . . . . 8.6 Distributed Unique ID generator . . . . . . . . . . . . . . . . . 8.7 Namespace and Block Pool ID . . . . . . . . . . . . . . . . . . 8.8 Client Configurations . . . . . . . . . . . . . . . . . . . . . . . 8.9 Data Access Layer (DAL) . . . . . . . . . . . . . . . . . . . . 8.9.1 MySQL Cluster Network Database Driver Configuration 8.9.2 Loading a DAL Driver . . . . . . . . . . . . . . . . . . 8.10 HopsFS-EC Configuration . . . . . . . . . . . . . . . . . . . . Hops-YARN Configuration 9.1 Configuring Hops-YARN fail-over . . . . . 9.2 Batch Processing of Operations . . . . . . . 9.2.1 Database back pressure . . . . . . . 9.2.2 Proxy provider . . . . . . . . . . . 9.3 Configuring Hops-YARN distributed mode . . . . . . . . . . 10 Hops Developer Guide 10.1 Extending HopsFS INode metadata . . . . . . . 10.1.1 Example use case . . . . . . . . . . . . 10.1.2 Adding a table to the schema . . . . . . 10.1.3 Defining the Entity Class . . . . . . . . 10.1.4 Defining the DataAccess interface . . . 10.1.5 Implementing the DataAccess interface 10.1.6 Implementing the EntityContext . . . . 10.1.7 Using custom locks . . . . . . . . . . . 10.2 Erasure Coding API Access . . . . . . . . . . . 10.2.1 Java API . . . . . . . . . . . . . . . . 10.2.2 Creation of Encoded Files . . . . . . . 10.2.3 Encoding of Existing Files . . . . . . . 10.2.4 Reverting To Replication Only . . . . . 10.2.5 Deletion Of Encoded Files . . . . . . . 11 License Compatibility . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44 44 44 45 45 46 46 47 47 47 48 48 49 . . . . . 51 51 51 52 52 53 . . . . . . . . . . . . . . 54 54 54 54 55 55 56 57 60 60 61 61 61 62 62 63 CHAPTER ONE HOPS OVERVIEW 1.1 Audience This document contains four different guides: installation, user, administration, and developer guides. For the following different types of readers, we recommend reading the guides: • Data Scientists – User Guide • Hadoop Administrators – Installation Guide – Administration Guide • Data Curators – User Guide • Hops Developers – Installation Guide – User Guide – Developer Guide 1.2 Revision History Date Nov 2015 Release 2.4.0 Description First release of Hops Documentation. 4 1.3 What is Hops? Hops is a next-generation distribution of Apache Hadoop that supports: • Hadoop-as-a-Service, • Project-Based Multi-Tenancy, • Secure sharing of DataSets across projects, • Extensible metadata that supports free-text search using Elasticsearch, • YARN quotas for projects. The key innovation that enables these features is a new architecture for scale-out, consistent metadata for both the Hadoop Filesystem (HDFS) and YARN (Hadoop’s Resource Manager). The new metadata layer enables us to support multiple stateless NameNodes and TBs of metadata stored in MySQL Clustepr Network Database (NDB). NDB is a distributed, relational, in-memory, open-source database. This enabled us to provide services such as tools for designing extended metadata (whose integrity with filesystem data is ensured through foreign keys in the database), and also extending HDFS’ metadata to enable new features such as erasure-coded replication, reducing storage requirements by 50% compared to triple replication in Apache HDFS. Extended metadata has enabled us to implement quota-based scheduling for YARN, where projects can be given quotas of CPU hours/minutes and memory, thus enabling resource usage in Hadoopas-a-Service to be accounted and enforced. Hops builds on YARN to provide support for application and resource management. All YARN frameworks can run on Hops, but currently we only provide UI support for general data-parallel processing frameworks such as Apache Spark, Apache Flink, and MapReduce. We also support frameworks used by BiobankCloud for data-parallel bioinformatics workflows, including SAASFEE and Adam. In future, other frameworks will be added to the mix. 1.4 HopsWorks HopsWorks is the UI front-end to Hops. It supports user authentication through either a native solution, LDAP, or two-factor authentication. There are both user and adminstrator views for HopsWorks. HopsWorks implements a perimeter security model, where command-line access to Hadoop services is restricted, and all jobs and interactive analyses are run from the HopsWorks UI and Apache Zeppelin (an iPython notebook style web application). HopsWorks provides first-class support for DataSets and Projects. Each DataSet has a home project. Each project has a number of default DataSets: • Resources: contains programs and small amounts of data • Logs: contains outputs (stdout, stderr) for YARN applications HopsWorks implements dynamic role-based access control for projects. That is, users do not have static global privileges. A user’s privileges depend on what the user’s active project is. For example, the user may be a Data Owner in one project, but only a Data Scientist in another project. Depending on which project is active, the user may be a Data Owner or a Data Scientist. The following roles are supported: Fig. 1.1: Dynamic Roles ensures strong multi-tenancy between projects in HopsWorks. A Data Scientist can • run interactive analytics through Apache Zeppelin • run batch jobs (Spark, Flink, MR) • upload to a restricted DataSet (called Resources) that contains only programs and resources A Data Owner can • upload/download data to the project, • add and remove members of the project • change the role of project members • create and delete DataSets • import and export data from DataSets • design and update metadata for files/directories/DataSets HopsWorks covers: users, projects and datasets, analytics, metadata mangement and free-text search. 1.4.1 Users • Users authenticate with a valid email address * An optional 2nd factor can optionally be enabled for authentication. Supported devices are smartphones (Android, Apple, Windows) or Yubikey usb sticks. 1.4.2 Projects and DataSets HopsWorks provides the following features: • project-based multi-tenancy with dynamic roles; • CPU hour quotas for projects (supported by HopsYARN); • the ability to share DataSets securely between projects (reuse of DataSets without copying); • DataSet browser; • import/export of data using the Browser. 1.4.3 Analytics HopsWorks provides two services for executing applications on YARN: • Apache Zepplin: interactive analytics with for Spark, Flink, and other data parallel frameworks; • YARN batch jobs: batch-based submission (including Spark, MapReduce, Flink, Adam, and SaasFee); 1.4.4 MetaData Management HopsWorks provides support for the design and entry of extended metadata for files and directorsy: • design your own extended metadata using an intuitive UI; • enter extended metadata using an intuitive UI. 1.4.5 Free-text search HopsWorks integrates with Elasticsearch to provide free-text search for files/directories and their extended metadata: • Global free-text search for projects and DataSets in the filesystem; • Project-based free-text search of all files and extended metadata within a project. 1.5 HopsFS HopsFS is a new implementation of the the Hadoop Filesystem (HDFS) based on Apache Hadoop1 2x, that supports multiple stateless NameNodes, where the metadata is stored in an in-memory distributed database (NDB). HopsFS enables NameNode metadata to be both customized and analyzed, because it can be easily accessed via SQL or the native API (NDB API). HopsFS replaces HDFS 2.x’s Primary-Secondary Replication model with an in-memory, shared nothing database. HopsFS provides the DAL-API as an abstraction layer over the database, and implements a leader election protocol using the database. This means HopsFS no longer needs several services required by highly available Apache HDFS: quorum journal nodes, Zookeeper, and the Snapshot server. 1 http://hadoop.apache.org/releases.html Fig. 1.2: HopsFS Architeture. 1.6 HopsYarn HopsYARN introduces a new metadata layer for Apache YARN, where the cluster state is stored in a distributed, in-memory, transactional database. HopsYARN enables us to provide quotas for Projects, in terms of how many CPU minutes and memory are available for use by each project. Quota-based scheduling is built as a layer on top of the capacity scheduler, enabling us to retain the benefits of the capacity scheduler. Apache Spark We support Apache Spark for both interactive analytics and jobs. Apache Zeppelin Apache Zeppelin is built-in to HopsWorks. We have extended Zeppelin with access control, ensuring only users in the same project can access and share the same Zeppelin notebooks. We will soon provide source-code control for notebooks using GitHub. Apache Flink Streaming Apache Flink provides a dataflow processing model and is highly suitable for stream processing. We support it in HopsWorks. Other Services HopsWorks is a web application that runs on a highly secure Glassfish server. ElasticSearch is used to provide free-text search services. MySQL 1.7 BiobankCloud BiobankCloud extends HopsWorks with platform-specific support for Biobanking and Bioinformatics. These services are: • An audit log for user actions; • Project roles compliant with the draft European General Data Protection Regulation; Fig. 1.3: Hops YARN Architecture. • Consent form management for projects (studies); • Charon, a service for securely sharing data between clusters using public clouds; • SaasFee (cuneiform), a YARN-based application for building scalable bioinformatics pipelines. CHAPTER TWO SYSTEM REQUIREMENTS The Hops stack can be installed on both cloud platforms and on-premises (baremetal). The recommended machine specifications given below do not take into account whether local storage is used or a cloud storage platform is used. For best performance due to improved data locality, we recommend local storage (instance storage in Amazon Web Services (AWS)/EC2). 2.1 Recommended Setup We recommend either Ubuntu/Debian or CentOS/Redhat as operating system (OS), with the same OS on all machines. A typical deployment of Hops Hadoop uses • DataNodes/NodeManagers: a set of commodity servers in a 12-24 SATA hard-disk JBOD setup; • NameNodes/ResourceManagers/NDB-database-nodes/HopsWorks-app-server: a homogeneous set of commodity (blade) servers with good CPUs, a reasonable amount of RAM, and one or two hard-disks; • MySQL Cluster Data nodes: a homogeneous set of commodity (blade) servers with a good amount of RAM (up to 512 GB) and good CPU(s). A good quality SATA disk is needed to store database logs. SSDs can also be used, but are typically not required. • Hopsworks: a single commodity (blade) server with a good amount of RAM (up to 128 GB) and good CPU(s). A good quality disk is needed to store logs. Either SATA or a large SSD can be used. For cloud platforms, such as AWS, we recommend using enhanced networking for the MySQL Cluster Data Nodes and the NameNodes/ResourceManagers. High latency connections between these machines will negatively affect system throughput. 2.2 Entire Hops platform on a single baremetal machine You can run HopsWorks and the entire Hops stack on a bare-metal single machine for development or testing purposes, but you will need at least: 10 Component Operating System RAM CPU Hard disk space Network Minimum Requirements Linux, Mac 8 GB of RAM 2 GHz dual-core minimum. 64-bit. 15 GB free space 1 Gb Ethernet 2.3 Entire Hops platform on a single virtualbox instance (vagrant) You can run HopsWorks and the entire Hops stack on a single virtualbox instance for development or testing purposes, but you will need at least: Component Operating System RAM CPU Hard disk space Network Minimum Requirements Linux, Mac, Windows (using Virtualbox) 10 GB of RAM 2 GHz dual-core minimum. 64-bit. 15 GB free space 1 Gb Ethernet 2.4 DataNode and NodeManager A typical deployment of Hops Hadoop installs both the Hops DataNode and NodeManager on a set of commodity servers, running without RAID (replication is done in software) in a 12-24 harddisk JBOD setup. Depending on your expected workloads, you can put as much RAM and CPU in the nodes as needed. Configurations can have up to (and probably more) than 512 GB RAM and 32 cores. The recommended setup for these machines in production (on a cost-performance basis) is: Component Operating System RAM CPU Hard disk Network Recommended (late 2015) Linux, Mac, Windows (using Virtualbox) 128 GB RAM Two CPUs with 12 cores. 64-bit. 12 x 4 TB SATA disks 1 Gb Ethernet 2.5 NameNode, ResourceManager, NDB Data Nodes, HopsWorks, and ElasticSearch NameNodes, ResourceManagers, NDB database nodes, ElasticSearch, and the HopsWorks application server require relatively more memory and not as much hard-disk space as DataNodes. The machines can be blade servers with only a disk or two. SSDs will not give significant performance improvements to any of these services, except the HopsWorks application server if you copy a lot of data in and out of the cluster via HopsWorks. The NDB database nodes will require free disk space that is at least 20 times the size of the RAM they use. Depending on how large your cluster is, the ElasticSearch server can be colocated with the HopsWorks application server or moved to its own machine with lower RAM and CPU requirements than the other services. 1 GbE gives great performance, but 10 GbE really makes it rock! You can deploy 10 GbE incrementally: first between the NameNodes/ResourceManagers <–> NDB database nodes to improve metadata processing performance, and then on the wider cluster. The recommended setup for these machines in production (on a cost-performance basis) is: Component Operating System RAM CPU Hard disk Network Recommended (late 2015) Linux, Mac, Windows (using Virtualbox) 128 GB RAM Two CPUs with 12 cores. 64-bit. 12 x 4 TB SATA disks 1 Gb Ethernet CHAPTER THREE HOPS INSTALLATION The Hops stack includes a number of services also requires a number of third-party distributed services: • Java 1.7 (OpenJDK or Oracle JRE/JDK) • NDB 7.4+ (MySQL Cluster) • J2EE7 web application server (default: Glassfish) • ElasticSearch 1.7+ Due to the complexity of installing and configuring all Hops’ services, we recommend installing Hops using the automated installer Karamel/Chef (http://www.karamel.io). We do not provide detailed documentation on the steps for installing and configuring all services in Hops. Instead, Chef cookbooks contain all the installation and configuration steps needed to install and configure Hops. The Chef cookbooks are available at: https://github.com/hopshadoop. 3.1 Cloud Platforms (AWS, GCE, OpenStack) Hops can be installed on a cloud platform using Karamel/Chef. 3.1.1 Karamel/Chef 1. Download and install Karamel (http://www.karamel.io). 2. Run Karamel. 3. Click on the “Load Cluster Definition” menu item in Karamel. You are now prompted to select a cluster definition YAML file. Go to the examples/stable directory, and select a cluster definition file for your target cloud platform for one of the following cluster types: (a) Amazon Web Services EC2 (AWS) (b) Google Compute Engine (GCE) (c) OpenStack (d) On-premises (bare metal) 13 For more information on how to configure cloud-based installations, go to help documentation at http://www.karamel.io. For on-premises installations, we provide some additional installation details and tips later in this section. Choosing which services to run on which nodes You now need to decide which services you will install on which nodes. In Karamel, we design a set of Node Groups, where each Node Group defines a stack of services to be installed on a machine. Each machine will only have one Node Group set of services. We now provide two recommended setup: • a single node cluster that includes all services on a single node. • a tiny cluster set of heavy stacks that includes a lot of services on each node. • a small cluster set of heavy stacks that includes lots of services on each node. • a large cluster set of light stacks that includes fewer services on each node. Single Node Setup You can run the entire HopsWorks application platform on a single node. You will have a NodeGroup with the following services on the single node: 1. HopsWorks, Elasticsearch, Zeppelin, MySQL Server, NDB Mgmt Server, HDFS NameNode, YARN ResourceManager, NDB Data Node(s), HDFS DataNode, YARN NodeManager Tiny Cluster Setup We recommend the following setup that includes the following NodeGroups, and requires at least 2 nodes to be deployed: 1. HopsWorks, Elasticsearch, Zeppelin, MySQL Server, NDB Mgmt Server, HDFS NameNode, YARN ResourceManager, NDB Data Node 2. HDFS DataNode, YARN NodeManager This is really only a test setup, but you will have one node dedicated to YARN applications and file storage, while the other node handles the metadata layer services. Small Cluster Setup We recommend the following setup that includes four NodeGroups, and requires at least 4 nodes to be deployed: 1. HopsWorks, Elasticsearch, Zeppelin, MySQL Server, NDB Mgmt Server, 2. HDFS NameNode, YARN ResourceManager, MySQL Server 3. NDB Data Node 4. HDFS DataNode, YARN NodeManager A highly available small cluster would require at least two instances of the last three NodeGroups. HopsWorks can also be deployed on mulitple instances, but Elasticsearch needs to be specially configured if it is to be sharded across many insances. Large Cluster Setup We recommend the following setup that includes six NodeGroups, and requires at least 4 nodes to be deployed: 1. Elasticsearch 2. HopsWorks, Zeppelin, MySQL Server, NDB Mgmt Server 3. HDFS NameNode, MySQL Server 4. YARN ResourceManager, MySQL Server 5. NDB Data Node 6. HDFS DataNode, YARN NodeManager A highly available large cluster would require at least two instances of every NodeGroup. HopsWorks can also be deployed on mulitple instances, while Elasticsearch needs to be specially configured if it is to be sharded across many insances. Otherwise, the other services can be easily scaled out by simply adding instances in Karamel. For improved performance, the metadata layer could be deployed on a better network (10 GbE at the time of writing), and the last NodeGroup (DataNode/NodeManager) instances could be deployed on cheaper network infrastructure (bonded 1 GbE or 10 GbE, at the time of writing). HopsWorks Configuration in Karamel Karamel Chef recipes support a large number of parameters that can be set while installing Hops. These parameters include, but are not limited to,: • usernames to install and run services as, • usernames and passwords for services, and • sizing and tuning configuration parameters for services (resources used, timeouts, etc). Here are some of the most important security parameters to set when installing services: • Superuser username and password for the MySQL Server(s) * Default: ‘kthfs’ and ‘kthfs’ • Administration username and password for the Glassfish administration account(s) * Default: ‘adminuser’ and ‘adminpw’ • Administration username and password for HopsWorks * Default: ‘[email protected] ‘ and ‘admin’ Here are some of the most important sizing configuration parameters to set when installing services: • DataMemory for NDB Data Nodes • YARN NodeManager amount of memory and number of CPUs • Heap size and Direct Memory for the NameNode • Heap size for Glassfish • Heap size for Elasticsearch 3.2 On-Premises (baremetal) Installation For on-premises (bare-metal) installations, you will need to prepare for installation by: 1. identifying a master host, from which you will run Karamel; 1 ’[email protected] (a) the master must have a display for Karamel’s user interface; (b) the master must be able to ping (and connect using ssh) to all of the target hosts. 2. identifying a set of target hosts, on which the Hops software and 3rd party services will be installed. (a) the target nodes should have http access to the open Internet to be able to download software during the installation process. (Cookbooks can be configured to download software from within the private network, but this requires a good bit of configuration work for Chef attributes, changing all download URLs). The master must be able to connect using SSH to all the target nodes, on which the software will be installed. If you have not already copied the master’s public key to the .ssh/authorized_keys file of all target hosts, you can do so by preparing the machines as follows: 1. Create an openssh public/private key pair on the master host for your user account. On Linux, you can use the ssh-keygen utility program to generate the keys, which will by default be stored in the $HOME/.ssh/id_rsa and $HOME/.ssh/id_rsa.pub files. If you decided to enter a password for the ssh keypair, you will need to enter it again in Karamel when you reach the ssh dialog, part of Karamel’s Launch step. We recommend no password (passwordless) for the ssh keypair. 2. Create a user account USER on the all the target machines with full sudo privileges (root privileges) and the same password on all target machines. 3. Copy the $HOME/.ssh/id_rsa.pub file on the master to the /tmp folder of all the target hosts. A good way to do this is to use pscp utility along with a file (hosts.txt) containing the line-separated hostnames (or IP addresss) for all the target machines. You may need to install the pssh utility programs (pssh), first. $sudo apt-get install pssh or $yum install pssh $vim hosts.txt # Enter the row-separated IP addresses of all target nodes in hosts.txt 128.112.152.122 18.31.0.190 128.232.103.201 ..... $pscp -h hosts.txt -P PASSWORD -i USER ~/.ssh/id_rsa.pub /tmp $pssh -h hosts.txt -i USER -P PASSWORD mkdir -p /home/USER/.ssh $pssh -h hosts.txt -i USER -P PASSWORD cat /tmp/id_rsa.pub >> /home/USER/.ssh/authorized_keys Update your Karamel cluster definition file to include the IP addresses of the target machines and the USER account name. After you have clicked on the launch menu item, you will come to a Ssh dialog. On the ssh dialog, you need to open the advanced section. Here, you will need to enter the password for the USER account on the target machines (sudo password text input box). If your ssh keypair is password protected, you will also need to enter it again here in the keypair password text input box. Note Redhat/Centos is not yet supported by Karamel, but you can install Hops using Chef-solo by logging into each machine separately. The chef cookbooks are written to work for both the Debian/Ubuntu and Redhat/Centos platform families. 3.3 Vagrant (Virtualbox) You can install HopsWorks and Hops on your laptop/desktop with Vagrant. You will need to have the following software packages installed: • chef-dk, version >0.5+ (but not >0.8+) • git • vagrant • vagrant omnibus plugin • virtualbox You can now run vagrant, using: $ $ $ $ $ $ sudo apt-get install virtualbox vagrant vagrant plugin install vagrant-omnibus git clone https://github.com/hopshadoop/hopsworks-chef.git cd hopsworks-chef berks vendor cookbooks vagrant up You can then access Hopsworks from your browser at http://127.0.0.1:8080/hopsworks. The default credentials are: username: [email protected] password: admin You can access the Hopsworks administration application http://127.0.0.1:8080/hopsworks/index.xhtml. The default credentials are: from your browser at username: [email protected] password: admin The Glassfish web application server is also available from your browser at http://127.0.0.1:4848. The default credentials are: username: adminuser password: adminpw The MySQL Server is also available from the command-line, if you ssh into the vagrant host (vagrant ssh). The default credentials are: username: kthfs password: kthfs It goes without saying, but for production deployments, we recommend changing all of these default credentials. The credentials can all be changed in Karamel during the installation phase. 3.4 Windows You can also install HopsWorks on vagrant and Windows. You will need to follow the vagrant instructions as above (installing the same software packages) aswell as installing: • Powershell After cloning the github repo, from the powershell, you can run: $ cd hopsworks-chef $ berks vendor cookbooks $ vagrant up 3.5 Apple OSX/Mac You can follow the Vagrant instructions above for Linux to install for OSX. Note that MySQL Cluster is not recommended for production installation on OSX, although it is OK, for developmenet setups. 3.6 Hops Chef Cookbooks Hops’ automated installation is orchestrated by Karamel and the installation/configuration logic is written as ruby programs in Chef. Chef supports the modularization of related programs in a unit of software, called a Chef cookbook. A Chef cookbook can be seen as a collection of programs, where each program contains instructions for how to install and configure software services. A cookbook may consist one or more programs that are known as recipes. These Chef recipes are executed by either a Chef client (that can talk to a Chef server) or chef-solo, a standalone program that has no dependencies on a Chef Server. Karamel uses chef-solo to execute Chef recipes on nodes. The benefit of this approach is that it is agentless. That is, Karamel only needs ssh to be installed on the target machine to be able to install and setup Hops. Karamel also provides dependency injection for Chef recipes, supplying the parameters (Chef attributes) used to execute recipes. Some stages/recipes return results (such as the IP address of the NameNode) that are used in subsequent recipes (for example, to generate configuration files containing the IP address of the NameNode, such as core-site.xml). The following is a brief description of the Chef cookbooks that we have developed to support the installation of Hops. The recipes have the naming convention: <cookbook>/<recipe>. You can determine the URL for each cookbook by prefixing the name with http://github.com/. All of the recipes have been karamelized, that is a Karamelfile containing orchestration rules has been added to all cookbooks. • hopshadoop/apache-hadoop-chef – This cookbook contains recipes for installing the Apache Hadoop services: HDFS NameNode (hadoop::nn), HDFS DataNode (hadoop::dn), YARN ResourceManager (hadoop::rm), YARN NodeManager (hadoop::nm), Hadoop Job HistoryServer for MapReduce (hadoop::jhs), Hadoop ProxyServer (hadoop::ps). • hopshadoop/hops-hadoop-chef – This cookbook contains is a wrapper cookbook for the Apache Hadoop cookbook. It install Hops, but makes use of the Apache Hadoop Chef cookbook to install and configure software. The recipes it provides are: HopsFS NameNode (hops::nn), HopsFS DataNode (hops::dn), HopsYARN ResourceManager (hops::rm), HopsYARN NodeManager (hops::nm), Hadoop Job HistoryServer for MapReduce (hops::jhs), Hadoop ProxyServer (hops::ps). • hopshadoop/elasticsearch-chef – This cookbook is a wrapper cookbook for the official Elasticsearch Chef cookbook, but it has been extended with Karamel orchestration rules. • hopshadoop/ndb-chef – This cookbook contains recipes for installing MySQL Cluster services: NDB Management Server (ndb::mgmd), NDB Data Node (ndb::ndbd), MySQL Server (ndb::mysqld), Memcached for MySQL Cluster (ndb::memcached). • hopshadoop/zeppelin-chef – This cookbook contains a default recipe for installing Apache Zeppelin. • hopshadoop/hopsworks-chef – This cookbook contains a default recipe for installing HopsWorks. • hopshadoop/spark-chef – This cookbook contains recipes for installing the Apache Spark Master, Worker, and a YARN client. • hopshadoop/flink-chef – This cookbook contains recipes for installing the Apache Flink jobmanager, taskmanager, and a YARN client. 3.7 BiobankCloud Chef Cookbooks • biobankcloud/charon-chef This cookbook contains a default recipe for installing Charon. • biobankcloud/hiway-chef This cookbook contains recipes for installing HiWAY, Cuneiform, the BiobankCloud workflows, and some example workflows. CHAPTER FOUR HOPSWORKS USER GUIDE If you are using 2-Factor authentication, jump ahead to “First Login with 2-Factor Authentication”. 4.1 First Login (no 2-Factor Authentication) Fig. 4.1: HopsWorks Login Page On initial installation, you can login with the default username and password. username: [email protected] password: admin If you manage to login successfully, you will arrive on the landing page: align center 20 figclass align-center HopsWorks Landing (Home) Page In the landing page, you can see a box for projects, a search bar (to find projects and data sets), an audit trail, and user menu (to change user settings or log out). If it goes wrong If login does not succeed, something has gone wrong during installation. The possible sources of error and the Web Application Server (Glassfish) and the database (MySQL Clusters). Actions: • Double-check that system meets the minimum system requirements for HopsWorks. Is there enough available disk space and memory? • Re-run the installation, as something may have gone wrong during installation. • Investigate Glassfish misconfiguration problems. Is Glassfish running? is the hopsworks.war application installed? Are the JDBC connections working? Is JavaMail configured correctly? • Investigate MySQL Cluster misconfiguration problems. Are the mgm server, data nodes, and MySQL server running? Do the hops and hopsworks databases exist and are they populated with tables and rows? If not, something went wrong during installation. 4.2 First Login with 2-Factor Authentication For 2-Factor Authentication, you cannot login directly via the web browser. You first need to generate your 2nd factor credentials for the default account ([email protected], admin). Login to the target machine where HopsWorks is installed, and run: sudo /bin/hopsworks-2fa It should return something like: +--------------+------------------+ | email | secret | +--------------+------------------+ | [email protected] | V3WBPS4G2WMQ53VA | +--------------+------------------+ Fig. 4.2: Google Authenticator - Enter the Provided Key V3WBPS4G2WMQ53VA for [email protected] as a Time-Based Key. You now need to start Google Authenticator on your smartphone. If you don’t have ‘Google Authenticator’ installed, install it from your app store. It is available for free on: • Android as Google Authenticator • iOS (Apple iPhone) as OTP Auth), and • Windows Phone as Microsoft Authenticator). After starting your Google Authenticator application, create an account (set up account), and add as the account email the default installation email address ([email protected]) and add as the provided key , the secret value returned by /bin/hopsworks-2fa (for example, ‘V3WBPS4G2WMQ53VA’). The key is a time-based key, if you need to specify the type of provided key. This should register your second factor on your phone. You can now go to the start-page on Google Authenticator to read the six-digit one-time password (OTP). Note that the OTP is updated every 30 seconds. On HopsWorks login page, you will need to supply the 6-digit number (OTP) shown for [email protected] when on the login page, along with the username and password. Fig. 4.3: HopsWorks Two-Factor Authentication Login Page 4.3 Register a New Account on HopsWorks The process for registering a new account is as follows: 1. Register your email address and details and use the camera from within Google Authenticator to store your 2nd factor credential; 2. Validate your email address by clicking on the link in the validation email you received; 3. Wait until an administrator has approved your account (you will receive a confirmation email). Fig. 4.4: HopsWorks User Registration Page Register a new account with a valid email account. If you have two-factor authentication enabled, you will then need to scan the QR code to save it on your phone. If you miss this step, you will have to recover your smartphone credentials at a later stage. In both cases, you should receive an email asking you to validate your account. The sender of the email will be either the default [email protected] or a gmail address that was supplied while installing HopsWorks. If you do not receive an email, wait a minute. If you still haven’t received it, you should contact the administrator. Validate the email address used in registration If you click on the link supplied in the registration email, it will validate your account. You will not be able to login until an administrator has approved your account. 1 . After your account has been approved, you can now go to HopsWork’s login page and start your Google Authenticator application on your smartphone. On HopsWorks login page, you will need to enter 1 If you are an administrator, you can jump now to the Hops Administration Guide to see how to validate account registrations, if you have administrator privileges. Fig. 4.5: Two-factor authentication: Scan the QR Code with Google Authenticator • the email address your registered with • the password you registered with • on Google Authenticator find the 6-digit number shown for the email address your registered with and enter it into HopsWorks. 4.4 Forgotten Password / Lost Smartphone If you forget your password or lose your 2nd factor device (smartphone or yubikey), you will need to recover your credentials. On the login screen, click on Need Help? to recover your password or replace the QR code for your smartphone. 4.5 Update your Profile/Password After you have logged in, in the upper right-hand corner of the screen, you will see your email address with a caret icon. Click on the caret icon, then click on the menu item Account. A dialog will pop-up, from where you can change your password and other parts of your profile. You cannot change your email address and will need to create a new account if you wish to change your email address. You can also logout by clicking on the sign out menu item. 4.6 If it goes wrong Contact an administrator or go to the Administration Guide section of this document. If you are an administrator: • Does your organization have a firewall that blocks outbound SMTP access? HopsWorks needs SMTP outbound access over TLS using SSL (port 587 or 465). • Is the Glassfish server up and running? Can you login to the Glassfish Administration console (on port 4848)? • Inside Glassfish, check the JavaMail settings. Is the gmail username/password correct? Are the SMTP server settings correct (hostname/ip, port, protocol (SSL, TLS))? User fails to receive an email to validate her account • This may be a misconfigured gmail address/password or a network connectivity issue. • Does your organization have a firewall that blocks outbound SMTP access? • For administrators: was the correct gmail username/password supplied when installing? • If you are not using a Gmail address, are the smtp server settings correct (ip-address or hostname, port, protocol (SSL, TLS))? User receives the validate-your-email message, but is not able to validate the account • Can you successfully access the HopsWorks homepage? If not, there may be a problem with the network or the webserver may be down. • Is the Glassfish webserver running and hopsworks.war application installed, but you still can’t logon? It may be that MySQL Cluster is not running. • Check the Glassfish logs for problems and the Browser logs. User successfully validates the account, but still can’t login The user account status may not be in the correct state, see next section for how to update user account status. User account has been disabled due to too many unsuccessful login attempts From the HopsWorks administration application, the administrator can re-enable the account by going to “User Administration” and taking the action “Approve account”. User account has been disabled due to too many unsuccessful login attempts Contact your system administrator who will re-enable your account. 4.7 Create a New Project You can create a project by clicking on the New button in the Projects box. This will pop-up a dialog, in which you enter the project name, an optional description, and select an optional set of services to be used in the project. You can also select an initial set of members for the project, who will be given the role of Data Scientist in the project. Member roles can later be updated in the Project settings by the project owner or a member with the data owner role. 4.8 Delete a Project Right click on the project to be deleted in the projects box. You have the options to: • Remove and delete data sets; – If the user deletes the project, the files are moved to trash in HopsFS; • Remove and keep data sets. 4.9 Data Set Browser The Data Set tab enables you to browse Data Sets, files and directories in this project. It is mostly used as a file browser for the project’s HDFS subtree. You cannot navigate to directories outside of this project’s subtree. 4.10 Upload Data Files can be uploaded using HopsWorks’ web interface. Go to the project you want to upload the file(s) to. You must have the Data Owner role for that project to be able to upload files. In the Data Sets tab, you will see a button Upload Files. Option Upload File Description You have to have the Data Owner role to be able to upload files. Click on the Upload File button to select a file from your local disk. Then click Upload All to upload the file(s) you selected. You can also upload folders. 4.11 Compress Files HopFS supports erasure-coding of files, which reduces storage requirements for large files by roughly 50%. If a file consists of 6 file blocks or more (that is, if the file is larger than 384 MB in size, for a default block size of 64 MB), then it can be compressed. Smaller files cannot be compressed. 4.12 Share a Data Set Only a data owner or the project owner has privileges to share Data Sets. To share a Data Set, go to the Data Sets Browser in your project, and right-click on the Data Set to be shared and then select the Share option. A popup dialog will then prompt you to select (1) a target project with which the Data Set is to be Shared and whether the Data Set will be shared as read-only (Can View) or as read-write (Can edit). To complete the sharing process, a Data Owner in the target project has to click on the shared Data Set, and then click on Acccept to complete the process. 4.13 Free-text Search Option Search from Landing Page Search from Project Page Description On landing page, enter the search term in the search bar and press return. Returns project names and Data Set names that match the entered term. From within the context of a project, enter the search term in the search bar and press return. The search returns any files or directories whose name or extended metadata matches the search term. 4.14 Jobs The Jobs tabs is the way to create and run YARN applications. HopsWorks supports the following YARN applications: • Apache Spark, • Apache Flink, • MapReduce (MR), • Adam (a bioinformatics data parallel framework), • SAASFEE (HiWAY/Cuneiform) (a bioinformatics data parallel framework). Option New Job Description Create a Job for any of the following YARN frameworks by clicking New Job : Spark/MR/Flink/Adam/Cuneiform. • Step 1: enter job-specific parameters • Step 2: enter YARN parameters. • Step 3: click on Create Job. Run Job After a job has been created, it can be run by clicking on its Run button. The logs for jobs are viewable in HopsWorks, as stdout and stderr files. These output files are also stored in the Logs/<app-framework>/<log-files> directories. After a job has been created, it can be edited, deleted, and scheduled by clickin on the More actions button. 4.15 Charon Charon is a cloud-of-clouds filesystem that enables the sharing of data between Hops clusters using public clouds. To do share data with a target cluster, you need to: • acquire the cluster-id of the target cluster and enter it as a cluster-id in the Charon service UI - you can read the cluster-id at the top of the page for the Charon service; • enter a token-id that is used as a secret key between the source and target cluster; • select a folder to share with the target cluster-id; • copy files to the shared folder from HDFS that you wish to share with the target cluster; • the files within that folder are copied to the public cloud(s), from where they are downloaded to the target cluster. 4.16 Apache Zeppelin Apache Zeppelin is an interactive notebook web application for running Spark or Flink code on Hops YARN. You can turn interpreters for Spark/Flink/etc on and off in the Zeppelin tab, helping, respectively, to reduce time required to execute a Note (paragraph) in Zeppelin or reclaim resources. More details can be found at: https://zeppelin.incubator.apache.org/ 4.17 Metadata Management Metadata enables data curation, that is, ensuring that data is properly catalogued and accessible to appropriate users. Metadata in HopsWorks is used primarily to discover and and retrieve relevant data sets or files by users by enabling users to attach arbitrary metadata to Data Sets, directories or files in HopsWorks. Metadata is associated with an individual file or Data Set or directory. This extended metadata is stored in the same database as the metadata for HopsFS and foreign keys link the extended metadata with the target file/directory/Data Set, ensuring its integrity. Extended metadata is exported to Elastic Search, from where it can be queried and the associated Data Set/Project/file/directory can be identified (and acted upon). 4.18 MetaData Designer Within the context of a project, click on the Metadata Designer button in the left-hand panel. It will bring up a metadata designer view that can be used to: • Design a new Metadata Template • Extend an existing Metadata Template • Import/Export a Metadata Template Within the Metadata Designer, you can define a Metadata template as one or more tables. Each table consists of a number of typed columns. Supported column types are: • string • single-select selection box • multi-select selection box Columns can also have constraints defined on them. On a column, click on cog icon (configure), where you can make the field: • searchable: included in the Elastic Search index; • required: when entering metadata, this column will make it is mandatory for users to enter a value for this column. 4.19 MetaData Attachment and Entry Within the context of a project, click on the Data Sets tab. From here, click on a Data Set. Inside the Data Set, if you select any file or directory, the rightmost panel will display any extended metadata associated with the file or directory. If no extended metadata is assocated with the file/directory, you will see “No metadata template attached” in the rightmost panel. You can attach an existing metadata template to the file or directory by right-clicking on it, and selecting Add metadata template. The metadata can then be selected from the set of available templates (designed or uploaded). After one or more metadata templates have been attached to the file/directory, if the file is selected, the metadata templates are now visible in the rightmost panel. The metadata can be edited in place by clicking on the + icon beside the metadata attribute. More than one extended metadata value can be added for each attribute, if the attribute is a string attribute. Metadata values can also be removed, and metadata templates can be removed from files/directories using the Data Set service. CHAPTER FIVE HOPSFS USER GUIDE HopsFS consist of the following types of nodes: NameNodes, DataNodes, and Clients. All the configurations parameters are defined in core-site.xml and hdfs-site.xml files. Currently Hops only supports non-secure mode of operations. As Hops is a fork of the Hadoop code base, most of the Hadoop configuration parameters and features are supported in Hops. In the following sections we highlight differences between HDFS and HopsFS and point out new configuration parameters and the parameters that are not supported due to different metadata management scheme . 5.1 Unsupported HDFS Features HopsFS is a drop-in replacement for HDFS and it supports most of the configuration1 parameters defined for Apache HDFS. As the architecture of HopsFS is fundamentally different from HDFS, some of the features such as journaling, secondary NameNode etc., are not required in HopsFS. Following is the list of HDFS features and configurations that are not applicable in HopsFS • Secondary NameNode The secondary NameNode is no longer supported. HopsFS supports multiple active NameNodes. Thus hdfs haadmin * command; and dfs.namenode.secondary.* and dfs.ha.* configuration parameters are not supported in HopsFS. • Checkpoint Node and FSImage HopsFS does not require checkpoint node as all the metadata is stored in NDB. Thus hdfs dfsadmin -{saveNamespace | metaSave | restoreFailedStorage | rollEdits | fetchImage} command; and dfs.namenode.name.dir.*, dfs.image.*, dfs.namenode.checkpoint.* configuration parameters are not supported in HopsFS. • Quorum Based Journaling and EditLog The write ahead log (EditLog) is not needed as all the metadata mutations are stored in NDB. Thus dfs.namenode.num.extra.edits.*, dfs.journalnode.* and dfs.namenode.edits.* configuration parameters are not supported in HopsFS. • NameNode Federation and ViewFS In HDFS the namespace is statically partitioned among multiple namenodes to support large namespace. In essence these are independent HDFS clusters where ViewFS provides a unified view of the namespace. HDFS Federation and ViewFS are no longer supported as the namespace in HopsFS scales to billions of files and directories. Thus dfs.nameservices.* configuration parameters are not supported in HopsFS. 1 http://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-hdfs/hdfs-default.xml 31 • ZooKeeper ZooKeeper is no longer required as the coordination and membership service. A coordination and membership management service is implemented using the transactional shared memory (NDB). As HopsFS is under heavy development some features such as rolling upgrades and snapshots are not yet supported. These features will be activated in future releases. 5.2 NameNodes HopsFS supports multiple NameNodes. A NameNode is configured as if it is the only NameNode in the system. Using the database a NameNode discovers all the existing NameNodes in the system. One of the NameNodes is declared the leader for housekeeping and maintenance operations. All the NameNodes in HopsFS are active. Secondary NameNode and Checkpoint Node configurations are not supported. See section (page 36) for detail list of configuration parameters and features that are no longer supported in HopsFS. For each NameNode define fs.defaultFS configuration parameter in the core-site.xml file. In order to load NDB driver set the dfs.storage.driver.* parameters in the hdfs-site.xml file. These parameter are defined in detail here (page 48). A detailed description of all the new configuration parameters for leader election, NameNode caches, distributed transaction handling, quota management, id generation and client configurations are defined here (page 44). The NameNodes are started/stopped using the following commands (executed as HDFS superuser): > $HADOOP_HOME/sbin/start-nn.sh > $HADOOP_HOME/sbin/stop-nn.sh The Apache HDFS commands for starting/stopping NameNodes can also be used: > $HADOOP_HOME/sbin/hadoop-daemon.sh --script hdfs start namenode > $HADOOP_HOME/sbin/hadoop-daemon.sh --script hdfs stop namenode Configuring HopsFS NameNode is very similar to configuring a HDFS NameNode. While configuring a single Hops NameNode, the configuration files are written as if it is the only NameNode in the system. The NameNode automatically detects other NameNodes using NDB. 5.2.1 Formating the Filesystem Running the format command on any NameNode truncates all the tables in the database and inserts default values in the tables. NDB atomically performs the truncate operation which can fail or take very long time to complete for very large tables. In such cases run the /hdfs namenode -dropAndCreateDB command first to drop and recreate the database schema followed by the format command to insert default values in the database tables. In NDB dropping and recreating a database is much quicker than truncating all the tables in the database. 5.2.2 NameNode Caches In published Hadoop workloads, metadata accesses follow a heavy-tail distribution where 3% of files account for 80% of accesses. This means that caching recently accessed metadata at NameNodes could give a significant performance boost. Each NameNode has a local cache that stores INode objects for recently accessed files and directories. Usually, the clients read/write files in the same sub-directory. Using RANDOM_STICKY load balancing policy to distribute filesystem operations among the NameNodes lowers the latencies for filesystem operations as most of the path components are already available in the NameNode cache. See HopsFS Client’s (page 33) and Cache Configuration Parameters (page 44) for more details. 5.2.3 Adding/Removing NameNodes As the namenodes are stateless any NameNode can be removed with out effecting the state of the system. All on going operations that fail due to stopping the NameNode are automatically forwarded by the clients to the remaining namenodes in the system. Similarly, the clients automatically discover the newly started namenodes. See client configuration parameters (page 47) that determines how quickly a new NameNode starts receiving requests from the existing clients. 5.3 DataNodes The DataNodes periodically acquire an updated list of NameNodes in the system and establish a connection (register) with the new NameNodes. Like clients, the DataNodes also uniformly distribute the filesystem operations among all the NameNodes in the system. Currently the DataNodes only support round-robin policy to distribute the filesystem operations. HopsFS DataNodes configuration is identical to HDFS DataNodes. In HopsFS a DataNode connects to all the NameNodes. Make sure that the fs.defaultFS parameter points to valid NameNode in the system. The DataNode will connect to the NameNode and obtain a list of all the active NameNodes in the system, and then connects/registers with all the NameNodes in the system. The DataNodes can started/stopped using the following commands (executed as HDFS superuser): > $HADOOP_HOME/sbin/start-dn.sh > $HADOOP_HOME/sbin/stop-dn.sh The Apache HDFS commands for starting/stopping Data Nodes can also be used: > $HADOOP_HOME/sbin/hadoop-deamon.sh --script hdfs start datanode > $HADOOP_HOME/sbin/hadoop-deamon.sh --script hdfs stop datanode 5.4 HopsFS Clients For load balancing the clients uniformly distributes the filesystem operations among all the NameNodes in the system. HopsFS clients support RANDOM, ROUND_ROBIN, and RANDOM_STICKY policies to distribute the filesystem operations among the NameNodes. Random and round-robin policies are self explanatory. Using sticky policy the filesystem client randomly picks a NameNode and forwards all subsequent operation to the same NameNode. If the NameNode fails then the clients randomly picks another NameNode. This maximizes the NameNode cache hits. In HDFS the client connects to the fs.defaultFS NameNode. In HopsFS, clients obtain the list of active NameNodes from the NameNode defined using fs.defaultFS parameter. The client then uniformly distributes the subsequent filesystem operations among the list of NameNodes. In core-site.xml we have introduced a new parameter dfs.namenodes.rpc.addresses that holds the rpc address of all the NameNodes in the system. If the NameNode pointed by fs.defaultFS is dead then the client tries to connect to a NameNode defined by the dfs.namenodes.rpc.addresses. As long as the NameNode addresses defined by the two parameters contain at least one valid address the client is able to communicate with the HopsFS. A detailed description of all the new client configuration parameters are here (page 47). 5.5 Compatibility with HDFS Clients HopsFS is fully compatible with HDFS clients, although they do not distribute operations over NameNodes, as they assume there is a single active NameNode. 5.6 HopsFS Async Quota Management In HopsFS the commands and the APIs for quota management are identical to HDFS. In HDFS all Quota management operations are performed synchronously while in HopsFS Quota management is performed asynchronously for performance reasons. In the following example maximum namespace quota for /QDir is set to 10. When a new sub-directory or a file is created in this folder then the quota update information propagates up the filesystem tree until it reaches /QDir. Each quota update propagation operation is implemented as an independent transaction. Fig. 5.1: HopsFS Quota Update For write heavy workloads a user might be able to consume more diskspace/namespace than it is allowed before the filesystem recognizes that the quota limits have been violated. After the quota updates are applied the filesystem will not allow the use to further violate the quota limits. In most existing Hadoop clusters, write operations are a small fraction of the workload. Additionally, considering the size of the filesystem we think this is a small trade off for improving throughput for read operations that typically comprise 90-95% a typical filesystem workload. In HopsFS asynchronous quota updates are highly optimized. We batch the quota updates wherever possible. In the linked section (page 45) there is a complete list of parameters that determines how aggressively asynchronous quota updates are applied. 5.7 Block Reporting DataNodes periodically synchronize the set of blocks stored locally with the metadata representing those blocks using a block report. Block reports are sent from DataNodes to NameNodes to indicate the set of valid blocks at a DataNode, and the NameNode compares the sent list with its metadata. For block report load balancing the DataNodes ask the leader NameNode to which NameNode they should send the block report. The leader NameNode uses round robin policy to distribute block reports among the NameNodes. In order to avoid sudden influx of large number of block reports that can slow down the performance of other filesystem operations the leader NameNode also performs admission control for block reports. The leader NameNode only allows certain number of block reports, which is configurable, to be processed at a given time. In the linked section (page 46) there is a complete list of parameters for block report admission control. CHAPTER SIX HOPS-YARN USER GUIDE Hops-YARN is very similar to Apache Hadoop YARN when it comes to using it. The goal of this section is to present the things that change. We first present some major features of Apache Hadoop YARN that have been removed or replaced in Hops-YARN. We then present how the different part of the YARN system (ResourceManager, NodeManager, Client) should be configured and used in Hops-YARN. 6.1 Removed/Replaced YARN Features Hops-YARN is a drop-in replacement for Apache Hadoop YARN and it supports most of the configuration parameters defined for Apache Hadoop YARN. As we have completely rewritten the failover mechanism some recovery option are not required in Hops-YARN. Following is the list of YARN configurations that are not applicable in Hops-YARN. • ZooKeeper ZooKeeper is no longer required as the coordination and membership service is implemented using the transactional shared memory (NDB). As a result the following options are not supported in Hops-YARN: yarn.resourcemanager.zk-address, yarn.resourcemanager.zk-numretries, yarn.resourcemanager.zk-retry-interval-ms, yarn.resourcemanager.zk-state-store.parent-path, yarn.resourcemanager.zk-timeout-ms, yarn.resourcemanager.zk-acl, yarn.resourcemanager.zk-statestore.root-node.acl, yarn.resourcemanager.ha.automatic-failover.zk-base-path. • StateStore Hops-YARN in entirely designed to store its state in the transactional share memory (NDB). As a result NDBRMStateStore is the only state store that is still supported. It follows that option specific to other state store are not supported in Hops-YARN: yarn.resourcemanager.fs.statestore.uri, yarn.resourcemanager.fs.state-store.retry-policy-spec. • Administration commands Two administration commands are now obsolete: transitionToActive and transitionToStandby. The selection of the active ResourceManager is now completely automatized and managed by the group membership service. As a result transitionToActive is not supported anymore. transitionToStandby does not present any interesting use case in Hops-YARN, if one want 36 to remove a ResourceManager from the system they can simply stop it and the automatic failover will make sure that a new ResourceManager transparently replace it. Moreover, as the transition to active it automatized, it is possible that the leader election elects the resource that we just transitioned to standby to make it the “new” active ResourceManager. As Hops-YARN is still at an early stage of is development, some features are still under development and not supported yet. Some the main unsupported features are: Fail-over when running in distributed mode and the fair-scheduler. 6.2 ResourceManager Even though Hops-YARN allows to distribute the ResourceManager to have the scheduling running on one node (the Scheduler) and the resource tracking running on several other nodes (the ResourceTrackers) the configuration of the resource manager is similar to the configuration of Apache Hadoop YARN. When running in distributed mode all the nodes participating in the resource management should be configured as a ResourceManager would be configured. They will then automatically detect each other and elect a leader to be the Scheduler. 6.2.1 Adding/Removing a ResourceManager As the ResourceManagers automatically detect each other through NDB adding a new ResourceManager consist simply in configuring and starting a new node as it would be done for the first started ResourceManager. Removing a resourceManager is not supported yet in the distributed mode. In the non distributed mode stopping the ResourceManager is enough to remove it. If the stopped ResourceManager was in standby nothing will happen. If the stopped ResourceManager was the active ResrouceManager the failover will automatically be triggered and a new active ResourceManager will take the active role. 6.3 YARN Clients Hops-YARN is fully compatible with Apache Hadoop YARN client. As in Apache Hadoop YARN the have to be configured with the list of all possible scheduler to be able to find the leader one and start communicating with it. When running Hops-YARN client it is possible to configure it to use the ConfiguredLeaderFailoverHAProxyProvider as a yarn.client.failover-proxy-provider. This will allow the client to find the leader faster than going through all the possible leaders present in the configuration file. This will also allow the client to find the leader even if it is not present in the client configuration file, as long as one of the resourceManager present in the client configuration file is alive. 6.4 YARN NodeManager: In non distributed mode the NodeManagers should be configured to use ConfiguredLeaderFailoverHAProxyProvider as a failover proxy provider. This allows them to automatically find the leading ResourceManager and to connect to it. In distributed mode the NodeManagers should be configured to use ConfiguredLeastLoadedRMFailoverHAProxyProvider as a failover proxy provider. This allows them to automatically find the resourceTracker which is the least loaded and to connect to it. CHAPTER SEVEN HOPSWORKS ADMINISTRATOR GUIDE HopsWorks has an administrator application that allows you, the administrator, to perform management actions, monitor HopsWorks and Hops, and control HopsWorks and Hops. 7.1 Activating users You, the administrator, have to approve each new user account before the user is able to login to HopsWorks. When you approve the account, you have to assign a role to a user as either an: • user • administrator Users that are assigned an administrator role will be granted privileges to login to the administrator application and control users and the system. Be careful in which users are assigned an administrator role. The vast majority of users will be assigned a user role. Fig. 7.1: Approve User Accounts so that Users are able to Login 7.2 User fails to receive an email to validate her account • Does your organization have a firewall that blocks outbound SMTP access? • Login to the Glassfish Webserver and check the JavaMail settings. The JNDI name should be mail/BBCMail. Is the gmail username/password correct? Are the smtp server settings correct (ipaddress or hostname, port, protocol (SSL, TLS))? 39 7.3 User receives email, but fails to validate the account • Can you successfully access the HopsWorks homepage? • Is the Glassfish webserver running and hopsworks.war application installed? • Is MySQL Cluster running? 7.4 Configuring email for HopsWorks Login to Glassfish, see Glassfish Adminstration (page 42), and update the JavaMail settings to set the email account, password, SMTP server IP and port, and whether SSL/TLS are used. 7.5 User successfully validates the account, but still can’t login Go to the User Administration view. From here, select the user whose account will be enabled, and update the user’s account status to validated. 7.6 User account has been disabled due to too many unsuccessful login attempts Go to the User Administration view. From here, select the user whose account will be re-enabled, and update the user’s account status to validated. 7.7 Disabling a user account Go to the User Administration view. From here, select the user whose account will be disabled, and update the user’s account status to disabled. 7.8 Re-activating a user account In the user administration view, you can select the action that changes the user status to activated. 7.9 Managing Project Quotas Each project is by default allocated a number of CPU hours in HopsYARN and an amount of available disk storage space in HopsFS: • HopsYARN Quota • HopsFS Quota We recommend that you override the default values for the Quota during the installation process, by overriding the Chef attributes: • hopsworks/yarn_default_quota • hopsworks/hdfs_default_quota In the Projects view, for any given project, the administrator can update the remaining amount of HopsYARN Quota (in CPU hours) and the amount disk space allocated in HopsFS for the project. Fig. 7.2: Project Administration: update quotas, disable/enable projects. 7.10 Disabling/Re-enabling Projects In the Projects view, any given project can be disabled (and subsequently renabled). Disabling a project will prevent members of the project from accessing data in the project, running Jobs stored in the project, or accessing the project at all. 7.11 Ubikeys in HopsWorks Ubikeys can be used as the 2nd factor authentication device, but a Ubikey needs to be programmed before it is given to a user. We recommend programming the Ubikey using Ubuntu’s Yubikey OTP tool. From the Yubikey OTP tool program, you will have to opy the Public Identity and Secret Key fields (from Yubikey OTP) to the corresponding fields in the HopsWorks Administration tool when you validate a user. That is, you should save the Public Identity and Secret Key fields for the Yubikey sticks, and when a user registers with one of those Ubikey sticks, you should then enter the Public Identity and Secret Key fields when approving the user’s account. $ sudo apt-get install yubikey-personalization-gui $ yubikey-personalization-gui Installing and starting Yubikey OTP tool in Ubuntu. Fig. 7.3: Registering YubiKey sticks using Yubikey OTP tool. 7.11.1 Glassfish Adminstration If you didn’t supply your own username/password for Glassfish administration during installation, you can login with the default username and password for Glassfish: :: https://<hostname>:4848 username: adminuser password: adminpw Users are referred to Glassfish documentation for more information regarding configuring Glassfish. Fig. 7.4: Registering YubiKey sticks using Yubikey OTP tool. Fig. 7.5: Copy the Public Identity and Secret Key fields from Yubikey OTP tool and enter them into the corresponding fields in the HopsWork’s Administration UI when you validate a user. CHAPTER EIGHT HOPSFS CONFIGURATION This section contains new/modified configuration parameters for HopsFS. All the configuration parameters are defined in hdfs-site.xml and core-site.xml files. 8.1 Leader Election Leader election service is used by HopsFS and Hops-YARN. The configuration parameters for Leader Election service are defined in core-site.xml file. • dfs.leader.check.interval: The length of the time period in milliseconds after which NameNodes run the leader election protocol. One of the active NameNodes is chosen as a leader to perform housekeeping operations. All NameNodes periodically update a counter in the database to mark that they are active. All NameNodes also periodically check for changes in the membership of the NameNodes. By default the time period is set to one second. Increasing the time interval leads to slow failure detection. • dfs.leader.missed.hb: This property specifies when a NameNode is declared dead. By default a NameNode is declared dead if it misses two consecutive heartbeats. Higher values of this property would lead to slower failure detection. The minimum supported value is 2. • dfs.leader.tp.increment: HopsFS uses an eventual leader election algorithm where the heartbeat time period (dfs.leader.check.interval) is automatically incremented if it detects that the NameNodes are falsely declared dead due to missed heartbeats caused by network/database/CPU overload. By default the heartbeat time period is incremented by 100 milliseconds, however it can be overridden using this parameter. 8.2 NameNode Cache The NameNode cache configuration parameters are defined in hdfs-site.xml file. The NameNode cache configuration parameters are: • dfs.resolvingcache.enabled: (true/false) Enable/Disables the cache for the NameNode. • dfs.resolvingcache.type: Each NameNode caches the inodes metadata in a local cache for quick path resolution. We support different implementations for the cache, i.e., INodeMemcache, PathMemcache, OptimalMemcache and InMemory. 44 – INodeMemcache: stores individual inodes in Memcached. – PathMemcache: is a course grain cache where entire file path (key) along with its associated inodes objects are stored in the Memcached. – OptimalMemcache: combines INodeMemcache and PathMemcache. – InMemory: Same as INodeMemcache but instead of using Memcached it uses an inmemory LRU ConcurrentLinkedHashMap. We recommend InMemory cache as it yields higher throughput. For INodeMemcache/PathMemcache/OptimalMemcache following configurations parameters must be set. • dfs.resolvingcache.memcached.server.address: Memcached server address. • dfs.resolvingcache.memcached.connectionpool.size: Number of connections to the memcached server. • dfs.resolvingcache.memcached.key.expiry: It determines when the memcached entries expire. The default value is 0, that is, the entries never expire. Whenever the NameNode encounters an entry that is no longer valid, it updates it. The InMemory cache specific configurations are: • dfs.resolvingcache.inmemory.maxsize: Max number of entries that could be stored in the cache before the LRU algorithm kicks in. 8.3 Distributed Transaction Hints In HopsFS the metadata is partitioned using the inodes’ id. HopsFS tries to to enlist the transactional filesystem operation on the database node that holds the metadata for the file/directory being manipulated by the operation. Distributed transaction hints configuration parameteres are defined in hdfs-site.xml file. • dfs.ndb.setpartitionkey.enabled: (true/false) Enable/Disable transaction partition key hint. • dfs.ndb.setrandompartitionkey.enabled: (true/false) Enable/Disable random partition key hint when HopsFS fails to determine appropriate partition key for the transactional filesystem operation. 8.4 Quota Management In order to boost the performance and increase the parallelism of metadata operations the quota updates are applied asynchronously i.e. disk and namespace usage statistics are asynchronously updated in the background. Using asynchronous quota system it is possible that some users over consume namespace/disk space before the background quota system throws an exception. Following parameters controls how aggressively the quota subsystem updates the quota statistics. Quota management configuration parameters are defined in hdfs-site.xml file. • dfs.quota.enabled: Enable/Disabled quota. By default quota is enabled. • dfs.namenode.quota.update.interval: The quota update manager applies the outstanding quota updates after every dfs.namenode.quota.update.interval milliseconds. • dfs.namenode.quota.update.limit: The maximum number of outstanding quota updates that are applied in each round. 8.5 Block Reporting • dfs.block.report.load.balancing.max.blks.per.time.window: This is a global configuration parameter. The leader NameNode only allows certain number of blocks reports such that the maximum number of blocks that are processed by the block reporting sub-system of HopsFS does not exceed dfs.block.report.load.balancing.max.blks.per.time.window in a given block report processing time window. • dfs.block.report.load.balancing.time.window.size This parameter determines the block report processing time window size. It is defined in milliseconds. If dfs.block.report.load.balancing.max.blks.per.time.window is set to one million and dfs.block.report.load.balancing.time.window.size is set to one minutes then the leader NameNode will ensure that at every minute at most 1 million blocks are accepted for processing by the admission control system of the filesystem. • dfs.blk.report.load.balancing.update.threashold.time Using command hdfs namenode -setBlkRptProcessSize noOfBlks the parameter dfs.block.report.load.balancing.max.blks.per.time.window can be changed. The parameter is stored in the database and the NameNodes periodically read the new value from the database. This parameter determines how frequently a NameNode checks for changes in this parameter. The default is set to 60*1000 milliseconds. 8.6 Distributed Unique ID generator ClusterJ API does not support any means to auto generate primary keys. Unique key generation is left to the application. Each NameNode has an ID generation daemon. ID generator keeps pools of pre-allocated IDs. The ID generation daemon keeps track of IDs for inodes, blocks and quota entities. Distributed unique ID generator configuration parameters are defined in hdfs-site.xml. • dfs.namenode.quota.update.id.batchsize, dfs.namenode.inodeid.batchsize, dfs.namenode.blockid.batchsize: When the ID generator is about to run out of the IDs it prefetches a batch of new IDs. These parameters defines the prefetch batch size for Quota, inodes and blocks updates respectively. • dfs.namenode.quota.update.updateThreshold, dfs.namenode.inodeid.updateThreshold, dfs.namenode.blockid.updateThreshold: These parameters define when the ID generator should pre-fetch new batch of IDs. Values for these parameter are defined as percentages i.e. 0.5 means prefetch new batch of IDs if 50 percent of the IDs have been consumed by the NameNode. • dfs.namenode.id.updateThreshold: It defines how often the IDs Monitor should check if the ID pools are running low on pre-allocated IDs. 8.7 Namespace and Block Pool ID • dfs.block.pool.id, and dfs.name.space.id: Due to shared state among the NameNodes, HopsFS only supports single namespace and one block pool. The default namespace and block pool ids can be overridden using these parameters. 8.8 Client Configurations All the client configuration parameters are defined in core-site.xml file. • dfs.namenodes.rpc.addresses: HopsFS support multiple active NameNodes. A client can send a RPC request to any of the active NameNodes. This parameter specifies a list of active NameNodes in the system. The list has following format [hdfs://ip:port, hdfs://ip:port, ...]. It is not necessary that this list contain all the active NameNodes in the system. Single valid reference to an active NameNode is sufficient. At the time of startup the client obtains an updated list of NameNodes from a NameNode mentioned in the list. If this list is empty then the client tries to connect to fs.default.name. • dfs.namenode.selector-policy: The clients uniformly distribute the RPC calls among the all the NameNodes in the system based on the following policies. - ROUND ROBIN - RANDOM - RANDOM_STICKY By default NameNode selection policy is set to RANDOM_STICKY • dfs.clinet.max.retires.on.failure: The client retries the RPC call if the RPC fails due to the failure of the NameNode. This configuration parameter specifies how many times the client would retry the RPC before throwing an exception. This property is directly related to number of expected simultaneous failures of NameNodes. Set this value to 1 in case of low failure rates such as one dead NameNode at any given time. It is recommended that this property must be set to value >= 1. • dfs.client.max.random.wait.on.retry: A RPC can fail because of many factors such as NameNode failure, network congestion etc. Changes in the membership of NameNodes can lead to contention on the remaining NameNodes. In order to avoid contention on the remaining NameNodes in the system the client would randomly wait between [0,MAX VALUE] ms before retrying the RPC. This property specifies MAX VALUE; by default it is set to 1000 ms. • dfs.client.refresh.namenode.list: All clients periodically refresh their view of active NameNodes in the system. By default after every minute the client checks for changes in the membership of the NameNodes. Higher values can be chosen for scenarios where the membership does not change frequently. 8.9 Data Access Layer (DAL) Using DAL layer HopsFS’s metadata can be stored in different databases. HopsFS provides a driver to store the metadata in MySQL Cluster Network Database (NDB). 8.9.1 MySQL Cluster Network Database Driver Configuration Database specific parameter are stored in a .properties file. The configuration files contains following parameters. • com.mysql.clusterj.connectstring: Address of management server of MySQL NDB Cluster. • com.mysql.clusterj.database: Name of the database schema that contains the metadata tables. • com.mysql.clusterj.connection.pool.size: This is the number of connections that are created in the ClusterJ connection pool. If it is set to 1 then all the sessions share the same connection; all requests for a SessionFactory with the same connect string and database will share a single SessionFactory. A setting of 0 disables pooling; each request for a SessionFactory will receive its own unique SessionFactory. • com.mysql.clusterj.max.transactions: Maximum number transactions that can be simultaneously executed using the clusterj client. The maximum support transactions are 1024. • io.hops.metadata.ndb.mysqlserver.host Address of MySQL server. For higher performance we use MySQL Server to perform a aggregate queries on the file system metadata. • io.hops.metadata.ndb.mysqlserver.port: If not specified then default value of 3306 will be used. • io.hops.metadata.ndb.mysqlserver.username: A valid user name to access MySQL Server. • io.hops.metadata.ndb.mysqlserver.password: MySQL Server user password • io.hops.metadata.ndb.mysqlserver.connection pool size: Number of NDB connections used by the MySQL Server. The default is set to 10. • Database Sessions Pool: For performance reasons the data access layer maintains a pools of preallocated ClusterJ session objects. Following parameters are used to control the behavior the session pool. – io.hops.session.pool.size: Defines the size of the session pool. The pool should be at least as big as the number of active transactions in the system. Number of active transactions in the system can be calculated as ( dfs.datanode.handler.count + dfs.namenode.handler.count + dfs.namenode.subtree-executor-limit). – io.hops.session.reuse.count: Session is used N times and then it is garbage collected. Note: Due to imporoved memory management in ClusterJ >= 7.4.7, N can be set to higher values i.e. Integer.MAX_VALUE for latest ClusterJ libraries. 8.9.2 Loading a DAL Driver In order to load a DAL driver following configuration parameters are added to hdfs-site.xml file. • dfs.storage.driver.jarFile: path of driver jar file if the driver’s jar file is not included in the class path. • dfs.storage.driver.class: main class that initializes the driver. • dfs.storage.driver.configfile: path to a file that contains configuration parameters for the driver jar file. The path is supplied to the dfs.storage.driver.class as an argument during initialization. See hops ndb driver configuration parameters (page 47). 8.10 HopsFS-EC Configuration The erasure coding API is flexibly configurable and hence comes with some new configuration options that are shown here. All configuration options can be set by creating an erasure-coding-site.xml in the Hops configuration folder. Note that Hops comes with reasonable default values for all of these values. However, erasure coding needs to be enabled manually. • dfs.erasure_coding.enabled: (true/false) Enable/Disable erasure coding. • dfs.erasure_coding.codecs.json: List of available erasure coding codecs available. This value is a json field i.e. <value> [ { "id" : "xor", "parity_dir" : "/raid", "stripe_length" : 10, "parity_length" : 1, "priority" : 100, "erasure_code" : "io.hops.erasure_coding.XORCode", "description" : "XOR code" }, { "id" : "rs", "parity_dir" : "/raidrs", "stripe_length" : 10, "parity_length" : 4, "priority" : 300, "erasure_code" : "io.hops.erasure_coding.ReedSolomonCode", "description" : "ReedSolomonCode code" }, { "id" : "src", "parity_dir" : "/raidsrc", "stripe_length" : 10, "parity_length" : 6, "parity_length_src" : 2, "erasure_code" : "io.hops.erasure_coding.SimpleRegeneratingCode", "priority" : 200, "description" : "SimpleRegeneratingCode code" }, ] </value> • dfs.erasure_coding.parity_folder: The HDFS folder to store parity information in. Default value is /parity • dfs.erasure_coding.recheck_interval: How frequently should the system schedule encoding or repairs and check their state. Default valude is 300000 ms. • dfs.erasure_coding.repair_delay: How long should the system wait before scheduling a repair. Default is 1800000 ms. • dfs.erasure_coding.parity_repair_delay: How long should the system wait before scheduling a parity repair. Default is 1800000 ms. • dfs.erasure_coding.active_encoding_limit: Maximum number of active encoding jobs. Default is 10. • dfs.erasure_coding.active_repair_limit: Maximum number of active repair jobs. Default is 10. • dfs.erasure_coding.active_parity_repair_limit: Maximum number of active parity repair jobs. Default is 10. • dfs.erasure_coding.deletion_limit: Delete operations to be handle during one round. Default is 100. • dfs.erasure_coding.encoding_manager: Implementation of the EncodingManager to be used. Default is io.hops.erasure_coding.MapReduceEncodingManager. • dfs.erasure_coding.block_rapair_manager: Implementation of the repair manager to be used. Default is io.hops.erasure_coding.MapReduceBlockRepairManager CHAPTER NINE HOPS-YARN CONFIGURATION Hops-YARN configuration is very similar to the Apache Hadoop YARN configuration. A few additionals configuration parameters are needed to configure the new services provided by Hops-YARN. This section presents the new/modified configuration parameters for Hops-YARN. All the new configuration parameters should be entered in yarn-site.xml. 9.1 Configuring Hops-YARN fail-over • yarn.resourcemanager.scheduler.port: The port used by the scheduler service(the port still need to be specified in yarn.resourcemanager.scheduler.address) • yarn.resourcemanager.resource-tracker.port: The port used by the resource-tracker service (the port still need to be specified in yarn.resourcemanager.resource-trakcer.address) • yarn.resourcemanager.admin.port: The port used by the admin service (the port still need to be specified in yarn.resourcemanager.admin.address) • yarn.resourcemanager.port: The port used by the resource manager service (the port still need to be specified in yarn.resourcemanager.resourcemanager.address) • yarn.resourcemanager.groupMembership.address: The address of the group membership service. The group membership service is used by the clients and node managers to obtain the list of alive resource managers. • yarn.resourcemanager.groupMembership.port: The port used by the group membership service (the port still need to be specified in yarn.resourcemanager.groupMembership.address) • yarn.resourcemanager.ha.rm-ids: Contain a list of ResourceManagers. This is used to establish the first connection to the group membership service. • yarn.resourcemanager.store.class: Should be set to org.apache.hadoop.yarn.server.resourcemanager.recovery.NDBR 9.2 Batch Processing of Operations In Hops-YARN, RPCs are received by the ResourceManager that describe operations on the Applications Master Interface, the Administrator Interface, and the Client Interface. RPCs for the Resource Tracker Interface are received by the ResourceTracker 51 nodes. For reasons of performance and consistency, the Hops-YARN resource manager processes incoming RPCs in batches. Hops-YARN first fills an adaptive processing buffer with a bounded-size batch of RPCs. If the batch size has not been filled before a timer expires (hops.yarn.resourcemanager.batch.max.duration), the batch is processed immediately. New RPCs are blocked until the accepted batch of RPCs has been processed. Once all of RPCs have been completely executed the state of the resource manager is pushed to the database and the next RPCs are accepted. The size of the batch of rpc that are accepted is limited by two factors: the number of RPCs and the time for which this batch have been going. The first factor guaranty that the number of state change in the database will be limited and that the commit of the new state to the database won’t be too long. The second factor guaranty that a new state will be committed in a given time even if few RPCs are received. • hops.yarn.resourcemanager.batch.max.size: The maximum number of RPCs in a batch. • hops.yarn.resourcemanager.batch.max.duration: The maximum time to wait before processing a batch of RPCs (default: 10 ms). • hops.yarn.resourcemanager.max.allocated.containers.per.request: In very large clusters some application may try to allocate tens of thousands of containers at once. This can take few seconds and block any other RPC to be handled during this time, this is due to the RPCs batch system. In order to limit the impact of such big request it is possible to set this option to limit the number of containers an application get at each request. This result in a suboptimal us of the cluster each time such application start 9.2.1 Database back pressure In order to exercise back pressure when the database is overloaded we block the execution of new RPCs. We identify that the database is overloaded by looking at the length of the queue of operations waiting to be committed as well as the duration of individual commits. If the length of the queue becomes too long or the duration of any individutal commit becomes too long, we exercise back pressure on the RPCs. • hops.yarn.resourcemanager.commit.and.queue.threshold: The upper bound on the length of the queue of operations waiting to be commited. • hops.yarn.resourcemanager.commit.queue.max.length: The upper bound on the time each individual commit should take. 9.2.2 Proxy provider • yarn.client.failover-proxy-provider: Two new proxy providers have been added to the existing ConfiguredRMFailoverProxyProvider • ConfiguredLeaderFailoverHAProxyProvider: this proxy provider has the same goal as the ConfiguredRMFailoverProxyProvider (connecting to the leading ResourceManager) but it uses the groupMembershipService where the ConfiguredRMFailoverProxyProvider goes through all the ResourceManagers present in the configuration file to find the leader. This allows the ConfiguredLeaderFailoverHAProxyProvider to be faster and to find the leader even if it is not present in the configuration file. • ConfiguredLeastLoadedRMFailoverHAProxyProvider: this proxy provider establishes a connection with the ResourceTracker that has the lowest current load (least loaded). This proxy provider is to be used in distributed mode in order to balance the load coming from NodeManagers across ResourceTrackers. 9.3 Configuring Hops-YARN distributed mode Hops-YARN distributed mode can be enabled by setting the following flags to true: • hops.yarn.resourcemanager.distributed-rt.enable: Set to true to indicate that the system should work in distributed mode. Set it to true to run in distributed mode. • hops.yarn.resourcemanager.ndb-event-streaming.enable: Set to true to indicate that the ResourceManager (scheduler) should use the streaming API to the database to receive updates on the state of NodeManagers. Set it to true if you want to use the streaming API for more performance. • hops.yarn.resourcemanager.ndb-rt-event-streaming.enable: Set to true to indicate that that the ResourceTracker should use the streaming API to the database to receive updates on the state of NodeManagers. Set it to true if you want to use the streaming API for more performance. CHAPTER TEN HOPS DEVELOPER GUIDE 10.1 Extending HopsFS INode metadata For the implementation of new features, it is often necessary to modify the hdfs_inodes table or add new tables in order to store extended metadata. With Hops-HDFS, this can be simply achieved by adding a new table with a foreign key that refers to hdfs_inodes. Adding new tables has the benefit that the original data structures do not need to be modified and old code paths not requiring the additional metadata are not burdened with additional reading costs. This guide gives a walkthrough on how to add additional INode-related metadata. 10.1.1 Example use case Let’s assume we would like to store per user access times for each INode. To do this, we need to store the id of the inode, the name of the user and the timestamp representing the most recent access. 10.1.2 Adding a table to the schema First, we need to add a new table storing the metadata to our schema. Therefor we’ll go to the hopsmetadata-dal-impl-ndb project and add the following to the schema/schema.sql file. CREATE TABLE `hdfs_access_time_log` ( `inode_id` int(11) NOT NULL, `user` varchar(32) NOT NULL, `access_time` bigint(20) NOT NULL, PRIMARY KEY (`inode_id` , `user`) ) ENGINE=ndbcluster DEFAULT CHARSET=latin1$$ Additionally we will make the table and column names available to the Java code by adding the following to the io.hops.metadata.hdfs.TablesDef class in hops-metadata-dal. public static interface AccessTimeLogTableDef { public static final String TABLE_NAME = "hdfs_access_time_log"; public static final String INODE_ID = "inode_id"; public static final String USER = "user"; public static final String ACCESS_TIME = "access_time"; } 54 Note Don’t forget to update your database with the new schema. 10.1.3 Defining the Entity Class Having defined the database table, we will need to defining an entity class representing our database entries in the java code. We will do this by adding the following AccessTimeLogEntry class hops-metadata-dal project. package io.hops.metadata.hdfs.entity; public class AccessTimeLogEntry { private final int inodeId; private final String user; private final long accessTime; public AccessTimeLogEntry(int inodeId, String user , long accessTime) { this.inodeId = inodeId; this.user = user; this.accessTime = accessTime; } public int getInodeId() { return inodeId; } public String getUser() { return user; } public long getAccessTime() { return accessTime; } } 10.1.4 Defining the DataAccess interface We will need a way for interacting with our new entity in the database. The preferred way of doing this in Hops is defining a DataAccess interface to be implemented by a database implementation. Let’s define define the following interface in the hops-metadata-dal project. For now, we will only require functionality to add and modify log entries and to read individual entries for a given INode and user. package io.hops.metadata.hdfs.dal; public interface AccessTimeLogDataAccess<T> extends EntityDataAccess { void prepare(Collection<T> modified, Collection<T> removed) throws StorageException; T find(int inodeId, String user) throws StorageException; } 10.1.5 Implementing the DataAccess interface Having defined the interface, we will need to implement it using ndb to read and persist our data. Therefor, we will add a clusterj implementation of our interface to the hops-metadata-dal-impl-ndb project. package io.hops.metadata.ndb.dalimpl.hdfs; public class AccessTimeLogClusterj implements TablesDef.AccessTimeLogTableDef, AccessTimeLogDataAccess<AccessTimeLogEntry> { private ClusterjConnector connector = ClusterjConnector.getInstance(); @PersistenceCapable(table = TABLE_NAME) public interface AccessTimeLogEntryDto { @PrimaryKey @Column(name = INODE_ID) int getInodeId(); void setInodeId(int inodeId); @PrimaryKey @Column(name = USER) String getUser(); void setUser(String user); @Column(name = ACCESS_TIME) long getAccessTime(); void setAccessTime(long accessTime); } @Override public void prepare(Collection<AccessTimeLogEntry> modified, Collection<AccessTimeLogEntry> removed) throws StorageException { HopsSession session = connector.obtainSession(); List<AccessTimeLogEntryDto> changes = new ArrayList<accesstimelogentrydto>(); List<AccessTimeLogEntryDto> deletions = new ArrayList<accesstimelogentrydto>(); if (removed != null) { for (AccessTimeLogEntry logEntry : removed) { Object[] pk = new Object[2]; pk[0] = logEntry.getInodeId(); pk[1] = logEntry.getUser(); InodeDTO persistable = session.newInstance(AccessTimeLogEntryDto.class, pk); deletions.add(persistable); } } if (modified != null) { for (AccessTimeLogEntry logEntry : modified) { AccessTimeLogEntryDto persistable = createPersistable(logEntry, session); changes.add(persistable); } } session.deletePersistentAll(deletions); session.savePersistentAll(changes); } @Override public AccessTimeLogEntry find(int inodeId, String user) throws StorageException { HopsSession session = connector.obtainSession(); Object[] key = new Object[2]; key[0] = inodeId; key[1] = user; AccessTimeLogEntryDto dto = session.find(AccessTimeLogEntryDto.class, key); AccessTimeLogEntry logEntry = create(dto); return logEntry; } private AccessTimeLogEntryDto createPersistable(AccessTimeLogEntry logEntry, HopsSession session) throws StorageException { AccessTimeLogEntryDto dto = session.newInstance(AccessTimeLogEntryDto.class); dto.setInodeId(logEntry.getInodeId()); dto.setUser(logEntry.getUser()); dto.setAccessTime(logEntry.getAccessTime()); return dto; } private AccessTimeLogEntry create(AccessTimeLogEntryDto dto) { AccessTimeLogEntry logEntry = new AccessTimeLogEntry( dto.getInodeId(), dto.getUser(), dto.getAccessTime()); return logEntry; } } Having defined a concrete implementation of the DataAccess, we need to make it available to the EntityManager by adding it to HdfsStorageFactory in the hops-metadata-dal-impl-ndb project. Edit its initDataAccessMap() function by adding the newly defined DataAccess as following. private void initDataAccessMap() { [...] dataAccessMap.put(AccessTimeLogDataAccess.class, new AccessTimeLogClusterj()); } 10.1.6 Implementing the EntityContext Hops-HDFS uses context objects to cache the state of entities during transactions before persisting them in the database during the commit phase. We will need to implement such a context for our new entity in the hops project. package io.hops.transaction.context; public class AccessTimeLogContext extends BaseEntityContext<Object, AccessTimeLogEntry> { private final AccessTimeLogDataAccess<AccessTimeLogEntry> dataAccess; /* Finder to be passed to the EntityManager */ public enum Finder implements FinderType<AccessTimeLogEntry> { ByInodeIdAndUser; @Override public Class getType() { return AccessTimeLogEntry.class; } @Override public Annotation getAnnotated() { switch (this) { case ByInodeIdAndUser: return Annotation.PrimaryKey; default: throw new IllegalStateException(); } } } /* * Our entity uses inode id and user as a composite key. * Hence, we need to implement a composite key class. */ private class Key { int inodeId; String user; public Key(int inodeId, String user) { this.inodeId = inodeId; this.user = user; } @Override public boolean equals(Object o) { if (this == o) { return true; } if (o == null || getClass() != o.getClass()) { return false; } Key key = (Key) o; if (inodeId != key.inodeId) { return false; } return user.equals(key.user); } @Override public int hashCode() { int result = inodeId; result = 31 * result + user.hashCode(); return result; } @Override public String toString() { return "Key{" + "inodeId=" + inodeId + ", user='" + user + '\'' + '}'; } } public AccessTimeLogContext(AccessTimeLogDataAccess<AccessTimeLogEntry> dataAccess) { this.dataAccess = dataAccess; } @Override Object getKey(AccessTimeLogEntry logEntry) { return new Key(logEntry.getInodeId(), logEntry.getUser()); } @Override public void prepare(TransactionLocks tlm) throws TransactionContextException, StorageException { Collection<AccessTimeLogEntry> modified = new ArrayList<AccessTimeLogEntry>(getModified()); modified.addAll(getAdded()); dataAccess.prepare(modified, getRemoved()); } @Override public AccessTimeLogEntry find(FinderType<AccessTimeLogEntry> finder, Object... params) throws TransactionContextException, StorageException { Finder afinder = (Finder) finder; switch (afinder) { case ByInodeIdAndUser: return findByPrimaryKey(afinder, params); } throw new UnsupportedOperationException(UNSUPPORTED_FINDER); } private AccessTimeLogEntry findByPrimaryKey(Finder finder, Object[] params) throws StorageCallPreventedException, StorageException { final int inodeId = (Integer) params[0]; final String user = (String) params[1]; Key key = new Key(inodeId, user); AccessTimeLogEntry result; if (contains(key)) { result = get(key); // Get it from the cache hit(finder, result, params); } else { aboutToAccessStorage(finder, params); // Throw an exception //if reading after the reading phase result = dataAccess.find(inodeId, user); // Fetch the value gotFromDB(key, result); // Put the new value into the cache miss(finder, result, params); } return result; } } Having defined an EntityContext, we need to make it available through the EntityManger by adding it to the HdfsStorageFactory in the hops project by modifying it as follows. private static ContextInitializer getContextInitializer() { return new ContextInitializer() { @Override public Map<Class, EntityContext> createEntityContexts() { Map<Class, EntityContext> entityContexts = new HashMap<class, entitycontext="">(); [...] entityContexts.put(AccessTimeLogEntry.class, new AccessTimeLogContext( (AccessLogDataAccess) getDataAccess(AccessTimeLogDataAccess.class))); return entityContexts; } } } 10.1.7 Using custom locks Your metadata extension relies on the inode object to be correctly locked in order to prevent concurrent modifications. However, it might be necessary to modify attributes without locking the INode in advance. In that case, one needs to add a new lock type. A good place to get started with this is looking at the Lock, HdfsTransactionLocks, LockFactory and HdfsTransactionalLockAcquirer classes in the hops project. 10.2 Erasure Coding API Access HopsFS provides erasure coding functionality in order to decrease storage costs without the loss of highavailability. Hops offers a powerful, on a per file basis configurable, erasure coding API. Codes can be freely configured and different configurations can be applied to different files. Given that Hops monitors your erasure-coded files directly in the NameNode, maximum control over encoded files is guaranteed. This page explains how to configure and use the erasure coding functionality of Hops. Apache HDFS stores 3 copies of your data to provide high-availability. So, 1 petabyte of data actually requires 3 petabytes of storage. For many organizations, this results in enormous storage costs. HopsFS also supports erasure coding to reduce the storage required by by 44% compared to HDFS, while still providing high-availability for your data. 10.2.1 Java API The erasure coding API is exposed to the client through the DistributedFileSystem class. The following sections give examples on how to use its functionality. Note that the following examples rely on erasure coding being properly configured. Information about how to do this can be found in erasure-coding-configuration. 10.2.2 Creation of Encoded Files The erasure coding API offers the ability to request the encoding of a file while being created. Doing so has the benefit that file blocks can initially be placed in a way that the meets placements constraints for erasurecoded files without needing to rewrite them during the encoding process. The actual encoding process will take place asynchronously on the cluster. Configuration conf = new Configuration(); DistributedFileSystem dfs = (DistributedFileSystem) FileSystem.get(conf); // Use the configured "src" codec and reduce // the replication to 1 after successful encoding EncodingPolicy policy = new EncodingPolicy("src" /* Codec id as configured */, (short) 1); // Create the file with the given policy and // write it with an initial replication of 2 FSDataOutputStream out = dfs.create(path, (short) 2, policy); // Write some data to the stream and close it as usual out.close(); // Done. The encoding will be executed asynchronously // as soon as resources are available. Multiple versions of the create function complementing the original versions with erasure coding functionality exist. For more information please refer to the class documentation. 10.2.3 Encoding of Existing Files The erasure coding API offers the ability to request the encoding for existing files. A replication factor to be applied after successfully encoding the file can be supplied as well as the desired codec. The actual encoding process will take place asynchronously on the cluster. Configuration conf = new Configuration(); DistributedFileSystem dfs = (DistributedFileSystem) FileSystem.get(conf); String path = "/testFile"; // Use the configured "src" codec and reduce the replication to 1 // after successful encoding EncodingPolicy policy = new EncodingPolicy("src" /* Codec id as configured */, (short) 1); // Request the asynchronous encoding of the file dfs.encodeFile(path, policy); // Done. The encoding will be executed asynchronously // as soon as resources are available. 10.2.4 Reverting To Replication Only The erasure coding API allows to revert the encoding and to default to replication only. A replication factor can be supplied and is guaranteed to be reached before deleting any parity information. Configuration conf = new Configuration(); DistributedFileSystem dfs = (DistributedFileSystem) FileSystem.get(conf); // The path to an encoded file String path = "/testFile"; // Request the asynchronous revocation process and // set the replication factor to be applied dfs.revokeEncoding(path, (short) 2); // Done. The file will be replicated asynchronously and // its parity will be deleted subsequently. 10.2.5 Deletion Of Encoded Files Deletion of encoded files does not require any special care. The system will automatically take care of deletion of any additionally stored information. CHAPTER ELEVEN LICENSE COMPATIBILITY We combine Apache and GPL licensed code, from Hops and MySQL Cluster, respectively, by providing a DAL API (similar to JDBC). We dynamically link our DAL implementation for MySQL Cluster with the Hops code. Both binaries are distributed separately. Hops derives from Hadoop and, as such, it is available under the Apache version 2.0 open- source licensing model. MySQL Cluster and its client connectors, on the other hand, are li- censed under the GPL version 2.0 licensing model. Similar to the JDBC model, we have in- troduced a Data Access Layer (DAL) API to bridge our code licensed under the Apache model with the MySQL Cluster connector libraries, licensed under the GPL v2.0 model. The DAL API is licensed under the Apache v2.0 model. The DAL API is statically linked to both Hops and our client library for MySQL Cluster that implements the DAL API. Our client library that implements the DAL API for MySQL Cluster, however, is licensed under the GPL v2.0 model, but static linking of Apache v2 code to GPL V2 code is allowed, as stated in the MySQL FOSS license exception. The FOSS License Exception permits use of the GPL-licensed MySQL Client Libraries with software applications licensed under certain other FOSS licenses without causing the entire derivative work to be subject to the GPL. However, to comply with the terms of both licensing models, the DAL API needs to generic and different implementations of it for different databases are possible. Although, we only currently support MySQL Cluster, you are free to develop your own DAL API client and run Hops on a different database. The main requirements for the database are support for transactions, read/write locks and at least read-committed isolation. 63 D6.1 – BiobankCloud Platform-as-a-Service Feature Description Integrated from Deliverable(s) Two-factor authentication Dynamic User Roles Secure authorization using smartphones and Yubikeys Users can have different privileges in different studies Biobanking forms Consent forms, Non-consent Forms Audit Trails Study membership mgmt Metadata mgmt Logging of user activity in the system Study owners manage users and their roles Metadata designer and metadata entry for files/directories Free-text search Search for projects/datasets/files/directories using Elasticsearch Sharing data between studies without copying Explore/upload/download files and directores in HopsFS Bioinformatics workflows on YARN using Cuniform and HiWAY D3.4 Security Toolset Final Version D3.4 Security Toolset Final Version D1.3 Legal and Ethical Framework ... D3.4 Security Toolset Final Version D1.4 Disclosure model, D3.4 Security Toolset Final Version D3.5 Object Model Implementation D3.5 Object Model Implementation D2.3 Scalable and Highly Available HDFS D3.5 Object Model Implementation D2.3 Scalable and Highly Available HDFS D3.5 Object Model Implementation D2.3 Scalable and Highly Available HDFS D1.2 Object model for biobank data sharing D5.3 Workflows for NGS data analysis use cases D6.3: Analysis Pipelines Linked to Public Biological Annotation D4.3 Overbank Implementation and Evaluation D2.4 Secure, scalable, highly-available Filesystem ... D6.1 BiobankCloud Platform-as-a-Service Data set sharing Data set browser SAASFEE Charon Sharing data between Biobanks Apache Zeppelin Interactive analytics using Spark and Flink Table 1: HopsWorks integrates features from BiobankCloud Deliverables. HopsWorks as a new UI for Hadoop Existing models for multi-tenancy in Hadoop, such as Amazon Web Services’ Elastic MapReduce (EMR) platform, Google’s Dataproc platform, and Altiscale’s Hadoop-as-a-Service, provide multi-tenant Hadoop by running separate Hadoop clusters for separate projects or organizations. They improve cluster efficiency by running Hadoop clusters on virtualized or containerized platforms, and in some cases, the clusters are not elastic, that is, they cannot be easily scaled up or down in size. There are no tools for securely sharing data between platforms without copying data. HopsWorks is a front-end to Hadoop that provides a new model for multi-tenancy in Hadoop, based around projects. A project is like a GitHub project - the owner of the project manages membership, and users can have different roles in the project: data scientists can run programs and data owners can also curate, import, and export data. Users can’t copy data between projects or run programs that process data from different projects, even if the user is a member of multiple projects. That is, we implement multi-tenancy with dynamic roles, where the user’s role is based on the currently active project. Users can still share datasets between projects, however. HopsWorks has been enabled by migrating all metadata in HDFS and YARN into an open-source, shared nothing, in-memory, distributed database, called NDB. HopsWorks is open-source and licensed as Apache v2, with database connectors licensed as GPL v2. From late January 2016, HopsWorks will be provided as software-as-a-service for researchers and companies in Sweden from the Swedish ICT SICS Data Center (https://www.sics.se/projects/sicsice-data-center-in-lulea). HopsWorks Implementation HopsWorks is a J2EE7 web application, that runs by default on Glassifsh, and has a modern AngularJS user interface, supporting responsive HTML using the Bootstrap framework (that is, the UI adapts its layout for mobile devices). We have a separate administration application D6.1 – BiobankCloud Platform-as-a-Service that is also a J2EE application but provides a JSF user interface. For reasons of security, the applications are kept separate, as we can deploy the administration application on a firewalled machine, while HopsWorks needs to be user-facing and open to clients, who may reside outside the internal network. D6.1 – BiobankCloud Platform-as-a-Service Conclusions In this deliverable, we introduced Karamel (http://www.karamel.io), a new orchestration application for Chef and JClouds that enables the easy configuration and installation of BiobankCloud on both cloud platforms and on-premise (baremetal) hosts. We also presented our SaaS platform for using BiobankCloud, HopsWorks, that provides an intuitive web-based user interface for the platform. Together these tools help lower the barrier of entry for both Biobankers and Bioinformaticians in getting started with Hadoop and BiobankCloud. Our first experiences with presenting these tools to the community has been positive, and we will deploy them at three Biobanks in 2016, as part of the BBMRI Competence center. BBMRI will take on the development of BiobankCloud and promote its use within the community. In a separate development, from February 2016, HopsWorks will be used to provide Hadoop-as-a-Service in Sweden to researchers and industry, where it will be deployed on 152 hosts in the Swedish ICT SICS North data center. D6.1 – BiobankCloud Platform-as-a-Service Bibliography [1] David Bernstein. Containers and cloud: From lxc to docker to kubernetes. IEEE Cloud Computing, (3):81–84, 2014. [2] Geoffrey C Fox, Judy Qiu, Supun Kamburugamuve, Shantenu Jha, and Andre Luckow. Hpc-abds high performance computing enhanced apache big data stack. In Cluster, Cloud and Grid Computing (CCGrid), 2015 15th IEEE/ACM International Symposium on, pages 1057–1066. IEEE, 2015. [3] Salman Niazi, Mahmoud Ismail, Stefan Grohsschiedt, and Jim Dowling. D2.3, scalable and highly available hdfs, 2014. [4] Fredrik Önnberg. Software configuration management: A comparison of chef, cfengine and puppet. 2012. [5] Liming Zhu, Donna Xu, An Binh Tran, Xiwei Xu, Len Bass, Ingo Weber, and Srini Dwarakanathan. Achieving reliable high-frequency releases in cloud environments. Software, IEEE, 32(2):73–80, 2015.
* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project
advertisement