D6.1 Biobank Platform as a Service

D6.1 Biobank Platform as a Service
Project number: 317871
Project acronym: BIOBANKCLOUD
Project title: Scalable, Secure Storage of Biobank Data
Project website: http://www.biobankcloud.eu
Project coordinator: Jim Dowling (KTH)
Coordinator e-mail: [email protected]
WORK PACKAGE 6:
Integration and Evaluation
WP leader: Michael Humml
WP leader organization: CHARITE
WP leader e-mail: [email protected]
PROJECT DELIVERABLE
D6.1
BiobankCloud Platform-as-a-Service
Due date: 30th November, 2015 (M36)
D6.1 – BiobankCloud Platform-as-a-Service
Editor
Jim Dowling (KTH)
Contributors
Jim Dowling, Salman Niazi, Mahmoud Ismail, Kamal Hakimzadeh, Ermias Gebremelski (KTH),
Marc Bux (HU), Tiago Oliveira, Ricardo Mendes (LIS)
Disclaimer
This work was partially supported by the European Commission through the FP7-ICT program
under project BiobankCloud, number 317871.
The information in this document is provided as is, and no warranty is given or implied that the
information is fit for any particular purpose.
The user thereof uses the information at its sole risk and liability. The opinions expressed in
this deliverable are those of the authors. They do not necessarily represent the views of all
BiobankCloud partners.
D6.1 – BiobankCloud Platform-as-a-Service
Executive Summary
In this deliverable, we introduce two new software services that we built to deploy and use
BiobankCloud, respectively. Firstly, we present Karamel, a standalone application that enables
a BiobankCloud cluster to be deployed in just a few mouse clicks. Karamel is an orchestration engine for the configuration management framework Chef, enabling it to coordinate the
installation of software on distributed systems. Karamel also integrates with cloud APIs to
create virtual machines for different cloud platforms. Together, these features enable Karamel
to provide an end-to-end system that creates virtual machines and coordinates the provisioning and configuration of the software for those virtual machines. The second service that we
present is HopsWorks, a Software-as-a-Service (SaaS) user interface to our Hadoop platform
and BiobankCloud. All of the software components in BiobankCloud have been integrated
in HopsWorks, from our security model, to HopsFS, HopsYARN, SAASFEE Bioinformatics
toolkit, and Charon for sharing files between clusters.
Together, Karamel and HopsWorks enable non-sophisticated users to deploy BiobankCloud on
cloud infrastructures, and immediately be able to use the software to curate data (Biobankers) or
run workflows (Bioinformaticians), while storing petabytes of data in secure, isolated studies.
The document is structured as an overview to both services containing help guides for both
Karamel and HopsWorks.
D6.1 – BiobankCloud Platform-as-a-Service
Introduction
During the course of BiobankCloud, we investigated many different possible approaches to
reducing the burden of deploying the BiobankCloud platform. BiobankCloud is based on Hops
Hadoop [3] and contains a large number of complex software services, each of which requires
installation and configuration. From experience, we realized that there was a compelling need
for the automated installation and configuration of BiobankCloud if the platform was to gain
wide adoption in the community.
As of late 2015, there are many different platforms that support automated installation [1].
• Google provide Kubernetes [1] for Google Cloud Engine, as well as an open-source variant that is not yet as complete, feature-wise, as the managed version;
• Amazon provide Opsworks [5], as a way to automate the installation of custom software
by providing Chef cookbooks to install the software;
• Docker [1] provides a way to install software using container technology, with the advantage of being platform independent;
• OpenStack provides Heat [2] as a way to define clusters in a declarative manner, but
needs a backend configuration management platform, such as Chef [4, 5] or Puppet [5],
to install the software;
• JuJu [2] provides a managed way to install applications on Ubuntu hosts in a declarative
manner.
All of the above are fine solutions, but we needed a system that was:
1. open: supporting public clouds, private clouds, and on-premises installations;
2. easy-to-use: normal users should be able to click their way to a clustered deployment;
3. configurable: normal users should be able to configure the cluster to their available resources and environment.
Of the above systems, Opsworks, OpenStack and JuJu are not open, working only on their
own platforms, Docker instances are not yet configurable - typically people run Chef or Puppet
D6.1 – BiobankCloud Platform-as-a-Service
on Docker instances to configure them, and Kubernetes is not yet feature complete enough for
non-google deployments.
We designed and developed Karamel to meet all our three requirements. Karamel is open, supporting Amazon Web Services (AWS), Google Compute Engine (GCE), OpenStack, and baremetal hosts. Karamel is easy-to-use, and users can both deploy clusters using a user-friendly
web user interface (UI). Users can also configure their clusters using the Web UI, adding machines easily, changing the configuration of services (for example, changing the amount of
memory used by services such as the database and Hadoop).
Karamel is built as an orchestration engine on top of Chef. Chef is a popular configuration
framework for managing and provisioning software on large clusters. Chef does not support
either the orchestration of services (starting services in a well-defined order) or the creation
of virtual machines or docker instances. Chef assumes an existing cluster and works with
that. Chef provides two modes of use: using a Chef server or serverless. Karamel is built
on serverless Chef, called Chef Solo. In Chef Server mode, all nodes in the cluster run a
chef client that periodically contacts the chef server for instructions on software to install or
configure. The Chef server maintains the configuration information and credentials needed by
the services. In Karamel, our Karamel client application plays the role of the Chef server, but
only during installation. Karamel injects configuration parameters into Chef solo runs, enabling
the passing of parameters between different services during installation. For example, when
deploying Master/Slave software (such our database, NDB, or data processing frameworks such
as Spark), Karamel installs the master first and passes the public OpenSsh key for the master to
the slave nodes, so that they can be configured allowing the master passwordless ssh access to
the slave machines.
Karamel requires Chef cookbooks to install and configure software. At a high level, Chef
cookbooks can be thought of as containers for logic for how to install and configure software
services. At a lower level, Chef cookbooks are containers for software programs written in
ruby called recipes that install and configure software services. Chef cookbooks also provide
parameters for configuring how software is installed (Chef attributes). These Chef attributes
are used by recipes to customize the software being installed or configured. In this deliverable,
we also wrote the Chef cookbooks for all our software services. Instead of providing web
pages containing instructions on how to install BiobankCloud, we now have programs that are
version-controlled in GitHub, automatically tested (using the Kitchen framework), and can be
composed in cluster definitions in Karamel.
The first part of this deliverable includes the a user guide and developer guide for Karamel,
including sample cluster definitions that can be used to deploy BiobankCloud.
In the second part of this deliverable, we provide the user, installation, administration, and
developer guides for our Software-as-a-Service (SaaS) for HopsWorks and Hops. HopsWorks is
the frontend to BiobankCloud and integrates all the software components from BiobankCloud.
In Table 1, we show the features provided by HopsWorks and which deliverables they derived
from in the project.
D6.1 – BiobankCloud Platform-as-a-Service
Feature
Description
Integrated from Deliverable(s)
Two-factor authentication
Dynamic User Roles
Secure authorization using smartphones and Yubikeys
Users can have different privileges in different studies
Biobanking forms
Consent forms, Non-consent Forms
Audit Trails
Study membership mgmt
Metadata mgmt
Logging of user activity in the system
Study owners manage users and their roles
Metadata designer and metadata entry for files/directories
Free-text search
Search for projects/datasets/files/directories
using Elasticsearch
Sharing data between studies without copying
Explore/upload/download files and directores in HopsFS
Bioinformatics workflows on YARN using
Cuniform and HiWAY
D3.4 Security Toolset Final Version
D3.4 Security Toolset Final Version
D1.3 Legal and Ethical Framework ...
D3.4 Security Toolset Final Version
D1.4 Disclosure model,
D3.4 Security Toolset Final Version
D3.5 Object Model Implementation
D3.5 Object Model Implementation
D2.3 Scalable and Highly Available HDFS
D3.5 Object Model Implementation
D2.3 Scalable and Highly Available HDFS
D3.5 Object Model Implementation
D2.3 Scalable and Highly Available HDFS
D1.2 Object model for biobank data sharing
D5.3 Workflows for NGS data analysis use cases
D6.3: Analysis Pipelines Linked to Public Biological
Annotation
D4.3 Overbank Implementation and Evaluation
D2.4 Secure, scalable, highly-available Filesystem ...
D6.1 BiobankCloud Platform-as-a-Service
Data set sharing
Data set browser
SAASFEE
Charon
Sharing data between Biobanks
Apache Zeppelin
Interactive analytics using Spark and Flink
Table 1: HopsWorks integrates features from BiobankCloud Deliverables.
HopsWorks as a new UI for Hadoop
Existing models for multi-tenancy in Hadoop, such as Amazon Web Services’ Elastic MapReduce (EMR) platform, Google’s Dataproc platform, and Altiscale’s Hadoop-as-a-Service, provide multi-tenant Hadoop by running separate Hadoop clusters for separate projects or organizations. They improve cluster efficiency by running Hadoop clusters on virtualized or containerized platforms, and in some cases, the clusters are not elastic, that is, they cannot be easily
scaled up or down in size. There are no tools for securely sharing data between platforms
without copying data.
HopsWorks is a front-end to Hadoop that provides a new model for multi-tenancy in Hadoop,
based around projects. A project is like a GitHub project - the owner of the project manages
membership, and users can have different roles in the project: data scientists can run programs and data owners can also curate, import, and export data. Users can’t copy data between
projects or run programs that process data from different projects, even if the user is a member
of multiple projects. That is, we implement multi-tenancy with dynamic roles, where the user’s
role is based on the currently active project. Users can still share datasets between projects,
however. HopsWorks has been enabled by migrating all metadata in HDFS and YARN into
an open-source, shared nothing, in-memory, distributed database, called NDB. HopsWorks is
open-source and licensed as Apache v2, with database connectors licensed as GPL v2. From late
January 2016, HopsWorks will be provided as software-as-a-service for researchers and companies in Sweden from the Swedish ICT SICS Data Center (https://www.sics.se/projects/sicsice-data-center-in-lulea).
HopsWorks Implementation
HopsWorks is a J2EE7 web application, that runs by default on Glassifsh, and has a modern
AngularJS user interface, supporting responsive HTML using the Bootstrap framework (that
is, the UI adapts its layout for mobile devices). We have a separate administration application
Karamel Documentation
Release 0.2
www.karamel.io
December 12, 2015
CONTENTS
1
What is Karamel?
2
Getting Started
2.1 How to run a cluster? . . . . . . . . . . . . . . . . .
2.2 Launching an Apache Hadoop Cluster with Karamel
2.3 Designing an experiment with Karamel/Chef . . . .
2.4 Designing an Experiment: MapReduce Wordcount .
1
.
.
.
.
3
3
5
6
6
3
Web-App
3.1 Board-UI . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
3.2 Karamel Terminal . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
3.3 Experiment Designer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
9
9
20
24
4
Cluster Definition
4.1 AWS(Amazon EC2) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
4.2 Google Compute Engine . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
4.3 Bare-metal . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
31
32
33
33
5
Deploying BiobankCloud with Karamel
35
6
Developer Guide
6.1 Code quality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
6.2 Build and run from Source . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
6.3 Building Window Executables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
41
41
41
41
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
i
ii
CHAPTER
ONE
WHAT IS KARAMEL?
Karamel is a management tool for reproducibly deploying and provisioning distributed applications on bare-metal,
cloud or multi-cloud environments. Karamel provides explicit support for reproducible experiments for distributed
systems. Users of Karamel experience the tool as an easy-to-use UI-driven approach to deploying distributed systems
or running distributed experiments, where the deployed system or experiment can be easily configured via the UI.
Karamel users can open a cluster definition file that describes a distributed system or experiment as:
• the application stacks used in the system, containing the set of services in each application stack,
• the provider(s) for each application stack in the cluster (the cloud provider or IP addresses of the bare-metal
hosts),
• the number of nodes that should be created and provisioned for each application stack,
• configuration parameters to customize each application stack.
Karamel is an orchestration engine that orchestrates:
• the creation of virtual machines if a cloud provider is used;
• the global order for installing and starting services on each node;
• the injection of configuration parameters and passing of parameters between services.
Karamel enables the deployment of arbitrarily large distributed systems on both virtualized platforms (AWS,
Vagrant) and bare-metal hosts.
Karamel is built on the configuration framework, Chef. The distributed system or experiment is defined in YAML as a
set of node groups that each implement a number of Chef recipes, where the Chef cookbooks are deployed on github.
Karamel orchestrates the execution of Chef recipes using a set of ordering rules defined in a YAML file (Karamelfile)
in each cookbook. For each recipe, the Karamelfile can define a set of dependent (possibly external) recipes that should
be executed before it. At the system level, the set of Karamelfiles defines a directed acyclic graph (DAG) of service
dependencies. Karamel system definitions are very compact. We leverage Berkshelf to transparently download and
install transitive cookbook dependencies, so large systems can be defined in a few lines of code. Finally, the Karamel
runtime builds and manages the execution of the DAG of Chef recipes, by first launching the virtual machines or
configuring the bare-metal boxes and then executing recipes with Chef Solo. The Karamel runtime executes the node
setup steps using JClouds and Ssh. Karamel is agentless, and only requires ssh to be installed on the target host.
Karamel transparently handles faults by retrying, as virtual machine creation or configuration is not always reliable or
timely.
Existing Chef cookbooks can easily be karamelized, that is, wrapped and extended with a Karamelfile containing
orchestration rules. In contrast to Chef, which is used primarily to manage production clusters, Karamel is designed
to support the creation of reproducible clusters for running experiments or benchmarks. Karamel provides additional
Chef cookbook support for copying experiment results to persistent storage before tearing down clusters.
In Karamel, infrastructure and software are delivered as code while the cluster definitions can be configured by modifying the configuration parameters for the services containined in the cluster definition. Karamel uses Github as the
1
Karamel Documentation, Release 0.2
artifact-server for Chef cookbooks, and all experiment artifacts are globally available - any person around the globe
can replay/reproduce the construction of the distributed system.
Karamel leverages virtual-machines to provision infrastructures on different clouds. We have cloud-connectors for
Amazon EC2, Google Compute Engine, OpenStack and on-premises (bare-metal).
2
Chapter 1. What is Karamel?
CHAPTER
TWO
GETTING STARTED
2.1 How to run a cluster?
To run a simple cluster you need:
• a cluster definition file;
• access to a cloud (or bare-metal cluster);
• the Karamel client application.
You can use Karamel as a standalone application with a Web UI or embed Karamel as a library in your application,
using the Java-API to start your cluster.
2.1.1 Linux/Mac
1. Starting Karamel
To run Karamel, download the Linux/Mac binaries from http://www.karamel.io. You first have to unzip
the binaries (tar -xf karamel-0.2.tgz). From your machine’s terminal (command-line), run the
following commands:
cd karamel-0.2
./bin/karamel
This should open a window on your Web Browser if it is already open or open your default Web Browswer
if one is not already open. Karamel will appear on the webpage opened.
2.1.2 Windows
1. Starting Karamel
To run Karamel, download the Windows binaries from http://www.karamel.io. You first have to unzip the binaries.
From Windows Explorer, navigate to the folder karamel-0.2 (probably in the Downloads folder) and doubleclick on karamel.exe file to start Karamel.
2. Customize and launch your cluster Take a look into the Board-UI.
3
Karamel Documentation, Release 0.2
Fig. 2.1: Karamel Homepage. Click on Menu to load a Cluster Definition file.
2.1.3 Command-Line in Linux/Mac
You can either set environment variables containing your EC2 credentials or enter them from the console. We recommend you set the environment variables, as shown below.
export AWS_KEY=...
export AWS_SECRET_KEY=...
./bin/karamel -launch examples/hadoop.yml
After you launch a cluster from the command-line, the client loops, printing out to stdout the status of the install DAG
of Chef recipes every 20 seconds or so. Both the GUI and command-line launchers print out stdout and stderr to log
files that can be found from the current working directory in:
tail -f log/karamel.log
2.1.4 Java-API:
You can run your cluster in your Java program by using our API.
1. Jar-file dependency First add a dependency into the karamel-core jar-file, its pom file dependency is as following:
<dependency>
<groupId>se.kth</groupId>
<artifactId>karamel-core</artifactId>
<scope>compile</scope>
</dependency>
2. Karamel Java API Load the content of your cluster definition into a variable and call KaramelApi like this example:
//instantiate the API
KaramelApi api = new KaramelApiImpl();
//load your cluster definition into a java variable
String clusterDefinition = ...;
//The API works with json, convert the cluster-definition into json
String json = api.yamlToJson(ymlString);
//Make sure your ssh keys are available, if not let API generate it for
SshKeyPair sshKeys = api.loadSshKeysIfExist("");
if (sshKeys == null) {
4
Chapter 2. Getting Started
Karamel Documentation, Release 0.2
sshKeys = api.generateSshKeysAndUpdateConf(clusterName);
}
//Register your ssh keys, thats the way of confirming your ssh-keys
api.registerSshKeys(sshKeys);
//Check if your credentials for AWS (or any other cloud) already exist otherwise register th
Ec2Credentials credentials = api.loadEc2CredentialsIfExist();
api.updateEc2CredentialsIfValid(credentials);
//Now you can start your cluster by giving json representation of your cluster
api.startCluster(json);
//You can always check status of your cluster by running the "status" command through the AP
//Run status in some time-intervals to see updates for your cluster
long ms1 = System.currentTimeMillis();
int mins = 0;
while (ms1 + 24 * 60 * 60 * 1000 > System.currentTimeMillis()) {
mins++;
System.out.println(api.processCommand("status").getResult());
Thread.currentThread().sleep(60000);
}
This code block will print out your cluster status to the console every minute.
2.2 Launching an Apache Hadoop Cluster with Karamel
A cluster definition file is shown below that defines a Apache Hadoop V2 cluster to be launched on AWS/EC2. If you
click on Menu->Load Cluster Definition and open this file, you can then proceed to launch this Hadoop
cluster by entering your AWS credentials and selecting or generating an Open Ssh keypair.
The cluster defintion includes a cookbook called ‘hadoop’, and recipes for HDFS’ NameNode (nn) and DataNodes
(dn), as well as YARN’s ResourceManager (rm) and NodeManagers (nm) and finally a recipe for the MapReduce
JobHistoryService (jhs). The nn, rm, and jhs recipes are included in a single group called ‘metadata’ group, and a
single node will be created (size: 1) on which all three services will be installed and configured. On a second group
(the datanodes group), dn and nm services will be installed and configured. They will will be installed on two nodes
(size: 2). If you want more instances of a particular group, you simply increase the value of the size attribute, (e.g.,
set “size: 100” for the datanodes group if you want 100 data nodes and resource managers for Hadoop). Finally, we
parameterize this cluster deployment with version 2.7.1 of Hadoop (attr -> hadoop -> version). The attrs section is
used to supply parameters that are fed to chef recipes during installation.
name: ApacheHadoopV2
ec2:
type: m3.medium
region: eu-west-1
cookbooks:
hadoop:
github: "hopshadoop/apache-hadoop-chef"
version: "v0.1"
attrs:
hadoop:
version: 2.7.1
groups:
metadata:
size: 1
recipes:
2.2. Launching an Apache Hadoop Cluster with Karamel
5
Karamel Documentation, Release 0.2
- hadoop::nn
- hadoop::rm
- hadoop::jhs
datanodes:
size: 2
recipes:
- hadoop::dn
- hadoop::nm
The cluster definition file also includes a cookbooks section. Github is our artifact server. We only support the use
of cookbooks in our cluster definition file that are located on GitHub. Dependent cookbooks (through Berkshelf)
may also be used (from Opscode’s repository, Chef supermarket or GitHub), but the cookbooks referenced in the
YAML file must be hosted on GitHub. The reason for this is that the Karamel runtime uses Github APIs to query
cookbooks for configuration parameters, available recipes, dependencies (Berksfile) and orchestration rules (defined
in a Karamelfile). The set of all Karamelfiles for all services is used to build a directed-acyclic graph (DAG) of the
installation order for recipes. This allows for modular development and automatic composition of cookbooks into cluster, where each cookbook encapsulates its own orchestration rules. In this way, deployment modules for complicated
distributed systems can be developed and tested incrementally, where each service defines its own independent deployment model in Chef and Karamel, and independet deployment modules can be automatically composed into clusters
in cluster definition files. This approach supports an incremental test and development model, helping improve the
quality of deployment software.
2.3 Designing an experiment with Karamel/Chef
An experiment in Karamel is a cluster definition file that contains a recipe defining the experiment. As such, an
experiment requires a Chef cookbook and recipe, and writing Chef cookbooks and recipes can be a daunting prospect
for even experienced developers. Luckily, Karamel provides a UI that can take a bash script or a python program and
generate a karamelized Chef cookbook with a Chef recipe for the experiment. The Chef cookbook is automatically
uploaded to a GitHub repository that Karamel creates for you. You recipe may have dependencies on other recipes. For
example, a MapReduce experiment defined on the above cluster should wait until all the other services have started
before it runs. On examination of the Karamelfile for the hadoop cookbook, we can see that hadoop::jhs and
hadoop::nm are the last services to start. Our MapReduce experiment can state in the Karamelfile that it should
start after the hadoop::jhs and hadoop::nm services have started at all nodes in the cluster.
Experiments also have parameters and produce results. Karamel provides UI support for users to enter parameter
values in the Configure menu item. An experiment can also download experiment results to your desktop (the
Karamel client) by writing to the filename /tmp/<cookbook>__<recipe>.out. For detailed information on
how to design experiments, go to experiment designer
2.4 Designing an Experiment: MapReduce Wordcount
This experiment is a wordcount program for MapReduce that takes as a parameter an input textfile as a URL. The
program counts the number of occurances of each word found in the input file. First, create a new experiment called
mapred in GitHub (any organization). You will then need to click on the advanced tickbox to allow us to specify
dependencies and parameters. .. We keep them separate experiments to measure their time individually.
user=mapred
group=mapred
textfile=http://www.gutenberg.org/cache/epub/1787/pg1787.txt
6
Chapter 2. Getting Started
Karamel Documentation, Release 0.2
Fig. 2.2: Defining the texfile input parameter. Parameters are key-value pairs defined in the Parameter box.
The code generator bash script must wait until all HDFS datanodes and YARN nodemanagers are up and running
before it is run. To indicate this, we add the following lines to Dependent Recipes textbox:
hadoop::dn
hadoop::nm
Our new cookbook will be dependent on the hadoop cookbook, and we have to enter into the Cookbook
Dependencies textbox the relative path to the cookbook on GitHub:
cookbook 'hadoop', github: 'hopshadoop/apache-hadoop-chef'
The following code snippet runs MapReduce wordcount on the input parameter textfile.
The parameter is referenced in the bash script as #{node.mapred.textfile}, which is a combination of
node.‘‘<cookbookname>‘‘.‘‘<parameter>‘‘.
2.4. Designing an Experiment: MapReduce Wordcount
7
Karamel Documentation, Release 0.2
Fig. 2.3: Define the Chef cookbook dependencies as well as the dependent recipes, the recipes that
have to start before the experiments in this cookbook.
8
Chapter 2. Getting Started
CHAPTER
THREE
WEB-APP
Karamel provides a web-based UI and is a lightweight standalone application that runs on user machines, typically
desktops. The user interface it has three different perspectives: board, terminal and experiment designer.
3.1 Board-UI
The Board is the landing page that appears in your browswer when you start Karamel. The Board is a view on
a cluster definition file that you load. You can modify the cluster using the UI (adding/removing recipes, entering
parameter values, save updated cluster definitions) and run the cluster definition from the UI. This way, inexperienced
users can launch clusters without needing to read cluster definitions in YAML.
3.1.1 Load Cluster Definition
Click on the menu item, and then click on Load Cluster Defn:
Fig. 3.1: Load Cluster Definition.
Lists are shown in the board perspective of the UI, where each list represents a group of machines that install the same
application stack. At the top of each list you see the group-name followed by the number of machines in that group
(in parentheses). Each list consists of a set of cards, where each card represents a service (Chef recipe) that will be
installed on all the machines in that group. Chef recipes are programs written in Ruby that contain instructions for
how to install and configure a piece of software or run an experiment.
9
Karamel Documentation, Release 0.2
Fig. 3.2: Lists of cards in Karamel. Each card is a Chef recipe.
3.1.2 Change group name and size
To change the GroupName and/or number of machines in each group, double click on the header of the group. In the
following dialog, you can make your changes and submit them (to indicate your are finished).
Fig. 3.3: Changing the number of nodes in a NodeGroup
3.1.3 Add a new recipe to a group
In the top-left icon in the header of each group, there is a menu to update the group. Select the Add recipe menu
item:
In order to add a recipe to a group, you must enter the GitHub URL for the (karamelized) Chef cookbook where
your recipe resides, and then press fetch to load available recipes from the cookbook. Choose your recipe from the
combo-box below:
10
Chapter 3. Web-App
Karamel Documentation, Release 0.2
Fig. 3.4: Adding a Chef recipe to a node group.
Fig. 3.5: Adding a Chef recipe from a GitHub cookbook to a node group.
3.1. Board-UI
11
Karamel Documentation, Release 0.2
3.1.4 Customize Chef attributes for a group
Parameters (Chef attributes) can be entered within the scope of a NodeGroup: group scope values have higher precedence than (override) global cluster scope values. To update chef attributes for a group, select its menu item from the
group settings menu:
Fig. 3.6: Updating Chef Attributes at the Group level.
In the dialog below, there is a tab per used cookbook in that group, in each tab you see all customizable attributes, some
of them are mandatory and some optional with some default values. Users must set a value for all of the mandatory
attributes (or accept the default value, if one is given).
3.1.5 Customize cloud provider for a group
Cluster definition files support the use of multiple (different) cloud providers within the same cluster definition. Each
group can specify its own cloud provider. This way, we can support multi-cloud deployments. Like attributes, cloud
provider settings at the NodeGroup scope will override cloud provider settings at the global scope. Should you have
multi-cloud settings in in your cluster, at launch time you must supply credentials for each cloud separately in the
launch dialog.
Choose the cloud provider for the current group then you will see moe detailed settings for the cloud provider.
3.1.6 Delete a group
If you want to delete a group find the menu-item in the group menu.
Once you delete a group the list and all the settings related to that group will be disappeared forever.
12
Chapter 3. Web-App
Karamel Documentation, Release 0.2
Fig. 3.7: Entering attribute values to customize service.
Fig. 3.8: Multi-cloud deployments are supported by specifying different cloud providers for different node groups.
3.1. Board-UI
13
Karamel Documentation, Release 0.2
Fig. 3.9: Configuring a cloud provider per Node Group.
Fig. 3.10: Delete a Node Group.
14
Chapter 3. Web-App
Karamel Documentation, Release 0.2
Fig. 3.11: Delete Confirmation.
3.1.7 Update cluster-scope attributes
When you are done with your group settings you can have some global values for Chef attributes. By choosing
Configure button in the middle of the top bar a configuration dialog will pop up, there you see several tabs each named
after one used chef-cookbook in the cluster definition. Those attributes are pre-built by cookbook designers for runtime customization. There are two types of attributes mandatory and optional - most of them usually have a default
value but if they don’t, the user must fill in mandatory values to be able to proceed.
Fig. 3.12: To fill in optional and mandatory attributes.
By default each cookbook has a parameter for the operating system’s user-name and group-name. It is recommended
to set the same user and group for all cookbooks that you don’t face with permission issues.
It is also important to fine-tune your systems with the right parameters, for instance according to type of the machines
in your cluster you should allocate enough memory to each system.
3.1.8 Start to Launch Cluster
Finally you have to launch your cluster by pressing launch icon in the top bar. There exist a few tabs that user must
go through all of them, you might have to specify values and confirm everything. Even though Karamel caches those
values, you have to always confirm that Karamel is allowed to use those values for running your cluster.
3.1.9 Set SSH Keys
In this step first you need to specify your ssh key pair - Karamel uses that to establish a secure connection to virtual
machines. For Linux and Mac operating systems, Karamel finds the default ssh key pair in your operating system and
will use it.
3.1. Board-UI
15
Karamel Documentation, Release 0.2
Fig. 3.13: Filling in optional and mandatory attributes.
Fig. 3.14: Launch Button.
16
Chapter 3. Web-App
Karamel Documentation, Release 0.2
Fig. 3.15: SSH key paths.
3.1.10 Generate SSH Key
If you want to change the default ssh-key you can just check the advance box and from there ask Karamel to generate
a new key pair for you.
3.1.11 Password Protected SSH Keys
If your ssh key is password-protected you need to enter your password in the provided box, and also in case you use
bare-metal (karamel doesn’t fork machines from cloud) you have to give sudo-account access to your machines.
3.1.12 Cloud Provider Credentials
In the second step of launch you need to give credentials for accessing the cloud of your choice. If your cluster is
running on a single cloud a tab related to that cloud will appear in the launch dialog and if you use multi-cloud a
separate tab for each cloud will appear. Credentials are usually in different formats for each cloud, for more detail
information please find it in the related cloud section.
3.1.13 Final Control
When you have all the steps passed in the summary tab you can launch your cluster, it will bring you to the terminal
there you can control the installation of your cluster.
3.1. Board-UI
17
Karamel Documentation, Release 0.2
Fig. 3.16: Advanced options for SSH keys.
18
Chapter 3. Web-App
Karamel Documentation, Release 0.2
Fig. 3.17: Provider-specific credentials.
Fig. 3.18: Validity summary for keys and credentials.
3.1. Board-UI
19
Karamel Documentation, Release 0.2
3.2 Karamel Terminal
The terminal perspective enables user to monitor and manage running clusters as well as debugging Chef recipes by
running them.
3.2.1 Open Terminal
The Board-UI redirects to the terminal as soon as a cluster launches. Another way to access the terminal is by clicking
on the terminal menu-item from the menu dropdown list, as shown in the screen-shot below.
Fig. 3.19: Selecting the terminal perspective.
3.2.2 Button Bar
The Terminal has a menu bar in which the available menu items (buttons) change dynamically based on the active
page.
3.2.3 Command Bar
Under the menu bar, there is a long text area where you can execute commands directly. The buttons (menu items)
are, in fact, just widely used commands. To see list of commands click on the Help button.
3.2.4 Main Page
The main page in the terminal shows available running clusters - you can run multiple clusters at the same time they
just need to have different names - where you see general status of your cluster. There are some actions in front of
each cluster where you can obtain more detail information about each cluster.
20
Chapter 3. Web-App
Karamel Documentation, Release 0.2
Fig. 3.20: Successful launch redirects to terminal page.
3.2.5 Cluster Status
Status page pushes the new status for the chosen cluster very often. In the first table you see phases of the cluster and
each of them they passed successfully or not.
Fig. 3.21: Cluster Status - A recently started cluster
The cluster deployment phases are: #1. Pre-Cleaning #2. Forking Groups #3. Forking Machines #4. Installing
As soon as the cluster passes the forking groups phase, a list of machine tables appear under the phases table. Each
machine table indicates that the virtual machine (VM) has been forked and some details on the VM are available, such
as its IP Addresses (public and private) and its connection status.
Inside each machine table there exists a smaller table for showing the tasks that are going to be submitted into that
machine. Before all machines became forked and ready, all task tables are empty for all machines.
Once all machines have started forking tasks, a list of tasks are displayed for each machine. The Karamel Scheduler
orders tasks and decides when each task is ready to be run. The scheduler assigns a status label to each task.
The task status labels are:
• Waiting: the task is still waiting until its dependencies have finished;
• Ready: the task is ready to be run but the associated machine has not yet taken it yet because it is running
another task;
• Ongoing: the task is currently running on the machine;
• Succeed: the task has finished successfully;
• Failed: the task has failed - each failure will be propagated up into cluster and will cause the cluster to
pause the installation.
3.2. Karamel Terminal
21
Karamel Documentation, Release 0.2
Fig. 3.22: Cluster Status - Forking Machines
When a task is finished a link to its log content will be displayed in the third column of task table. The log is the
merged content of the standard output and standard error streams.
Fig. 3.23: Cluster Status - Installing.
3.2.6 Orchestartion DAG
The scheduler in Karamel builds a Directed Acyclic Graph (DAG) from the set of tasks in the cluster. In the terminal
perspective, the progress of the DAG execution can be visualized by clicking on the “Orchestration DAG” button.
Each Node of the DAG represents a task that must be run on a certain machine. Nodes dynamically change their color
according to the status change of their tasks. Each color is interpreted as follows:
• Blue: Waiting
• Ready: Yellow
• Ongoing: Blinking orange
• Succeed: Green
• Failed: Red
The Orchestration DAG is not only useful to visualize the cluster progress but can also help in debugging the level
of parallelization in the installation graph. If some tasks are acting as global barriers during installation, they can be
quickly identified by inspecting the DAG and seeing the nodes with lots of incoming edges and some outgoing edges.
As have local orchestration rules in their Karamelfiles, the DAG is built from the set of Karamelfiles. It is not easy
22
Chapter 3. Web-App
Karamel Documentation, Release 0.2
Fig. 3.24: Orchestration DAG
to manually traverse the DAG, given a set of Karamelfiles, but the visual DAG enables easier inspection of the global
order of installation of tasks.
3.2.7 Quick Links
Quick links a facility that Karamel provides in the terminal perspective to access web pages for services in your cluster.
For example, when you install Apache Hadoop, you might want to access the NameNode or ResourceManager’s
web UI. Those links must be designed in karamelized cookbooks (in the metadata.rb file). Karamel parses the
metadata.rb files, extracting the webpage links and displaying them in the Quick Links tab.
Fig. 3.25: Quick Links
3.2. Karamel Terminal
23
Karamel Documentation, Release 0.2
3.2.8 Statistics
Currently Karamel collects information about the duration of all tasks when you deploy a cluster. Duration statistics
are available by clicking on statistics button that will show the names of the tasks and their execution time. It might
be have you have several instances of each task in your cluster, for example, you may install the hadoop::dn recipe
on several machines in your cluster - all such instances will appear in the statistics table. Statistics is a good way for
performance measurement for some type of experiments. You can just draw a plot on them to show the performance
of your experiment.
3.2.9 Pause/Resume
A cluster may pause running either because the user’s order or when a failure happens. It is a good way if user wants
to change something or if he wants to avoid running the entire cluster for some reason. In that case when you click on
the “Pause” button it takes some time until all machines finish their current running task and go into the paused mode.
When cluster is paused, a resume button will appear which proceeds running the cluster again.
3.2.10 Purge
Purge is a button to destroy and release all the resources both on Clouds and Karamel-runtime, destroying any virtual
machines created. It is recommended to use the purge function via Karamel to clean-up resources rather than doing so
manually - Karamel makes sure all ssh connections, local threads, virtual machines and security groups are released
completely.
3.3 Experiment Designer
The experiment Designer perspective in Karamel helps you to design your experiment in a bash script or a python
program without needing to know Chef or Git. Take the following steps to design and deploy your experiment.
3.3.1 Find experiment designer
When you have Karamel web app up and running, you can access the experiment designer from the Experiment
menu-item on the left-hand side of the application.
Fig. 3.26: Get into the experiment designer.
24
Chapter 3. Web-App
Karamel Documentation, Release 0.2
3.3.2 Login into Github
Github is Karamel’s artifact server, here you will have to login into your Github account for the first time while
Karamel will remember your credentials for other times.
Fig. 3.27: Login button.
Fig. 3.28: Github credentials.
3.3.3 Start working on experiment
You can either create a new experiment or alternatively load the already designed experiment into the designer.
Fig. 3.29: Work on a new or old experiment.
3.3.4 Create a new experiment
If you choose to create a new experiment you will need to choose a name for it, optionally describe it and choose
which Github repo you want to host your experiment in. As you can see in the below image Karamel connects and
fetches your available repos from Github.
3.3. Experiment Designer
25
Karamel Documentation, Release 0.2
Fig. 3.30: New experiment on a Github repo.
3.3.5 Write body of experiment
At this point you land into the programming section of your experiment. The default name for the experiment recipe
is “experiment”. In the large text-area, as can be seen in the screenshot below, you can write your experiment code
either in bash or python. Karamel will automatically wrap your code into a chef recipe. All parameters in experiment
come in the format of Chef variables, you should wrap them inside #{} and prefix them node.<cookbookname>.
By default, they have the format #{node.cookbook.paramName}, where paramName is the name of your
parameter. If you write results of your experiment in a file called /tmp/wordcout__experiment.out - if your cookbook called “wordcount” and your recipe called “experiment”- Karamel will download that file and will put it into
~/.karamel/results/ folder of your client machine.
3.3.6 Define orchestration rules for experiment
Placing your experiment in the right order in the cluster orchestration is a very essential part of your experiment
design. Click the advanced checkbox, write in the line-separated Cookbook::recipe_name that your experiment
requires have finished before the experiment will start. If your experiment is dependent on other cookbooks (for
recipes or parameters), you must enter the relative GitHub name for the cookbook and the version/branch in lineseparated format in the second text-area.
3.3.7 Push your experiment into Github
You can save your cluster to GitHub by pressing the save button in the top-right hand corner of the webpage. This will
generate your cookbook and copy all the files to Github by adding, committing, and pushing the new files to GitHub.
26
Chapter 3. Web-App
Karamel Documentation, Release 0.2
Fig. 3.31: Experiment bash script.
3.3.8 Approve uploaded experiment to Github
Navigate to your Github repo on your web browser and you can see your cookbook.
3.3. Experiment Designer
27
Karamel Documentation, Release 0.2
Fig. 3.32: Orchestration rules for new cluster.
Fig. 3.33: Push the experiment to a Github repository.
28
Chapter 3. Web-App
Karamel Documentation, Release 0.2
Fig. 3.34: New experiment added to Github.
3.3. Experiment Designer
29
Karamel Documentation, Release 0.2
30
Chapter 3. Web-App
CHAPTER
FOUR
CLUSTER DEFINITION
The cluster definition format is an expressive DSL based on YAML as you can see in the following sample. Since
Karamel can run several clusters simultaneously, the name of each cluster must be unique.
Currently We support four cloud providers: Amazon EC2 (ec2), Google Compute Engine (gce), Openstack Nova
(nova) and bare-metal(baremetal). You can define a provider globally within a cluster definition file or you can define
a different provider for each group in the cluster definition file. In the group scope, you can overwrite some attributes of
the network/machines in the global scope or you can choose an entirely different cloud provider, defining a multi-cloud
deployment. Settings and properties for each provider is introduced in later section. For a single cloud deployment,
one often uses group-scope provider details to override the type of instance used for machines in the group. For
example, one group of nodes may require lots of memory and processing power, while other nodes require less. For
AWS, you would achive this by overriding the instanceType attribute.
The Cookbooks section specifies GitHub references to the cookbooks used in the cluster definition. It is possible to
refer to a specific version or branch for each GitHub repository.
We group machines based on the application stack (list of recipes) that should be installed on the machines in the
group. The number of machines in each group and list of recipes must be defined under each group name.
31
Karamel Documentation, Release 0.2
name: spark
ec2:
type: m3.medium
region: eu-west-1
cookbooks:
hadoop:
github: "hopshadoop/apache-hadoop-chef"
spark:
github: "hopshadoop/spark-chef"
branch: "master"
groups:
namenodes:
size: 1
recipes:
- hadoop::nn
- hadoop::rm
- hadoop::jhs
- spark::master
datanodes:
size: 2
recipes:
- hadoop::dn
- hadoop::nm
- spark::slave
4.1 AWS(Amazon EC2)
In cluster definitions, we use key word ec2 for deploying the cluster on Amazon EC2 Cloud. The following code
snippet shows all supported attributes for AWS.
ec2:
type: c4.large
region: eu-west-1
ami: ami-47a23a30
price: 0.1
vpc: vpc-f70ea392
subnet: subnet-e7830290
Type of the virtual machine, region (data center) and Amazon Machine Image are the basic properties.
We support spot instances that is a way to control your budget. Since Amazon prices are changing based on demand,
price is a limit you can set if you are not willing to pay beyond that limit (price unit is USD).
4.1.1 Virtual Private Cloud on AWS-EC2
We support AWS VPC on EC2 for better performance. First you must define your VPC in EC2 with the following
steps then include your vpc and subnet id in the cluster definition as it is shown above.
1. Make a VPC and a subnet assigned to it under your ec2.
2. Check the “Auto-assign Public IP” item for your subnet.
3. Make an internet gateway and attach it to the VPC.
4. Make a routing table for your VPC and add a row for your gateway into it, on this row open all ips ‘0.0.0.0/0’.
32
Chapter 4. Cluster Definition
Karamel Documentation, Release 0.2
5. Add your vpc-id and subnet-id into the ec2 section of your yaml like the following example. Also make sure
you are using the right image and type of instance for your vpc.
4.2 Google Compute Engine
To deploy the cluster on Google’s infrastructure, we use the keyword gce in the cluster definition YAML file. Following
code snippet shows the current supported attributes:
gce:
type: n1-standard-1
zone: europe-west1-b
image: ubuntu-1404-trusty-v20150316
Machine type, zone of the VMs, and the VM image can be specified by the user.
Karamel uses Compute Engine’s OAuth 2.0 authentication method. Therefore, an OAuth 2.0 client ID needs to be
created through the Google’s Developer Console. The description on how to generate a client ID is available here.
You need to select Service account as the application type. After generating a service account, click on Generate new
JSON key button to download a generated JSON file that contains both private and public keys. You need to register
the fullpath of the generated JSON file with Karamel API.
4.3 Bare-metal
Bare-metal clusters are supported, but the machines must first be prepared with support for login using a ssh-key that
is stored on the Karamel client. The target hosts must be contactable using ssh from the Karamel client, and the target
hosts’ ip-addresses must be specified in the cluster definition. If you have many ip-addresses in a range, it is possible
to give range of addresses instead of specifying them one by one (the second example below). The public key stored
on the Karamel client should be copied to the .ssh/authorized_keys file in the home folder of the sudo account on the
target machines that will be used to install the software. The username goes into the cluster definition is the sudo
account, and if there is a password required to get sudo access, it must be entered in the Web UI or entered through
Karamel’s programmatic API.
baremetal:
username: ubuntu
ips:
- 192.168.33.12
- 192.168.33.13
- 192.168.33.14
- 192.168.44.15
4.3.1 IP-Range
baremetal:
username: ubuntu
ips:
- 192.168.33.12-192.168.33.14
- 192.168.44.15
4.2. Google Compute Engine
33
Karamel Documentation, Release 0.2
34
Chapter 4. Cluster Definition
CHAPTER
FIVE
DEPLOYING BIOBANKCLOUD WITH KARAMEL
BiobankCloud is a Platform-as-a-Service (PaaS) for biobanking with Big Data (Hadoop). BiobankCloud brings together
• Hops Hadoop with
• SAASFEE, a Bioinformatics platform for YARN that provides both a workflow language (Cuneiform) and a
2nd-level scheduler (HiWAY)
• Charon, a cloud-of-clouds filesystem, for sharing data between BiobankCloud clusters.
We have written karamelized Chef cookbooks for installing all of the components of BiobankCloud, and we provide
some sample cluster definitions for installing small, medium, and large BiobankCloud clusters. Users are, of course,
expected to adapt these sample cluster definitions to their cloud provider or bare-metal environment as well as their
needs.
The following is a brief description of the karmelized Chef cookbooks that we have developed to support the installation of BiobankCloud. The cookbooks are all publicly available at: http://github.com/.
• hopshadoop/apache-hadoop-chef
• hopshadoop/hops-hadoop-chef
• hopshadoop/elasticsearch-chef
• hopshadoop/ndb-chef
• hopshadoop/zeppelin-chef
• hopshadoop/hopsworks-chef
• hopshadoop/spark-chef
• hopshadoop/flink-chef
• biobankcloud/charon-chef
• biobankcloud/hiway-chef
The following is a cluster definition file that installs BiobankCloud on a single m3.xlarge instance on AWS/EC2:
name: BiobankCloudSingleNodeAws
ec2:
type: m3.xlarge
region: eu-west-1
cookbooks:
hops:
github: "hopshadoop/hops-hadoop-chef"
branch: "master"
hadoop:
github: "hopshadoop/apache-hadoop-chef"
35
Karamel Documentation, Release 0.2
branch: "master"
hopsworks:
github: "hopshadoop/hopsworks-chef"
branch: "master"
ndb:
github: "hopshadoop/ndb-chef"
branch: "master"
spark:
github: "hopshadoop/spark-chef"
branch: "hops"
zeppelin:
github: "hopshadoop/zeppelin-chef"
branch: "master"
elastic:
github: "hopshadoop/elasticsearch-chef"
branch: "master"
charon:
github: "biobankcloud/charon-chef"
branch: "master"
hiway:
github: "biobankcloud/hiway-chef"
branch: "master"
attrs:
hdfs:
user: glassfish
conf_dir: /mnt/hadoop/etc/hadoop
hadoop:
dir: /mnt
yarn:
user: glassfish
nm:
memory_mbs: 9600
vcores: 4
mr:
user: glassfish
spark:
user: glassfish
hiway:
home: /mnt/hiway
user: glassfish
release: false
hiway:
am:
memory_mb: '512'
vcores: '1'
worker:
memory_mb: '3072'
vcores: '1'
hopsworks:
user: glassfish
twofactor_auth: "true"
hops:
use_hopsworks: "true"
ndb:
DataMemory: '50'
IndexMemory: '15'
dir: "/mnt"
shared_folder: "/mnt"
36
Chapter 5. Deploying BiobankCloud with Karamel
Karamel Documentation, Release 0.2
mysql:
dir: "/mnt"
charon:
user: glassfish
group: hadoop
user_email: [email protected]
use_only_aws: true
groups:
master:
size: 1
recipes:
- ndb::mysqld
- ndb::mgmd
- ndb::ndbd
- hops::ndb
- hops::rm
- hops::nn
- hops::dn
- hops::nm
- hopsworks
- zeppelin
- charon
- elastic
- spark::master
- hiway::hiway_client
- hiway::cuneiform_client
- hiway::hiway_worker
- hiway::cuneiform_worker
- hiway::variantcall_worker
The following is a cluster definition file that installs a very large, highly available, BiobankCloud cluster on 56
m3.xlarge instance on AWS/EC2:
name: BiobankCloudMediumAws
ec2:
type: m3.xlarge
region: eu-west-1
cookbooks:
hops:
github: "hopshadoop/hops-hadoop-chef"
branch: "master"
hadoop:
github: "hopshadoop/apache-hadoop-chef"
branch: "master"
hopsworks:
github: "hopshadoop/hopsworks-chef"
branch: "master"
ndb:
github: "hopshadoop/ndb-chef"
branch: "master"
spark:
github: "hopshadoop/spark-chef"
branch: "hops"
zeppelin:
github: "hopshadoop/zeppelin-chef"
branch: "master"
elastic:
github: "hopshadoop/elasticsearch-chef"
37
Karamel Documentation, Release 0.2
branch: "master"
charon:
github: "biobankcloud/charon-chef"
branch: "master"
hiway:
github: "biobankcloud/hiway-chef"
branch: "master"
attrs:
hdfs:
user: glassfish
conf_dir: /mnt/hadoop/etc/hadoop
hadoop:
dir: /mnt
yarn:
user: glassfish
nm:
memory_mbs: 9600
vcores: 8
mr:
user: glassfish
spark:
user: glassfish
hiway:
home: /mnt/hiway
user: glassfish
release: false
hiway:
am:
memory_mb: '512'
vcores: '1'
worker:
memory_mb: '3072'
vcores: '1'
hopsworks:
user: glassfish
twofactor_auth: "true"
hops:
use_hopsworks: "true"
ndb:
DataMemory: '8000'
IndexMemory: '1000'
dir: "/mnt"
shared_folder: "/mnt"
mysql:
dir: "/mnt"
charon:
user: glassfish
group: hadoop
user_email: [email protected]
use_only_aws: true
groups:
master:
size: 1
bbcui:
- ndb::mgmd
- ndb::mysqld
- hops::ndb
- hops::client
38
Chapter 5. Deploying BiobankCloud with Karamel
Karamel Documentation, Release 0.2
- hopsworks
- spark::yarn
- charon
- zeppelin
- hiway::hiway_client
- hiway::cuneiform_client
metadata:
size: 2
recipes:
- hops::ndb
- hops::rm
- hops::nn
- ndb::mysqld
elastic:
size: 1
recipes:
- elastic
database:
size: 2
recipes:
- ndb::ndbd
workers:
size: 50
recipes:
- hops::ndb
- hops::dn
- hops::nm
- hiway::hiway_worker
- hiway::cuneiform_worker
- hiway::variantcall_worker
Alternative configurations are, of course, possible. You could run several Elasticsearch instances for high availability
and more master instances if you have many active clients.
39
Karamel Documentation, Release 0.2
40
Chapter 5. Deploying BiobankCloud with Karamel
CHAPTER
SIX
DEVELOPER GUIDE
We have organized our code into two main projects, karamel-core and karamel-ui. The core is our engine for launching, installing and monitoring clusters. The UI is a standalone web application containing several designers and
visualizers. There is a REST-API in between the UI and the core.
The core and REST-API are programmed in Java 7, and the UI is programmed in Angular JS.
6.1 Code quality
1. Testability and mockability: Write your code in a way that you test each unit separately. Split concerns into
different modules that you can mock one when testing the other. We use JUnit-4 for unit testing and mockito
for mocking.
2. Code styles: Write a DRY (Don’t repeat yourself) code, use spaces instead of tab and line width limit is 120.
3. We use Google Guava and its best practices, specially the basic ones such as nullity checks and preconditions.
6.2 Build and run from Source
Ubuntu Requirements:
apt-get install lib32z1 lib32ncurses5 lib32bz2-1.0
Centos 7 Requirements:
Install zlib.i686, ncurses-libs.i686, and bzip2-libs.i686 on CentOS 7
Building from root directory:
mvn install
Running:
cd karamel-ui/target/appassembler
./bin/karamel
6.3 Building Window Executables
You need to have 32-bit libraries to build the windows exe from Linux, as the launch4j plugin requires them.
41
Karamel Documentation, Release 0.2
sudo apt-get install gcc binutils-mingw-w64-x86-64 -y
# Then replace 32-bit libraries with their 64-bit equivalents
cd /home/ubuntu/.m2/repository/net/sf/
cd launch4j/launch4j/3.8.0/launch4j-3.8.0-workdir-linux/bin
rm ld windres
ln -s /usr/bin/x86_64-w64-mingw32-ld ./ld
ln -s /usr/bin/x86_64-w64-mingw32-windres ./windres
Then run maven with the -Pwin to run the plugin:
mvn -Dwin package
42
Chapter 6. Developer Guide
HopsWorks Documentation
www.hops.io
December 12, 2015
CONTENTS
1
2
3
4
Hops Overview
1.1 Audience . . . . . . . . . . .
1.2 Revision History . . . . . . .
1.3 What is Hops? . . . . . . . . .
1.4 HopsWorks . . . . . . . . . .
1.4.1 Users . . . . . . . . .
1.4.2 Projects and DataSets .
1.4.3 Analytics . . . . . . .
1.4.4 MetaData Management
1.4.5 Free-text search . . . .
1.5 HopsFS . . . . . . . . . . . .
1.6 HopsYarn . . . . . . . . . . .
1.7 BiobankCloud . . . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
4
4
4
5
5
6
6
7
7
7
7
8
8
System Requirements
2.1 Recommended Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.2 Entire Hops platform on a single baremetal machine . . . . . . . . . . . . . . . .
2.3 Entire Hops platform on a single virtualbox instance (vagrant) . . . . . . . . . .
2.4 DataNode and NodeManager . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.5 NameNode, ResourceManager, NDB Data Nodes, HopsWorks, and ElasticSearch
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
10
10
10
11
11
11
Hops Installation
3.1 Cloud Platforms (AWS, GCE, OpenStack)
3.1.1 Karamel/Chef . . . . . . . . . . .
3.2 On-Premises (baremetal) Installation . . .
3.3 Vagrant (Virtualbox) . . . . . . . . . . .
3.4 Windows . . . . . . . . . . . . . . . . .
3.5 Apple OSX/Mac . . . . . . . . . . . . .
3.6 Hops Chef Cookbooks . . . . . . . . . .
3.7 BiobankCloud Chef Cookbooks . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
13
13
13
15
17
18
18
18
19
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
HopsWorks User Guide
20
4.1 First Login (no 2-Factor Authentication) . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
4.2 First Login with 2-Factor Authentication . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
4.3 Register a New Account on HopsWorks . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
1
4.4
4.5
4.6
4.7
4.8
4.9
4.10
4.11
4.12
4.13
4.14
4.15
4.16
4.17
4.18
4.19
5
6
7
Forgotten Password / Lost Smartphone
Update your Profile/Password . . . .
If it goes wrong . . . . . . . . . . . .
Create a New Project . . . . . . . . .
Delete a Project . . . . . . . . . . . .
Data Set Browser . . . . . . . . . . .
Upload Data . . . . . . . . . . . . . .
Compress Files . . . . . . . . . . . .
Share a Data Set . . . . . . . . . . . .
Free-text Search . . . . . . . . . . . .
Jobs . . . . . . . . . . . . . . . . . .
Charon . . . . . . . . . . . . . . . . .
Apache Zeppelin . . . . . . . . . . .
Metadata Management . . . . . . . .
MetaData Designer . . . . . . . . . .
MetaData Attachment and Entry . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
HopsFS User Guide
5.1 Unsupported HDFS Features . . . . . .
5.2 NameNodes . . . . . . . . . . . . . . .
5.2.1 Formating the Filesystem . . . .
5.2.2 NameNode Caches . . . . . . .
5.2.3 Adding/Removing NameNodes
5.3 DataNodes . . . . . . . . . . . . . . . .
5.4 HopsFS Clients . . . . . . . . . . . . .
5.5 Compatibility with HDFS Clients . . .
5.6 HopsFS Async Quota Management . . .
5.7 Block Reporting . . . . . . . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
Hops-YARN User Guide
6.1 Removed/Replaced YARN Features . . . . . .
6.2 ResourceManager . . . . . . . . . . . . . . . .
6.2.1 Adding/Removing a ResourceManager
6.3 YARN Clients . . . . . . . . . . . . . . . . . .
6.4 YARN NodeManager: . . . . . . . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
HopsWorks Administrator Guide
7.1 Activating users . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
7.2 User fails to receive an email to validate her account . . . . . . . . . . . . .
7.3 User receives email, but fails to validate the account . . . . . . . . . . . . .
7.4 Configuring email for HopsWorks . . . . . . . . . . . . . . . . . . . . . .
7.5 User successfully validates the account, but still can’t login . . . . . . . . .
7.6 User account has been disabled due to too many unsuccessful login attempts
7.7 Disabling a user account . . . . . . . . . . . . . . . . . . . . . . . . . . .
7.8 Re-activating a user account . . . . . . . . . . . . . . . . . . . . . . . . .
7.9 Managing Project Quotas . . . . . . . . . . . . . . . . . . . . . . . . . . .
7.10 Disabling/Re-enabling Projects . . . . . . . . . . . . . . . . . . . . . . . .
7.11 Ubikeys in HopsWorks . . . . . . . . . . . . . . . . . . . . . . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
25
26
26
27
27
27
27
27
28
28
28
29
29
29
29
30
.
.
.
.
.
.
.
.
.
.
31
31
32
32
33
33
33
34
34
34
35
.
.
.
.
.
36
36
37
37
37
37
.
.
.
.
.
.
.
.
.
.
.
39
39
39
40
40
40
40
40
40
40
41
41
7.11.1 Glassfish Adminstration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
8
9
HopsFS Configuration
8.1 Leader Election . . . . . . . . . . . . . . . . . . . . . . . . . .
8.2 NameNode Cache . . . . . . . . . . . . . . . . . . . . . . . . .
8.3 Distributed Transaction Hints . . . . . . . . . . . . . . . . . . .
8.4 Quota Management . . . . . . . . . . . . . . . . . . . . . . . .
8.5 Block Reporting . . . . . . . . . . . . . . . . . . . . . . . . . .
8.6 Distributed Unique ID generator . . . . . . . . . . . . . . . . .
8.7 Namespace and Block Pool ID . . . . . . . . . . . . . . . . . .
8.8 Client Configurations . . . . . . . . . . . . . . . . . . . . . . .
8.9 Data Access Layer (DAL) . . . . . . . . . . . . . . . . . . . .
8.9.1 MySQL Cluster Network Database Driver Configuration
8.9.2 Loading a DAL Driver . . . . . . . . . . . . . . . . . .
8.10 HopsFS-EC Configuration . . . . . . . . . . . . . . . . . . . .
Hops-YARN Configuration
9.1 Configuring Hops-YARN fail-over . . . . .
9.2 Batch Processing of Operations . . . . . . .
9.2.1 Database back pressure . . . . . . .
9.2.2 Proxy provider . . . . . . . . . . .
9.3 Configuring Hops-YARN distributed mode
.
.
.
.
.
.
.
.
.
.
10 Hops Developer Guide
10.1 Extending HopsFS INode metadata . . . . . . .
10.1.1 Example use case . . . . . . . . . . . .
10.1.2 Adding a table to the schema . . . . . .
10.1.3 Defining the Entity Class . . . . . . . .
10.1.4 Defining the DataAccess interface . . .
10.1.5 Implementing the DataAccess interface
10.1.6 Implementing the EntityContext . . . .
10.1.7 Using custom locks . . . . . . . . . . .
10.2 Erasure Coding API Access . . . . . . . . . . .
10.2.1 Java API . . . . . . . . . . . . . . . .
10.2.2 Creation of Encoded Files . . . . . . .
10.2.3 Encoding of Existing Files . . . . . . .
10.2.4 Reverting To Replication Only . . . . .
10.2.5 Deletion Of Encoded Files . . . . . . .
11 License Compatibility
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
44
44
44
45
45
46
46
47
47
47
48
48
49
.
.
.
.
.
51
51
51
52
52
53
.
.
.
.
.
.
.
.
.
.
.
.
.
.
54
54
54
54
55
55
56
57
60
60
61
61
61
62
62
63
CHAPTER
ONE
HOPS OVERVIEW
1.1 Audience
This document contains four different guides: installation, user, administration, and developer guides. For
the following different types of readers, we recommend reading the guides:
• Data Scientists
– User Guide
• Hadoop Administrators
– Installation Guide
– Administration Guide
• Data Curators
– User Guide
• Hops Developers
– Installation Guide
– User Guide
– Developer Guide
1.2 Revision History
Date
Nov 2015
Release
2.4.0
Description
First release of Hops Documentation.
4
1.3 What is Hops?
Hops is a next-generation distribution of Apache Hadoop that supports:
• Hadoop-as-a-Service,
• Project-Based Multi-Tenancy,
• Secure sharing of DataSets across projects,
• Extensible metadata that supports free-text search using Elasticsearch,
• YARN quotas for projects.
The key innovation that enables these features is a new architecture for scale-out, consistent metadata for
both the Hadoop Filesystem (HDFS) and YARN (Hadoop’s Resource Manager). The new metadata layer
enables us to support multiple stateless NameNodes and TBs of metadata stored in MySQL Clustepr Network Database (NDB). NDB is a distributed, relational, in-memory, open-source database. This enabled us
to provide services such as tools for designing extended metadata (whose integrity with filesystem data is
ensured through foreign keys in the database), and also extending HDFS’ metadata to enable new features
such as erasure-coded replication, reducing storage requirements by 50% compared to triple replication in
Apache HDFS. Extended metadata has enabled us to implement quota-based scheduling for YARN, where
projects can be given quotas of CPU hours/minutes and memory, thus enabling resource usage in Hadoopas-a-Service to be accounted and enforced.
Hops builds on YARN to provide support for application and resource management. All YARN frameworks
can run on Hops, but currently we only provide UI support for general data-parallel processing frameworks
such as Apache Spark, Apache Flink, and MapReduce. We also support frameworks used by BiobankCloud
for data-parallel bioinformatics workflows, including SAASFEE and Adam. In future, other frameworks
will be added to the mix.
1.4 HopsWorks
HopsWorks is the UI front-end to Hops. It supports user authentication through either a native solution,
LDAP, or two-factor authentication. There are both user and adminstrator views for HopsWorks. HopsWorks
implements a perimeter security model, where command-line access to Hadoop services is restricted, and
all jobs and interactive analyses are run from the HopsWorks UI and Apache Zeppelin (an iPython notebook
style web application).
HopsWorks provides first-class support for DataSets and Projects. Each DataSet has a home project. Each
project has a number of default DataSets:
• Resources: contains programs and small amounts of data
• Logs: contains outputs (stdout, stderr) for YARN applications
HopsWorks implements dynamic role-based access control for projects. That is, users do not have static
global privileges. A user’s privileges depend on what the user’s active project is. For example, the user may
be a Data Owner in one project, but only a Data Scientist in another project. Depending on which project is
active, the user may be a Data Owner or a Data Scientist.
The following roles are supported:
Fig. 1.1: Dynamic Roles ensures strong multi-tenancy between projects in HopsWorks.
A Data Scientist can
• run interactive analytics through Apache Zeppelin
• run batch jobs (Spark, Flink, MR)
• upload to a restricted DataSet (called Resources) that contains only programs and resources
A Data Owner can
• upload/download data to the project,
• add and remove members of the project
• change the role of project members
• create and delete DataSets
• import and export data from DataSets
• design and update metadata for files/directories/DataSets
HopsWorks covers: users, projects and datasets, analytics, metadata mangement and free-text search.
1.4.1 Users
• Users authenticate with a valid email address * An optional 2nd factor can optionally be enabled for
authentication. Supported devices are smartphones (Android, Apple, Windows) or Yubikey usb sticks.
1.4.2 Projects and DataSets
HopsWorks provides the following features:
• project-based multi-tenancy with dynamic roles;
• CPU hour quotas for projects (supported by HopsYARN);
• the ability to share DataSets securely between projects (reuse of DataSets without copying);
• DataSet browser;
• import/export of data using the Browser.
1.4.3 Analytics
HopsWorks provides two services for executing applications on YARN:
• Apache Zepplin: interactive analytics with for Spark, Flink, and other data parallel frameworks;
• YARN batch jobs: batch-based submission (including Spark, MapReduce, Flink, Adam, and SaasFee);
1.4.4 MetaData Management
HopsWorks provides support for the design and entry of extended metadata for files and directorsy:
• design your own extended metadata using an intuitive UI;
• enter extended metadata using an intuitive UI.
1.4.5 Free-text search
HopsWorks integrates with Elasticsearch to provide free-text search for files/directories and their extended
metadata:
• Global free-text search for projects and DataSets in the filesystem;
• Project-based free-text search of all files and extended metadata within a project.
1.5 HopsFS
HopsFS is a new implementation of the the Hadoop Filesystem (HDFS) based on Apache Hadoop1 2x, that
supports multiple stateless NameNodes, where the metadata is stored in an in-memory distributed database
(NDB). HopsFS enables NameNode metadata to be both customized and analyzed, because it can be easily
accessed via SQL or the native API (NDB API).
HopsFS replaces HDFS 2.x’s Primary-Secondary Replication model with an in-memory, shared nothing
database. HopsFS provides the DAL-API as an abstraction layer over the database, and implements a leader
election protocol using the database. This means HopsFS no longer needs several services required by
highly available Apache HDFS: quorum journal nodes, Zookeeper, and the Snapshot server.
1
http://hadoop.apache.org/releases.html
Fig. 1.2: HopsFS Architeture.
1.6 HopsYarn
HopsYARN introduces a new metadata layer for Apache YARN, where the cluster state is stored in a distributed, in-memory, transactional database. HopsYARN enables us to provide quotas for Projects, in terms
of how many CPU minutes and memory are available for use by each project. Quota-based scheduling is
built as a layer on top of the capacity scheduler, enabling us to retain the benefits of the capacity scheduler.
Apache Spark We support Apache Spark for both interactive analytics and jobs.
Apache Zeppelin Apache Zeppelin is built-in to HopsWorks. We have extended Zeppelin with access
control, ensuring only users in the same project can access and share the same Zeppelin notebooks. We will
soon provide source-code control for notebooks using GitHub.
Apache Flink Streaming Apache Flink provides a dataflow processing model and is highly suitable for
stream processing. We support it in HopsWorks.
Other Services HopsWorks is a web application that runs on a highly secure Glassfish server. ElasticSearch
is used to provide free-text search services. MySQL
1.7 BiobankCloud
BiobankCloud extends HopsWorks with platform-specific support for Biobanking and Bioinformatics.
These services are:
• An audit log for user actions;
• Project roles compliant with the draft European General Data Protection Regulation;
Fig. 1.3: Hops YARN Architecture.
• Consent form management for projects (studies);
• Charon, a service for securely sharing data between clusters using public clouds;
• SaasFee (cuneiform), a YARN-based application for building scalable bioinformatics pipelines.
CHAPTER
TWO
SYSTEM REQUIREMENTS
The Hops stack can be installed on both cloud platforms and on-premises (baremetal). The recommended
machine specifications given below do not take into account whether local storage is used or a cloud storage
platform is used. For best performance due to improved data locality, we recommend local storage (instance
storage in Amazon Web Services (AWS)/EC2).
2.1 Recommended Setup
We recommend either Ubuntu/Debian or CentOS/Redhat as operating system (OS), with the same OS on all
machines. A typical deployment of Hops Hadoop uses
• DataNodes/NodeManagers: a set of commodity servers in a 12-24 SATA hard-disk JBOD setup;
• NameNodes/ResourceManagers/NDB-database-nodes/HopsWorks-app-server: a homogeneous set of
commodity (blade) servers with good CPUs, a reasonable amount of RAM, and one or two hard-disks;
• MySQL Cluster Data nodes: a homogeneous set of commodity (blade) servers with a good amount of
RAM (up to 512 GB) and good CPU(s). A good quality SATA disk is needed to store database logs.
SSDs can also be used, but are typically not required.
• Hopsworks: a single commodity (blade) server with a good amount of RAM (up to 128 GB) and good
CPU(s). A good quality disk is needed to store logs. Either SATA or a large SSD can be used.
For cloud platforms, such as AWS, we recommend using enhanced networking for the MySQL Cluster
Data Nodes and the NameNodes/ResourceManagers. High latency connections between these machines
will negatively affect system throughput.
2.2 Entire Hops platform on a single baremetal machine
You can run HopsWorks and the entire Hops stack on a bare-metal single machine for development or testing
purposes, but you will need at least:
10
Component
Operating System
RAM
CPU
Hard disk space
Network
Minimum Requirements
Linux, Mac
8 GB of RAM
2 GHz dual-core minimum. 64-bit.
15 GB free space
1 Gb Ethernet
2.3 Entire Hops platform on a single virtualbox instance (vagrant)
You can run HopsWorks and the entire Hops stack on a single virtualbox instance for development or testing
purposes, but you will need at least:
Component
Operating System
RAM
CPU
Hard disk space
Network
Minimum Requirements
Linux, Mac, Windows (using Virtualbox)
10 GB of RAM
2 GHz dual-core minimum. 64-bit.
15 GB free space
1 Gb Ethernet
2.4 DataNode and NodeManager
A typical deployment of Hops Hadoop installs both the Hops DataNode and NodeManager on a set of
commodity servers, running without RAID (replication is done in software) in a 12-24 harddisk JBOD
setup. Depending on your expected workloads, you can put as much RAM and CPU in the nodes as needed.
Configurations can have up to (and probably more) than 512 GB RAM and 32 cores.
The recommended setup for these machines in production (on a cost-performance basis) is:
Component
Operating System
RAM
CPU
Hard disk
Network
Recommended (late 2015)
Linux, Mac, Windows (using Virtualbox)
128 GB RAM
Two CPUs with 12 cores. 64-bit.
12 x 4 TB SATA disks
1 Gb Ethernet
2.5 NameNode, ResourceManager, NDB Data Nodes, HopsWorks,
and ElasticSearch
NameNodes, ResourceManagers, NDB database nodes, ElasticSearch, and the HopsWorks application
server require relatively more memory and not as much hard-disk space as DataNodes. The machines can
be blade servers with only a disk or two. SSDs will not give significant performance improvements to any
of these services, except the HopsWorks application server if you copy a lot of data in and out of the cluster
via HopsWorks. The NDB database nodes will require free disk space that is at least 20 times the size of the
RAM they use. Depending on how large your cluster is, the ElasticSearch server can be colocated with the
HopsWorks application server or moved to its own machine with lower RAM and CPU requirements than
the other services.
1 GbE gives great performance, but 10 GbE really makes it rock! You can deploy 10 GbE incrementally:
first between the NameNodes/ResourceManagers <–> NDB database nodes to improve metadata processing
performance, and then on the wider cluster.
The recommended setup for these machines in production (on a cost-performance basis) is:
Component
Operating System
RAM
CPU
Hard disk
Network
Recommended (late 2015)
Linux, Mac, Windows (using Virtualbox)
128 GB RAM
Two CPUs with 12 cores. 64-bit.
12 x 4 TB SATA disks
1 Gb Ethernet
CHAPTER
THREE
HOPS INSTALLATION
The Hops stack includes a number of services also requires a number of third-party distributed services:
• Java 1.7 (OpenJDK or Oracle JRE/JDK)
• NDB 7.4+ (MySQL Cluster)
• J2EE7 web application server (default: Glassfish)
• ElasticSearch 1.7+
Due to the complexity of installing and configuring all Hops’ services, we recommend installing Hops using
the automated installer Karamel/Chef (http://www.karamel.io). We do not provide detailed documentation
on the steps for installing and configuring all services in Hops. Instead, Chef cookbooks contain all the
installation and configuration steps needed to install and configure Hops. The Chef cookbooks are available
at: https://github.com/hopshadoop.
3.1 Cloud Platforms (AWS, GCE, OpenStack)
Hops can be installed on a cloud platform using Karamel/Chef.
3.1.1 Karamel/Chef
1. Download and install Karamel (http://www.karamel.io).
2. Run Karamel.
3. Click on the “Load Cluster Definition” menu item in Karamel. You are now prompted to select a
cluster definition YAML file. Go to the examples/stable directory, and select a cluster definition file
for your target cloud platform for one of the following cluster types:
(a) Amazon Web Services EC2 (AWS)
(b) Google Compute Engine (GCE)
(c) OpenStack
(d) On-premises (bare metal)
13
For more information on how to configure cloud-based installations, go to help documentation at
http://www.karamel.io. For on-premises installations, we provide some additional installation details and
tips later in this section.
Choosing which services to run on which nodes
You now need to decide which services you will install on which nodes. In Karamel, we design a set of Node
Groups, where each Node Group defines a stack of services to be installed on a machine. Each machine will
only have one Node Group set of services. We now provide two recommended setup:
• a single node cluster that includes all services on a single node.
• a tiny cluster set of heavy stacks that includes a lot of services on each node.
• a small cluster set of heavy stacks that includes lots of services on each node.
• a large cluster set of light stacks that includes fewer services on each node.
Single Node Setup You can run the entire HopsWorks application platform on a single node. You will have
a NodeGroup with the following services on the single node:
1. HopsWorks, Elasticsearch, Zeppelin, MySQL Server, NDB Mgmt Server, HDFS NameNode, YARN
ResourceManager, NDB Data Node(s), HDFS DataNode, YARN NodeManager
Tiny Cluster Setup
We recommend the following setup that includes the following NodeGroups, and requires at least 2 nodes
to be deployed:
1. HopsWorks, Elasticsearch, Zeppelin, MySQL Server, NDB Mgmt Server, HDFS NameNode, YARN
ResourceManager, NDB Data Node
2. HDFS DataNode, YARN NodeManager
This is really only a test setup, but you will have one node dedicated to YARN applications and file storage,
while the other node handles the metadata layer services.
Small Cluster Setup
We recommend the following setup that includes four NodeGroups, and requires at least 4 nodes to be
deployed:
1. HopsWorks, Elasticsearch, Zeppelin, MySQL Server, NDB Mgmt Server,
2. HDFS NameNode, YARN ResourceManager, MySQL Server
3. NDB Data Node
4. HDFS DataNode, YARN NodeManager
A highly available small cluster would require at least two instances of the last three NodeGroups.
HopsWorks can also be deployed on mulitple instances, but Elasticsearch needs to be specially configured
if it is to be sharded across many insances.
Large Cluster Setup
We recommend the following setup that includes six NodeGroups, and requires at least 4 nodes to be deployed:
1. Elasticsearch
2. HopsWorks, Zeppelin, MySQL Server, NDB Mgmt Server
3. HDFS NameNode, MySQL Server
4. YARN ResourceManager, MySQL Server
5. NDB Data Node
6. HDFS DataNode, YARN NodeManager
A highly available large cluster would require at least two instances of every NodeGroup. HopsWorks can
also be deployed on mulitple instances, while Elasticsearch needs to be specially configured if it is to be
sharded across many insances. Otherwise, the other services can be easily scaled out by simply adding
instances in Karamel. For improved performance, the metadata layer could be deployed on a better network
(10 GbE at the time of writing), and the last NodeGroup (DataNode/NodeManager) instances could be
deployed on cheaper network infrastructure (bonded 1 GbE or 10 GbE, at the time of writing).
HopsWorks Configuration in Karamel
Karamel Chef recipes support a large number of parameters that can be set while installing Hops. These
parameters include, but are not limited to,:
• usernames to install and run services as,
• usernames and passwords for services, and
• sizing and tuning configuration parameters for services (resources used, timeouts, etc).
Here are some of the most important security parameters to set when installing services:
• Superuser username and password for the MySQL Server(s) * Default: ‘kthfs’ and ‘kthfs’
• Administration username and password for the Glassfish administration account(s) * Default: ‘adminuser’ and ‘adminpw’
• Administration username and password for HopsWorks * Default: ‘[email protected] ‘ and ‘admin’
Here are some of the most important sizing configuration parameters to set when installing services:
• DataMemory for NDB Data Nodes
• YARN NodeManager amount of memory and number of CPUs
• Heap size and Direct Memory for the NameNode
• Heap size for Glassfish
• Heap size for Elasticsearch
3.2 On-Premises (baremetal) Installation
For on-premises (bare-metal) installations, you will need to prepare for installation by:
1. identifying a master host, from which you will run Karamel;
1
’[email protected]
(a) the master must have a display for Karamel’s user interface;
(b) the master must be able to ping (and connect using ssh) to all of the target hosts.
2. identifying a set of target hosts, on which the Hops software and 3rd party services will be installed.
(a) the target nodes should have http access to the open Internet to be able to download software
during the installation process. (Cookbooks can be configured to download software from within
the private network, but this requires a good bit of configuration work for Chef attributes, changing all download URLs).
The master must be able to connect using SSH to all the target nodes, on which the software will be installed.
If you have not already copied the master’s public key to the .ssh/authorized_keys file of all target hosts,
you can do so by preparing the machines as follows:
1. Create an openssh public/private key pair on the master host for your user account. On Linux, you
can use the ssh-keygen utility program to generate the keys, which will by default be stored in the
$HOME/.ssh/id_rsa and $HOME/.ssh/id_rsa.pub files. If you decided to enter a password for the ssh keypair, you will need to enter it again in Karamel when you reach the ssh dialog,
part of Karamel’s Launch step. We recommend no password (passwordless) for the ssh keypair.
2. Create a user account USER on the all the target machines with full sudo privileges (root privileges)
and the same password on all target machines.
3. Copy the $HOME/.ssh/id_rsa.pub file on the master to the /tmp folder of all the target hosts. A good
way to do this is to use pscp utility along with a file (hosts.txt) containing the line-separated
hostnames (or IP addresss) for all the target machines. You may need to install the pssh utility programs (pssh), first.
$sudo apt-get install pssh
or
$yum install pssh
$vim hosts.txt
# Enter the row-separated IP addresses of all target nodes in hosts.txt
128.112.152.122
18.31.0.190
128.232.103.201
.....
$pscp -h hosts.txt -P PASSWORD -i USER ~/.ssh/id_rsa.pub /tmp
$pssh -h hosts.txt -i USER -P PASSWORD mkdir -p /home/USER/.ssh
$pssh -h hosts.txt -i USER -P PASSWORD cat /tmp/id_rsa.pub
>> /home/USER/.ssh/authorized_keys
Update your Karamel cluster definition file to include the IP addresses of the target machines and the USER
account name. After you have clicked on the launch menu item, you will come to a Ssh dialog. On
the ssh dialog, you need to open the advanced section. Here, you will need to enter the password for the
USER account on the target machines (sudo password text input box). If your ssh keypair is password
protected, you will also need to enter it again here in the keypair password text input box.
Note Redhat/Centos is not yet supported by Karamel, but you can install Hops using Chef-solo
by logging into each machine separately. The chef cookbooks are written to work for both
the Debian/Ubuntu and Redhat/Centos platform families.
3.3 Vagrant (Virtualbox)
You can install HopsWorks and Hops on your laptop/desktop with Vagrant. You will need to have the
following software packages installed:
• chef-dk, version >0.5+ (but not >0.8+)
• git
• vagrant
• vagrant omnibus plugin
• virtualbox
You can now run vagrant, using:
$
$
$
$
$
$
sudo apt-get install virtualbox vagrant
vagrant plugin install vagrant-omnibus
git clone https://github.com/hopshadoop/hopsworks-chef.git
cd hopsworks-chef
berks vendor cookbooks
vagrant up
You can then access Hopsworks from your browser at http://127.0.0.1:8080/hopsworks. The default credentials are:
username: [email protected]
password: admin
You can access the Hopsworks administration application
http://127.0.0.1:8080/hopsworks/index.xhtml. The default credentials are:
from
your
browser
at
username: [email protected]
password: admin
The Glassfish web application server is also available from your browser at http://127.0.0.1:4848. The
default credentials are:
username: adminuser
password: adminpw
The MySQL Server is also available from the command-line, if you ssh into the vagrant host (vagrant
ssh). The default credentials are:
username: kthfs
password: kthfs
It goes without saying, but for production deployments, we recommend changing all of these default credentials. The credentials can all be changed in Karamel during the installation phase.
3.4 Windows
You can also install HopsWorks on vagrant and Windows. You will need to follow the vagrant instructions
as above (installing the same software packages) aswell as installing:
• Powershell
After cloning the github repo, from the powershell, you can run:
$ cd hopsworks-chef
$ berks vendor cookbooks
$ vagrant up
3.5 Apple OSX/Mac
You can follow the Vagrant instructions above for Linux to install for OSX. Note that MySQL Cluster is not
recommended for production installation on OSX, although it is OK, for developmenet setups.
3.6 Hops Chef Cookbooks
Hops’ automated installation is orchestrated by Karamel and the installation/configuration logic is written
as ruby programs in Chef. Chef supports the modularization of related programs in a unit of software,
called a Chef cookbook. A Chef cookbook can be seen as a collection of programs, where each program
contains instructions for how to install and configure software services. A cookbook may consist one or
more programs that are known as recipes. These Chef recipes are executed by either a Chef client (that
can talk to a Chef server) or chef-solo, a standalone program that has no dependencies on a Chef Server.
Karamel uses chef-solo to execute Chef recipes on nodes. The benefit of this approach is that it is agentless.
That is, Karamel only needs ssh to be installed on the target machine to be able to install and setup Hops.
Karamel also provides dependency injection for Chef recipes, supplying the parameters (Chef attributes)
used to execute recipes. Some stages/recipes return results (such as the IP address of the NameNode) that
are used in subsequent recipes (for example, to generate configuration files containing the IP address of the
NameNode, such as core-site.xml).
The following is a brief description of the Chef cookbooks that we have developed to support the installation
of Hops. The recipes have the naming convention: <cookbook>/<recipe>. You can determine the URL for
each cookbook by prefixing the name with http://github.com/. All of the recipes have been karamelized, that
is a Karamelfile containing orchestration rules has been added to all cookbooks.
• hopshadoop/apache-hadoop-chef
– This cookbook contains recipes for installing the Apache Hadoop services: HDFS NameNode
(hadoop::nn), HDFS DataNode (hadoop::dn), YARN ResourceManager (hadoop::rm), YARN
NodeManager (hadoop::nm), Hadoop Job HistoryServer for MapReduce (hadoop::jhs), Hadoop
ProxyServer (hadoop::ps).
• hopshadoop/hops-hadoop-chef
– This cookbook contains is a wrapper cookbook for the Apache Hadoop cookbook. It install
Hops, but makes use of the Apache Hadoop Chef cookbook to install and configure software.
The recipes it provides are: HopsFS NameNode (hops::nn), HopsFS DataNode (hops::dn), HopsYARN ResourceManager (hops::rm), HopsYARN NodeManager (hops::nm), Hadoop Job HistoryServer for MapReduce (hops::jhs), Hadoop ProxyServer (hops::ps).
• hopshadoop/elasticsearch-chef
– This cookbook is a wrapper cookbook for the official Elasticsearch Chef cookbook, but it has
been extended with Karamel orchestration rules.
• hopshadoop/ndb-chef
– This cookbook contains recipes for installing MySQL Cluster services: NDB Management
Server (ndb::mgmd), NDB Data Node (ndb::ndbd), MySQL Server (ndb::mysqld), Memcached
for MySQL Cluster (ndb::memcached).
• hopshadoop/zeppelin-chef
– This cookbook contains a default recipe for installing Apache Zeppelin.
• hopshadoop/hopsworks-chef
– This cookbook contains a default recipe for installing HopsWorks.
• hopshadoop/spark-chef
– This cookbook contains recipes for installing the Apache Spark Master, Worker, and a YARN
client.
• hopshadoop/flink-chef
– This cookbook contains recipes for installing the Apache Flink jobmanager, taskmanager, and a
YARN client.
3.7 BiobankCloud Chef Cookbooks
• biobankcloud/charon-chef This cookbook contains a default recipe for installing Charon.
• biobankcloud/hiway-chef This cookbook contains recipes for installing HiWAY, Cuneiform, the
BiobankCloud workflows, and some example workflows.
CHAPTER
FOUR
HOPSWORKS USER GUIDE
If you are using 2-Factor authentication, jump ahead to “First Login with 2-Factor Authentication”.
4.1 First Login (no 2-Factor Authentication)
Fig. 4.1: HopsWorks Login Page
On initial installation, you can login with the default username and password.
username: [email protected]
password: admin
If you manage to login successfully, you will arrive on the landing page:
align center
20
figclass align-center
HopsWorks Landing (Home) Page
In the landing page, you can see a box for projects, a search bar (to find projects and data sets), an audit trail,
and user menu (to change user settings or log out).
If it goes wrong
If login does not succeed, something has gone wrong during installation. The possible sources of error and
the Web Application Server (Glassfish) and the database (MySQL Clusters).
Actions:
• Double-check that system meets the minimum system requirements for HopsWorks. Is there enough
available disk space and memory?
• Re-run the installation, as something may have gone wrong during installation.
• Investigate Glassfish misconfiguration problems. Is Glassfish running? is the hopsworks.war application installed? Are the JDBC connections working? Is JavaMail configured correctly?
• Investigate MySQL Cluster misconfiguration problems. Are the mgm server, data nodes, and MySQL
server running? Do the hops and hopsworks databases exist and are they populated with tables and
rows? If not, something went wrong during installation.
4.2 First Login with 2-Factor Authentication
For 2-Factor Authentication, you cannot login directly via the web browser. You first need to generate your
2nd factor credentials for the default account ([email protected], admin). Login to the target machine where
HopsWorks is installed, and run:
sudo /bin/hopsworks-2fa
It should return something like:
+--------------+------------------+
| email
| secret
|
+--------------+------------------+
| [email protected] | V3WBPS4G2WMQ53VA |
+--------------+------------------+
Fig. 4.2: Google Authenticator - Enter the Provided Key V3WBPS4G2WMQ53VA for [email protected] as
a Time-Based Key.
You now need to start Google Authenticator on your smartphone. If you don’t have ‘Google Authenticator’ installed, install it from your app store. It is available for free on:
• Android as Google Authenticator
• iOS (Apple iPhone) as OTP Auth), and
• Windows Phone as Microsoft Authenticator).
After starting your Google Authenticator application, create an account (set up account), and add
as the account email the default installation email address ([email protected]) and add as the provided
key , the secret value returned by /bin/hopsworks-2fa (for example, ‘V3WBPS4G2WMQ53VA’).
The key is a time-based key, if you need to specify the type of provided key. This should register your
second factor on your phone.
You can now go to the start-page on Google Authenticator to read the six-digit one-time password (OTP).
Note that the OTP is updated every 30 seconds. On HopsWorks login page, you will need to supply the
6-digit number (OTP) shown for [email protected] when on the login page, along with the username and
password.
Fig. 4.3: HopsWorks Two-Factor Authentication Login Page
4.3 Register a New Account on HopsWorks
The process for registering a new account is as follows:
1. Register your email address and details and use the camera from within Google Authenticator to store
your 2nd factor credential;
2. Validate your email address by clicking on the link in the validation email you received;
3. Wait until an administrator has approved your account (you will receive a confirmation email).
Fig. 4.4: HopsWorks User Registration Page
Register a new account with a valid email account. If you have two-factor authentication enabled, you will
then need to scan the QR code to save it on your phone. If you miss this step, you will have to recover your
smartphone credentials at a later stage.
In both cases, you should receive an email asking you to validate your account. The sender of the email
will be either the default [email protected] or a gmail address that was supplied while installing
HopsWorks. If you do not receive an email, wait a minute. If you still haven’t received it, you should contact
the administrator.
Validate the email address used in registration
If you click on the link supplied in the registration email, it will validate your account. You will not be able
to login until an administrator has approved your account. 1 .
After your account has been approved, you can now go to HopsWork’s login page and start your Google
Authenticator application on your smartphone. On HopsWorks login page, you will need to enter
1
If you are an administrator, you can jump now to the Hops Administration Guide to see how to validate account registrations,
if you have administrator privileges.
Fig. 4.5: Two-factor authentication: Scan the QR Code with Google Authenticator
• the email address your registered with
• the password you registered with
• on Google Authenticator find the 6-digit number shown for the email address your registered with and
enter it into HopsWorks.
4.4 Forgotten Password / Lost Smartphone
If you forget your password or lose your 2nd factor device (smartphone or yubikey), you will need to recover
your credentials. On the login screen, click on Need Help? to recover your password or replace the QR
code for your smartphone.
4.5 Update your Profile/Password
After you have logged in, in the upper right-hand corner of the screen, you will see your email address with
a caret icon. Click on the caret icon, then click on the menu item Account. A dialog will pop-up, from
where you can change your password and other parts of your profile. You cannot change your email address
and will need to create a new account if you wish to change your email address. You can also logout by
clicking on the sign out menu item.
4.6 If it goes wrong
Contact an administrator or go to the Administration Guide section of this document. If you are an administrator:
• Does your organization have a firewall that blocks outbound SMTP access? HopsWorks needs SMTP
outbound access over TLS using SSL (port 587 or 465).
• Is the Glassfish server up and running? Can you login to the Glassfish Administration console (on
port 4848)?
• Inside Glassfish, check the JavaMail settings. Is the gmail username/password correct? Are the SMTP
server settings correct (hostname/ip, port, protocol (SSL, TLS))?
User fails to receive an email to validate her account
• This may be a misconfigured gmail address/password or a network connectivity issue.
• Does your organization have a firewall that blocks outbound SMTP access?
• For administrators: was the correct gmail username/password supplied when installing?
• If you are not using a Gmail address, are the smtp server settings correct (ip-address or hostname,
port, protocol (SSL, TLS))?
User receives the validate-your-email message, but is not able to validate the account
• Can you successfully access the HopsWorks homepage? If not, there may be a problem with the
network or the webserver may be down.
• Is the Glassfish webserver running and hopsworks.war application installed, but you still can’t logon?
It may be that MySQL Cluster is not running.
• Check the Glassfish logs for problems and the Browser logs.
User successfully validates the account, but still can’t login
The user account status may not be in the correct state, see next section for how to update user account
status.
User account has been disabled due to too many unsuccessful login attempts
From the HopsWorks administration application, the administrator can re-enable the account by going to
“User Administration” and taking the action “Approve account”.
User account has been disabled due to too many unsuccessful login attempts
Contact your system administrator who will re-enable your account.
4.7 Create a New Project
You can create a project by clicking on the New button in the Projects box. This will pop-up a dialog, in
which you enter the project name, an optional description, and select an optional set of services to be used
in the project. You can also select an initial set of members for the project, who will be given the role of
Data Scientist in the project. Member roles can later be updated in the Project settings by the project owner
or a member with the data owner role.
4.8 Delete a Project
Right click on the project to be deleted in the projects box. You have the options to:
• Remove and delete data sets;
– If the user deletes the project, the files are moved to trash in HopsFS;
• Remove and keep data sets.
4.9 Data Set Browser
The Data Set tab enables you to browse Data Sets, files and directories in this project. It is mostly used as
a file browser for the project’s HDFS subtree. You cannot navigate to directories outside of this project’s
subtree.
4.10 Upload Data
Files can be uploaded using HopsWorks’ web interface. Go to the project you want to upload the file(s) to.
You must have the Data Owner role for that project to be able to upload files. In the Data Sets tab, you will
see a button Upload Files.
Option
Upload File
Description
You have to have the Data Owner role to be able to upload files. Click
on the Upload File button to select a file from your local disk. Then
click Upload All to upload the file(s) you selected. You can also upload
folders.
4.11 Compress Files
HopFS supports erasure-coding of files, which reduces storage requirements for large files by roughly 50%.
If a file consists of 6 file blocks or more (that is, if the file is larger than 384 MB in size, for a default block
size of 64 MB), then it can be compressed. Smaller files cannot be compressed.
4.12 Share a Data Set
Only a data owner or the project owner has privileges to share Data Sets. To share a Data Set, go to the
Data Sets Browser in your project, and right-click on the Data Set to be shared and then select the Share
option. A popup dialog will then prompt you to select (1) a target project with which the Data Set is to be
Shared and whether the Data Set will be shared as read-only (Can View) or as read-write (Can edit). To
complete the sharing process, a Data Owner in the target project has to click on the shared Data Set, and
then click on Acccept to complete the process.
4.13 Free-text Search
Option
Search from Landing Page
Search from Project Page
Description
On landing page, enter the search term in the search bar and press return.
Returns project names and Data Set names that match the entered term.
From within the context of a project, enter the search term in the search
bar and press return. The search returns any files or directories whose
name or extended metadata matches the search term.
4.14 Jobs
The Jobs tabs is the way to create and run YARN applications. HopsWorks supports the following YARN
applications:
• Apache Spark,
• Apache Flink,
• MapReduce (MR),
• Adam (a bioinformatics data parallel framework),
• SAASFEE (HiWAY/Cuneiform) (a bioinformatics data parallel framework).
Option
New Job
Description
Create a Job for any of the following YARN frameworks by clicking New
Job : Spark/MR/Flink/Adam/Cuneiform.
• Step 1: enter job-specific parameters
• Step 2: enter YARN parameters.
• Step 3: click on Create Job.
Run Job
After a job has been created, it can be run by clicking on its Run button.
The logs for jobs are viewable in HopsWorks, as stdout and stderr files. These output files are also stored
in the Logs/<app-framework>/<log-files> directories. After a job has been created, it can be
edited, deleted, and scheduled by clickin on the More actions button.
4.15 Charon
Charon is a cloud-of-clouds filesystem that enables the sharing of data between Hops clusters using public
clouds. To do share data with a target cluster, you need to:
• acquire the cluster-id of the target cluster and enter it as a cluster-id in the Charon service UI - you
can read the cluster-id at the top of the page for the Charon service;
• enter a token-id that is used as a secret key between the source and target cluster;
• select a folder to share with the target cluster-id;
• copy files to the shared folder from HDFS that you wish to share with the target cluster;
• the files within that folder are copied to the public cloud(s), from where they are downloaded to the
target cluster.
4.16 Apache Zeppelin
Apache Zeppelin is an interactive notebook web application for running Spark or Flink code on Hops YARN.
You can turn interpreters for Spark/Flink/etc on and off in the Zeppelin tab, helping, respectively, to reduce
time required to execute a Note (paragraph) in Zeppelin or reclaim resources. More details can be found at:
https://zeppelin.incubator.apache.org/
4.17 Metadata Management
Metadata enables data curation, that is, ensuring that data is properly catalogued and accessible to appropriate users.
Metadata in HopsWorks is used primarily to discover and and retrieve relevant data sets or files by users by
enabling users to attach arbitrary metadata to Data Sets, directories or files in HopsWorks. Metadata is associated with an individual file or Data Set or directory. This extended metadata is stored in the same database
as the metadata for HopsFS and foreign keys link the extended metadata with the target file/directory/Data
Set, ensuring its integrity. Extended metadata is exported to Elastic Search, from where it can be queried
and the associated Data Set/Project/file/directory can be identified (and acted upon).
4.18 MetaData Designer
Within the context of a project, click on the Metadata Designer button in the left-hand panel. It will
bring up a metadata designer view that can be used to:
• Design a new Metadata Template
• Extend an existing Metadata Template
• Import/Export a Metadata Template
Within the Metadata Designer, you can define a Metadata template as one or more tables. Each table consists
of a number of typed columns. Supported column types are:
• string
• single-select selection box
• multi-select selection box
Columns can also have constraints defined on them. On a column, click on cog icon (configure), where you
can make the field:
• searchable: included in the Elastic Search index;
• required: when entering metadata, this column will make it is mandatory for users to enter a value for
this column.
4.19 MetaData Attachment and Entry
Within the context of a project, click on the Data Sets tab. From here, click on a Data Set. Inside the Data
Set, if you select any file or directory, the rightmost panel will display any extended metadata associated
with the file or directory. If no extended metadata is assocated with the file/directory, you will see “No
metadata template attached” in the rightmost panel. You can attach an existing metadata template to the file
or directory by right-clicking on it, and selecting Add metadata template. The metadata can then be
selected from the set of available templates (designed or uploaded).
After one or more metadata templates have been attached to the file/directory, if the file is selected, the
metadata templates are now visible in the rightmost panel. The metadata can be edited in place by clicking
on the + icon beside the metadata attribute. More than one extended metadata value can be added for each
attribute, if the attribute is a string attribute.
Metadata values can also be removed, and metadata templates can be removed from files/directories using
the Data Set service.
CHAPTER
FIVE
HOPSFS USER GUIDE
HopsFS consist of the following types of nodes: NameNodes, DataNodes, and Clients. All the configurations parameters are defined in core-site.xml and hdfs-site.xml files.
Currently Hops only supports non-secure mode of operations. As Hops is a fork of the Hadoop code base,
most of the Hadoop configuration parameters and features are supported in Hops. In the following sections
we highlight differences between HDFS and HopsFS and point out new configuration parameters and the
parameters that are not supported due to different metadata management scheme .
5.1 Unsupported HDFS Features
HopsFS is a drop-in replacement for HDFS and it supports most of the configuration1 parameters defined for
Apache HDFS. As the architecture of HopsFS is fundamentally different from HDFS, some of the features
such as journaling, secondary NameNode etc., are not required in HopsFS. Following is the list of HDFS
features and configurations that are not applicable in HopsFS
• Secondary NameNode The secondary NameNode is no longer supported. HopsFS supports multiple active NameNodes. Thus hdfs haadmin * command; and dfs.namenode.secondary.* and
dfs.ha.* configuration parameters are not supported in HopsFS.
• Checkpoint Node and FSImage HopsFS does not require checkpoint node as all the metadata
is stored in NDB. Thus hdfs dfsadmin -{saveNamespace | metaSave | restoreFailedStorage | rollEdits | fetchImage} command; and dfs.namenode.name.dir.*, dfs.image.*,
dfs.namenode.checkpoint.* configuration parameters are not supported in HopsFS.
• Quorum Based Journaling and EditLog The write ahead log (EditLog) is not needed as
all the metadata mutations are stored in NDB. Thus dfs.namenode.num.extra.edits.*,
dfs.journalnode.* and dfs.namenode.edits.* configuration parameters are not supported in
HopsFS.
• NameNode Federation and ViewFS In HDFS the namespace is statically partitioned among multiple namenodes to support large namespace. In essence these are independent HDFS clusters
where ViewFS provides a unified view of the namespace. HDFS Federation and ViewFS are no
longer supported as the namespace in HopsFS scales to billions of files and directories. Thus
dfs.nameservices.* configuration parameters are not supported in HopsFS.
1
http://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-hdfs/hdfs-default.xml
31
• ZooKeeper ZooKeeper is no longer required as the coordination and membership service. A coordination and membership management service is implemented using the transactional shared
memory (NDB).
As HopsFS is under heavy development some features such as rolling upgrades and snapshots are not yet
supported. These features will be activated in future releases.
5.2 NameNodes
HopsFS supports multiple NameNodes. A NameNode is configured as if it is the only NameNode in the
system. Using the database a NameNode discovers all the existing NameNodes in the system. One of the
NameNodes is declared the leader for housekeeping and maintenance operations. All the NameNodes in
HopsFS are active. Secondary NameNode and Checkpoint Node configurations are not supported. See
section (page 36) for detail list of configuration parameters and features that are no longer supported in
HopsFS.
For each NameNode define fs.defaultFS configuration parameter in the core-site.xml file. In
order to load NDB driver set the dfs.storage.driver.* parameters in the hdfs-site.xml file.
These parameter are defined in detail here (page 48).
A detailed description of all the new configuration parameters for leader election, NameNode caches, distributed transaction handling, quota management, id generation and client configurations are defined here
(page 44).
The NameNodes are started/stopped using the following commands (executed as HDFS superuser):
> $HADOOP_HOME/sbin/start-nn.sh
> $HADOOP_HOME/sbin/stop-nn.sh
The Apache HDFS commands for starting/stopping NameNodes can also be used:
> $HADOOP_HOME/sbin/hadoop-daemon.sh --script hdfs start namenode
> $HADOOP_HOME/sbin/hadoop-daemon.sh --script hdfs stop namenode
Configuring HopsFS NameNode is very similar to configuring a HDFS NameNode. While configuring a
single Hops NameNode, the configuration files are written as if it is the only NameNode in the system. The
NameNode automatically detects other NameNodes using NDB.
5.2.1 Formating the Filesystem
Running the format command on any NameNode truncates all the tables in the database and inserts default
values in the tables. NDB atomically performs the truncate operation which can fail or take very long time
to complete for very large tables. In such cases run the /hdfs namenode -dropAndCreateDB command
first to drop and recreate the database schema followed by the format command to insert default values
in the database tables. In NDB dropping and recreating a database is much quicker than truncating all the
tables in the database.
5.2.2 NameNode Caches
In published Hadoop workloads, metadata accesses follow a heavy-tail distribution where 3% of files account for 80% of accesses. This means that caching recently accessed metadata at NameNodes could
give a significant performance boost. Each NameNode has a local cache that stores INode objects for
recently accessed files and directories. Usually, the clients read/write files in the same sub-directory. Using RANDOM_STICKY load balancing policy to distribute filesystem operations among the NameNodes
lowers the latencies for filesystem operations as most of the path components are already available in the
NameNode cache. See HopsFS Client’s (page 33) and Cache Configuration Parameters (page 44) for more
details.
5.2.3 Adding/Removing NameNodes
As the namenodes are stateless any NameNode can be removed with out effecting the state of the system.
All on going operations that fail due to stopping the NameNode are automatically forwarded by the clients
to the remaining namenodes in the system.
Similarly, the clients automatically discover the newly started namenodes. See client configuration parameters (page 47) that determines how quickly a new NameNode starts receiving requests from the existing
clients.
5.3 DataNodes
The DataNodes periodically acquire an updated list of NameNodes in the system and establish a connection
(register) with the new NameNodes. Like clients, the DataNodes also uniformly distribute the filesystem
operations among all the NameNodes in the system. Currently the DataNodes only support round-robin
policy to distribute the filesystem operations.
HopsFS DataNodes configuration is identical to HDFS DataNodes. In HopsFS a DataNode connects to all
the NameNodes. Make sure that the fs.defaultFS parameter points to valid NameNode in the system.
The DataNode will connect to the NameNode and obtain a list of all the active NameNodes in the system,
and then connects/registers with all the NameNodes in the system.
The DataNodes can started/stopped using the following commands (executed as HDFS superuser):
> $HADOOP_HOME/sbin/start-dn.sh
> $HADOOP_HOME/sbin/stop-dn.sh
The Apache HDFS commands for starting/stopping Data Nodes can also be used:
> $HADOOP_HOME/sbin/hadoop-deamon.sh --script hdfs start datanode
> $HADOOP_HOME/sbin/hadoop-deamon.sh --script hdfs stop datanode
5.4 HopsFS Clients
For load balancing the clients uniformly distributes the filesystem operations among all the NameNodes in
the system. HopsFS clients support RANDOM, ROUND_ROBIN, and RANDOM_STICKY policies to distribute
the filesystem operations among the NameNodes. Random and round-robin policies are self explanatory.
Using sticky policy the filesystem client randomly picks a NameNode and forwards all subsequent operation
to the same NameNode. If the NameNode fails then the clients randomly picks another NameNode. This
maximizes the NameNode cache hits.
In HDFS the client connects to the fs.defaultFS NameNode. In HopsFS, clients obtain the list of active
NameNodes from the NameNode defined using fs.defaultFS parameter. The client then uniformly
distributes the subsequent filesystem operations among the list of NameNodes.
In core-site.xml we have introduced a new parameter dfs.namenodes.rpc.addresses that
holds the rpc address of all the NameNodes in the system. If the NameNode pointed by fs.defaultFS is
dead then the client tries to connect to a NameNode defined by the dfs.namenodes.rpc.addresses.
As long as the NameNode addresses defined by the two parameters contain at least one valid address the
client is able to communicate with the HopsFS. A detailed description of all the new client configuration
parameters are here (page 47).
5.5 Compatibility with HDFS Clients
HopsFS is fully compatible with HDFS clients, although they do not distribute operations over NameNodes,
as they assume there is a single active NameNode.
5.6 HopsFS Async Quota Management
In HopsFS the commands and the APIs for quota management are identical to HDFS. In HDFS all Quota
management operations are performed synchronously while in HopsFS Quota management is performed
asynchronously for performance reasons. In the following example maximum namespace quota for /QDir
is set to 10. When a new sub-directory or a file is created in this folder then the quota update information propagates up the filesystem tree until it reaches /QDir. Each quota update propagation operation is
implemented as an independent transaction.
Fig. 5.1: HopsFS Quota Update
For write heavy workloads a user might be able to consume more diskspace/namespace than it is allowed
before the filesystem recognizes that the quota limits have been violated. After the quota updates are applied
the filesystem will not allow the use to further violate the quota limits. In most existing Hadoop clusters,
write operations are a small fraction of the workload. Additionally, considering the size of the filesystem we
think this is a small trade off for improving throughput for read operations that typically comprise 90-95%
a typical filesystem workload.
In HopsFS asynchronous quota updates are highly optimized. We batch the quota updates wherever possible.
In the linked section (page 45) there is a complete list of parameters that determines how aggressively
asynchronous quota updates are applied.
5.7 Block Reporting
DataNodes periodically synchronize the set of blocks stored locally with the metadata representing those
blocks using a block report. Block reports are sent from DataNodes to NameNodes to indicate the set of
valid blocks at a DataNode, and the NameNode compares the sent list with its metadata. For block report
load balancing the DataNodes ask the leader NameNode to which NameNode they should send the block
report. The leader NameNode uses round robin policy to distribute block reports among the NameNodes.
In order to avoid sudden influx of large number of block reports that can slow down the performance of
other filesystem operations the leader NameNode also performs admission control for block reports. The
leader NameNode only allows certain number of block reports, which is configurable, to be processed at a
given time. In the linked section (page 46) there is a complete list of parameters for block report admission
control.
CHAPTER
SIX
HOPS-YARN USER GUIDE
Hops-YARN is very similar to Apache Hadoop YARN when it comes to using it. The goal of this section
is to present the things that change. We first present some major features of Apache Hadoop YARN that
have been removed or replaced in Hops-YARN. We then present how the different part of the YARN system
(ResourceManager, NodeManager, Client) should be configured and used in Hops-YARN.
6.1 Removed/Replaced YARN Features
Hops-YARN is a drop-in replacement for Apache Hadoop YARN and it supports most of the configuration
parameters defined for Apache Hadoop YARN. As we have completely rewritten the failover mechanism
some recovery option are not required in Hops-YARN. Following is the list of YARN configurations that are
not applicable in Hops-YARN.
• ZooKeeper ZooKeeper is no longer required as the coordination and membership service is
implemented using the transactional shared memory (NDB). As a result the following options
are not supported in Hops-YARN: yarn.resourcemanager.zk-address, yarn.resourcemanager.zk-numretries, yarn.resourcemanager.zk-retry-interval-ms, yarn.resourcemanager.zk-state-store.parent-path,
yarn.resourcemanager.zk-timeout-ms, yarn.resourcemanager.zk-acl, yarn.resourcemanager.zk-statestore.root-node.acl, yarn.resourcemanager.ha.automatic-failover.zk-base-path.
• StateStore Hops-YARN in entirely designed to store its state in the transactional share memory
(NDB). As a result NDBRMStateStore is the only state store that is still supported. It follows that
option specific to other state store are not supported in Hops-YARN: yarn.resourcemanager.fs.statestore.uri, yarn.resourcemanager.fs.state-store.retry-policy-spec.
• Administration commands Two administration commands are now obsolete: transitionToActive
and transitionToStandby. The selection of the active ResourceManager is now completely automatized and managed by the group membership service. As a result transitionToActive is not supported
anymore. transitionToStandby does not present any interesting use case in Hops-YARN, if one want
36
to remove a ResourceManager from the system they can simply stop it and the automatic failover will
make sure that a new ResourceManager transparently replace it. Moreover, as the transition to active
it automatized, it is possible that the leader election elects the resource that we just transitioned to
standby to make it the “new” active ResourceManager.
As Hops-YARN is still at an early stage of is development, some features are still under development and
not supported yet. Some the main unsupported features are: Fail-over when running in distributed mode
and the fair-scheduler.
6.2 ResourceManager
Even though Hops-YARN allows to distribute the ResourceManager to have the scheduling running on
one node (the Scheduler) and the resource tracking running on several other nodes (the ResourceTrackers)
the configuration of the resource manager is similar to the configuration of Apache Hadoop YARN. When
running in distributed mode all the nodes participating in the resource management should be configured as
a ResourceManager would be configured. They will then automatically detect each other and elect a leader
to be the Scheduler.
6.2.1 Adding/Removing a ResourceManager
As the ResourceManagers automatically detect each other through NDB adding a new ResourceManager
consist simply in configuring and starting a new node as it would be done for the first started ResourceManager. Removing a resourceManager is not supported yet in the distributed mode. In the non distributed mode
stopping the ResourceManager is enough to remove it. If the stopped ResourceManager was in standby
nothing will happen. If the stopped ResourceManager was the active ResrouceManager the failover will
automatically be triggered and a new active ResourceManager will take the active role.
6.3 YARN Clients
Hops-YARN is fully compatible with Apache Hadoop YARN client. As in Apache Hadoop YARN the
have to be configured with the list of all possible scheduler to be able to find the leader one and start
communicating with it.
When running Hops-YARN client it is possible to configure it to use the ConfiguredLeaderFailoverHAProxyProvider as a yarn.client.failover-proxy-provider. This will allow the client to find the leader faster than
going through all the possible leaders present in the configuration file. This will also allow the client to find
the leader even if it is not present in the client configuration file, as long as one of the resourceManager
present in the client configuration file is alive.
6.4 YARN NodeManager:
In non distributed mode the NodeManagers should be configured to use ConfiguredLeaderFailoverHAProxyProvider as a failover proxy provider. This allows them to automatically find the leading ResourceManager
and to connect to it.
In distributed mode the NodeManagers should be configured to use ConfiguredLeastLoadedRMFailoverHAProxyProvider as a failover proxy provider. This allows them to automatically find the resourceTracker
which is the least loaded and to connect to it.
CHAPTER
SEVEN
HOPSWORKS ADMINISTRATOR GUIDE
HopsWorks has an administrator application that allows you, the administrator, to perform management
actions, monitor HopsWorks and Hops, and control HopsWorks and Hops.
7.1 Activating users
You, the administrator, have to approve each new user account before the user is able to login to HopsWorks.
When you approve the account, you have to assign a role to a user as either an:
• user
• administrator
Users that are assigned an administrator role will be granted privileges to login to the administrator
application and control users and the system. Be careful in which users are assigned an administrator
role. The vast majority of users will be assigned a user role.
Fig. 7.1: Approve User Accounts so that Users are able to Login
7.2 User fails to receive an email to validate her account
• Does your organization have a firewall that blocks outbound SMTP access?
• Login to the Glassfish Webserver and check the JavaMail settings. The JNDI name should be
mail/BBCMail. Is the gmail username/password correct? Are the smtp server settings correct (ipaddress or hostname, port, protocol (SSL, TLS))?
39
7.3 User receives email, but fails to validate the account
• Can you successfully access the HopsWorks homepage?
• Is the Glassfish webserver running and hopsworks.war application installed?
• Is MySQL Cluster running?
7.4 Configuring email for HopsWorks
Login to Glassfish, see Glassfish Adminstration (page 42), and update the JavaMail settings to set the email
account, password, SMTP server IP and port, and whether SSL/TLS are used.
7.5 User successfully validates the account, but still can’t login
Go to the User Administration view. From here, select the user whose account will be enabled, and update
the user’s account status to validated.
7.6 User account has been disabled due to too many unsuccessful
login attempts
Go to the User Administration view. From here, select the user whose account will be re-enabled, and
update the user’s account status to validated.
7.7 Disabling a user account
Go to the User Administration view. From here, select the user whose account will be disabled, and update
the user’s account status to disabled.
7.8 Re-activating a user account
In the user administration view, you can select the action that changes the user status to
activated.
7.9 Managing Project Quotas
Each project is by default allocated a number of CPU hours in HopsYARN and an amount of available disk
storage space in HopsFS:
• HopsYARN Quota
• HopsFS Quota
We recommend that you override the default values for the Quota during the installation process, by overriding the Chef attributes:
• hopsworks/yarn_default_quota
• hopsworks/hdfs_default_quota
In the Projects view, for any given project, the administrator can update the remaining amount of HopsYARN Quota (in CPU hours) and the amount disk space allocated in HopsFS for the project.
Fig. 7.2: Project Administration: update quotas, disable/enable projects.
7.10 Disabling/Re-enabling Projects
In the Projects view, any given project can be disabled (and subsequently renabled). Disabling a project
will prevent members of the project from accessing data in the project, running Jobs stored in the project, or
accessing the project at all.
7.11 Ubikeys in HopsWorks
Ubikeys can be used as the 2nd factor authentication device, but a Ubikey needs to be programmed before
it is given to a user. We recommend programming the Ubikey using Ubuntu’s Yubikey OTP tool. From the
Yubikey OTP tool program, you will have to opy the Public Identity and Secret Key fields (from Yubikey
OTP) to the corresponding fields in the HopsWorks Administration tool when you validate a user. That is,
you should save the Public Identity and Secret Key fields for the Yubikey sticks, and when a user registers
with one of those Ubikey sticks, you should then enter the Public Identity and Secret Key fields when
approving the user’s account.
$ sudo apt-get install yubikey-personalization-gui
$ yubikey-personalization-gui
Installing and starting Yubikey OTP tool in Ubuntu.
Fig. 7.3: Registering YubiKey sticks using Yubikey OTP tool.
7.11.1 Glassfish Adminstration
If you didn’t supply your own username/password for Glassfish administration during installation, you can
login with the default username and password for Glassfish:
:: https://<hostname>:4848 username: adminuser password: adminpw
Users are referred to Glassfish documentation for more information regarding configuring Glassfish.
Fig. 7.4: Registering YubiKey sticks using Yubikey OTP tool.
Fig. 7.5: Copy the Public Identity and Secret Key fields from Yubikey OTP tool and enter them into the
corresponding fields in the HopsWork’s Administration UI when you validate a user.
CHAPTER
EIGHT
HOPSFS CONFIGURATION
This section contains new/modified configuration parameters for HopsFS. All the configuration parameters
are defined in hdfs-site.xml and core-site.xml files.
8.1 Leader Election
Leader election service is used by HopsFS and Hops-YARN. The configuration parameters for Leader Election service are defined in core-site.xml file.
• dfs.leader.check.interval: The length of the time period in milliseconds after which NameNodes
run the leader election protocol. One of the active NameNodes is chosen as a leader to perform
housekeeping operations. All NameNodes periodically update a counter in the database to mark
that they are active. All NameNodes also periodically check for changes in the membership of the
NameNodes. By default the time period is set to one second. Increasing the time interval leads to
slow failure detection.
• dfs.leader.missed.hb: This property specifies when a NameNode is declared dead. By default a
NameNode is declared dead if it misses two consecutive heartbeats. Higher values of this property
would lead to slower failure detection. The minimum supported value is 2.
• dfs.leader.tp.increment: HopsFS uses an eventual leader election algorithm where the heartbeat time
period (dfs.leader.check.interval) is automatically incremented if it detects that the NameNodes
are falsely declared dead due to missed heartbeats caused by network/database/CPU overload.
By default the heartbeat time period is incremented by 100 milliseconds, however it can be
overridden using this parameter.
8.2 NameNode Cache
The NameNode cache configuration parameters are defined in hdfs-site.xml file. The NameNode
cache configuration parameters are:
• dfs.resolvingcache.enabled: (true/false) Enable/Disables the cache for the NameNode.
• dfs.resolvingcache.type: Each NameNode caches the inodes metadata in a local cache for quick path
resolution. We support different implementations for the cache, i.e., INodeMemcache, PathMemcache, OptimalMemcache and InMemory.
44
– INodeMemcache: stores individual inodes in Memcached.
– PathMemcache: is a course grain cache where entire file path (key) along with its associated
inodes objects are stored in the Memcached.
– OptimalMemcache: combines INodeMemcache and PathMemcache.
– InMemory: Same as INodeMemcache but instead of using Memcached it uses an inmemory LRU ConcurrentLinkedHashMap. We recommend InMemory cache as it yields higher
throughput.
For INodeMemcache/PathMemcache/OptimalMemcache following configurations parameters must be set.
• dfs.resolvingcache.memcached.server.address: Memcached server address.
• dfs.resolvingcache.memcached.connectionpool.size: Number of connections to the memcached
server.
• dfs.resolvingcache.memcached.key.expiry: It determines when the memcached entries expire. The
default value is 0, that is, the entries never expire. Whenever the NameNode encounters an entry that
is no longer valid, it updates it.
The InMemory cache specific configurations are:
• dfs.resolvingcache.inmemory.maxsize: Max number of entries that could be stored in the cache
before the LRU algorithm kicks in.
8.3 Distributed Transaction Hints
In HopsFS the metadata is partitioned using the inodes’ id. HopsFS tries to to enlist the transactional
filesystem operation on the database node that holds the metadata for the file/directory being manipulated
by the operation. Distributed transaction hints configuration parameteres are defined in hdfs-site.xml
file.
• dfs.ndb.setpartitionkey.enabled: (true/false) Enable/Disable transaction partition key hint.
• dfs.ndb.setrandompartitionkey.enabled: (true/false) Enable/Disable random partition key hint
when HopsFS fails to determine appropriate partition key for the transactional filesystem operation.
8.4 Quota Management
In order to boost the performance and increase the parallelism of metadata operations the quota updates are
applied asynchronously i.e. disk and namespace usage statistics are asynchronously updated in the background. Using asynchronous quota system it is possible that some users over consume namespace/disk space
before the background quota system throws an exception. Following parameters controls how aggressively
the quota subsystem updates the quota statistics. Quota management configuration parameters are defined
in hdfs-site.xml file.
• dfs.quota.enabled: Enable/Disabled quota. By default quota is enabled.
• dfs.namenode.quota.update.interval: The quota update manager applies the outstanding quota updates after every dfs.namenode.quota.update.interval milliseconds.
• dfs.namenode.quota.update.limit: The maximum number of outstanding quota updates that are applied in each round.
8.5 Block Reporting
• dfs.block.report.load.balancing.max.blks.per.time.window: This is a global configuration parameter. The leader NameNode only allows certain number of blocks reports such that the maximum
number of blocks that are processed by the block reporting sub-system of HopsFS does not exceed
dfs.block.report.load.balancing.max.blks.per.time.window in a given block report processing time
window.
• dfs.block.report.load.balancing.time.window.size
This
parameter
determines
the
block report processing time window size.
It is defined in milliseconds.
If
dfs.block.report.load.balancing.max.blks.per.time.window is set to one million and
dfs.block.report.load.balancing.time.window.size is set to one minutes then the leader NameNode will ensure that at every minute at most 1 million blocks are accepted for processing by the
admission control system of the filesystem.
• dfs.blk.report.load.balancing.update.threashold.time
Using
command
hdfs namenode -setBlkRptProcessSize noOfBlks
the
parameter
dfs.block.report.load.balancing.max.blks.per.time.window can be changed.
The parameter
is stored in the database and the NameNodes periodically read the new value from the database. This
parameter determines how frequently a NameNode checks for changes in this parameter. The default
is set to 60*1000 milliseconds.
8.6 Distributed Unique ID generator
ClusterJ API does not support any means to auto generate primary keys. Unique key generation is left to the
application. Each NameNode has an ID generation daemon. ID generator keeps pools of pre-allocated IDs.
The ID generation daemon keeps track of IDs for inodes, blocks and quota entities. Distributed unique ID
generator configuration parameters are defined in hdfs-site.xml.
• dfs.namenode.quota.update.id.batchsize,
dfs.namenode.inodeid.batchsize,
dfs.namenode.blockid.batchsize: When the ID generator is about to run out of the IDs it prefetches a batch of new IDs. These parameters defines the prefetch batch size for Quota, inodes and
blocks updates respectively.
• dfs.namenode.quota.update.updateThreshold,
dfs.namenode.inodeid.updateThreshold,
dfs.namenode.blockid.updateThreshold: These parameters define when the ID generator
should pre-fetch new batch of IDs. Values for these parameter are defined as percentages i.e. 0.5
means prefetch new batch of IDs if 50 percent of the IDs have been consumed by the NameNode.
• dfs.namenode.id.updateThreshold: It defines how often the IDs Monitor should check if the ID
pools are running low on pre-allocated IDs.
8.7 Namespace and Block Pool ID
• dfs.block.pool.id, and dfs.name.space.id: Due to shared state among the NameNodes, HopsFS only
supports single namespace and one block pool. The default namespace and block pool ids can be
overridden using these parameters.
8.8 Client Configurations
All the client configuration parameters are defined in core-site.xml file.
• dfs.namenodes.rpc.addresses: HopsFS support multiple active NameNodes. A client can send a
RPC request to any of the active NameNodes. This parameter specifies a list of active NameNodes in
the system. The list has following format [hdfs://ip:port, hdfs://ip:port, ...]. It is not necessary that this
list contain all the active NameNodes in the system. Single valid reference to an active NameNode is
sufficient. At the time of startup the client obtains an updated list of NameNodes from a NameNode
mentioned in the list. If this list is empty then the client tries to connect to fs.default.name.
• dfs.namenode.selector-policy: The clients uniformly distribute the RPC calls among the all the NameNodes in the system based on the following policies. - ROUND ROBIN - RANDOM - RANDOM_STICKY By default NameNode selection policy is set to RANDOM_STICKY
• dfs.clinet.max.retires.on.failure: The client retries the RPC call if the RPC fails due to the failure of
the NameNode. This configuration parameter specifies how many times the client would retry the RPC
before throwing an exception. This property is directly related to number of expected simultaneous
failures of NameNodes. Set this value to 1 in case of low failure rates such as one dead NameNode at
any given time. It is recommended that this property must be set to value >= 1.
• dfs.client.max.random.wait.on.retry: A RPC can fail because of many factors such as NameNode
failure, network congestion etc. Changes in the membership of NameNodes can lead to contention on
the remaining NameNodes. In order to avoid contention on the remaining NameNodes in the system
the client would randomly wait between [0,MAX VALUE] ms before retrying the RPC. This property
specifies MAX VALUE; by default it is set to 1000 ms.
• dfs.client.refresh.namenode.list: All clients periodically refresh their view of active NameNodes
in the system. By default after every minute the client checks for changes in the membership of
the NameNodes. Higher values can be chosen for scenarios where the membership does not change
frequently.
8.9 Data Access Layer (DAL)
Using DAL layer HopsFS’s metadata can be stored in different databases. HopsFS provides a driver to store
the metadata in MySQL Cluster Network Database (NDB).
8.9.1 MySQL Cluster Network Database Driver Configuration
Database specific parameter are stored in a .properties file. The configuration files contains following
parameters.
• com.mysql.clusterj.connectstring: Address of management server of MySQL NDB Cluster.
• com.mysql.clusterj.database: Name of the database schema that contains the metadata tables.
• com.mysql.clusterj.connection.pool.size: This is the number of connections that are created in the
ClusterJ connection pool. If it is set to 1 then all the sessions share the same connection; all requests
for a SessionFactory with the same connect string and database will share a single SessionFactory. A
setting of 0 disables pooling; each request for a SessionFactory will receive its own unique SessionFactory.
• com.mysql.clusterj.max.transactions: Maximum number transactions that can be simultaneously
executed using the clusterj client. The maximum support transactions are 1024.
• io.hops.metadata.ndb.mysqlserver.host Address of MySQL server. For higher performance we use
MySQL Server to perform a aggregate queries on the file system metadata.
• io.hops.metadata.ndb.mysqlserver.port: If not specified then default value of 3306 will be used.
• io.hops.metadata.ndb.mysqlserver.username: A valid user name to access MySQL Server.
• io.hops.metadata.ndb.mysqlserver.password: MySQL Server user password
• io.hops.metadata.ndb.mysqlserver.connection pool size: Number of NDB connections used by the
MySQL Server. The default is set to 10.
• Database Sessions Pool: For performance reasons the data access layer maintains a pools of preallocated ClusterJ session objects. Following parameters are used to control the behavior the session
pool.
– io.hops.session.pool.size: Defines the size of the session pool. The pool should be at least as
big as the number of active transactions in the system. Number of active transactions in the
system can be calculated as ( dfs.datanode.handler.count + dfs.namenode.handler.count +
dfs.namenode.subtree-executor-limit).
– io.hops.session.reuse.count: Session is used N times and then it is garbage collected. Note:
Due to imporoved memory management in ClusterJ >= 7.4.7, N can be set to higher values i.e.
Integer.MAX_VALUE for latest ClusterJ libraries.
8.9.2 Loading a DAL Driver
In order to load a DAL driver following configuration parameters are added to hdfs-site.xml file.
• dfs.storage.driver.jarFile: path of driver jar file if the driver’s jar file is not included in the class path.
• dfs.storage.driver.class: main class that initializes the driver.
• dfs.storage.driver.configfile: path to a file that contains configuration parameters for the driver jar
file. The path is supplied to the dfs.storage.driver.class as an argument during initialization. See
hops ndb driver configuration parameters (page 47).
8.10 HopsFS-EC Configuration
The erasure coding API is flexibly configurable and hence comes with some new configuration options that
are shown here. All configuration options can be set by creating an erasure-coding-site.xml in
the Hops configuration folder. Note that Hops comes with reasonable default values for all of these values.
However, erasure coding needs to be enabled manually.
• dfs.erasure_coding.enabled: (true/false) Enable/Disable erasure coding.
• dfs.erasure_coding.codecs.json: List of available erasure coding codecs available. This value is a
json field i.e.
<value>
[
{
"id" : "xor",
"parity_dir" : "/raid",
"stripe_length" : 10,
"parity_length" : 1,
"priority" : 100,
"erasure_code" : "io.hops.erasure_coding.XORCode",
"description" : "XOR code"
},
{
"id" : "rs",
"parity_dir" : "/raidrs",
"stripe_length" : 10,
"parity_length" : 4,
"priority" : 300,
"erasure_code" : "io.hops.erasure_coding.ReedSolomonCode",
"description" : "ReedSolomonCode code"
},
{
"id" : "src",
"parity_dir" : "/raidsrc",
"stripe_length" : 10,
"parity_length" : 6,
"parity_length_src" : 2,
"erasure_code" : "io.hops.erasure_coding.SimpleRegeneratingCode",
"priority" : 200,
"description" : "SimpleRegeneratingCode code"
},
]
</value>
• dfs.erasure_coding.parity_folder: The HDFS folder to store parity information in. Default value is
/parity
• dfs.erasure_coding.recheck_interval: How frequently should the system schedule encoding or repairs and check their state. Default valude is 300000 ms.
• dfs.erasure_coding.repair_delay: How long should the system wait before scheduling a repair. Default is 1800000 ms.
• dfs.erasure_coding.parity_repair_delay: How long should the system wait before scheduling a parity repair. Default is 1800000 ms.
• dfs.erasure_coding.active_encoding_limit: Maximum number of active encoding jobs. Default is
10.
• dfs.erasure_coding.active_repair_limit: Maximum number of active repair jobs. Default is 10.
• dfs.erasure_coding.active_parity_repair_limit: Maximum number of active parity repair jobs. Default is 10.
• dfs.erasure_coding.deletion_limit: Delete operations to be handle during one round. Default is 100.
• dfs.erasure_coding.encoding_manager: Implementation of the EncodingManager to be used. Default is io.hops.erasure_coding.MapReduceEncodingManager.
• dfs.erasure_coding.block_rapair_manager: Implementation of the repair manager to be used. Default is io.hops.erasure_coding.MapReduceBlockRepairManager
CHAPTER
NINE
HOPS-YARN CONFIGURATION
Hops-YARN configuration is very similar to the Apache Hadoop YARN configuration. A few additionals
configuration parameters are needed to configure the new services provided by Hops-YARN. This section
presents the new/modified configuration parameters for Hops-YARN. All the new configuration parameters
should be entered in yarn-site.xml.
9.1 Configuring Hops-YARN fail-over
• yarn.resourcemanager.scheduler.port: The port used by the scheduler service(the port still need to
be specified in yarn.resourcemanager.scheduler.address)
• yarn.resourcemanager.resource-tracker.port: The port used by the resource-tracker service (the
port still need to be specified in yarn.resourcemanager.resource-trakcer.address)
• yarn.resourcemanager.admin.port: The port used by the admin service (the port still need to be
specified in yarn.resourcemanager.admin.address)
• yarn.resourcemanager.port: The port used by the resource manager service (the port still need to be
specified in yarn.resourcemanager.resourcemanager.address)
• yarn.resourcemanager.groupMembership.address: The address of the group membership service.
The group membership service is used by the clients and node managers to obtain the list of alive
resource managers.
• yarn.resourcemanager.groupMembership.port: The port used by the group membership service
(the port still need to be specified in yarn.resourcemanager.groupMembership.address)
• yarn.resourcemanager.ha.rm-ids: Contain a list of ResourceManagers. This is used to establish the
first connection to the group membership service.
• yarn.resourcemanager.store.class: Should be set to org.apache.hadoop.yarn.server.resourcemanager.recovery.NDBR
9.2 Batch Processing of Operations
In Hops-YARN, RPCs are received by the ResourceManager that describe operations on the
Applications Master Interface, the Administrator Interface, and the Client
Interface. RPCs for the Resource Tracker Interface are received by the ResourceTracker
51
nodes.
For reasons of performance and consistency, the Hops-YARN resource manager processes incoming RPCs in batches. Hops-YARN first fills an adaptive processing buffer with
a bounded-size batch of RPCs.
If the batch size has not been filled before a timer expires
(hops.yarn.resourcemanager.batch.max.duration), the batch is processed immediately.
New RPCs are blocked until the accepted batch of RPCs has been processed. Once all of RPCs have
been completely executed the state of the resource manager is pushed to the database and the next RPCs are
accepted. The size of the batch of rpc that are accepted is limited by two factors: the number of RPCs and
the time for which this batch have been going. The first factor guaranty that the number of state change in
the database will be limited and that the commit of the new state to the database won’t be too long. The
second factor guaranty that a new state will be committed in a given time even if few RPCs are received.
• hops.yarn.resourcemanager.batch.max.size: The maximum number of RPCs in a batch.
• hops.yarn.resourcemanager.batch.max.duration: The maximum time to wait before processing a
batch of RPCs (default: 10 ms).
• hops.yarn.resourcemanager.max.allocated.containers.per.request: In very large clusters some application may try to allocate tens of thousands of containers at once. This can take few seconds and
block any other RPC to be handled during this time, this is due to the RPCs batch system. In order to
limit the impact of such big request it is possible to set this option to limit the number of containers
an application get at each request. This result in a suboptimal us of the cluster each time such
application start
9.2.1 Database back pressure
In order to exercise back pressure when the database is overloaded we block the execution of new RPCs.
We identify that the database is overloaded by looking at the length of the queue of operations waiting to be
committed as well as the duration of individual commits. If the length of the queue becomes too long or the
duration of any individutal commit becomes too long, we exercise back pressure on the RPCs.
• hops.yarn.resourcemanager.commit.and.queue.threshold: The upper bound on the length of the
queue of operations waiting to be commited.
• hops.yarn.resourcemanager.commit.queue.max.length: The upper bound on the time each individual commit should take.
9.2.2 Proxy provider
• yarn.client.failover-proxy-provider: Two new proxy providers have been added to the existing ConfiguredRMFailoverProxyProvider
• ConfiguredLeaderFailoverHAProxyProvider: this proxy provider has the same goal as the ConfiguredRMFailoverProxyProvider (connecting to the leading ResourceManager) but it uses the groupMembershipService where the ConfiguredRMFailoverProxyProvider goes through all the ResourceManagers present in the configuration file to find the leader. This allows the ConfiguredLeaderFailoverHAProxyProvider to be faster and to find the leader even if it is not present in the configuration
file.
• ConfiguredLeastLoadedRMFailoverHAProxyProvider: this proxy provider establishes a connection with the ResourceTracker that has the lowest current load (least loaded). This proxy provider
is to be used in distributed mode in order to balance the load coming from NodeManagers across
ResourceTrackers.
9.3 Configuring Hops-YARN distributed mode
Hops-YARN distributed mode can be enabled by setting the following flags to true:
• hops.yarn.resourcemanager.distributed-rt.enable: Set to true to indicate that the system should
work in distributed mode. Set it to true to run in distributed mode.
• hops.yarn.resourcemanager.ndb-event-streaming.enable: Set to true to indicate that the ResourceManager (scheduler) should use the streaming API to the database to receive updates on the state of
NodeManagers. Set it to true if you want to use the streaming API for more performance.
• hops.yarn.resourcemanager.ndb-rt-event-streaming.enable: Set to true to indicate that that the
ResourceTracker should use the streaming API to the database to receive updates on the state of
NodeManagers. Set it to true if you want to use the streaming API for more performance.
CHAPTER
TEN
HOPS DEVELOPER GUIDE
10.1 Extending HopsFS INode metadata
For the implementation of new features, it is often necessary to modify the hdfs_inodes table or add
new tables in order to store extended metadata. With Hops-HDFS, this can be simply achieved by adding
a new table with a foreign key that refers to hdfs_inodes. Adding new tables has the benefit that the
original data structures do not need to be modified and old code paths not requiring the additional metadata
are not burdened with additional reading costs. This guide gives a walkthrough on how to add additional
INode-related metadata.
10.1.1 Example use case
Let’s assume we would like to store per user access times for each INode. To do this, we need to store the
id of the inode, the name of the user and the timestamp representing the most recent access.
10.1.2 Adding a table to the schema
First, we need to add a new table storing the metadata to our schema. Therefor we’ll go to the hopsmetadata-dal-impl-ndb project and add the following to the schema/schema.sql file.
CREATE TABLE `hdfs_access_time_log` (
`inode_id` int(11) NOT NULL,
`user` varchar(32) NOT NULL,
`access_time` bigint(20) NOT NULL,
PRIMARY KEY (`inode_id` , `user`)
) ENGINE=ndbcluster DEFAULT CHARSET=latin1$$
Additionally we will make the table and column names available to the Java code by adding the following
to the io.hops.metadata.hdfs.TablesDef class in hops-metadata-dal.
public static interface AccessTimeLogTableDef {
public static final String TABLE_NAME = "hdfs_access_time_log";
public static final String INODE_ID = "inode_id";
public static final String USER = "user";
public static final String ACCESS_TIME = "access_time";
}
54
Note Don’t forget to update your database with the new schema.
10.1.3 Defining the Entity Class
Having defined the database table, we will need to defining an entity class representing our database entries
in the java code. We will do this by adding the following AccessTimeLogEntry class hops-metadata-dal
project.
package io.hops.metadata.hdfs.entity;
public class AccessTimeLogEntry {
private final int inodeId;
private final String user;
private final long accessTime;
public AccessTimeLogEntry(int inodeId, String user
, long accessTime) {
this.inodeId = inodeId;
this.user = user;
this.accessTime = accessTime;
}
public int getInodeId() {
return inodeId;
}
public String getUser() {
return user;
}
public long getAccessTime() {
return accessTime;
}
}
10.1.4 Defining the DataAccess interface
We will need a way for interacting with our new entity in the database. The preferred way of doing this
in Hops is defining a DataAccess interface to be implemented by a database implementation. Let’s define
define the following interface in the hops-metadata-dal project. For now, we will only require functionality
to add and modify log entries and to read individual entries for a given INode and user.
package io.hops.metadata.hdfs.dal;
public interface AccessTimeLogDataAccess<T> extends EntityDataAccess {
void prepare(Collection<T> modified,
Collection<T> removed) throws StorageException;
T find(int inodeId, String user) throws StorageException;
}
10.1.5 Implementing the DataAccess interface
Having defined the interface, we will need to implement it using ndb to read and persist our data. Therefor,
we will add a clusterj implementation of our interface to the hops-metadata-dal-impl-ndb project.
package io.hops.metadata.ndb.dalimpl.hdfs;
public class AccessTimeLogClusterj implements TablesDef.AccessTimeLogTableDef,
AccessTimeLogDataAccess<AccessTimeLogEntry> {
private ClusterjConnector connector = ClusterjConnector.getInstance();
@PersistenceCapable(table = TABLE_NAME)
public interface AccessTimeLogEntryDto {
@PrimaryKey
@Column(name = INODE_ID)
int getInodeId();
void setInodeId(int inodeId);
@PrimaryKey
@Column(name = USER)
String getUser();
void setUser(String user);
@Column(name = ACCESS_TIME)
long getAccessTime();
void setAccessTime(long accessTime);
}
@Override
public void prepare(Collection<AccessTimeLogEntry> modified,
Collection<AccessTimeLogEntry> removed) throws StorageException {
HopsSession session = connector.obtainSession();
List<AccessTimeLogEntryDto> changes =
new ArrayList<accesstimelogentrydto>();
List<AccessTimeLogEntryDto> deletions =
new ArrayList<accesstimelogentrydto>();
if (removed != null) {
for (AccessTimeLogEntry logEntry : removed) {
Object[] pk = new Object[2];
pk[0] = logEntry.getInodeId();
pk[1] = logEntry.getUser();
InodeDTO persistable =
session.newInstance(AccessTimeLogEntryDto.class, pk);
deletions.add(persistable);
}
}
if (modified != null) {
for (AccessTimeLogEntry logEntry : modified) {
AccessTimeLogEntryDto persistable =
createPersistable(logEntry, session);
changes.add(persistable);
}
}
session.deletePersistentAll(deletions);
session.savePersistentAll(changes);
}
@Override
public AccessTimeLogEntry find(int inodeId, String user)
throws StorageException {
HopsSession session = connector.obtainSession();
Object[] key = new Object[2];
key[0] = inodeId;
key[1] = user;
AccessTimeLogEntryDto dto = session.find(AccessTimeLogEntryDto.class, key);
AccessTimeLogEntry logEntry = create(dto);
return logEntry;
}
private AccessTimeLogEntryDto createPersistable(AccessTimeLogEntry logEntry,
HopsSession session) throws StorageException {
AccessTimeLogEntryDto dto = session.newInstance(AccessTimeLogEntryDto.class);
dto.setInodeId(logEntry.getInodeId());
dto.setUser(logEntry.getUser());
dto.setAccessTime(logEntry.getAccessTime());
return dto;
}
private AccessTimeLogEntry create(AccessTimeLogEntryDto dto) {
AccessTimeLogEntry logEntry = new AccessTimeLogEntry(
dto.getInodeId(),
dto.getUser(),
dto.getAccessTime());
return logEntry;
}
}
Having defined a concrete implementation of the DataAccess, we need to make it available to the
EntityManager by adding it to HdfsStorageFactory in the hops-metadata-dal-impl-ndb
project. Edit its initDataAccessMap() function by adding the newly defined DataAccess as following.
private void initDataAccessMap() {
[...]
dataAccessMap.put(AccessTimeLogDataAccess.class, new AccessTimeLogClusterj());
}
10.1.6 Implementing the EntityContext
Hops-HDFS uses context objects to cache the state of entities during transactions before persisting them in
the database during the commit phase. We will need to implement such a context for our new entity in the
hops project.
package io.hops.transaction.context;
public class AccessTimeLogContext extends
BaseEntityContext<Object, AccessTimeLogEntry> {
private final AccessTimeLogDataAccess<AccessTimeLogEntry> dataAccess;
/* Finder to be passed to the EntityManager */
public enum Finder implements FinderType<AccessTimeLogEntry> {
ByInodeIdAndUser;
@Override
public Class getType() {
return AccessTimeLogEntry.class;
}
@Override
public Annotation getAnnotated() {
switch (this) {
case ByInodeIdAndUser:
return Annotation.PrimaryKey;
default:
throw new IllegalStateException();
}
}
}
/*
* Our entity uses inode id and user as a composite key.
* Hence, we need to implement a composite key class.
*/
private class Key {
int inodeId;
String user;
public Key(int inodeId, String user) {
this.inodeId = inodeId;
this.user = user;
}
@Override
public boolean equals(Object o) {
if (this == o) {
return true;
}
if (o == null || getClass() != o.getClass()) {
return false;
}
Key key = (Key) o;
if (inodeId != key.inodeId) {
return false;
}
return user.equals(key.user);
}
@Override
public int hashCode() {
int result = inodeId;
result = 31 * result + user.hashCode();
return result;
}
@Override
public String toString() {
return "Key{" +
"inodeId=" + inodeId +
", user='" + user + '\'' +
'}';
}
}
public AccessTimeLogContext(AccessTimeLogDataAccess<AccessTimeLogEntry>
dataAccess) {
this.dataAccess = dataAccess;
}
@Override
Object getKey(AccessTimeLogEntry logEntry) {
return new Key(logEntry.getInodeId(), logEntry.getUser());
}
@Override
public void prepare(TransactionLocks tlm)
throws TransactionContextException, StorageException {
Collection<AccessTimeLogEntry> modified =
new ArrayList<AccessTimeLogEntry>(getModified());
modified.addAll(getAdded());
dataAccess.prepare(modified, getRemoved());
}
@Override
public AccessTimeLogEntry find(FinderType<AccessTimeLogEntry> finder,
Object... params) throws TransactionContextException,
StorageException {
Finder afinder = (Finder) finder;
switch (afinder) {
case ByInodeIdAndUser:
return findByPrimaryKey(afinder, params);
}
throw new UnsupportedOperationException(UNSUPPORTED_FINDER);
}
private AccessTimeLogEntry findByPrimaryKey(Finder finder, Object[] params)
throws StorageCallPreventedException, StorageException {
final int inodeId = (Integer) params[0];
final String user = (String) params[1];
Key key = new Key(inodeId, user);
AccessTimeLogEntry result;
if (contains(key)) {
result = get(key); // Get it from the cache
hit(finder, result, params);
} else {
aboutToAccessStorage(finder, params); // Throw an exception
//if reading after the reading phase
result = dataAccess.find(inodeId, user); // Fetch the value
gotFromDB(key, result); // Put the new value into the cache
miss(finder, result, params);
}
return result;
}
}
Having defined an EntityContext, we need to make it available through the EntityManger by adding it
to the HdfsStorageFactory in the hops project by modifying it as follows.
private static ContextInitializer getContextInitializer() {
return new ContextInitializer() {
@Override
public Map<Class, EntityContext> createEntityContexts() {
Map<Class, EntityContext> entityContexts =
new HashMap<class, entitycontext="">();
[...]
entityContexts.put(AccessTimeLogEntry.class, new AccessTimeLogContext(
(AccessLogDataAccess) getDataAccess(AccessTimeLogDataAccess.class)));
return entityContexts;
}
}
}
10.1.7 Using custom locks
Your metadata extension relies on the inode object to be correctly locked in order to prevent concurrent
modifications. However, it might be necessary to modify attributes without locking the INode in advance.
In that case, one needs to add a new lock type. A good place to get started with this is looking at the Lock,
HdfsTransactionLocks, LockFactory and HdfsTransactionalLockAcquirer classes in
the hops project.
10.2 Erasure Coding API Access
HopsFS provides erasure coding functionality in order to decrease storage costs without the loss of highavailability. Hops offers a powerful, on a per file basis configurable, erasure coding API. Codes can be
freely configured and different configurations can be applied to different files. Given that Hops monitors
your erasure-coded files directly in the NameNode, maximum control over encoded files is guaranteed. This
page explains how to configure and use the erasure coding functionality of Hops. Apache HDFS stores
3 copies of your data to provide high-availability. So, 1 petabyte of data actually requires 3 petabytes of
storage. For many organizations, this results in enormous storage costs. HopsFS also supports erasure
coding to reduce the storage required by by 44% compared to HDFS, while still providing high-availability
for your data.
10.2.1 Java API
The erasure coding API is exposed to the client through the DistributedFileSystem class. The following sections give examples on how to use its functionality. Note that the following examples rely on erasure coding
being properly configured. Information about how to do this can be found in erasure-coding-configuration.
10.2.2 Creation of Encoded Files
The erasure coding API offers the ability to request the encoding of a file while being created. Doing so has
the benefit that file blocks can initially be placed in a way that the meets placements constraints for erasurecoded files without needing to rewrite them during the encoding process. The actual encoding process will
take place asynchronously on the cluster.
Configuration conf = new Configuration();
DistributedFileSystem dfs = (DistributedFileSystem) FileSystem.get(conf);
// Use the configured "src" codec and reduce
// the replication to 1 after successful encoding
EncodingPolicy policy = new EncodingPolicy("src" /* Codec id as configured */,
(short) 1);
// Create the file with the given policy and
// write it with an initial replication of 2
FSDataOutputStream out = dfs.create(path, (short) 2, policy);
// Write some data to the stream and close it as usual
out.close();
// Done. The encoding will be executed asynchronously
// as soon as resources are available.
Multiple versions of the create function complementing the original versions with erasure coding functionality exist. For more information please refer to the class documentation.
10.2.3 Encoding of Existing Files
The erasure coding API offers the ability to request the encoding for existing files. A replication factor to be
applied after successfully encoding the file can be supplied as well as the desired codec. The actual encoding
process will take place asynchronously on the cluster.
Configuration conf = new Configuration();
DistributedFileSystem dfs = (DistributedFileSystem) FileSystem.get(conf);
String path = "/testFile";
// Use the configured "src" codec and reduce the replication to 1
// after successful encoding
EncodingPolicy policy = new EncodingPolicy("src" /* Codec id as configured */,
(short) 1);
// Request the asynchronous encoding of the file
dfs.encodeFile(path, policy);
// Done. The encoding will be executed asynchronously
// as soon as resources are available.
10.2.4 Reverting To Replication Only
The erasure coding API allows to revert the encoding and to default to replication only. A replication factor
can be supplied and is guaranteed to be reached before deleting any parity information.
Configuration conf = new Configuration();
DistributedFileSystem dfs = (DistributedFileSystem) FileSystem.get(conf);
// The path to an encoded file
String path = "/testFile";
// Request the asynchronous revocation process and
// set the replication factor to be applied
dfs.revokeEncoding(path, (short) 2);
// Done. The file will be replicated asynchronously and
// its parity will be deleted subsequently.
10.2.5 Deletion Of Encoded Files
Deletion of encoded files does not require any special care. The system will automatically take care of
deletion of any additionally stored information.
CHAPTER
ELEVEN
LICENSE COMPATIBILITY
We combine Apache and GPL licensed code, from Hops and MySQL Cluster, respectively, by providing a
DAL API (similar to JDBC). We dynamically link our DAL implementation for MySQL Cluster with the
Hops code. Both binaries are distributed separately. Hops derives from Hadoop and, as such, it is available
under the Apache version 2.0 open- source licensing model. MySQL Cluster and its client connectors, on
the other hand, are li- censed under the GPL version 2.0 licensing model. Similar to the JDBC model, we
have in- troduced a Data Access Layer (DAL) API to bridge our code licensed under the Apache model
with the MySQL Cluster connector libraries, licensed under the GPL v2.0 model. The DAL API is licensed
under the Apache v2.0 model. The DAL API is statically linked to both Hops and our client library for
MySQL Cluster that implements the DAL API. Our client library that implements the DAL API for MySQL
Cluster, however, is licensed under the GPL v2.0 model, but static linking of Apache v2 code to GPL V2
code is allowed, as stated in the MySQL FOSS license exception. The FOSS License Exception permits
use of the GPL-licensed MySQL Client Libraries with software applications licensed under certain other
FOSS licenses without causing the entire derivative work to be subject to the GPL. However, to comply
with the terms of both licensing models, the DAL API needs to generic and different implementations of
it for different databases are possible. Although, we only currently support MySQL Cluster, you are free
to develop your own DAL API client and run Hops on a different database. The main requirements for the
database are support for transactions, read/write locks and at least read-committed isolation.
63
D6.1 – BiobankCloud Platform-as-a-Service
Feature
Description
Integrated from Deliverable(s)
Two-factor authentication
Dynamic User Roles
Secure authorization using smartphones and Yubikeys
Users can have different privileges in different studies
Biobanking forms
Consent forms, Non-consent Forms
Audit Trails
Study membership mgmt
Metadata mgmt
Logging of user activity in the system
Study owners manage users and their roles
Metadata designer and metadata entry for files/directories
Free-text search
Search for projects/datasets/files/directories
using Elasticsearch
Sharing data between studies without copying
Explore/upload/download files and directores in HopsFS
Bioinformatics workflows on YARN using
Cuniform and HiWAY
D3.4 Security Toolset Final Version
D3.4 Security Toolset Final Version
D1.3 Legal and Ethical Framework ...
D3.4 Security Toolset Final Version
D1.4 Disclosure model,
D3.4 Security Toolset Final Version
D3.5 Object Model Implementation
D3.5 Object Model Implementation
D2.3 Scalable and Highly Available HDFS
D3.5 Object Model Implementation
D2.3 Scalable and Highly Available HDFS
D3.5 Object Model Implementation
D2.3 Scalable and Highly Available HDFS
D1.2 Object model for biobank data sharing
D5.3 Workflows for NGS data analysis use cases
D6.3: Analysis Pipelines Linked to Public Biological
Annotation
D4.3 Overbank Implementation and Evaluation
D2.4 Secure, scalable, highly-available Filesystem ...
D6.1 BiobankCloud Platform-as-a-Service
Data set sharing
Data set browser
SAASFEE
Charon
Sharing data between Biobanks
Apache Zeppelin
Interactive analytics using Spark and Flink
Table 1: HopsWorks integrates features from BiobankCloud Deliverables.
HopsWorks as a new UI for Hadoop
Existing models for multi-tenancy in Hadoop, such as Amazon Web Services’ Elastic MapReduce (EMR) platform, Google’s Dataproc platform, and Altiscale’s Hadoop-as-a-Service, provide multi-tenant Hadoop by running separate Hadoop clusters for separate projects or organizations. They improve cluster efficiency by running Hadoop clusters on virtualized or containerized platforms, and in some cases, the clusters are not elastic, that is, they cannot be easily
scaled up or down in size. There are no tools for securely sharing data between platforms
without copying data.
HopsWorks is a front-end to Hadoop that provides a new model for multi-tenancy in Hadoop,
based around projects. A project is like a GitHub project - the owner of the project manages
membership, and users can have different roles in the project: data scientists can run programs and data owners can also curate, import, and export data. Users can’t copy data between
projects or run programs that process data from different projects, even if the user is a member
of multiple projects. That is, we implement multi-tenancy with dynamic roles, where the user’s
role is based on the currently active project. Users can still share datasets between projects,
however. HopsWorks has been enabled by migrating all metadata in HDFS and YARN into
an open-source, shared nothing, in-memory, distributed database, called NDB. HopsWorks is
open-source and licensed as Apache v2, with database connectors licensed as GPL v2. From late
January 2016, HopsWorks will be provided as software-as-a-service for researchers and companies in Sweden from the Swedish ICT SICS Data Center (https://www.sics.se/projects/sicsice-data-center-in-lulea).
HopsWorks Implementation
HopsWorks is a J2EE7 web application, that runs by default on Glassifsh, and has a modern
AngularJS user interface, supporting responsive HTML using the Bootstrap framework (that
is, the UI adapts its layout for mobile devices). We have a separate administration application
D6.1 – BiobankCloud Platform-as-a-Service
that is also a J2EE application but provides a JSF user interface. For reasons of security, the
applications are kept separate, as we can deploy the administration application on a firewalled
machine, while HopsWorks needs to be user-facing and open to clients, who may reside outside
the internal network.
D6.1 – BiobankCloud Platform-as-a-Service
Conclusions
In this deliverable, we introduced Karamel (http://www.karamel.io), a new orchestration application for Chef and JClouds that enables the easy configuration and installation of BiobankCloud
on both cloud platforms and on-premise (baremetal) hosts. We also presented our SaaS platform for using BiobankCloud, HopsWorks, that provides an intuitive web-based user interface
for the platform. Together these tools help lower the barrier of entry for both Biobankers and
Bioinformaticians in getting started with Hadoop and BiobankCloud. Our first experiences
with presenting these tools to the community has been positive, and we will deploy them at
three Biobanks in 2016, as part of the BBMRI Competence center. BBMRI will take on the
development of BiobankCloud and promote its use within the community. In a separate development, from February 2016, HopsWorks will be used to provide Hadoop-as-a-Service in
Sweden to researchers and industry, where it will be deployed on 152 hosts in the Swedish ICT
SICS North data center.
D6.1 – BiobankCloud Platform-as-a-Service
Bibliography
[1] David Bernstein. Containers and cloud: From lxc to docker to kubernetes. IEEE Cloud
Computing, (3):81–84, 2014.
[2] Geoffrey C Fox, Judy Qiu, Supun Kamburugamuve, Shantenu Jha, and Andre Luckow.
Hpc-abds high performance computing enhanced apache big data stack. In Cluster, Cloud
and Grid Computing (CCGrid), 2015 15th IEEE/ACM International Symposium on, pages
1057–1066. IEEE, 2015.
[3] Salman Niazi, Mahmoud Ismail, Stefan Grohsschiedt, and Jim Dowling. D2.3, scalable and
highly available hdfs, 2014.
[4] Fredrik Önnberg. Software configuration management: A comparison of chef, cfengine
and puppet. 2012.
[5] Liming Zhu, Donna Xu, An Binh Tran, Xiwei Xu, Len Bass, Ingo Weber, and Srini
Dwarakanathan. Achieving reliable high-frequency releases in cloud environments. Software, IEEE, 32(2):73–80, 2015.
Was this manual useful for you? yes no
Thank you for your participation!

* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project

Download PDF

advertisement