SAS and Hadoop Technology: Deployment Scenarios

SAS and Hadoop Technology: Deployment Scenarios
SAS and Hadoop
Technology
®
Deployment Scenarios
SAS® Documentation
The correct bibliographic citation for this manual is as follows: SAS Institute Inc. 2015. SAS® and Hadoop Technology:
Deployment Scenarios. Cary, NC: SAS Institute Inc.
SAS® and Hadoop Technology: Deployment Scenarios
Copyright © 2015, SAS Institute Inc., Cary, NC, USA
All rights reserved. Produced in the United States of America.
For a hard-copy book: No part of this publication may be reproduced, stored in a retrieval system, or transmitted, in any
form or by any means, electronic, mechanical, photocopying, or otherwise, without the prior written permission of the
publisher, SAS Institute Inc.
For a web download or e-book: Your use of this publication shall be governed by the terms established by the vendor at
the time you acquire this publication.
The scanning, uploading, and distribution of this book via the Internet or any other means without the permission of the
publisher is illegal and punishable by law. Please purchase only authorized electronic editions and do not participate in or
encourage electronic piracy of copyrighted materials. Your support of others' rights is appreciated.
U.S. Government License Rights; Restricted Rights: The Software and its documentation is commercial computer
software developed at private expense and is provided with RESTRICTED RIGHTS to the United States Government. Use,
duplication or disclosure of the Software by the United States Government is subject to the license terms of this Agreement
pursuant to, as applicable, FAR 12.212, DFAR 227.7202-1(a), DFAR 227.7202-3(a) and DFAR 227.7202-4 and, to the extent
required under U.S. federal law, the minimum restricted rights as set out in FAR 52.227-19 (DEC 2007). If FAR 52.227-19 is
applicable, this provision serves as notice under clause (c) thereof and no other notice is required to be affixed to the
Software or documentation. The Government's rights in Software and documentation shall be only those set forth in this
Agreement.
SAS Institute Inc., SAS Campus Drive, Cary, North Carolina 27513-2414.
January 2016
SAS® and all other SAS Institute Inc. product or service names are registered trademarks or trademarks of SAS Institute
Inc. in the USA and other countries. ® indicates USA registration.
Other brand and product names are trademarks of their respective companies.
Contents
Using This Book . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . v
Chapter 1 • Introduction to SAS and Hadoop Deployment Scenarios . . . . . . . . . . . . . . . . . . . . . . . . 1
Deploying SAS with Hadoop . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
How to Use This Guide . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Tips Before You Deploy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Supported Hadoop Distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
1
3
3
4
Chapter 2 • Scenario for SAS Data Loader for Hadoop . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
Step 1: Install and Configure Software on the Hadoop Cluster . . . . . 9
Step 2: Deploy the vApp and Configure Directives . . . . . . . . . . . . . . 10
Chapter 3 • Scenarios for In-Memory Analytics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Scenario 1: Deploy In-Memory Analytics with SAS
High-Performance Deployment of Hadoop . . . . . . . . . . . . . . . . .
Scenario 2: Deploy In-Memory Analytics on Your Hadoop Cluster . .
Scenario 3: Configure Remote Access to Hadoop . . . . . . . . . . . . . .
12
19
23
28
Recommended Reading . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
iv Contents
v
Using This Book
Audience
This document is for anyone who is interested in learning how to deploy SAS software
with Hadoop. Examples include a Hadoop administrator or IT administrator who will
install SAS software to work with Hadoop, or a SAS representative who helps
customers deploy software.
vi Using This Book
1
1
Introduction to SAS and Hadoop
Deployment Scenarios
Deploying SAS with Hadoop . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
How to Use This Guide . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
Tips Before You Deploy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
Understand Your Hadoop Environment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
Understand the SAS Intelligence Platform . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
Supported Hadoop Distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
Deploying SAS with Hadoop
SAS has long supported many different data sources, and Hadoop is no exception.
Currently, more than twenty SAS products, solutions, and technology packages interact
with Hadoop. Each SAS technology provides different functionality—from accessing and
managing Hadoop data to executing in-memory analytics with data in Hadoop.
Because SAS provides multiple options for accessing and processing data in Hadoop,
consider the following approaches to deployment:
n
You can configure SAS software that enables processing in your Hadoop cluster. To
do this, you deploy the SAS in-database deployment package for Hadoop (SAS
Embedded Process) on the same nodes of your Hadoop cluster. Together with
SAS/ACCESS Interface to Hadoop, SAS Embedded Process enables analysts to
2
Chapter 1 / Introduction to SAS and Hadoop Deployment Scenarios
run programs in Hadoop. This approach eliminates data movement, as SAS
performs the analytics where the data is stored and uses the distributed, parallel
processing architecture of Hadoop for improved performance.
SAS products that can take advantage of data management in the Hadoop cluster
include SAS Data Loader for Hadoop, SAS Scoring Accelerator for Hadoop, and
SAS Code Accelerator for Hadoop.
This guide provides a deployment scenario for the SAS Data Loader for Hadoop. For
more information, see Chapter 2, “Scenario for SAS Data Loader for Hadoop,” on
page 7.
n
You can deploy an in-memory analytics environment to work with Hadoop. This
approach provides the greatest potential for the fastest analytics on very large data
sets from Hadoop.
The SAS software that makes up an in-memory analytics environment is SAS HighPerformance Analytics and the SAS LASR Analytic Server. Products that can take
advantage of this environment include SAS Visual Analytics and SAS Visual
Statistics. This guide provides different deployment scenarios for in-memory
analytics. For more information, see Chapter 3, “Scenarios for In-Memory Analytics,”
on page 11.
n
You can configure SAS for basic access to Hadoop so that users can access data in
Hadoop, just as they would with any other data source. This approach supports the
most products provided by SAS, and any tools that you already have in place can
access and manage data from Hadoop.
Because configuring basic access to Hadoop includes post-installation configuration
steps for both Base SAS and SAS/ACCESS Interface to Hadoop, this document
provides no scenario and no additional planning information. Instead, you can find
instructions in SAS Hadoop Configuration Guide for Base SAS and SAS/ACCESS.
TIP For more information about how SAS and Hadoop work together, see SAS and
Hadoop Technology: Overview.
Tips Before You Deploy
3
How to Use This Guide
Use each scenario as a roadmap for deployment. Each scenario includes a summary of
what is deployed, an overview of steps to complete the deployment, and links to SAS
product documentation where you can find detailed instructions.
Each SAS solution, product, or technology has installation documentation and
information to help you configure SAS software with Hadoop. Because some SAS
deployments with Hadoop require multiple software components, information for how to
best deploy the software is required. To be successful, make sure that you have access
to the product documentation for the software that you want to deploy.
Tips Before You Deploy
Here are a few considerations before you get started.
Understand Your Hadoop Environment
To deploy SAS software with Hadoop successfully, consider the following tips:
n
Gain working knowledge of the Hadoop distribution that you are using (for example,
Cloudera CDH or Hortonworks Data Platform). Make sure you have working
knowledge of the Hadoop Distributed File System (HDFS), MapReduce 1,
MapReduce 2, YARN, Hive, and HiveServer2 services. Review your YARN
configuration. For more information, see the Apache website or the vendor’s
website.
n
Ensure that the HCatalog, HDFS, Hive, MapReduce, Oozie, Sqoop, and YARN
services are running on the Hadoop cluster. SAS software uses these various
services and you need to ensure that the appropriate JAR files are gathered during
the configuration.
n
Know the location of the MapReduce home.
4
Chapter 1 / Introduction to SAS and Hadoop Deployment Scenarios
n
Know the host name of the Hive server and the name of the NameNode.
n
Determine where the HDFS and Hive servers are running. If the Hive server is not
running on the same machine as the NameNode, note the server and port number of
the Hive server for future configuration.
n
Request permission to restart the MapReduce service.
n
Understand and verify your Hadoop user authentication.
n
Understand Kerberos or another security protocol for data security. Verify that you
can connect to your Hadoop cluster (HDFS and Hive) from your client machine
outside the SAS environment with your defined security protocol.
Note: The scenarios in this document assume that Kerberos has been enabled and
both SAS software and the Hadoop cluster are configured as part of the same
Kerberos realm.
Understand the SAS Intelligence Platform
The SAS Intelligence Platform provides the architecture for data management, business
intelligence, and analytics. When you deploy SAS software for Hadoop, it is important to
understand the different computing tiers and servers that the architecture comprises.
For more information, see SAS Intelligence Platform: Overview.
Supported Hadoop Distributions
SAS supports commercial Hadoop distributions from Cloudera, Hortonworks, IBM
BigInsights, MapR, and Pivotal.
For more information about the supported distributions, see SAS 9.4 Support for
Hadoop. In addition, see the full product documentation or system requirements
documentation for SAS products and technologies.
Note: SAS provides support for the installation and integration of Apache Hadoop with
SAS software. SAS does not provide support for other aspects of the administration and
Supported Hadoop Distributions
5
operation of Apache Hadoop. For production environments, customers should seek out
a well-supported third-party distribution of Hadoop. This ensures that they can turn to a
dedicated Hadoop vendor for assistance with their production Hadoop needs. For the
complete statement for licensing and support of SAS High-Performance Deployment for
Hadoop, go to Support for Apache Hadoop Software Distributed with SAS Software.
6
Chapter 1 / Introduction to SAS and Hadoop Deployment Scenarios
7
2
Scenario for SAS Data Loader for
Hadoop
Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
About SAS Data Loader for Hadoop . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
What Gets Deployed . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
Step 1: Install and Configure Software on the
Hadoop Cluster . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
Step 2: Deploy the vApp and Configure Directives . . . . . . . . . . . . . . . . 10
Overview
You can deploy SAS Data Loader for Hadoop to take advantage of the data
management capabilities of Hadoop. This chapter provides an overview of SAS Data
Loader for Hadoop and a roadmap for deploying software.
About SAS Data Loader for Hadoop
SAS Data Loader for Hadoop is a software offering that makes it easier to move,
cleanse, and analyze data in Hadoop. SAS Data Loader for Hadoop provides a set of
directives or wizards that help business users and data scientists perform the following
tasks:
n
copy data to and from Hadoop using parallel bulk data transfer
8 Chapter 2 / Scenario for SAS Data Loader for Hadoop
n
perform data integration, data quality, and data preparation tasks within Hadoop,
without writing complex MapReduce code or asking for outside help
n
minimize data movement for increased scalability, governance, and performance
n
load data in memory to prepare it for high-performance reporting, visualization, or
analytics
What Gets Deployed
To use SAS Data Loader for Hadoop, you deploy the software components shown in the
following table.
Table 2.1
Software to Deploy for SAS Data Loader for Hadoop
Software
Details
SAS In-Database Technologies for
Hadoop
The software that enables the following
technologies for SAS Data Loader for Hadoop:
n SAS In-Database Deployment Package for
Hadoop
n SAS In-Database Technologies for Data Quality
Directives, the component that includes SAS
Data Quality Accelerator for Hadoop and SAS
Quality Knowledge Base. SAS Data Quality
Accelerator is required to run the Cleanse Data in
Hadoop directive in the Hadoop cluster. SAS
Quality Knowledge Base is a collection of files
that store data and logic that support data
management operations.
n SAS Data Management Accelerator for Spark,
the component that runs data integration and
data quality tasks in a Spark environment.
vApp for SAS Data Loader for Hadoop
(client)
The virtual machine that business users run to
interface with SAS Data Loader for Hadoop.
The vApp is a complete and isolated operating
environment that business users configure in a
supported hypervisor.
Step 1: Install and Configure Software on the Hadoop Cluster
Step 1: Install and Configure Software
on the Hadoop Cluster
The following steps should be performed by a Hadoop or systems administrator.
Table 2.2
Software Deployed on the Hadoop Cluster
Overview of Steps
Documentation
1.
SAS In-Database Products: Administrator's Guide
Deploy SAS In-Database
Technologies for Hadoop
See “Part 3 — Administrator’s Guide for SAS Data
Loader for Hadoop.”
Note: The procedure for installing and deploying
SAS In-Database Technologies for Hadoop
depends on which distribution you have
downloaded. Instructions for different distributions
are provided in SAS In-Database Products:
Administrator's Guide.
2.
Provide information to each
person who will deploy the vApp
for SAS Data Loader for Hadoop.
Note: This information includes
the Kerberos settings and
additional values to connect to the
Hadoop environment. Each
person deploying a vApp must
have the correct information.
SAS In-Database Products: Administrator's Guide
See the sections “End-User Configuration
Support” and “End-User Security Support” in “Part
3 — Administrator’s Guide for SAS Data Loader
for Hadoop.”
9
10 Chapter 2 / Scenario for SAS Data Loader for Hadoop
Step 2: Deploy the vApp and Configure
Directives
To deploy SAS Data Loader for Hadoop on each client host, a user must set up and run
the vApp in a supported hypervisor. Part of the configuration is to enter information that
is provided by the Hadoop administrator.
Table 2.3
vApp Setup and Configuration
Overview of Steps
Documentation
1.
Set up and run the vApp.
SAS Data Loader for Hadoop: vApp Deployment
Guide
2.
Configure directives and global
settings.
SAS Data Loader for Hadoop: User's Guide.
Additional configuration might be
required for global settings or for
working with the different
directives.
11
3
Scenarios for In-Memory Analytics
Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
About In-Memory Analytics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
Deployment Considerations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
Core Software . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
Supporting Software . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
Hadoop Distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
SAS Products That Take Advantage of In-Memory Analytics . . . . 18
Scenario 1: Deploy In-Memory Analytics with
SAS High-Performance Deployment of Hadoop . . . . . . . . . . . . . . . . . . . . 19
What Gets Deployed . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
Step 1: Prepare Your Machines . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
Step 2: Deploy SAS High-Performance
Computing Management Console and Create
User Accounts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
Step 3: Deploy SAS High-Performance
Deployment of Hadoop . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
Step 4: Install and Configure the SAS HighPerformance Analytics Environment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
Scenario 2: Deploy In-Memory Analytics on Your
Hadoop Cluster . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
What Gets Deployed . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
Step 1: Prepare Machines on the Hadoop Cluster . . . . . . . . . . . . . . . . . 24
12 Chapter 3 / Scenarios for In-Memory Analytics
Step 2: Deploy SAS High-Performance
Computing Management Console and Create
User Accounts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
Step 3: Configure the Hadoop Cluster . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
Step 4: Install and Configure the SAS HighPerformance Analytics Environment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
Step 5: Deploy SAS Embedded Process . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
Scenario 3: Configure Remote Access to Hadoop . . . . . . . . . . . . . . . . . 28
What Gets Deployed . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
Step 1: Install SAS/ACCESS Interface to Hadoop . . . . . . . . . . . . . . . . . 29
Step 2: Deploy SAS Embedded Process in Your
Hadoop Cluster . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
Step 3: Configure the Analytics Environment for
a Remote Parallel Connection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
Overview
You can deploy software that enables SAS software to process and analyze data in a
distributed, in-memory environment. Because different environments are possible, this
chapter provides information to help you understand the SAS software that enables inmemory analytics, and each scenario provides a roadmap for deploying the software.
About In-Memory Analytics
A key to understanding the advantages of in-memory analytics is to understand how
data is staged and where analytics takes place. To process data, SAS loads data from
Hadoop to the in-memory environment. The in-memory environment then performs the
analysis and only the results are returned to the SAS server or SAS client that
submitted the request.
The in-memory SAS software consists of the SAS High-Performance Analytics
environment and SAS LASR Analytic Server. The set of connected machines where
SAS in-memory software is deployed and where SAS processes the data is referred to
as the analytics cluster. When it is deployed, the analytics cluster reduces processing
Overview 13
time and brings SAS High-Performance Analytics to data volumes that exceed the
memory capacity of a single machine.
To read more about the advantages of using in-memory analytics, see SAS and Hadoop
Technology: Overview.
Deployment Considerations
Different deployments are possible for an in-memory environment.
n
In-memory analytics software can be deployed with SAS High-Performance
Deployment of Hadoop. An advantage to this scenario is that the Hadoop distribution
provided by SAS can be used to stage data in HDFS as SASHDAT tables. The
SASHDAT file format is memory efficient, supports SAS formats, and is optimal for
use with in-memory processing.
For more information, see “Scenario 1: Deploy In-Memory Analytics with SAS HighPerformance Deployment of Hadoop” on page 19.
n
You can co-locate the in-memory environment on a commercial Hadoop distribution.
You might want to do this if you have experience with a commercial distribution or
prefer the management interfaces that it provides. After you configure the
commercial distribution, all the benefits of the SASHDAT file format that are ascribed
to SAS High-Performance Deployment of Hadoop are available. If you already have
a Hadoop cluster and the resource demands leave capacity for SAS software, you
can consider deploying SAS software on the same machines.
For more information, see “Scenario 2: Deploy In-Memory Analytics on Your Hadoop
Cluster” on page 23.
n
You can configure your analytics cluster to access a remote Hadoop cluster. In this
scenario, minimal SAS software is installed in the Hadoop cluster. Specifically, SAS
Embedded Process is deployed on the nodes of the Hadoop cluster to handle
asymmetric, distributed workloads from remote Hadoop to the SAS analytics cluster.
In this scenario, it is recommended that you deploy the in-memory analytics software
with SAS High-Performance Deployment of Hadoop to stage data in HDFS as
SASHDAT tables.
14 Chapter 3 / Scenarios for In-Memory Analytics
For more information, see “Scenario 3: Configure Remote Access to Hadoop” on
page 28.
Note: A co-located deployment consists of SAS in-memory analytics software that is
installed on the same nodes as a distribution of Hadoop. As described previously, the
Hadoop distribution can be the SAS High-Performance Deployment of Hadoop or a
commercial Hadoop distribution. Here are two ways a co-located deployment can help:
n
To use SASHDAT tables, a co-located deployment is required. An important
feature of a co-located deployment is the ability to stage data on the analytics
cluster. The preferred format is a SASHDAT table, because it takes advantage of
the redundancy and highly available features of HDFS. SAS LASR Analytic
Server and high-performance procedures can read and write SASHDAT tables in
parallel at impressive speeds.
n
If you use SAS LASR Analytic Server, a co-located deployment is highly
recommended. A co-located deployment enables you to read data from
operational systems and stage it so that SAS can analyze it on the same
machines. One of the most powerful benefits of SAS LASR Analytic Server is the
ability to read data in parallel from a co-located data provider. A co-located
deployment provides optimal performance when using the in-memory
processing.
Overview 15
Core Software
The primary SAS software that enables distributed, in-memory computing is SAS HighPerformance Analytics and SAS LASR Analytic Server.
Table 3.1
Software for In-Memory Analytics
Software
Details
SAS High-Performance Analytics
environment (also known as SAS HighPerformance Analytics node installation)
n Performs analytic tasks in a high-performance
environment that is characterized by massively
parallel processing (MPP). After you deploy it,
a root-and-worker architecture is established
for running distributed, high-performance
analytics.
n TKGrid is the primary software installed on
each node to provide the SAS HighPerformance Analytics environment.
SAS LASR Analytic Server
n A scalable, analytic platform that provides a
secure, multi-user environment for concurrent
access to in-memory data.
n Provides the ability to load Hadoop data into
memory and perform a variety of distributed
processing, exploratory analyses, analytic
calculations, and more.
16 Chapter 3 / Scenarios for In-Memory Analytics
Supporting Software
SAS provides additional software that can enable access, enhance performance, and
facilitate administration of the in-memory environment.
Table 3.2
Supporting Software for In-Memory Analytics
Software
Details
SAS Embedded Process (in-database
deployment package for Hadoop)
n Recommended software that enables
reading and writing data to HDFS in
parallel for SAS High-Performance
Analytics.
n For environments that access Hadoop
remotely, SAS Embedded Process is
required to handle asymmetric, parallel
loads between HDFS and the analytics
cluster.
n SAS Embedded Process is not supported
for environments that use SAS HighPerformance Deployment of Hadoop.
SAS/ACCESS Interface to Hadoop
n The required access engine that enables
SAS software to interface with Hadoop.
n For environments that access Hadoop
remotely, SAS/ACCESS Interface to
Hadoop works with SAS Embedded
Process to read data from the Hadoop
cluster.
SAS High-Performance Computing
Management Console
n A console that provides an easy-to-use
interface for performing administrative
tasks in the analytics cluster.
n This software is optional.
Note: A SAS client is required to submit programs to the analytics cluster, and
SAS/ACCESS Interface to Hadoop must be installed on the same machine as the SAS
client. Data scientists, analytic experts, and other users interface with the SAS client
Overview 17
software to write SAS programs and submit them to the SAS High-Performance
Analytics environment. An example of a SAS client is SAS Studio.
Hadoop Distributions
The Hadoop distributions mentioned in this section are shown in the following table.
Table 3.3
Supporting Software for In-Memory Analytics
Software
Details
SAS High-Performance Deployment of
Hadoop
n A Hadoop distribution provided by SAS as
a convenience for deploying co-located
Hadoop with your analytics cluster. This
software is optional and is not intended to
replace a commercial distribution of
Hadoop.
n Includes the Apache Hadoop framework,
which includes Hadoop Common, HDFS,
Hadoop YARN, and Hadoop MapReduce.
Also includes JAR files from SAS that
provide support for SASHDAT tables.
Commercial Hadoop distributions
n The collection of Hadoop components
(such as HDFS, Hive, and MapReduce)
that is provided by a vendor.
n For more information about the supported
distributions, see SAS 9.4 Support for
Hadoop. In addition, see the full product
documentation or system requirements
documentation for SAS products and
technologies.
18 Chapter 3 / Scenarios for In-Memory Analytics
SAS Products That Take Advantage of InMemory Analytics
If you intend to use one or more of the following products, consider deploying an inmemory environment.
Table 3.4
SAS Products That Take Advantage of In-Memory Analytics
Software
Details
SAS Visual Analytics
Use SAS Visual Analytics to explore large
volumes of data very quickly to identify
patterns and trends and to identify
opportunities for further analysis.
Visual Analytics provides an easy-to-use,
web-based interface for running analytics.
SAS In-Memory Statistics
Use SAS In-Memory Statistics to perform
analytical data preparation, variable
transformations, exploratory analysis,
statistical modeling and machine-learning
techniques, integrated modeling comparison,
and model scoring.
Data scientists can access data from a variety
of sources and use an interactive
programming interface to access data in
memory. The IMSTAT procedure enables inmemory analytics on the data. SAS LASR
Analytic Server holds the data in memory and
performs complex analytics.
SAS High-Performance Risk
SAS High-Performance Data Mining
SAS High-Performance Econometrics
SAS High-Performance Optimization
SAS High-Performance Statistics
SAS High-Performance Text-Mining
These products, which provide multiple
features for modeling, are collectively known
as the SAS High-Performance Analytics
products.
Scenario 1: Deploy In-Memory Analytics with SAS High-Performance Deployment of Hadoop
19
Scenario 1: Deploy In-Memory Analytics
with SAS High-Performance Deployment
of Hadoop
What Gets Deployed
In this scenario, the SAS in-memory analytics environment is installed on the same
nodes with SAS High-Performance Deployment of Hadoop.
The following table shows nodes and roles, as well as the locations where analytics
software is installed.
Table 3.5
Roles and Software for Co-Located SAS High-Performance Deployment of Hadoop
On the Root Node
On Each Worker Node
SAS Software
SAS Software
n NameNode: SAS High-
n DataNode: SAS High-
n Root Node: SAS High-
n Worker Node: SAS
Performance Deployment of
Hadoop
Performance Analytics
Environment
Performance
Deployment of Hadoop
High-Performance
Analytics Environment
n SAS High-Performance
Management Console
Note: Both the Hadoop NameNode and the root node for SAS High-Performance
Analytics are on the same machine. The root node takes on the role of distributing and
coordinating the workload to the worker nodes.
20 Chapter 3 / Scenarios for In-Memory Analytics
Step 1: Prepare Your Machines
An important step for a successful deployment is to ensure that the machines are
configured appropriately before you deploy SAS High-Performance Deployment of
Hadoop and the SAS High-Performance Analytics environment.
Table 3.6
Configuration to Prepare Your Machines
Overview of Steps
Note: To find the detailed information for the following steps, see “Chapter 2 — Preparing
Your System to Deploy the SAS High-Performance Analytics Infrastructure” in SAS HighPerformance Analytics Infrastructure: Installation and Configuration Guide.
1.
Configure system settings on all nodes.
2.
Create the /etc/gridhosts file, and list all nodes that will run the analytics
cluster. During a later step in this scenario, you will reference this file so that the
analytics cluster is installed on the relevant nodes.
3.
Prepare each machine for Kerberos.
4.
Prepare to install SAS High-Performance Computing Management Console.
Note: SAS High-Performance Computing Management Console is an optional
installation. In this scenario, the console is installed to highlight the features that it
provides.
5.
Prepare to deploy Hadoop.
6.
Prepare to deploy the SAS High-Performance Analytics environment.
7.
Read information about recommended data names.
8.
Make sure the SAS Software Depot has been created and is available to the root
node.
9.
Make sure the installer is root. Also, review the user account and directory
recommendations.
10.
Make sure Java software is installed. Each machine must have a Java Runtime
Environment (JRE) or Java Development Kit (JDK) installed.
Scenario 1: Deploy In-Memory Analytics with SAS High-Performance Deployment of Hadoop
21
Overview of Steps
11.
Understand user account requirements and umask settings for deploying and
running the SAS High-Performance Analytics environment.
12.
Recommended: Record each SAS port in /etc/services.
Step 2: Deploy SAS High-Performance
Computing Management Console and Create
User Accounts
When users interact with the analytics cluster, user accounts with passwordless SSH
are required to start and stop SAS LASR Analytic Server and to run programs in the
analytics cluster. The SAS High-Performance Computing Management Console is
designed to ease the creation of user accounts that require passwordless SSH.
Table 3.7 Installing and Configuring the Console and Creating Users
Overview of Steps
Note: To find the detailed information for the following steps, see “Chapter 3 — Deploying
SAS High-Performance Computing Management Console” in SAS High-Performance
Analytics Infrastructure: Installation and Configuration Guide.
1.
Install the SAS High-Performance Management Console by using RPM or tar.
2.
Configure the SAS High-Performance Management Console.
3.
Create user accounts and propagate the SSH key for each account.
Step 3: Deploy SAS High-Performance
Deployment of Hadoop
SAS High-Performance Deployment of Hadoop is deployed on all machines where you
plan to deploy and run the SAS High-Performance Analytics environment.
22 Chapter 3 / Scenarios for In-Memory Analytics
Table 3.8
Installing SAS High-Performance Deployment of Hadoop
Overview of Steps
Note: To find the detailed information for the following steps, see “Chapter 4 — Deploying
Co-Located Hadoop” in SAS High-Performance Analytics Infrastructure: Installation and
Configuration Guide. Look for step-by-step instructions in the section “Deploying SAS HighPerformance Deployment of Hadoop.”
1.
Install SAS High-Performance Deployment of Hadoop.
2.
Perform manual steps after installation, including configuration changes for Kerberos.
3.
Validate the SAS High-Performance Deployment of Hadoop.
Step 4: Install and Configure the SAS HighPerformance Analytics Environment
Deploying the SAS High-Performance Analytics environment requires installing and
configuring components on the machine that will act as the root node, and then on the
remaining worker nodes. Refer to the following table for an overview of steps.
Table 3.9
Deploying the SAS High-Performance Analytics Environment
Overview of Steps
Note: To find the detailed information for the following steps, see “Chapter 5 — Deploying the
SAS High-Performance Analytics Environment” in SAS High-Performance Analytics
Infrastructure: Installation and Configuration Guide.
1.
Install the analytics environment by running the TKGrid and TKTGDat shell scripts.
2.
Validate the environment by invoking the simsh or MPI command.
3.
(Optional) Customize resource settings for your site. To manage the performance, you
can set limits on processes that run in the analytics cluster, and you can control how
much memory is requested by programmers.
Scenario 2: Deploy In-Memory Analytics on Your Hadoop Cluster
23
Note: You can configure the analytics environment for SASHDAT encryption. Because
this scenario does not cover this feature, see “Chapter 5 — Deploying the SAS HighPerformance Analytics Environment” in SAS High-Performance Analytics Infrastructure:
Installation and Configuration Guide for more information.
Scenario 2: Deploy In-Memory Analytics
on Your Hadoop Cluster
What Gets Deployed
In this scenario, the software that makes up the in-memory environment is deployed on
the same nodes with your commercial Hadoop distribution.
The following table shows nodes and roles, as well as locations where analytics
software is installed.
Table 3.10
Roles and Software for a Co-Located Commercial Hadoop Distribution
On the Root Node
On Each Worker Node
Commercial Hadoop
Commercial Hadoop
n NameNode
n DataNode
SAS Software
SAS Software
n SAS Root Node: SAS High-
n Worker Node: SAS
Performance Analytics
Environment
n SAS High-Performance
Management Console
(optional)
High-Performance Node
n SAS Embedded
Process
n SAS Embedded Process
Note: Both the Hadoop NameNode and the root node for SAS High-Performance
Analytics are installed on the same machine. The root node takes on the role of
distributing and coordinating the workload to the worker nodes.
24 Chapter 3 / Scenarios for In-Memory Analytics
Step 1: Prepare Machines on the Hadoop
Cluster
An important step for a successful deployment is to ensure that the machines are
configured appropriately before you deploy SAS High-Performance Deployment of
Hadoop and the SAS High-Performance Analytics environment.
Table 3.11 Configuration to Prepare Your Machines
Overview of Steps
Note: To find the detailed information for the following steps, see “Chapter 2 — Preparing
Your System to Deploy the SAS High-Performance Analytics Infrastructure” in SAS HighPerformance Analytics Infrastructure: Installation and Configuration Guide.
1.
Configure system settings on all nodes.
2.
Create the /etc/gridhosts file, and list all nodes that will run the analytics
cluster. During a later step in this scenario, you will reference this file so that the
analytics cluster is installed on the relevant nodes.
3.
Prepare each machine for Kerberos.
4.
Prepare to install SAS High-Performance Computing Management Console.
Note: SAS High-Performance Computing Management Console is an optional
installation. In this scenario, the console is installed to highlight the features that it
provides.
5.
Prepare to deploy Hadoop.
6.
Prepare to deploy the SAS High-Performance Analytics environment.
7.
Read information about recommended data names.
8.
Make sure the SAS Software Depot has been created and is available to the root
node.
9.
Make sure the installer is root. Also, review the user account and directory
recommendations.
Scenario 2: Deploy In-Memory Analytics on Your Hadoop Cluster
Overview of Steps
10.
Make sure Java software is installed. Each machine must have a Java Runtime
Environment (JRE) or Java Development Kit (JDK) installed.
11.
Understand user account requirements and umask settings for deploying and
running the SAS High-Performance Analytics environment.
12.
Recommended: Record each SAS port in /etc/services.
Step 2: Deploy SAS High-Performance
Computing Management Console and Create
User Accounts
When users interact with the analytics cluster, user accounts with passwordless SSH
are required to start and stop SAS LASR Analytic Server and to run programs in the
analytics cluster. The SAS High-Performance Computing Management Console is
designed to ease the creation of user accounts that require passwordless SSH.
Table 3.12
Installing and Configuring the Console and Creating Users
Overview of Steps
Note: To find the detailed information for the following steps, see “Chapter 3 — Deploying
SAS High-Performance Computing Management Console” in SAS High-Performance
Analytics Infrastructure: Installation and Configuration Guide.
1.
Install the SAS High-Performance Management Console by using RPM or tar.
2.
Configure the SAS High-Performance Management Console.
3.
Create user accounts and propagate the SSH key for each account.
25
26 Chapter 3 / Scenarios for In-Memory Analytics
Step 3: Configure the Hadoop Cluster
For most commercial Hadoop distributions, you perform configuration steps on every
machine in the Hadoop cluster. These steps can include setting environment variables,
propagating the sas.lasr.jar and sas.lar.hadoop.jar files, and additional configuration.
Table 3.13
Configure the Hadoop Cluster
Overview of Steps
Note: To find the detailed information for the following steps, see “Chapter 4 — Deploying
Co-Located Hadoop” in SAS High-Performance Analytics Infrastructure: Installation and
Configuration Guide. Look for step-by-step instructions for your specific Hadoop distribution in
the section “Configuring Existing Hadoop Clusters.”
1.
Read about the prerequisites for existing Hadoop clusters.
2.
Follow steps specific to your existing implementation of Hadoop:
n Cloudera CDH
n Hortonworks HDP
n IBM BigInsights
n MapR Distribution
n Pivotal HD
3.
Create user accounts and propagate the SSH key for each account.
Step 4: Install and Configure the SAS HighPerformance Analytics Environment
Deploying the SAS High-Performance Analytics environment requires installing and
configuring components on the machine that will act as the root node, and then on the
remaining worker nodes. Refer to the following table for an overview of steps.
Scenario 2: Deploy In-Memory Analytics on Your Hadoop Cluster
Table 3.14
27
Deploying the SAS High-Performance Analytics Environment
Overview of Steps
Note: To find the detailed information for the following steps, “Chapter 5 — Deploying the
SAS High-Performance Analytics Environment” in SAS High-Performance Analytics
Infrastructure: Installation and Configuration Guide.
1.
Install the analytics environment by running the TKGrid, TKTGDat, and TKGrid_REP
shell scripts.
Note: TKGrid_REP is required to work with the SAS Embedded Process for parallel
reading and writing of non-SASHDAT Hadoop data.
2.
Validate the environment by invoking the simsh or MPI command.
3.
(Optional) Customize resource settings for your site. To manage the performance of the
SAS High-Performance Analytics environment, you can set limits on processes that run
in the analytics cluster, and you can control how much memory is requested by
programmers.
Note: You can configure the analytics environment for SASHDAT encryption. Because
this scenario does not cover this feature, see “Chapter 5 — Deploying the SAS HighPerformance Analytics Environment” in SAS High-Performance Analytics Infrastructure:
Installation and Configuration Guide for more information.
Step 5: Deploy SAS Embedded Process
SAS Embedded Process is required for in-database functionality with the Hadoop
cluster, including scoring acceleration, code acceleration, and the SQL pass-through
facility. SAS Embedded Process provides parallel data loading to SAS LASR Analytic
Server functionality for Hive and Impala tables and for SPD Engine and SPD Server
tables.
SAS Embedded Process is part of the in-database deployment package for Hadoop.
For more information, see “Part 2 — Administrator’s Guide for Hadoop (In-Database
Deployment Package)” in SAS In-Database Products: Administrator's Guide.
28 Chapter 3 / Scenarios for In-Memory Analytics
Note: A prerequisite to installing SAS Embedded Process is to configure SAS/ACCESS
Interface to Hadoop. For more information, see SAS Hadoop Configuration Guide for
Base SAS and SAS/ACCESS.
Scenario 3: Configure Remote Access
to Hadoop
What Gets Deployed
In this scenario, you deploy software on your existing commercial Hadoop cluster. This
deployment enables a remote, parallel connection between a Hadoop cluster and a set
of machines that is dedicated to running in-memory analytics. For this scenario,
SAS/ACCESS Interface to Hadoop is installed on a SAS client machine that is remote
from the Hadoop cluster and the in-memory analytics cluster.
The following table shows SAS Embedded Process software installed in the Hadoop
cluster.
Table 3.15
SAS Embedded Software Installed on the Hadoop Cluster
On the Root Node
On Each Worker Node
Commercial Hadoop
Commercial Hadoop
n NameNode
n DataNode
SAS Software
SAS Software
n SAS Embedded Process
n SAS Embedded
Process
Scenario 3: Configure Remote Access to Hadoop
29
Step 1: Install SAS/ACCESS Interface to
Hadoop
In this scenario, SAS/ACCESS Interface to Hadoop is installed on a SAS client that is
remote from the root node of the analytics cluster. For more information about
configuring SAS/ACCESS Interface to Hadoop, see SAS Hadoop Configuration Guide
for Base SAS and SAS/ACCESS.
Step 2: Deploy SAS Embedded Process in
Your Hadoop Cluster
Make sure that SAS Embedded Process is deployed on each node in your Hadoop
cluster.
For more information, see “Part 2 — Administrator’s Guide for Hadoop (In-Database
Deployment Package)” in SAS In-Database Products: Administrator's Guide.
Step 3: Configure the Analytics Environment
for a Remote Parallel Connection
Make sure that TKGrid_REP is configured across all nodes in the in-memory analytics
cluster. TKGrid_REP is a configuration of TKGrid that enables support for remote
access to Hadoop.
Table 3.16
Deploying the SAS High-Performance Analytics Environment
Overview of Steps
Note: To find the detailed information for the following steps, see “Chapter 6 — Configuring
the Analytics Environment for a Remote Parallel Connection” in SAS High-Performance
Analytics Infrastructure: Installation and Configuration Guide.
1.
Prepare for a remote parallel connection.
2.
Read how the configuration script works.
3.
Run the TKGrid_REP script to configure access.
30 Chapter 3 / Scenarios for In-Memory Analytics
31
Recommended Reading
Here is the recommended reading list for this title:
n
SAS and Hadoop Technology: Overview
n
SAS High-Performance Analytics Infrastructure: Installation and Configuration Guide
n
SAS Visual Analytics: Installation and Configuration Guide (Distributed SAS LASR)
n
SAS In-Database Products: Administrator's Guide
n
SAS Hadoop Configuration Guide for Base SAS and SAS/ACCESS
n
SAS LASR Analytic Server: Reference Guide
For a complete list of SAS publications, go to sas.com/store/books. If you have
questions about which titles you need, please contact a SAS Representative:
SAS Books
SAS Campus Drive
Cary, NC 27513-2414
Phone: 1-800-727-0025
Fax: 1-919-677-4444
Email: [email protected]
Web address: sas.com/store/books
32 Recommended Reading
33
Index
C
co-located deployment 13
D
deployment
approaches 1
for data management 7
for in-memory analtyics 12
H
Hadoop Distributed File System
(HDFS) 3, 17
Hadoop distributions
configuring for in-memory
analytics 26
SAS High-Performance
Deployment of Hadoop 17
supported 4, 17
HCatalog 3
HDFS (Hadoop Distributed File
System) 3, 17
Hive 3
HiveServer2 3
I
in-memory analytics 2
co-located deployment 13
deploy on a Hadoop cluster
23
deploy with SAS HighPerformance Deployment
of Hadoop 19
explained 12
reasons to deploy 13
related products 18
remote access to Hadoop 28
software components 15
K
Kerberos 4
M
MapReduce 3, 17
O
Oozie 3
34 Index
P
passwordless SSH 21, 25
processing data
basic access 2
in a Hadoop cluster 1
processing data, in memory
See in-memory analytics
S
SAS Code Accelerator for
Hadoop 2
SAS Data Loader for Hadoop
deployment steps 9
directives 7, 10
end-user configuration 9
explained 2, 7
software components 8
vApp 8
SAS Data Management
Accelerator for Spark 8
SAS Data Quality Accelerator
for Hadoop 8
SAS Embedded Process
explained 1
purpose, in-memory analytics
16
remote access to Hadoop 28
SAS High-Performance
Analytics environment 2,
15, 19, 22, 23, 26
SAS High-Performance
Analytics node installation
15
SAS High-Performance
Analytics products 18
SAS High-Performance
Computing Management
Console 16, 19, 21, 23, 25
SAS High-Performance
Deployment of Hadoop 19,
21
SAS in-database deployment
package for Hadoop
See SAS Embedded Process
SAS In-Database Technologies
for Data Quality Directives 8
SAS In-Database Technologies
for Hadoop 8
SAS In-Memory Statistics 18
SAS Intelligence Platform 4
SAS LASR Analytic Server 2,
14, 15, 21, 25
SAS Quality Knowledge Base 8
SAS Scoring Accelerator for
Hadoop 2
SAS Visual Analytics 2, 18
SAS Visual Statistics 2
SAS/ACCESS Interface to
Hadoop
explained 1
purpose, in-memory analytics
16
remote access to Hadoop 28
SASHDAT tables 13, 17
Sqoop 3
Index
T
TKGrid 15, 22, 26
TKGrid_REP 29
Y
YARN 3, 17
35
36 Index
Was this manual useful for you? yes no
Thank you for your participation!

* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project

Download PDF

advertisement