AWS Data Pipeline

Add to My manuals
248 Pages

advertisement

AWS Data Pipeline | Manualzz

AWS Data Pipeline

Developer Guide

API Version 2012-10-29

AWS Data Pipeline Developer Guide

Amazon Web Services

AWS Data Pipeline Developer Guide

AWS Data Pipeline: Developer Guide

Amazon Web Services

AWS Data Pipeline Developer Guide

What is AWS Data Pipeline? .................................................................................................................. 1

How Does AWS Data Pipeline Work? ..................................................................................................... 1

Pipeline Definition .......................................................................................................................... 2

Lifecycle of a Pipeline .................................................................................................................... 4

Task Runners ................................................................................................................................ 5

Pipeline Components, Instances, and Attempts .......................................................................... 10

Lifecycle of a Pipeline Task ......................................................................................................... 11

Get Set Up ............................................................................................................................................ 12

Access the Console .............................................................................................................................. 12

Install the Command Line Interface ...................................................................................................... 15

Deploy and Configure Task Runner ...................................................................................................... 19

Install the AWS SDK ............................................................................................................................. 20

Granting Permissions to Pipelines with IAM ......................................................................................... 21

Grant Amazon RDS Permissions to Task Runner ................................................................................ 23

Tutorial: Copy CSV Data from Amazon S3 to Amazon S3 .................................................................... 25

Using the AWS Data Pipeline Console ................................................................................................. 27

Using the Command Line Interface ...................................................................................................... 33

Tutorial: Copy Data From a MySQL Table to Amazon S3 ..................................................................... 40

Using the AWS Data Pipeline Console ................................................................................................. 42

Using the Command Line Interface ...................................................................................................... 48

Tutorial: Launch an Amazon EMR Job Flow ......................................................................................... 56

Using the AWS Data Pipeline Console ................................................................................................. 57

Using the Command Line Interface ...................................................................................................... 63

Tutorial: Import/Export Data in Amazon DynamoDB With Amazon EMR and Hive .............................. 69

Part One: Import Data into Amazon DynamoDB .................................................................................. 69

Using the AWS Data Pipeline Console ........................................................................................ 74

Using the Command Line Interface ............................................................................................. 81

Part Two: Export Data from Amazon DynamoDB ................................................................................. 90

Using the AWS Data Pipeline Console ........................................................................................ 92

Using the Command Line Interface ............................................................................................. 98

Tutorial: Run a Shell Command to Process MySQL Table .................................................................. 107

Using the AWS Data Pipeline Console ............................................................................................... 109

Manage Pipelines ............................................................................................................................... 116

Using AWS Data Pipeline Console .................................................................................................... 116

Using the Command Line Interface .................................................................................................... 121

Troubleshoot AWS Data Pipeline ........................................................................................................ 128

Pipeline Definition Files ...................................................................................................................... 135

Creating Pipeline Definition Files ........................................................................................................ 135

Example Pipeline Definitions .............................................................................................................. 139

Copy SQL Data to a CSV File in Amazon S3 ............................................................................ 139

Launch an Amazon EMR Job Flow ........................................................................................... 141

Run a Script on a Schedule ...................................................................................................... 143

Chain Multiple Activities and Roll Up Data ................................................................................ 144

Copy Data from Amazon S3 to MySQL ..................................................................................... 146

Extract Apache Web Log Data from Amazon S3 using Hive ..................................................... 148

Extract Amazon S3 Data (CSV/TSV) to Amazon S3 using Hive ............................................... 150

Extract Amazon S3 Data (Custom Format) to Amazon S3 using Hive ...................................... 152

Simple Data Types .............................................................................................................................. 153

Expression Evaluation ........................................................................................................................ 155

Objects ............................................................................................................................................... 161

Schedule ................................................................................................................................... 163

S3DataNode .............................................................................................................................. 165

MySqlDataNode ........................................................................................................................ 169

DynamoDBDataNode ................................................................................................................ 172

ShellCommandActivity .............................................................................................................. 176

CopyActivity ............................................................................................................................... 180

EmrActivity ................................................................................................................................ 184

HiveActivity ................................................................................................................................ 188

API Version 2012-10-29

4

AWS Data Pipeline Developer Guide

ShellCommandPrecondition ...................................................................................................... 192

Exists ......................................................................................................................................... 194

S3KeyExists .............................................................................................................................. 197

S3PrefixNotEmpty ..................................................................................................................... 200

RdsSqlPrecondition ................................................................................................................... 203

DynamoDBTableExists .............................................................................................................. 204

DynamoDBDataExists ............................................................................................................... 204

Ec2Resource ............................................................................................................................. 205

EmrCluster ................................................................................................................................ 209

SnsAlarm ................................................................................................................................... 213

Command Line Reference .................................................................................................................. 215

Program AWS Data Pipeline ............................................................................................................... 231

Make an HTTP Request to AWS Data Pipeline .................................................................................. 231

Actions in AWS Data Pipeline ............................................................................................................. 234

Task Runner Reference ...................................................................................................................... 235

Web Service Limits ............................................................................................................................. 238

AWS Data Pipeline Resources ........................................................................................................... 241

Document History ............................................................................................................................... 243

API Version 2012-10-29

5

AWS Data Pipeline Developer Guide

How Does AWS Data Pipeline Work?

What is AWS Data Pipeline?

AWS Data Pipeline is a web service that you can use to automate the movement and transformation of data. With AWS Data Pipeline, you can define data-driven workflows, so that tasks can be dependent on the successful completion of previous tasks.

For example, you can use AWS Data Pipeline to archive your web server's logs to Amazon Simple Storage

Service (Amazon S3) each day and then run a weekly Amazon Elastic MapReduce (Amazon EMR) job flow over those logs to generate traffic reports.

In this example, AWS Data Pipeline would schedule the daily tasks to copy data and the weekly task to launch the Amazon EMR job flow. AWS Data Pipeline would also ensure that Amazon EMR waits for the final day's data to be uploaded to Amazon S3 before it began its analysis, even if there is an unforeseen delay in uploading the logs.

AWS Data Pipeline handles the ambiguities of real-world data management. You define the parameters of your data transformations and AWS Data Pipeline enforces the logic that you've set up.

How Does AWS Data Pipeline Work?

Three main components of AWS Data Pipeline work together to manage your data:

API Version 2012-10-29

1

AWS Data Pipeline Developer Guide

Pipeline Definition

Pipeline definition specifies the business logic of your data management. For more information, see

Pipeline Definition Files (p. 135)

.

AWS Data Pipeline web service interprets the pipeline definition and assigns tasks to workers to move and transform data.

Task Runners poll the AWS Data Pipeline web service for tasks and then perform those tasks. In the previous example, Task Runner would copy log files to Amazon S3 and launch Amazon EMR job flows.

Task Runner is installed and runs automatically on resources created by your pipeline definitions. You can write a custom task runner application, or you can use the Task Runner application that is provided by AWS Data Pipeline. For more information, see

Task Runner (p. 5) and

Custom Task Runner (p. 8)

.

The following illustration below shows how these components work together. If the pipeline definition supports nonserialized tasks, AWS Data Pipeline can manage tasks for multiple task runners working in parallel.

Pipeline Definition

A pipeline definition is how you communicate your business logic to AWS Data Pipeline. It contains the following information:

• Names, locations, and formats of your data sources.

• Activities that transform the data.

• The schedule for those activities.

• Resources that run your activities and preconditions

• Preconditions that must be satisfied before the activities can be scheduled.

• Ways to alert you with status updates as pipeline execution proceeds.

From your pipeline definition, AWS Data Pipeline determines the tasks that will occur, schedules them, and assigns them to task runners. If a task is not completed successfully, AWS Data Pipeline retries the task according to your instructions and, if necessary, reassigns it to another task runner. If the task fails repeatedly, you can configure the pipeline to notify you.

For example, in your pipeline definition, you might specify that in 2013, log files generated by your application will be archived each month to an Amazon S3 bucket. AWS Data Pipeline would then create

12 tasks, each copying over a month's worth of data, regardless of whether the month contained 30, 31,

28, or 29 days.

API Version 2012-10-29

2

AWS Data Pipeline Developer Guide

Pipeline Definition

You can create a pipeline definition in the following ways:

• Graphically, by using the AWS Data Pipeline console.

• Textually, by writing a JSON file in the format used by the command line interface.

• Programmatically, by calling the web service with either one of the AWS SDKs or the AWS Data Pipeline

API.

A pipeline definition can contain the following types of components:

Component

Data Node

Description

The location of input data for a task or the location where output data is to be stored.

The following data locations are currently supported:

• Amazon S3 bucket

• MySQL database

• Amazon DynamoDB

• Local data node

Activity

Precondition

Schedule

An interaction with the data.

The following activities are currently supported:

• Copy to a new location

• Launch an Amazon EMR job flow

• Run a Bash script from the command line (requires a UNIX environment to run the script)

• Run a database query

• Run a Hive activity

A conditional statement that must be true before an action can run.

The following preconditions are currently supported:

• A command-line Bash script was successfully completed

(requires a UNIX environment to run the script)

• Data exists

• A specific time or a time interval relative to another event has been reached

• An Amazon S3 location contains data

• An Amazon RDS or Amazon DynamoDB table exists

Any or all of the following:

• The time that an action should start

• The time that an action should stop

• How often the action should run

API Version 2012-10-29

3

Component

Resource

Action

AWS Data Pipeline Developer Guide

Lifecycle of a Pipeline

Description

A resource that can analyze or modify data.

The following computational resources are currently supported:

• Amazon EMR job flow

• Amazon EC2 instance

A behavior that is triggered when specified conditions are met, such as the failure of an activity.

The following actions are currently supported:

• Amazon SNS notification

• Terminate action

For more information, see Pipeline Definition Files (p. 135) .

Lifecycle of a Pipeline

After you create a pipeline definition, you create a pipeline and then add your pipeline definition to it. Your pipeline must be validated. After you have a valid pipeline definition, you can activate it. At that point, the pipeline runs and schedules tasks. When you are done with your pipeline, you can delete it.

The complete lifecycle of a pipeline is shown in the following illustration.

API Version 2012-10-29

4

AWS Data Pipeline Developer Guide

Task Runners

Task Runners

A task runner is an application that polls AWS Data Pipeline for tasks and then performs those tasks.

You can either use Task Runner as provided by AWS Data Pipeline, or create a custom Task Runner application.

Task Runner

Task Runner is a default implementation of a task runner that is provided by AWS Data Pipeline. When

Task Runner is installed and configured, it polls AWS Data Pipeline for tasks associated with pipelines that you have activated. When a task is assigned to Task Runner, it performs that task and reports its status back to AWS Data Pipeline. If your workflow requires non-default behavior, you'll need to implement that functionality in a custom task runner.

There are three ways you can use Task Runner to process your pipeline:

• AWS Data Pipeline installs Task Runner for you on resources that are launched and managed by the web service.

• You install Task Runner on a computational resource that you manage, such as a long-running Amazon

EC2 instance, or an on-premise server.

• You modify the Task Runner code to create a custom Task Runner, which you then install on a computational resource that you manage.

API Version 2012-10-29

5

AWS Data Pipeline Developer Guide

Task Runners

Task Runner on AWS Data Pipeline-Managed Resources

When a resource is launched and managed by AWS Data Pipeline, the web service automatically installs

Task Runner on that resource to process tasks in the pipeline. You specify a computational resource

(either an Amazon EC2 instance or an Amazon EMR job flow) for the runsOn

field of an activity object.

When AWS Data Pipeline launches this resource, it installs Task Runner on that resource and configures it to process all activity objects that have their runsOn

field set to that resource. When AWS Data Pipeline terminates the resource, the Task Runner and all its logs are published to an Amazon S3 location before it shuts down.

For example, if you use the

EmrActivity

action in a pipeline, and specify an

EmrCluster

object in the runsOn

field. When AWS Data Pipeline processes that activity, it launches an Amazon EMR job flow and uses a bootstrap step to install Task Runner onto the master node. This Task Runner then processes the tasks for activities that have their runsOn

field set to that

EmrCluster

object. The following excerpt from a pipeline definiton shows this relationship between the two objects.

{

"id" : "MyEmrActivity",

"name" : "Work to perform on my data",

"type" : "EmrActivity",

"runsOn" : {"ref" : "

MyEmrCluster

"},

"preStepCommand" : "scp remoteFiles localFiles",

"step" : "s3://myBucket/myPath/myStep.jar,firstArg,secondArg",

"step" : "s3://myBucket/myPath/myOtherStep.jar,anotherArg",

"postStepCommand" : "scp localFiles remoteFiles",

"input" : {"ref" : "MyS3Input"},

"output" : {"ref" : "MyS3Output"}

API Version 2012-10-29

6

AWS Data Pipeline Developer Guide

Task Runners

},

{

"id" : "

MyEmrCluster

",

"name" : "EMR cluster to perform the work",

"type" : "EmrCluster",

"hadoopVersion" : "0.20",

"keypair" : "myKeyPair",

"masterInstanceType" : "m1.xlarge",

"coreInstanceType" : "m1.small",

"coreInstanceCount" : "10",

"instanceTaskType" : "m1.small",

"instanceTaskCount": "10",

"bootstrapAction" : "s3://elasticmapreduce/libs/ba/configure-ha doop,arg1,arg2,arg3",

"bootstrapAction" : "s3://elasticmapreduce/libs/ba/configure-otherstuff,arg1,arg2"

}

If you have multiple AWS Data Pipeline-managed resources in a pipeline, Task Runner is installed on each of them, and they all poll AWS Data Pipeline for tasks to process.

Task Runner on User-Managed Resources

You can install Task Runner on computational resources that you manage, such a long-running Amazon

EC2 instance or a physical server. This approach can be useful when, for example, you want to use AWS

Data Pipeline to process data that is stored inside your organization’s firewall. By installing Task Runner on a server in the local network, you can access the local database securely and then poll AWS Data

Pipeline for the next task to run. When AWS Data Pipeline ends processing or deletes the pipeline, the

Task Runner instance remains running on your computational resource until you manually shut it down.

Similarly, the Task Runner logs persist after pipeline execution is complete.

You download Task Runner, which is in Java Archive (JAR) format, and install it on your computational

resource. For more information about downloading and installing Task Runner, see Deploy and Configure

Task Runner (p. 19) . To connect a Task Runner that you've installed to the pipeline activities it should

process, add a workerGroup

field to the object, and configure Task Runner to poll for that worker group value. You do this by passing the worker group string as a parameter (for example,

--workerGroup=wg-12345

) when you run the Task Runner JAR file.

API Version 2012-10-29

7

AWS Data Pipeline Developer Guide

Task Runners

{

"id" : "MyStoredProcedureActivity",

"type" : "StoredProcedureActivity",

"workerGroup" : "wg-12345",

"command" : "mkdir new-directory"

}

Custom Task Runner

If your data-management requires behavior other than the default behavior provided by Task Runner, you need to create a custom task runner. Because Task Runner is an open-source application, you can use it as the basis for creating your custom implementation.

After you write the custom task runner, you install it on a computational resource that you own, such as a long-running EC2 instance or a physical server inside your organization's firewall. To connect your custom task runner to the pipeline activities it should process, add a workerGroup

field to the object, and configure your custom task runner to poll for that worker group value.

API Version 2012-10-29

8

AWS Data Pipeline Developer Guide

Task Runners

For example, if you use the

ShellCommandActivity

action in a pipeline, and specify a value for the workerGroup

field, when AWS Data Pipeline processes that activity, it passes the task to a task runner that polls the web service for work and specifies that worker group. The following excerpt from a pipeline definition shows how to configure the workerGroup

field.

{

"id" : "CreateDirectory",

"type" : "ShellCommandActivity",

"workerGroup" : "wg-67890",

"command" : "mkdir new-directory"

}

When you create a custom task runner, you have complete control over how your pipeline activities are processed. The only requirement is that you communicate with AWS Data Pipeline as follows:

Poll for tasks—Your task runner should poll AWS Data Pipeline for tasks to process by calling the

PollForTask API. If tasks are ready in the work queue, GetRemoteWork returns a Response immediately.

If no tasks are available in the queue, GetRemoteWork uses long-polling and holds on to a poll connection for up to 90 seconds, during which time any newly scheduled tasks are handed to the task agent. Your remote worker should not call GetRemoteWork again on the same worker group until it receives a Response, and this may take up to 90 seconds.

Report progress—Your task runner should report its progress to AWS Data Pipeline by calling the

ReportTaskProgress API each minute. If a task runner does not report its status after 5 minutes, then every 20 minutes afterwards (configurable), AWS Data Pipeline assumes the task runner is unable to process the task and assigns it in a subsequent Response to GetRemoteWork.

Signal completion of a task—Your task runner should inform AWS Data Pipeline of the outcome when it completes a task by calling the SetTaskStatus API. The task runner calls this action regardless

API Version 2012-10-29

9

AWS Data Pipeline Developer Guide

Pipeline Components, Instances, and Attempts

of whether the task was sucessful. The task runner does not need to call SetRemoteWorkStatus for tasks canceled by AWS Data Pipeline.

Pipeline Components, Instances, and Attempts

There are three types of items associated with a scheduled pipeline:

Pipeline Components — Pipeline components represent the business logic of the pipeline and are represented by the different sections of a pipeline definition. Pipeline components specify the data sources, activities, schedule, and preconditions of the workflow. They can inherit properties from parent components. Relationships among components are defined by reference. Pipeline components define the rules of data management; they are not a to-do list.

Instances — When AWS Data Pipeline runs a pipeline, it compiles the pipeline components to create a set of actionable instances. Each instance contains all the information needed to perform a specific task. The complete set of instances is the to-do list of the pipeline. AWS Data Pipeline hands the instances out to task runners to process.

Attempts — To provide robust data management, AWS Data Pipeline retries a failed operation. It continues to do so until the task reaches the maximum number of allowed retry attempts. Attempt objects track the various attempts, results, and failure reasons if applicable. Essentially, it is the instance with a counter.

Note

Retrying failed tasks is an important part of a fault tolerance strategy, and AWS Data Pipeline pipeline definitions provide conditions and thresholds to control retries. However, too many retries can delay detection of an unrecoverable failure because AWS Data Pipeline does not report failure until it has exhausted all the retries that you specify. The extra retries may accrue additional charges if they are running on AWS resources. As a result, carefully consider when it is appropriate to exceed the AWS Data Pipeline default settings that you use to control re-tries and related settings.

API Version 2012-10-29

10

AWS Data Pipeline Developer Guide

Lifecycle of a Pipeline Task

Lifecycle of a Pipeline Task

The following diagram illustrates how AWS Data Pipeline and a task runner interact to process a scheduled task.

API Version 2012-10-29

11

AWS Data Pipeline Developer Guide

Access the Console

Get Set Up for AWS Data Pipeline

There are several ways you can interact with AWS Data Pipeline:

Console — a graphical interface you can use to create and manage pipelines. With it, you fill out web forms to specify the configuration details of your pipeline components. The AWS Data Pipeline console provides several templates, which are pre-configured pipelines for common scenarios. As you keep building your pipeline, graphical representation of the components appear on the design pane. The arrows between the components indicate the connection between the components. Using the console is the easiest way to get started with AWS Data Pipeline. It creates the pipeline definition for you, and no JSON or programming knowledge is required. The console is available online at https://console.aws.amazon.com/datapipeline/ . For more information about accessing the console, see

Access the Console (p. 12)

.

Command Line Interface (CLI) — an application you run on your local machine to connect to AWS

Data Pipeline and create and manage pipelines. With it, you issue commands into a terminal window and pass in JSON files that specify the pipeline definition. Using the CLI is the best option if you prefer

working from a command line. For more information, see Install the Command Line Interface (p. 15)

.

Software Development Kit (SDK) — AWS provides an SDK with functions that call AWS Data Pipeline to create and manage pipelines. With it, you can write applications that automate the process of creating and managing pipelines. Using the SDK is the best option if you want to extend or customize the functionality of AWS Data Pipeline. You can download the AWS SDK for Java from http://aws.amazon.com/sdkforjava/ .

Web Service API — AWS provides a low-level interface that you can use to call the web service directly using JSON. Using the API is the best option if you want to create an custom SDK that calls AWS Data

Pipeline. For more information, see AWS Data Pipeline API Reference .

In addition, there is the Task Runner application, which is a default implementation of a task runner.

Depending on the requirements of your data management, you may need to install Task Runner on a computational resource such as a long-running Amazon EC2 instance or a physical server. For more

information about when to install Task Runner, see Task Runner (p. 5)

. For more information about how

to install Task Runner, see Deploy and Configure Task Runner (p. 19)

.

Access the Console

Topics

Where Do I Go Now? (p. 15)

API Version 2012-10-29

12

AWS Data Pipeline Developer Guide

Access the Console

The AWS Data Pipeline console enables you to do the following:

• Create, save, and activate your pipeline

• View the details of all the pipelines associated with your account

• Modify your pipeline

• Delete your pipeline

You must have an Amazon Web Services (AWS) account to access the AWS Data Pipeline console.

When you create an AWS account, AWS automatically signs up the account for all AWS services, including

AWS Data Pipeline. With AWS Data Pipeline, you pay only for what you use. For more information about

AWS Data Pipeline usage rates, see AWS Data Pipeline .

If you have an AWS account already, skip to the next step. If you don't have an AWS account, use the following procedure to create one.

To create an AWS account

1.

Go to AWS and click Sign Up Now.

2.

Follow the on-screen instructions.

Part of the sign-up process involves receiving a phone call and entering a PIN using the phone keypad.

To access the console

1.

Sign in to the AWS Management Console and open the AWS Data Pipeline console at https://console.aws.amazon.com/datapipeline/ .

2.

If your account doesn't already have data pipelines, the console displays the following introductory screen that prompts you to create your first pipeline. This screen also provides an overview of the process for creating a pipeline, and links to relevant documentation and resources. Click Create

Pipeline to create your pipeline.

API Version 2012-10-29

13

AWS Data Pipeline Developer Guide

Access the Console

If you already have pipelines associated with your account, the console displays the page listing all the pipelines associated with your account. Click Create New Pipeline to create your pipeline.

API Version 2012-10-29

14

AWS Data Pipeline Developer Guide

Where Do I Go Now?

Where Do I Go Now?

You are now ready to start creating your pipelines. For more information about creating a pipeline, see the following tutorials:

Tutorial: Copy CSV Data from Amazon S3 to Amazon S3 (p. 25)

Tutorial: Copy Data From a MySQL Table to Amazon S3 (p. 40)

Tutorial: Launch an Amazon EMR Job Flow (p. 56)

Tutorial: Run a Shell Command to Process MySQL Table (p. 107)

Install the Command Line Interface

The AWS Data Pipeline command line interface (CLI) is a tool you can use to create and manage pipelines from a terminal window. It is written in Ruby and makes calls to the web service on your behalf.

Topics

Install Ruby (p. 15)

Install the RubyGems package management framework (p. 15)

Install Prerequisite Ruby Gems (p. 16)

Install the AWS Data Pipeline CLI (p. 17)

Locate your AWS Credentials (p. 17)

Create a Credentials File (p. 18)

Verify the CLI (p. 18)

Install Ruby

The AWS Data Pipeline CLI requires Ruby 1.8.7. Some operating systems, such as Mac OS, come with

Ruby pre-installed.

To verify the Ruby installation and version

• To check whether Ruby is installed, and which version, run the following command in a terminal window. If Ruby is installed, this command displays its version information.

ruby -v

If you don’t have Ruby 1.8.7 installed, use the following procedure to install it.

To install Ruby on Linux/Unix/Mac OS

• Download Ruby from http://www.ruby-lang.org/en/downloads/ and follow the installation instructions for your version of Linux/Unix/Mac OS.

Install the RubyGems package management framework

The AWS Data Pipeline CLI requires a version of RubyGems that is compatible with Ruby 1.8.7.

API Version 2012-10-29

15

AWS Data Pipeline Developer Guide

Install Prerequisite Ruby Gems

To verify the RubyGems installation and version

• To check whether RubyGems is installed, run the following command from a terminal window. If

RubyGems is installed, this command displays its version information.

gem -v

If you don’t have RubyGems installed, or have a version not compatible with Ruby 1.8.7, you need to download and install RubyGems before you can install the AWS Data Pipeline CLI.

To install RubyGems on Linux/Unix/Mac OS

1.

Download RubyGems from http://rubyforge.org/frs/?group_id=126 .

2.

Install RubyGems using the following command.

sudo ruby setup.rb

Install Prerequisite Ruby Gems

The AWS Data Pipeline CLI requires Ruby 1.8.7 or greater, a compatible version of RubyGems, and the following Ruby gems:

• json (version 1.4 or greater)

• uuidtools (version 2.1 or greater)

• httparty (version .7 or greater)

• bigdecimal (version 1.0 or greater)

• nokogiri (version 1.4.4 or greater)

The following topics describe how to install the AWS Data Pipeline CLI and the Ruby environment it requires.

Use the following procedures to ensure that each of the gems listed above is installed.

To verify whether a gem is installed

• To check whether a gem is installed, run the following command from a terminal window. For example, if 'uuidtools' is installed, this command displays the name and version of the 'uuidtools' RubyGem.

gem search 'uuidtools'

If you don’t have 'uuidtools' installed, then you need to install it before you can install the AWS Data

Pipeline CLI.

To install 'uuidtools' on Windows/Linux/Unix/Mac OS

• Install 'uuidtools' using the following command.

API Version 2012-10-29

16

AWS Data Pipeline Developer Guide

Install the AWS Data Pipeline CLI

sudo gem install uuidtools

Install the AWS Data Pipeline CLI

After you have verified the installation of your Ruby environment, you’re ready to install the AWS Data

Pipeline CLI.

To install the AWS Data Pipeline CLI on Windows/Linux/Unix/Mac OS

1.

Download datapipeline-cli.zip

from https://s3.amazonaws.com/datapipeline-us-east-1/software/latest/DataPipelineCLI/ .

2.

Unzip the compressed file. For example, on Linux/Unix/Mac OS use the following command: unzip datapipeline-cli.zip

This uncompresses the CLI and supporting code into a new directory called dp-cli

.

3.

If you add the new directory, dp-cli

, to your PATH variable, you can use the CLI without specifying the complete path. In this guide, we assume that you've updated your PATH variable, or that you run the CLI from the directory where it is installed.

Locate your AWS Credentials

When you create an AWS account, AWS assigns you an access key ID and a secret access key. AWS uses these credentials to identify you when you interact with a web service. You need these keys for the next step of the CLI installation process.

Note

Your secret access key is a shared secret between you and AWS. Keep this ID secret; we use it to bill you for the AWS services that you use. Never include the ID in your requests to AWS, and never email this ID to anyone, even if a request appears to originate from AWS or

Amazon.com. No one who legitimately represents Amazon will ever ask you for your secret access key.

The following procedure explains how to locate your access key ID and secret access key in the AWS

Management Console.

To view your AWS access credentials

1.

Go to the Amazon Web Services website at http://aws.amazon.com

.

2.

Click My Account/Console, and then click Security Credentials.

3.

Under Your Account, click Security Credentials.

4.

In the spaces provided, type your user name and password, and then click Sign in using our secure

server.

5.

Under Access Credentials, on the Access Keys tab, your access key ID is displayed. To view your secret key, under Secret Access Key, click Show.

Make a note of your access key ID and your secret access key; you will use them in the next section.

API Version 2012-10-29

17

AWS Data Pipeline Developer Guide

Create a Credentials File

Create a Credentials File

When you request services from AWS Data Pipeline, you must pass your credentials with the request so that AWS can properly authenticate and eventually bill you. The command line interface obtains your credentials from a JSON document called a credentials file, which is stored in your home directory,

~/

.

Using a credentials file is the simplest way to make your AWS credentials available to the AWS Data

Pipeline CLI.

The credentials file contains the following name-value pairs.

Name

comment access-id private-key endpoint log-uri

Description

An optional comment within the credentials file.

The access key ID for your AWS account.

The secret access key for your AWS account

The endpoint for AWS Data Pipeline in the region where you are making requests.

The location of the Amazon S3 bucket where AWS

Data Pipeline writes log files.

In the following example credentials file,

AKIAIOSFODNN7EXAMPLE

represents an access key ID, and wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY

represents the corresponding secret access key.

The value of log-uri

specifies the location of your Amazon S3 bucket and the path to the log files for actions performed by the AWS Data Pipeline web service on behalf of your pipeline.

{

"access-id": "AKIAIOSFODNN7EXAMPLE",

"private-key": "wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY",

"endpoint": "datapipeline.us-east-1.amazonaws.com",

"port": "443",

"use-ssl": "true",

"region": "us-east-1",

"log-uri": "s3://myawsbucket/logfiles"

}

After you replace the values for the access-id

, private-key

, and log-uri

fields with the appropriate information, save the file as credentials.json in either your home directory,

~/

.

Verify the CLI

To verify that the command line interface (CLI) is installed, use the following command.

datapipeline --help

If the CLI is installed correctly, this command displays the list of commands for the CLI.

API Version 2012-10-29

18

AWS Data Pipeline Developer Guide

Deploy and Configure Task Runner

Deploy and Configure Task Runner

Task Runner is an task runner application that polls AWS Data Pipeline for scheduled tasks and processes the tasks assigned to it by the web service, reporting status as it does so.

Depending on your application, you may choose to:

• Have AWS Data Pipeline install and manage one or more Task Runner applications for you on computational resources managed by the web service. In this case, you do not need to install or configure Task Runner.

• Manually install and configure Task Runner on a computational resource such as a long-running Amazon

EC2 instance or a physical server. To do so, use the following procedures.

• Manually install and configure a custom task runner instead of Task Runner. The procedures for doing so depends on the implementation of the custom task runner.

For more information about Task Runner and when and where it should be configured, see

Task

Runner (p. 5) .

Note

You can only install Task Runner on Linux, UNIX, or Mac OS. Task Runner is not supported on the Windows operating system.

Topics

Install Java (p. 19)

Install Task Runner (p. 235)

Start Task Runner (p. 235)

Verify Task Runner (p. 236)

Install Java

Task Runner requires Java version 1.6 or later. To determine whether Java is installed, and the version that is running, use the following command: java -version

If you do not have Java 1.6 or later installed on your computer, you can download the latest version from http://www.oracle.com/technetwork/java/index.html

.

Install Task Runner

To install Task Runner, download

TaskRunner-1.0.jar

from Task Runner download and copy it into a folder. Additionally, download mysql-connector-java-5.1.18-bin.jar

from http://dev.mysql.com/usingmysql/java/ and copy it into the same folder where you install Task Runner.

Start Task Runner

In a new command prompt window that is set to the directory where you installed Task Runner, start Task

Runner with the following command. The

--config

option points to your credentials file. The

--workerGroup

option specifies the name of your worker group, which must be the same value as specified in your pipeline for tasks to be processed.

API Version 2012-10-29

19

AWS Data Pipeline Developer Guide

Verify Task Runner

java -jar TaskRunner-1.0.jar --config ~/credentials.json --workerGroup=myWork erGroup

When Task Runner is active, it prints the path to where log files are written in the terminal window. The following is an example.

Logging to /myComputerName/.../dist/output/logs

Warning

If you close the terminal window, or interrupt the command with CTRL+C, Task Runner stops, which halts the pipeline runs.

Verify Task Runner

The easiest way to verify that Task Runner is working is to check whether it is writing log files. The log files are stored in the directory where you started Task Runner.

When you check the logs, make sure you that are checking logs for the current date and time. Task

Runner creates a new log file each hour, where the hour from midnight to 1am is 00. So the format of the log file name is

TaskRunner.log.YYYY-MM-DD-HH

, where HH runs from 00 to 23, in UDT. To save storage space, any log files older than eight hours are compressed with GZip.

Install the AWS SDK

The easiest way to write applications that interact with AWS Data Pipeline or to implement a custom task runner is to use one of the AWS SDKs. The AWS SDKs provide functionality that simplify calling the web service APIs from your preferred programming environment.

For more information about the programming languages and development environments that have AWS

SDK support, see the AWS SDK listings .

If you are not writing programs that interact with AWS Data Pipeline, you do not need to install any of the

AWS SDKs. You can create and run pipelines using the console or command-line interface.

This guide provides examples of programming AWS Data Pipeline using Java. The following are examples of how to download and install the AWS SDK for Java.

To install the AWS SDK for Java using Eclipse

• Install the AWS Toolkit for Eclipse .

Eclipse is a popular Java development environment. The AWS Toolkit for Eclipse installs the latest version of the AWS SDK for Java. From Eclipse, you can easily modify, build, and run any of the samples included in the SDK.

To install the AWS SDK for Java

• If you are using a Java development environment other than Eclipse, download and install the AWS

SDK for Java .

API Version 2012-10-29

20

AWS Data Pipeline Developer Guide

Granting Permissions to Pipelines with IAM

Granting Permissions to Pipelines with IAM

In AWS Data Pipeline, IAM roles determine what your pipeline can access and actions it can perform.

Additionally, when your pipeline creates a resource, such as when a pipeline creates an Amazon EC2 instance, IAM roles determine the EC2 instance's permitted resources and actions. When you create a pipeline, you specify one IAM role that governs your pipeline and another IAM role to govern your pipeline's resources (referred to as a "resource role"), which can be the same role for both. Carefully consider the minimum permissions necessary for your pipeline to perform work and define the IAM roles accordingly.

It is important to note that even a modest pipeline might need access to resources and actions to various areas of AWS, for example:

• Accessing files in Amazon S3

• Creating and managing Amazon EMR clusters

• Creating and managing Amazon EC2 instances

• Accessing data in Amazon RDS or Amazon DynamoDB

• Sending notifications using Amazon SNS

When you use the AWS Data Pipeline console, you can choose a pre-defined, default IAM role and resource role or create a new one to suit your needs. However, when using the AWS Data Pipeline CLI, you must create a new IAM role and apply a policy to it yourself, for which you can use the following example policy. For more information about how to create a new IAM role and apply a policy to it, see

Managing IAM Policies in the Using IAM guide.

Warning

Carefully review and restrict permissions in the following example policy to only the resources that your pipeline requires.

{

"Statement": [

{

"Action": [

"s3:*"

],

"Effect": "Allow",

"Resource": [

"*"

]

},

{

"Action": [

"ec2:DescribeInstances",

"ec2:RunInstances",

"ec2:StartInstances",

"ec2:StopInstances",

"ec2:TerminateInstances"

],

"Effect": "Allow",

"Resource": [

"*"

]

},

{

"Action": [

"elasticmapreduce:*"

],

API Version 2012-10-29

21

AWS Data Pipeline Developer Guide

Granting Permissions to Pipelines with IAM

"Effect": "Allow",

"Resource": [

"*"

]

},

{

"Action": [

"dynamodb:*"

],

"Effect": "Allow",

"Resource": [

"*"

]

},

{

"Action": [

"rds:DescribeDBInstances",

"rds:DescribeDBSecurityGroups"

],

"Effect": "Allow",

"Resource": [

"*"

]

},

{

"Action": [

"sns:GetTopicAttributes",

"sns:ListTopics",

"sns:Publish",

"sns:Subscribe",

"sns:Unsubscribe"

],

"Effect": "Allow",

"Resource": [

"*"

]

},

{

"Action": [

"iam:PassRole"

],

"Effect": "Allow",

"Resource": [

"*"

]

},

{

"Action": [

"datapipeline:*"

],

"Effect": "Allow",

"Resource": [

"*"

]

}

]

}

API Version 2012-10-29

22

AWS Data Pipeline Developer Guide

Grant Amazon RDS Permissions to Task Runner

After you define a role and apply its policy, you define a trusted entities list, which indicates the entities or services that are permitted to use your new role. You can use the following IAM trust relationship definition to allow AWS Data Pipeline and Amazon EC2 to use your new pipeline and resource roles. For more information about editing IAM trust relationships, see Modifying a Role in the Using IAM guide.

{

"Version": "2008-10-17",

"Statement": [

{

"Sid": "",

"Effect": "Allow",

"Principal": {

"Service": [

"ec2.amazonaws.com",

"datapipeline.amazonaws.com"

]

},

"Action": "sts:AssumeRole"

}

]

}

Grant Amazon RDS Permissions to Task Runner

Amazon RDS allows you to control access to your DB Instances using database security groups (DB

Security Groups). A DB Security Group acts like a firewall controlling network access to your DB Instance.

By default, network access is turned off to your DB Instances. You must modify your DB Security Groups to let Task Runner access your Amazon RDS instances. Task Runner gains Amazon RDS access from the instance on which it runs, so the accounts and security groups that you add to your Amazon RDS instance depend on where you install Task Runner.

To grant permissions to Task Runner,

1.

Sign in to the AWS Management Console and open the Amazon RDS console .

2.

In the Amazon RDS: My DB Security Groups pane, click your Amazon RDS instance. In the DB

Security Group pane, under Connection Type, select EC2 Security Group. Configure the fields in the EC2 Security Group pane as described below:

For Task Runner running on an EC2 Resource,

• AWS Account Id:

Your AccountId

EC2 Security Group:

Your Security Group Name

For a Task Runner running on an EMR Resource,

• AWS Account Id:

Your AccountId

EC2 Security Group:

ElasticMapReduce-master

AWS Account Id:

Your AccountId

EC2 Security Group:

ElasticMapReduce-slave

API Version 2012-10-29

23

AWS Data Pipeline Developer Guide

Grant Amazon RDS Permissions to Task Runner

For a Task Runner running in your local environment (on-premise),

• CIDR: The IP address range of your on premise machine, or firewall if your on-premise computer is behind a firewall.

To allow connection from an RdsSqlPrecondition

• AWS Account Id:

793385162516

EC2 Security Group:

DataPipeline

API Version 2012-10-29

24

AWS Data Pipeline Developer Guide

Tutorial: Copy CSV Data from

Amazon S3 to Amazon S3

After you read What is AWS Data Pipeline? (p. 1)

and decide you want to use AWS Data Pipeline to automate the movement and transformation of your data, it is time to get started with creating data pipelines. To help you make sense of how AWS Data Pipeline works, let’s walk through a simple task.

This tutorial walks you through the process of creating a data pipeline to copy data from one Amazon S3 bucket to another and then send an Amazon SNS notification after the copy activity completes successfully.

You use the Amazon EC2 instance resource managed by AWS Data Pipeline for this copy activity.

Important

This tutorial does not employ the Amazon S3 API for high speed data transfer between Amazon

S3 buckets. It is intended only for demonstration purposes to help new customers understand how to create a simple pipeline and the related concepts. For advanced information about data transfer using Amazon S3, see Working with Buckets in the Amazon S3 Developer Guide.

The first step in pipeline creation process is to select the pipeline objects that make up your pipeline definition. After you select the pipeline objects, you add fields for each object. For more information, see

Pipeline Definition (p. 2)

.

This tutorial uses the following objects to create a pipeline definition:

Activity

Activity the AWS Data Pipeline performs for this pipeline.

This tutorial uses the

CopyActivity

object to copy CSV data from one Amazon S3 bucket to another.

Important

There are distinct limitations regarding the CSV file format with

CopyActivity

and

S3DataNode

. For more information, see CopyActivity (p. 180) .

Schedule

The start date, time, and the recurrence for this activity. You can optionally specify the end date and time.

Resource

Resource AWS Data Pipeline must use to perform this activity.

This tutorial uses

Ec2Resource

, an Amazon EC2 instance provided by AWS Data Pipeline, to copy data. AWS Data Pipeline automatically launches the Amazon EC2 instance and then terminates the instance after the task finishes.

API Version 2012-10-29

25

AWS Data Pipeline Developer Guide

Before You Begin...

DataNodes

Input and output nodes for this pipeline.

This tutorial uses

S3DataNode

for both input and output nodes.

Action

Action AWS Data Pipeline must take when the specified conditions are met.

This tutorial uses

SnsAlarm

action to send Amazon SNS notifications to the email address you specify, after the task finishes successfully. You must subscribe to the Amazon SNS Topic Arn to receive the notifications.

The following steps outline how to create a data pipeline to copy data from one Amazon S3 bucket to another Amazon S3 bucket.

1. Create your pipeline definition

2. Validate and save your pipeline definition

3. Activate your pipeline

4. Monitor the progress of your pipeline

5. [Optional] Delete your pipeline

Before You Begin...

Be sure you've completed the following steps.

• Set up an Amazon Web Services (AWS) account to access the AWS Data Pipeline console. For more information, see

Access the Console (p. 12)

.

Set up the AWS Data Pipeline tools and interface you plan on using. For more information on interfaces

and tools you can use to interact with AWS Data Pipeline, see Get Set Up for AWS Data Pipeline (p. 12)

.

• Create an Amazon S3 bucket as a data source.

For more information, see Create a Bucket in the Amazon Simple Storage Service Getting Started

Guide.

• Upload your data to your Amazon S3 bucket.

For more information, see Add an Object to a Bucket in the Amazon Simple Storage Service Getting

Started Guide.

• Create another Amazon S3 bucket as a data target

• Create an Amazon SNS topic for sending email notification and make a note of the topic Amazon

Resource Name (ARN). For more information, see Create a Topic in the Amazon Simple Notification

Service Getting Started Guide.

• [Optional] This tutorial uses the default IAM role policies created by AWS Data Pipeline. If you would rather create and configure your own IAM role policy and trust relationships, follow the instructions

described in Granting Permissions to Pipelines with IAM (p. 21) .

Note

Some of the actions described in this tutorial can generate AWS usage charges, depending on whether you are using the AWS Free Usage Tier .

API Version 2012-10-29

26

AWS Data Pipeline Developer Guide

Using the AWS Data Pipeline Console

Using the AWS Data Pipeline Console

Topics

Create and Configure the Pipeline Definition Objects (p. 27)

Validate and Save Your Pipeline (p. 30)

Verify your Pipeline Definition (p. 30)

Activate your Pipeline (p. 31)

Monitor the Progress of Your Pipeline Runs (p. 31)

[Optional] Delete your Pipeline (p. 33)

The following sections include the instructions for creating the pipeline using the AWS Data Pipeline console.

To create your pipeline definition

1.

Sign in to the AWS Management Console and open the AWS Data Pipeline console .

2.

Click Create Pipeline.

3.

On the Create a New Pipeline page: a.

In the Pipeline Name box, enter a name (for example,

CopyMyS3Data

).

b.

In Pipeline Description, enter a description.

c.

Leave the Select Schedule Type: button set to the default type for this tutorial.

Note

Schedule type allows you to specify whether the objects in your pipeline definition should be scheduled at the beginning of interval or end of the interval. Time Series

Style Scheduling means instances are scheduled at the end of each interval and Cron

Style Scheduling means instances are scheduled at the beginning of each interval.

d.

Leave the Role boxes set to their default values for this tutorial.

Note

If you have created your own IAM roles and would like to use them in this tutorial, you can select them now.

e.

Click Create a new pipeline.

Create and Configure the Pipeline Definition

Objects

Next, you define the

Activity

object in your pipeline definition. When you define the

Activity

object, you also define the objects that AWS Data Pipeline must use to perform this activity.

1.

On the Pipeline:

name of your pipeline

page, select Add activity.

2.

In the Activities pane: a.

Enter the name of the activity; for example, copy-myS3-data

.

b.

In the Type box, select CopyActivity.

c.

In the Input box, select Create new: DataNode.

d.

In the Output box, select Create new: DataNode.

e.

In the Schedule box, select Create new: Schedule.

API Version 2012-10-29

27

AWS Data Pipeline Developer Guide

Create and Configure the Pipeline Definition Objects

f.

In the Add an optional field .. box, select RunsOn.

g.

In the Runs On box, select Create new: Resource.

h.

In the Add an optional field... box, select On Success.

i.

In the On Success box, select Create new: Action.

j.

In the left pane, separate the icons by dragging them apart.

You've completed defining your pipeline definition by specifying the objects AWS Data Pipeline uses to perform the copy activity.

The Pipeline:

name of your pipeline

pane shows the graphical representation of the pipeline you just created. The arrows indicate the connection between the various objects.

Next, configure the run date and time for your pipeline.

To configure run date and time for your pipeline

1.

On the Pipeline:

name of your pipeline

page, in the right pane, click Schedules.

2.

In the Schedules pane: a.

Enter a schedule name for this activity (for example, copy-myS3-data-schedule

).

b.

In the Type box, select Schedule.

c.

In the Start Date Time box, select the date from the calendar, and then enter the time to start the activity.

Note

AWS Data Pipeline supports the date and time expressed in "YYYY-MM-DDTHH:MM:SS" format in UTC/GMT only.

d.

In the Period box, enter the duration for the activity (for example,

1

), and then select the period category (for example,

Days

).

e.

[Optional] To specify the date and time to end the activity, in the Add an optional field box, select endDateTime, and enter the date and time.

To get your pipeline to launch immediately, set Start Date Time to a date one day in the past. AWS

Data Pipeline then starts launching the "past due" runs immediately in an attempt to address what it perceives as a backlog of work. This backfilling means you don't have to wait an hour to see AWS

Data Pipeline launch its first job flow.

Next, configure the input and the output data nodes for your pipeline.

API Version 2012-10-29

28

AWS Data Pipeline Developer Guide

Create and Configure the Pipeline Definition Objects

To configure the input and output data nodes of your pipeline

1.

On the Pipeline:

name of your pipeline

page, in the right pane, click DataNodes.

2.

In the DataNodes pane: a.

In the

DefaultDataNode1

Name box , enter the name for your input node (for example,

MyS3Input

).

In this tutorial, your input node is the Amazon S3 data source bucket.

b.

In the Type box, select S3DataNode.

c.

In the Schedule box, select copy-myS3-data-schedule.

d.

In the Add an optional field... box, select File Path.

e.

In the File Path box, enter the path to your Amazon S3 bucket (for example, s3://my-data-pipeline-input/

name of your data file

).

f.

In the

DefaultDataNode2

Name box, enter the name for your output node (for example,

MyS3Output

).

In this tutorial, your output node is the Amazon S3 data target bucket.

g.

In the Type box, select S3DataNode.

h.

In the Schedule box, select copy-myS3-data-schedule.

i.

In the Add an optional field... box, select File Path.

j.

In the File Path box, enter the path to your Amazon S3 bucket (for example, s3://my-data-pipeline-output/

name of your data file

).

Next, configure the resource AWS Data Pipeline must use to perform the copy activity.

To configure the resource,

1.

On the Pipeline:

name of your pipeline

page, in the right pane, click Resources.

2.

In the Resources pane: a.

In the Name box, enter the name for your resource (for example,

CopyDataInstance

).

b.

In the Type box, select Ec2Resource.

c.

In the Schedule box, select copy-myS3-data-schedule.

d.

Leave the Role and Resource Role boxes set to default values for this tutorial.

Note

If you have created your own IAM roles, you can select them now.

Next, configure the SNS notification action AWS Data Pipeline must perform after the copy activity finishes successfully.

To configure the SNS notification action

1.

On the Pipeline:

name of your pipeline

page, in the right pane, click Others.

2.

In the Others pane: a.

In the

DefaultAction1

Name box, enter the name for your Amazon SNS notification (for example,

CopyDataNotice

).

b.

In the Type box, select SnsAlarm.

API Version 2012-10-29

29

AWS Data Pipeline Developer Guide

Validate and Save Your Pipeline

c.

In the Topic Arn box, enter the ARN of your Amazon SNS topic.

d.

In the Message box, enter the message content.

e.

In the Subject box, enter the subject line for your notification.

f.

Leave the Role box set to the default value for this tutorial.

You have now completed all the steps required for creating your pipeline definition. Next, validate and save your pipeline.

Validate and Save Your Pipeline

You can save your pipeline definition at any point during the creation process. As soon as you save your pipeline definition, AWS Data Pipeline looks for syntax errors and missing values in your pipeline definition.

If your pipeline is incomplete or is incorrect, AWS Data Pipeline throws a validation error. If you plan to continue the creation process later, you can ignore the error message. If your pipeline definition is complete, and you are getting the validation error message, you have to fix the errors in the pipeline definition before activating your pipeline.

To validate and save your pipeline

1.

On the Pipeline:

name of your pipeline

page, click Save Pipeline.

2.

AWS Data Pipeline validates your pipeline definition and returns either success or the error message.

If you get an error message, click Close and then, in the right pane, click Errors.

3.

The Errors pane lists the objects failing validation. Click the plus (+) sign next to the object names and look for an error message in red.

4.

When you see the error message, click the specific object pane where you see the error and fix it.

For example, if you see an error message in the DataNodes object, click the DataNodes pane to fix the error.

5.

After you've fixed the errors listed in the Errors pane, click Save Pipeline.

6.

Repeat the process until your pipeline is validated.

Next, verify that your pipeline definition has been saved.

Verify your Pipeline Definition

It is important to verify that your pipeline was correctly initialized from your definitions before you activate it.

To verify your pipeline definition

1.

On the Pipeline:

name of your pipeline

page, click Back to list of pipelines.

2.

On the List Pipelines page, check if your newly-created pipeline is listed.

AWS Data Pipeline has created a unique Pipeline ID for your pipeline definition.

The Status column in the row listing your pipeline should show PENDING.

3.

Click on the triangle icon next to your pipeline. The Pipeline Summary panel below shows the details of your pipeline runs. Because your pipeline is not yet activated, you should see only 0s at this point.

4.

In the Pipeline summary panel, click View all fields to see the configuration of your pipeline definition.

API Version 2012-10-29

30

AWS Data Pipeline Developer Guide

Activate your Pipeline

5.

Click Close.

Next, activate your pipeline.

Activate your Pipeline

You must activate your pipeline to start creating and processing runs based on the specifications in your pipeline definition.

To activate your pipeline

1.

On the List Pipelines page, in the Details column of your pipeline, click View pipeline.

2.

In the Pipeline:

name of your pipeline

page, click Activate.

A confirmation dialog box opens up confirming the activation.

3.

Click Close.

Next, verify if your pipeline is running.

Monitor the Progress of Your Pipeline Runs

To monitor the progress of your pipeline

1.

On the List Pipelines page, in the Details column of your pipeline, click View instance details.

API Version 2012-10-29

31

AWS Data Pipeline Developer Guide

Monitor the Progress of Your Pipeline Runs

2.

The Instance details:

name of your pipeline

page lists the status of each instance.

Note

If you do not see runs listed, depending on when your pipeline was scheduled, either click the End (in UTC) date box and change it to a later date or click the Start (in UTC) date box and change it to an earlier date. Then click Update.

3.

If the Status column of all the instances in your pipeline indicates FINISHED, your pipeline has successfully completed the copy activity.You should receive an email about the successful completion of this task, to the account you specified for receiving your Amazon SNS notification.

You can also check your Amazon S3 data target bucket to verify if the data was copied.

4.

If the Status column of any of your instance indicates a status other than FINISHED, either your pipeline is waiting for some precondition to be met or it has failed.

a.

To troubleshoot the failed or the incomplete instance runs,

Click the triangle next to an instance, Instance summary panel opens to show the details of the selected instance.

b.

In the Instance summary pane. click View instance fields to see details of fields associated with the selected instance. If the status of your selected instance is FAILED, the details box has an entry indicating the reason for failure. For example,

@failureReason = Resource not healthy terminated

.

c.

In the Instance summary pane, in the Select attempt for this instance box, select the attempt number.

d.

In the Instance summary pane, click View attempt fields to see details of fields associated with the selected attempt.

5.

To take an action on your incomplete or failed instance, select an action (

Rerun|Cancel|Mark

Finished

) from the Action column of the instance.

You can use the information in the Instance summary pane and the View instance fields box to troubleshoot issues with your failed pipeline.

For more information about instance status, see Interpret Pipeline Status Details (p. 129)

. For more

information about troubleshooting the failed or incomplete instance runs of your pipeline, see AWS Data

Pipeline Problems and Solutions (p. 131)

.

API Version 2012-10-29

32

AWS Data Pipeline Developer Guide

[Optional] Delete your Pipeline

Important

Your pipeline is running and is incurring charges. For more information, see AWS Data Pipeline pricing . If you would like to stop incurring the AWS Data Pipeline usage charges, delete your pipleline.

[Optional] Delete your Pipeline

Deleting your pipeline deletes the pipeline definition including all the associated objects.You stop incurring charges as soon as your pipeline is deleted.

To delete your pipeline

1.

In the List Pipelines page, click the check box next to your pipeline.

2.

Click Delete.

3.

In the confirmation dialog box, click Delete to confirm the delete request.

Using the Command Line Interface

Topics

Define a Pipeline in JSON Format (p. 33)

Upload the Pipeline Definition (p. 38)

Activate the Pipeline (p. 39)

Verify the Pipeline Status (p. 39)

The following topics explain how to use the AWS Data Pipeline CLI to create and use pipelines to copy data from one Amazon S3 bucket to another. In this example, we perform the following steps:

• Create a pipeline definition using the CLI in JSON format

• Create the necessary IAM roles and define a policy and trust relationships

• Upload the pipeline definition using the AWS Data Pipeline CLI tools

• Monitor the progress of the pipeline

Define a Pipeline in JSON Format

This example scenario shows how to use JSON pipeline definitions and the AWS Data Pipeline CLI to schedule copying data between two Amazon S3 buckets at a specific time interval. This is the full pipeline definition JSON file followed by an explanation for each of its sections.

Note

We recommend that you use a text editor that can help you verify the syntax of JSON-formatted files, and name the file using the .json file extension.

{

"objects": [

API Version 2012-10-29

33

AWS Data Pipeline Developer Guide

Define a Pipeline in JSON Format

{

"id": "MySchedule",

"type": "Schedule",

"startDateTime": "2012-11-25T00:00:00",

"endDateTime": "2012-11-26T00:00:00",

"period": "1 day"

},

{

"id": "S3Input",

"type": "S3DataNode",

"schedule": {

"ref": "MySchedule"

},

"filePath": "s3://testbucket/file.txt"

},

{

"id": "S3Output",

"type": "S3DataNode",

"schedule": {

"ref": "MySchedule"

},

"filePath": "s3://testbucket/file-copy.txt"

},

{

"id": "MyEC2Resource",

"type": "Ec2Resource",

"schedule": {

"ref": "MySchedule"

},

"actionOnTaskFailure": "terminate",

"actionOnResourceFailure": "retryAll",

"maximumRetries": "1",

"role": "test-role",

"resourceRole": "test-role",

"instanceType": "m1.medium",

"instanceCount": "1",

"securityGroups": [

"test-group",

"default"

],

"keyPair": "test-pair"

},

{

"id": "MyCopyActivity",

"type": "CopyActivity",

"runsOn": {

"ref": "MyEC2Resource"

},

"input": {

"ref": "S3Input"

},

"output": {

"ref": "S3Output"

},

"schedule": {

"ref": "MySchedule"

}

}

API Version 2012-10-29

34

AWS Data Pipeline Developer Guide

Define a Pipeline in JSON Format

]

}

Schedule

The example AWS Data Pipeline JSON file begins with a section to define the schedule by which to copy the data. Many pipeline components have a reference to a schedule and you may have more than one.

The Schedule component is defined by the following fields:

{

"id": "MySchedule",

"type": "Schedule",

"startDateTime": "2012-11-22T00:00:00",

"endDateTime":"2012-11-23T00:00:00",

"period": "1 day"

},

Note

In the JSON file, you can define the pipeline components in any order you prefer. In this example, we chose the order that best illustrates the pipeline component dependencies.

Name

The user-defined name for the pipeline schedule, which is a label for your reference only.

Type

The pipeline component type, which is Schedule.

startDateTime

The date/time (in UTC format) that you want the task to begin.

endDateTime

The date/time (in UTC format) that you want the task to stop.

period

The time period that you want to pass between task attempts, even if the task occurs only one time.

The period must evenly divide the time between startDateTime

and endDateTime

. In this example, we set the period to be 1 day so that the pipeline copy operation can only run one time.

Amazon S3 Data Nodes

Next, the input S3DataNode pipeline component defines a location for the input files; in this case, an

Amazon S3 bucket location. The input S3DataNode component is defined by the following fields:

{

"id" : "S3Input",

"type" : "S3DataNode",

"schedule" : {"ref" : "MySchedule"},

"filePath" : "s3://testbucket/file.txt",

"schedule": { "ref": "MySchedule" }

},

Name

The user-defined name for the input location (a label for your reference only).

Type

The pipeline component type, which is "S3DataNode" to match the location where the data resides, in an Amazon S3 bucket.

API Version 2012-10-29

35

AWS Data Pipeline Developer Guide

Define a Pipeline in JSON Format

Schedule

A reference to the schedule component that we created in the preceding lines of the JSON file labeled

“MySchedule”.

Path

The path to the data associated with the data node. The syntax for a data node is determined by its type. For example, a data node for a database follows a different syntax that is appropriate for a database table.

Next, the output S3DataNode component defines the output destination location for the data. It follows the same format as the input S3DataNode component, except the name of the component and a different path to indicate the target file.

{

"id" : "S3Output",

"type" : "S3DataNode",

"schedule" : {"ref" : "MySchedule"},

"filePath" : "s3://testbucket/file-copy.txt",

"schedule": { "ref": "MySchedule" }

},

Resource

This is a definition of the computational resource that performs the copy operation. In this example, AWS

Data Pipeline should automatically create an EC2 instance to perform the copy task and terminate the resource after the task completes. The fields defined here control the creation and function of the Amazon

EC2 instance that does the work. The EC2Resource is defined by the following fields:

{

"id": "MyEC2Resource",

"type": "Ec2Resource",

"schedule": {"ref": "MySchedule"},

"actionOnTaskFailure": "terminate",

"actionOnResourceFailure": "retryAll",

"maximumRetries": "1",

"role" : "test-role",

"resourceRole": "test-role",

"instanceType": "m1.medium",

"instanceCount": "1",

"securityGroups": [ "test-group", "default" ],

"keyPair": "test-pair"

},

Name

The user-defined name for the pipeline schedule, which is a label for your reference only.

Type

The type of computational resource to perform work; in this case, an Amazon EC2 instance. There are other resource types available, such as an EmrCluster type.

Schedule

The schedule on which to create this computational resource.

actionOnTaskFailure

The action to take on the Amazon EC2 instance if the task fails. In this case, the instance should terminate so that each failed attempt to copy data does not leave behind idle, abandoned Amazon

EC2 instances with no work to perform. These instances require manual termination by an administrator.

API Version 2012-10-29

36

AWS Data Pipeline Developer Guide

Define a Pipeline in JSON Format actionOnResourceFailure

The action to perform if the resource is not created successfully. In this case, retry the creation of an

Amazon EC2 instance until it is successful.

maximumRetries

The number of times to retry the creation of this computational resource before marking the resource as a failure and stopping any further creation attempts. This setting works in conjunction with the actionOnResourceFailure

field.

Role

The IAM role of the account that accesses resources, such as accessing an Amazon S3 bucket to retrieve data.

resourceRole

The IAM role of the account that creates resources, such as creating and configuring an Amazon

EC2 instance on your behalf. Role and ResourceRole can be the same role, but separately provide greater granularity in your security configuration.

instanceType

The size of the Amazon EC2 instance to create. Ensure that you set the appropriate size of EC2 instance that best matches the load of the work that you want to perform with AWS Data Pipeline.

In this case, we set an m1.medium sized EC2 instance. For more information about the different instance types and when to use each one, see to the Amazon EC2 Instance Types topic at http://aws.amazon.com/ec2/instance-types/.

instanceCount

The number of Amazon EC2 instances in the computational resource pool to service any pipeline components depending on this resource.

securityGroups

The Amazon EC2 security groups to grant access to this EC2 instance. As the example shows, you can provide more than one group at a time (test-group and default).

keyPair

The name of the SSH public/private key pair to log in to the Amazon EC2 instance. For more information, see Amazon EC2 Key Pairs .

Activity

The last section in the JSON file is the definition of the activity that represents the work to perform. This example uses

CopyActivity

to copy data from a file in an Amazon S3 bucket to another file. The

CopyActivity

component is defined by the following fields:

{

"id" : "MyCopyActivity",

"type" : "CopyActivity",

"runsOn":{"ref":"MyEC2Resource"},

"input" : {"ref" : "S3Input"},

"output" : {"ref" : "S3Output"},

"schedule" : {"ref" : "MySchedule"}

}

Name

The user-defined name for the activity, which is a label for your reference only.

Type

The type of activity to perform, such as MyCopyActivity.

runsOn

The computational resource that performs the work that this activity defines. In this example, we provide a reference to the Amazon EC2 instance defined previously. Using the

runsOn field causes

AWS Data Pipeline to create the EC2 instance for you. The runsOn

field indicates that the resource

API Version 2012-10-29

37

AWS Data Pipeline Developer Guide

Upload the Pipeline Definition

exists in the AWS infrastructure, while the workerGroup

value indicates that you want to use your own on-premises resources to perform the work.

Schedule

The schedule on which to run this activity.

Input

The location of the data to copy.

Output

The target location data.

Upload the Pipeline Definition

You can upload a pipeline definition file using the AWS Data Pipeline CLI tools. For more information,

see Install the Command Line Interface (p. 15)

To upload your pipeline definition, use the following command.

On Linux/Unix/Mac OS:

./datapipeline -–create pipeline_name -–put pipeline_file

On Windows: ruby datapipeline -–create pipeline_name -–put pipeline_file

Where pipeline_name is the label for your pipeline and pipeline_file is the full path and file name for the file with the .json file extension that defines your pipeline.

If your pipeline validates successfully, you receive the following message:

Pipeline with name pipeline_name and id df-AKIAIOSFODNN7EXAMPLE created. Pipeline

definition pipeline_file.json uploaded.

Note

For more information about any errors returned by the –create command or other commands,

see Troubleshoot AWS Data Pipeline (p. 128)

.

Ensure that your pipeline appears in the pipeline list by using the following command.

On Linux/Unix/Mac OS:

./datapipeline --list-pipelines

On Windows: ruby datapipeline -–list-pipelines

The list of pipelines includes details such as Name, Id, State, and UserId. Take note of your pipeline ID, because you use this value for most AWS Data Pipeline CLI commands. The pipeline ID is a unique identifier using the format df-AKIAIOSFODNN7EXAMPLE

.

API Version 2012-10-29

38

AWS Data Pipeline Developer Guide

Activate the Pipeline

Activate the Pipeline

You must activate the pipeline, by using the --activate command-line parameter, before it will begin performing work. Use the following command.

On Linux/Unix/Mac OS:

./datapipeline --activate -–id df-AKIAIOSFODNN7EXAMPLE

On Windows: ruby datapipeline --activate -–id df-AKIAIOSFODNN7EXAMPLE

Where df-AKIAIOSFODNN7EXAMPLE is the identifier for your pipeline.

Verify the Pipeline Status

View the status of your pipeline and its components, along with its activity attempts and retries with the following command.

On Linux/Unix/Mac OS:

./datapipeline --list-runs –id df-AKIAIOSFODNN7EXAMPLE

On Windows: ruby datapipeline --list-runs –id df-AKIAIOSFODNN7EXAMPLE

Where df-AKIAIOSFODNN7EXAMPLE is the identifier for your pipeline.

The --list-runs command displays a list of pipelines components and details such as Name, Scheduled

Start, Status, ID, Started, and Ended.

Note

It is important to note the difference between the Scheduled Start date/time vs. the Started time.

It is possible to schedule a pipeline component to run at a certain time (Scheduled Start), but the actual start time (Started) could be later due to problems or delays with preconditions, dependencies, failures, or retries.

Note

AWS Data Pipeline may backfill a pipeline, which happens when you define a Scheduled Start date/time for a date in the past. In that situation, AWS Data Pipeline immediately runs the pipeline components the number of times the activity should have run if it had started on the Scheduled

Start time. When this happens, you see pipeline components run back-to-back at a greater frequency than the period value that you specified when you created the pipeline. AWS Data

Pipeline returns your pipeline to the defined frequency only when it catches up to the number of past runs.

Successful pipeline runs are indicated by all the activities in your pipeline reporting the

FINISHED

status.

Your pipeline frequency determines how many times the pipeline runs, and each run has its own success or failure, as indicated by the --list-runs command. Resources that you define in your pipeline, such as

Amazon EC2 instances, may show the

SHUTTING_DOWN

status until they are finally terminated after a successful run. Depending on how you configured your pipeline, you may have multiple Amazon EC2 resources, each with their own final status.

API Version 2012-10-29

39

AWS Data Pipeline Developer Guide

Tutorial: Copy Data From a MySQL

Table to Amazon S3

Topics

Before You Begin ... (p. 41)

Using the AWS Data Pipeline Console (p. 42)

Using the Command Line Interface (p. 48)

This tutorial walks you through the process of creating a data pipeline to copy data (rows) from a table in MySQL database to a CSV (comma-separated values) file in Amazon S3 bucket and then send an

Amazon SNS notification after the copy activity completes successfully. You will use the Amazon EC2 instance resource provided by AWS Data Pipeline for this copy activity.

The first step in pipeline creation process is to select the pipeline objects that make up your pipeline definition. After you select the pipeline objects, you add fields for each pipeline object. For more information on pipeline definition, see

Pipeline Definition (p. 2) .

This tutorial uses the following objects to create a pipeline definition:

Activity

Activity the AWS Data Pipeline must perform for this pipeline.

This tutorial uses the

CopyActivity

to copy data from a MySQL table to an Amazon S3 bucket.

Schedule

The start date, time, and the duration for this activity. You can optionally specify the end date and time.

Resource

Resource AWS Data Pipeline must use to perform this activity.

This tutorial uses

Ec2Resource

, an Amazon EC2 instance provided by AWS Data Pipeline, to copy data. AWS Data Pipeline automatically launches the Amazon EC2 instance and then terminates the instance after the task finishes.

DataNodes

Input and output nodes for this pipeline.

This tutorial uses

MySQLDataNode

for source data and

S3DataNode

for target data.

API Version 2012-10-29

40

AWS Data Pipeline Developer Guide

Before You Begin ...

Action

Action AWS Data Pipeline must take when the specified conditions are met.

This tutorial uses

SnsAlarm

action to send Amazon SNS notification to the email address you specify, after the task finishes successfully.

For information about the additional objects and fields supported by the copy activity, see

CopyActivity (p. 180) .

The following steps outline how to create a data pipeline to copy data from MySQL table to Amazon S3 bucket.

1. Create your pipeline definition

2. Create and configure the pipeline definition objects

3. Validate and save your pipeline definition

4. Verify that your pipeline definition is saved

5. Activate your pipeline

6. Monitor the progress of your pipeline

7. [Optional] Delete your pipeline

Before You Begin ...

Be sure you've completed the following steps.

• Set up an Amazon Web Services (AWS) account to access the AWS Data Pipeline console. For more information, see

Access the Console (p. 12)

.

Set up the AWS Data Pipeline tools and interface you plan on using. For more information on interfaces

and tools you can use to interact with AWS Data Pipeline, see Get Set Up for AWS Data Pipeline (p. 12)

.

• Create an Amazon S3 bucket as a data source.

For more information, see Create a Bucket in Amazon Simple Storage Service Getting Started Guide.

• Create and launch a MySQL database instance as a data source.

For more information, see Launch a DB Instance in the Amazon Relational Database Service(RDS)

Getting Started Guide.

Note

Make a note of the user name and the password you used for creating the MySQL instance.

After you've launched your MySQL database instance, make a note of the instance's endpoint.

You will need all this information in this tutorial.

• Connect to your MySQL database instance, create a table, and then add test data values to the newly created table.

For more information, go to Create a Table in the MySQL documentation.

• Create an Amazon SNS topic for sending email notification and make a note of the topic Amazon

Resource Name (ARN). For more information, go to Create a Topic in Amazon Simple Notification

Service Getting Started Guide.

• [Optional] This tutorial uses the default IAM role policies created by AWS Data Pipeline. If you would rather create and configure your IAM role policy and trust relationships, follow the instructions described

in Granting Permissions to Pipelines with IAM (p. 21)

.

API Version 2012-10-29

41

AWS Data Pipeline Developer Guide

Using the AWS Data Pipeline Console

Note

Some of the actions described in this tutorial can generate AWS usage charges, depending on whether you are using the AWS Free Usage Tier .

Using the AWS Data Pipeline Console

Topics

Create and Configure the Pipeline Definition Objects (p. 42)

Validate and Save Your Pipeline (p. 45)

Verify Your Pipeline Definition (p. 45)

Activate your Pipeline (p. 46)

Monitor the Progress of Your Pipeline Runs (p. 47)

[Optional] Delete your Pipeline (p. 48)

The following sections include the instructions for creating the pipeline using the AWS Data Pipeline console.

To create your pipeline definition

1.

Sign in to the AWS Management Console and open the AWS Data Pipeline Console .

2.

Click Create Pipeline.

3.

On the Create a New Pipeline page: a.

In the Pipeline Name box, enter a name (for example,

CopyMySQLData

).

b.

In Pipeline Description, enter a description.

c.

Leave the Select Schedule Type: button set to the default type for this tutorial.

Note

Schedule type allows you to specify whether the objects in your pipeline definition should be scheduled at the beginning of interval or end of the interval. Time Series

Style Scheduling means instances are scheduled at the end of each interval and Cron

Style Scheduling means instances are scheduled at the beginning of each interval.

d.

Leave the Role boxes set to their default values for this tutorial.

Note

If you have created your own IAM roles and would like to use them in this tutorial, you can select them now.

e.

Click Create a new pipeline.

Create and Configure the Pipeline Definition

Objects

Next, you define the

Activity

object in your pipeline definition. When you define the

Activity

object, you also define the objects that AWS Data Pipeline must use to perform this activity.

1.

On the Pipeline:

name of your pipeline

page, click Add activity.

2.

In the Activities pane a.

Enter the name of the activity; for example, copy-mysql-data

API Version 2012-10-29

42

AWS Data Pipeline Developer Guide

Create and Configure the Pipeline Definition Objects

b.

In the Type box, select CopyActivity.

c.

In the Input box, select Create new: DataNode.

d.

In the Schedule box, select Create new: Schedule.

e.

In the Output box, select Create new: DataNode.

f.

In the Add an optional field .. box, select RunsOn.

g.

In the Runs On box, select Create new: Resource.

h.

In the Add an optional field .. box, select On Success.

i.

In the On Success box, select Create new: Action.

j.

In the left pane, separate the icons by dragging them apart.

You've completed defining your pipeline definition by specifying the objects AWS Data Pipeline will use to perform the copy activity.

The Pipeline:

name of your pipeline

pane shows the graphical representation of the pipeline you just created. The arrows indicate the connection between the various objects.

Next step, configure run date and time for your pipeline.

To configure run date and time for your pipeline,

1.

On the Pipeline:

name of your pipeline

page, in the right pane, click Schedules.

2.

In the Schedules pane: a.

Enter a schedule name for this activity (for example, copy-mysql-data-schedule

).

b.

In the Type box, select Schedule.

c.

In the Start Date Time box, select the date from the calendar, and then enter the time to start the activity.

Note

AWS Data Pipeline supports the date and time expressed in "YYYY-MM-DDTHH:MM:SS" format in UTC/GMT only d.

In the Period box, enter the duration for the activity (for example,

1

), and then select the period category (for example,

Days

).

e.

[Optional] To specify the date and time to end the activity, in the Add an optional field box, select endDateTime, and enter the date and time.

API Version 2012-10-29

43

AWS Data Pipeline Developer Guide

Create and Configure the Pipeline Definition Objects

To get your pipeline to launch immediately, set Start Date Time to a date one day in the past. AWS

Data Pipeline will then starting launching the "past due" runs immediately in an attempt to address what it percieves as a backlog of work. This backfilling means you don't have to wait an hour to see

AWS Data Pipeline launch its first job flow.

Next step, configure the input and the output data nodes for your pipeline.

To configure the input and output data nodes of your pipeline,

1.

On the Pipeline:

name of your pipeline

page, in the right pane, click DataNodes.

2.

In the DataNodes pane: a.

In the

DefaultDataNode1

Name box , enter the name for your input node (for example,

MySQLInput

).

In this tutorial, your input node is the Amazon RDS MySQL instance you just created.

b.

In the Type box, select MySQLDataNode.

c.

In the Username box, enter the user name you used when you created your MySQL database instance.

d.

In the Connection String box, enter the end point of your MySQL database instance (for example, mydbinstance.c3frkexample.us-east-1.rds.amazonaws.com

).

e.

In the *Password box, enter the password you used when you created your MySQL database instance.

f.

In the Table box, enter the name of the source MySQL database table (for example, input-table g.

In the Schedule box, select copy-mysql-data-schedule.

h.

In the

DefaultDataNode2

Name box, enter the name for your output node (for example,

MyS3Output

).

In this tutorial, your output node is the Amazon S3 data target bucket.

i.

In the Type box, select S3DataNode.

j.

In the Schedule box, select copy-mysql-data-schedule.

k.

In the Add an optional field .. box, select File Path.

l.

In the File Path box, enter the path to your Amazon S3 bucket (for example, s3://my-data-pipeline-output/

name of your csv file

).

Next step, configure the the resource AWS Data Pipeline must use to perform the copy activity.

To configure the resource,

1.

On the Pipeline:

name of your pipeline

page, in the right pane, click Resources.

2.

In the Resources pane: a.

In the Name box, enter the name for your resource (for example,

CopyDataInstance

).

b.

In the Type box, select Ec2Resource.

c.

In the Schedule box, select copy-mysql-data-schedule.

Next step, configure the SNS notification action AWS Data Pipeline must perform after the copy activity finishes successfully.

API Version 2012-10-29

44

AWS Data Pipeline Developer Guide

Validate and Save Your Pipeline

To configure the SNS notification action,

1.

On the Pipeline:

name of your pipeline

page, in the right pane, click Others.

2.

In the Others pane: a.

In the

DefaultAction1

Name box, enter the name for your Amazon SNS notification (for example,

CopyDataNotice

).

b.

In the Type box, select SnsAlarm.

c.

In the Message box, enter the message content.

d.

Leave the entry in the Role box set to default value.

e.

In the Topic Arn box, enter the ARN of your Amazon SNS topic.

f.

In the Subject box, enter the subject line for your notification.

You have now completed all the steps required for creating your pipeline definition. Next step, validate and save your pipeline.

Validate and Save Your Pipeline

You can save your pipeline definition at any point during the creation process. As soon as you save your pipeline definition, AWS Data Pipeline looks for syntax errors and missing values in your pipeline definition.

If your pipeline is incomplete or is incorrect, AWS Data Pipeline throws a validation error. If you plan to continue the creation process later, you can ignore the error message. If your pipeline definition is complete, and you are getting the validation error message, you'll have to fix the errors in the pipeline definition before activating your pipeline.

To validate and save your pipeline,

1.

On the Pipeline:

name of your pipeline

page, click Save Pipeline.

2.

AWS Data Pipeline validates your pipeline definition and returns either success or the error message.

3.

If you get an error message, click Close and then, in the right pane, click Errors.

4.

The Errors pane lists the objects failing validation.

Click the plus (+) sign next to the object names and look for an error message in red.

5.

When you see the error message, click the specific object pane where you see the error and fix it.

For example, if you see an error message in the DataNodes object, click the DataNodes pane to fix the error.

6.

After you've fixed the errors listed in the Errors pane, click Save Pipeline.

7.

Repeat the process until your pipeline is validated.

Next step, verify that your pipeline definition has been saved.

Verify Your Pipeline Definition

It is important to verify that your pipeline was correctly initialized from your definitions before you activate it.

To verify your pipeline definition,

1.

On the Pipeline:

name of your pipeline

page, click Back to list of pipelines.

2.

On the List Pipelines page, check if your newly-created pipeline is listed.

AWS Data Pipeline has created a unique Pipeline ID for your pipeline definition.

API Version 2012-10-29

45

AWS Data Pipeline Developer Guide

Activate your Pipeline

The Status column in the row listing your pipeine should show PENDING.

3.

Click on the triangle icon next to your pipeline. The Pipeline Summary panel below shows the details of your pipeline runs. Because your pipeline is not yet activated, you should see only 0's at this point.

4.

In the Pipeline summary panel, click View all fields to see the configuration of your pipeline definition.

5.

Click Close.

Next step, activate your pipeline.

Activate your Pipeline

You must activate your pipeline to start creating and processing runs based on the specifications in your pipeline definition.

To activate your pipeline,

1.

On the List Pipelines page, in the Details column of your pipeline, click View pipeline.

2.

In the Pipeline:

name of your pipeline

page, click Activate.

A confirmation dialog box opens up confirming the activation.

3.

Click Close.

Next step, verify if your pipeline is running.

API Version 2012-10-29

46

AWS Data Pipeline Developer Guide

Monitor the Progress of Your Pipeline Runs

Monitor the Progress of Your Pipeline Runs

To monitor the progress of your pipeline,

1.

On the List Pipelines page, in the Details column of your pipeline, click View instance details.

2.

The Instance details:

name of your pipeline

page lists the status of each instance.

Note

If you do not see instances listed, depending on when your pipeline was scheduled, either click End (in UTC) date box and change it to a later date or click Start (in UTC) date box and change it to an earlier date. And then click Update.

3.

If the Status column of all the objects in your pipeline indicates FINISHED, your pipeline has successfully completed the copy activity.You should receive an email about the successful completion of this task, to the account you specified for receiving your Amazon SNS notification.

You can also check your Amazon S3 data target bucket to verify if the data was copied.

4.

If the Status column of any of your instance indicates a status other than FINISHED, either your pipeline is waiting for some precondition to be met or it has failed.

a.

To troubleshoot the failed or the incomplete instance runs,

Click the triangle next to an instance , Instance summary panel opens to show the details of the selected instance.

b.

Click View instance fields to see additional details of the instance. If the status of your selected instance is FAILED the additional details box will have an entry indicating the reason for failure.

For example,

@failureReason = Resource not healthy terminated

.

c.

In the Instance summary pane, in the Select attempt for this instance box, select the attempt number.

d.

In the Instance summary pane, click View attempt fields to see details of fields associated with the selected attempt.

5.

To take an action on your incomplete or failed instance, select an action (

Rerun|Cancel|Mark

Finished

) from the Action column of the instance.

You can use the information in the Instance summary pane and the View instance fields box to troubleshoot issues with your failed pipeline.

API Version 2012-10-29

47

AWS Data Pipeline Developer Guide

[Optional] Delete your Pipeline

For more information about instance status, see Interpret Pipeline Status Details (p. 129)

. For more

information about troubleshooting the failed or incomplete instance runs of your pipeline, see AWS Data

Pipeline Problems and Solutions (p. 131)

.

Important

Your pipeline is running and is incurring charges. For more information, see AWS Data Pipeline pricing . If you would like to stop incurring the AWS Data Pipeline usage charges, delete your pipeline.

[Optional] Delete your Pipeline

Deleting your pipeline will delete the pipeline definition including all the associated objects. You will stop incurring charges as soon as your pipeline is deleted.

To delete your pipeline,

1.

In the List Pipelines page, click the check box next to your pipeline.

2.

Click Delete.

3.

In the confirmation dialog box, click Delete to confirm the delete request.

Using the Command Line Interface

Topics

Define a Pipeline in JSON Format (p. 49)

Schedule (p. 50)

MySQL Data Node (p. 51)

Amazon S3 Data Node (p. 51)

Resource (p. 52)

Activity (p. 53)

Upload the Pipeline Definition (p. 54)

Activate the Pipeline (p. 54)

Verify the Pipeline Status (p. 55)

The following topics explain how to use the AWS Data Pipeline CLI to create a pipeline to copy data from a MySQL table to a file in an Amazon S3 bucket. In this example, we perform the following steps:

• Create a pipeline definition using the CLI in JSON format

• Create the necessary IAM roles and define a policy and trust relationships

• Upload the pipeline definition using the AWS Data Pipeline CLI tools

• Monitor the progress of the pipeline

To complete the steps in this example, you need a MySQL database instance with a table that contains data. To create a MySQL database using Amazon RDS, see Get Started with Amazon RDS http://docs.aws.amazon.com/AmazonRDS/latest/GettingStartedGuide/Welcome.html. After you have an

API Version 2012-10-29

48

AWS Data Pipeline Developer Guide

Define a Pipeline in JSON Format

Amazon RDS instance, see the MySQL documentation to Create a Table http://dev.mysql.com/doc/refman/5.5/en//creating-tables.html.

Define a Pipeline in JSON Format

This example scenario shows how to use JSON pipeline definitions and the AWS Data Pipeline CLI to copy data (rows) from a table in a MySQL database to a CSV (comma-separated values) file in an Amazon

S3 bucket at a specified time interval. This is the full pipeline definition JSON file followed by an explanation for each of its sections.

Note

We recommend that you use a text editor that can help you verify the syntax of JSON-formatted files, and name the file using the .json file extension.

{

"objects": [

{

"id": "MySchedule",

"type": "Schedule",

"startDateTime": "2012-11-25T00:00:00",

"endDateTime": "2012-11-26T00:00:00",

"period": "1 day"

},

{

"id": "MySQLInput",

"type": "MySqlDataNode",

"schedule": {

"ref": "MySchedule"

},

"table": "table_name",

"username": "

user_name

",

"*password": "

my_password

",

"connectionString": "jdbc:mysql:/

/mysqlinstance

-rds.example.us-east-

1.rds.amazonaws.com:3306/

database_name

",

"selectQuery": "select * from #{table}"

},

{

"id": "S3Output",

"type": "S3DataNode",

"filePath": "s3://testbucket/output/output_file.csv",

"schedule": {

"ref": "MySchedule"

}

},

{

"id": "MyEC2Resource",

"type": "Ec2Resource",

"schedule": {

"ref": "MySchedule"

},

"actionOnTaskFailure": "terminate",

"actionOnResourceFailure": "retryAll",

"maximumRetries": "1",

"role": "test-role",

"resourceRole": "test-role",

"instanceType": "m1.medium",

"instanceCount": "1",

API Version 2012-10-29

49

AWS Data Pipeline Developer Guide

Schedule

"securityGroups": [

"test-group",

"default"

],

"keyPair": "test-pair"

},

{

"id": "MyCopyActivity",

"type": "CopyActivity",

"runsOn": {

"ref": "MyEC2Resource"

},

"input": {

"ref": "MySQLInput"

},

"output": {

"ref": "S3Output"

},

"schedule": {

"ref": "MySchedule"

}

}

]

}

Schedule

The example AWS Data Pipeline JSON file begins with a section to define the schedule by which to copy the data. Many pipeline components have a reference to a schedule and you may have more than one.

The Schedule component is defined by the following fields:

{

"id": "MySchedule",

"type": "Schedule",

"startDateTime": "2012-11-22T00:00:00",

"endDateTime":"2012-11-23T00:00:00",

"period": "1 day"

},

Note

In the JSON file, you can define the pipeline components in any order you prefer. In this example, we chose the order that best illustrates the pipeline component dependencies.

Name

The user-defined name for the pipeline schedule, which is a label for your reference only.

Type

The pipeline component type, which is Schedule.

startDateTime

The date/time (in UTC format) that you want the task to begin.

endDateTime

The date/time (in UTC format) that you want the task to stop.

period

The time period that you want to pass between task attempts, even if the task occurs only one time.

The period must evenly divide the time between startDateTime

and endDateTime

. In this example, we set the period to be 1 day so that the pipeline copy operation can only run one time.

API Version 2012-10-29

50

AWS Data Pipeline Developer Guide

MySQL Data Node

MySQL Data Node

Next, the input MySqlDataNode pipeline component defines a location for the input data; in this case, an

Amazon RDS instance. The input MySqlDataNode component is defined by the following fields:

{

"id": "MySQLInput",

"type": "MySqlDataNode",

"schedule": {"ref": "MySchedule"},

"table": "table_name",

"username": "

user_name

",

"*password": "

my_password

",

"connectionString": "jdbc:mysql:/

/mysqlinstance

-rds.example.us-east-

1.rds.amazonaws.com:3306/

database_name

",

"selectQuery" : "select * from #{table}"

},

Name

The user-defined name for the MySQL database, which is a label for your reference only.

Type

The MySqlDataNode type, which is an Amazon RDS instance using MySQL in this example..

Schedule

A reference to the schedule component that we created in the preceding lines of the JSON file labeled

“MySchedule”.

Table

The name of the database table that contains the data to copy. Replace table_name with the name of your database table.

Username

The user name of the database account that has sufficient permission to retrieve data from the database table. Replace user_name with the name of your user account.

Password

The password for the database account with the asterisk prefix to indicate that AWS Data Pipeline must encrypt the password value. Replace my_password with the correct password for your user account.

connectionString

The JDBC connection string for the CopyActivity object to connect to the database.

selectQuery

A valid SQL SELECT query that specifies which data to copy from the database table. Note that

#{table} is a variable that re-uses the table name provided by the "table" variable in the preceding lines of the JSON file.

Amazon S3 Data Node

Next, the S3Output pipeline component defines a location for the output file; in this case a CSV file in an

S3 bucket location. The output S3DataNode component is defined by the following fields:

{

"id": "S3Output",

"type": "S3DataNode",

"filePath": "s3://testbucket/output/output_file.csv",

"schedule":{"ref":"MySchedule"}

},

API Version 2012-10-29

51

AWS Data Pipeline Developer Guide

Resource

Name

The user-defined name for the input location (a label for your reference only).

Type

The pipeline component type, which is "S3DataNode" to match the location where the data resides, in an Amazon S3 bucket.

Path

The path to the data associated with the data node. The syntax for a data node is determined by its type. For example, a data node for a database follows a different syntax that is appropriate for a database table.

Schedule

A reference to the schedule component that we created in the preceding lines of the JSON file labeled

“MySchedule”.

Resource

This is a definition of the computational resource that performs the copy operation. In this example, AWS

Data Pipeline should automatically create an EC2 instance to perform the copy task and terminate the resource after the task completes. The fields defined here control the creation and function of the Amazon

EC2 instance that does the work. The EC2Resource is defined by the following fields:

{

"id": "MyEC2Resource",

"type": "Ec2Resource",

"schedule": {"ref": "MySchedule"},

"actionOnTaskFailure": "terminate",

"actionOnResourceFailure": "retryAll",

"maximumRetries": "1",

"role" : "test-role",

"resourceRole": "test-role",

"instanceType": "m1.medium",

"instanceCount": "1",

"securityGroups": [ "test-group", "default" ],

"keyPair": "test-pair"

},

Name

The user-defined name for the pipeline schedule, which is a label for your reference only.

Type

The type of computational resource to perform work; in this case, an Amazon EC2 instance. There are other resource types available, such as an EmrCluster type.

Schedule

The schedule on which to create this computational resource.

actionOnTaskFailure

The action to take on the Amazon EC2 instance if the task fails. In this case, the instance should terminate so that each failed attempt to copy data does not leave behind idle, abandoned Amazon

EC2 instances with no work to perform. These instances require manual termination by an administrator.

actionOnResourceFailure

The action to perform if the resource is not created successfully. In this case, retry the creation of an

Amazon EC2 instance until it is successful.

maximumRetries

The number of times to retry the creation of this computational resource before marking the resource as a failure and stopping any further creation attempts. This setting works in conjunction with the actionOnResourceFailure

field.

API Version 2012-10-29

52

AWS Data Pipeline Developer Guide

Activity

Role

The IAM role of the account that accesses resources, such as accessing an Amazon S3 bucket to retrieve data.

resourceRole

The IAM role of the account that creates resources, such as creating and configuring an Amazon

EC2 instance on your behalf. Role and ResourceRole can be the same role, but separately provide greater granularity in your security configuration.

instanceType

The size of the Amazon EC2 instance to create. Ensure that you set the appropriate size of EC2 instance that best matches the load of the work that you want to perform with AWS Data Pipeline.

In this case, we set an m1.medium sized EC2 instance. For more information about the different instance types and when to use each one, see to the Amazon EC2 Instance Types topic at http://aws.amazon.com/ec2/instance-types/.

instanceCount

The number of Amazon EC2 instances in the computational resource pool to service any pipeline components depending on this resource.

securityGroups

The Amazon EC2 security groups to grant access to this EC2 instance. As the example shows, you can provide more than one group at a time (test-group and default).

keyPair

The name of the SSH public/private key pair to log in to the Amazon EC2 instance. For more information, see Amazon EC2 Key Pairs .

Activity

The last section in the JSON file is the definition of the activity that represents the work to perform. In this case we use a CopyActivity component to copy data from a file in an Amazon S3 bucket to another file.

The CopyActivity component is defined by the following fields:

{

"id": "MyCopyActivity",

"type": "CopyActivity",

"runsOn":{"ref":"MyEC2Resource"},

"input": {"ref": "MySQLInput"},

"output": {"ref": "S3Output"},

"schedule":{"ref":"MySchedule"}

}

Name

The user-defined name for the activity, which is a label for your reference only.

Type

The type of activity to perform, such as MyCopyActivity.

runsOn

The computational resource that performs the work that this activity defines. In this example, we provide a reference to the EC2 instance defined previously. Using the runsOn field causes AWS

Data Pipeline to create the EC2 instance for you. The runsOn field indicates that the resource exists in the AWS infrastructure, while the workerGroup value indicates that you want to use your own on-premises resources to perform the work.

Schedule

The schedule on which to run this activity.

Input

The location of the data to copy.

API Version 2012-10-29

53

AWS Data Pipeline Developer Guide

Upload the Pipeline Definition

Output

The target location data.

Upload the Pipeline Definition

You can upload a pipeline definition file using the AWS Data Pipeline CLI tools. For more information,

see Install the Command Line Interface (p. 15)

To upload your pipeline definition, use the following command.

On Linux/Unix/Mac OS:

./datapipeline -–create pipeline_name -–put pipeline_file

On Windows: ruby datapipeline -–create pipeline_name -–put pipeline_file

Where pipeline_name is the label for your pipeline and pipeline_file is the full path and file name for the file with the .json file extension that defines your pipeline.

If your pipeline validates successfully, you receive the following message:

Pipeline with name pipeline_name and id df-AKIAIOSFODNN7EXAMPLE created. Pipeline

definition pipeline_file.json uploaded.

Note

For more information about any errors returned by the –create command or other commands,

see Troubleshoot AWS Data Pipeline (p. 128)

.

Ensure that your pipeline appears in the pipeline list by using the following command.

On Linux/Unix/Mac OS:

./datapipeline --list-pipelines

On Windows: ruby datapipeline -–list-pipelines

The list of pipelines includes details such as Name, Id, State, and UserId. Take note of your pipeline ID, because you use this value for most AWS Data Pipeline CLI commands. The pipeline ID is a unique identifier using the format df-AKIAIOSFODNN7EXAMPLE

.

Activate the Pipeline

You must activate the pipeline, by using the --activate command-line parameter, before it will begin performing work. Use the following command.

On Linux/Unix/Mac OS:

API Version 2012-10-29

54

AWS Data Pipeline Developer Guide

Verify the Pipeline Status

./datapipeline --activate -–id df-AKIAIOSFODNN7EXAMPLE

On Windows: ruby datapipeline --activate -–id df-AKIAIOSFODNN7EXAMPLE

Where df-AKIAIOSFODNN7EXAMPLE is the identifier for your pipeline.

Verify the Pipeline Status

View the status of your pipeline and its components, along with its activity attempts and retries with the following command.

On Linux/Unix/Mac OS:

./datapipeline --list-runs –id df-AKIAIOSFODNN7EXAMPLE

On Windows: ruby datapipeline --list-runs –id df-AKIAIOSFODNN7EXAMPLE

Where df-AKIAIOSFODNN7EXAMPLE is the identifier for your pipeline.

The --list-runs command displays a list of pipelines components and details such as Name, Scheduled

Start, Status, ID, Started, and Ended.

Note

It is important to note the difference between the Scheduled Start date/time vs. the Started time.

It is possible to schedule a pipeline component to run at a certain time (Scheduled Start), but the actual start time (Started) could be later due to problems or delays with preconditions, dependencies, failures, or retries.

Note

AWS Data Pipeline may backfill a pipeline, which happens when you define a Scheduled Start date/time for a date in the past. In that situation, AWS Data Pipeline immediately runs the pipeline components the number of times the activity should have run if it had started on the Scheduled

Start time. When this happens, you see pipeline components run back-to-back at a greater frequency than the period value that you specified when you created the pipeline. AWS Data

Pipeline returns your pipeline to the defined frequency only when it catches up to the number of past runs.

Successful pipeline runs are indicated by all the activities in your pipeline reporting the

FINISHED

status.

Your pipeline frequency determines how many times the pipeline runs, and each run has its own success or failure, as indicated by the --list-runs command. Resources that you define in your pipeline, such as

Amazon EC2 instances, may show the

SHUTTING_DOWN

status until they are finally terminated after a successful run. Depending on how you configured your pipeline, you may have multiple Amazon EC2 resources, each with their own final status.

API Version 2012-10-29

55

AWS Data Pipeline Developer Guide

Tutorial: Launch an Amazon EMR

Job Flow

If you regularly run an Amazon EMR job flow, such as to analyze web logs or perform analysis of scientific data, you can use AWS Data Pipeline to manage your Amazon EMR job flows. With AWS Data Pipeline you can specify preconditions that must be met before the job flow is launched (for example, ensuring that today's data been uploaded to Amazon S3), a schedule for repeatedly running the job flow, and the cluster configuration to use for the job flow. The following tutorial walks you through launching a simple job flow as an example. This can be used as a model for a simple Amazon EMR-based pipeline, or as part of a more involved pipeline.

This tutorial walks you through the process of creating a data pipeline for a simple Amazon EMR job flow to run a pre-existing Hadoop Streaming job provided by Amazon EMR, and then send an Amazon SNS notification after the task completes successfuly.You will use the Amazon EMR cluster resource provided by AWS Data Pipeline for this task. This sample application is called WordCount, and can also be run manually from the Amazon EMR console.

The first step in pipeline creation process is to select the pipeline objects that make up your pipeline definition. After you select the pipeline objects, you add fields for each pipeline object. For more information on pipeline definition, see

Pipeline Definition (p. 2) .

This tutorial uses the following objects to create a pipeline definition:

Activity

Activity the AWS Data Pipeline must perform for this pipeline.

This tutorial uses the

EmrActivity

to run a pre-existing Hadoop Streaming job provided by Amazon

EMR.

Schedule

Start date, time, and the duration for this activity. You can optionally specify the end date and time.

Resource

Resource AWS Data Pipeline must use to perform this activity.

This tutorial uses

EmrCluster

, a set of Amazon EC2 instances, provided AWS Data Pipeline to run the job flow.AWS Data Pipeline automatically launches the Amazon EMR cluster and then terminates the cluster after the task finishes.

Action

Action AWS Data Pipeline must take when the specified conditions are met.

API Version 2012-10-29

56

AWS Data Pipeline Developer Guide

Before You Begin ...

This tutorial uses

SnsAlarm

action to send Amazon SNS notification to the email address you specify, after the task finishes successfully.

For more information about the additional objects and fields supported by Amazon EMR activity, see

EmrCluster (p. 209)

.

The following steps outline how to create a data pipeline to launch an Amazon EMR job flow.

1. Create your pipeline definition

2. Create and configure the pipeline definition objects

3. Validate and save your pipeline definition

4. Verify that your pipeline definition is saved

5. Activate your pipeline

6. Monitor the progress of your pipeline

7. [Optional] Delete your pipeline

Before You Begin ...

Be sure you've completed the following steps.

• Set up an Amazon Web Services (AWS) account to access the AWS Data Pipeline console. For more information, see

Access the Console (p. 12)

.

Set up the AWS Data Pipeline tools and interface you plan on using. For more information about interfaces and tools you can use to interact with AWS Data Pipeline, see

Get Set Up for AWS Data

Pipeline (p. 12)

.

• Create an Amazon SNS topic for sending email notification and make a note of the topic Amazon

Resource Name (ARN). For more information, see Create a Topic in the Amazon Simple Notification

Service Getting Started Guide.

Note

Some of the actions described in this tutorial can generate AWS usage charges, depending on whether you are using the AWS Free Usage Tier .

Using the AWS Data Pipeline Console

Topics

Create and Configure the Pipeline Definition Objects (p. 58)

Validate and Save Your Pipeline (p. 60)

Verify Your Pipeline Definition (p. 60)

Activate your Pipeline (p. 61)

Monitor the Progress of Your Pipeline Runs (p. 61)

[Optional] Delete your Pipeline (p. 63)

The following sections include the instructions for creating the pipeline using the AWS Data Pipeline console.

API Version 2012-10-29

57

AWS Data Pipeline Developer Guide

Create and Configure the Pipeline Definition Objects

To create your pipeline definition

1.

Sign in to the AWS Management Console and open the AWS Data Pipeline console .

2.

Click Create Pipeline.

3.

On the Create a New Pipeline page: a.

In the Pipeline Name box, enter a name (for example,

MyEmrJob

).

b.

In Pipeline Description, enter a description.

c.

Leave the Select Schedule Type: button set to the default type for this tutorial.

Note

Schedule type allows you to specify whether the objects in your pipeline definition should be scheduled at the beginning of interval or end of the interval. Time Series

Style Scheduling means instances are scheduled at the end of each interval and Cron

Style Scheduling means instances are scheduled at the beginning of each interval.

d.

Leave the Role boxes set to their default values for this tutorial.

e.

Click Create a new pipeline.

Create and Configure the Pipeline Definition

Objects

Next, you define the

Activity

object in your pipeline definition. When you define the

Activity

object, you also define the objects that AWS Data Pipeline must use to perform this activity.

1.

On the Pipeline:

name of your pipeline

page, select Add activity.

2.

In the Activities pane a.

Enter the name of the activity; for example, my-emr-job b.

In the Type box, select EmrActivity.

c.

In the Step box, enter

/home/hadoop/contrib/streaming/hadoop-streaming.jar,-in put,\ s3n://elasticmapreduce/samples/wordcount/input,-output,\ s3://myawsbucket/word count/output/#{@scheduledStartTime},\

-mapper,s3n://elasticmapreduce/samples/word count/wordSplitter.py,-reducer,aggregate

.

d.

In the Schedule box, select Create new: Schedule.

e.

In the Add an optional field .. box, select Runs On.

f.

In the Runs On box, select Create new: EmrCluster.

g.

In the Add an optional field .. box, select On Success.

h.

In the On Success box, select Create new: Action.

You've completed defining your pipeline definition by specifying the objects AWS Data Pipeline will use to launch an Amazon EMR job flow.

The Pipeline:

name of your pipeline

pane shows a single activity icon for this pipeline.

Next, configure run date and time for your pipeline.

API Version 2012-10-29

58

AWS Data Pipeline Developer Guide

Create and Configure the Pipeline Definition Objects

To configure run date and time for your pipeline

1.

On the Pipeline:

name of your pipeline

page, in the right pane, click Schedules.

2.

In the Schedules pane: a.

Enter a schedule name for this activity (for example, my-emr-job-schedule

).

b.

In the Type box, select Schedule.

c.

In the Start Date Time box, select the date from the calendar, and then enter the time to start the activity.

Note

AWS Data Pipeline supports the date and time expressed in "YYYY-MM-DDTHH:MM:SS" format in UTC/GMT only.

d.

In the Period box, enter the duration for the activity (for example,

1

), and then select the period category (for example,

Days

).

e.

[Optional] To specify the date and time to end the activity, in the Add an optional field box, select endDateTime, and enter the date and time.

To get your pipeline to launch immediately, set Start Date Time to a date one day in the past. AWS

Data Pipeline will then starting launching the "past due" runs immediately in an attempt to address what it percieves as a backlog of work. This backfilling means you don't have to wait an hour to see

AWS Data Pipeline launch its first job flow.

Next, configure the resource AWS Data Pipeline must use to perform the Amazon EMR job.

To configure the resource

1.

On the Pipeline:

name of your pipeline

page, in the right pane, click Resources.

2.

In the Resources pane: a.

In the Name box, enter the name for your EMR cluster (for example,

MyEmrCluster

).

b.

Leave the Type box set to the default value.

c.

In the Schedule box, select my-emr-job-schedule.

Next, configure the SNS notification action AWS Data Pipeline must perform after the Amazon EMR job finishes successfully.

To configure the SNS notification action

1.

On the Pipeline:

name of your pipeline

page, in the right pane, click Others.

2.

In the Others pane: a.

In the

DefaultAction1

Name box, enter the name for your Amazon SNS notification (for example,

EmrJobNotice

).

b.

In the Type box, select SnsAlarm.

c.

In the Message box, enter the message content.

d.

Leave the entry in the Role box set to default.

e.

In the Subject box, enter the subject line for your notification.

f.

In the Topic Arn box, enter the ARN of your Amazon SNS topic.

API Version 2012-10-29

59

AWS Data Pipeline Developer Guide

Validate and Save Your Pipeline

You have now completed all the steps required for creating your pipeline definition. Next, validate and save your pipeline.

Validate and Save Your Pipeline

You can save your pipeline definition at any point during the creation process. As soon as you save your pipeline definition, AWS Data Pipeline looks for syntax errors and missing values in your pipeline definition.

If your pipeline is incomplete or is incorrect, AWS Data Pipeline throws a validation error. If you plan to continue the creation process later, you can ignore the error message. If your pipeline definition is complete, and you are getting the validation error message, you'll have to fix the errors in the pipeline definition before activating your pipeline.

To validate and save your pipeline

1.

On the Pipeline:

name of your pipeline

page, click Save Pipeline.

2.

AWS Data Pipeline validates your pipeline definition and returns either success or the error message.

3.

If you get an error message, click Close and then, in the right pane, click Errors.

4.

The Errors pane lists the objects failing validation.

Click the plus (+) sign next to the object names and look for an error message in red.

5.

When you see the error message, click the specific object pane where you see the error and fix it.

For example, if you see an error message in the Schedules object, click the Schedules pane to fix the error.

6.

After you've fixed the errors listed in the Errors pane, click Save Pipeline.

7.

Repeat the process until your pipeline is validated.

Next, verify that your pipeline definition has been saved.

Verify Your Pipeline Definition

It is important to verify that your pipeline was correctly initialized from your definitions before you activate it.

To verify your pipeline definition

1.

On the Pipeline:

name of your pipeline

page, click Back to list of pipelines.

2.

On the List Pipelines page, check if your newly-created pipeline is listed.

AWS Data Pipeline has created a unique Pipeline ID for your pipeline definition.

The Status column in the row listing your pipeine should show PENDING.

3.

Click on the triangle icon next to your pipeline. The Pipeline Summary panel below shows the details of your pipeline runs. Because your pipeline is not yet activated, you should see only 0s at this point.

4.

In the Pipeline summary panel, click View all fields to see the configuration of your pipeline definition.

API Version 2012-10-29

60

AWS Data Pipeline Developer Guide

Activate your Pipeline

5.

Click Close.

Next, activate your pipeline.

Activate your Pipeline

You must activate your pipeline to start creating and processing runs based on the specifications in your pipeline definition.

To activate your pipeline

1.

On the List Pipelines page, in the Details column of your pipeline, click View pipeline.

2.

In the Pipeline:

name of your pipeline

page, click Activate.

A confirmation dialog box opens up confirming the activation.

3.

Click Close.

Next, verify if your pipeline is running.

Monitor the Progress of Your Pipeline Runs

To monitor the progress of your pipeline

1.

On the List Pipelines page, in the Details column of your pipeline, click View instance details.

API Version 2012-10-29

61

AWS Data Pipeline Developer Guide

Monitor the Progress of Your Pipeline Runs

Note

You can also view the job flows in the Amazon EMR console. The job flows spawned by

AWS Data Pipeline on your behalf are displayed in the Amazon EMR console and billed to your AWS Account in the same manner as job flows that you launch manually. You can tell which job flows were spawned by AWS Data Pipeline by looking at the name of the job flow.

Those spawned by AWS Data Pipeline have a name formatted as follows:

job-flow-identifier

_@

emr-cluster-name

_

launch-time

. For more information, see View Job Flow Details in the Amazon Elastic MapReduce Developer Guide.

2.

The Instance details:

name of your pipeline

page lists the status of each instance in your pipeline definition.

Note

If you do not see instances listed, depending on when your pipeline was scheduled, either click End (in UTC) date box and change it to a later date or click Start (in UTC) date box and change it to an earlier date. Then click Update.

3.

If the Status column of all the instances in your pipeline indicates FINISHED, your pipeline has successfully completed the copy activity.You should receive an email about the successful completion of this task, to the account you specified for receiving your Amazon SNS notification.

4.

If the Status column of any of your instances indicates a status other than FINISHED, either your pipeline is waiting for some precondition to be met or it has failed.

a.

To troubleshoot the failed or the incomplete runs

Click the triangle next to an instance; the Instance summary panel opens to show the details of the selected instance.

b.

Click View instance fields to see additional details of the instance. If the status of your selected instance is FAILED, the details box has an entry indicating the reason for failure. For example,

@failureReason = Resource not healthy terminated

.

c.

In the Instance summary pane, in the Select attempt for this instance box, select the attempt number.

d.

In the Instance summary pane, click View attempt fields to see details of fields associated with the selected attempt.

5.

To take an action on your incomplete or failed instance, select an action (

Rerun|Cancel|Mark

Finished

) from the Action column of the instance.

You can use the information in the Instance summary pane and the View instance fields box to troubleshoot issues with your failed pipeline.

API Version 2012-10-29

62

AWS Data Pipeline Developer Guide

[Optional] Delete your Pipeline

For more information about instance status, see Interpret Pipeline Status Details (p. 129)

. For more

information about troubleshooting the failed or incomplete instance runs of your pipeline, see AWS Data

Pipeline Problems and Solutions (p. 131)

.

Important

Your pipeline is running and is incurring charges. For more information, see AWS Data Pipline pricing . If you would like to stop incurring the AWS Data Pipeline usage charges, delete your pipleline.

[Optional] Delete your Pipeline

Deleting your pipeline deletes the pipeline definition including all the associated objects.You stop incurring charges as soon as your pipeline is deleted.

To delete your pipeline

1.

In the List Pipelines page, click the check box next to your pipeline.

2.

Click Delete.

3.

In the confirmation dialog box, click Delete to confirm the delete request.

Using the Command Line Interface

Topics

If you regularly run an Amazon EMR job flow to analyze web logs or perform analysis of scientific data, you can use AWS Data Pipeline to manage your Amazon EMR job flows. With AWS Data Pipeline, you can specify preconditions that must be met before the job flow is launched (for example, ensuring that today's data been uploaded to Amazon S3.) The following tutorial walks you through launching the job flow that can be a model for a simple Amazon EMR-based pipeline, or as part of a more involved pipeline.

The following code is the pipeline definition file for a simple Amazon EMR job flow that runs a pre-existing

Hadoop streaming job provided by Amazon EMR. This sample application is called WordCount, and can also be run manually from the Amazon EMR console. In the following code, you should replace the

Amazon S3 bucket location with the name of an Amazon S3 bucket that you own.You should also replace the start and end dates. To get job flows launching immediately, set startDateTime

to a date one day in the past and endDateTime

to one day in the future. AWS Data Pipeline then starts launching the "past due" job flows immediately in an attempt to address what it perceives as a backlog of work. This backfilling means you don't have to wait an hour to see AWS Data Pipeline launch its first job flow.

{

"objects": [

{

"id": "Hourly",

"type": "Schedule",

"startDateTime": "2012-11-19T07:48:00",

"endDateTime": "2012-11-21T07:48:00",

"period": "1 hours"

},

API Version 2012-10-29

63

AWS Data Pipeline Developer Guide

Using the Command Line Interface

{

"id": "MyCluster",

"type": "EmrCluster",

"masterInstanceType": "m1.small",

"schedule": {

"ref": "Hourly"

}

},

{

"id": "MyEmrActivity",

"type": "EmrActivity",

"schedule": {

"ref": "Hourly"

},

"runsOn": {

"ref": "MyCluster"

},

"step": "/home/hadoop/contrib/streaming/hadoop-streaming.jar,-in put,s3n://elasticmapreduce/samples/wordcount/input,-output,s3://myawsbucket/word count/output/#{@scheduledStartTime},-mapper,s3n://elasticmapreduce/samples/word count/wordSplitter.py,-reducer,aggregate"

}

]

}

This pipeline has three objects:

Hourly

, which represents the schedule of the work. You can set a schedule as one of the fields on an transform. When you do, the transform runs according to that schedule, or in this case, hourly.

MyCluster

, which represents the set of Amazon EC2 instances used to run the job flow. You can specify the size and number of EC2 instances to run as the cluster. If you do not specify the number of instances, the job flow launches with two, a master node and a task node. You can add additional configurations to the cluster, such as bootstrap actions to load additional software onto the Amazon

EMR-provided AMI.

MyEmrActivity

, which represents the computation to process with the job flow. Amazon EMR supports several types of job flows, including streaming, Cascading, and Scripted Hive. The runsOn

field refers back to MyCluster, using that as the specification for the underpinnings of the job flow.

To create a pipeline that launches an Amazon EMR job flow

1.

Open a terminal window in the directory where you've installed the AWS Data Pipeline CLI. For more information about how to install the CLI, see

Install the Command Line Interface (p. 15) .

2.

Create a new pipeline.

./datapipeline --credentials ./credentials.json --create MyEmrPipeline

When the pipeline is created, AWS Data Pipeline returns a success message and an identifier for the pipeline.

Pipeline with name 'MyEmrPipeline' and id 'df-07634391Y0GRTUD0SP0' created.

API Version 2012-10-29

64

AWS Data Pipeline Developer Guide

Using the Command Line Interface

3.

Add the JSON definition to the pipeline. This gives AWS Data Pipeline the business logic it needs to manage your data.

./datapipeline --credentials ./credentials.json --put MyEmrPipelineDefini tion.df --id df-07634391Y0GRTUD0SP0

The following message is an example of a successfully uploaded pipeline.

State of pipeline id 'df-07634391Y0GRTUD0SP0' is currently 'PENDING'

4.

Activate the pipeline.

./datapipeline --credentials ./credentials.json --activate --id df-

07634391Y0GRTUD0SP0

If the pipeline definition is valid, the previous

--put

command uploads the business logic and activates the pipeline. If the pipeline is invalid, AWS Data Pipeline returns an error code indicating what the problems are.

5.

Wait until the pipeline has had time to start running, then verify the pipeline's operation.

./datapipeline --credentials ./credentials.json --list-runs --id df-

07634391Y0GRTUD0SP0

This returns information about the runs initiated by the pipeline, such as the following.

State of pipeline id 'df-07634391Y0GRTUD0SP0' is currently 'SCHEDULED'

The --list-runs command is fetching the last 4 days of pipeline runs.

If this takes too long, use --help for how to specify a different interval with --start-interval or --schedule-interval.

Name Scheduled Start

Status

ID Started

Ended

--------------------------------------------------------------------------

-----------------------------

1. MyCluster 2012-11-19T07:48:00

FINISHED

@MyCluster_2012-11-19T07:48:00 2012-11-20T22:29:33

API Version 2012-10-29

65

AWS Data Pipeline Developer Guide

Using the Command Line Interface

2012-11-20T22:40:46

2. MyEmrActivity 2012-11-19T07:48:00

FINISHED

@MyEmrActivity_2012-11-19T07:48:00 2012-11-20T22:29:31

2012-11-20T22:38:43

3. MyCluster 2012-11-19T08:03:00

RUNNING

@MyCluster_2012-11-19T08:03:00 2012-11-20T22:34:32

4. MyEmrActivity 2012-11-19T08:03:00

RUNNING

@MyEmrActivity_2012-11-19T08:03:00 2012-11-20T22:34:31

5. MyCluster 2012-11-19T08:18:00

CREATING

@MyCluster_2012-11-19T08:18:00 2012-11-20T22:39:31

6. MyEmrActivity 2012-11-19T08:18:00

WAITING_FOR_RUNNER

@MyEmrActivity_2012-11-19T08:18:00 2012-11-20T22:39:30

All times are listed in UTC and all command line input is treated as UTC.

Total of 6 pipeline runs shown from pipeline named 'MyEmrPipeline' where

--start-interval 2012-11-16T22:41:32,2012-11-20T22:41:32

You can view job flows launched by AWS Data Pipeline in the Amazon EMR console. The job flows spawned by AWS Data Pipeline on your behalf are displayed in the Amazon EMR console and billed to your AWS Account in the same manner as job flows that you launch manually.

To check the progress of job flows launched by AWS Data Pipeline

1.

Look at the name of the job flow to tell which job flows were spawned by AWS Data Pipeline. Those spawned by AWS Data Pipeline have a name formatted as follows:

<job-flow-identifier>

_@

<emr-cluster-name>

_

<launch-time>

. This is shown in the following screen.

API Version 2012-10-29

66

AWS Data Pipeline Developer Guide

Using the Command Line Interface

2.

Click on the Bootstrap Actions tab to display the bootstrap action that AWS Data Pipeline uses to install AWS Data Pipeline Task Agent on the Amazon EMR clusters that it launches.

API Version 2012-10-29

67

AWS Data Pipeline Developer Guide

Using the Command Line Interface

3.

After one of the runs is complete, navigate to the Amazon S3 console and check that the time-stamped output folder exists and contains the expected results of the job flow.

API Version 2012-10-29

68

AWS Data Pipeline Developer Guide

Part One: Import Data into Amazon DynamoDB

Tutorial: Import/Export Data in

Amazon DynamoDB With Amazon

EMR and Hive

This is the first of a two-part tutorial that demonstrates how to bring together multiple AWS features to solve real-world problems in a scalable way through a common scenario: moving schema-less data in and out of Amazon DynamoDB using Amazon EMR and Hive. Complete part one before you move on to part two. This tutorial involves the following concepts and procedures:

• Using the AWS Data Pipeline console and command-line interface (CLI) to create and configure pipelines

• Creating and configuring Amazon DynamoDB tables

• Creating and allocating work to Amazon EMR clusters

• Querying and processing data with Hive scripts

• Storing and accessing data using Amazon S3

Part One: Import Data into Amazon DynamoDB

Topics

Before You Begin... (p. 70)

Create an Amazon SNS Topic (p. 73)

Create an Amazon S3 Bucket (p. 74)

Using the AWS Data Pipeline Console (p. 74)

Using the Command Line Interface (p. 81)

The first part of this tutorial explains how to define an AWS Data Pipeline pipeline to retrieve data from a tab-delimited file in Amazon S3 to populate a Amazon DynamoDB table, use a Hive script to define the necessary data transformation steps, and automatically create an Amazon EMR cluster to perform the work. The first part of the tutorial involves the following steps:

1. Create a Amazon DynamoDB table to store the data

2. Create and configure the pipeline definition objects

API Version 2012-10-29

69

AWS Data Pipeline Developer Guide

Before You Begin...

3. Upload your pipeline definition

4. Verify your results

Before You Begin...

Be sure you've completed the following steps.

• Set up an Amazon Web Services (AWS) account to access the AWS Data Pipeline console. For more information, see

Access the Console (p. 12)

.

Set up the AWS Data Pipeline tools and interface you plan on using. For more information on interfaces

and tools you can use to interact with AWS Data Pipeline, see Get Set Up for AWS Data Pipeline (p. 12)

.

• Create an Amazon S3 bucket as a data source.

For more information, see Create a Bucket in the Amazon Simple Storage Service Getting Started

Guide.

• Create an Amazon DynamoDB table to store data as defined by the following procedure.

Be aware of the following:

• Imports may overwrite data in your Amazon DynamoDB table. When you import data from Amazon

S3, the import may overwrite items in your Amazon DynamoDB table. Make sure that you are importing the right data and into the right table. Be careful not to accidentally set up a recurring import pipeline that will import the same data multiple times.

• Exports may overwrite data in your Amazon S3 bucket. When you export data to Amazon S3, you may overwrite previous exports if you write to the same bucket path. The default behavior of the Export

DynamoDB to S3 template will append the job’s scheduled time to the Amazon S3 bucket path, which will help you avoid this problem.

• Import and Export jobs will consume some of your Amazon DynamoDB table’s provisioned throughput capacity. This section explains how to schedule an import or export job using Amazon EMR. The

Amazon EMR cluster will consume some read capacity during exports or write capacity during imports.

You can control the percentage of the provisioned capacity that the import/export jobs consume by with the settings M yImportJob.myDynamoDBWriteThroughputRatio

and

MyExportJob.myDynamoDBReadThroughputRatio

. Be aware that these settings determine how much capacity to consume at the beginning of the import/export process and will not adapt in real time if you change your table’s provisioned capacity in the middle of the process.

• Be aware of the costs. AWS Data Pipeline manages the import/export process for you, but you still pay for the underlying AWS services that are being used. The import and export pipelines will create Amazon

EMR clusters to read and write data and there are per-instance charges for each node in the cluster.

You can read more about the details of Amazon EMR Pricing . The default cluster configuration is one m1.small instance master node and one m1.xlarge instance task node, though you can change this configuration in the pipeline definition. There are also charges for AWS Data Pipeline. For more information, see AWS Data Pipeline Pricing and Amazon S3 Pricing .

Create an Amazon DynamoDB Table

This section explains how to create an Amazon DynamoDB table that is a prerequisite for this tutorial.

For more information, see Working with Tables in Amazon DynamoDB in the Amazon DynamoDB

Developer Guide.

Note

If you already have a Amazon DynamoDB table, you can skip this procedure to create one.

API Version 2012-10-29

70

AWS Data Pipeline Developer Guide

Before You Begin...

To create a Amazon DynamoDB table

1.

Sign in to the AWS Management Console and open the Amazon DynamoDB console .

2.

Click Create Table.

3.

On the Create Table / Primary Key page, enter a name (for example,

MyTable

) in the Table Name box.

Note

Your table name must be unique.

4.

In the Primary Key section, for the Primary Key Type radio button, select Hash.

5.

In the Hash Attribute Name field, select Number and enter

Id

in the text box as shown:

6.

Click Continue.

7.

On the Create Table / Provisioned Throughput Capacity page, in the Read Capacity Units box, enter

5

.

8.

In the Write Capacity Units box, enter

5

as shown:

API Version 2012-10-29

71

AWS Data Pipeline Developer Guide

Before You Begin...

Note

In this example, we use read and write capacity unit values of five because the sample input data is small. You may need a larger value depending on the size of your actual input data set. For more information, see Provisioned Throughput in Amazon DynamoDB in the Amazon

DynamoDB Developer Guide.

9.

Click Continue.

10. On the Create Table / Throughput Alarms page, in the Send notification to box, enter your email address as shown:

API Version 2012-10-29

72

AWS Data Pipeline Developer Guide

Create an Amazon SNS Topic

Create an Amazon SNS Topic

This section explains how to create an Amazon SNS topic and subscribe to receive notifications from

AWS Data Pipeline regarding the status of your pipeline components. For more information, see Create a Topic in the Amazon SNS Getting Started Guide.

Note

If you already have an Amazon SNS topic ARN to which you have subscribed, you can skip this procedure to create one.

To create an Amazon SNS topic

1.

Sign in to the AWS Management Console and open the Amazon SNS console .

2.

Click Create New Topic.

3.

In the Topic Name field, type your topic name, such as my-example-topic

, and select Create

Topic.

4.

Note the value from the Topic ARN field, which should be similar in format to this example: arn:aws:sns:us-east-1:403EXAMPLE:my-example-topic

.

To create an Amazon SNS subscription

1.

Sign in to the AWS Management Console and open the Amazon SNS console .

2.

In the navigation pane, select your Amazon SNS topic and click Create New Subscription.

3.

In the Protocol field, choose Email.

4.

In the Endpoint field, type your email address and select Subscribe.

Note

You must accept the subscription confirmation email to begin receiving Amazon SNS notifications at the email address you specify.

API Version 2012-10-29

73

AWS Data Pipeline Developer Guide

Create an Amazon S3 Bucket

Create an Amazon S3 Bucket

This section explains how to create an Amazon S3 bucket as a storage location for your input and output files related to this tutorial. For more information, see Create a Bucket in the Amazon Simple Storage

Service Getting Started Guide.

Note

If you already have an Amazon S3 bucket configured with write permissions, you can skip this procedure to create one.

To create an Amazon S3 bucket

1.

Sign in to the AWS Management Console and open the Amazon S3 console .

2.

Click Create Bucket.

3.

In the Bucket Name field, type your topic name, such as my-example-bucket

and select Create.

4.

In the Buckets pane, select your new bucket and select Permissions.

5.

Ensure that all user accounts that you want to access these files appear in the Grantee list.

Using the AWS Data Pipeline Console

Topics

Start Import from the Amazon DynamoDB Console (p. 74)

Create the Pipeline Definition using the AWS Data Pipeline Console (p. 75)

Create and Configure the Pipeline from a Template (p. 76)

Complete the Data Nodes (p. 76)

Complete the Resources (p. 77)

Complete the Activity (p. 78)

Complete the Notifications (p. 78)

Validate and Save Your Pipeline (p. 78)

Verify your Pipeline Definition (p. 79)

Activate your Pipeline (p. 79)

Monitor the Progress of Your Pipeline Runs (p. 80)

[Optional] Delete your Pipeline (p. 81)

The following topics explain how to how to define an AWS Data Pipeline pipeline to retrieve data from a tab-delimited file using the AWS Data Pipeline console.

Start Import from the Amazon DynamoDB Console

You can begin the Amazon DynamoDB import operation from within the Amazon DynamoDB console.

To start the data import

1.

Sign in to the AWS Management Console and open the Amazon DynamoDB console .

2.

On the Tables screen, click your Amazon DynamoDB table and click the Import Table button.

3.

On the Import Table screen, read the walkthrough and check the I have read the walkthrough

box, then select Build a Pipeline. This opens the AWS Data Pipeline console so that you can choose a template to import the Amazon DynamoDB table data.

API Version 2012-10-29

74

AWS Data Pipeline Developer Guide

Using the AWS Data Pipeline Console

Create the Pipeline Definition using the AWS Data Pipeline

Console

To create the new pipeline

1.

Sign in to the AWS Management Console and open the AWS Data Pipeline console or arrive at the

AWS Data Pipeline console through the Build a Pipeline button in the Amazon DynamoDB console.

2.

Click Create new pipeline.

3.

On the Create a New Pipeline page: a.

In the Pipeline Name box, enter a name (for example,

CopyMyS3Data

).

b.

In Pipeline Description, enter a description.

c.

Leave the Select Schedule Type: button set to the default type Time Series Time Scheduling for this tutorial. Schedule type allows you to specify whether the objects in your pipeline definition should be scheduled at the beginning of interval or end of the interval. Time Series Style

Scheduling means instances are scheduled at the end of each interval and Cron Style Scheduling means instances are scheduled at the beginning of each interval.

d.

Leave the Role boxes set to their default values for this tutorial, which is

DataPipelineDefaultRole.

Note

If you have created your own IAM roles and would like to use them in this tutorial, you can select them now.

e.

Leave the Role boxes set to their default values for this tutorial, which are

DataPipelineDefaultRole for the role and DataPipelineDefaultResourceRole for the resource role.

Note

If you have created your own IAM roles and would like to use them in this tutorial, you can select them now.

f.

Click Create a new Pipeline.

API Version 2012-10-29

75

AWS Data Pipeline Developer Guide

Using the AWS Data Pipeline Console

Create and Configure the Pipeline from a Template

On the Pipeline screen, click Templates and select Export S3 to DynamoDB. The AWS Data Pipeline console pre-populates a pipeline definition template with the base objects necessary to import data from

Amazon S3, as shown in the following screen.

Review the template and complete the missing fields. You start by choosing the schedule and frequency by which you want your data export operation to run.

To complete the schedule

• On the Pipeline screen, click Schedules.

a.

In the ImportSchedule section, set Period to 1 Hours.

b.

Set Start Date Time using the calendar to the current date, such as

2012-12-18

and the time to

00:00:00 UTC

.

c.

In the Add an optional field .. box, select End Date Time.

d.

Set End Date Time using the calendar to the following day, such as

2012-12-19

and the time to

00:00:00 UTC

.

Complete the Data Nodes

Next, you complete the data node objects in your pipeline definition template.

To complete the Amazon DynamoDB data node

1.

On the Pipeline:

name of your pipeline

page, select DataNodes.

2.

In the DataNodes pane:

API Version 2012-10-29

76

AWS Data Pipeline Developer Guide

Using the AWS Data Pipeline Console

a.

Enter the Name; for example:

DynamoDB

.

b.

In the MyDynamoDBData section, in the Table Name box, type the name of the Amazon

DynamoDB table where you want to store the output data; for example:

MyTable

.

To complete the Amazon S3 data node

• In the DataNodes pane:

• In the MyS3Data section, in the Directory Path field, type a valid Amazon S3 directory path for the location of your source data, for example,

s3://elasticmapreduce/samples/Store/ProductCatalog

. This sample file is a fictional product catalog that is pre-populated with delimited data for demonstration purposes.

Complete the Resources

Next, you complete the resources that will run the data import activities. Many of the fields are auto-populated by the template, as shown in the following screen. You only need to complete the empty fields.

To complete the resources

• On the Pipeline page, select Resources.

• In the Emr Log Uri box, type the path where to store Amazon EMR debugging logs, using the

Amazon S3 bucket that you configured in part one of this tutorial; for example:

s3://my-test-bucket/emr_debug_logs

.

API Version 2012-10-29

77

AWS Data Pipeline Developer Guide

Using the AWS Data Pipeline Console

Complete the Activity

Next, you complete the activity that represents the steps to perform in your data import operation.

To complete the activity

1.

On the Pipeline:

name of your pipeline

page, select Activities.

2.

In the MyImportJob section, review the default options already provided. You are not required to manually configure any options in this section.

Complete the Notifications

Next, configure the SNS notification action AWS Data Pipeline must perform depending on the outcome of the activity.

To configure the SNS success, failure, and late notification action

1.

On the Pipeline:

name of your pipeline

page, in the right pane, click Others.

2.

In the Others pane: a.

In the LateSnsAlarmsection, in the Topic Arn box, enter the ARN of your Amazon SNS topic that you created earlier in the tutorial; for example:

arn:aws:sns:us-east-1:403EXAMPLE:my-example-topic

.

b.

In the FailureSnsAlarmsection, in the Topic Arn box, enter the ARN of your Amazon SNS topic that you created earlier in the tutorial; for example:

arn:aws:sns:us-east-1:403EXAMPLE:my-example-topic

.

c.

In the SuccessSnsAlarmsection, in the Topic Arn box, enter the ARN of your Amazon SNS topic that you created earlier in the tutorial; for example:

arn:aws:sns:us-east-1:403EXAMPLE:my-example-topic

.

You have now completed all the steps required for creating your pipeline definition. Next, validate and save your pipeline.

Validate and Save Your Pipeline

You can save your pipeline definition at any point during the creation process. As soon as you save your pipeline definition, AWS Data Pipeline looks for syntax errors and missing values in your pipeline definition.

If your pipeline is incomplete or is incorrect, AWS Data Pipeline throws a validation error. If you plan to continue the creation process later, you can ignore the error message. If your pipeline definition is complete, and you are getting the validation error message, you'll have to fix the errors in the pipeline definition before activating your pipeline.

To validate and save your pipeline

1.

On the Pipeline:

name of your pipeline

page, click Save Pipeline.

2.

AWS Data Pipeline validates your pipeline definition and returns either success or the error message.

3.

If you get an error message, click Close and then, in the right pane, click Errors.

4.

The Errors pane lists the objects failing validation.

Click the plus (+) sign next to the object names and look for an error message in red.

5.

When you see the error message, click the specific object pane where you see the error and fix it.

For example, if you see an error message in the DataNodes object, click the DataNodes pane to fix the error.

API Version 2012-10-29

78

AWS Data Pipeline Developer Guide

Using the AWS Data Pipeline Console

6.

After you've fixed the errors listed in the Errors pane, click Save Pipeline.

7.

Repeat the process until your pipeline is validated.

Next, verify that your pipeline definition has been saved.

Verify your Pipeline Definition

It is important to verify that your pipeline was correctly initialized from your definitions before you activate it.

To verify your pipeline definition

1.

On the Pipeline:

name of your pipeline

page, click Back to list of pipelines.

2.

On the List Pipelines page, check if your newly-created pipeline is listed.

AWS Data Pipeline has created a unique Pipeline ID for your pipeline definition.

The Status column in the row listing your pipeine should show PENDING.

3.

Click on the triangle icon next to your pipeline. The Pipeline Summary panel below shows the details of your pipeline runs. Because your pipeline is not yet activated, you should see only 0s at this point.

4.

In the Pipeline summary panel, click View all fields to see the configuration of your pipeline definition.

5.

Click Close.

Next, activate your pipeline.

Activate your Pipeline

You must activate your pipeline to start creating and processing runs based on the specifications in your pipeline definition.

API Version 2012-10-29

79

AWS Data Pipeline Developer Guide

Using the AWS Data Pipeline Console

To activate your pipeline

1.

On the List Pipelines page, in the Details column of your pipeline, click View pipeline.

2.

In the Pipeline:

name of your pipeline

page, click Activate.

A confirmation dialog box opens up confirming the activation.

3.

Click Close.

Next, verify if your pipeline is running.

Monitor the Progress of Your Pipeline Runs

To monitor the progress of your pipeline

1.

On the List Pipelines page, in the Details column of your pipeline, click View instance details.

2.

The Instance details:

name of your pipeline

page lists the status of each object in your pipeline definition.

Note

If you do not see instances listed, depending on when your pipeline was scheduled, either click End (in UTC) date box and change it to a later date or click Start (in UTC) date box and change it to an earlier date. Then click Update.

3.

If the Status column of all the objects in your pipeline indicates FINISHED, your pipeline has successfully completed the copy activity.You should receive an email about the successful completion of this task, to the account you specified for receiving your Amazon SNS notification.

You can also check your Amazon S3 data target bucket to verify if the data was copied.

4.

If the Status column of any of your instance indicates a status other than FINISHED, either your pipeline is waiting for some precondition to be met or it has failed.

a.

To troubleshoot the failed or the incomplete instance runs

Click the triangle next to an instance; the Instance summary panel opens to show the details of the selected instance.

b.

Click View instance fields to see additional details of the instance. If the status of your selected instance is FAILED, the details box has an entry indicating the reason for failure; for example:

@failureReason = Resource not healthy terminated

.

c.

In the Instance summary pane, in the Select attempt for this instance box, select the attempt number.

d.

In the Instance summary pane, click View attempt fields to see details of fields associated with the selected attempt.

API Version 2012-10-29

80

AWS Data Pipeline Developer Guide

Using the Command Line Interface

5.

To take an action on your incomplete or failed instance, select an action (

Rerun|Cancel|Mark

Finished

) from the Action column of the instance.

You can use the information in the Instance summary pane and the View instance fields box to troubleshoot issues with your failed pipeline.

For more information about instance status, see Interpret Pipeline Status Details (p. 129)

. For more

information about troubleshooting the failed or incomplete instance runs of your pipeline, see AWS Data

Pipeline Problems and Solutions (p. 131)

.

Important

Your pipeline is running and is incurring charges. For more information, see AWS Data Pipline pricing . If you would like to stop incurring the AWS Data Pipeline usage charges, delete your pipeline.

[Optional] Delete your Pipeline

Deleting your pipeline deletes the pipeline definition including all the associated objects.You stop incurring charges as soon as your pipeline is deleted.

To delete your pipeline,

1.

In the List Pipelines page, click the check box next to your pipeline.

2.

Click Delete.

3.

In the confirmation dialog box, click Delete to confirm the delete request.

Using the Command Line Interface

Topics

Define the Import Pipeline in JSON Format (p. 82)

Schedule (p. 84)

API Version 2012-10-29

81

AWS Data Pipeline Developer Guide

Using the Command Line Interface

Amazon S3 Data Node (p. 84)

Precondition (p. 85)

Amazon EMR Cluster (p. 86)

Amazon EMR Activity (p. 86)

Upload the Pipeline Definition (p. 88)

Activate the Pipeline (p. 89)

Verify the Pipeline Status (p. 89)

Verify Data Import (p. 90)

The following topics explain how to perform the steps in this tutorial using the AWS Data Pipeline CLI.

Define the Import Pipeline in JSON Format

This example pipeline definition shows how to use AWS Data Pipeline to retrieve data from a file in

Amazon S3 to populate a Amazon DynamoDB table, use a Hive script to define the necessary data transformation steps, and automatically create an Amazon EMR cluster to perform the work. Additionally, this pipeline sends Amazon SNS notifications if the pipeline succeeds, fails, or runs late. This is the full pipeline definition JSON file followed by an explanation for each of its sections.

Note

We recommend that you use a text editor that can help you verify the syntax of JSON-formatted files, and name the file using the .json file extension.

{

"objects": [

{

"id": "MySchedule",

"type": "Schedule",

"startDateTime": "2012-11-22T00:00:00",

"endDateTime":"2012-11-23T00:00:00",

"period": "1 day"

},

{

"id": "MyS3Data",

"type": "S3DataNode",

"schedule": {

"ref": "MySchedule"

},

"filePath": "s3://input_bucket/ProductCatalog",

"precondition": {

"ref": "InputReady"

}

},

{

"id": "InputReady",

"type": "S3PrefixNotEmpty",

"role": "test-role",

"s3Prefix": "#{node.filePath}"

},

{

"id": "ImportCluster",

"type": "EmrCluster",

"masterInstanceType": "m1.small",

"instanceCoreType": "m1.xlarge",

"instanceCoreCount": "1",

API Version 2012-10-29

82

AWS Data Pipeline Developer Guide

Using the Command Line Interface

"schedule": {

"ref": "MySchedule"

},

"enableDebugging": "true",

"emrLogUri": "s3://test_bucket/emr_logs"

},

{

"id": "MyImportJob",

"type": "EmrActivity",

"dynamoDBOutputTable": "MyTable",

"dynamoDBWritePercent": "1.00",

"s3MyS3Data": "#{input.path}",

"lateAfterTimeout": "12 hours",

"attemptTimeout": "24 hours",

"maximumRetries": "0",

"input": {

"ref": "MyS3Data"

},

"runsOn": {

"ref": "ImportCluster"

},

"schedule": {

"ref": "MySchedule"

},

"onSuccess": {

"ref": "SuccessSnsAlarm"

},

"onFail": {

"ref": "FailureSnsAlarm"

},

"onLateAction": {

"ref": "LateSnsAlarm"

},

"step": "s3://elasticmapreduce/libs/script-runner/script-run ner.jar,s3://elasticmapreduce/libs/hive/hive-script,--run-hive-script,--hiveversions,latest,--args,-f,s3://elasticmapreduce/libs/hive/dynamodb/importDynamoD

BTableFromS3,-d,DYNAMODB_OUTPUT_TABLE=#{dynamoDBOutputTable},-d,S3_INPUT_BUCK

ET=#{s3MyS3Data},-d,DYNAMODB_WRITE_PERCENT=#{dynamoDBWritePercent},-d,DYNAMODB_EN

DPOINT=dynamodb.us-east-1.amazonaws.com"

},

{

"id": "SuccessSnsAlarm",

"type": "SnsAlarm",

"topicArn": "arn:aws:sns:us-east-1:286192228708:mysnsnotify",

"role": "test-role",

"subject": "DynamoDB table '#{node.dynamoDBOutputTable}' import succeeded",

"message": "DynamoDB table '#{node.dynamoDBOutputTable}' import from S3 bucket '#{node.s3MyS3Data}' succeeded at #{node.@actualEndTime}. JobId:

#{node.id}"

},

{

"id": "LateSnsAlarm",

"type": "SnsAlarm",

"topicArn": "arn:aws:sns:us-east-1:286192228708:mysnsnotify",

"role": "test-role",

"subject": "DynamoDB table '#{node.dynamoDBOutputTable}' import is taking

a long time!",

API Version 2012-10-29

83

AWS Data Pipeline Developer Guide

Using the Command Line Interface

"message": "DynamoDB table '#{node.dynamoDBOutputTable}' import from S3 bucket '#{node.s3MyS3Data}' has exceeded the late warning period

'#{node.lateAfterTimeout}'. JobId: #{node.id}"

},

{

"id": "FailureSnsAlarm",

"type": "SnsAlarm",

"topicArn": "arn:aws:sns:us-east-1:286192228708:mysnsnotify",

"role": "test-role",

"subject": "DynamoDB table '#{node.dynamoDBOutputTable}' import failed!",

"message": "DynamoDB table '#{node.dynamoDBOutputTable}' import from S3 bucket '#{node.s3MyS3Data}' failed. JobId: #{node.id}. Error: #{node.errorMes sage}."

}

]

}

Schedule

The example AWS Data Pipeline JSON file begins with a section to define the schedule by which to copy the data. Many pipeline components have a reference to a schedule and you may have more than one.

The Schedule component is defined by the following fields:

{

"id": "MySchedule",

"type": "Schedule",

"startDateTime": "2012-11-22T00:00:00",

"endDateTime":"2012-11-23T00:00:00",

"period": "1 day"

},

Note

In the JSON file, you can define the pipeline components in any order you prefer. In this example, we chose the order that best illustrates the pipeline component dependencies.

Name

The user-defined name for the pipeline schedule, which is a label for your reference only.

Type

The pipeline component type, which is Schedule.

startDateTime

The date/time (in UTC format) that you want the task to begin.

endDateTime

The date/time (in UTC format) that you want the task to stop.

period

The time period that you want to pass between task attempts, even if the task occurs only one time.

The period must evenly divide the time between startDateTime

and endDateTime

. In this example, we set the period to be 1 day so that the pipeline copy operation can only run one time.

Amazon S3 Data Node

Next, the S3DataNode pipeline component defines a location for the input file; in this case a tab-delimited file in an Amazon S3 bucket location. The input S3DataNode component is defined by the following fields:

API Version 2012-10-29

84

AWS Data Pipeline Developer Guide

Using the Command Line Interface

{

"id": "MyS3Data",

"type": "S3DataNode",

"schedule": {

"ref": "MySchedule"

},

"filePath": "s3://input_bucket/ProductCatalog",

"precondition": {

"ref": "InputReady"

}

},

Name

The user-defined name for the input location (a label for your reference only).

Type

The pipeline component type, which is "S3DataNode" to match the location where the data resides, in an Amazon S3 bucket.

Schedule

A reference to the schedule component that we created in the preceding lines of the JSON file labeled

“MySchedule”.

Path

The path to the data associated with the data node. This path contains a sample product catalog input file that we use for this scenario. The syntax for a data node is determined by its type. For example, a data node for a file in Amazon S3 follows a different syntax that is appropriate for a database table.

Precondition

A reference to a precondition that must evaluate as true for the pipeline to consider the data node to be valid. The precondition itself is defined later in the pipeline definition file.

Precondition

Next, the precondition defines a condition that must be true for the pipeline to use the S3DataNode associated with this precondition. The precondition is defined by the following fields:

{

"id": "InputReady",

"type": "S3PrefixNotEmpty",

"role": "test-role",

"s3Prefix": "#{node.filePath}"

},

Name

The user-defined name for the precondition (a label for your reference only).

Type

The type of the precondition is S3PrefixNotEmpty, which checks an Amazon S3 prefix to ensure that it is not empty.

Role

The IAM role that provides the permissions necessary to access the S3DataNode.

S3Prefix

The Amazon S3 prefix to check for emptiness. This field uses an expression

#{node.filePath} populated from the referring component, which in this example is the S3DataNode that refers to this precondition.

API Version 2012-10-29

85

AWS Data Pipeline Developer Guide

Using the Command Line Interface

Amazon EMR Cluster

Next, the EmrCluster pipeline component defines an Amazon EMR cluster that processes and moves the data in this tutorial. The EmrCluster component is defined by the following fields:

{

"id": "ImportCluster",

"type": "EmrCluster",

"masterInstanceType": "m1.small",

"instanceCoreType": "m1.xlarge",

"instanceCoreCount": "1",

"schedule": {

"ref": "MySchedule"

},

"enableDebugging": "true",

"emrLogUri": "s3://test_bucket/emr_logs"

},

Name

The user-defined name for the Amazon EMR cluster (a label for your reference only).

Type

The computational resource type, which is an Amazon EMR cluster. For more information, see

Overview of Amazon EMR in the Amazon EMR Developer Guide.

masterInstanceType

The type of Amazon EC2 instance to use as the master node of the Amazon EMR cluster. For more information, see Amazon EC2 Instance Types in the Amazon EC2 Documentation.

instanceCoreType

The type of Amazon EC2 instance to use as the core node of the Amazon EMR cluster. For more information, see Amazon EC2 Instance Types in the Amazon EC2 Documentation.

instanceCoreCount

The number of core Amazon EC2 instances to use in the Amazon EMR cluster.

Schedule

A reference to the schedule component that we created in the preceding lines of the JSON file labeled

“MySchedule”.

enableDebugging

Indicates whether to create detailed debug logs for the Amazon EMR job flow.

emrLogUri

Specifies an Amazon S3 location to store the Amazon EMR job flow debug logs if you enabled debugging with the previously mentioned enableDebugging

field.

Amazon EMR Activity

Next, the EmrActivity pipeline component brings together the schedule, resources, and data nodes to define the work to perform, the conditions under which to do the work, and the actions to perform when certain events occur. The EmrActivity component is defined by the following fields:

{

"id": "MyImportJob",

"type": "EmrActivity",

"dynamoDBOutputTable": "MyTable",

"dynamoDBWritePercent": "1.00",

"s3MyS3Data": "#{input.path}",

"lateAfterTimeout": "12 hours",

API Version 2012-10-29

86

AWS Data Pipeline Developer Guide

Using the Command Line Interface

"attemptTimeout": "24 hours",

"maximumRetries": "0",

"input": {

"ref": "MyS3Data"

},

"runsOn": {

"ref": "ImportCluster"

},

"schedule": {

"ref": "MySchedule"

},

"onSuccess": {

"ref": "SuccessSnsAlarm"

},

"onFail": {

"ref": "FailureSnsAlarm"

},

"onLateAction": {

"ref": "LateSnsAlarm"

},

"step": "s3://elasticmapreduce/libs/script-runner/script-runner.jar,s3://elast icmapreduce/libs/hive/hive-script,--run-hive-script,--hive-versions,latest,-args,-f,s3://elasticmapreduce/libs/hive/dynamodb/importDynamoDBTableFromS3,d,DYNAMODB_OUTPUT_TABLE=#{dynamoDBOutputTable},-d,S3_INPUT_BUCKET=#{s3MyS3Data},d,DYNAMODB_WRITE_PERCENT=#{dynamoDBWritePercent},-d,DYNAMODB_ENDPOINT=dy namodb.us-east-1.amazonaws.com"

},

Name

The user-defined name for the Amazon EMR activity (a label for your reference only).

Type

The EmrActivity pipeline component type, which creates an Amazon EMR job flow to perform the defined work. For more information, see Overview of Amazon EMR in the Amazon EMR Developer

Guide.

dynamoDBOutputTable

The Amazon DynamoDB table where the Amazon EMR job flow writes the output of the Hive script.

dynamoDBWritePercent

Sets the rate of write operations to keep your Amazon DynamoDB database instance provisioned throughput rate in the allocated range for your table. The value is between 0.1 and 1.5, inclusively.

For more information, see Hive Options in Amazon EMR Developer Guide.

s3MyS3Data

An expression that refers to the Amazon S3 location path of the input data defined by the S3DataNode labeled "MyS3Data".

lateAfterTimeout

The amount of time, after the schedule start time, that the activity can wait to start before AWS Data

Pipeline considers it late.

attemptTimeout

The amount of time, after the schedule start time, that the activity has to complete before AWS Data

Pipeline considers it as failed.

maximumRetries

The maximum number of times that AWS Data Pipeline retries the activity.

input

The Amazon S3 location path of the input data defined by the S3DataNode labeled "MyS3Data".

API Version 2012-10-29

87

AWS Data Pipeline Developer Guide

Using the Command Line Interface runsOn

A reference to the computational resource that will run the activity; in this case, an EmrCluster labeled

"ImportCluster".

schedule

A reference to the schedule component that we created in the preceding lines of the JSON file labeled

“MySchedule”.

onSuccess

A reference to the action to perform when the activity is successful. In this case, it is to send an

Amazon SNS notification.

onFail

A reference to the action to perform when the activity fails. In this case, it is to send an Amazon SNS notification.

onLateAction

A reference to the action to perform when the activity is late. In this case, it is to send an Amazon

SNS notification.

step

Defines the steps for the EMR job flow to perform. This step calls a Hive script named importDynamoDBTableFromS3 that is provided by Amazon EMR and is specifically designed to move data from Amazon S3 into Amazon DynamoDB. To perform more complex data transformation tasks, you would customize this Hive script and provide its name and path here. For more information about sample Hive scripts that show how to perform data transformation tasks, see Contextual

Advertising using Apache Hive and Amazon EMR in AWS Articles and Tutorials.

Upload the Pipeline Definition

You can upload a pipeline definition file using the AWS Data Pipeline CLI tools. For more information,

see Install the Command Line Interface (p. 15)

To upload your pipeline definition, use the following command.

On Linux/Unix/Mac OS:

./datapipeline -–create pipeline_name -–put pipeline_file

On Windows: ruby datapipeline -–create pipeline_name -–put pipeline_file

Where pipeline_name is the label for your pipeline and pipeline_file is the full path and file name for the file with the .json file extension that defines your pipeline.

If your pipeline validates successfully, you receive the following message:

Pipeline with name pipeline_name and id df-AKIAIOSFODNN7EXAMPLE created. Pipeline

definition pipeline_file.json uploaded.

Note

For more information about any errors returned by the –create command or other commands,

see Troubleshoot AWS Data Pipeline (p. 128)

.

Ensure that your pipeline appears in the pipeline list by using the following command.

On Linux/Unix/Mac OS:

API Version 2012-10-29

88

AWS Data Pipeline Developer Guide

Using the Command Line Interface

./datapipeline --list-pipelines

On Windows: ruby datapipeline -–list-pipelines

The list of pipelines includes details such as Name, Id, State, and UserId. Take note of your pipeline ID, because you use this value for most AWS Data Pipeline CLI commands. The pipeline ID is a unique identifier using the format df-AKIAIOSFODNN7EXAMPLE

.

Activate the Pipeline

You must activate the pipeline, by using the --activate command-line parameter, before it will begin performing work. Use the following command.

On Linux/Unix/Mac OS:

./datapipeline --activate -–id df-AKIAIOSFODNN7EXAMPLE

On Windows: ruby datapipeline --activate -–id df-AKIAIOSFODNN7EXAMPLE

Where df-AKIAIOSFODNN7EXAMPLE is the identifier for your pipeline.

Verify the Pipeline Status

View the status of your pipeline and its components, along with its activity attempts and retries with the following command.

On Linux/Unix/Mac OS:

./datapipeline --list-runs –id df-AKIAIOSFODNN7EXAMPLE

On Windows: ruby datapipeline --list-runs –id df-AKIAIOSFODNN7EXAMPLE

Where df-AKIAIOSFODNN7EXAMPLE is the identifier for your pipeline.

The --list-runs command displays a list of pipelines components and details such as Name, Scheduled

Start, Status, ID, Started, and Ended.

Note

It is important to note the difference between the Scheduled Start date/time vs. the Started time.

It is possible to schedule a pipeline component to run at a certain time (Scheduled Start), but the actual start time (Started) could be later due to problems or delays with preconditions, dependencies, failures, or retries.

Note

AWS Data Pipeline may backfill a pipeline, which happens when you define a Scheduled Start date/time for a date in the past. In that situation, AWS Data Pipeline immediately runs the pipeline components the number of times the activity should have run if it had started on the Scheduled

API Version 2012-10-29

89

AWS Data Pipeline Developer Guide

Part Two: Export Data from Amazon DynamoDB

Start time. When this happens, you see pipeline components run back-to-back at a greater frequency than the period value that you specified when you created the pipeline. AWS Data

Pipeline returns your pipeline to the defined frequency only when it catches up to the number of past runs.

Successful pipeline runs are indicated by all the activities in your pipeline reporting the

FINISHED

status.

Your pipeline frequency determines how many times the pipeline runs, and each run has its own success or failure, as indicated by the --list-runs command. Resources that you define in your pipeline, such as

Amazon EC2 instances, may show the

SHUTTING_DOWN

status until they are finally terminated after a successful run. Depending on how you configured your pipeline, you may have multiple Amazon EC2 resources, each with their own final status.

Verify Data Import

Next, verify that the data import occurred successfully using the Amazon DynamoDB console to inspect the data in the table.

To create a Amazon DynamoDB table

1.

Sign in to the AWS Management Console and open the Amazon DynamoDB console .

2.

On the Tables screen, click your Amazon DynamoDB table and click the Explore Table button.

3.

On the Browse Items tab, columns that correspond to the data input file should display, such as Id,

Price, ProductCategory, as shown in the following screen. This indicates that the import operation from the file to the Amazon DynamoDB table occurred successfully.

Part Two: Export Data from Amazon DynamoDB

Topics

Before You Begin ... (p. 91)

Using the AWS Data Pipeline Console (p. 92)

Using the Command Line Interface (p. 98)

This is the second of a two-part tutorial that demonstrates how to bring together multiple AWS features to solve real-world problems in a scalable way through a common scenario: moving schema-less data in and out of Amazon DynamoDB using Amazon EMR and Hive. This tutorial involves the following concepts and procedures:

API Version 2012-10-29

90

AWS Data Pipeline Developer Guide

Before You Begin ...

• Using the AWS Data Pipeline console and command-line interface (CLI) to create and configure pipelines

• Creating and configuring Amazon DynamoDB tables

• Creating and allocating work to Amazon EMR clusters

• Querying and processing data with Hive scripts

• Storing and accessing data using Amazon S3

Before You Begin ...

You must complete part one of this tutorial to ensure that your Amazon DynamoDB table contains the necessary data to perform the steps in this section. For more information, see

Part One: Import Data into

Amazon DynamoDB (p. 69)

.

Additionally, be sure you've completed the following steps:

• Set up an Amazon Web Services (AWS) account to access the AWS Data Pipeline console. For more information, see

Access the Console (p. 12)

.

Set up the AWS Data Pipeline tools and interface you plan on using. For more information about interfaces and tools you can use to interact with AWS Data Pipeline, see

Get Set Up for AWS Data

Pipeline (p. 12)

.

• Create an Amazon S3 bucket as a data output location.

For more information, see Create a Bucket in the Amazon Simple Storage Service Getting Started

Guide.

• Ensure that you have the Amazon DynamoDB table that was created and populated with data in part one of this tutorial. This table will be your data source for part two of the tutorial. For more information, see

Part One: Import Data into Amazon DynamoDB (p. 69) .

Be aware of the following:

• Imports may overwrite data in your Amazon DynamoDB table. When you import data from Amazon

S3, the import may overwrite items in your Amazon DynamoDB table. Make sure that you are importing the right data and into the right table. Be careful not to accidentally set up a recurring import pipeline that will import the same data multiple times.

• Exports may overwrite data in your Amazon S3 bucket. When you export data to Amazon S3, you may overwrite previous exports if you write to the same bucket path. The default behavior of the Export

DynamoDB to S3 template will append the job’s scheduled time to the Amazon S3 bucket path, which will help you avoid this problem.

• Import and Export jobs will consume some of your Amazon DynamoDB table’s provisioned throughput capacity. This section explains how to schedule an import or export job using Amazon EMR. The

Amazon EMR cluster will consume some read capacity during exports or write capacity during imports.

You can control the percentage of the provisioned capacity that the import/export jobs consume by with the settings M yImportJob.myDynamoDBWriteThroughputRatio

and

MyExportJob.myDynamoDBReadThroughputRatio

. Be aware that these settings determine how much capacity to consume at the beginning of the import/export process and will not adapt in real time if you change your table’s provisioned capacity in the middle of the process.

• Be aware of the costs. AWS Data Pipeline manages the import/export process for you, but you still pay for the underlying AWS services that are being used. The import and export pipelines will create Amazon

EMR clusters to read and write data and there are per-instance charges for each node in the cluster.

You can read more about the details of Amazon EMR Pricing . The default cluster configuration is one m1.small instance master node and one m1.xlarge instance task node, though you can change this configuration in the pipeline definition. There are also charges for AWS Data Pipeline. For more information, see AWS Data Pipeline Pricing and Amazon S3 Pricing .

API Version 2012-10-29

91

AWS Data Pipeline Developer Guide

Using the AWS Data Pipeline Console

Using the AWS Data Pipeline Console

Topics

Start Export from the Amazon DynamoDB Console (p. 92)

Create the Pipeline Definition using the AWS Data Pipeline Console (p. 93)

Create and Configure the Pipeline from a Template (p. 93)

Complete the Data Nodes (p. 94)

Complete the Resources (p. 95)

Complete the Activity (p. 95)

Complete the Notifications (p. 96)

Validate and Save Your Pipeline (p. 96)

Verify your Pipeline Definition (p. 96)

Activate your Pipeline (p. 97)

Monitor the Progress of Your Pipeline Runs (p. 97)

[Optional] Delete your Pipeline (p. 98)

The following topics explain how to perform the steps in part two of this tutorial using the AWS Data

Pipeline console.

Start Export from the Amazon DynamoDB Console

You can begin the Amazon DynamoDB export operation from within the Amazon DynamoDB console.

To start the data export

1.

Sign in to the AWS Management Console and open the Amazon DynamoDB console .

2.

On the Tables screen, click your Amazon DynamoDB table and click the Export Table button.

3.

On the Import / Export Table screen, select Build a Pipeline. This opens the AWS Data Pipeline console so that you can choose a template to export the Amazon DynamoDB table data.

API Version 2012-10-29

92

AWS Data Pipeline Developer Guide

Using the AWS Data Pipeline Console

Create the Pipeline Definition using the AWS Data Pipeline

Console

To create the new pipeline

1.

Sign in to the AWS Management Console and open the AWS Data Pipeline console or arrive at the

AWS Data Pipeline console through the Build a Pipeline button in the Amazon DynamoDB console.

2.

Click Create new pipeline.

3.

On the Create a New Pipeline page: a.

In the Pipeline Name box, enter a name (for example,

CopyMyS3Data

).

b.

In Pipeline Description, enter a description.

c.

Leave the Select Schedule Type: button set to the default type Time Series Time Scheduling for this tutorial. Schedule type allows you to specify whether the objects in your pipeline definition should be scheduled at the beginning of interval or end of the interval. Time Series Style

Scheduling means instances are scheduled at the end of each interval and Cron Style Scheduling means instances are scheduled at the beginning of each interval.

d.

Leave the Role boxes set to their default values for this tutorial, which is

DataPipelineDefaultRole.

Note

If you have created your own IAM roles and would like to use them in this tutorial, you can select them now.

e.

Leave the Role boxes set to their default values for this tutorial, which are

DataPipelineDefaultRole for the role and DataPipelineDefaultResourceRole for the resource role.

Note

If you have created your own IAM roles and would like to use them in this tutorial, you can select them now.

f.

Click Create a new Pipeline.

Create and Configure the Pipeline from a Template

On the Pipeline screen, click Templates and select Export DynamoDB to S3. The AWS Data Pipeline console pre-populates a pipeline definition template with the base objects necessary to export data from

Amazon DynamoDB, as shown in the following screen.

API Version 2012-10-29

93

AWS Data Pipeline Developer Guide

Using the AWS Data Pipeline Console

Review the template and complete the missing fields. You start by choosing the schedule and frequency by which you want your data export operation to run.

To complete the schedule

• On the Pipeline screen, click Schedules.

a.

In the DefaultSchedule1 section, set Name to

ExportSchedule

.

b.

Set Period to 1 Hours.

c.

Set Start Date Time using the calendar to the current date, such as

2012-12-18

and the time to

00:00:00 UTC

.

d.

In the Add an optional field .. box, select End Date Time.

e.

Set End Date Time using the calendar to the following day, such as

2012-12-19

and the time to

00:00:00 UTC

.

Complete the Data Nodes

Next, you complete the data node objects in your pipeline definition template.

To complete the Amazon DynamoDB data node

1.

On the Pipeline:

name of your pipeline

page, select DataNodes.

2.

In the DataNodes pane, in the Table Name box, type the name of the Amazon DynamoDB table that you created in part one of this tutorial; for example:

MyTable

.

API Version 2012-10-29

94

AWS Data Pipeline Developer Guide

Using the AWS Data Pipeline Console

To complete the Amazon S3 data node

• In the MyS3Data section, in the Directory Path field, type the path to the files where you want the

Amazon DynamoDB table data to be written, which is the Amazon S3 bucket that you configured in part one of this tutorial. For example:

s3://mybucket/output/MyTable

.

Complete the Resources

Next, you complete the resources that will run the data import activities. Many of the fields are auto-populated by the template, as shown in the following screen. You only need to complete the empty fields.

To complete the resources

• On the Pipeline page, select Resources.

• In the Emr Log Uri box, type the path where to store EMR debugging logs, using the Amazon

S3 bucket that you configured in part one of this tutorial; for example:

s3://mybucket/emr_debug_logs

.

Complete the Activity

Next, you complete the activity that represents the steps to perform in your data export operation.

To complete the activity

1.

On the Pipeline:

name of your pipeline

page, select Activities.

2.

In the MyExportJob section, review the default options already provided. You are not required to manually configure any options in this section.

API Version 2012-10-29

95

AWS Data Pipeline Developer Guide

Using the AWS Data Pipeline Console

Complete the Notifications

Next, configure the SNS notification action AWS Data Pipeline must perform depending on the outcome of the activity.

To configure the SNS success, failure, and late notification action

1.

On the Pipeline:

name of your pipeline

page, in the right pane, click Others.

2.

In the Others pane: a.

In the LateSnsAlarmsection, in the Topic Arn box, enter the ARN of your Amazon SNS topic that you created earlier in the tutorial; for example:

arn:aws:sns:us-east-1:403EXAMPLE:my-example-topic

.

b.

In the FailureSnsAlarmsection, in the Topic Arn box, enter the ARN of your Amazon SNS topic that you created earlier in the tutorial; for example:

arn:aws:sns:us-east-1:403EXAMPLE:my-example-topic

.

c.

In the SuccessSnsAlarmsection, in the Topic Arn box, enter the ARN of your Amazon SNS topic that you created earlier in the tutorial; for example:

arn:aws:sns:us-east-1:403EXAMPLE:my-example-topic

.

You have now completed all the steps required for creating your pipeline definition. Next, validate and save your pipeline.

Validate and Save Your Pipeline

You can save your pipeline definition at any point during the creation process. As soon as you save your pipeline definition, AWS Data Pipeline looks for syntax errors and missing values in your pipeline definition.

If your pipeline is incomplete or is incorrect, AWS Data Pipeline throws a validation error. If you plan to continue the creation process later, you can ignore the error message. If your pipeline definition is complete, and you are getting the validation error message, you'll have to fix the errors in the pipeline definition before activating your pipeline.

To validate and save your pipeline

1.

On the Pipeline:

name of your pipeline

page, click Save Pipeline.

2.

AWS Data Pipeline validates your pipeline definition and returns either success or the error message.

3.

If you get an error message, click Close and then, in the right pane, click Errors.

4.

The Errors pane lists the objects failing validation.

Click the plus (+) sign next to the object names and look for an error message in red.

5.

When you see the error message, click the specific object pane where you see the error and fix it.

For example, if you see an error message in the DataNodes object, click the DataNodes pane to fix the error.

6.

After you've fixed the errors listed in the Errors pane, click Save Pipeline.

7.

Repeat the process until your pipeline is validated.

Next, verify that your pipeline definition has been saved.

Verify your Pipeline Definition

It is important to verify that your pipeline was correctly initialized from your definitions before you activate it.

API Version 2012-10-29

96

AWS Data Pipeline Developer Guide

Using the AWS Data Pipeline Console

To verify your pipeline definition

1.

On the Pipeline:

name of your pipeline

page, click Back to list of pipelines.

2.

On the List Pipelines page, check if your newly-created pipeline is listed.

AWS Data Pipeline has created a unique Pipeline ID for your pipeline definition.

The Status column in the row listing your pipeine should show PENDING.

3.

Click on the triangle icon next to your pipeline. The Pipeline Summary panel below shows the details of your pipeline runs. Because your pipeline is not yet activated, you should see only 0s at this point.

4.

In the Pipeline summary panel, click View all fields to see the configuration of your pipeline definition.

5.

Click Close.

Next, activate your pipeline.

Activate your Pipeline

You must activate your pipeline to start creating and processing runs based on the specifications in your pipeline definition.

To activate your pipeline

1.

On the List Pipelines page, in the Details column of your pipeline, click View pipeline.

2.

In the Pipeline:

name of your pipeline

page, click Activate.

Next, verify if your pipeline is running.

Monitor the Progress of Your Pipeline Runs

To monitor the progress of your pipeline

1.

On the List Pipelines page, in the Details column of your pipeline, click View instance details.

2.

The Instance details:

name of your pipeline

page lists the status of each object in your pipeline definition.

Note

If you do not see runs listed, depending on when your pipeline was scheduled, either click

End (in UTC) date box and change it to a later date or click Start (in UTC) date box and change it to an earlier date. Then click Update.

3.

If the Status column of all the objects in your pipeline indicates FINISHED, your pipeline has successfully completed the copy activity.You should receive an email about the successful completion of this task, to the account you specified for receiving your Amazon SNS notification.

You can also check your Amazon S3 data target bucket to verify if the data was copied.

4.

If the Status column of any of your objects indicates a status other than FINISHED, either your pipeline is waiting for some precondition to be met or it has failed.

a.

To troubleshoot the failed or the incomplete runs

Click the triangle next to a run; the Instance summary panel opens to show the details of the selected run.

b.

Click View instance fields to see additional details of the run. If the status of your selected run is FAILED, the additional details box has an entry indicating the reason for failure; for example:

@failureReason = Resource not healthy terminated

.

API Version 2012-10-29

97

AWS Data Pipeline Developer Guide

Using the Command Line Interface

c.

You can use the information in the Instance summary panel and the View instance fields box to troubleshoot issues with your failed pipeline.

For more information about the status listed for the runs, see

Interpret Pipeline Status

Details (p. 129) . For more information about troubleshooting the failed or incomplete runs of your

pipeline, see AWS Data Pipeline Problems and Solutions (p. 131)

.

Important

Your pipeline is running and is incurring charges. For more information, see AWS Data Pipeline pricing . If you would like to stop incurring the AWS Data Pipeline usage charges, delete your pipeline.

[Optional] Delete your Pipeline

Deleting your pipeline deletes the pipeline definition including all the associated objects.You stop incurring charges as soon as your pipeline is deleted.

To delete your pipeline

1.

In the List Pipelines page, click the check box next to your pipeline.

2.

Click Delete.

3.

In the confirmation dialog box, click Delete to confirm the delete request.

Using the Command Line Interface

Topics

Define the Export Pipeline in JSON Format (p. 98)

Schedule (p. 100)

Amazon S3 Data Node (p. 101)

Amazon EMR Cluster (p. 102)

Amazon EMR Activity (p. 102)

Upload the Pipeline Definition (p. 104)

Activate the Pipeline (p. 105)

Verify the Pipeline Status (p. 105)

Verify Data Export (p. 106)

The following topics explain how to perform the steps in this tutorial using the AWS Data Pipeline CLI.

Define the Export Pipeline in JSON Format

This example pipeline definition shows how to use AWS Data Pipeline to retrieve data from an Amazon

DynamoDB table to populate a tab-delimited file in Amazon S3, use a Hive script to define the necessary data transformation steps, and automatically create an Amazon EMR cluster to perform the work.

API Version 2012-10-29

98

AWS Data Pipeline Developer Guide

Using the Command Line Interface

Additionally, this pipeline will send Amazon SNS notifications if the pipeline succeeds, fails, or runs late.

This is the full pipeline definition JSON file followed by an explanation for each of its sections.

Note

We recommend that you use a text editor that can help you verify the syntax of JSON-formatted files, and name the file using the .json file extension.

{

"objects": [

{

"id": "MySchedule",

"type": "Schedule",

"startDateTime": "2012-11-22T00:00:00",

"endDateTime":"2012-11-23T00:00:00",

"period": "1 day"

},

{

"id": "MyS3Data",

"type": "S3DataNode",

"schedule": {

"ref": "MySchedule"

},

"filePath": "s3://output_bucket/ProductCatalog"

},

{

"id": "ExportCluster",

"type": "EmrCluster",

"masterInstanceType": "m1.small",

"instanceCoreType": "m1.xlarge",

"instanceCoreCount": "1",

"schedule": {

"ref": "MySchedule"

},

"enableDebugging": "true",

"emrLogUri": "s3://test_bucket/emr_logs"

},

{

"id": "MyExportJob",

"type": "EmrActivity",

"dynamoDBInputTable": "MyTable",

"dynamoDBReadPercent": "0.25",

"s3OutputBucket": "#{output.path}",

"lateAfterTimeout": "12 hours",

"attemptTimeout": "24 hours",

"maximumRetries": "0",

"output": {

"ref": "MyS3Data"

},

"runsOn": {

"ref": "ExportCluster"

},

"schedule": {

"ref": "MySchedule"

},

"onSuccess": {

"ref": "SuccessSnsAlarm"

},

"onFail": {

API Version 2012-10-29

99

AWS Data Pipeline Developer Guide

Using the Command Line Interface

"ref": "FailureSnsAlarm"

},

"onLateAction": {

"ref": "LateSnsAlarm"

},

"step": "s3://elasticmapreduce/libs/script-runner/script-run ner.jar,s3://elasticmapreduce/libs/hive/hive-script,--run-hive-script,--hiveversions,latest,--args,-f,s3://elasticmapreduce/libs/hive/dynamodb/exportDynamoD

BTableToS3,-d,DYNAMODB_INPUT_TABLE=#{dynamoDBInputTable},-d,S3_OUTPUT_BUCK

ET=#{s3OutputBucket}/#{format(@actualStartTime,'YYYY-MM-dd_hh.mm')},-d,DY

NAMODB_READ_PERCENT=#{dynamoDBReadPercent},-d,DYNAMODB_ENDPOINT=dynamodb.useast-1.amazonaws.com"

},

{

"id": "SuccessSnsAlarm",

"type": "SnsAlarm",

"topicArn": "arn:aws:sns:us-east-1:286198878708:mysnsnotify",

"role": "test-role",

"subject": "DynamoDB table '#{node.dynamoDBInputTable}' export succeeded",

"message": "DynamoDB table '#{node.dynamoDBInputTable}' export to S3 bucket '#{node.s3OutputBucket}' succeeded at #{node.@actualEndTime}. JobId:

#{node.id}"

},

{

"id": "LateSnsAlarm",

"type": "SnsAlarm",

"topicArn": "arn:aws:sns:us-east-1:286198878708:mysnsnotify",

"role": "test-role",

"subject": "DynamoDB table '#{node.dynamoDBInputTable}' export is taking

a long time!",

"message": "DynamoDB table '#{node.dynamoDBInputTable}' export to S3 bucket '#{node.s3OutputBucket}' has exceeded the late warning period

'#{node.lateAfterTimeout}'. JobId: #{node.id}"

},

{

"id": "FailureSnsAlarm",

"type": "SnsAlarm",

"topicArn": "arn:aws:sns:us-east-1:286198878708:mysnsnotify",

"role": "test-role",

"subject": "DynamoDB table '#{node.dynamoDBInputTable}' export failed!",

"message": "DynamoDB table '#{node.dynamoDBInputTable}' export to S3 bucket '#{node.s3OutputBucket}' failed. JobId: #{node.id}. Error: #{node.er rorMessage}."

}

]

}

Schedule

The example AWS Data Pipeline JSON file begins with a section to define the schedule by which to copy the data. Many pipeline components have a reference to a schedule and you may have more than one.

The Schedule component is defined by the following fields:

API Version 2012-10-29

100

AWS Data Pipeline Developer Guide

Using the Command Line Interface

{

"id": "MySchedule",

"type": "Schedule",

"startDateTime": "2012-11-22T00:00:00",

"endDateTime":"2012-11-23T00:00:00",

"period": "1 day"

},

Note

In the JSON file, you can define the pipeline components in any order you prefer. In this example, we chose the order that best illustrates the pipeline component dependencies.

Name

The user-defined name for the pipeline schedule, which is a label for your reference only.

Type

The pipeline component type, which is Schedule.

startDateTime

The date/time (in UTC format) that you want the task to begin.

endDateTime

The date/time (in UTC format) that you want the task to stop.

period

The time period that you want to pass between task attempts, even if the task occurs only one time.

The period must evenly divide the time between startDateTime

and endDateTime

. In this example, we set the period to be 1 day so that the pipeline copy operation can only run one time.

Amazon S3 Data Node

Next, the S3DataNode pipeline component defines a location for the output file; in this case a tab-delimited file in an Amazon S3 bucket location. The output S3DataNode component is defined by the following fields:

{

"id": "MyS3Data",

"type": "S3DataNode",

"schedule": {

"ref": "MySchedule"

},

"filePath": "s3://output_bucket/ProductCatalog"

}

},

Name

The user-defined name for the output location (a label for your reference only).

Type

The pipeline component type, which is "S3DataNode" to match the data output location, in an Amazon

S3 bucket.

Schedule

A reference to the schedule component that we created in the preceding lines of the JSON file labeled

“MySchedule”.

Path

The path to the data associated with the data node. This path is an empty Amazon S3 location where a tab-delimited output file will be written that has the contents of a sample product catalog in an

API Version 2012-10-29

101

AWS Data Pipeline Developer Guide

Using the Command Line Interface

Amazon DynamoDB table. The syntax for a data node is determined by its type. For example, a data node for a file in Amazon S3 follows a different syntax that is appropriate for a database table.

Amazon EMR Cluster

Next, the EmrCluster pipeline component defines an Amazon EMR cluster that processes and moves the data in this tutorial. The EmrCluster component is defined by the following fields:

{

"id": "ImportCluster",

"type": "EmrCluster",

"masterInstanceType": "m1.small",

"instanceCoreType": "m1.xlarge",

"instanceCoreCount": "1",

"schedule": {

"ref": "MySchedule"

},

"enableDebugging": "true",

"emrLogUri": "s3://test_bucket/emr_logs"

},

Name

The user-defined name for the Amazon EMR cluster (a label for your reference only).

Type

The computational resource type, which is an Amazon EMR cluster. For more information,see

Overview of Amazon EMR in the Amazon EMR Developer Guide.

masterInstanceType

The type of Amazon EC2 instance to use as the master node of the Amazon EMR cluster. For more information, see Amazon EC2 Instance Types in the Amazon EC2 Documentation.

instanceCoreType

The type of Amazon EC2 instance to use as the core node of the Amazon EMR cluster. For more information, see Amazon EC2 Instance Types in the Amazon EC2 Documentation.

instanceCoreCount

The number of core Amazon EC2 instances to use in the Amazon EMR cluster.

Schedule

A reference to the schedule component that we created in the preceding lines of the JSON file labeled

“MySchedule”.

enableDebugging

Indicates whether to create detailed debug logs for the Amazon EMR job flow.

emrLogUri

Specifies an Amazon S3 location to store the Amazon EMR job flow debug logs if you enabled debugging with the previously-mentioned enableDebugging

field.

Amazon EMR Activity

Next, the EmrActivity pipeline component brings together the schedule, resources, and data nodes to define the work to perform, the conditions under which to do the work, and the actions to perform when certain events occur. The EmrActivity component is defined by the following fields:

{

"id": "MyExportJob",

"type": "EmrActivity",

API Version 2012-10-29

102

AWS Data Pipeline Developer Guide

Using the Command Line Interface

"dynamoDBInputTable": "MyTable",

"dynamoDBReadPercent": "0.25",

"s3OutputBucket": "#{output.path}",

"lateAfterTimeout": "12 hours",

"attemptTimeout": "24 hours",

"maximumRetries": "0",

"output": {

"ref": "MyS3Data"

},

"runsOn": {

"ref": "ExportCluster"

},

"schedule": {

"ref": "ExportPeriod"

},

"onSuccess": {

"ref": "SuccessSnsAlarm"

},

"onFail": {

"ref": "FailureSnsAlarm"

},

"onLateAction": {

"ref": "LateSnsAlarm"

},

"step": "s3://elasticmapreduce/libs/script-runner/script-runner.jar,s3://elast icmapreduce/libs/hive/hive-script,--run-hive-script,--hive-versions,latest,-args,-f,s3://elasticmapreduce/libs/hive/dynamodb/exportDynamoDBTableToS3,d,DYNAMODB_INPUT_TABLE=#{dynamoDBInputTable},-d,S3_OUTPUT_BUCKET=#{s3OutputBuck et}/#{format(@actualStartTime,'YYYY-MM-dd_hh.mm')},-d,DYNAMODB_READ_PERCENT=#{dy namoDBReadPercent},-d,DYNAMODB_ENDPOINT=dynamodb.us-east-1.amazonaws.com"

},

Name

The user-defined name for the Amazon EMR activity (a label for your reference only).

Type

The EmrActivity pipeline component type, which creates an Amazon EMR job flow to perform the defined work. For more information, see Overview of Amazon EMR in the Amazon EMR Developer

Guide.

dynamoDBInputTable

The Amazon DynamoDB table that the Amazon EMR job flow reads as the input for the Hive script.

dynamoDBReadPercent

Set the rate of read operations to keep your Amazon DynamoDB provisioned throughput rate in the allocated range for your table. The value is between 0.1 and 1.5, inclusively. For more information, see Hive Options in Amazon EMR Developer Guide.

s3OutputBucket

An expression that refers to the Amazon S3 location path for the output file defined by the S3DataNode labeled "MyS3Data".

lateAfterTimeout

The amount of time, after the schedule start time, that the activity can wait to start before AWS Data

Pipeline considers it late.

attemptTimeout

The amount of time, after the schedule start time, that the activity has to complete before AWS Data

Pipeline considers it as failed.

maximumRetries

The maximum number of times that AWS Data Pipeline retries the activity.

API Version 2012-10-29

103

AWS Data Pipeline Developer Guide

Using the Command Line Interface input

The Amazon S3 location path of the input data defined by the S3DataNode labeled "MyS3Data".

runsOn

A reference to the computational resource that will run the activity; in this case, an EmrCluster labeled

"ImportCluster".

schedule

A reference to the schedule component that we created in the preceding lines of the JSON file labeled

“MySchedule”.

onSuccess

A reference to the action to perform when the activity is successful. In this case, it is to send an

Amazon SNS notification.

onFail

A reference to the action to perform when the activity fails. In this case, it is to send an Amazon SNS notification.

onLateAction

A reference to the action to perform when the activity is late. In this case, it is to send an Amazon

SNS notification.

step

Defines the steps for the EMR job flow to perform. This step calls a Hive script named exportDynamoDBTableToS3 that is provided by Amazon EMR and is specifically designed to move data from Amazon DynamoDB to Amazon S3. To perform more complex data transformation tasks, you would customize this Hive script and provide its name and path here. For more information about sample Hive scripts that show how to perform data transformation tasks, see Contextual Advertising using Apache Hive and Amazon EMR in AWS Articles and Tutorials.

Upload the Pipeline Definition

You can upload a pipeline definition file using the AWS Data Pipeline CLI tools. For more information,

see Install the Command Line Interface (p. 15)

To upload your pipeline definition, use the following command.

On Linux/Unix/Mac OS:

./datapipeline -–create pipeline_name -–put pipeline_file

On Windows: ruby datapipeline -–create pipeline_name -–put pipeline_file

Where pipeline_name is the label for your pipeline and pipeline_file is the full path and file name for the file with the .json file extension that defines your pipeline.

If your pipeline validates successfully, you receive the following message:

Pipeline with name pipeline_name and id df-AKIAIOSFODNN7EXAMPLE created. Pipeline

definition pipeline_file.json uploaded.

Note

For more information about any errors returned by the –create command or other commands,

see Troubleshoot AWS Data Pipeline (p. 128)

.

Ensure that your pipeline appears in the pipeline list by using the following command.

API Version 2012-10-29

104

AWS Data Pipeline Developer Guide

Using the Command Line Interface

On Linux/Unix/Mac OS:

./datapipeline --list-pipelines

On Windows: ruby datapipeline -–list-pipelines

The list of pipelines includes details such as Name, Id, State, and UserId. Take note of your pipeline ID, because you use this value for most AWS Data Pipeline CLI commands. The pipeline ID is a unique identifier using the format df-AKIAIOSFODNN7EXAMPLE

.

Activate the Pipeline

You must activate the pipeline, by using the --activate command-line parameter, before it will begin performing work. Use the following command.

On Linux/Unix/Mac OS:

./datapipeline --activate -–id df-AKIAIOSFODNN7EXAMPLE

On Windows: ruby datapipeline --activate -–id df-AKIAIOSFODNN7EXAMPLE

Where df-AKIAIOSFODNN7EXAMPLE is the identifier for your pipeline.

Verify the Pipeline Status

View the status of your pipeline and its components, along with its activity attempts and retries with the following command.

On Linux/Unix/Mac OS:

./datapipeline --list-runs –id df-AKIAIOSFODNN7EXAMPLE

On Windows: ruby datapipeline --list-runs –id df-AKIAIOSFODNN7EXAMPLE

Where df-AKIAIOSFODNN7EXAMPLE is the identifier for your pipeline.

The --list-runs command displays a list of pipelines components and details such as Name, Scheduled

Start, Status, ID, Started, and Ended.

Note

It is important to note the difference between the Scheduled Start date/time vs. the Started time.

It is possible to schedule a pipeline component to run at a certain time (Scheduled Start), but the actual start time (Started) could be later due to problems or delays with preconditions, dependencies, failures, or retries.

Note

AWS Data Pipeline may backfill a pipeline, which happens when you define a Scheduled Start date/time for a date in the past. In that situation, AWS Data Pipeline immediately runs the pipeline

API Version 2012-10-29

105

AWS Data Pipeline Developer Guide

Using the Command Line Interface

components the number of times the activity should have run if it had started on the Scheduled

Start time. When this happens, you see pipeline components run back-to-back at a greater frequency than the period value that you specified when you created the pipeline. AWS Data

Pipeline returns your pipeline to the defined frequency only when it catches up to the number of past runs.

Successful pipeline runs are indicated by all the activities in your pipeline reporting the

FINISHED

status.

Your pipeline frequency determines how many times the pipeline runs, and each run has its own success or failure, as indicated by the --list-runs command. Resources that you define in your pipeline, such as

Amazon EC2 instances, may show the

SHUTTING_DOWN

status until they are finally terminated after a successful run. Depending on how you configured your pipeline, you may have multiple Amazon EC2 resources, each with their own final status.

Verify Data Export

Next, verify that the data export occurred successfully using viewing the output file contents.

To view the export file contents

1.

Sign in to the AWS Management Console and open the Amazon S3 console .

2.

On the Buckets pane, click the Amazon S3 bucket that contains your file output (the example pipeline uses the output path s3://output_bucket/ProductCatalog

) and open the output file with your preferred text editor. The output file name is an identifier value with no extension, such as this example: ae10f955-fb2f-4790-9b11-fbfea01a871e_000000

.

3.

Using your preferred text editor, view the contents of the output file and ensure that there is delimited data that corresponds to the Amazon DynamoDB source table, such as Id, Price, ProductCategory, as shown in the following screen. This indicates that the export operation from Amazon DynamoDB to the output file occurred successfully.

API Version 2012-10-29

106

AWS Data Pipeline Developer Guide

Tutorial: Run a Shell Command to

Process MySQL Table

This tutorial walks you through the process of creating a data pipeline to use a script stored in Amazon

S3 bucket to process a MySQL table, write the output in a comma-separated values (CSV) file in Amazon

S3 bucket, and then send an Amazon SNS notification after the task completes successfully. You will use the Amazon EC2 instance resource provided by AWS Data Pipeline for this shell command activity.

The first step in pipeline creation process is to select the pipeline objects that make up your pipeline definition. After you select the pipeline objects, you add fields for each pipeline object. For more information on pipeline definition, see

Pipeline Definition (p. 2) .

This tutorial uses the following objects to create a pipeline definition:

Activity

Activity the AWS Data Pipeline must perform for this pipeline.

This tutorial uses the

ShellCommandActivity

to process the data in MySQL table and write the output in a CSV file.

Schedule

The start date, time, and the duration for this activity. You can optionally specify the end date and time.

Resource

Resource AWS Data Pipeline must use to perform this activity.

This tutorial uses

Ec2Resource

, an Amazon EC2 instance provided by AWS Data Pipeline, to run a command for processing the data. AWS Data Pipeline automatically launches the Amazon EC2 instance and then terminates the instance after the task finishes.

DataNodes

Input and output nodes for this pipeline.

This tutorial uses two input nodes and one output node. The first input node is the

MySQLDataNode that contains the MySQL table. The second input node is the

S3DataNode

that contains the script.

The output node is the

S3DataNode

for storing the CSV file.

Action

Action AWS Data Pipeline must take when the specified conditions are met.

This tutorial uses

SnsAlarm

action to send Amazon SNS notification to the email address you specify, after the task finishes successfully.

API Version 2012-10-29

107

AWS Data Pipeline Developer Guide

Before you begin ...

For more information about the additional objects and fields supported by the copy activity, see

ShellCommandActivity (p. 176)

.

The following steps outline how to create a data pipeline to run a script stored in an Amazon S3 bucket.

1. Create your pipeline definition

2. Create and configure the pipeline definition objects

3. Validate and save your pipeline definition

4. Verify that your pipeline definition is saved

5. Activate your pipeline

6. Monitor the progress of your pipeline

7. [Optional] Delete your pipeline

Before you begin ...

Be sure you've completed the following steps.

• Set up an Amazon Web Services (AWS) account to access the AWS Data Pipeline console. For more information, see

Access the Console (p. 12)

.

Set up the AWS Data Pipeline tools and interface you plan on using. For more information on interfaces

and tools you can use to interact with AWS Data Pipeline, see Get Set Up for AWS Data Pipeline (p. 12)

.

• Create and launch a MySQL database instance as a data source.

For more information, see Launch a DB Instance in the Amazon Relational Database Service (RDS)

Getting Started Guide.

Note

Make a note of the user name and the password you used for creating the MySQL instance.

After you've launched your MySQL database instance, make a note of the instance's endpoint.

You will need all this information in this tutorial.

• Connect to your MySQL database instance, create a table, and then add test data values to the newly-created table.

For more information, see Create a Table in the MySQL documentation.

• Create an Amazon S3 bucket as a source for the script.

For more information, see Create a Bucket in the Amazon Simple Storage Service getting Started

Guide.

• Create a script to read the data in the MySQL table, process the data, and then write the results in a

CSV file. The script must run on an Amazon EC2 Linux instance.

Note

The AWS Data Pipeline computational resources (Amazon EMR job flow and Amazon EC2 instance) are not supported on Windows in this release.

• Upload your script to your Amazon S3 bucket.

For more information, see Add an Object to a Bucket in the Amazon Simple Storage Service Getting

Started Guide.

• Create another Amazon S3 bucket as a data target.

• Create an Amazon SNS topic for sending email notification and make a note of the topic Amazon

Resource Name (ARN). For more information on creating an Amazon SNS topic, see Create a Topic in the Amazon Simple Notification Service Getting Started Guide.

API Version 2012-10-29

108

AWS Data Pipeline Developer Guide

Using the AWS Data Pipeline Console

• [Optional] This tutorial uses the default IAM role policies created by AWS Data Pipeline. If you would rather create and configure your IAM role policy and trust relationships, follow the instructions described

in Granting Permissions to Pipelines with IAM (p. 21)

.

Note

Some of the actions described in this tutorial can generate AWS usage charges, depending on whether you are using the AWS Free Usage Tier .

Using the AWS Data Pipeline Console

Topics

Create and Configure the Pipeline Definition Objects (p. 109)

Validate and Save Your Pipeline (p. 112)

Verify your Pipeline Definition (p. 113)

Activate your Pipeline (p. 113)

Monitor the Progress of Your Pipeline Runs (p. 114)

[Optional] Delete your Pipeline (p. 115)

The following sections include the instructions for creating the pipeline using the AWS Data Pipeline console.

To create your pipeline definition

1.

Sign in to the AWS Management Console and open the AWS Data Pipeline console .

2.

Click Create Pipeline.

3.

On the Create a New Pipeline page: a.

In the Pipeline Name box, enter a name (for example,

RunDailyScript

).

b.

In Pipeline Description, enter a description.

c.

Leave the Select Schedule Type: button set to the default type for this tutorial.

Note

Schedule type allows you to specify whether the objects in your pipeline definition should be scheduled at the beginning of interval or end of the interval. Time Series

Style Scheduling means instances are scheduled at the end of each interval and Cron

Style Scheduling means instances are scheduled at the beginning of each interval.

d.

Leave the Role boxes set to their default values for this tutorial.

Note

If you have created your own IAM roles and would like to use them in this tutorial, you can select them now.

e.

Click Create a new pipeline.

Create and Configure the Pipeline Definition

Objects

Next, you define the

Activity

object in your pipeline definition. When you define the

Activity

object, you also define the objects that AWS Data Pipeline must use to perform this activity.

API Version 2012-10-29

109

AWS Data Pipeline Developer Guide

Create and Configure the Pipeline Definition Objects

1.

On the Pipeline:

name of your pipeline

page, click Add activity.

2.

In the Activities pane a.

Enter the name of the activity; for example, run-my-script

.

b.

In the Type box, select ShellCommandActivity.

c.

In the Schedule box, select Create new: Schedule.

d.

In the Add an optional field .. box, select Script Uri.

e.

In the Script Uri box, enter the path to your uploaded script; for example, s3://my-script/myscript.txt

.

f.

In the Add an optional field .. box, select Input.

g.

In the Input box, select Create new: DataNode.

h.

In the Add an optional field .. box, select Output.

i.

In the Output box, select Create new: DataNode.

j.

In the Add an optional field .. box, select RunsOn.

k.

In the Runs On box, select Create new: Resource.

l.

In the Add an optional field .. box, select On Success.

m. In the On Success box, select Create new: Action.

n.

In the left pane, separate the icons by dragging them apart.

You've completed defining your pipeline definition by specifying the objects AWS Data Pipeline will use to perform the shell command activity.

The Pipeline:

name of your pipeline

pane shows the graphical representation of the pipeline you just created. The arrows indicate the connection between the various objects.

Next, configure run date and time for your pipeline.

To configure run date and time for your pipeline

1.

On the Pipeline:

name of your pipeline

page, in the right pane, click Schedules.

2.

In the Schedules pane: a.

Enter a schedule name for this activity (for example, run-mysql-script-schedule

).

b.

In the Type box, select Schedule.

API Version 2012-10-29

110

AWS Data Pipeline Developer Guide

Create and Configure the Pipeline Definition Objects

c.

In the Start Date Time box, select the date from the calendar, and then enter the time to start the activity.

Note

AWS Data Pipeline supports the date and time expressed in "YYYY-MM-DDTHH:MM:SS" format in UTC/GMT only.

d.

In the Period box, enter the duration for the activity (for example,

1

), and then select the period category (for example,

Days

).

e.

[Optional] To specify the date and time to end the activity, in the Add an optional field box, select endDateTime, and enter the date and time.

To get your pipeline to launch immediately, set Start Date Time to a date one day in the past. AWS

Data Pipeline will then starting launching the "past due" runs immediately in an attempt to address what it perceives as a backlog of work. This backfilling means you don't have to wait an hour to see

AWS Data Pipeline launch its first job flow.

Next, configure the input and the output data nodes for your pipeline.

To configure the input and output data nodes of your pipeline

1.

On the Pipeline:

name of your pipeline

page, in the right pane, click DataNodes.

2.

In the DataNodes pane: a.

In the

DefaultDataNode1

Name box , enter the name for your MySQL data source node (for example,

MySQLTableInput

).

b.

In the Type box, select MySQLDataNode.

c.

In the Connection String box, enter the end point of your MySQL database instance (for example, mydbinstance.c3frkexample.us-east-1.rds.amazonaws.com

).

d.

In the Table box, enter the name of the source database table (for example, mysql-input-table

).

e.

In the Schedule box, select run-mysql-script-schedule.

f.

In the *Password box, enter the password you used when you created your MySQL database instance.

g.

In the Username box, enter the user name you used when you created your MySQL database instance.

h.

In the

DefaultDataNode2

Name box, enter the name for the data target node for your CSV file (for example,

MySQLScriptOutput

).

i.

In the Type box, select S3DataNode.

j.

In the Schedule box, select run-mysql-script-schedule.

k.

In the Add an optional field .. box, select File Path.

l.

In the File Path box, enter the path to your Amazon S3 bucket (for example, s3://my-data-pipeline-output/

name of your csv file

).

Next, configure the resource AWS Data Pipeline must use to run your script.

To configure the resource,

1.

On the Pipeline:

name of your pipeline

page, in the right pane, click Resources.

2.

In the Resources pane: a.

In the Name box, enter the name for your resource (for example,

RunScriptInstance

).

API Version 2012-10-29

111

AWS Data Pipeline Developer Guide

Validate and Save Your Pipeline

b.

In the Type box, select Ec2Resource.

c.

Leave the Resource Role and Role boxes set to the default values for this tutorial.

d.

In the Schedule box, select run-mysql-script-schedule.

Next, configure the SNS notification action AWS Data Pipeline must perform after your script runs successfully.

To configure the SNS notification action

1.

On the Pipeline:

name of your pipeline

page, in the right pane, click Others.

2.

In the Others pane: a.

In the

DefaultAction1

Name box, enter the name for your Amazon SNS notification (for example,

RunDailyScriptNotice

).

b.

In the Type box, select SnsAlarm.

c.

In the Topic Arn box, enter the ARN of your Amazon SNS topic.

d.

In the Subject box, enter the subject line for your notification.

e.

In the Message box, enter the message content.

f.

Leave the entry in the Role box set to default.

You have now completed all the steps required for creating your pipeline definition. Next, validate and save your pipeline.

Validate and Save Your Pipeline

You can save your pipeline definition at any point during the creation process. As soon as you save your pipeline definition, AWS Data Pipeline looks for syntax errors and missing values in your pipeline definition.

If your pipeline is incomplete or is incorrect, AWS Data Pipeline throws a validation error. If you plan to continue the creation process later, you can ignore the error message. If your pipeline definition is complete, and you are getting the validation error message, you'll have to fix the errors in the pipeline definition before activating your pipeline.

To validate and save your pipeline

1.

On the Pipeline:

name of your pipeline

page, click Save Pipeline.

2.

AWS Data Pipeline validates your pipeline definition and returns either success or the error message.

3.

If you get an error message, click Close and then, in the right pane, click Errors.

4.

The Errors pane lists the objects failing validation.

Click the plus (+) sign next to the object names and look for an error message in red.

5.

When you see the error message, click the specific object pane where you see the error and fix it.

For example, if you see an error message in the DataNodes object, click the DataNodes pane to fix the error.

6.

After you've fixed the errors listed in the Errors pane, click Save Pipeline.

7.

Repeat the process until your pipeline is validated.

Next, verify that your pipeline definition has been saved.

API Version 2012-10-29

112

AWS Data Pipeline Developer Guide

Verify your Pipeline Definition

Verify your Pipeline Definition

It is important to verify that your pipeline was correctly initialized from your definitions before you activate it.

To verify your pipeline definition

1.

On the Pipeline:

name of your pipeline

page, click Back to list of pipelines.

2.

On the List Pipelines page, check if your newly-created pipeline is listed.

AWS Data Pipeline has created a unique Pipeline ID for your pipeline definition.

The Status column in the row listing your pipeine should show PENDING.

3.

Click on the triangle icon next to your pipeline. The Pipeline Summary panel below shows the details of your pipeline runs. Because your pipeline is not yet activated, you should see only 0s at this point.

4.

In the Pipeline summary panel, click View all fields to see the configuration of your pipeline definition.

5.

Click Close.

Next, activate your pipeline.

Activate your Pipeline

You must activate your pipeline to start creating and processing runs based on the specifications in your pipeline definition.

To activate your pipeline

1.

On the List Pipelines page, in the Details column of your pipeline, click View pipeline.

2.

In the Pipeline:

name of your pipeline

page, click Activate.

API Version 2012-10-29

113

AWS Data Pipeline Developer Guide

Monitor the Progress of Your Pipeline Runs

A confirmation dialog box opens up confirming the activation.

3.

Click Close.

Next, verify if your pipeline is running.

Monitor the Progress of Your Pipeline Runs

To monitor the progress of your pipeline

1.

On the List Pipelines page, in the Details column of your pipeline, click View instance details.

2.

The Instance details:

name of your pipeline

page lists the status of each instance in your pipeline definition.

Note

If you do not see instances listed, depending on when your pipeline was scheduled, either click End (in UTC) date box and change it to a later date or click Start (in UTC) date box and change it to an earlier date. Then click Update.

3.

If the Status column of all the instances in your pipeline indicates FINISHED, your pipeline has successfully completed the copy activity.You should receive an email about the successful completion of this task, to the account you specified for receiving your Amazon SNS notification.

You can also check your Amazon S3 data target bucket to verify if the data was processed.

4.

If the Status column of any of your instances indicates a status other than FINISHED, either your pipeline is waiting for some precondition to be met or it has failed.

a.

To troubleshoot the failed or the incomplete instance runs

Click the triangle next to an instance, Instance summary panel opens to show the details of the selected instance.

b.

Click View instance fields to see additional details of the instance. If the status of your selected instance is FAILED, the details box has an entry indicating the reason for failure. For example,

@failureReason = Resource not healthy terminated

.

c.

In the Instance summary pane, in the Select attempt for this instance box, select the attempt number.

d.

In the Instance summary pane, click View attempt fields to see details of fields associated with the selected attempt.

API Version 2012-10-29

114

AWS Data Pipeline Developer Guide

[Optional] Delete your Pipeline

5.

To take an action on your incomplete or failed instance, select an action (

Rerun|Cancel|Mark

Finished

) from the Action column of the instance.

You can use the information in the Instance summary pane and the View instance fields box to troubleshoot issues with your failed pipeline.

For more information about instance status, see Interpret Pipeline Status Details (p. 129)

. For more

information about troubleshooting the failed or incomplete instance runs of your pipeline, see AWS Data

Pipeline Problems and Solutions (p. 131)

.

Important

Your pipeline is running and incurring charges. For more information, see AWS Data Pipeline pricing . If you would like to stop incurring the AWS Data Pipeline usage charges, delete your pipeline.

[Optional] Delete your Pipeline

Deleting your pipeline deletes the pipeline definition including all the associated objects.You stop incurring charges as soon as your pipeline is deleted.

To delete your pipeline

1.

In the List Pipelines page, click the check box next to your pipeline.

2.

Click Delete.

3.

In the confirmation dialog box, click Delete to confirm the delete request.

API Version 2012-10-29

115

AWS Data Pipeline Developer Guide

Using AWS Data Pipeline Console

Manage Pipelines

Topics

Using AWS Data Pipeline Console (p. 116)

Using the Command Line Interface (p. 121)

You can use either the AWS Data Pipeline console or the AWS Data Pipeline command line interface

(CLI) to view the details of your pipeline or to delete your pipeline.

Using AWS Data Pipeline Console

Topics

View pipeline definition (p. 116)

View details of each instance in an active pipeline (p. 117)

Modify pipeline definition (p. 119)

Delete a Pipeline (p. 121)

With the AWS Data Pipeline console, you can:

• View the pipeline definition of any pipeline associated with your account

• View the details of each instance in your pipeline and use the information to troubleshoot a failed instance run

• Modify pipeline definition

• Delete pipeline

The following sections walk you through the steps for managing your pipeline. Before you begin, be sure that you have at least one pipeline associated with your account, have access to the AWS Management

Console, and have opened the AWS Data Pipeline console at https://console.aws.amazon.com/datapipeline/ .

View pipeline definition

If you are signed in and have opened the AWS Data Pipeline console , your screen shows a list of pipelines associated with your account.

API Version 2012-10-29

116

AWS Data Pipeline Developer Guide

View details of each instance in an active pipeline

The Status column in the pipeline listing displays the current state of your pipelines. A pipeline is

SCHEDULED if the pipeline definition has passed validation and is activated, is currently running, or has completed its run. A pipeline is PENDING if the pipeline definition is incomplete or might have failed the validation step that all pipelines go through before being saved.

If you want to modify or complete your pipeline definition, see

Modify pipeline definition (p. 119) .

To view the pipeline definition of your pipeline,

1.

On the List Pipelines page, in the Details column of your pipeline, click View instance details (or click View pipeline, if PENDING).

2.

If your pipeline is SCHEDULED, a.

On the Instance details:

name of your pipeline

page, click View pipeline.

b.

The Pipeline:

name of your pipeline

[The pipeline is active.] page opens.

This is your pipeline definition page. As indicated in the title of the page, this pipeline is active.

3.

To view the pipeline definition object definitions, on the Pipeline:

name of your pipeline

page, click the object icons in the design pane. The corresponding object pane on the right panel opens.

4.

You can also click the object panes on the right panel to view the objects and the associated fields.

5.

If your pipeline definition graph does not fit in the design pane, use the pan buttons on the right side of the design pane to slide the canvas.

6.

Click Back to list of pipelines to get back to the List Pipelines page.

View details of each instance in an active pipeline

If you are signed in and have opened the AWS Data Pipeline console , your screen looks similar to this:

API Version 2012-10-29

117

AWS Data Pipeline Developer Guide

View details of each instance in an active pipeline

The Status column in the pipeline listing displays the current state of your pipelines. Your pipeline is active if the status is SCHEDULED. A pipeline is in SCHEDULED state if the pipeline definition has passed validation and is activated, is currently running, or has completed its run.You can view the pipeline definition, the runs list, and the details of each run of an active pipeline. For information on modifying an active pipeline, see

Modify pipeline definition (p. 119)

To retrieve the details of your active pipeline

1.

On the List Pipelines page, identify your active pipeline, and then click the small triangle that is next to the pipeline ID.

2.

In the Pipeline summary pane, click View fields to see additional information on your pipeline definition.

3.

Click Close to close the View fields box, and then click the triangle of your active pipeline again to close the Pipeline Summary pane.

4.

In the row that lists your active pipeline, click View instance details.

5.

The Instance details:

name of your pipeline

page lists all the instances of your active pipeline.

Note

If you do not see the list of instances, click End (in UTC) date box, change it to a later date, and then click Update.

6.

You can also use the Filter Object, Start, or End date-time fields to filter the number of instances returned based on either their current status or the date-range in which they were launched. Filtering the results is useful because, depending on the pipeline age and scheduling, the instance run history can be very large.

7.

If the Status column of all the runs in your pipeline displays the FINISHED state, your pipeline has successfully completed running.

API Version 2012-10-29

118

AWS Data Pipeline Developer Guide

Modify pipeline definition

If the Status column of any one of your runs indicate a status other than FINISHED, your pipeline is either running, waiting for some precondition to be met or has failed.

8.

Click the triangle next to an instance to show the details of the selected instance.

9.

In the Instance summary pane. click View instance fields to see details of fields associated with the selected instance. If the status of your selected instance is FAILED, the additional details box has an entry indicating the reason for failure. For example,

@failureReason = Resource not healthy terminated

.

10. In the Instance summary pane, in the Select attempt for this instance box, select the attempt number.

11. In the Instance summary pane, click View attempt fields to see details of fields associated with the selected attempt.

12. To take an action on your incomplete or failed instance, select an action (

Rerun|Cancel|Mark

Finished

) from the Action column of the instance.

13. You can use the information in the Instance summary pane and the View instance fields box to troubleshoot issues with your failed pipeline.

For more information about instance status, see

Interpret Pipeline Status Details (p. 129) . For more

information about troubleshooting the failed or incomplete instance runs of your pipeline, see

AWS

Data Pipeline Problems and Solutions (p. 131)

.

14. Click Back to list of pipelines to get back to the List Pipelines page.

Modify pipeline definition

If your pipeline is in a PENDING state, either your pipeline definition is incomplete or your pipeline might have failed the validation step that all pipelines go through before saving. If your pipeline is active, you may need to change some aspect of it. However, if you are modifying the pipeline definition of an active pipeline, you must keep in mind the following rules:

• Cannot change the Default objects

• Cannot change the schedule of an object

• Cannot change the dependencies between objects

• Cannot add/delete/modify reference fields for existing objects, only non-reference fields are allowed

• New objects cannot reference an previously existing object for the output field, only the input fields is allowed

Follow the steps in this section to either complete or modify your pipeline definition.

API Version 2012-10-29

119

AWS Data Pipeline Developer Guide

Modify pipeline definition

To modify your pipeline definition

1.

On the List Pipelines page, in the Details column of your pipeline, click View instance details (or click View pipeline, if PENDING).

2.

If your pipeline is SCHEDULED, a.

On the Instance details:

name of your pipeline

page, click View pipeline.

b.

The Pipeline:

name of your pipeline

[The pipeline is active.] page opens.

This is your pipeline definition page. As indicated in the title of the page, this pipeline is active.

3.

To complete or modify your pipeline definition: a.

On the Pipeline:

name of your pipeline

page, click the object panes in the right side panel and complete defining the objects and fields of your pipeline definition.

Note

If you are modifying an active pipeline, you will see some fields are grayed out and are inactive. You cannot modify those fields.

b.

Skip the next step and follow the steps to validate and save your pipeline definition.

4.

To edit your pipeline definition: a.

On the Pipeline:

name of your pipeline

page, click the Errors pane. The Errors pane lists the objects of your pipeline that failed validation.

b.

Click on the plus (+) sign next to the object names and look for an error message in red.

c.

Click the object pane where you see the error and fix it. For example, if you see error message in the DataNodes object, click the DataNodes pane to fix the error.

To validate and save your pipeline definition

1.

Click Save Pipeline. AWS Data Pipeline validates your pipeline definition and returns one of the following messages:

Or

2.

If you get an Error! message, click Close and then, on the right side panel, click Errors to see the objects that did not pass the validation.

Fix the errors and save. Repeat this step till your pipeline definition passes validation.

API Version 2012-10-29

120

AWS Data Pipeline Developer Guide

Delete a Pipeline

Activate and verify your pipeline

1.

After you've saved your pipeline definition with no validation errors, click Activate.

2.

To verify that your your pipeline definition has been activated, click Back to list of pipelines.

3.

In the List Pipelines page, check if your newly-created pipeline is listed and the Status column displays SCHEDULED.

Delete a Pipeline

When you no longer require a pipeline, such as a pipeline created during application testing, you should delete it to remove it from active use. Deleting a pipeline puts it into a deleting state. When the pipeline is in the deleted state, its pipeline definition and run history are gone. Therefore, you can no longer perform operations on the pipeline, including describing it.

You can't restore a pipeline after you delete it, so be sure that you won’t need the pipeline in the future before you delete it.

To delete your pipeline

1.

In the List Pipelines page, click the empty box next to your pipeline.

2.

Click Delete.

3.

In the confirmation dialog box, click Delete to confirm the delete request.

Using the Command Line Interface

Topics

Install the AWS Data Pipeline Command-Line Client (p. 122)

Command-Line Syntax (p. 122)

Setting Credentials for the AWS Data Pipeline Command Line Interface (p. 122)

List Pipelines (p. 124)

Create a New Pipeline (p. 124)

Retrieve Pipeline Details (p. 124)

View Pipeline Versions (p. 125)

Modify a Pipeline (p. 126)

Delete a Pipeline (p. 126)

The AWS Data Pipeline Command-Line Client (CLI) is one of three ways to interact with AWS Data

Pipeline. The other two are: using the AWS Data Pipeline console—a graphical user interface—or calling

the APIs (calling functions in the AWS Data Pipeline SDK). For more information, see What is AWS Data

Pipeline? (p. 1)

.

API Version 2012-10-29

121

AWS Data Pipeline Developer Guide

Install the AWS Data Pipeline Command-Line Client

Install the AWS Data Pipeline Command-Line Client

Install the AWS Data Pipeline Command-Line Client (CLI) as described in Install the Command Line

Interface (p. 15)

Command-Line Syntax

Use the AWS Data Pipeline CLI from your operating system's command-line prompt and type the CLI tool name "datapipeline" followed by one or more parameters. However, the syntax of the command is different on Linux/Unix/Mac OS compared to Windows. Linux/Unix/Mac users must use the "./" prefix for the CLI command and Windows users must specify "ruby" before the CLI command. For example, to view the CLI help text on Linux/Unix/Mac, the syntax is:

./datapipeline --help

However, to perform the same action on Windows, the syntax of the command is: ruby datapipeline --help

Other than the prefix, the AWS Data Pipeline CLI syntax is the same between operating systems.

Note

For brevity, we do not list all the operating system syntax permutations for each example in this documentation. Instead, we simply refer to the commands like the following example.

datapipeline --help

Setting Credentials for the AWS Data Pipeline

Command Line Interface

In order to connect to the AWS Data Pipeline web service to process your commands, the CLI needs the account details of an AWS account that has permissions to create and/or manage data pipelines. There are three ways to pass your credentials into the CLI:

• Implicitly, using a JSON file.

• Explicitly, by specifying a JSON file at the command line.

• Explicitly, by specifying credentials using a series of command-line options.

To Set Your Credentials Implicitly with a JSON File

• The easiest and most common way is implicitly, by creating a JSON file named credentials.json in either your home directory, or the directory where CLI is installed. For example, when you use the

CLI with Windows, the folder may be c:\datapipeline-cli\amazon\datapipeline. When you do this, the

CLI loads the credentials implicitly and you do not need to specify any credential information at the command line. Verify the credentials file syntax using the following example JSON file, where you replace the example access-id and private-key values with your own:

{

"access-id": "AKIAIOSFODNN7EXAMPLE",

"private-key": "wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY",

"endpoint": "datapipeline.us-east-1.amazonaws.com",

API Version 2012-10-29

122

AWS Data Pipeline Developer Guide

Setting Credentials for the AWS Data Pipeline Command

Line Interface

"port": "443",

"use-ssl": "true",

"region": "us-east-1",

"log-uri": "log-uri": "s3://myawsbucket/logfiles"

}

After setting your credentials file, test the CLI using the following command, which uses implicit credentials to call the CLI and display a list of all the data pipelines those credentials can access.

datapipeline --list-pipelines

To Set Your Credentials Explicitly with a JSON File

• The next easiest way to pass in credentials is explicitly, using the --credentials option to specify the location of a JSON file. With this method, you’ll have to add the --credentials option to each command-line call. This can be useful if you are SSHing into a machine to run the command-line client remotely, or if you are testing various sets of credentials. For information about what to include in the JSON file, go to How to Format the JSON File.

For example, the following command explicitly uses the credentials stored in a JSON file to call the

CLI and display a list of all the data pipelines those credentials can access.

datapipeline --credentials /my-directory/my-credentials.json --list-pipelines

To Set Your Credentials Using Command-Line Options

• The final way to pass in credentials is to specify them using a series of options at the command line.

This is the most verbose way to pass in credentials, but may be useful if you are scripting the CLI and want the flexibility of changing credential information without having to edit a JSON file. The options you’ll use in this scenario are

--access-key

,

--secret-key

and

--endpoint

.

For example, the following command explicitly uses credentials specified at the command line to call the CLI and display a list of all the data pipelines those credentials can access.

datapipeline --access-key

my-access-key-id

--secret-key

my-secret-accesskey

--endpoint datapipeline.us-east-1.amazonaws.com --list-pipelines

In the preceding,

my-access-key-id

would be replaced with your AWS Access Key ID, and

my-secret-access-key

replaced with AWS Secret Access Key, and --endpoint wound specify the endpoint the CLI should use when contacting the AWS Data Pipeline web service. For more information about how to locate your AWS security credentials, go to Locating Your AWS Security

Credentials.

Note

Because you are passing in your credentials at every command-line call, you may wish to take additional security precautions to ensure the privacy of your command-line calls, such as clearing auto-complete when you are done with your terminal session. You also should not store this script in an unsecured file.

API Version 2012-10-29

123

AWS Data Pipeline Developer Guide

List Pipelines

List Pipelines

A simple example that also helps you confirm that your credentials are set correctly is to view a get a list of the currently running pipelines using

--list-pipelines

command. This command returns the names and identifiers of all pipelines that you have permission to access.

datapipeline --list-pipelines

This is how you get the ID of pipelines that you want to work with using the CLI, because many commands require you to specify the pipeline ID using the --id parameter.

For more information, see --list-pipelines (p. 221) .

Create a New Pipeline

The first step to create a new data pipeline is to define your data activities and their data dependencies using a pipeline definition file. The syntax and usage of the pipeline definition is described in Pipeline

Definition Language Reference in the AWS Data Pipeline Developer’s Guide.

Once you’ve written the details of the new pipeline using JSON syntax, save them to a text file with the extension .json. You’ll then specify this pipeline definition file as part of the input when creating the new pipeline.

After creating your pipeline definition file, you can create a new pipeline by calling the

--create (p. 217)

action of the AWS Data Pipeline CLI, as shown below.

datapipeline --create my-pipeline --put my-pipeline-file.json

If you leave off the

--put

option, as shown following, AWS Data Pipeline creates an empty pipeline. You can then use a subsequent

--put

call to attach a pipeline definition to the empty pipeline.

datapipeline --create pipeline_name

The --put parameter does not activate a pipeline by default. You must explicitly activate a pipeline before it will begin doing work, using the --activate command and specifying a pipeline ID as shown below.

datapipeline --activate --id pipeline_id

For more information about creating pipelines, see the

--create (p. 217)

and

--put

actions.

Retrieve Pipeline Details

Using the CLI, you can retrieve all the information about a pipeline, which includes the pipeline definition and the run attempt history of the pipeline components.

Retrieving the Pipeline Definition

To get the complete pipeline definition, use the

--get

command. The pipeline objects are returned in alphabetical order, not in the order they had in the pipeline definition file that you uploaded, and the slots for each object are also returned in alphabetical order.

You can specify an output file to receive the pipeline definition, but the default is to print the information to standard output (which is typically your terminal screen).

API Version 2012-10-29

124

AWS Data Pipeline Developer Guide

View Pipeline Versions

The following example prints the pipeline definition to a file named output.txt

.

datapipeline --get --file output.txt --id df-00627471SOVYZEXAMPLE

The following example prints the pipeline definition to standard output (stdout).

datapipeline --get --id df-00627471SOVYZEXAMPLE

It's a good idea to retrieve the pipeline definition before you submit modifications, because it’s possible that another user or process changed the pipeline definition after you last worked with it. By downloading a copy of the current definition and using that as the basis for your modifications, you can be sure that you are working with the most recent pipeline definition.

It’s also a good idea to retrieve the pipeline definition again after you modify it to ensure that the update was successful.

For more information, see --get, --g (p. 219)

.

Retrieving the Pipeline Run History

To retrieve a history of the times that a pipeline has run, use the

--list-runs

command. This command has options that you can use to filter the number of runs returned based on either their current status or the date-range in which they were launched. Filtering the results is useful because, depending on the pipeline’s age and scheduling, the run history can be very large.

This example shows how to retrieve information for all runs.

datapipeline --list-runs --id df-00627471SOVYZEXAMPLE

This example shows how to retrieve information for all runs that have completed.

datapipeline --list-runs --id df-00627471SOVYZEXAMPLE --status finished

This example shows how to retrieve information for all runs launched in the specified time frame.

datapipeline --list-runs --id df-00627471SOVYZEXAMPLE --start-interval "2011-

09-02", "2011-09-11"

For more information, see --list-runs (p. 222)

.

View Pipeline Versions

There are two versions of a pipeline that you can view with the CLI. There is the "active" version, which is the version of a pipeline that is currently running. There is also the "latest" version, which is created when a user edits a running pipeline and is created as a copy of the "active" pipeline until you edit it.

When you upload the edited pipeline, it becomes the "active" version and the previous "active" version is no longer accessible. A new "latest" version is created if you edit the pipeline again, repeating the previously described cycle.

To retrieve a specific version of a pipeline, use the

--version

command, specifying the version name of the pipeline. For example, the following command retrieves the "active" version of a pipeline.

datapipeline --get --version active --id df-00627471SOVYZEXAMPLE

API Version 2012-10-29

125

AWS Data Pipeline Developer Guide

Modify a Pipeline

For more information, see --delete (p. 218)

.

Modify a Pipeline

After you’ve created a pipeline, you may need to change some aspect of it. To do this, get the current pipeline definition and save it to a file, update the pipeline definition file, and upload the updated pipeline definition to AWS Data Pipeline using the

--put

command.

The following rules apply when you modify a pipeline definition:

• Cannot change the Default object

• Cannot change the schedule of an object

• Cannot change the dependencies between objects

• Cannot added/delete/modify reference fields for existing objects, only non-reference fields are allowed

• New objects cannot reference an previously existing object for the output field, only the input fields is allowed

It's a good idea to retrieve the pipeline definition before you submit modifications, because it’s possible that another user or process changed the pipeline definition after you last worked with it. By downloading a copy of the current definition and using that as the basis for your modifications, you can be sure that you are working with the most recent pipeline definition.

The following example prints the pipeline definition to a file named output.txt

.

datapipeline --get --file output.txt --id df-00627471SOVYZEXAMPLE

Update your pipeline definition file and save it as my-updated-file.txt

. The following example uploads the updated pipeline definition.

datapipeline --put my-updated-file.txt --id df-00627471SOVYZEXAMPLE

You can retrieve the pipeline definition using

--get

to ensure that the update was successful.

When you use

--put

to replace the pipeline definition file, the previous pipeline definition is completely replaced. Currently there is no way to change only a portion, such as a single object, of a pipeline; you must include all previously defined objects in the updated pipeline definition.

For more information, see --put (p. 224) and

--get, --g (p. 219)

.

Delete a Pipeline

When you no longer require a pipeline, such as a pipeline created during application testing, you should delete it to remove it from active use. Deleting a pipeline puts it into a deleting state. When the pipeline is in the deleted state, its pipeline definition and run history are gone. Therefore, you can no longer perform operations on the pipeline, including describing it.

You can't restore a pipeline after you delete it, so be sure that you won’t need the pipeline in the future before you delete it.

To delete a pipeline, use the

--delete

command, specifying the identifier of the pipeline. For example, the following command deletes a pipeline.

datapipeline --delete --id df-00627471SOVYZEXAMPLE

API Version 2012-10-29

126

AWS Data Pipeline Developer Guide

Delete a Pipeline

For more information, see --delete (p. 218)

.

API Version 2012-10-29

127

AWS Data Pipeline Developer Guide

Proactively Monitor Your Pipeline

Troubleshoot AWS Data Pipeline

Topics

Proactively Monitor Your Pipeline (p. 128)

Verify Your Pipeline Status (p. 129)

Interpret Pipeline Status Details (p. 129)

Error Log Locations (p. 130)

AWS Data Pipeline Problems and Solutions (p. 131)

When you have a problem while AWS Data Pipeline , the most common symptom is that a pipeline won't run. Since there are several possible causes, this topic explains how to track the status of your AWS Data

Pipeline pipelines, get notifications when problems occur, and gather more information. After you have enough information to narrow the list of potential problems, this topic guides you to solutions. To get the most benefit from these troubleshooting steps and scenarios, you should use the console or CLI to gather the required information.

Proactively Monitor Your Pipeline

The best way to detect problems is to monitor your pipelines from the start proactively. You can configure pipeline components to inform you of certain situations or events, such as when a pipeline component fails or doesn't begin by its scheduled start time. AWS Data Pipeline makes it easy to configure notifications using Amazon SNS.

Using the AWS Data Pipeline CLI, you can configure a pipeline component to send Amazon SNS notifications on failures. Add the following code to your pipeline definition JSON file. This example also demonstrates how to use the AWS Data Pipeline expression language to insert details about the specific execution attempt denoted by the #{node.interval.start} and #{node.interval.end} variables:

Note

You must create an Amazon SNS topic to use for the Topic ARN value in the following example.

For more information, see the Create a Topic documentation at http://docs.aws.amazon.com/sns/latest/gsg/CreateTopic.html.

{

"id" : "FailureNotify",

"type" : "SnsAlarm",

API Version 2012-10-29

128

AWS Data Pipeline Developer Guide

Verify Your Pipeline Status

"subject" : "Failed to run pipeline component",

"message": "Error for interval #{node.interval.start}..#{node.inter val.end}.",

"topicArn":"arn:aws:sns:us-east-1:28619EXAMPLE:ExampleTopic"

},

You must also associate the notification to the pipeline component that you want to monitor, as shown by the following example. In this example using the onFail action, the component sends a notification if a file doesn't exist in an Amazon S3 bucket:

{

"id": "S3Data",

"type": "S3DataNode",

"schedule": { "ref": "MySchedule" },

"filePath": "s3://mys3bucket/file.txt",

"precondition": {"ref":"ExampleCondition"},

"onFail": {"ref":"FailureNotify"}

},

Using the steps mentioned in the previous examples, you can also use the onLateAction, and onSuccess pipeline component fields to notify you when a pipeline component has not been scheduled on-time or has succeeded, respectively. You should configure notifications for any critical tasks in a pipeline by adding notifications to the default object in a pipeline, they automatically apply to all components in that pipeline. Pipeline components get the ability to send notifications through their IAM roles. Do not modify the default IAM roles unless your situation demands it, otherwise notifications may not work.

Verify Your Pipeline Status

When you notice a problem with a pipeline, check the status of your pipeline components and look for error messages using the console or CLI and look for error messages.

To locate your pipeline ID using the CLI, run this command: ruby datapipeline --list-pipelines

After you have the pipeline ID, view the status of the pipeline components using the CLI. In this example, replace your pipeline ID with the example provided: ruby datapipeline --list-runs --id df-AKIAIOSFODNN7EXAMPLE

On the list of pipeline components, look at the status column of each component and pay special attention to any components that indicate a status of FAILED, WAITING_FOR_RUNNER, or CANCELLED.

Additionally, look at the Scheduled Start column and match it with a corresponding value for the Actual

Start column to ensure that the tasks occur with the timing that you expect.

Interpret Pipeline Status Details

The various status levels displayed in the AWS Data Pipeline console and CLI indicate the condition of a pipeline and its components. Pipelines have a SCHEDULED status if they have passed validation and are ready, currently performing work, or done with their work. PENDING status means the pipeline is not able to perform work for some reason; for example, the pipeline definition might be incomplete or might

API Version 2012-10-29

129

AWS Data Pipeline Developer Guide

Error Log Locations

have failed the validation step that all pipelines go through before activation. The pipeline status is simply an overview of a pipeline; to see more information, view the status of individual pipeline components.

You can do this by clicking through a pipeline in the console or retrieving pipeline component details using the CLI.

A pipeline component has the following available status values:

CHECKING_PRECONDITIONS

The component is checking to ensure that all its default and user-configured preconditions are met before performing its work.

WAITING_FOR_RUNNER

The component is waiting for its worker client to retrieve it as a work item. The component and worker client relationship is controlled by the runsOn

or the workerGroup

field defined by that component.

CREATING

The component or resource is in the process of being started, such as an Amazon EC2 instance.

VALIDATING

The pipeline definition is in the process of being validated by AWS Data Pipeline.

RUNNING

The resource is running and ready to receive work.

CANCELLED

The component was pre-emptively cancelled by a user or AWS Data Pipeline before it could run.

This can happen automatically when a failure occurs in different component or resource that this component depends on.

PAUSED

The component has been paused and is not currently performing work.

FINISHED

The component has completed its assigned work.

SHUTTING_DOWN

The resource is shutting down after successfully performing its defined work.

FAILED

The component or resource encountered an error and stopped working. When a component or resource fails, it can cause cancellations and failures to cascade to other components that depend on it.

Error Log Locations

This section explains the various logs that AWS Data Pipeline writes that you can use to determine the source of certain failures and errors.

Task Runner Logs

Task Runner writes a log file named TaskRunner.log to the local computer which runs in the <your home directory>/output/logs directory, where "AmazonDataPipeline_location" is the directory where you extracted the AWS Data Pipeline CLI tools. In this directory, Task Runner also creates several nested directories that are named after the pipeline ID that it ran, with subdirectories for the year, month, day, and attempt number in the format <pipeline ID>/<year>/<month>/<day>/<pipeline object attempt ID_Attempt=X>. In these folders, Task Runner writes three files:

<Pipeline Attempt ID>_Attempt_<number>_main.log.gz - This archive logs the step-by-step execution of Task Runner work items (both succeeded and failed) along with any error messages that were generated.

<Pipeline Attempt ID>_Attempt_<number>_stderr.log.gz - This archive logs only error messages that occurred while Task Runner processed tasks.

API Version 2012-10-29

130

AWS Data Pipeline Developer Guide

Pipeline Logs

<Pipeline Attempt ID>_Attempt_<number>_stdout.log.gz - This log provides any standard output text if provided by certain tasks.

Pipeline Logs

You can configure pipelines to create log files in a location, such as in the following example where you use the Default object in a pipeline to cause all pipeline components to use that log location by default

(you can override this by configuring a log location in a specific pipeline component).

To configure the log location using the AWS Data Pipeline CLI in a pipeline JSON file, begin your pipeline file with the following text:

{ "objects": [

{

"id":"Default",

"logUri":"s3://mys3bucket/error_logs"

},

...

After you configure a pipeline log directory, Task Runner creates a copy of the logs in your directory, with the same formatting and file names described in the previous section about Task Runner logs.

AWS Data Pipeline Problems and Solutions

This topic provides various symptoms of AWS Data Pipeline problems and the recommended steps to solve them.

Pipeline Stuck in Pending Status

A pipeline that appears stuck in the PENDING status indicates a fundamental error in the pipeline definition.

Ensure that you did not receive any errors when you submitted your pipeline using the AWS Data Pipeline

CLI or when you attempted to save or activate your pipeline using the AWS Data Pipeline console.

Additionally, check that your pipeline has a valid definition.

To view your pipeline definition on the screen using the CLI: ruby datapipeline --get --id df-EXAMPLE_PIPELINE_ID

Ensure that the pipeline definition is complete, check your closing braces, verify required commas, check for missing references, and other syntax errors. It is best to use a text editor that can visually validate the syntax of JSON files.

Pipeline Component Stuck in Waiting for Runner

Status

If your pipeline is in the SCHEDULED state and one or more tasks appear stuck in the

WAITING_FOR_RUNNER state, ensure that you set a valid value for either the runsOn or workerGroup fields for those tasks. If both values are empty or missing, the task cannot start because there is no association between the task and a worker to perform the tasks. In this situation, you've defined work but haven't defined what computer will do that work. If applicable, verify that the workerGroup value assigned

API Version 2012-10-29

131

AWS Data Pipeline Developer Guide

Pipeline Component Stuck in Checking Preconditions

Status

to the pipeline component is exactly the same name and case as the workerGroup value that you configured for Task Runner.

Another potential cause of this problem is that the endpoint and access key provided to Task Runner is not the same as the AWS Data Pipeline console or the computer where the AWS Data Pipeline CLI tools are installed. You might have created new pipelines with no visible errors, but Task Runner polls the wrong location due to the difference in credentials, or polls the correct location with insufficient permissions to identify and run the work specified by the pipeline definition.

Pipeline Component Stuck in Checking

Preconditions Status

If your pipeline is in the SCHEDULED state and one or more tasks appear stuck in the

CHECKING_PRECONDITIONS state, make sure your pipeline's initial preconditions have been met. If the preconditions of the first object in the logic chain are not met, none of the objects that depend on that first object will be able to move out of the CHECKING_PRECONDITIONS state.

For example, consider the following excerpt from a pipeline definition. In this case, the InputData object has a precondition 'Ready' specifying that the data must exist before the InputData object is complete. If the data does not exist, the InputData object remains in the CHECKING_PRECONDITIONS state, waiting for the data specified by the path field to become available. Any objects that depend on InputData likewise remain in a CHECKING_PRECONDITIONS state waiting for the InputData object to reach the FINISHED state.

{

"id": "InputData",

"type": "S3DataNode",

"filePath": "s3://elasticmapreduce/samples/wordcount/wordSplitter.py",

"schedule":{"ref":"MySchedule"},

"precondition": "Ready"

},

{

"id": "Ready",

"type": "Exists"

...

Also, check that your objects have the proper permissions to access the data. In the preceding example, if the information in the credentials field did not have permissions to access the data specified in the path field, the InputData object would get stuck in a CHECKING_PRECONDITIONS state because it cannot access the data specified by the path field, even if that data exists.

Run Doesn't Start When Scheduled

Check that you have properly specified the dates in your schedule objects and that the startDateTime and endDateTime values are in UTC format, such as in the following example:

{

"id": "MySchedule",

"startDateTime": "2012-11-12T19:30:00",

"endDateTime":"2012-11-12T20:30:00",

"period": "1 Hour",

"type": "Schedule"

},

API Version 2012-10-29

132

AWS Data Pipeline Developer Guide

Pipeline Components Run in Wrong Order

Pipeline Components Run in Wrong Order

You might notice that the start and end times for your pipeline components are running in the wrong order, or in a different sequence than you expect. It is important to understand that pipeline components can start running simultaneously if their preconditions are met at start-up time. In other words, pipeline components do not execute sequentially by default; if you need a specific execution order, you must control the execution order with preconditions and dependsOn fields. Verify that you are using the dependsOn field populated with a reference to the correct prerequisite pipeline components, and that all the necessary pointers between components are present to achieve the order you require.

EMR Cluster Fails With Error: The security token included in the request is invalid

Verify your IAM roles, policies, and trust relationships as described in Granting Permissions to Pipelines with IAM (p. 21)

.

Insufficient Permissions to Access Resources

Permissions that you set on IAM roles determine whether AWS Data Pipeline can access your EMR clusters and EC2 instances to run your pipelines. Additionally, IAM provides the concept of trust relationships that go further to allow creation of resources on your behalf. For example, when you create a pipeline that uses an EC2 instance to run a command to move data, AWS Data Pipeline can provision this EC2 instance for you. If you encounter problems, especially those involving resources that you can access manually but AWS Data Pipeline cannot, verify your IAM roles, policies, and trust relationships as described in

Granting Permissions to Pipelines with IAM (p. 21) .

Creating a Pipeline Causes a Security Token Error

You receive the following error when you try to create a pipeline:

Failed to create pipeline with 'pipeline_name'. Error: UnrecognizedClientException - The security token included in the request is invalid.

Cannot See Pipeline Details in the Console

The AWS Data Pipeline console pipeline filter applies to the scheduled start date for a pipeline, without regard to when the pipeline was submitted. It is possible to submit a new pipeline using a scheduled start date that occurs in the past, which the default date filter may not show. To see the pipeline details, change your date filter to ensure that the scheduled pipeline start date fits within the date range filter.

Error in remote runner Status Code: 404, AWS

Service: Amazon S3

This error means that Task Runner could not access your files in Amazon S3. Verify that:

• You have credentials correctly set

• The Amazon S3 bucket that you are trying to access exists

• You are authorized to access the Amazon S3 bucket

API Version 2012-10-29

133

AWS Data Pipeline Developer Guide

Access Denied - Not Authorized to Perform Function datapipeline:

Access Denied - Not Authorized to Perform

Function datapipeline:

In the Task Runner logs, you may see an error that is similar to the following:

• ERROR Status Code: 403

• AWS Service: DataPipeline

• AWS Error Code: AccessDenied

• AWS Error Message: User: arn:aws:sts::XXXXXXXXXXXX:federated-user/i-XXXXXXXX is not authorized to perform: datapipeline:PollForWork.

Note

In the this error message, PollForWork may be replaced with names of other AWS Data Pipeline permissions.

This error message indicates that the IAM role you specified needs additional permissions necessary to interact with AWS Data Pipeline. Ensure that your IAM role policy contains the following lines, where

PollForWork is replaced with the name of the permission you want to add (use * to grant all permissions):

{

"Action": [ "datapipeline:PollForWork" ],

"Effect": "Allow",

"Resource": ["*"]

}

API Version 2012-10-29

134

AWS Data Pipeline Developer Guide

Creating Pipeline Definition Files

Pipeline Definition Files

Topics

Creating Pipeline Definition Files (p. 135)

Example Pipeline Definitions (p. 139)

Simple Data Types (p. 153)

Expression Evaluation (p. 155)

Objects (p. 161)

The AWS Data Pipeline web service receives a pipeline definition file as input. This file specifies objects for the data nodes, activities, schedules, and computational resources for the pipeline.

Creating Pipeline Definition Files

To create a pipeline definition file, you can use either the AWS Data Pipeline console interface or a text editor that supports saving files using the UTF-8 file format.

This topic describes creating a pipeline definition file using a text editor.

Topics

Prerequisites (p. 135)

General Structure of a Pipeline Definition File (p. 136)

Pipeline Objects (p. 136)

Pipeline fields (p. 136)

User-Defined Fields (p. 137)

Expressions (p. 138)

Saving the Pipeline Definition File (p. 139)

Prerequisites

Before you create your pipeline definition file, you should determine the following:

• Objectives and tasks you need to accomplish

• Location and format of your source data (data nodes) and how often you update them

API Version 2012-10-29

135

AWS Data Pipeline Developer Guide

General Structure of a Pipeline Definition File

• Calculations or changes to the data (activities) you need

• Dependencies and checks (preconditions) that indicate when tasks are ready to run

• Frequency (schedule) you need for the pipeline to run

• Validation tests to confirm your data reached the destination

• How you want to be notified about success and failure

• Performance, volume, and runtime goals that suggest using other AWS services like EMR to process your data

General Structure of a Pipeline Definition File

The first step in pipeline creation is to compose pipeline definition objects in a pipeline definition file. The following example illustrates the general structure of a pipeline definition file. This file defines two objects, which are delimited by '{' and '}', and separated by a comma. The first object defines two name-value pairs, known as fields. The second object defines three fields.

{

"objects" : [

{

"name1" : "value1",

"name2" : "value2"

},

{

"name1" : "value3",

"name3" : "value4",

"name4" : "value5"

}

]

}

Pipeline Objects

When creating a pipeline definition file, you must select the types of pipeline objects that you'll need, add them to the pipeline definition file, and then add the appropriate fields. For more information about pipeline

objects, see Objects (p. 161)

.

For example, you could create a pipeline definition object for an input data node and another for the output data node. Then create another pipeline definition object for an activity, such as processing the input data using Amazon EMR.

Pipeline fields

After you know which object types to include in your pipeline definition file, you add fields to the definition of each pipeline object. Field names are enclosed in quotes, and are separated from field values by a space, a colon, and a space, as shown in the following example.

"name" : "value"

The field value can be a text string, a reference to another object, a function call, an expression, or an ordered list of any of the preceding types. for more information about the types of data that can be used for field values, see

Simple Data Types (p. 153)

. For more information about functions that you can use to evaluate field values, see

Expression Evaluation (p. 155)

.

API Version 2012-10-29

136

AWS Data Pipeline Developer Guide

User-Defined Fields

Fields are limited to 2048 characters. Objects can be 20 KB in size, which means that you can't add many large fields to an object.

Each pipeline object must contain the following fields: id

and type

, as shown in the following example.

Other fields may also be required based on the object type. Select a value for id

that's meaningful to you, and is unique within the pipeline definition. The value for type

specifies the type of the object. Specify

one of the supported pipeline definition object types, which are listed in the topic Objects (p. 161) .

{

"id": "MyCopyToS3",

"type": "CopyActivity"

}

For more information about the required and optional fields for each object, see the documentation for the object.

To include fields from one object in another object, use the parent

field with a reference to the object.

For example, object "B" includes its fields, "B1" and "B2", plus the fields from object "A", "A1" and "A2".

{

"id" : "A",

"A1" : "value",

"A2" : "value"

},

{

"id" : "B",

"parent" : {"ref" : "A"},

"B1" : "value",

"B2" : "value"

}

You can define common fields in an object named "Default". These fields are automatically included in every object in the pipeline definition file that doesn't explicitly set its parent

field to reference a different object.

{

"id" : "Default",

"onFail" : {"ref" : "FailureNotification"},

"maximumRetries" : "3",

"workerGroup" : "myWorkerGroup"

}

User-Defined Fields

You can create user-defined or custom fields on your pipeline components and refer to them with expressions. The following example shows a custom field named "myCustomField" and

"myCustomFieldReference" added to an S3DataNode:

{

"id": "S3DataInput",

"type": "S3DataNode",

"schedule": {"ref": "TheSchedule"},

"filePath": "s3://bucket_name",

"myCustomField": "This is a custom value in a custom field.",

API Version 2012-10-29

137

AWS Data Pipeline Developer Guide

Expressions

"my_customFieldReference": {"ref":"AnotherPipelineComponent"}

},

A custom field must have a name prefixed with the word "my" in all lower-case letters, followed by a capital letter or underscore character, such as the preceding example "myCustomField". A user-defined field can be both a string value or a reference to another pipeline component as shown by the preceding example "my_customFieldReference".

Note

On user-defined fields, AWS Data Pipeline only checks for valid references to other pipeline components, not any custom field string values that you add.

Expressions

Expressions enable you to share a value across related objects. Expressions are processed by the AWS

Data Pipeline web service at runtime, ensuring that all expressions are substituted with the value of the expression.

Expressions are delimited by: "#{" and "}". You can use an expressions in any pipeline definition object where a string is legal.

The following expression calls one of the AWS Data Pipeline functions. For more information, see

Expression Evaluation (p. 155)

.

#{format(myDateTime,'YYYY-MM-dd hh:mm:ss')}

Referencing Fields and Objects

To reference a field on the current object in an expression, use the node

keyword. This keyword is available with alarm and precondition objects.

In the following example, the path

field references the id

field in the same object to form a file name.

The value of path

evaluates to "s3://mybucket/ExampleDataNode.csv".

{

"id" : "ExampleDataNode",

"type" : "S3DataNode",

"schedule" : {"ref" : "ExampleSchedule"},

"filePath" : "s3://mybucket/#{node.filePath}.csv",

"precondition" : {"ref" : "ExampleCondition"},

"onFail" : {"ref" : "FailureNotify"}

}

You can use an expression to reference objects that include another object, such as an alarm or precondition object, using the node

keyword. For example, the precondition object "ExampleCondition" is referenced by the previously described "ExampleDataNode" object, so "ExampleCondition" can reference field values of "ExampleDataNode" using the node

keyword. In the following example, the value of path evaluates to "s3://mybucket/ExampleDataNode.csv".

{

"id" : "ExampleCondition",

"type" : "Exists"

}

API Version 2012-10-29

138

AWS Data Pipeline Developer Guide

Saving the Pipeline Definition File

Note

You can create pipelines that have dependencies, such as tasks in your pipeline that depend on the work of other systems or tasks. If your pipeline requires certain resources, add those dependencies to the pipeline using preconditions that you associate with data nodes and tasks so your pipelines are easier to debug and more resilient. Additionally, keep your dependencies within a single pipeline when possible, because cross-pipeline troubleshooting is difficult.

As another example, you can use an expression to refer to the date and time range created by a

Schedule object. For example, the message

field uses the

@scheduledStartTime

and

@scheduledEndTime runtime fields from the

Schedule

object that is referenced by the data node or activity that references this object in its onFail

field.

{

"id" : "FailureNotify",

"type" : "SnsAlarm",

"subject" : "Failed to run pipeline component",

"message": "Error for interval #{node.@scheduledStartTime}..#{node.@sched uledEndTime}.",

"topicArn":"arn:aws:sns:us-east-1:28619EXAMPLE:ExampleTopic"

},

Saving the Pipeline Definition File

After you have completed the pipeline definition objects in your pipeline definition file, save the pipeline definition file using UTF-8 encoding.

You must submit the pipeline definition file to the AWS Data Pipeline web service. There are two primary ways to submit a pipeline definition file, using the AWS Data Pipeline command line interface or using the AWS Data Pipeline console.

Example Pipeline Definitions

This section contains a collection of example pipelines that you can quickly use for various scenarios, once you are familiar with AWS Data Pipeline. For more detailed, step by step instructions for creating and using pipelines, we recommend that you read one or more of the detailed tutorials available in this

guide, for example Tutorial: Copy Data From a MySQL Table to Amazon S3 (p. 40) and

Tutorial:

Import/Export Data in Amazon DynamoDB With Amazon EMR and Hive (p. 69)

.

Copy SQL Data to a CSV File in Amazon S3

This example pipeline definition shows how to set a precondition or dependency on the existence of data gathered within a specific hour of time before copying the data (rows) from a table in a SQL database to a CSV (comma-separated values) file in an Amazon S3 bucket. The prerequisites and steps listed in this example pipeline definition are based on a MySQL database and table created using Amazon RDS.

Prerequisites

To set up and test this example pipeline definition, see Get Started with Amazon RDS to complete the following steps:

1.

Sign up for Amazon RDS.

2.

Launch a MySQL DB instance.

3.

Authorize access.

API Version 2012-10-29

139

AWS Data Pipeline Developer Guide

Copy SQL Data to a CSV File in Amazon S3

4.

Connect to the MySQL DB instance.

Note

The database name in this example, mydatabase, is the same as the one in the Amazon

RDS Getting Started Guide.

After you can connect to your MySQL DB Instance, then use the MySQL command line client to:

1.

Use your test MySQL database and create a table named adEvents

.

USE myDatabase;

CREATE TABLE IF NOT EXISTS adEvents (eventTime DATETIME, eventId INT, siteName

VARCHAR(100));

2.

Insert test data values to the newly created table named adEvents

.

INSERT INTO adEvents (eventTime, eventId, siteName) values ('2012-06-17

10:00:00', 100, 'Sports');

INSERT INTO adEvents (eventTime, eventId, siteName) values ('2012-06-17

10:00:00', 200, 'News');

INSERT INTO adEvents (eventTime, eventId, siteName) values ('2012-06-17

10:00:00', 300, 'Finance');

Example Pipeline Definition

The eight pipeline definition objects in this example are defined with the objectives of having:

• A default object with values that can be used by all subsequent objects

• A precondition object that resolves to true when the data exists in a referencing object

• A schedule object that specifies beginning and ending dates and duration or period of time in a referencing object

• A data input or source object with MySQL connection information, query string, and referencing to the precondition and schedule objects

• A data output or destination object pointing to your specified Amazon S3 bucket

• An activity object for copying from MySQL to Amazon S3

• Amazon SNS notification objects used for signalling success and failure of a referencing activity object

{

"objects" : [

{

"id" : "Default",

"onFail" : {"ref" : "FailureNotify"},

"onSuccess" : {"ref" : "SuccessNotify"},

"maximumRetries" : "3",

"workerGroup" : "myWorkerGroup"

},

{

"id" : "Ready",

"type" : "Exists"

},

{

API Version 2012-10-29

140

AWS Data Pipeline Developer Guide

Launch an Amazon EMR Job Flow

"id" : "CopyPeriod",

"type" : "Schedule",

"startDateTime" : "2012-06-13T10:00:00",

"endDateTime" : "2012-06-13T11:00:00",

"period" : "1 hour"

},

{

"id" : "SqlTable",

"type" : "MySqlDataNode",

"schedule" : {"ref" : "CopyPeriod"},

"table" : "adEvents",

"username": "

user_name

",

"*password": "

my_password

",

"connectionString": "jdbc:mysql:/

/mysqlinstance

-rds.example.us-east-

1.rds.amazonaws.com:3306/

database_name

",

"selectQuery" : "select * from #{table} where eventTime >= '#{@scheduled

StartTime.format('YYYY-MM-dd HH:mm:ss')}' and eventTime < '#{@scheduledEnd

Time.format('YYYY-MM-dd HH:mm:ss')}'",

"precondition" : {"ref" : "Ready"}

},

{

"id" : "OutputData",

"type" : "S3DataNode",

"schedule" : {"ref" : "CopyPeriod"},

"filePath" : "s3://S3BucketNameHere/#{@scheduledStartTime}.csv"

},

{

"id" : "mySqlToS3",

"type" : "CopyActivity",

"schedule" : {"ref" : "CopyPeriod"},

"input" : {"ref" : "SqlTable"},

"output" : {"ref" : "OutputData"},

"onSuccess" : {"ref" : "SuccessNotify"}

},

{

"id" : "SuccessNotify",

"type" : "SnsAlarm",

"subject" : "Pipeline component succeeded",

"message": "Success for interval #{node.@scheduledStartTime}..#{node.@sched uledEndTime}.",

"topicArn":"arn:aws:sns:us-east-1:28619EXAMPLE:ExampleTopic"

},

{

"id" : "FailureNotify",

"type" : "SnsAlarm",

"subject" : "Failed to run pipeline component",

"message": "Error for interval #{node.@scheduledStartTime}..#{node.@sched uledEndTime}.",

"topicArn":"arn:aws:sns:us-east-1:28619EXAMPLE:ExampleTopic"

}

]

}

Launch an Amazon EMR Job Flow

This example pipeline definition provisions an Amazon EMR cluster and runs a job step on that cluster one time a day, governed by the existence of the specified Amazon S3 path.

API Version 2012-10-29

141

AWS Data Pipeline Developer Guide

Launch an Amazon EMR Job Flow

Example Pipeline Definition

The value for workerGroup

should match the value that you specified for Task Runner.

Replace myOutputPath

, myLogPath

.

{

"objects" : [

{

"id" : "Default",

"onFail" : {"ref" : "FailureNotify"},

"maximumRetries" : "3",

"workerGroup: "myWorkerGroup"

},

{

"id" : "Daily",

"type" : "Schedule",

"period" : "1 day",

"startDateTime: "2012-06-26T00:00:00",

"endDateTime" : "2012-06-27T00:00:00"

},

{

"id" : "InputData",

"type" : "S3DataNode",

"schedule" : {"ref" : "Daily"},

"filePath" : "s3://myBucket/#{@scheduledEndTime.format('YYYY-MM-dd')}",

"precondition" : {"ref" : "Ready"}

},

{

"id" : "Ready",

"type" : "S3DirectoryNotEmpty",

"prefix" : "#{node.filePath}",

},

{

"id" : "MyCluster",

"type" : "EmrCluster",

"masterInstanceType" : "m1.small",

"schedule" : {"ref" : "Daily"},

"enableDebugging" : "true",

"logUri": "s3://myLogPath/logs"

},

{

"id" : "MyEmrActivity",

"type" : "EmrActivity",

"input" : {"ref" : "InputData"},

"schedule" : {"ref" : "Daily"},

"onSuccess" : "SuccessNotify",

"runsOn" : {"ref" : "MyCluster"},

"preStepCommand" : "echo Starting #{id} for day #{@scheduledStartTime}

>> /tmp/stepCommand.txt",

"postStepCommand" : "echo Ending #{id} for day #{@scheduledStartTime} >>

/tmp/stepCommand.txt",

"step" : "/home/hadoop/contrib/streaming/hadoop-streaming.jar,-in put,s3n://elasticmapreduce/samples/wordcount/input,-output,s3://myOutputPath/word count/output/,-mapper,s3n://elasticmapreduce/samples/wordcount/wordSplitter.py,reducer,aggregate"

},

API Version 2012-10-29

142

AWS Data Pipeline Developer Guide

Run a Script on a Schedule

{

"id" : "SuccessNotify",

"type" : "SnsAlarm",

"subject" : "Pipeline component succeeded",

"message": "Success for interval #{node.@scheduledStartTime}..#{node.@sched uledEndTime}.",

"topicArn":"arn:aws:sns:us-east-1:28619EXAMPLE:ExampleTopic"

},

{

"id" : "FailureNotify",

"type" : "SnsAlarm",

"subject" : "Failed to run pipeline component",

"message": "Error for interval #{node.@scheduledStartTime}..#{node.@sched uledEndTime}.",

"topicArn":"arn:aws:sns:us-east-1:28619EXAMPLE:ExampleTopic"

}

]

}

Run a Script on a Schedule

This example pipeline definition runs an arbitrary command line script set on a 'time-based' schedule.

This is a 'time based' schedule and not a 'dependency based' schedule in the sense that 'MyProcess' is scheduled to run based on clock time, not based on the availability of data sources that are inputs to

'MyProcess'. The schedule object 'Period' in this case defines a schedule used by activity 'MyProcess' such that 'MyProcess' will be scheduled to execute every hour, beginning at startDatetime

. The interval could have been minutes, hours, days, weeks, or months and be made different by chaining the period fields on object 'Period'.

Note

When a schedule's startDateTime

is in the past, AWS Data Pipeline backfills your pipeline and begins scheduling runs immediately beginning at startDateTime

. For testing/development, use a relatively short interval for startDateTime

..

endDateTime

. If not, AWS Data Pipeline attempts to queue up and schedule all runs of your pipeline for that interval.

Example Pipeline Definition

{

"objects" : [

{

"id" : "Default",

"onFail" : {"ref" : "FailureNotify"},

"maximumRetries" : "3",

"workerGroup" : "myWorkerGroup"

},

{

"id" : "Period",

"type" : "Schedule",

"period" : "1 hour",

"startDateTime" : "2012-01-13T20:00:00",

"endDateTime" : "2012-01-13T21:00:00"

},

{

"id" : "MyProcess",

"type" : "ShellCommandActivity",

"onSuccess" : {"ref" : "SuccessNotify"},

API Version 2012-10-29

143

AWS Data Pipeline Developer Guide

Chain Multiple Activities and Roll Up Data

"command" : "/home/myScriptPath/myScript.sh #{@scheduledStartTime}

#{@scheduledEndTime}",

"schedule": {"ref" : "Period"},

"stderr" : "/tmp/stderr:#{@scheduledStartTime}",

"stdout" : "/tmp/stdout:#{@scheduledStartTime}"

},

{

"id" : "SuccessNotify",

"type" : "SnsAlarm",

"subject" : "Pipeline component succeeded",

"message": "Success for interval #{node.@scheduledStartTime}..#{node.@sched uledEndTime}.",

"topicArn":"arn:aws:sns:us-east-1:28619EXAMPLE:ExampleTopic"

},

{

"id" : "FailureNotify",

"type" : "SnsAlarm",

"subject" : "Failed to run pipeline component",

"message": "Error for interval #{node.@scheduledStartTime}..#{node.@sched uledEndTime}.",

"topicArn":"arn:aws:sns:us-east-1:28619EXAMPLE:ExampleTopic"

}

]

}

Chain Multiple Activities and Roll Up Data

This example pipeline definition demonstrates the following:

• Chaining multiple activities in a graph of dependencies based on inputs and outputs.

• Rolling up data from smaller granularities (such as 15 minute buckets) into a larger granularity (such as 1 hour buckets).

This pipeline defines a schedule named 'CopyPeriod', which describes 15 minute time intervals originating at UTC time 2012-01-17T00:00:00 and a schedule named 'HourlyPeriod', which describes 1 hour time intervals originating at UTC time 2012-01-17T00:00:00.

'InputData' describes files of this form:

• s3://myBucket/demo/2012-01-17T00:00:00.csv

• s3://myBucket/demo/2012-01-17T00:15:00.csv

• s3://myBucket/demo/2012-01-17T00:30:00.csv

• s3://myBucket/demo/2012-01-17T00:45:00.csv

• s3://myBucket/demo/2012-01-17T01:00:00.csv

Every 15 minute interval (specified by

@scheduledStartTime..@scheduledEndTime

), activity

'CopyMinuteData' checks for Amazon S3 file s3://myBucket/demo/#{@scheduledStartTime}.csv and when it is found, copies the file to s3://myBucket/demo/#{@scheduledEndTime}.csv, per the definition of output object 'OutputMinuteData'.

Similarly, for every hour's worth of 'OutputMinuteData' Amazon S3 files found to exist (four 15-minute files in this case), activity 'CopyHourlyData' runs and writes the output to an hourly file defined by the expression s3://myBucket/demo/hourly/#{@scheduledEndTime}.csv in 'HourlyData'.

API Version 2012-10-29

144

AWS Data Pipeline Developer Guide

Chain Multiple Activities and Roll Up Data

Finally, when the Amazon S3 file described by s3://myBucket/demo/hourly/#{@scheduledEndTime}.csv

in 'HourlyData' is found to exist, AWS Data Pipeline runs the script described by activity 'ShellOut'.

Example Pipeline Definition

{

"objects" " [

{

"id" : "Default",

"onFail" : {"ref" : "FailureNotify"},

"maximumRetries" : "3",

"workerGroup" : "myWorkerGroup"

},

{

"id" : "CopyPeriod",

"type" : "Schedule",

"period" : "15 minutes",

"startDateTime" : "2012-01-17T00:00:00",

"endDateTime" : "2012-01-17T02:00:00"

},

{

"id" : "InputData",

"type" : "S3DataNode",

"schedule" : {"ref" : "CopyPeriod"},

"filePath" : "s3://myBucket/demo/#{@scheduledStartTime}.csv",

"precondition" : {"ref" : "Ready"}

},

{

"id" : "OutputMinuteData",

"type" : "S3DataNode",

"schedule" : {"ref" : "CopyPeriod"},

"filePath" : "s3://myBucket/demo/#{@scheduledEndTime}.csv"

},

{

"id" : "Ready",

"type" : "Exists",

},

{

"id" : "CopyMinuteData",

"type" : "CopyActivity",

"schedule" : {"ref" : "CopyPeriod"},

"input" : {"ref" : "InputData"},

"output" : {"ref" : "OutputMinuteData"}

},

{

"id" : "HourlyPeriod",

"type" : "Schedule",

"period" : "1 hour",

"startDateTime" : "2012-01-17T00:00:00",

"endDateTime" : "2012-01-17T02:00:00"

},

{

"id" : "CopyHourlyData",

"type" : "CopyActivity",

"schedule" : {"ref" : "HourlyPeriod"},

"input" : {"ref" : "OutputMinuteData"},

"output" : {"ref" : "HourlyData"}

API Version 2012-10-29

145

AWS Data Pipeline Developer Guide

Copy Data from Amazon S3 to MySQL

},

{

"id" : "HourlyData",

"type" : "S3DataNode",

"schedule" : {"ref" : "HourlyPeriod"},

"filePath" : "s3://myBucket/demo/hourly/#{@scheduledEndTime}.csv"

},

{

"id" : "ShellOut",

"type" : "ShellCommandActivity",

"input" : {"ref" : "HourlyData"},

"command" : "/home/userName/xxx.sh #{@scheduledStartTime} #{@scheduledEnd

Time}",

"schedule" : {"ref" : "HourlyPeriod"},

"stderr" : "/tmp/stderr:#{@scheduledStartTime}",

"stdout" : "/tmp/stdout:#{@scheduledStartTime}",

"onSuccess" : {"ref" : "SuccessNotify"}

},

{

"id" : "SuccessNotify",

"type" : "SnsAlarm",

"subject" : "Pipeline component succeeded",

"message": "Success for interval #{node.@scheduledStartTime}..#{node.@sched uledEndTime}.",

"topicArn":"arn:aws:sns:us-east-1:28619EXAMPLE:ExampleTopic"

},

{

"id" : "FailureNotify",

"type" : "SnsAlarm",

"subject" : "Failed to run pipeline component",

"message": "Error for interval #{node.@scheduledStartTime}..#{node.@sched uledEndTime}.",

"topicArn":"arn:aws:sns:us-east-1:28619EXAMPLE:ExampleTopic"

}

]

}

Copy Data from Amazon S3 to MySQL

This example pipeline definition automatically creates an Amazon EC2 instance that will copy the specified data from a CSV file in Amazon S3 into a MySQL database table. For simplicity, the structure of the example MySQL insert statement assumes that you have a CSV input file with two columns of data that you are writing into a MySQL database table that has two matching columns of the appropriate data type.

If you have data of a different scope, you would modify the MySQL statement to include additional data columns or data types.

Example Pipeline Definition

{

"objects": [

{

"id": "Default",

"logUri": "s3://testbucket/error_log",

"schedule": {

"ref": "MySchedule"

API Version 2012-10-29

146

AWS Data Pipeline Developer Guide

Copy Data from Amazon S3 to MySQL

}

},

{

"id": "MySchedule",

"type": "Schedule",

"startDateTime": "2012-11-26T00:00:00",

"endDateTime": "2012-11-27T00:00:00",

"period": "1 day"

},

{

"id": "MyS3Input",

"filePath": "s3://testbucket/input_data_file.csv",

"type": "S3DataNode"

},

{

"id": "MyCopyActivity",

"input": {

"ref": "MyS3Input"

},

"output": {

"ref": "MyDatabaseNode"

},

"type": "CopyActivity",

"runsOn": {

"ref": "MyEC2Resource"

}

},

{

"id": "MyEC2Resource",

"type": "Ec2Resource",

"actionOnTaskFailure": "terminate",

"actionOnResourceFailure": "retryAll",

"maximumRetries": "1",

"role": "test-role",

"resourceRole": "test-role",

"instanceType": "m1.medium",

"instanceCount": "1",

"securityGroups": [

"test-group",

"default"

],

"keyPair": "test-pair"

},

{

"id": "MyDatabaseNode",

"type": "MySqlDataNode",

"table": "table_name",

"username": "

user_name

",

"*password": "

my_password

",

"connectionString": "jdbc:mysql:/

/mysqlinstance

-rds.example.us-east-

1.rds.amazonaws.com:3306/

database_name

",

"insertQuery": "insert into #{table} (column1_ name, column2_name) values

(?, ?);"

}

]

}

API Version 2012-10-29

147

AWS Data Pipeline Developer Guide

Extract Apache Web Log Data from Amazon S3 using

Hive

This example has the following fields defined in the MySqlDataNode:

id

User-defined identifier for the MySQL database, which is a label for your reference only.

type

MySqlDataNode type that matches the kind of location for our data, which is an Amazon RDS instance using MySQL in this example.

table

Name of the database table that contains the data to copy. Replace table_name with the name of your database table.

username

User name of the database account that has sufficient permission to retrieve data from the database table. Replace user_name with the name of your user account.

*password

Password for the database account with the asterisk prefix to indicate that AWS Data Pipeline must encrypt the password value. Replace my_password with the correct password for your user account.

connectionString

JDBC connection string for CopyActivity to connect to the database.

insertQuery

A valid SQL SELECT query that specifies which data to copy from the database table. Note that

#{table} is a variable that re-uses the table name provided by the "table" variable in the preceding lines of the JSON file.

Extract Apache Web Log Data from Amazon S3 using Hive

This example pipeline definition automatically creates an Amazon EMR cluster to extract data from Apache web logs in Amazon S3 to a CSV file in Amazon S3 using Hive.

Example Pipeline Definition

{

"objects": [

{

"startDateTime": "2012-05-04T00:00:00",

"id": "MyEmrResourcePeriod",

"period": "1 day",

"type": "Schedule",

"endDateTime": "2012-05-05T00:00:00"

},

{

"id": "MyHiveActivity",

"type": "HiveActivity",

"schedule": {

"ref": "MyEmrResourcePeriod"

},

"runsOn": {

"ref": "MyEmrResource"

},

"input": {

"ref": "MyInputData"

},

"output": {

API Version 2012-10-29

148

AWS Data Pipeline Developer Guide

Extract Apache Web Log Data from Amazon S3 using

Hive

"ref": "MyOutputData"

},

"hiveScript": "INSERT OVERWRITE TABLE ${output1} select host,user,time,request,status,size from ${input1};"

},

{

"schedule": {

"ref": "MyEmrResourcePeriod"

},

"masterInstanceType": "m1.small",

"coreInstanceType": "m1.small",

"enableDebugging": "true",

"keyPair": "test-pair",

"id": "MyEmrResource",

"coreInstanceCount": "1",

"actionOnTaskFailure": "continue",

"maximumRetries": "1",

"type": "EmrCluster",

"actionOnResourceFailure": "retryAll",

"terminateAfter": "10 hour"

},

{

"id": "MyInputData",

"type": "S3DataNode",

"schedule": {

"ref": "MyEmrResourcePeriod"

},

"directoryPath": "s3://test-hive/input-access-logs",

"dataFormat": {

"ref": "MyInputDataType"

}

},

{

"id": "MyOutputData",

"type": "S3DataNode",

"schedule": {

"ref": "MyEmrResourcePeriod"

},

"directoryPath": "s3://test-hive/output-access-logs",

"dataFormat": {

"ref": "MyOutputDataType"

}

},

{

"id": "MyOutputDataType",

"type": "Custom",

"columnSeparator": "\t",

"recordSeparator": "\n",

"column": [

"host STRING",

"user STRING",

"time STRING",

"request STRING",

"status STRING",

"size STRING"

]

},

{

API Version 2012-10-29

149

AWS Data Pipeline Developer Guide

Extract Amazon S3 Data (CSV/TSV) to Amazon S3 using

Hive

"id": "MyInputDataType",

"type": "RegEx",

"inputRegEx": "([^ ]*) ([^ ]*) ([^ ]*) (-|\\[[^\\]]*\\]) ([^

\"]*|\"[^\"]*\") (-|[0-9]*) (-|[0-9]*)(?: ([^ \"]*|\"[^\"]*\") ([^

\"]*|\"[^\"]*\"))?",

"outputFormat": "%1$s %2$s %3$s %4$s %5$s %6$s %7$s %8$s %9$s",

"column": [

"host STRING",

"identity STRING",

"user STRING",

"time STRING",

"request STRING",

"status STRING",

"size STRING",

"referer STRING",

"agent STRING"

]

}

]

}

Extract Amazon S3 Data (CSV/TSV) to Amazon S3 using Hive

This example pipeline definition creates an Amazon EMR cluster to extract data from Apache web logs in Amazon S3 to a CSV file in Amazon S3 using Hive.

Note

You can accommodate tab-delimited (TSV) data files similarly to how this sample demonstrates using comma-delimited (CSV) files, if you change the

MyInputDataType

and

MyOutputDataType

type field to "TSV" instead of "CSV".

Example Pipeline Definition

{

"objects": [

{

"startDateTime": "2012-05-04T00:00:00",

"id": "MyEmrResourcePeriod",

"period": "1 day",

"type": "Schedule",

"endDateTime": "2012-05-05T00:00:00"

},

{

"id": "MyHiveActivity",

"maximumRetries": "10",

"type": "HiveActivity",

"schedule": {

"ref": "MyEmrResourcePeriod"

},

"runsOn": {

"ref": "MyEmrResource"

},

"input": {

"ref": "MyInputData"

API Version 2012-10-29

150

AWS Data Pipeline Developer Guide

Extract Amazon S3 Data (CSV/TSV) to Amazon S3 using

Hive

},

"output": {

"ref": "MyOutputData"

},

"hiveScript": "INSERT OVERWRITE TABLE ${output1} select * from ${input1};"

},

{

"schedule": {

"ref": "MyEmrResourcePeriod"

},

"masterInstanceType": "m1.small",

"coreInstanceType": "m1.small",

"enableDebugging": "true",

"keyPair": "test-pair",

"id": "MyEmrResource",

"coreInstanceCount": "1",

"actionOnTaskFailure": "continue",

"maximumRetries": "2",

"type": "EmrCluster",

"actionOnResourceFailure": "retryAll",

"terminateAfter": "10 hour"

},

{

"id": "MyInputData",

"type": "S3DataNode",

"schedule": {

"ref": "MyEmrResourcePeriod"

},

"directoryPath": "s3://test-hive/input",

"dataFormat": {

"ref": "MyInputDataType"

}

},

{

"id": "MyOutputData",

"type": "S3DataNode",

"schedule": {

"ref": "MyEmrResourcePeriod"

},

"directoryPath": "s3://test-hive/output",

"dataFormat": {

"ref": "MyOutputDataType"

}

},

{

"id": "MyOutputDataType",

"type": "CSV",

"column": [

"Name STRING",

"Age STRING",

"Surname STRING"

]

},

{

"id": "MyInputDataType",

"type": "CSV",

"column": [

API Version 2012-10-29

151

AWS Data Pipeline Developer Guide

Extract Amazon S3 Data (Custom Format) to Amazon

S3 using Hive

"Name STRING",

"Age STRING",

"Surname STRING"

]

}

]

}

Extract Amazon S3 Data (Custom Format) to

Amazon S3 using Hive

This example pipeline definition creates an Amazon EMR cluster to extract data from Amazon S3 with

Hive, using a custom file format specified by the columnSeparator

and recordSeparator

fields.

Example Pipeline Definition

{

"objects": [

{

"startDateTime": "2012-05-04T00:00:00",

"id": "MyEmrResourcePeriod",

"period": "1 day",

"type": "Schedule",

"endDateTime": "2012-05-05T00:00:00"

},

{

"id": "MyHiveActivity",

"type": "HiveActivity",

"schedule": {

"ref": "MyEmrResourcePeriod"

},

"runsOn": {

"ref": "MyEmrResource"

},

"input": {

"ref": "MyInputData"

},

"output": {

"ref": "MyOutputData"

},

"hiveScript": "INSERT OVERWRITE TABLE ${output1} select * from ${input1};"

},

{

"schedule": {

"ref": "MyEmrResourcePeriod"

},

"masterInstanceType": "m1.small",

"coreInstanceType": "m1.small",

"enableDebugging": "true",

"keyPair": "test-pair",

"id": "MyEmrResource",

"coreInstanceCount": "1",

"actionOnTaskFailure": "continue",

API Version 2012-10-29

152

AWS Data Pipeline Developer Guide

Simple Data Types

"maximumRetries": "1",

"type": "EmrCluster",

"actionOnResourceFailure": "retryAll",

"terminateAfter": "10 hour"

},

{

"id": "MyInputData",

"type": "S3DataNode",

"schedule": {

"ref": "MyEmrResourcePeriod"

},

"directoryPath": "s3://test-hive/input",

"dataFormat": {

"ref": "MyInputDataType"

}

},

{

"id": "MyOutputData",

"type": "S3DataNode",

"schedule": {

"ref": "MyEmrResourcePeriod"

},

"directoryPath": "s3://test-hive/output-custom",

"dataFormat": {

"ref": "MyOutputDataType"

}

},

{

"id": "MyOutputDataType",

"type": "Custom",

"columnSeparator": ",",

"recordSeparator": "\n",

"column": [

"Name STRING",

"Age STRING",

"Surname STRING"

]

},

{

"id": "MyInputDataType",

"type": "Custom",

"columnSeparator": ",",

"recordSeparator": "\n",

"column": [

"Name STRING",

"Age STRING",

"Surname STRING"

]

}

]

}

Simple Data Types

The following types of data can be set as field values.

API Version 2012-10-29

153

AWS Data Pipeline Developer Guide

DateTime

Topics

DateTime (p. 154)

Numeric (p. 154)

Expression Evaluation (p. 154)

Object References (p. 154)

Period (p. 154)

String (p. 155)

DateTime

AWS Data Pipeline supports the date and time expressed in "YYYY-MM-DDTHH:MM:SS" format in

UTC/GMT only. The following example sets the startDateTime

field of a

Schedule

object to

1/15/2012, 11:59 p.m.

, in the UTC/GMT timezone.

"startDateTime" : "2012-01-15T23:59:00"

Numeric

AWS Data Pipeline supports both integers and floating-point values.

Expression Evaluation

AWS Data Pipeline provides a set of functions that you can use to calculate the value of a field. For more information about these functions, see

Expression Evaluation (p. 155)

. The following example uses the makeDate

function to set the startDateTime

field of a

Schedule

object to

"2011-05-24T0:00:00"

GMT/UTC.

"startDateTime" : "makeDate(2011,5,24)"

Object References

An object in the pipeline definition. This can either be the current object, the name of an object defined elsewhere in the pipeline, or an object that lists the current object in a field, referenced by the node keyword. For more information about node

, see Referencing Fields and Objects (p. 138) . For more

information about the pipeline object types, see

Objects (p. 161) .

Period

Indicates how often a scheduled event should run. It's expressed in the format "N

[ years

| months

| weeks

| days

| hours

| minutes

]", where N is a positive integer value.

The minimum period is 15 minutes and the maximum period is 3 years.

The following example sets the period

field of the

Schedule

object to 3 hours. This creates a schedule that runs every three hours.

"period" : "3 hours"

API Version 2012-10-29

154

AWS Data Pipeline Developer Guide

String

String

Standard string values. Strings must be surrounded by double quotes (").You can use the slash character

(\) to escape characters in a strings. Multiline strings are not supported.

The following examples show examples of valid string values for the id

field.

"id" : "My Data Object"

"id" : "My \"Data\" Object"

Strings can also contain expressions that evaluate to string values. These are inserted into the string, and are delimited with: "#{" and "}". The following example uses an expression to insert the name of the current object into a path.

"filePath" : "s3://myBucket/#{name}.csv"

For more information about using expressions, see Referencing Fields and Objects (p. 138) and

Expression

Evaluation (p. 155)

.

Expression Evaluation

The following functions are provided by AWS Data Pipeline. You can use them to evaluate field values.

Topics

Mathematical Functions (p. 155)

String Functions (p. 156)

Date and Time Functions (p. 156)

Mathematical Functions

The following functions are available for working with numerical values.

Function

+

-

*

Description

Addition.

Example:

#{1 + 2}

Result:

3

Subtraction.

Example:

#{1 - 2}

Result:

-1

Multiplication.

Example:

#{1 * 2}

Result:

2

API Version 2012-10-29

155

AWS Data Pipeline Developer Guide

String Functions

/

Function

^

Description

Division. If you divide two integers, the result is truncated.

Example:

#{1 / 2}

, Result:

0

Example:

#{1.0 / 2}

, Result:

.5

Exponent.

Example:

#{2 ^ 2}

Result:

4.0

String Functions

The following functions are available for working with string values.

Function

+

Description

Concatenation. Non-string values are first converted to strings.

Example:

#{"hel" + "lo"}

Result:

"hello"

Date and Time Functions

The following functions are available for working with DateTime values. For the examples, the value of myDateTime

is

May 24, 2011 @ 5:10 pm GMT

.

Function

int minute(DateTime myDateTime) int hour(DateTime myDateTime)

Description

Gets the minute of the DateTime value as an integer.

Example:

#{minute(myDateTime)}

Result:

10

Gets the hour of the DateTime value as an integer.

Example:

#{hour(myDateTime)}

Result:

17

API Version 2012-10-29

156

AWS Data Pipeline Developer Guide

Date and Time Functions

Function

int day(DateTime myDateTime) int dayOfYear(DateTime myDateTime)

Description

Gets the day of the DateTime value as an integer.

Example:

#{day(myDateTime)}

Result:

24

Gets the day of the year of the

DateTime value as an integer.

Example:

#{dayOfYear(myDateTime)} int month(DateTime myDateTime) int year(DateTime myDateTime)

Result:

144

Gets the month of the DateTime value as an integer.

Example:

#{month(myDateTime)}

Result:

5

Gets the year of the DateTime value as an integer.

Example:

#{year(myDateTime)}

String format(DateTime myDateTime,String format)

Result:

2011

Creates a String object that is the result of converting the specified

DateTime using the specified format string.

Example:

#{format(myDateTime,'YYYY-MM-dd hh:mm:ss z')}

Result:

"2011-05-24T17:10:00

UTC"

DateTime inTimeZone(DateTime myDateTime,String zone)

Creates a DateTime object with the same date and time, but in the specified time zone, and taking daylight savings time into account.

For more information about time zones, see http://joda-time.sourceforge.net/timezones.html

.

Example:

#{inTimeZone(myDateTime,'America/Los_Angeles')}

Result:

"2011-05-24T10:10:00

America/Los_Angeles"

API Version 2012-10-29

157

AWS Data Pipeline Developer Guide

Date and Time Functions

Function

DateTime makeDate(int year,int month,int day)

Description

Creates a DateTime object, in

UTC, with the specified year, month, and day, at midnight.

Example:

#{makeDate(2011,5,24)}

Result:

"2011-05-24T0:00:00z"

DateTime makeDateTime(int year,int month,int day,int hour,int minute)

Creates a DateTime object, in

UTC, with the specified year, month, day, hour, and minute.

Example:

#{makeDateTime(2011,5,24,14,21)}

DateTime midnight(DateTime myDateTime)

DateTime yesterday(DateTime myDateTime)

Result:

"2011-05-24T14:21:00z"

Creates a DateTime object for the next midnight, relative to the specified DateTime.

Example:

#{midnight(myDateTime)}

Result:

"2011-05-24T0:00:00z"

Creates a DateTime object for the previous day, relative to the specified DateTime. The result is the same as minusDays(1).

DateTime sunday(DateTime myDateTime)

Example:

#{yesterday(myDateTime)}

Result:

"2011-05-23T17:10:00z"

Creates a DateTime object for the previous Sunday, relative to the specified DateTime. If the specified DateTime is a Sunday, the result is the specified

DateTime.

Example:

#{sunday(myDateTime)}

Result:

"2011-05-22 17:10:00

UTC"

API Version 2012-10-29

158

AWS Data Pipeline Developer Guide

Date and Time Functions

Function

DateTime firstOfMonth(DateTime myDateTime)

DateTime minusMinutes(DateTime myDateTime,int minutesToSub)

DateTime minusHours(DateTime myDateTime,int hoursToSub)

DateTime minusDays(DateTime myDateTime,int daysToSub)

DateTime minusWeeks(DateTime myDateTime,int weeksToSub)

Description

Creates a DateTime object for the start of the month in the specified

DateTime.

Example:

#{firstOfMonth(myDateTime)}

Result:

"2011-05-01T17:10:00z"

Creates a DateTime object that is the result of subtracting the specified number of minutes from the specified DateTime.

Example:

#{minusMinutes(myDateTime,1)}

Result:

"2011-05-24T17:09:00z"

Creates a DateTime object that is the result of subtracting the specified number of hours from the specified DateTime.

Example:

#{minusHours(myDateTime,1)}

Result:

"2011-05-24T16:10:00z"

Creates a DateTime object that is the result of subtracting the specified number of days from the specified DateTime.

Example:

#{minusDays(myDateTime,1)}

Result:

"2011-05-23T17:10:00z"

Creates a DateTime object that is the result of subtracting the specified number of weeks from the specified DateTime.

Example:

#{minusWeeks(myDateTime,1)}

Result:

"2011-05-17T17:10:00z"

API Version 2012-10-29

159

AWS Data Pipeline Developer Guide

Date and Time Functions

Function

DateTime minusMonths(DateTime myDateTime,int monthsToSub) minusYears(DateTime myDateTime,int yearsToSub)

Description

Creates a DateTime object that is the result of subtracting the specified number of months from the specified DateTime.

Example:

#{minusMonths(myDateTime,1)}

Result:

"2011-04-24T17:10:00z"

Creates a DateTime object that is the result of subtracting the specified number of years from the specified DateTime.

Example:

#{minusYears(myDateTime,1)}

DateTime plusMinutes(DateTime myDateTime,int minutesToAdd)

DateTime plusHours(DateTime myDateTime,int hoursToAdd)

Result:

"2010-05-24T17:10:00z"

Creates a DateTime object that is the result of adding the specified number of minutes to the specified

DateTime.

Example:

#{plusMinutes(myDateTime,1)}

Result:

"2011-05-24

17:11:00z"

Creates a DateTime object that is the result of adding the specified number of hours to the specified

DateTime.

Example:

#{plusHours(myDateTime,1)}

Result:

"2011-05-24T18:10:00z"

DateTime plusDays(DateTime myDateTime,int daysToAdd)

Creates a DateTime object that is the result of adding the specified number of days to the specified

DateTime.

Example:

#{plusDays(myDateTime,1)}

Result:

"2011-05-25T17:10:00z"

API Version 2012-10-29

160

AWS Data Pipeline Developer Guide

Objects

Function

DateTime plusWeeks(DateTime myDateTime,int weeksToAdd)

DateTime plusMonths(DateTime myDateTime,int monthsToAdd)

DateTime plusYears(DateTime myDateTime,int yearsToAdd)

Description

Creates a DateTime object that is the result of adding the specified number of weeks to the specified

DateTime.

Example:

#{plusWeeks(myDateTime,1)}

Result:

"2011-05-31T17:10:00z"

Creates a DateTime object that is the result of adding the specified number of months to the specified

DateTime.

Example:

#{plusMonths(myDateTime,1)}

Result:

"2011-06-24T17:10:00z"

Creates a DateTime object that is the result of adding the specified number of years to the specified

DateTime.

Example:

#{plusYears(myDateTime,1)}

Result:

"2012-05-24T17:10:00z"

Objects

This section describes the objects that you can use in your pipeline definition file.

Object Categories

The following is a list of AWS Data Pipeline objects by category.

Schedule

Schedule (p. 163)

Data node

S3DataNode (p. 165)

MySqlDataNode (p. 169)

Activity

ShellCommandActivity (p. 176)

API Version 2012-10-29

161

AWS Data Pipeline Developer Guide

Object Hierarchy

CopyActivity (p. 180)

EmrActivity (p. 184)

Precondition

ShellCommandPrecondition (p. 192)

Exists (p. 194)

RdsSqlPrecondition (p. 203)

DynamoDBTableExists (p. 204)

DynamoDBDataExists (p. 204)

Computational resource

EmrCluster (p. 209)

Alarm

SnsAlarm (p. 213)

Object Hierarchy

The following is the object hierarchy for AWS Data Pipeline.

Important

You can only create objects of the types that are listed in the previous section.

API Version 2012-10-29

162

AWS Data Pipeline Developer Guide

Schedule

Schedule

Defines the timing of a scheduled event, such as when an activity runs.

Note

When a schedule's startDateTime

is in the past, AWS Data Pipeline will backfill your pipeline and begin scheduling runs immediately beginning at startDateTime

. For testing/development, use a relatively short interval for startDateTime

..

endDateTime

. If not, AWS Data Pipeline attempts to queue up and schedule all runs of your pipeline for that interval.

API Version 2012-10-29

163

AWS Data Pipeline Developer Guide

Schedule

Syntax

The following slots are included in all objects.

Name

id name type parent

@sphere

Description

The ID of the object. IDs must be unique within a pipeline definition.

Type

String

The optional, user-defined label of the object.

If you do not provide a name

for an object in a pipeline definition, AWS Data Pipeline automatically duplicates the value of id

.

String

The type of object. Use one of the predefined

AWS Data Pipeline object types.

String

The parent of the object.

String

The sphere of an object denotes its place in the pipeline lifecycle, such as Pipeline,

Component, Instance, or Attempt.

String (read-only)

Required

Yes

No

Yes

No

No

This object includes the following fields.

Name

startDateTime endDateTime period

Description

The date and time to start the scheduled runs.

Type Required

String, in

DateTime or

DateTimeWithZone format

Yes

The date and time to end the scheduled runs.

The default behavior is to schedule runs until the pipeline is shut down.

String, in

DateTime or

DateTimeWithZone format

No

How often the pipeline should run. The format is "N

[minutes|hours|days|weeks|months]", where

N is a number followed by one of the time specifiers. For example, "15 minutes", runs the pipeline every 15 minutes.

String Yes

The minimum period is 15 minutes and the maximum period is 3 years.

@scheduledStartTime

The date and time that the scheduled run actually started. This value is added to the object by the schedule. By convention, activities treat the start as inclusive.

@scheduledEndTime

The date and time that the scheduled run actually ended. This value is added to the object by the schedule. By convention, activities treat the end as exclusive.

DateTime (read-only) No

DateTime (read-only) No

API Version 2012-10-29

164

AWS Data Pipeline Developer Guide

S3DataNode

Example

The following is an example of this object type. It defines a schedule of every hour starting at 00:00:00 hours on 2012-09-01 and ending at 00:00:00 hours on 2012-10-01. The first period ends at 01:00:00 on

2012-09-01.

{

"id" : "Hourly",

"type" : "Schedule",

"period" : "1 hours",

"startDateTime" : "2012-09-01T00:00:00",

"endDateTime" : "2012-10-01T00:00:00"

}

S3DataNode

Defines a data node using Amazon S3.

Note

When you use an S3DataNode as input to a CopyActivity, only CSV data format is supported.

Syntax

The following slots are included in all objects.

Name

id name type parent

@sphere

Description

The ID of the object. IDs must be unique within a pipeline definition.

Type

String

The optional, user-defined label of the object.

If you do not provide a name

for an object in a pipeline definition, AWS Data Pipeline automatically duplicates the value of id

.

String

The type of object. Use one of the predefined

AWS Data Pipeline object types.

String

The parent of the object.

String

The sphere of an object denotes its place in the pipeline lifecycle, such as Pipeline,

Component, Instance, or Attempt.

String (read-only)

Required

Yes

No

Yes

No

No

This object includes the following fields.

Name

filePath directoryPath

Description

The path to the object in Amazon S3 as a

URI, for example: s3://

my-bucket

/

my-key-for-file

.

Type

String

Amazon S3 directory path as a URI: s3://

my-bucket

/

my-key-for-directory

.

String

Required

No

No

API Version 2012-10-29

165

AWS Data Pipeline Developer Guide

S3DataNode

Name

compression dataFormat

Description

The format of the data described by the

S3DataNode. This field is only supported when you use S3DataNode with a

HiveActivity.

Type

The type of compression for the data described by the S3DataNode.

none

is no compression and gzip

is compressed with the gzip algorithm. This field is only supported when you use S3DataNode with a CopyActivity.

String

String

Required

No

Yes

This object includes the following slots from the

DataNode

object.

Name

onFail onSuccess precondition schedule scheduleType

Description

An action to run when the current instance fails.

Type

SnsAlarm (p. 213)

object reference

An email alarm to use when the object's run succeeds.

SnsAlarm (p. 213)

object reference

A condition that must be met for the data node to be valid. To specify multiple conditions, add multiple precondition slots. A data node is not ready until all its conditions are met.

Object reference

A schedule of the object. A common use is to specify a time schedule that correlates to the schedule for the object.

Schedule (p. 163)

object reference

Required

No

No

No

Yes

This slot overrides the schedule

slot included from

SchedulableObject

, which is optional.

Schedule type allows you to specify whether the objects in your pipeline definition should be scheduled at the beginning of interval or end of the interval. Time Series Style

Scheduling means instances are scheduled at the end of each interval and Cron Style

Scheduling means instances are scheduled at the beginning of each interval.

Allowed values are

“cron” or “timeseries”.

Defaults to

"timeseries".

No

This object includes the following slots from

RunnableObject

.

Name

workerGroup retryDelay

Description

The worker group. This is used for routing tasks.

The timeout duration between two retry attempts. The default is 10 minutes.

Type

String

Period

Required

No

Yes

API Version 2012-10-29

166

AWS Data Pipeline Developer Guide

S3DataNode

Name

maximumRetries

Description

The maximum number of times to retry the action.

Type

Integer onLateNotify

The email alarm to use when the object's run is late.

SnsAlarm (p. 213)

object reference onLateKill

Indicates whether all pending or unscheduled tasks should be killed if they are late.

Boolean lateAfterTimeout

The period in which the object run must start.

If the activity does not start within the scheduled start time plus this time interval, it is considered late.

Period

Required

No

No

No

No reportProgressTimeout

The period for successive calls from Task

Runner to the ReportTaskProgress API. If

Task Runner, or other code that is processing the tasks, does not report progress within the specified period, the activity can be retried.

attemptTimeout logUri

Period

The timeout for an activity. If an activity does not complete within the start time plus this time interval, AWS Data Pipeline marks the attempt as failed and your retry settings determine the next steps taken.

Period

The location in Amazon S3 to store log files generated by Task Runner when performing work for this object.

String

No

No

No

@reportProgressTime

The last time that Task Runner, or other code that is processing the tasks, called the

ReportTaskProgress API.

DateTime

@activeInstances

Record of the currently scheduled instance objects

Schedulable object reference

No

No

DateTime (read-only) No

@lastRun

@status

The last run of the object. This is a runtime slot.

@scheduledPhysicalObjects

The currently scheduled instance objects.

This is a runtime slot.

The status of this object. This is a runtime slot. Possible values are: pending

, checking_preconditions

, running

, waiting_on_runner

, successful

, and failed

.

Object reference

(read-only)

String (read-only)

No

No

Integer (read-only) No

@triesLeft

The number of attempted runs remaining before setting the status of this object to failed. This is a runtime slot.

@actualStartTime

The date and time that the scheduled run actually started. This is a runtime slot.

@actualEndTime

The date and time that the scheduled run actually ended. This is a runtime slot.

DateTime (read-only)

DateTime (read-only)

No

No

API Version 2012-10-29

167

AWS Data Pipeline Developer Guide

S3DataNode

Name

errorCode errorMessage

Description

If the object failed, the error code. This is a runtime slot.

If the object failed, the error message. This is a runtime slot.

If the object failed, the error stack trace.

Type

String (read-only)

String (read-only) errorStackTrace

@scheduledStartTime

The date and time that the run was scheduled to start.

@scheduledEndTime

The date and time that the run was scheduled to end.

@componentParent

The component from which this instance is created.

@headAttempt

The latest attempt on the given instance.

String (read-only)

DateTime

DateTime

Object reference

(read-only)

Object reference

(read-only)

@resource activityStatus

The resource instance on which the given activity/precondition attempt is being run.

Object reference

(read-only)

The status most recently reported from the activity.

String

Required

No

No

No

No

No

No

No

No

No

This object includes the following slots from

SchedulableObject

.

Name

schedule scheduleType runsOn

Description Type

A schedule of the object. A common use is to specify a time schedule that correlates to the schedule for the object.

Schedule (p. 163)

object reference

Required

No

Schedule type allows you to specify whether the objects in your pipeline definition should be scheduled at the beginning of interval or end of the interval. Time Series Style

Scheduling means instances are scheduled at the end of each interval and Cron Style

Scheduling means instances are scheduled at the beginning of each interval.

Allowed values are

“cron” or “timeseries”.

Defaults to

"timeseries".

No

The computational resource to run the activity or command. For example, an

Amazon EC2 instance or Amazon EMR cluster.

Resource object reference

No

Example

The following is an example of this object type. This object references another object that you'd define in the same pipeline definition file.

CopyPeriod

is a

Schedule

object.

API Version 2012-10-29

168

AWS Data Pipeline Developer Guide

MySqlDataNode

{

"id" : "OutputData",

"type" : "S3DataNode",

"schedule" : {"ref" : "CopyPeriod"},

"filePath" : "s3://myBucket/#{@scheduledStartTime}.csv"

}

See Also

MySqlDataNode (p. 169)

MySqlDataNode

Defines a data node using MySQL.

Syntax

The following slots are included in all objects.

Name

id name type parent

@sphere

Description

The ID of the object. IDs must be unique within a pipeline definition.

Type

String

The optional, user-defined label of the object.

If you do not provide a name

for an object in a pipeline definition, AWS Data Pipeline automatically duplicates the value of id

.

String

The type of object. Use one of the predefined

AWS Data Pipeline object types.

String

The parent of the object.

String

The sphere of an object denotes its place in the pipeline lifecycle, such as Pipeline,

Component, Instance, or Attempt.

String (read-only)

Required

Yes

No

Yes

No

No

This object includes the following slots from the

SqlDataNode

object.

Name

table

Description

The name of the table in the MySQL database. To specify multiple tables, add multiple table

slots.

connectionString

The JDBC connection string to access the database.

selectQuery insertQuery

Type

String

String

A SQL statement to fetch data from the table.

String

A SQL statement to insert data into the table.

String

Required

Yes

No

No

No

API Version 2012-10-29

169

AWS Data Pipeline Developer Guide

MySqlDataNode

This object includes the following slots from the

DataNode

object.

Name

onFail onSuccess precondition schedule scheduleType

Description

An action to run when the current instance fails.

Type

SnsAlarm (p. 213)

object reference

An email alarm to use when the object's run succeeds.

SnsAlarm (p. 213)

object reference

A condition that must be met for the data node to be valid. To specify multiple conditions, add multiple precondition slots. A data node is not ready until all its conditions are met.

Object reference

A schedule of the object. A common use is to specify a time schedule that correlates to the schedule for the object.

Schedule (p. 163)

object reference

Required

No

No

No

Yes

This slot overrides the schedule

slot included from

SchedulableObject

, which is optional.

Schedule type allows you to specify whether the objects in your pipeline definition should be scheduled at the beginning of interval or end of the interval. Time Series Style

Scheduling means instances are scheduled at the end of each interval and Cron Style

Scheduling means instances are scheduled at the beginning of each interval.

Allowed values are

“cron” or “timeseries”.

Defaults to

"timeseries".

No

This object includes the following slots from

RunnableObject

.

Name

workerGroup retryDelay maximumRetries onLateNotify

Description

The worker group. This is used for routing tasks.

The timeout duration between two retry attempts. The default is 10 minutes.

Type

String

Period

The maximum number of times to retry the action.

Integer

The email alarm to use when the object's run is late.

SnsAlarm (p. 213)

object reference onLateKill

Indicates whether all pending or unscheduled tasks should be killed if they are late.

Boolean lateAfterTimeout

The period in which the object run must start.

If the activity does not start within the scheduled start time plus this time interval, it is considered late.

Period

Required

No

Yes

No

No

No

No

API Version 2012-10-29

170

AWS Data Pipeline Developer Guide

MySqlDataNode

Name Description

reportProgressTimeout

The period for successive calls from Task

Runner to the ReportTaskProgress API. If

Task Runner, or other code that is processing the tasks, does not report progress within the specified period, the activity can be retried.

attemptTimeout logUri

Type

Period

The timeout for an activity. If an activity does not complete within the start time plus this time interval, AWS Data Pipeline marks the attempt as failed and your retry settings determine the next steps taken.

Period

The location in Amazon S3 to store log files generated by Task Runner when performing work for this object.

String

@reportProgressTime

The last time that Task Runner, or other code that is processing the tasks, called the

ReportTaskProgress API.

DateTime

@activeInstances

Record of the currently scheduled instance objects

Schedulable object reference

No

No

DateTime (read-only) No

@lastRun

The last run of the object. This is a runtime slot.

@scheduledPhysicalObjects

The currently scheduled instance objects.

This is a runtime slot.

@status

Object reference

(read-only)

String (read-only)

No

No

@triesLeft

The status of this object. This is a runtime slot. Possible values are: pending

, checking_preconditions

, running

, waiting_on_runner

, successful

, and failed

.

The number of attempted runs remaining before setting the status of this object to failed. This is a runtime slot.

Integer (read-only) No

DateTime (read-only) No

@actualStartTime

The date and time that the scheduled run actually started. This is a runtime slot.

@actualEndTime errorCode

The date and time that the scheduled run actually ended. This is a runtime slot.

If the object failed, the error code. This is a runtime slot.

DateTime (read-only)

String (read-only)

No

No errorMessage

If the object failed, the error message. This is a runtime slot.

If the object failed, the error stack trace.

errorStackTrace

@scheduledStartTime

The date and time that the run was scheduled to start.

@scheduledEndTime

The date and time that the run was scheduled to end.

String (read-only)

String (read-only)

DateTime

DateTime

Required

No

No

No

No

No

No

No

API Version 2012-10-29

171

AWS Data Pipeline Developer Guide

DynamoDBDataNode

Name

@headAttempt

Description

@componentParent

The component from which this instance is created.

The latest attempt on the given instance.

Type

Object reference

(read-only)

Object reference

(read-only)

@resource activityStatus

The resource instance on which the given activity/precondition attempt is being run.

The status most recently reported from the activity.

Object reference

(read-only)

String

Required

No

No

No

No

Example

The following is an example of this object type. This object references two other objects that you'd define in the same pipeline definition file.

CopyPeriod

is a

Schedule

object and

Ready

is a precondition object.

{

"id" : "Sql Table",

"type" : "MySqlDataNode",

"schedule" : {"ref" : "CopyPeriod"},

"table" : "adEvents",

"username": "

user_name

",

"*password": "

my_password

",

"connectionString": "jdbc:mysql:/

/mysqlinstance

-rds.example.us-east-

1.rds.amazonaws.com:3306/

database_name

",

"selectQuery" : "select * from #{table} where eventTime >= '#{@scheduledStart

Time.format('YYYY-MM-dd HH:mm:ss')}' and eventTime < '#{@scheduledEnd

Time.format('YYYY-MM-dd HH:mm:ss')}'",

"precondition" : {"ref" : "Ready"}

}

See Also

S3DataNode (p. 165)

DynamoDBDataNode

Defines a data node using Amazon DynamoDB, which is specified as an input to a HiveActivity or

EMRActivity.

Note

The DynamoDBDataNode does not support the Exists precondition.

Syntax

The following slots are included in all objects.

Name

id

Description

The ID of the object. IDs must be unique within a pipeline definition.

Type

String

Required

Yes

API Version 2012-10-29

172

AWS Data Pipeline Developer Guide

DynamoDBDataNode

Name

name type parent

@sphere

Description Type

The optional, user-defined label of the object.

If you do not provide a name

for an object in a pipeline definition, AWS Data Pipeline automatically duplicates the value of id

.

String

The type of object. Use one of the predefined

AWS Data Pipeline object types.

String

The parent of the object.

String

The sphere of an object denotes its place in the pipeline lifecycle, such as Pipeline,

Component, Instance, or Attempt.

String (read-only)

Required

No

Yes

No

No

This object includes the following fields.

Name

tableName

Description

The DynamoDB table.

Type

String

Required

Yes

This object includes the following slots from the

DataNode

object.

Name

onFail onSuccess precondition schedule scheduleType

Description

An action to run when the current instance fails.

Type

SnsAlarm (p. 213)

object reference

An email alarm to use when the object's run succeeds.

SnsAlarm (p. 213)

object reference

A condition that must be met for the data node to be valid. To specify multiple conditions, add multiple precondition slots. A data node is not ready until all its conditions are met.

Object reference

A schedule of the object. A common use is to specify a time schedule that correlates to the schedule for the object.

Schedule (p. 163)

object reference

Required

No

No

No

Yes

This slot overrides the schedule

slot included from

SchedulableObject

, which is optional.

Schedule type allows you to specify whether the objects in your pipeline definition should be scheduled at the beginning of interval or end of the interval. Time Series Style

Scheduling means instances are scheduled at the end of each interval and Cron Style

Scheduling means instances are scheduled at the beginning of each interval.

Allowed values are

“cron” or “timeseries”.

Defaults to

"timeseries".

No

API Version 2012-10-29

173

AWS Data Pipeline Developer Guide

DynamoDBDataNode

This object includes the following slots from

RunnableObject

.

Name

workerGroup retryDelay maximumRetries

Description

The worker group. This is used for routing tasks.

The timeout duration between two retry attempts. The default is 10 minutes.

The maximum number of times to retry the action.

Type

String

Period

Integer onLateNotify onLateKill

The email alarm to use when the object's run is late.

SnsAlarm (p. 213)

object reference

Indicates whether all pending or unscheduled tasks should be killed if they are late.

Boolean lateAfterTimeout

The period in which the object run must start.

If the activity does not start within the scheduled start time plus this time interval, it is considered late.

Period

Required

No

Yes

No

No

No

No reportProgressTimeout

The period for successive calls from Task

Runner to the ReportTaskProgress API. If

Task Runner, or other code that is processing the tasks, does not report progress within the specified period, the activity can be retried.

attemptTimeout

Period

The timeout for an activity. If an activity does not complete within the start time plus this time interval, AWS Data Pipeline marks the attempt as failed and your retry settings determine the next steps taken.

Period logUri

The location in Amazon S3 to store log files generated by Task Runner when performing work for this object.

String

@reportProgressTime

The last time that Task Runner, or other code that is processing the tasks, called the

ReportTaskProgress API.

DateTime

No

No

No

No

@activeInstances

Record of the currently scheduled instance objects

Schedulable object reference

No

DateTime (read-only) No

@lastRun

@status

The last run of the object. This is a runtime slot.

@scheduledPhysicalObjects

The currently scheduled instance objects.

This is a runtime slot.

The status of this object. This is a runtime slot. Possible values are: pending

, checking_preconditions

, running

, waiting_on_runner

, successful

, and failed

.

Object reference

(read-only)

String (read-only)

No

No

API Version 2012-10-29

174

AWS Data Pipeline Developer Guide

DynamoDBDataNode

Name

@triesLeft

Type

Integer (read-only)

Required

No

@actualStartTime

The date and time that the scheduled run actually started. This is a runtime slot.

@actualEndTime errorCode errorMessage errorStackTrace

The date and time that the scheduled run actually ended. This is a runtime slot.

If the object failed, the error code. This is a runtime slot.

If the object failed, the error message. This is a runtime slot.

If the object failed, the error stack trace.

DateTime (read-only)

DateTime (read-only)

String (read-only)

String (read-only)

@scheduledStartTime

The date and time that the run was scheduled to start.

@scheduledEndTime

The date and time that the run was scheduled to end.

@componentParent

The component from which this instance is created.

@headAttempt

The latest attempt on the given instance.

String (read-only)

DateTime

DateTime

Object reference

(read-only)

Object reference

(read-only)

No

No

No

No

No

No

No

No

No

@resource activityStatus

Description

The number of attempted runs remaining before setting the status of this object to failed. This is a runtime slot.

The resource instance on which the given activity/precondition attempt is being run.

Object reference

(read-only)

The status most recently reported from the activity.

String

No

No

This object includes the following slots from

SchedulableObject

.

Name

schedule scheduleType runsOn

Description Type

A schedule of the object. A common use is to specify a time schedule that correlates to the schedule for the object.

Schedule (p. 163)

object reference

Required

No

Schedule type allows you to specify whether the objects in your pipeline definition should be scheduled at the beginning of interval or end of the interval. Time Series Style

Scheduling means instances are scheduled at the end of each interval and Cron Style

Scheduling means instances are scheduled at the beginning of each interval.

Allowed values are

“cron” or “timeseries”.

Defaults to

"timeseries".

No

The computational resource to run the activity or command. For example, an

Amazon EC2 instance or Amazon EMR cluster.

Resource object reference

No

API Version 2012-10-29

175

AWS Data Pipeline Developer Guide

ShellCommandActivity

Example

The following is an example of this object type. This object references two other objects that you'd define in the same pipeline definition file.

CopyPeriod

is a

Schedule

object and

Ready

is a precondition object.

{

"id" : "MyDynamoDBTable",

"type" : "DynamoDBDataNode",

"schedule" : {"ref" : "CopyPeriod"},

"tableName" : "adEvents",

"precondition" : {"ref" : "Ready"}

}

ShellCommandActivity

Runs a command or script.You can use

ShellCommandActivity

to run time-series or cron-like scheduled tasks.

When the stage

field is set to true and used with an S3DataNode, ShellCommandActivity supports the concept of staging data, which means that you can move data from Amazon S3 to a stage location, such as Amazon EC2 or your local environment, perform work on the data using scripts and the

ShellCommandActivity, and move it back to Amazon S3. In this case, when your shell command is connected to an input S3DataNode, your shell scripts to operate directly on the data using

${input1}

,

${input2}

, etc. referring to the ShellCommandActivity input fields. Similarly, output from the shell-command can be staged in an output directory to be automatically pushed to Amazon S3, referred to by

${output1}

,

${output2}

, etc. These expressions can pass as command-line arguments to the shell-command for you to use in data transformation logic.

Syntax

The following slots are included in all objects.

Name

id name type parent

@sphere

Description

The ID of the object. IDs must be unique within a pipeline definition.

Type

String

The optional, user-defined label of the object.

If you do not provide a name

for an object in a pipeline definition, AWS Data Pipeline automatically duplicates the value of id

.

String

The type of object. Use one of the predefined

AWS Data Pipeline object types.

String

The parent of the object.

String

The sphere of an object denotes its place in the pipeline lifecycle, such as Pipeline,

Component, Instance, or Attempt.

String (read-only)

Required

Yes

No

Yes

No

No

API Version 2012-10-29

176

AWS Data Pipeline Developer Guide

ShellCommandActivity

This object includes the following fields.

Name

command stdout stderr input output scriptUri stage

Description Type

The command to run. This value and any associated parameters must function in the environment from which you are running the

Task Runner.

String

The file that receives redirected output from the command that is run.

String

The file that receives redirected system error messages from the command that is run.

String

The input data source. To specify multiple data sources, add multiple input

fields.

The location for the output. To specify multiple locations, add multiple output fields.

Data node object reference

Data node object reference

An Amazon S3 URI path for a file to download and run as a shell command. Only one scriptUri

or command

field should be present.

A valid S3 URI

Determines whether staging is enabled and allows your shell commands to have access to the staged-data variables, such as

${input1}

,

${output1}

, etc.

Boolean

Required

Yes

No

No

No

No

No

No

Note

You must specify a command

value or a scriptUri

value, but both are not required.

This object includes the following slots from the

Activity

object.

Name

onFail onSuccess precondition schedule

Description

An action to run when the current instance fails.

Type

SnsAlarm (p. 213)

object reference

An email alarm to use when the object's run succeeds.

SnsAlarm (p. 213)

object reference

A condition that must be met before the object can run. To specify multiple conditions, add multiple precondition slots. The activity cannot run until all its conditions are met.

Object reference

A schedule of the object. A common use is to specify a time schedule that correlates to the schedule for the object.

Schedule (p. 163)

object reference

This slot overrides the schedule

slot included from

SchedulableObject

, which is optional.

Required

No

No

No

Yes

API Version 2012-10-29

177

AWS Data Pipeline Developer Guide

ShellCommandActivity

This object includes the following slots from

RunnableObject

.

Name

workerGroup retryDelay maximumRetries

Description

The worker group. This is used for routing tasks.

The timeout duration between two retry attempts. The default is 10 minutes.

The maximum number of times to retry the action.

Type

String

Period

Integer onLateNotify onLateKill

The email alarm to use when the object's run is late.

SnsAlarm (p. 213)

object reference

Indicates whether all pending or unscheduled tasks should be killed if they are late.

Boolean lateAfterTimeout

The period in which the object run must start.

If the activity does not start within the scheduled start time plus this time interval, it is considered late.

Period

Required

No

Yes

No

No

No

No reportProgressTimeout

The period for successive calls from Task

Runner to the ReportTaskProgress API. If

Task Runner, or other code that is processing the tasks, does not report progress within the specified period, the activity can be retried.

attemptTimeout

Period

The timeout for an activity. If an activity does not complete within the start time plus this time interval, AWS Data Pipeline marks the attempt as failed and your retry settings determine the next steps taken.

Period logUri

The location in Amazon S3 to store log files generated by Task Runner when performing work for this object.

String

@reportProgressTime

The last time that Task Runner, or other code that is processing the tasks, called the

ReportTaskProgress API.

DateTime

No

No

No

No

@activeInstances

Record of the currently scheduled instance objects

Schedulable object reference

No

DateTime (read-only) No

@lastRun

@status

The last run of the object. This is a runtime slot.

@scheduledPhysicalObjects

The currently scheduled instance objects.

This is a runtime slot.

The status of this object. This is a runtime slot. Possible values are: pending

, checking_preconditions

, running

, waiting_on_runner

, successful

, and failed

.

Object reference

(read-only)

String (read-only)

No

No

API Version 2012-10-29

178

AWS Data Pipeline Developer Guide

ShellCommandActivity

Name

@triesLeft

Type

Integer (read-only)

Required

No

@actualStartTime

The date and time that the scheduled run actually started. This is a runtime slot.

@actualEndTime errorCode errorMessage errorStackTrace

The date and time that the scheduled run actually ended. This is a runtime slot.

If the object failed, the error code. This is a runtime slot.

If the object failed, the error message. This is a runtime slot.

If the object failed, the error stack trace.

DateTime (read-only)

DateTime (read-only)

String (read-only)

String (read-only)

@scheduledStartTime

The date and time that the run was scheduled to start.

@scheduledEndTime

The date and time that the run was scheduled to end.

@componentParent

The component from which this instance is created.

@headAttempt

The latest attempt on the given instance.

String (read-only)

DateTime

DateTime

Object reference

(read-only)

Object reference

(read-only)

No

No

No

No

No

No

No

No

No

@resource activityStatus

Description

The number of attempted runs remaining before setting the status of this object to failed. This is a runtime slot.

The resource instance on which the given activity/precondition attempt is being run.

Object reference

(read-only)

The status most recently reported from the activity.

String

No

No

This object includes the following slots from

SchedulableObject

.

Name

schedule scheduleType runsOn

Description Type

A schedule of the object. A common use is to specify a time schedule that correlates to the schedule for the object.

Schedule (p. 163)

object reference

Required

No

Schedule type allows you to specify whether the objects in your pipeline definition should be scheduled at the beginning of interval or end of the interval. Time Series Style

Scheduling means instances are scheduled at the end of each interval and Cron Style

Scheduling means instances are scheduled at the beginning of each interval.

Allowed values are

“cron” or “timeseries”.

Defaults to

"timeseries".

No

The computational resource to run the activity or command. For example, an

Amazon EC2 instance or Amazon EMR cluster.

Resource object reference

No

API Version 2012-10-29

179

AWS Data Pipeline Developer Guide

CopyActivity

Example

The following is an example of this object type.

{

"id" : "CreateDirectory",

"type" : "ShellCommandActivity",

"command" : "mkdir new-directory"

}

See Also

CopyActivity (p. 180)

EmrActivity (p. 184)

CopyActivity

Copies data from one location to another. The copy operation is performed record by record.

Important

When you use an S3DataNode as input for CopyActivity, you can only use a Unix/Linux variant of the CSV data file format, which means that CopyActivity has specific limitations to its CSV support:

• The separator must be the "," (comma) character.

• The records will not be quoted.

• The default escape character will be ASCII value 92 (backslash).

• The end of record identifier will be ASCII value 10 (or "\n").

Warning

Windows-based systems typically use a different end of record character sequence: a carriage return and line feed together (ASCII value 13 and ASCII value 10). You must accommodate this difference using an additional mechanism, such as a pre-copy script to modify the input data, to ensure that CopyActivity can properly detect the end of a record, otherwise the CopyActivity will fail repeatedly.

Warning

Additionally, you may encounter repeated CopyActivity failures if you supply compressed data files as input, but do not specify this using the compression

field. In this case, CopyActivity will not properly detect the end of record character and the operation will fail.

Syntax

The following slots are included in all objects.

Name

id

Description

The ID of the object. IDs must be unique within a pipeline definition.

Type

String

Required

Yes

API Version 2012-10-29

180

AWS Data Pipeline Developer Guide

CopyActivity

Name

name type parent

@sphere

Description Type

The optional, user-defined label of the object.

If you do not provide a name

for an object in a pipeline definition, AWS Data Pipeline automatically duplicates the value of id

.

String

The type of object. Use one of the predefined

AWS Data Pipeline object types.

String

The parent of the object.

String

The sphere of an object denotes its place in the pipeline lifecycle, such as Pipeline,

Component, Instance, or Attempt.

String (read-only)

Required

No

Yes

No

No

This object includes the following fields.

Name

input output

Description

The input data source. To specify multiple data sources, add multiple input

fields.

The location for the output. To specify multiple locations, add multiple output fields.

Type

Data node object reference

Data node object reference

Required

Yes

Yes

This object includes the following slots from the

Activity

object.

Name

onFail onSuccess precondition schedule

Description

An action to run when the current instance fails.

This slot overrides the schedule

slot included from

SchedulableObject

, which is optional.

Type

SnsAlarm (p. 213)

object reference

An email alarm to use when the object's run succeeds.

SnsAlarm (p. 213)

object reference

A condition that must be met before the object can run. To specify multiple conditions, add multiple precondition slots. The activity cannot run until all its conditions are met.

Object reference

A schedule of the object. A common use is to specify a time schedule that correlates to the schedule for the object.

Schedule (p. 163)

object reference

Required

No

No

No

Yes

API Version 2012-10-29

181

AWS Data Pipeline Developer Guide

CopyActivity

This object includes the following slots from

RunnableObject

.

Name

workerGroup retryDelay maximumRetries

Description

The worker group. This is used for routing tasks.

The timeout duration between two retry attempts. The default is 10 minutes.

The maximum number of times to retry the action.

Type

String

Period

Integer onLateNotify onLateKill

The email alarm to use when the object's run is late.

SnsAlarm (p. 213)

object reference

Indicates whether all pending or unscheduled tasks should be killed if they are late.

Boolean lateAfterTimeout

The period in which the object run must start.

If the activity does not start within the scheduled start time plus this time interval, it is considered late.

Period

Required

No

Yes

No

No

No

No reportProgressTimeout

The period for successive calls from Task

Runner to the ReportTaskProgress API. If

Task Runner, or other code that is processing the tasks, does not report progress within the specified period, the activity can be retried.

attemptTimeout

Period

The timeout for an activity. If an activity does not complete within the start time plus this time interval, AWS Data Pipeline marks the attempt as failed and your retry settings determine the next steps taken.

Period logUri

The location in Amazon S3 to store log files generated by Task Runner when performing work for this object.

String

@reportProgressTime

The last time that Task Runner, or other code that is processing the tasks, called the

ReportTaskProgress API.

DateTime

No

No

No

No

@activeInstances

Record of the currently scheduled instance objects

Schedulable object reference

No

DateTime (read-only) No

@lastRun

@status

The last run of the object. This is a runtime slot.

@scheduledPhysicalObjects

The currently scheduled instance objects.

This is a runtime slot.

The status of this object. This is a runtime slot. Possible values are: pending

, checking_preconditions

, running

, waiting_on_runner

, successful

, and failed

.

Object reference

(read-only)

String (read-only)

No

No

API Version 2012-10-29

182

AWS Data Pipeline Developer Guide

CopyActivity

Name

@triesLeft

Type

Integer (read-only)

Required

No

@actualStartTime

The date and time that the scheduled run actually started. This is a runtime slot.

@actualEndTime errorCode errorMessage errorStackTrace

The date and time that the scheduled run actually ended. This is a runtime slot.

If the object failed, the error code. This is a runtime slot.

If the object failed, the error message. This is a runtime slot.

If the object failed, the error stack trace.

DateTime (read-only)

DateTime (read-only)

String (read-only)

String (read-only)

@scheduledStartTime

The date and time that the run was scheduled to start.

@scheduledEndTime

The date and time that the run was scheduled to end.

@componentParent

The component from which this instance is created.

@headAttempt

The latest attempt on the given instance.

String (read-only)

DateTime

DateTime

Object reference

(read-only)

Object reference

(read-only)

No

No

No

No

No

No

No

No

No

@resource activityStatus

Description

The number of attempted runs remaining before setting the status of this object to failed. This is a runtime slot.

The resource instance on which the given activity/precondition attempt is being run.

Object reference

(read-only)

The status most recently reported from the activity.

String

No

No

This object includes the following slots from

SchedulableObject

.

Name

schedule scheduleType runsOn

Description Type

A schedule of the object. A common use is to specify a time schedule that correlates to the schedule for the object.

Schedule (p. 163)

object reference

Required

No

Schedule type allows you to specify whether the objects in your pipeline definition should be scheduled at the beginning of interval or end of the interval. Time Series Style

Scheduling means instances are scheduled at the end of each interval and Cron Style

Scheduling means instances are scheduled at the beginning of each interval.

Allowed values are

“cron” or “timeseries”.

Defaults to

"timeseries".

No

The computational resource to run the activity or command. For example, an

Amazon EC2 instance or Amazon EMR cluster.

Resource object reference

No

API Version 2012-10-29

183

AWS Data Pipeline Developer Guide

EmrActivity

Example

The following is an example of this object type. This object references three other objects that you would define in the same pipeline definition file.

CopyPeriod

is a

Schedule

object and

InputData

and

OutputData

are data node objects.

{

"id" : "S3ToS3Copy",

"type" : "CopyActivity",

"schedule" : {"ref" : "CopyPeriod"},

"input" : {"ref" : "InputData"},

"output" : {"ref" : "OutputData"}

}

See Also

ShellCommandActivity (p. 176)

EmrActivity (p. 184)

EmrActivity

Runs an Amazon EMR job flow.

Syntax

The following slots are included in all objects.

Name

id name type parent

@sphere

Description

The ID of the object. IDs must be unique within a pipeline definition.

Type

String

The optional, user-defined label of the object.

If you do not provide a name

for an object in a pipeline definition, AWS Data Pipeline automatically duplicates the value of id

.

String

The type of object. Use one of the predefined

AWS Data Pipeline object types.

String

The parent of the object.

String

The sphere of an object denotes its place in the pipeline lifecycle, such as Pipeline,

Component, Instance, or Attempt.

String (read-only)

Required

Yes

No

Yes

No

No

This object includes the following fields.

Name

runsOn

Description

Details about the Amazon EMR job flow.

Type

EmrCluster (p. 209)

object reference

Required

Yes

API Version 2012-10-29

184

AWS Data Pipeline Developer Guide

EmrActivity

Name

step preStepCommand postStepCommand input output

Description Type

One or more steps for the job flow to run. To specify multiple steps, up to 255, add multiple step

fields.

String

Shell scripts to be run before any steps are run. To specify multiple scripts, up to 255, add multiple preStepCommand

fields.

String

Shell scripts to be run after all steps are finished. To specify multiple scripts, up to

255, add multiple postStepCommand

fields.

String

The input data source. To specify multiple data sources, add multiple input

fields.

The location for the output. To specify multiple locations, add multiple output fields.

Data node object reference

Data node object reference

Required

Yes

No

No

No

No

This object includes the following slots from the

Activity

object.

Name

onFail onSuccess precondition schedule

Description

An action to run when the current instance fails.

Type

SnsAlarm (p. 213)

object reference

An email alarm to use when the object's run succeeds.

SnsAlarm (p. 213)

object reference

A condition that must be met before the object can run. To specify multiple conditions, add multiple precondition slots. The activity cannot run until all its conditions are met.

Object reference

A schedule of the object. A common use is to specify a time schedule that correlates to the schedule for the object.

Schedule (p. 163)

object reference

This slot overrides the schedule

slot included from

SchedulableObject

, which is optional.

Required

No

No

No

Yes

This object includes the following slots from

RunnableObject

.

Name

workerGroup retryDelay maximumRetries

Description

The worker group. This is used for routing tasks.

The timeout duration between two retry attempts. The default is 10 minutes.

The maximum number of times to retry the action.

Type

String

Period

Integer

Required

No

Yes

No

API Version 2012-10-29

185

AWS Data Pipeline Developer Guide

EmrActivity

Name

onLateNotify onLateKill

Description Type

The email alarm to use when the object's run is late.

SnsAlarm (p. 213)

object reference

Indicates whether all pending or unscheduled tasks should be killed if they are late.

Boolean lateAfterTimeout

The period in which the object run must start.

If the activity does not start within the scheduled start time plus this time interval, it is considered late.

Period

Required

No

No

No reportProgressTimeout

The period for successive calls from Task

Runner to the ReportTaskProgress API. If

Task Runner, or other code that is processing the tasks, does not report progress within the specified period, the activity can be retried.

attemptTimeout logUri

Period

The timeout for an activity. If an activity does not complete within the start time plus this time interval, AWS Data Pipeline marks the attempt as failed and your retry settings determine the next steps taken.

Period

The location in Amazon S3 to store log files generated by Task Runner when performing work for this object.

String

No

No

No

@reportProgressTime

The last time that Task Runner, or other code that is processing the tasks, called the

ReportTaskProgress API.

DateTime

@activeInstances

Record of the currently scheduled instance objects

Schedulable object reference

No

No

DateTime (read-only) No

@lastRun

The last run of the object. This is a runtime slot.

@scheduledPhysicalObjects

The currently scheduled instance objects.

This is a runtime slot.

@status

Object reference

(read-only)

String (read-only)

No

No

@triesLeft

The status of this object. This is a runtime slot. Possible values are: pending

, checking_preconditions

, running

, waiting_on_runner

, successful

, and failed

.

The number of attempted runs remaining before setting the status of this object to failed. This is a runtime slot.

Integer (read-only) No

DateTime (read-only) No

@actualStartTime

The date and time that the scheduled run actually started. This is a runtime slot.

@actualEndTime

The date and time that the scheduled run actually ended. This is a runtime slot.

errorCode

If the object failed, the error code. This is a runtime slot.

DateTime (read-only)

String (read-only)

No

No

API Version 2012-10-29

186

AWS Data Pipeline Developer Guide

EmrActivity

Name

errorMessage errorStackTrace

Description

If the object failed, the error message. This is a runtime slot.

If the object failed, the error stack trace.

@scheduledStartTime

The date and time that the run was scheduled to start.

@scheduledEndTime

The date and time that the run was scheduled to end.

Type

String (read-only)

String (read-only)

DateTime

DateTime

@componentParent

The component from which this instance is created.

@headAttempt

The latest attempt on the given instance.

@resource activityStatus

The resource instance on which the given activity/precondition attempt is being run.

The status most recently reported from the activity.

Object reference

(read-only)

Object reference

(read-only)

Object reference

(read-only)

String

Required

No

No

No

No

No

No

No

No

This object includes the following slots from

SchedulableObject

.

Name

schedule scheduleType runsOn

Description Type

A schedule of the object. A common use is to specify a time schedule that correlates to the schedule for the object.

Schedule (p. 163)

object reference

Required

No

Schedule type allows you to specify whether the objects in your pipeline definition should be scheduled at the beginning of interval or end of the interval. Time Series Style

Scheduling means instances are scheduled at the end of each interval and Cron Style

Scheduling means instances are scheduled at the beginning of each interval.

Allowed values are

“cron” or “timeseries”.

Defaults to

"timeseries".

No

The computational resource to run the activity or command. For example, an

Amazon EC2 instance or Amazon EMR cluster.

Resource object reference

No

Example

The following is an example of this object type. This object references three other objects that you would define in the same pipeline definition file.

MyEmrCluster

is an

EmrCluster

object and

MyS3Input

and

MyS3Output

are

S3DataNode

objects.

{

"id" : "MyEmrActivity",

"type" : "EmrActivity",

API Version 2012-10-29

187

AWS Data Pipeline Developer Guide

HiveActivity

"runsOn" : {"ref" : "MyEmrCluster"},

"preStepCommand" : "scp remoteFiles localFiles",

"step" : "s3://myBucket/myPath/myStep.jar,firstArg,secondArg",

"step" : "s3://myBucket/myPath/myOtherStep.jar,anotherArg",

"postStepCommand" : "scp localFiles remoteFiles",

"input" : {"ref" : "MyS3Input"},

"output" : {"ref" : "MyS3Output"}

}

See Also

ShellCommandActivity (p. 176)

CopyActivity (p. 180)

EmrCluster (p. 209)

HiveActivity

Runs a Hive query on an Amazon EMR cluster. HiveActivity makes it easier to set up an EMR activity and automatically creates Hive tables based on input data coming in from either Amazon S3 or Amazon

RDS. All you need to specify is the HiveQL to execute on the source data. AWS Data Pipeline automatically creates Hive tables with

${input1}

,

${input2}

, etc. based on the input slots in the HiveActivity. For

S3 inputs, the dataFormat

field is used to create the Hive column names. For MySQL (RDS) inputs and the column names for the SQL query are used to create the Hive column names.

Syntax

The following slots are included in all objects.

Name

id name type parent

@sphere

Description

The ID of the object. IDs must be unique within a pipeline definition.

Type

String

The optional, user-defined label of the object.

If you do not provide a name

for an object in a pipeline definition, AWS Data Pipeline automatically duplicates the value of id

.

String

The type of object. Use one of the predefined

AWS Data Pipeline object types.

String

The parent of the object.

String

The sphere of an object denotes its place in the pipeline lifecycle, such as Pipeline,

Component, Instance, or Attempt.

String (read-only)

Required

Yes

No

Yes

No

No

This object includes the following fields.

Name

scriptUri

Description

The location of the Hive script to run. ex: s3://

script location

Type

String

Required

No

API Version 2012-10-29

188

AWS Data Pipeline Developer Guide

HiveActivity

Name

hiveScript

Description

The Hive script to run.

Type

String

Required

No

Note

You must specify a hiveScript

value or a scriptUri

value, but both are not required.

This object includes the following slots from the

Activity

object.

Name

onFail onSuccess precondition schedule

Description

An action to run when the current instance fails.

Type

SnsAlarm (p. 213)

object reference

An email alarm to use when the object's run succeeds.

SnsAlarm (p. 213)

object reference

A condition that must be met before the object can run. To specify multiple conditions, add multiple precondition slots. The activity cannot run until all its conditions are met.

Object reference

A schedule of the object. A common use is to specify a time schedule that correlates to the schedule for the object.

Schedule (p. 163)

object reference

This slot overrides the schedule

slot included from

SchedulableObject

, which is optional.

Required

No

No

No

Yes

This object includes the following slots from

RunnableObject

.

Name

workerGroup retryDelay maximumRetries onLateNotify

Description

The worker group. This is used for routing tasks.

The timeout duration between two retry attempts. The default is 10 minutes.

Type

String

Period

The maximum number of times to retry the action.

Integer

The email alarm to use when the object's run is late.

SnsAlarm (p. 213)

object reference onLateKill

Indicates whether all pending or unscheduled tasks should be killed if they are late.

Boolean lateAfterTimeout

The period in which the object run must start.

If the activity does not start within the scheduled start time plus this time interval, it is considered late.

Period

Required

No

Yes

No

No

No

No

API Version 2012-10-29

189

AWS Data Pipeline Developer Guide

HiveActivity

Name Description

reportProgressTimeout

The period for successive calls from Task

Runner to the ReportTaskProgress API. If

Task Runner, or other code that is processing the tasks, does not report progress within the specified period, the activity can be retried.

attemptTimeout logUri

Type

Period

The timeout for an activity. If an activity does not complete within the start time plus this time interval, AWS Data Pipeline marks the attempt as failed and your retry settings determine the next steps taken.

Period

The location in Amazon S3 to store log files generated by Task Runner when performing work for this object.

String

@reportProgressTime

The last time that Task Runner, or other code that is processing the tasks, called the

ReportTaskProgress API.

DateTime

@activeInstances

Record of the currently scheduled instance objects

Schedulable object reference

No

No

DateTime (read-only) No

@lastRun

The last run of the object. This is a runtime slot.

@scheduledPhysicalObjects

The currently scheduled instance objects.

This is a runtime slot.

@status

Object reference

(read-only)

String (read-only)

No

No

@triesLeft

The status of this object. This is a runtime slot. Possible values are: pending

, checking_preconditions

, running

, waiting_on_runner

, successful

, and failed

.

The number of attempted runs remaining before setting the status of this object to failed. This is a runtime slot.

Integer (read-only) No

DateTime (read-only) No

@actualStartTime

The date and time that the scheduled run actually started. This is a runtime slot.

@actualEndTime errorCode

The date and time that the scheduled run actually ended. This is a runtime slot.

If the object failed, the error code. This is a runtime slot.

DateTime (read-only)

String (read-only)

No

No errorMessage

If the object failed, the error message. This is a runtime slot.

If the object failed, the error stack trace.

errorStackTrace

@scheduledStartTime

The date and time that the run was scheduled to start.

@scheduledEndTime

The date and time that the run was scheduled to end.

String (read-only)

String (read-only)

DateTime

DateTime

Required

No

No

No

No

No

No

No

API Version 2012-10-29

190

AWS Data Pipeline Developer Guide

HiveActivity

Name

@headAttempt

Description

@componentParent

The component from which this instance is created.

The latest attempt on the given instance.

Type

Object reference

(read-only)

Object reference

(read-only)

@resource activityStatus

The resource instance on which the given activity/precondition attempt is being run.

The status most recently reported from the activity.

Object reference

(read-only)

String

Required

No

No

No

No

This object includes the following slots from

SchedulableObject

.

Name

schedule scheduleType runsOn

Description Type

A schedule of the object. A common use is to specify a time schedule that correlates to the schedule for the object.

Schedule (p. 163)

object reference

Required

No

Schedule type allows you to specify whether the objects in your pipeline definition should be scheduled at the beginning of interval or end of the interval. Time Series Style

Scheduling means instances are scheduled at the end of each interval and Cron Style

Scheduling means instances are scheduled at the beginning of each interval.

Allowed values are

“cron” or “timeseries”.

Defaults to

"timeseries".

No

The computational resource to run the activity or command. For example, an

Amazon EC2 instance or Amazon EMR cluster.

Resource object reference

No

Example

The following is an example of this object type. This object references three other objects that you would define in the same pipeline definition file.

CopyPeriod

is a

Schedule

object and

InputData

and

OutputData

are data node objects.

{

"id" : "S3ToS3Copy",

"type" : "CopyActivity",

"schedule" : {"ref" : "CopyPeriod"},

"input" : {"ref" : "InputData"},

"output" : {"ref" : "OutputData"}

}

See Also

ShellCommandActivity (p. 176)

EmrActivity (p. 184)

API Version 2012-10-29

191

AWS Data Pipeline Developer Guide

ShellCommandPrecondition

ShellCommandPrecondition

A Unix/Linux shell command that can be executed as a precondition.

Syntax

The following slots are included in all objects.

Name

id name type parent

@sphere

Description

The ID of the object. IDs must be unique within a pipeline definition.

Type

String

The optional, user-defined label of the object.

If you do not provide a name

for an object in a pipeline definition, AWS Data Pipeline automatically duplicates the value of id

.

String

The type of object. Use one of the predefined

AWS Data Pipeline object types.

String

The parent of the object.

String

The sphere of an object denotes its place in the pipeline lifecycle, such as Pipeline,

Component, Instance, or Attempt.

String (read-only)

Required

Yes

No

Yes

No

No

This object includes the following fields.

Name

command scriptUri

Description Type

The command to run. This value and any associated parameters must function in the environment from which you are running the

Task Runner.

String

An Amazon S3 URI path for a file to download and run as a shell command. Only one scriptUri

or command field should be present.

A valid S3 URI

Required

Yes

No

This object includes the following slots from the

Precondition

object.

Name Description Type

preconditionMaximumRetries

Specifies the maximum number of times that a precondition is retried.

Integer node

The activity or data node for which this precondition is being checked. This is a runtime slot.

Object reference

(read-only)

Required

No

No

API Version 2012-10-29

192

AWS Data Pipeline Developer Guide

ShellCommandPrecondition

This object includes the following slots from

RunnableObject

.

Name

workerGroup retryDelay maximumRetries

Description

The worker group. This is used for routing tasks.

The timeout duration between two retry attempts. The default is 10 minutes.

The maximum number of times to retry the action.

Type

String

Period

Integer onLateNotify onLateKill

The email alarm to use when the object's run is late.

SnsAlarm (p. 213)

object reference

Indicates whether all pending or unscheduled tasks should be killed if they are late.

Boolean lateAfterTimeout

The period in which the object run must start.

If the activity does not start within the scheduled start time plus this time interval, it is considered late.

Period

Required

No

Yes

No

No

No

No reportProgressTimeout

The period for successive calls from Task

Runner to the ReportTaskProgress API. If

Task Runner, or other code that is processing the tasks, does not report progress within the specified period, the activity can be retried.

attemptTimeout

Period

The timeout for an activity. If an activity does not complete within the start time plus this time interval, AWS Data Pipeline marks the attempt as failed and your retry settings determine the next steps taken.

Period logUri

The location in Amazon S3 to store log files generated by Task Runner when performing work for this object.

String

@reportProgressTime

The last time that Task Runner, or other code that is processing the tasks, called the

ReportTaskProgress API.

DateTime

No

No

No

No

@activeInstances

Record of the currently scheduled instance objects

Schedulable object reference

No

DateTime (read-only) No

@lastRun

@status

The last run of the object. This is a runtime slot.

@scheduledPhysicalObjects

The currently scheduled instance objects.

This is a runtime slot.

The status of this object. This is a runtime slot. Possible values are: pending

, checking_preconditions

, running

, waiting_on_runner

, successful

, and failed

.

Object reference

(read-only)

String (read-only)

No

No

API Version 2012-10-29

193

AWS Data Pipeline Developer Guide

Exists

Name

@triesLeft

Type

Integer (read-only)

Required

No

@actualStartTime

The date and time that the scheduled run actually started. This is a runtime slot.

@actualEndTime errorCode errorMessage errorStackTrace

The date and time that the scheduled run actually ended. This is a runtime slot.

If the object failed, the error code. This is a runtime slot.

If the object failed, the error message. This is a runtime slot.

If the object failed, the error stack trace.

DateTime (read-only)

DateTime (read-only)

String (read-only)

String (read-only)

@scheduledStartTime

The date and time that the run was scheduled to start.

@scheduledEndTime

The date and time that the run was scheduled to end.

@componentParent

The component from which this instance is created.

@headAttempt

The latest attempt on the given instance.

String (read-only)

DateTime

DateTime

Object reference

(read-only)

Object reference

(read-only)

No

No

No

No

No

No

No

No

No

@resource activityStatus

Description

The number of attempted runs remaining before setting the status of this object to failed. This is a runtime slot.

The resource instance on which the given activity/precondition attempt is being run.

Object reference

(read-only)

The status most recently reported from the activity.

String

No

No

Example

The following is an example of this object type.

{

"id" : "VerifyDataReadiness",

"type" : "ShellCommandPrecondition",

"command" : "perl check-data-ready.pl"

}

See Also

ShellCommandActivity (p. 176)

Exists (p. 194)

Exists

Checks whether a data node object exists.

API Version 2012-10-29

194

AWS Data Pipeline Developer Guide

Exists

Syntax

The following slots are included in all objects.

Name

id name type parent

@sphere

Description

The ID of the object. IDs must be unique within a pipeline definition.

Type

String

The optional, user-defined label of the object.

If you do not provide a name

for an object in a pipeline definition, AWS Data Pipeline automatically duplicates the value of id

.

String

The type of object. Use one of the predefined

AWS Data Pipeline object types.

String

The parent of the object.

String

The sphere of an object denotes its place in the pipeline lifecycle, such as Pipeline,

Component, Instance, or Attempt.

String (read-only)

Required

Yes

No

Yes

No

No

This object includes the following slots from the

Precondition

object.

Name

node

Description

The activity or data node for which this precondition is being checked. This is a runtime slot.

Type

preconditionMaximumRetries

Specifies the maximum number of times that a precondition is retried.

Integer

Object reference

(read-only)

Required

No

No

This object includes the following slots from

RunnableObject

.

Name

workerGroup retryDelay maximumRetries onLateNotify

Description

The worker group. This is used for routing tasks.

The timeout duration between two retry attempts. The default is 10 minutes.

Type

String

Period

The maximum number of times to retry the action.

Integer

The email alarm to use when the object's run is late.

SnsAlarm (p. 213)

object reference onLateKill

Indicates whether all pending or unscheduled tasks should be killed if they are late.

Boolean lateAfterTimeout

The period in which the object run must start.

If the activity does not start within the scheduled start time plus this time interval, it is considered late.

Period

Required

No

Yes

No

No

No

No

API Version 2012-10-29

195

AWS Data Pipeline Developer Guide

Exists

Name Description

reportProgressTimeout

The period for successive calls from Task

Runner to the ReportTaskProgress API. If

Task Runner, or other code that is processing the tasks, does not report progress within the specified period, the activity can be retried.

attemptTimeout logUri

Type

Period

The timeout for an activity. If an activity does not complete within the start time plus this time interval, AWS Data Pipeline marks the attempt as failed and your retry settings determine the next steps taken.

Period

The location in Amazon S3 to store log files generated by Task Runner when performing work for this object.

String

@reportProgressTime

The last time that Task Runner, or other code that is processing the tasks, called the

ReportTaskProgress API.

DateTime

@activeInstances

Record of the currently scheduled instance objects

Schedulable object reference

No

No

DateTime (read-only) No

@lastRun

The last run of the object. This is a runtime slot.

@scheduledPhysicalObjects

The currently scheduled instance objects.

This is a runtime slot.

@status

Object reference

(read-only)

String (read-only)

No

No

@triesLeft

The status of this object. This is a runtime slot. Possible values are: pending

, checking_preconditions

, running

, waiting_on_runner

, successful

, and failed

.

The number of attempted runs remaining before setting the status of this object to failed. This is a runtime slot.

Integer (read-only) No

DateTime (read-only) No

@actualStartTime

The date and time that the scheduled run actually started. This is a runtime slot.

@actualEndTime errorCode

The date and time that the scheduled run actually ended. This is a runtime slot.

If the object failed, the error code. This is a runtime slot.

DateTime (read-only)

String (read-only)

No

No errorMessage

If the object failed, the error message. This is a runtime slot.

If the object failed, the error stack trace.

errorStackTrace

@scheduledStartTime

The date and time that the run was scheduled to start.

@scheduledEndTime

The date and time that the run was scheduled to end.

String (read-only)

String (read-only)

DateTime

DateTime

Required

No

No

No

No

No

No

No

API Version 2012-10-29

196

AWS Data Pipeline Developer Guide

S3KeyExists

Name

@headAttempt

Description

@componentParent

The component from which this instance is created.

The latest attempt on the given instance.

Type

Object reference

(read-only)

Object reference

(read-only)

@resource activityStatus

The resource instance on which the given activity/precondition attempt is being run.

The status most recently reported from the activity.

Object reference

(read-only)

String

Required

No

No

No

No

Example

The following is an example of this object type. The

InputData

object references this object,

Ready

, plus another object that you'd define in the same pipeline definition file.

CopyPeriod

is a

Schedule object.

{

"id" : "InputData",

"type" : "S3DataNode",

"schedule" : {"ref" : "CopyPeriod"},

"filePath" : "s3://test/InputData/#{@scheduledStartTime.format('YYYY-MM-ddhh:mm')}.csv",

"precondition" : {"ref" : "Ready"}

},

{

"id" : "Ready",

"type" : "Exists"

}

See Also

ShellCommandPrecondition (p. 192)

S3KeyExists

Checks whether a key exists in an Amazon S3 data node.

Syntax

The following slots are included in all objects.

Name

id name

Description

The ID of the object. IDs must be unique within a pipeline definition.

Type

String

The optional, user-defined label of the object.

If you do not provide a name

for an object in a pipeline definition, AWS Data Pipeline automatically duplicates the value of id

.

String

Required

Yes

No

API Version 2012-10-29

197

AWS Data Pipeline Developer Guide

S3KeyExists

Name

type parent

@sphere

Description

The parent of the object.

Type

The type of object. Use one of the predefined

AWS Data Pipeline object types.

String

String

The sphere of an object denotes its place in the pipeline lifecycle, such as Pipeline,

Component, Instance, or Attempt.

String (read-only)

Required

Yes

No

No

This object includes the following slots from the

Precondition

object.

Name

node

Description

The activity or data node for which this precondition is being checked. This is a runtime slot.

Type

preconditionMaximumRetries

Specifies the maximum number of times that a precondition is retried.

Integer

Object reference

(read-only)

Required

No

No

This object includes the following fields.

Name

s3Key

Description

Amazon S3 key to check for existence.

Type

String

Required

Yes

This object includes the following slots from

RunnableObject

.

Name

workerGroup retryDelay maximumRetries onLateNotify

Description

The worker group. This is used for routing tasks.

The timeout duration between two retry attempts. The default is 10 minutes.

Type

String

Period

The maximum number of times to retry the action.

Integer

The email alarm to use when the object's run is late.

SnsAlarm (p. 213)

object reference onLateKill

Indicates whether all pending or unscheduled tasks should be killed if they are late.

Boolean lateAfterTimeout

The period in which the object run must start.

If the activity does not start within the scheduled start time plus this time interval, it is considered late.

Period

Required

No

Yes

No

No

No

No

API Version 2012-10-29

198

AWS Data Pipeline Developer Guide

S3KeyExists

Name Description

reportProgressTimeout

The period for successive calls from Task

Runner to the ReportTaskProgress API. If

Task Runner, or other code that is processing the tasks, does not report progress within the specified period, the activity can be retried.

attemptTimeout logUri

Type

Period

The timeout for an activity. If an activity does not complete within the start time plus this time interval, AWS Data Pipeline marks the attempt as failed and your retry settings determine the next steps taken.

Period

The location in Amazon S3 to store log files generated by Task Runner when performing work for this object.

String

@reportProgressTime

The last time that Task Runner, or other code that is processing the tasks, called the

ReportTaskProgress API.

DateTime

@activeInstances

Record of the currently scheduled instance objects

Schedulable object reference

No

No

DateTime (read-only) No

@lastRun

The last run of the object. This is a runtime slot.

@scheduledPhysicalObjects

The currently scheduled instance objects.

This is a runtime slot.

@status

Object reference

(read-only)

String (read-only)

No

No

@triesLeft

The status of this object. This is a runtime slot. Possible values are: pending

, checking_preconditions

, running

, waiting_on_runner

, successful

, and failed

.

The number of attempted runs remaining before setting the status of this object to failed. This is a runtime slot.

Integer (read-only) No

DateTime (read-only) No

@actualStartTime

The date and time that the scheduled run actually started. This is a runtime slot.

@actualEndTime errorCode

The date and time that the scheduled run actually ended. This is a runtime slot.

If the object failed, the error code. This is a runtime slot.

DateTime (read-only)

String (read-only)

No

No errorMessage

If the object failed, the error message. This is a runtime slot.

If the object failed, the error stack trace.

errorStackTrace

@scheduledStartTime

The date and time that the run was scheduled to start.

@scheduledEndTime

The date and time that the run was scheduled to end.

String (read-only)

String (read-only)

DateTime

DateTime

Required

No

No

No

No

No

No

No

API Version 2012-10-29

199

AWS Data Pipeline Developer Guide

S3PrefixNotEmpty

Name

@headAttempt

Description

@componentParent

The component from which this instance is created.

The latest attempt on the given instance.

Type

Object reference

(read-only)

Object reference

(read-only)

@resource activityStatus

The resource instance on which the given activity/precondition attempt is being run.

The status most recently reported from the activity.

Object reference

(read-only)

String

Required

No

No

No

No

See Also

ShellCommandPrecondition (p. 192)

S3PrefixNotEmpty

A precondition to check that the Amazon S3 objects with the given prefix (represented as a URI) are present.

Syntax

The following slots are included in all objects.

Name

id name type parent

@sphere

Description

The ID of the object. IDs must be unique within a pipeline definition.

Type

String

The optional, user-defined label of the object.

If you do not provide a name

for an object in a pipeline definition, AWS Data Pipeline automatically duplicates the value of id

.

String

The type of object. Use one of the predefined

AWS Data Pipeline object types.

String

The parent of the object.

String

The sphere of an object denotes its place in the pipeline lifecycle, such as Pipeline,

Component, Instance, or Attempt.

String (read-only)

Required

Yes

No

Yes

No

No

This object includes the following slots from the

Precondition

object.

Name Description Type

preconditionMaximumRetries

Specifies the maximum number of times that a precondition is retried.

Integer

Required

No

API Version 2012-10-29

200

AWS Data Pipeline Developer Guide

S3PrefixNotEmpty

Name

node

Description

The activity or data node for which this precondition is being checked. This is a runtime slot.

Type

Object reference

(read-only)

This object includes the following fields.

Name

s3Prefix

Description Type

The Amazon S3 prefix to check for existence of objects.

String

Required

No

Required

Yes

This object includes the following slots from

RunnableObject

.

Name

workerGroup retryDelay maximumRetries

Description

The worker group. This is used for routing tasks.

The timeout duration between two retry attempts. The default is 10 minutes.

The maximum number of times to retry the action.

Type

String

Period

Integer onLateNotify onLateKill

The email alarm to use when the object's run is late.

SnsAlarm (p. 213)

object reference

Indicates whether all pending or unscheduled tasks should be killed if they are late.

Boolean lateAfterTimeout

The period in which the object run must start.

If the activity does not start within the scheduled start time plus this time interval, it is considered late.

Period reportProgressTimeout

The period for successive calls from Task

Runner to the ReportTaskProgress API. If

Task Runner, or other code that is processing the tasks, does not report progress within the specified period, the activity can be retried.

attemptTimeout

Period

The timeout for an activity. If an activity does not complete within the start time plus this time interval, AWS Data Pipeline marks the attempt as failed and your retry settings determine the next steps taken.

Period logUri

The location in Amazon S3 to store log files generated by Task Runner when performing work for this object.

String

@reportProgressTime

The last time that Task Runner, or other code that is processing the tasks, called the

ReportTaskProgress API.

DateTime

Required

No

Yes

No

No

No

No

No

No

No

No

API Version 2012-10-29

201

AWS Data Pipeline Developer Guide

S3PrefixNotEmpty

Name Description Type

@activeInstances

Record of the currently scheduled instance objects

Schedulable object reference

Required

No

DateTime (read-only) No

@lastRun

The last run of the object. This is a runtime slot.

@scheduledPhysicalObjects

The currently scheduled instance objects.

This is a runtime slot.

@status

The status of this object. This is a runtime slot. Possible values are: pending

, checking_preconditions

, running

, waiting_on_runner

, successful

, and failed

.

Object reference

(read-only)

String (read-only)

No

No

@triesLeft

The number of attempted runs remaining before setting the status of this object to failed. This is a runtime slot.

@actualStartTime

The date and time that the scheduled run actually started. This is a runtime slot.

Integer (read-only)

DateTime (read-only)

No

No

@actualEndTime errorCode

The date and time that the scheduled run actually ended. This is a runtime slot.

If the object failed, the error code. This is a runtime slot.

errorMessage

If the object failed, the error message. This is a runtime slot.

If the object failed, the error stack trace.

errorStackTrace

@scheduledStartTime

The date and time that the run was scheduled to start.

@resource activityStatus

DateTime (read-only)

String (read-only)

String (read-only)

String (read-only)

DateTime

@scheduledEndTime

The date and time that the run was scheduled to end.

@componentParent

The component from which this instance is created.

@headAttempt

The latest attempt on the given instance.

DateTime

Object reference

(read-only)

Object reference

(read-only)

The resource instance on which the given activity/precondition attempt is being run.

Object reference

(read-only)

The status most recently reported from the activity.

String

No

No

No

No

No

No

No

No

No

No

Example

The following is an example of this object type using required, optional, and expression fields.

{

"id": "InputReady",

API Version 2012-10-29

202

AWS Data Pipeline Developer Guide

RdsSqlPrecondition

"type": "S3PrefixNotEmpty",

"role": "test-role",

"s3Prefix": "#{node.filePath}"

}

See Also

ShellCommandPrecondition (p. 192)

RdsSqlPrecondition

A precondition that executes a query to verify the readiness of data within Amazon RDS. Specified conditions are combined by a logical AND operation.

Important

You must grant permissions to Task Runner to access Amazon RDS using an RdsSqlPrecondition

as described by Grant Amazon RDS Permissions to Task Runner (p. 23)

Syntax

The following slots are included in all objects.

Name

id name type parent

@sphere

Description

The ID of the object. IDs must be unique within a pipeline definition.

Type

String

The optional, user-defined label of the object.

If you do not provide a name

for an object in a pipeline definition, AWS Data Pipeline automatically duplicates the value of id

.

String

The type of object. Use one of the predefined

AWS Data Pipeline object types.

String

The parent of the object.

String

The sphere of an object denotes its place in the pipeline lifecycle, such as Pipeline,

Component, Instance, or Attempt.

String (read-only)

Required

Yes

No

Yes

No

No

This object includes the following fields.

Name

query username

*password rdsInstanceId

Description

The user name to connect with.

Type

A valid SQL query that should return one row with one column (scalar value).

String

String

The password to connect with. The asterisk instructs AWS Data Pipeline to encrypt the password.

String

The InstanceId to connect to.

String

Required

Yes

Yes

Yes

Yes

API Version 2012-10-29

203

AWS Data Pipeline Developer Guide

DynamoDBTableExists

Name

database equalTo lessThan greaterThan isTrue

Description

The logical database to connect to.

Type

String

This precondition is true if the value returned by the query is equal to this value.

Integer

This precondition is true if the value returned by the query is less than this value.

Integer

This precondition is true if the value returned by the query is greater than this value.

Integer

This precondition is true if the Boolean value returned by the query is equal to true.

Boolean

DynamoDBTableExists

A precondition to check that the Amazon DynamoDB table exists.

Syntax

The following slots are included in all objects.

Name

id name type parent

@sphere

Description

The ID of the object. IDs must be unique within a pipeline definition.

Type

String

The optional, user-defined label of the object.

If you do not provide a name

for an object in a pipeline definition, AWS Data Pipeline automatically duplicates the value of id

.

String

The type of object. Use one of the predefined

AWS Data Pipeline object types.

String

The parent of the object.

String

The sphere of an object denotes its place in the pipeline lifecycle, such as Pipeline,

Component, Instance, or Attempt.

String (read-only)

Required

Yes

No

Yes

No

No

This object includes the following fields.

Name

tableName

Description

The Amazon DynamoDB table to check.

Type

String

Required

Yes

Required

Yes

No

No

No

No

DynamoDBDataExists

"A precondition to check that data exists in a Amazon DynamoDB table.

API Version 2012-10-29

204

AWS Data Pipeline Developer Guide

Ec2Resource

Syntax

The following slots are included in all objects.

Name

id name type parent

@sphere

Description

The ID of the object. IDs must be unique within a pipeline definition.

Type

String

The optional, user-defined label of the object.

If you do not provide a name

for an object in a pipeline definition, AWS Data Pipeline automatically duplicates the value of id

.

String

The type of object. Use one of the predefined

AWS Data Pipeline object types.

String

The parent of the object.

String

The sphere of an object denotes its place in the pipeline lifecycle, such as Pipeline,

Component, Instance, or Attempt.

String (read-only)

Required

Yes

No

Yes

No

No

This object includes the following fields.

Name

tableName

Description

The Amazon DynamoDB table to check.

Type

String

Required

Yes

Ec2Resource

Represents the configuration of an Amazon EMR job flow.

Syntax

The following slots are included in all objects.

Name

id name type parent

@sphere

Description

The ID of the object. IDs must be unique within a pipeline definition.

Type

String

The optional, user-defined label of the object.

If you do not provide a name

for an object in a pipeline definition, AWS Data Pipeline automatically duplicates the value of id

.

String

The type of object. Use one of the predefined

AWS Data Pipeline object types.

String

The parent of the object.

String

The sphere of an object denotes its place in the pipeline lifecycle, such as Pipeline,

Component, Instance, or Attempt.

String (read-only)

Required

Yes

No

Yes

No

No

API Version 2012-10-29

205

AWS Data Pipeline Developer Guide

Ec2Resource

This object includes the following fields.

Name

instanceType instanceCount keyPair role resourceRole

Description

The type of EC2 instance to use for the resource pool. The default value is m1.small

.

The number of instances to use for the resource pool. The default value is

1

.

Type

String

Integer minInstanceCount

The minimum number of EC2 instances for the pool. The default value is

1

.

Integer securityGroups

String imageId

The EC2 security group to use for the instances in the resource pool.

The AMI version to use for the EC2 instances. The default value is ami-1624987f

, which we recommend using. For more information, see Amazon

Machine Images (AMIs) .

String

The Amazon EC2 key pair to use to log onto the EC2 instance. The default action is not to attach a key pair to the EC2 instance.

String

The IAM role to use to create the EC2 instance.

String

The IAM role to use to control the resources that the EC2 instance can access.

String

This object includes the following slots from the

Resource

object.

Name

terminateAfter

@resourceId

@resourceStatus

Description

The number of hours to wait before terminating the resource.

The unique identifier for the resource.

Type

Period

Period

The current status of the resource, such as checking_preconditions, creating, shutting_down, running, failed, timed_out, cancelled, or paused.

String

String

@failureReason

The reason for the failure to create the resource.

@resourceCreationTime

The time when this resource was created.

DateTime

This object includes the following slots from

RunnableObject

.

Name

workerGroup

Description

The worker group. This is used for routing tasks.

Type

String

API Version 2012-10-29

206

No

Yes

Yes

Required

No

No

No

No

No

Required

Yes

Yes

No

No

No

Required

No

AWS Data Pipeline Developer Guide

Ec2Resource

Name

retryDelay maximumRetries

Description

The timeout duration between two retry attempts. The default is 10 minutes.

The maximum number of times to retry the action.

Type

Period

Integer onLateNotify onLateKill

The email alarm to use when the object's run is late.

SnsAlarm (p. 213)

object reference

Indicates whether all pending or unscheduled tasks should be killed if they are late.

Boolean lateAfterTimeout

The period in which the object run must start.

If the activity does not start within the scheduled start time plus this time interval, it is considered late.

Period reportProgressTimeout

The period for successive calls from Task

Runner to the ReportTaskProgress API. If

Task Runner, or other code that is processing the tasks, does not report progress within the specified period, the activity can be retried.

attemptTimeout logUri

Period

The timeout for an activity. If an activity does not complete within the start time plus this time interval, AWS Data Pipeline marks the attempt as failed and your retry settings determine the next steps taken.

Period

The location in Amazon S3 to store log files generated by Task Runner when performing work for this object.

String

Required

Yes

No

No

No

No

No

No

No

@reportProgressTime

The last time that Task Runner, or other code that is processing the tasks, called the

ReportTaskProgress API.

DateTime

@activeInstances

Record of the currently scheduled instance objects

Schedulable object reference

No

No

DateTime (read-only) No

@lastRun

The last run of the object. This is a runtime slot.

@scheduledPhysicalObjects

The currently scheduled instance objects.

This is a runtime slot.

@status

Object reference

(read-only)

String (read-only)

No

No

@triesLeft

The status of this object. This is a runtime slot. Possible values are: pending

, checking_preconditions

, running

, waiting_on_runner

, successful

, and failed

.

The number of attempted runs remaining before setting the status of this object to failed. This is a runtime slot.

Integer (read-only) No

@actualStartTime

The date and time that the scheduled run actually started. This is a runtime slot.

DateTime (read-only) No

API Version 2012-10-29

207

AWS Data Pipeline Developer Guide

Ec2Resource

Name

@actualEndTime errorCode errorMessage errorStackTrace

Description

The date and time that the scheduled run actually ended. This is a runtime slot.

If the object failed, the error code. This is a runtime slot.

If the object failed, the error message. This is a runtime slot.

If the object failed, the error stack trace.

Type

DateTime (read-only) No

String (read-only)

String (read-only)

String (read-only)

@scheduledStartTime

The date and time that the run was scheduled to start.

@scheduledEndTime

The date and time that the run was scheduled to end.

@componentParent

@headAttempt

@resource activityStatus

The component from which this instance is created.

The latest attempt on the given instance.

The resource instance on which the given activity/precondition attempt is being run.

The status most recently reported from the activity.

DateTime

DateTime

Object reference

(read-only)

Object reference

(read-only)

Object reference

(read-only)

String

Required

No

No

No

No

No

No

No

No

No

This object includes the following slots from

SchedulableObject

.

Name

schedule scheduleType runsOn

Description Type

A schedule of the object. A common use is to specify a time schedule that correlates to the schedule for the object.

Schedule (p. 163)

object reference

Required

No

Schedule type allows you to specify whether the objects in your pipeline definition should be scheduled at the beginning of interval or end of the interval. Time Series Style

Scheduling means instances are scheduled at the end of each interval and Cron Style

Scheduling means instances are scheduled at the beginning of each interval.

Allowed values are

“cron” or “timeseries”.

Defaults to

"timeseries".

No

The computational resource to run the activity or command. For example, an

Amazon EC2 instance or Amazon EMR cluster.

Resource object reference

No

Example

The following is an example of this object type. It launches an EC2 instance and shows some optional fields set.

API Version 2012-10-29

208

AWS Data Pipeline Developer Guide

EmrCluster

{

"id": "MyEC2Resource",

"type": "Ec2Resource",

"actionOnTaskFailure": "terminate",

"actionOnResourceFailure": "retryAll",

"maximumRetries": "1",

"role": "test-role",

"resourceRole": "test-role",

"instanceType": "m1.medium",

"instanceCount": "1",

"securityGroups": [

"test-group",

"default"

],

"keyPair": "test-pair"

}

EmrCluster

Represents the configuration of an Amazon EMR job flow. This object is used by

EmrActivity (p. 184) to

launch a job flow.

Syntax

The following slots are included in all objects.

Name

id name type parent

@sphere

Description Type

The ID of the object. IDs must be unique within a pipeline definition.

String

The optional, user-defined label of the object.

If you do not provide a name

for an object in a pipeline definition, AWS Data Pipeline automatically duplicates the value of id

.

String

The type of object. Use one of the predefined

AWS Data Pipeline object types.

String

The parent of the object.

String

The sphere of an object denotes its place in the pipeline lifecycle, such as Pipeline,

Component, Instance, or Attempt.

String (read-only)

Required

Yes

No

Yes

No

No

This object includes the following fields.

Name

coreInstanceType

Description

The type of EC2 instance to use for core nodes. The default value is m1.small

.

Type

masterInstanceType

The type of EC2 instance to use for the master node. The default value is m1.small

.

String

String

Required

No

No

API Version 2012-10-29

209

AWS Data Pipeline Developer Guide

EmrCluster

Name

taskInstanceType

Description

The type of EC2 instance to use for task nodes.

Type

coreInstanceCount

The number of core nodes to use for the job flow. The default value is

1

.

String

String taskInstanceCount

The number of task nodes to use for the job flow. The default value is

1

.

String keyPair

The Amazon EC2 key pair to use to log onto the master node of the job flow. The default action is not to attach a key pair to the job flow.

String hadoopVersion bootstrapAction enableDebugging logUri

The version of Hadoop to use in the job flow.

The default value is

0.20

. For more information about the Hadoop versions supported by Amazon EMR, see Supported

Hadoop Versions .

String

An action to run when the job flow starts.You

can specify comma-separated arguments.

To specify multiple actions, up to 255, add multiple bootstrapAction

fields. The default behavior is to start the job flow without any bootstrap actions.

String array

Enables debugging on the job flow.

String

The location in Amazon S3 to store log files from the job flow.

String

This object includes the following slots from the

Resource

object.

Name

terminateAfter

@resourceId

@resourceStatus

Description

The number of hours to wait before terminating the resource.

The unique identifier for the resource.

Type

Period

Period

The current status of the resource, such as checking_preconditions, creating, shutting_down, running, failed, timed_out, cancelled, or paused.

String

String

@failureReason

The reason for the failure to create the resource.

@resourceCreationTime

The time when this resource was created.

DateTime

No

No

No

Required

No

No

No

No

No

Required

Yes

Yes

No

No

No

API Version 2012-10-29

210

AWS Data Pipeline Developer Guide

EmrCluster

This object includes the following slots from

RunnableObject

.

Name

workerGroup retryDelay maximumRetries

Description

The worker group. This is used for routing tasks.

The timeout duration between two retry attempts. The default is 10 minutes.

The maximum number of times to retry the action.

Type

String

Period

Integer onLateNotify onLateKill

The email alarm to use when the object's run is late.

SnsAlarm (p. 213)

object reference

Indicates whether all pending or unscheduled tasks should be killed if they are late.

Boolean lateAfterTimeout

The period in which the object run must start.

If the activity does not start within the scheduled start time plus this time interval, it is considered late.

Period

Required

No

Yes

No

No

No

No reportProgressTimeout

The period for successive calls from Task

Runner to the ReportTaskProgress API. If

Task Runner, or other code that is processing the tasks, does not report progress within the specified period, the activity can be retried.

attemptTimeout

Period

The timeout for an activity. If an activity does not complete within the start time plus this time interval, AWS Data Pipeline marks the attempt as failed and your retry settings determine the next steps taken.

Period logUri

The location in Amazon S3 to store log files generated by Task Runner when performing work for this object.

String

@reportProgressTime

The last time that Task Runner, or other code that is processing the tasks, called the

ReportTaskProgress API.

DateTime

No

No

No

No

@activeInstances

Record of the currently scheduled instance objects

Schedulable object reference

No

DateTime (read-only) No

@lastRun

@status

The last run of the object. This is a runtime slot.

@scheduledPhysicalObjects

The currently scheduled instance objects.

This is a runtime slot.

The status of this object. This is a runtime slot. Possible values are: pending

, checking_preconditions

, running

, waiting_on_runner

, successful

, and failed

.

Object reference

(read-only)

String (read-only)

No

No

API Version 2012-10-29

211

AWS Data Pipeline Developer Guide

EmrCluster

Name

@triesLeft

Type

Integer (read-only)

Required

No

@actualStartTime

The date and time that the scheduled run actually started. This is a runtime slot.

@actualEndTime errorCode errorMessage errorStackTrace

The date and time that the scheduled run actually ended. This is a runtime slot.

If the object failed, the error code. This is a runtime slot.

If the object failed, the error message. This is a runtime slot.

If the object failed, the error stack trace.

DateTime (read-only)

DateTime (read-only)

String (read-only)

String (read-only)

@scheduledStartTime

The date and time that the run was scheduled to start.

@scheduledEndTime

The date and time that the run was scheduled to end.

@componentParent

The component from which this instance is created.

@headAttempt

The latest attempt on the given instance.

String (read-only)

DateTime

DateTime

Object reference

(read-only)

Object reference

(read-only)

No

No

No

No

No

No

No

No

No

@resource activityStatus

Description

The number of attempted runs remaining before setting the status of this object to failed. This is a runtime slot.

The resource instance on which the given activity/precondition attempt is being run.

Object reference

(read-only)

The status most recently reported from the activity.

String

No

No

This object includes the following slots from

SchedulableObject

.

Name

schedule scheduleType runsOn

Description Type

A schedule of the object. A common use is to specify a time schedule that correlates to the schedule for the object.

Schedule (p. 163)

object reference

Required

No

Schedule type allows you to specify whether the objects in your pipeline definition should be scheduled at the beginning of interval or end of the interval. Time Series Style

Scheduling means instances are scheduled at the end of each interval and Cron Style

Scheduling means instances are scheduled at the beginning of each interval.

Allowed values are

“cron” or “timeseries”.

Defaults to

"timeseries".

No

The computational resource to run the activity or command. For example, an

Amazon EC2 instance or Amazon EMR cluster.

Resource object reference

No

API Version 2012-10-29

212

AWS Data Pipeline Developer Guide

SnsAlarm

Example

The following is an example of this object type. It launches an *Amazon EMR job flow using AMI version

1.0 and Hadoop 0.20.

{

"id" : "MyEmrCluster",

"type" : "EmrCluster",

"hadoopVersion" : "0.20",

"keypair" : "myKeyPair",

"masterInstanceType" : "m1.xlarge",

"coreInstanceType" : "m1.small",

"coreInstanceCount" : "10",

"instanceTaskType" : "m1.small",

"instanceTaskCount": "10",

"bootstrapAction" : "s3://elasticmapreduce/libs/ba/configure-ha doop,arg1,arg2,arg3",

"bootstrapAction" : "s3://elasticmapreduce/libs/ba/configure-otherstuff,arg1,arg2"

}

See Also

EmrActivity (p. 184)

SnsAlarm

Sends an Amazon SNS notification message when an activity fails or finishes successfully.

Syntax

The following slots are included in all objects.

Name

id name type parent

@sphere

Description

The ID of the object. IDs must be unique within a pipeline definition.

Type

String

The optional, user-defined label of the object.

If you do not provide a name

for an object in a pipeline definition, AWS Data Pipeline automatically duplicates the value of id

.

String

The type of object. Use one of the predefined

AWS Data Pipeline object types.

String

The parent of the object.

String

The sphere of an object denotes its place in the pipeline lifecycle, such as Pipeline,

Component, Instance, or Attempt.

String (read-only)

Required

Yes

No

Yes

No

No

API Version 2012-10-29

213

AWS Data Pipeline Developer Guide

SnsAlarm

This object includes the following fields.

Name

subject message topicArn

Description

The subject line of the Amazon SNS notification message.

Type

String

The body text of the Amazon SNS notification.

String

The destination Amazon SNS topic ARN for the message.

String

Required

Yes

Yes

Yes

This object includes the following slots from the

Action

object.

Name

node

Description

The node for which this action is being performed. This is a runtime slot.

Type

Object reference

(read-only)

Required

No

Example

The following is an example of this object type. The values for node.input

and node.output

come from the data node or activity that references this object in its onSuccess

field.

{

"id" : "SuccessNotify",

"type" : "SnsAlarm",

"topicArn" : "arn:aws:sns:us-east-1:28619EXAMPLE:ExampleTopic",

"subject" : "COPY SUCCESS: #{node.@scheduledStartTime}",

"message" : "Files were copied from #{node.input} to #{node.output}."

}

API Version 2012-10-29

214

AWS Data Pipeline Developer Guide

--cancel

Command Line Reference

Before you read this section, you should be familiar with

Using the Command Line Interface (p. 121) .

This section is a detailed reference of the AWS Data Pipeline command line interface (CLI) commands and parameters to interact with AWS Data Pipeline.

You can combine commands on a single command line. Commands are processed from left to right. You can use the

--create

and

--id

commands anywhere on the command line, but not together, and not more than once.

--cancel

Description

Cancels one or more specified objects from within a pipeline that is either currently running or ran previously.

To see the status of the canceled pipeline object, use

--list-runs

.

Syntax

On Linux/Unix/Mac OS:

./datapipeline --cancel

object_id

--id

pipeline_id

[Common Options]

On Windows: ruby datapipeline --cancel

object_id

--id

pipeline_id

[Common Options]

API Version 2012-10-29

215

AWS Data Pipeline Developer Guide

Options

Options

Name

object_id

--id pipeline_id

Description

The identifier of the object to cancel. You can specify the name of a single object, or a comma-separated list of object identifiers.

Example: o-06198791C436IEXAMPLE

The identifier of the pipeline.

Example:

--id df-00627471SOVYZEXAMPLE

Required

Yes

Yes

Common Options

For more information, see Common Options for AWS Data Pipeline Commands (p. 229)

.

Output

None.

Examples

The following example demonstrates how to list the objects of a previously run or currently running pipeline.

Next, the example cancels an object of the pipeline. Finally, the example lists the results of the canceled object.

On Linux/Unix/Mac OS:

./datapipeline --list-runs --id df-00627471SOVYZEXAMPLE

./datapipeline --id df-00627471SOVYZEXAMPLE --cancel o-06198791C436IEXAMPLE

./datapipeline --list-runs --id df-00627471SOVYZEXAMPLE

On Windows: ruby datapipeline --list-runs --id df-00627471SOVYZEXAMPLE ruby datapipeline --id df-00627471SOVYZEXAMPLE --cancel o-06198791C436IEXAMPLE ruby datapipeline --list-runs --id df-00627471SOVYZEXAMPLE

Related Commands

--delete (p. 218)

--list-pipelines (p. 221)

--list-runs (p. 222)

API Version 2012-10-29

216

AWS Data Pipeline Developer Guide

--create

--create

Description

Creates a data pipeline with the specified name, but does not activate the pipeline.

There is a limit of 20 pipelines per AWS account.

To specify a pipeline definition file when you create the pipeline, use this command with the --put (p. 224)

command.

Syntax

On Linux/Unix/Mac OS:

./datapipeline --create

name

[Common Options]

On Windows: ruby datapipeline --create

name

[Common Options]

Options

Name

name

Description

The name of the pipeline.

Example: my-pipeline

Required

Yes

Common Options

For more information, see Common Options for AWS Data Pipeline Commands (p. 229)

.

Output

Pipeline with name 'name' and id 'df-xxxxxxxxxxxxxxxxxxxx' created.

df-xxxxxxxxxxxxxxxxxxxx

The identifier of the newly created pipeline (df-xxxxxxxxxxxxxxxxxxxx). You must specify this identifier with the

--id

command whenever you issue a command that operates on the corresponding pipeline.

Examples

The following example creates the first pipeline without specifying a pipeline definition file, and creates the second pipeline with a pipeline definition file.

On Linux/Unix/Mac OS:

API Version 2012-10-29

217

AWS Data Pipeline Developer Guide

Related Commands

./datapipeline --create my-first-pipeline

./datapipeline --create my-second-pipeline --put my-pipeline-file.json

On Windows: ruby datapipeline --create my-first-pipeline ruby datapipeline --create my-second-pipeline --put my-pipeline-file.json

Related Commands

--delete (p. 218)

--list-pipelines (p. 221)

--put (p. 224)

--delete

Description

Stops the specified data pipeline, and cancels its future runs.

This command removes the pipeline definition file and run history. This action is irreversible; you can't restart a deleted pipeline.

Syntax

On Linux/Unix/Mac OS:

./datapipeline --delete --id

pipeline_id

[Common Options]

On Windows: ruby datapipeline --delete --id

pipeline_id

[Common Options]

Options

Name

--id pipeline_id

Description

The identifier of the pipeline.

Example:

--id df-00627471SOVYZEXAMPLE

Required

Yes

Common Options

For more information, see Common Options for AWS Data Pipeline Commands (p. 229)

.

API Version 2012-10-29

218

AWS Data Pipeline Developer Guide

Output

Output

State of pipeline id 'df-xxxxxxxxxxxxxxxxxxxx' is currently 'state'

Deleted pipeline 'df-xxxxxxxxxxxxxxxxxxxx'

A message indicating that the pipeline was successfully deleted.

Examples

The following example deletes the pipeline with the identifier df-00627471SOVYZEXAMPLE

.

On Linux/Unix/Mac OS:

./datapipeline --delete --id df-00627471SOVYZEXAMPLE

On Windows: ruby datapipeline --delete --id df-00627471SOVYZEXAMPLE

Related Commands

--create (p. 217)

--list-pipelines (p. 221)

--get, --g

Description

Gets the pipeline definition file for the specified data pipeline and saves it to a file. If no file is specified, the file contents are written to standard output.

Syntax

On Linux/Unix/Mac OS:

./datapipeline --get

pipeline_definition_file

--id

pipeline_id

--version

pipeline_version

[Common Options]

On Windows: ruby datapipeline --get

pipeline_definition_file

--id

pipeline_id

--version

pipeline_version

[Common Options]

API Version 2012-10-29

219

AWS Data Pipeline Developer Guide

Options

Options

Name

--id pipeline_id

Description

The identifier of the pipeline.

Example:

--id df-00627471SOVYZEXAMPLE

pipeline_definition_file

The full path to the output file that receives the pipeline definition.

Default: standard output

Example: my-pipeline.json

--versionpipeline_version

The version name of the pipeline.

Example:

--version active

Example:

--version latest

Required

Yes

No

No

Common Options

For more information, see Common Options for AWS Data Pipeline Commands (p. 229)

.

Output

If an output file is specified, the output is a pipeline definition file; otherwise, the contents of the pipeline definition are written to standard output.

Examples

The first two command writes the definition to standard output (usually the terminal screen), and the second command writes the pipeline definition to the file my-pipeline.json

.

On Linux/Unix/Mac OS:

./datapipeline --get --id df-00627471SOVYZEXAMPLE

./datapipeline --get my-pipeline.json --id df-00627471SOVYZEXAMPLE

On Windows: ruby datapipeline --get --id df-00627471SOVYZEXAMPLE ruby datapipeline --get my-pipeline.json --id df-00627471SOVYZEXAMPLE

Related Commands

--create (p. 217)

--put (p. 224)

API Version 2012-10-29

220

AWS Data Pipeline Developer Guide

--help, --h

--help, --h

Description

Displays information about the commands provided by the CLI.

Syntax

On Linux/Unix/Mac OS:

./datapipeline --help

On Windows: ruby datapipeline --help

Options

None.

Output

A list of the commands used by the CLI, printed to standard output (typically the terminal window).

--list-pipelines

Description

Lists the pipelines that you have permission to access.

Syntax

On Linux/Unix/Mac OS:

./datapipeline --list-pipelines

On Windows: ruby datapipeline --list-pipelines

Options

None.

API Version 2012-10-29

221

AWS Data Pipeline Developer Guide

Related Commands

Related Commands

--create (p. 217)

--list-runs (p. 222)

--list-runs

Description

Lists the times the specified pipeline has run.You can optionally filter the complete list of results to include only the runs you are interested in.

Syntax

On Linux/Unix/Mac OS:

./datapipeline --list-runs --id

pipeline_id

[filter] [Common Options]

On Windows: ruby datapipeline --list-runs --id

pipeline_id

[filter] [Common Options]

Options

Name

--id pipeline_id

--status code

--failed

--running

Description

The identifier of the pipeline.

Required

Yes

Filters the list to include only runs in the specified statuses.

No

The valid statuses are as follows: waiting

, pending

, cancelled

, running

, finished

, failed

, waiting_for_runner and checking_preconditions

.

Example:

--status running

You can combine statuses as a comma-separated list.

Example:

--status pending,checking_preconditions

Filters the list to include only runs in the failed state that started during the last 2 days and were scheduled to end within the last 15 days.

No

Filters the list to include only runs in the running state that started during the last 2 days and were scheduled to end within the last 15 days.

No

API Version 2012-10-29

222

AWS Data Pipeline Developer Guide

Common Options

Name

--start-interval

date1,date2

Description

Filters the list to include only runs that started within the specified interval.

--schedule-interval

date1,date2

Filters the list to include only runs that are scheduled to start within the specified interval.

Required

No

No

Common Options

For more information, see Common Options for AWS Data Pipeline Commands (p. 229)

.

Output

A list of the times the specified pipeline has run and the status of each run. You can filter this list by the options you specify when you run the command.

Examples

The first command lists all the runs for the specified pipeline. The other commands show how to filter the complete list of runs using different options.

On Linux/Unix/Mac OS:

./datapipeline --list-runs --id df-00627471SOVYZEXAMPLE

./datapipeline --list-runs --id df-00627471SOVYZEXAMPLE --status PENDING

./datapipeline --list-runs --id df-00627471SOVYZEXAMPLE --start-interval 2011-

11-29T06:07:21,2011-12-06T06:07:21

./datapipeline --list-runs --id df-00627471SOVYZEXAMPLE --schedule-interval

2011-11-29T06:07:21,2011-12-06T06:07:21

./datapipeline --list-runs --id df-00627471SOVYZEXAMPLE --failed

./datapipeline --list-runs --id df-00627471SOVYZEXAMPLE --running

On Windows: ruby datapipeline --list-runs --id df-00627471SOVYZEXAMPLE ruby datapipeline --list-runs --id df-00627471SOVYZEXAMPLE --status PENDING ruby datapipeline --list-runs --id df-00627471SOVYZEXAMPLE --start-interval

2011-11-29T06:07:21,2011-12-06T06:07:21 ruby datapipeline --list-runs --id df-00627471SOVYZEXAMPLE --schedule-interval

2011-11-29T06:07:21,2011-12-06T06:07:21 ruby datapipeline --list-runs --id df-00627471SOVYZEXAMPLE --failed ruby datapipeline --list-runs --id df-00627471SOVYZEXAMPLE --running

API Version 2012-10-29

223

AWS Data Pipeline Developer Guide

Related Commands

Related Commands

--list-pipelines (p. 221)

--put

Description

Uploads a pipeline definition file to AWS Data Pipeline for a new or existing pipeline, but does not activate the pipeline. Use the --activate parameter in a separate command when you want the pipeline to begin.

To specify a pipeline definition file at the time that you create the pipeline, use this command with the

--create (p. 217) command.

Syntax

On Linux/Unix/Mac OS:

./datapipeline --activate --id

pipeline_id

[Common Options]

On Windows: ruby datapipeline --activate --id

pipeline_id

[Common Options]

Options

Name

--id pipeline_id

Description

The identifier of the pipeline.

You must specify the identifier of the pipeline when updating an existing pipeline with a new pipeline definition file.

Example:

--id df-00627471SOVYZEXAMPLE

Required

Conditional

Common Options

For more information, see Common Options for AWS Data Pipeline Commands (p. 229)

.

Output

A response indicating that the new definition was successfully loaded, or, in the case where you are also using the

--create (p. 217)

command, an indication that the new pipeline was successfully activated.

API Version 2012-10-29

224

AWS Data Pipeline Developer Guide

Examples

Examples

The following examples show how to use

--put

to create a new pipeline (example one) and how to use

--put

and

--id

to add a definition file to a pipeline (example two) or update a preexisting pipeline definition file of a pipeline (example three).

On Linux/Unix/Mac OS:

./datapipeline --create my-pipeline --put my-pipeline-definition.json

./datapipeline --id df-00627471SOVYZEXAMPLE --put a-pipeline-definition.json

./datapipeline --id df-00627471SOVYZEXAMPLE --put my-updated-pipeline-defini tion.json

On Windows: ruby datapipeline --create my-pipeline --put my-pipeline-definition.json

ruby datapipeline --id df-00627471SOVYZEXAMPLE --put a-pipeline-definition.json

ruby datapipeline --id df-00627471SOVYZEXAMPLE --put my-updated-pipelinedefinition.json

Related Commands

--create (p. 217)

--get, --g (p. 219)

--activate

Description

Starts a new or existing pipeline.

To specify a pipeline definition file at the time that you create the pipeline, use this command with the

--create (p. 217) command.

Syntax

On Linux/Unix/Mac OS:

./datapipeline --put

pipeline_definition_file

--id

pipeline_id

[Common Options]

On Windows: ruby datapipeline --put

pipeline_definition_file

--id

pipeline_id

[Common Op tions]

API Version 2012-10-29

225

AWS Data Pipeline Developer Guide

Options

Options

Name Description

pipeline_definition_file

The name of the pipeline definition file.

--id pipeline_id

Example: pipeline-definition-file.json

The identifier of the pipeline.

You must specify the identifier of the pipeline when updating an existing pipeline with a new pipeline definition file.

Conditional

Example:

--id df-00627471SOVYZEXAMPLE

Required

Yes

Common Options

For more information, see Common Options for AWS Data Pipeline Commands (p. 229)

.

Output

A response indicating that the new definition was successfully loaded, or, in the case where you are also using the

--create (p. 217)

command, an indication that the new pipeline was successfully created.

Examples

The following examples show how to use

--put

to create a new pipeline (example one) and how to use

--put

and

--id

to add a definition file to a pipeline (example two) or update a preexisting pipeline definition file of a pipeline (example three).

On Linux/Unix/Mac OS:

./datapipeline --create my-pipeline --put my-pipeline-definition.json

./datapipeline --id df-00627471SOVYZEXAMPLE --put a-pipeline-definition.json

./datapipeline --id df-00627471SOVYZEXAMPLE --put my-updated-pipeline-defini tion.json

On Windows: ruby datapipeline --create my-pipeline --put my-pipeline-definition.json

ruby datapipeline --id df-00627471SOVYZEXAMPLE --put a-pipeline-definition.json

ruby datapipeline --id df-00627471SOVYZEXAMPLE --put my-updated-pipelinedefinition.json

Related Commands

--create (p. 217)

API Version 2012-10-29

226

AWS Data Pipeline Developer Guide

--rerun

--get, --g (p. 219)

--rerun

Description

Reruns one or more specified objects from within a pipeline that is either currently running or has previously run. Resets the retry count of the object and then runs the object. It also tries to cancel the current attempt if an attempt is running.

Syntax

On Linux/Unix/Mac OS:

./datapipeline --rerun

object_id

--id

pipeline_id

[Common Options]

On Windows: ruby datapipeline --rerun

object_id

--id

pipeline_id

[Common Options]

Note

object_id

can be a comma separated list.

Options

Name

object_id

--id pipeline_id

Description

The identifier of the object.

Example: o-06198791C436IEXAMPLE

The identifier of the pipeline.

Example:

--id df-00627471SOVYZEXAMPLE

Required

Yes

Yes

Common Options

For more information, see Common Options for AWS Data Pipeline Commands (p. 229)

.

Output

None. To see the status of the object set to rerun, use

--list-runs

.

Examples

Reruns the specified object in the indicated pipeline.

On Linux/Unix/Mac OS:

API Version 2012-10-29

227

AWS Data Pipeline Developer Guide

Related Commands

./datapipeline --rerun o-06198791C436IEXAMPLE --id df-00627471SOVYZEXAMPLE

On Windows: ruby datapipeline --rerun o-06198791C436IEXAMPLE --id df-00627471SOVYZEXAMPLE

Related Commands

--list-runs (p. 222)

--list-pipelines (p. 221)

--validate

Description

Validates the pipeline definition for correct syntax. Also performs additional checks, such as a check for circular dependencies.

Syntax

On Linux/Unix/Mac OS:

./datapipeline --validate

pipeline_definition_file

On Windows: ruby datapipeline --validate

pipeline_definition_file

Options

Name Description

pipeline_definition_file

The full path to the output file that receives the pipeline definition.

Default: standard output

Example: my-pipeline.json

Required

Yes

API Version 2012-10-29

228

AWS Data Pipeline Developer Guide

Common Options

Common Options for AWS Data Pipeline

Commands

The following set of options are accepted by most of the commands described in this guide.

Name Description

--access-key

aws_access_key

The access key ID associated with your AWS account.

If you specify

--access-key

, you must also specify

--secret-key

.

This option is required if you aren't using a JSON credentials file (see

--credentials

).

Required

Conditional

--credentials

json_file

Example:

--access-key AKIAIOSFODNN7EXAMPLE

For more information, see Setting Credentials for the AWS

Data Pipeline Command Line Interface.

The location of the JSON file with your AWS credentials.

You don't need to set this option if the JSON file is named credentials.json

, and it exists in either your user home directory or the directory where the AWS Data Pipeline CLI is installed. The CLI automatically finds the JSON file if it exists in either location.

Conditional

If you specify a credentials file (either using this option or by including credentials.json

in one of its two supported locations), you don't need to use the

--access-key

and

--secret-key

options.

Example:

TBD

--endpoint

url

For more information, see Setting Credentials for the AWS

Data Pipeline Command Line Interface.

The URL of the AWS Data Pipeline endpoint that the CLI should use to contact the web service.

If you specify an endpoint both in a JSON file and with this command line option, the CLI ignores the endpoint set with this command line option.

Example:

TBD

--id

pipeline_id

Use the specified pipeline identifier.

--limit

limit

Example:

--id df-00627471SOVYZEXAMPLE

The field limit for the pagination of objects.

Example:

TBD

Conditional

API Version 2012-10-29

229

AWS Data Pipeline Developer Guide

Common Options

Name Description

--secret-key

aws_secret_key

The secret access key associated with your AWS account.

If you specify

--secret-key

, you must also specify

--access-key

.

This option is required if you aren't using a JSON credentials file (see

--credentials

).

Example:

--secret-key wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY

Required

Conditional

For more information, see Setting Credentials for the Amazon

AWS Data Pipeline Command Line Interface.

--timeout

seconds

The number of seconds for the AWS Data Pipeline client to wait before timing out the http connection to the AWS Data

Pipeline web service.

No

--t, --trace

--v, --verbose

Example:

--timeout 120

Prints detailed debugging output.

Prints verbose output. This is useful for debugging.

No

No

API Version 2012-10-29

230

AWS Data Pipeline Developer Guide

Make an HTTP Request to AWS Data Pipeline

Program AWS Data Pipeline

Topics

Make an HTTP Request to AWS Data Pipeline (p. 231)

Actions in AWS Data Pipeline (p. 234)

Make an HTTP Request to AWS Data Pipeline

If you don't use one of the AWS SDKs, you can perform AWS Data Pipeline operations over HTTP using the POST request method. The POST method requires you to specify the operation in the header of the request and provide the data for the operation in JSON format in the body of the request.

HTTP Header Contents

AWS Data Pipeline requires the following information in the header of an HTTP request:

host

The AWS Data Pipeline endpoint. For information about endpoints, see Regions and Endpoints .

x-amz-date

You must provide the time stamp in either the HTTP Date header or the AWS x-amz-date header. (Some HTTP client libraries don't let you set the Date header.) When an x-amz-date header is present, the system ignores any Date header during the request authentication.

The date must be specified in one of the following three formats, as specified in the HTTP/1.1 RFC:

• Sun, 06 Nov 1994 08:49:37 GMT (RFC 822, updated by RFC 1123)

• Sunday, 06-Nov-94 08:49:37 GMT (RFC 850, obsoleted by RFC 1036)

• Sun Nov 6 08:49:37 1994 (ANSI C asctime() format)

Authorization

The set of authorization parameters that AWS uses to ensure the validity and authenticity of the request. For more information about constructing this header, go to Signature Version

4 Signing Process .

x-amz-target

The destination service of the request and the operation for the data, in the format:

<<serviceName>>_<<API version>>.<<operationName>>

For example,

DataPipeline_20121129.ActivatePipeline

content-type

Specifies JSON and the version. For example,

Content-Type: application/x-amz-json-1.0

The following is an example header for an HTTP request to activate a pipeline.

API Version 2012-10-29

231

AWS Data Pipeline Developer Guide

HTTP Body Content

POST / HTTP/1.1

host: datapipeline.us-east-1.amazonaws.com

x-amz-date: Mon, 12 Nov 2012 17:49:52 GMT x-amz-target: DataPipeline_20121129.ActivatePipeline

Authorization: AuthParams

Content-Type: application/x-amz-json-1.1

Content-Length: 39

Connection: Keep-Alive

HTTP Body Content

The body of an HTTP request contains the data for the operation specified in the header of the HTTP request. The data must be formatted according to the JSON data schema for each AWS Data Pipeline

API. The AWS Data Pipeline JSON data schema defines the types of data and parameters (such as comparison operators and enumeration constants) available for each operation.

Format the Body of an HTTP request

Use the JSON data format to convey data values and data structure, simultaneously. Elements can be nested within other elements by using bracket notation. The following example shows a request for putting a pipeline definition consisting of three objects and their corresponding slots.

{"pipelineId": "df-06372391ZG65EXAMPLE",

"pipelineObjects":

[

{"id": "Default",

"name": "Default",

"slots":

[

{"key": "workerGroup",

"stringValue": "MyWorkerGroup"}

]

},

{"id": "Schedule",

"name": "Schedule",

"slots":

[

{"key": "startDateTime",

"stringValue": "2012-09-25T17:00:00"},

{"key": "type",

"stringValue": "Schedule"},

{"key": "period",

"stringValue": "1 hour"},

{"key": "endDateTime",

"stringValue": "2012-09-25T18:00:00"}

]

},

{"id": "SayHello",

"name": "SayHello",

"slots":

[

{"key": "type",

API Version 2012-10-29

232

AWS Data Pipeline Developer Guide

HTTP Body Content

"stringValue": "ShellCommandActivity"},

{"key": "command",

"stringValue": "echo hello"},

{"key": "parent",

"refValue": "Default"},

{"key": "schedule",

"refValue": "Schedule"}

]

}

]

}

Handle the HTTP Response

Here are some important headers in the HTTP response, and how you should handle them in your application:

HTTP/1.1—This header is followed by a status code. A code value of 200 indicates a successful operation. Any other value indicates an error.

x-amzn-RequestId—This header contains a request ID that you can use if you need to troubleshoot a request with AWS Data Pipeline. An example of a request ID is

K2QH8DNOU907N97FNA2GDLL8OBVV4KQNSO5AEMVJF66Q9ASUAAJG.

x-amz-crc32—AWS Data Pipeline calculates a CRC32 checksum of the HTTP payload and returns this checksum in the x-amz-crc32 header. We recommend that you compute your own CRC32 checksum on the client side and compare it with the x-amz-crc32 header; if the checksums do not match, it might indicate that the data was corrupted in transit. If this happens, you should retry your request.

AWS SDK users do not need to manually perform this verification, because the SDKs compute the checksum of each reply from Amazon DynamoDB and automatically retry if a mismatch is detected.

Sample AWS Data Pipeline JSON Request and Response

The following examples show a request for creating a new pipeline. Then it shows the AWS Data Pipeline response, including the pipeline identifier of the newly created pipeline.

HTTP POST Request

POST / HTTP/1.1

host: datapipeline.us-east-1.amazonaws.com

x-amz-date: Mon, 12 Nov 2012 17:49:52 GMT x-amz-target: DataPipeline_20121129.CreatePipeline

Authorization: AuthParams

Content-Type: application/x-amz-json-1.1

Content-Length: 50

Connection: Keep-Alive

{"name": "MyPipeline",

"uniqueId": "12345ABCDEFG"}

API Version 2012-10-29

233

AWS Data Pipeline Developer Guide

Actions in AWS Data Pipeline

AWS Data Pipeline Response

HTTP/1.1 200 x-amzn-RequestId: b16911ce-0774-11e2-af6f-6bc7a6be60d9 x-amz-crc32: 2215946753

Content-Type: application/x-amz-json-1.0

Content-Length: 2

Date: Mon, 16 Jan 2012 17:50:53 GMT

{"pipelineId": "df-06372391ZG65EXAMPLE"}

Actions in AWS Data Pipeline

• ActivatePipeline

• CreatePipeline

• DeletePipeline

• DescribeObjects

• DescribePipelines

• GetPipelineDefinition

• ListPipelines

• PollForTask

• PutPipelineDefinition

• QueryObjects

• ReportTaskProgress

• SetStatus

• SetTaskStatus

• ValidatePipelineDefinition

API Version 2012-10-29

234

AWS Data Pipeline Developer Guide

Install Task Runner

AWS Task Runner Reference

Topics

Install Task Runner (p. 235)

Start Task Runner (p. 235)

Verify Task Runner (p. 236)

Setting Credentials for Task Runner (p. 236)

Task Runner Threading (p. 236)

Long Running Preconditions (p. 236)

Task Runner Configuration Options (p. 237)

Task Runner is a task agent application that polls AWS Data Pipeline for scheduled tasks and executes them on Amazon EC2 instances, Amazon EMR clusters, or other computational resources, reporting status as it does so. Depending on your application, you may choose to:

• Have AWS Data Pipeline install and manage one or more Task Runner applications for you on computational resources managed by the web service. In this case, you do not need to install or configure Task Runner.

• Manually install and configure Task Runner on a computational resource such as a long-running EC2 instance or a physical server. To do so, use the following procedures.

• Manually install and configure a custom task agent instead of Task Runner. The procedures for doing so will depend on the implementation of the custom task agent.

Install Task Runner

To install Task Runner, download

TaskRunner-1.0.jar

from Task Runner download and copy it into a folder. Additionally, download mysql-connector-java-5.1.18-bin.jar

from http://dev.mysql.com/usingmysql/java/ and copy it into the same folder where you install Task Runner.

Start Task Runner

In a new command prompt window that is set to the directory where you installed Task Runner, start Task

Runner with the following command.

API Version 2012-10-29

235

AWS Data Pipeline Developer Guide

Verify Task Runner

Warning

If you close the terminal window, or interrupt the command with CTRL+C, Task Runner stops, which halts the pipeline runs.

java -jar TaskRunner-1.0.jar --config ~/credentials.json --workerGroup=myWork erGroup

The

--config

option points to your credentials file. The

--workerGroup

option specifies the name of your worker group, which must be the same value as specified in your pipeline for tasks to be processed.

When Task Runner is active, it prints the path to where log files are written in the terminal window. The following is an example.

Logging to /myComputerName/.../dist/output/logs

Verify Task Runner

The easiest way to verify that Task Runner is working is to check whether it is writing log files. The log files are stored in the directory where you started Task Runner.

When you check the logs, make sure you that are checking logs for the current date and time. Task

Runner creates a new log file each hour, where the hour from midnight to 1am is 00. So the format of the log file name is

TaskRunner.log.YYYY-MM-DD-HH

, where HH runs from 00 to 23, in UDT. To save storage space, any log files older than eight hours are compressed with GZip.

Setting Credentials for Task Runner

In order to connect to the AWS Data Pipeline web service to process your commands, configure Task

Runner with an AWS account that has permissions to create and/or manage data pipelines.

To Set Your Credentials Implicitly with a JSON File

• Create a JSON file named credentials.json in the directory where you installed Task Runner. For

information about what to include in the JSON file, see Create a Credentials File (p. 18)

.

Task Runner Threading

Task Runner activities and preconditions are single threaded. This means it can handle one work item per thread. By default, it has one activity thread and two precondition threads. If you are installing Task

Runner and believe it will need to handle more than that at a single time, you need to increase the

--activities and --preconditions values.

Long Running Preconditions

For performance reasons, pipeline retry logic for preconditions happens in Task Runner and AWS Data

Pipeline supplies Task Runner with preconditions only once per 30 minute period. Task Runner will honor the retryDelay

field that you define on preconditions. You can configure "preconditionTimeout" slot to limit the precondition retry period.

API Version 2012-10-29

236

AWS Data Pipeline Developer Guide

Task Runner Configuration Options

Task Runner Configuration Options

These are the configuration options available from the command line when you launch Task Runner.

Command Line Parameter

--help

--config

--accessId

--secretKey

--endpoint

--workerGroup

--output

--log

--staging

--temp

--activities

--preconditions

--pcSuffix

Description

Displays command line help.

Path and file name of your credentials.json file

The AWS access ID for Task Runner to use when making requests

The AWS secret key for Task Runner to use when making requests

The AWS Data Pipeline service endpoint to use

The name of the worker group that Task Runner will retrieve work for

The Task Runner directory for output files

The Task Runner directory for local log files. If it is not absolute, it will be relative to output. Default is

'logs'.

The Task Runner directory for staging files. If it is not absolute, it will be relative to output. Default is

'staging'.

The Task Runner directory for temporary files. If it is not absolute, it will be relative to output. Default is 'tmp'.

Number of activity threads to run simultaneously, defaults to 1.

Number of precondition threads to run simultaneously, defaults to 2.

The suffix to use for preconditions. Defaults to

"precondition".

API Version 2012-10-29

237

AWS Data Pipeline Developer Guide

Account Limits

Web Service Limits

To ensure there is capacity for all users of the AWS Data Pipeline service, the web service imposes limits on the amount of resources you can allocate and the rate at which you can allocate them.

Account Limits

The following limits apply to a single AWS account. If you require additional capacity, you can contact

Amazon Web Services to increase your capacity.

Attribute

Number of pipelines

Limit

20

Number of pipeline components per pipeline

50

50 Number of fields per pipeline component

Number of UTF8 bytes per field name or identifier

Number of UTF8 bytes per field

256

10240

Adjustable

Yes

Yes

Yes

Yes

No

Number of UTF8 bytes per pipeline component

15,360, including the names of fields.

No

No Rate of creation of a instance from a pipeline component

1 per 5 minutes

5 Number of running instances of a pipeline component

Retries of a pipeline activity

5 per task

Yes

No

API Version 2012-10-29

238

AWS Data Pipeline Developer Guide

Web Service Call Limits

Attribute Limit

Minimum delay between retry attempts

2 minutes

15 minutes Minimum scheduling interval

Maximum number of rollups into a single object

32

Adjustable

No

No

No

Web Service Call Limits

AWS Data Pipeline limits the rate at which you can call the web service API. These limits also apply to

AWS Data Pipeline agents that call the web service API on your behalf, such as the console, CLI, and

Task Runner.

The following limits apply to a single AWS account. This means the total usage on the account, including that by IAM users, cannot exceed these limits.

The burst rate lets you save up web service calls during periods of inactivity and expend them all in a short amount of time. For example, CreatePipeline has a regular rate of 1 call each 5 seconds. If you don't call the service for 30 seconds, you will have 6 calls saved up. You could then call the web service

6 times in a second. Because this is below the burst limit and keeps your average calls at the regular rate limit, your calls are not be throttled.

If you exceed the rate limit and the burst limit, your web service call fails and returns a throttling exception.

The default implementation of a worker, Task Runner, automatically retries API calls that fail with a throttling exception, with a back off so that subsequent attempts to call the API occur at increasingly longer intervals. If you write a worker, we recommend that you implement similar retry logic.

These limits are applied against an individual AWS account.

API

ActivatePipeline

CreatePipeline

DeletePipeline

DescribeObjects

DescribePipelines

GetPipelineDefinition

PollForTask

ListPipelines

PutPipelineDefinition

QueryObjects

ReportProgress

SetTaskStatus

Regular rate limit

1 call per 5 seconds

1 call per 5 seconds

1 call per 5 seconds

1 call per 2 seconds

1 call per 5 seconds

1 call per 5 seconds

1 call per 2 seconds

1 call per 5 seconds

1 call per 5 seconds

1 call per 2 seconds

1 call per 2 seconds

2 call per second

Burst limit

10 calls

10 calls

10 calls

10 calls

10 calls

10 calls

10 calls

10 calls

10 calls

10 calls

10 calls

10 calls

API Version 2012-10-29

239

AWS Data Pipeline Developer Guide

Scaling Considerations

API

SetStatus

Regular rate limit

1 call per 5 seconds

ReportTaskRunnerHeartbeat 1 call per 5 seconds

ValidatePipelineDefinition 1 call per 5 seconds

Burst limit

10 calls

10 calls

10 calls

Scaling Considerations

AWS Data Pipeline scales to accommodate a huge number of concurrent tasks and you can configure it to automatically create the resources necessary to handle large workloads. These automatically-created resources are under your control and count against your AWS account resource limits. For example, if you configure AWS Data Pipeline to automatically create a 20-node EMR cluster to process data and your AWS account has an EC2 instance limit set to 20, you may inadvertently exhaust your available backfill resources. As a result, consider these resource restrictions in your design or increase your account limits accordingly.

If you require additional capacity, you can contact Amazon Web Services to increase your capacity.

API Version 2012-10-29

240

AWS Data Pipeline Developer Guide

AWS Data Pipeline Resources

Topics

The following table lists related resources to help you use AWS Data Pipeline.

Resource

AWS Data Pipeline API Reference

AWS Data Pipeline Technical FAQ

Release Notes

AWS Developer Resource Center

AWS Management Console

Discussion Forums

AWS Support Center

AWS Premium Support

AWS Data Pipeline Product Information

Contact Us

Description

Describes AWS Data Pipeline operations, errors, and data structures.

Covers the top 20 questions developers ask about this product.

Provide a high-level overview of the current release. They specifically note any new features, corrections, and known issues.

A central starting point to find documentation, code samples, release notes, and other information to help you build innovative applications with AWS.

The AWS Data Pipeline console.

A community-based forum for developers to discuss technical questions related to Amazon Web Services.

The home page for AWS Technical Support, including access to our Developer Forums, Technical FAQs, Service

Status page, and Premium Support.

The primary web page for information about AWS Premium

Support, a one-on-one, fast-response support channel to help you build and run applications on AWS Infrastructure

Services.

The primary web page for information about AWS Data

Pipeline.

A form for questions about your AWS account, including billing.

API Version 2012-10-29

241

Resource

Terms of Use

AWS Data Pipeline Developer Guide

Description

Detailed information about the copyright and trademark usage at Amazon.com and other topics.

API Version 2012-10-29

242

AWS Data Pipeline Developer Guide

Document History

This documentation is associated with the 2012-10-29 version of AWS Data Pipeline.

Change

Guide revision

Description Release Date

This release is the initial release of the AWS Data Pipeline

Developer Guide.

20 December

2012

API Version 2012-10-29

243

advertisement

Related manuals

advertisement

Table of contents