![](http://s2.manualzz.com/store/data/040895544_1-612f342bc4333b5920db5c115f1c9252-128x128.png)
advertisement
![AWS Data Pipeline | Manualzz AWS Data Pipeline | Manualzz](http://s2.manualzz.com/store/data/040895544_1-612f342bc4333b5920db5c115f1c9252-360x466.png)
AWS Data Pipeline
Developer Guide
API Version 2012-10-29
AWS Data Pipeline Developer Guide
Amazon Web Services
AWS Data Pipeline Developer Guide
AWS Data Pipeline: Developer Guide
Amazon Web Services
AWS Data Pipeline Developer Guide
API Version 2012-10-29
4
AWS Data Pipeline Developer Guide
API Version 2012-10-29
5
AWS Data Pipeline Developer Guide
How Does AWS Data Pipeline Work?
What is AWS Data Pipeline?
AWS Data Pipeline is a web service that you can use to automate the movement and transformation of data. With AWS Data Pipeline, you can define data-driven workflows, so that tasks can be dependent on the successful completion of previous tasks.
For example, you can use AWS Data Pipeline to archive your web server's logs to Amazon Simple Storage
Service (Amazon S3) each day and then run a weekly Amazon Elastic MapReduce (Amazon EMR) job flow over those logs to generate traffic reports.
In this example, AWS Data Pipeline would schedule the daily tasks to copy data and the weekly task to launch the Amazon EMR job flow. AWS Data Pipeline would also ensure that Amazon EMR waits for the final day's data to be uploaded to Amazon S3 before it began its analysis, even if there is an unforeseen delay in uploading the logs.
AWS Data Pipeline handles the ambiguities of real-world data management. You define the parameters of your data transformations and AWS Data Pipeline enforces the logic that you've set up.
How Does AWS Data Pipeline Work?
Three main components of AWS Data Pipeline work together to manage your data:
API Version 2012-10-29
1
AWS Data Pipeline Developer Guide
Pipeline Definition
• Pipeline definition specifies the business logic of your data management. For more information, see
Pipeline Definition Files (p. 135)
.
• AWS Data Pipeline web service interprets the pipeline definition and assigns tasks to workers to move and transform data.
• Task Runners poll the AWS Data Pipeline web service for tasks and then perform those tasks. In the previous example, Task Runner would copy log files to Amazon S3 and launch Amazon EMR job flows.
Task Runner is installed and runs automatically on resources created by your pipeline definitions. You can write a custom task runner application, or you can use the Task Runner application that is provided by AWS Data Pipeline. For more information, see
.
The following illustration below shows how these components work together. If the pipeline definition supports nonserialized tasks, AWS Data Pipeline can manage tasks for multiple task runners working in parallel.
Pipeline Definition
A pipeline definition is how you communicate your business logic to AWS Data Pipeline. It contains the following information:
• Names, locations, and formats of your data sources.
• Activities that transform the data.
• The schedule for those activities.
• Resources that run your activities and preconditions
• Preconditions that must be satisfied before the activities can be scheduled.
• Ways to alert you with status updates as pipeline execution proceeds.
From your pipeline definition, AWS Data Pipeline determines the tasks that will occur, schedules them, and assigns them to task runners. If a task is not completed successfully, AWS Data Pipeline retries the task according to your instructions and, if necessary, reassigns it to another task runner. If the task fails repeatedly, you can configure the pipeline to notify you.
For example, in your pipeline definition, you might specify that in 2013, log files generated by your application will be archived each month to an Amazon S3 bucket. AWS Data Pipeline would then create
12 tasks, each copying over a month's worth of data, regardless of whether the month contained 30, 31,
28, or 29 days.
API Version 2012-10-29
2
AWS Data Pipeline Developer Guide
Pipeline Definition
You can create a pipeline definition in the following ways:
• Graphically, by using the AWS Data Pipeline console.
• Textually, by writing a JSON file in the format used by the command line interface.
• Programmatically, by calling the web service with either one of the AWS SDKs or the AWS Data Pipeline
API.
A pipeline definition can contain the following types of components:
Component
Data Node
Description
The location of input data for a task or the location where output data is to be stored.
The following data locations are currently supported:
• Amazon S3 bucket
• MySQL database
• Amazon DynamoDB
• Local data node
Activity
Precondition
Schedule
An interaction with the data.
The following activities are currently supported:
• Copy to a new location
• Launch an Amazon EMR job flow
• Run a Bash script from the command line (requires a UNIX environment to run the script)
• Run a database query
• Run a Hive activity
A conditional statement that must be true before an action can run.
The following preconditions are currently supported:
• A command-line Bash script was successfully completed
(requires a UNIX environment to run the script)
• Data exists
• A specific time or a time interval relative to another event has been reached
• An Amazon S3 location contains data
• An Amazon RDS or Amazon DynamoDB table exists
Any or all of the following:
• The time that an action should start
• The time that an action should stop
• How often the action should run
API Version 2012-10-29
3
Component
Resource
Action
AWS Data Pipeline Developer Guide
Lifecycle of a Pipeline
Description
A resource that can analyze or modify data.
The following computational resources are currently supported:
• Amazon EMR job flow
• Amazon EC2 instance
A behavior that is triggered when specified conditions are met, such as the failure of an activity.
The following actions are currently supported:
• Amazon SNS notification
• Terminate action
For more information, see Pipeline Definition Files (p. 135) .
Lifecycle of a Pipeline
After you create a pipeline definition, you create a pipeline and then add your pipeline definition to it. Your pipeline must be validated. After you have a valid pipeline definition, you can activate it. At that point, the pipeline runs and schedules tasks. When you are done with your pipeline, you can delete it.
The complete lifecycle of a pipeline is shown in the following illustration.
API Version 2012-10-29
4
AWS Data Pipeline Developer Guide
Task Runners
Task Runners
A task runner is an application that polls AWS Data Pipeline for tasks and then performs those tasks.
You can either use Task Runner as provided by AWS Data Pipeline, or create a custom Task Runner application.
Task Runner
Task Runner is a default implementation of a task runner that is provided by AWS Data Pipeline. When
Task Runner is installed and configured, it polls AWS Data Pipeline for tasks associated with pipelines that you have activated. When a task is assigned to Task Runner, it performs that task and reports its status back to AWS Data Pipeline. If your workflow requires non-default behavior, you'll need to implement that functionality in a custom task runner.
There are three ways you can use Task Runner to process your pipeline:
• AWS Data Pipeline installs Task Runner for you on resources that are launched and managed by the web service.
• You install Task Runner on a computational resource that you manage, such as a long-running Amazon
EC2 instance, or an on-premise server.
• You modify the Task Runner code to create a custom Task Runner, which you then install on a computational resource that you manage.
API Version 2012-10-29
5
AWS Data Pipeline Developer Guide
Task Runners
Task Runner on AWS Data Pipeline-Managed Resources
When a resource is launched and managed by AWS Data Pipeline, the web service automatically installs
Task Runner on that resource to process tasks in the pipeline. You specify a computational resource
(either an Amazon EC2 instance or an Amazon EMR job flow) for the runsOn
field of an activity object.
When AWS Data Pipeline launches this resource, it installs Task Runner on that resource and configures it to process all activity objects that have their runsOn
field set to that resource. When AWS Data Pipeline terminates the resource, the Task Runner and all its logs are published to an Amazon S3 location before it shuts down.
For example, if you use the
EmrActivity
action in a pipeline, and specify an
EmrCluster
object in the runsOn
field. When AWS Data Pipeline processes that activity, it launches an Amazon EMR job flow and uses a bootstrap step to install Task Runner onto the master node. This Task Runner then processes the tasks for activities that have their runsOn
field set to that
EmrCluster
object. The following excerpt from a pipeline definiton shows this relationship between the two objects.
{
"id" : "MyEmrActivity",
"name" : "Work to perform on my data",
"type" : "EmrActivity",
"runsOn" : {"ref" : "
MyEmrCluster
"},
"preStepCommand" : "scp remoteFiles localFiles",
"step" : "s3://myBucket/myPath/myStep.jar,firstArg,secondArg",
"step" : "s3://myBucket/myPath/myOtherStep.jar,anotherArg",
"postStepCommand" : "scp localFiles remoteFiles",
"input" : {"ref" : "MyS3Input"},
"output" : {"ref" : "MyS3Output"}
API Version 2012-10-29
6
AWS Data Pipeline Developer Guide
Task Runners
},
{
"id" : "
MyEmrCluster
",
"name" : "EMR cluster to perform the work",
"type" : "EmrCluster",
"hadoopVersion" : "0.20",
"keypair" : "myKeyPair",
"masterInstanceType" : "m1.xlarge",
"coreInstanceType" : "m1.small",
"coreInstanceCount" : "10",
"instanceTaskType" : "m1.small",
"instanceTaskCount": "10",
"bootstrapAction" : "s3://elasticmapreduce/libs/ba/configure-ha doop,arg1,arg2,arg3",
"bootstrapAction" : "s3://elasticmapreduce/libs/ba/configure-otherstuff,arg1,arg2"
}
If you have multiple AWS Data Pipeline-managed resources in a pipeline, Task Runner is installed on each of them, and they all poll AWS Data Pipeline for tasks to process.
Task Runner on User-Managed Resources
You can install Task Runner on computational resources that you manage, such a long-running Amazon
EC2 instance or a physical server. This approach can be useful when, for example, you want to use AWS
Data Pipeline to process data that is stored inside your organization’s firewall. By installing Task Runner on a server in the local network, you can access the local database securely and then poll AWS Data
Pipeline for the next task to run. When AWS Data Pipeline ends processing or deletes the pipeline, the
Task Runner instance remains running on your computational resource until you manually shut it down.
Similarly, the Task Runner logs persist after pipeline execution is complete.
You download Task Runner, which is in Java Archive (JAR) format, and install it on your computational
process, add a workerGroup
field to the object, and configure Task Runner to poll for that worker group value. You do this by passing the worker group string as a parameter (for example,
--workerGroup=wg-12345
) when you run the Task Runner JAR file.
API Version 2012-10-29
7
AWS Data Pipeline Developer Guide
Task Runners
{
"id" : "MyStoredProcedureActivity",
"type" : "StoredProcedureActivity",
"workerGroup" : "wg-12345",
"command" : "mkdir new-directory"
}
Custom Task Runner
If your data-management requires behavior other than the default behavior provided by Task Runner, you need to create a custom task runner. Because Task Runner is an open-source application, you can use it as the basis for creating your custom implementation.
After you write the custom task runner, you install it on a computational resource that you own, such as a long-running EC2 instance or a physical server inside your organization's firewall. To connect your custom task runner to the pipeline activities it should process, add a workerGroup
field to the object, and configure your custom task runner to poll for that worker group value.
API Version 2012-10-29
8
AWS Data Pipeline Developer Guide
Task Runners
For example, if you use the
ShellCommandActivity
action in a pipeline, and specify a value for the workerGroup
field, when AWS Data Pipeline processes that activity, it passes the task to a task runner that polls the web service for work and specifies that worker group. The following excerpt from a pipeline definition shows how to configure the workerGroup
field.
{
"id" : "CreateDirectory",
"type" : "ShellCommandActivity",
"workerGroup" : "wg-67890",
"command" : "mkdir new-directory"
}
When you create a custom task runner, you have complete control over how your pipeline activities are processed. The only requirement is that you communicate with AWS Data Pipeline as follows:
• Poll for tasks—Your task runner should poll AWS Data Pipeline for tasks to process by calling the
PollForTask API. If tasks are ready in the work queue, GetRemoteWork returns a Response immediately.
If no tasks are available in the queue, GetRemoteWork uses long-polling and holds on to a poll connection for up to 90 seconds, during which time any newly scheduled tasks are handed to the task agent. Your remote worker should not call GetRemoteWork again on the same worker group until it receives a Response, and this may take up to 90 seconds.
• Report progress—Your task runner should report its progress to AWS Data Pipeline by calling the
ReportTaskProgress API each minute. If a task runner does not report its status after 5 minutes, then every 20 minutes afterwards (configurable), AWS Data Pipeline assumes the task runner is unable to process the task and assigns it in a subsequent Response to GetRemoteWork.
• Signal completion of a task—Your task runner should inform AWS Data Pipeline of the outcome when it completes a task by calling the SetTaskStatus API. The task runner calls this action regardless
API Version 2012-10-29
9
AWS Data Pipeline Developer Guide
Pipeline Components, Instances, and Attempts
of whether the task was sucessful. The task runner does not need to call SetRemoteWorkStatus for tasks canceled by AWS Data Pipeline.
Pipeline Components, Instances, and Attempts
There are three types of items associated with a scheduled pipeline:
• Pipeline Components — Pipeline components represent the business logic of the pipeline and are represented by the different sections of a pipeline definition. Pipeline components specify the data sources, activities, schedule, and preconditions of the workflow. They can inherit properties from parent components. Relationships among components are defined by reference. Pipeline components define the rules of data management; they are not a to-do list.
• Instances — When AWS Data Pipeline runs a pipeline, it compiles the pipeline components to create a set of actionable instances. Each instance contains all the information needed to perform a specific task. The complete set of instances is the to-do list of the pipeline. AWS Data Pipeline hands the instances out to task runners to process.
• Attempts — To provide robust data management, AWS Data Pipeline retries a failed operation. It continues to do so until the task reaches the maximum number of allowed retry attempts. Attempt objects track the various attempts, results, and failure reasons if applicable. Essentially, it is the instance with a counter.
Note
Retrying failed tasks is an important part of a fault tolerance strategy, and AWS Data Pipeline pipeline definitions provide conditions and thresholds to control retries. However, too many retries can delay detection of an unrecoverable failure because AWS Data Pipeline does not report failure until it has exhausted all the retries that you specify. The extra retries may accrue additional charges if they are running on AWS resources. As a result, carefully consider when it is appropriate to exceed the AWS Data Pipeline default settings that you use to control re-tries and related settings.
API Version 2012-10-29
10
AWS Data Pipeline Developer Guide
Lifecycle of a Pipeline Task
Lifecycle of a Pipeline Task
The following diagram illustrates how AWS Data Pipeline and a task runner interact to process a scheduled task.
API Version 2012-10-29
11
AWS Data Pipeline Developer Guide
Access the Console
Get Set Up for AWS Data Pipeline
There are several ways you can interact with AWS Data Pipeline:
• Console — a graphical interface you can use to create and manage pipelines. With it, you fill out web forms to specify the configuration details of your pipeline components. The AWS Data Pipeline console provides several templates, which are pre-configured pipelines for common scenarios. As you keep building your pipeline, graphical representation of the components appear on the design pane. The arrows between the components indicate the connection between the components. Using the console is the easiest way to get started with AWS Data Pipeline. It creates the pipeline definition for you, and no JSON or programming knowledge is required. The console is available online at https://console.aws.amazon.com/datapipeline/ . For more information about accessing the console, see
.
• Command Line Interface (CLI) — an application you run on your local machine to connect to AWS
Data Pipeline and create and manage pipelines. With it, you issue commands into a terminal window and pass in JSON files that specify the pipeline definition. Using the CLI is the best option if you prefer
working from a command line. For more information, see Install the Command Line Interface (p. 15)
.
• Software Development Kit (SDK) — AWS provides an SDK with functions that call AWS Data Pipeline to create and manage pipelines. With it, you can write applications that automate the process of creating and managing pipelines. Using the SDK is the best option if you want to extend or customize the functionality of AWS Data Pipeline. You can download the AWS SDK for Java from http://aws.amazon.com/sdkforjava/ .
• Web Service API — AWS provides a low-level interface that you can use to call the web service directly using JSON. Using the API is the best option if you want to create an custom SDK that calls AWS Data
Pipeline. For more information, see AWS Data Pipeline API Reference .
In addition, there is the Task Runner application, which is a default implementation of a task runner.
Depending on the requirements of your data management, you may need to install Task Runner on a computational resource such as a long-running Amazon EC2 instance or a physical server. For more
information about when to install Task Runner, see Task Runner (p. 5)
. For more information about how
to install Task Runner, see Deploy and Configure Task Runner (p. 19)
.
Access the Console
Topics
•
API Version 2012-10-29
12
AWS Data Pipeline Developer Guide
Access the Console
The AWS Data Pipeline console enables you to do the following:
• Create, save, and activate your pipeline
• View the details of all the pipelines associated with your account
• Modify your pipeline
• Delete your pipeline
You must have an Amazon Web Services (AWS) account to access the AWS Data Pipeline console.
When you create an AWS account, AWS automatically signs up the account for all AWS services, including
AWS Data Pipeline. With AWS Data Pipeline, you pay only for what you use. For more information about
AWS Data Pipeline usage rates, see AWS Data Pipeline .
If you have an AWS account already, skip to the next step. If you don't have an AWS account, use the following procedure to create one.
To create an AWS account
1.
Go to AWS and click Sign Up Now.
2.
Follow the on-screen instructions.
Part of the sign-up process involves receiving a phone call and entering a PIN using the phone keypad.
To access the console
1.
Sign in to the AWS Management Console and open the AWS Data Pipeline console at https://console.aws.amazon.com/datapipeline/ .
2.
If your account doesn't already have data pipelines, the console displays the following introductory screen that prompts you to create your first pipeline. This screen also provides an overview of the process for creating a pipeline, and links to relevant documentation and resources. Click Create
Pipeline to create your pipeline.
API Version 2012-10-29
13
AWS Data Pipeline Developer Guide
Access the Console
If you already have pipelines associated with your account, the console displays the page listing all the pipelines associated with your account. Click Create New Pipeline to create your pipeline.
API Version 2012-10-29
14
AWS Data Pipeline Developer Guide
Where Do I Go Now?
Where Do I Go Now?
You are now ready to start creating your pipelines. For more information about creating a pipeline, see the following tutorials:
•
Tutorial: Copy CSV Data from Amazon S3 to Amazon S3 (p. 25)
•
Tutorial: Copy Data From a MySQL Table to Amazon S3 (p. 40)
•
Tutorial: Launch an Amazon EMR Job Flow (p. 56)
•
Tutorial: Run a Shell Command to Process MySQL Table (p. 107)
Install the Command Line Interface
The AWS Data Pipeline command line interface (CLI) is a tool you can use to create and manage pipelines from a terminal window. It is written in Ruby and makes calls to the web service on your behalf.
Topics
•
•
Install the RubyGems package management framework (p. 15)
•
Install Prerequisite Ruby Gems (p. 16)
•
Install the AWS Data Pipeline CLI (p. 17)
•
Locate your AWS Credentials (p. 17)
•
Create a Credentials File (p. 18)
•
Install Ruby
The AWS Data Pipeline CLI requires Ruby 1.8.7. Some operating systems, such as Mac OS, come with
Ruby pre-installed.
To verify the Ruby installation and version
• To check whether Ruby is installed, and which version, run the following command in a terminal window. If Ruby is installed, this command displays its version information.
ruby -v
If you don’t have Ruby 1.8.7 installed, use the following procedure to install it.
To install Ruby on Linux/Unix/Mac OS
• Download Ruby from http://www.ruby-lang.org/en/downloads/ and follow the installation instructions for your version of Linux/Unix/Mac OS.
Install the RubyGems package management framework
The AWS Data Pipeline CLI requires a version of RubyGems that is compatible with Ruby 1.8.7.
API Version 2012-10-29
15
AWS Data Pipeline Developer Guide
Install Prerequisite Ruby Gems
To verify the RubyGems installation and version
• To check whether RubyGems is installed, run the following command from a terminal window. If
RubyGems is installed, this command displays its version information.
gem -v
If you don’t have RubyGems installed, or have a version not compatible with Ruby 1.8.7, you need to download and install RubyGems before you can install the AWS Data Pipeline CLI.
To install RubyGems on Linux/Unix/Mac OS
1.
Download RubyGems from http://rubyforge.org/frs/?group_id=126 .
2.
Install RubyGems using the following command.
sudo ruby setup.rb
Install Prerequisite Ruby Gems
The AWS Data Pipeline CLI requires Ruby 1.8.7 or greater, a compatible version of RubyGems, and the following Ruby gems:
• json (version 1.4 or greater)
• uuidtools (version 2.1 or greater)
• httparty (version .7 or greater)
• bigdecimal (version 1.0 or greater)
• nokogiri (version 1.4.4 or greater)
The following topics describe how to install the AWS Data Pipeline CLI and the Ruby environment it requires.
Use the following procedures to ensure that each of the gems listed above is installed.
To verify whether a gem is installed
• To check whether a gem is installed, run the following command from a terminal window. For example, if 'uuidtools' is installed, this command displays the name and version of the 'uuidtools' RubyGem.
gem search 'uuidtools'
If you don’t have 'uuidtools' installed, then you need to install it before you can install the AWS Data
Pipeline CLI.
To install 'uuidtools' on Windows/Linux/Unix/Mac OS
• Install 'uuidtools' using the following command.
API Version 2012-10-29
16
AWS Data Pipeline Developer Guide
Install the AWS Data Pipeline CLI
sudo gem install uuidtools
Install the AWS Data Pipeline CLI
After you have verified the installation of your Ruby environment, you’re ready to install the AWS Data
Pipeline CLI.
To install the AWS Data Pipeline CLI on Windows/Linux/Unix/Mac OS
1.
Download datapipeline-cli.zip
from https://s3.amazonaws.com/datapipeline-us-east-1/software/latest/DataPipelineCLI/ .
2.
Unzip the compressed file. For example, on Linux/Unix/Mac OS use the following command: unzip datapipeline-cli.zip
This uncompresses the CLI and supporting code into a new directory called dp-cli
.
3.
If you add the new directory, dp-cli
, to your PATH variable, you can use the CLI without specifying the complete path. In this guide, we assume that you've updated your PATH variable, or that you run the CLI from the directory where it is installed.
Locate your AWS Credentials
When you create an AWS account, AWS assigns you an access key ID and a secret access key. AWS uses these credentials to identify you when you interact with a web service. You need these keys for the next step of the CLI installation process.
Note
Your secret access key is a shared secret between you and AWS. Keep this ID secret; we use it to bill you for the AWS services that you use. Never include the ID in your requests to AWS, and never email this ID to anyone, even if a request appears to originate from AWS or
Amazon.com. No one who legitimately represents Amazon will ever ask you for your secret access key.
The following procedure explains how to locate your access key ID and secret access key in the AWS
Management Console.
To view your AWS access credentials
1.
Go to the Amazon Web Services website at http://aws.amazon.com
.
2.
Click My Account/Console, and then click Security Credentials.
3.
Under Your Account, click Security Credentials.
4.
In the spaces provided, type your user name and password, and then click Sign in using our secure
server.
5.
Under Access Credentials, on the Access Keys tab, your access key ID is displayed. To view your secret key, under Secret Access Key, click Show.
Make a note of your access key ID and your secret access key; you will use them in the next section.
API Version 2012-10-29
17
AWS Data Pipeline Developer Guide
Create a Credentials File
Create a Credentials File
When you request services from AWS Data Pipeline, you must pass your credentials with the request so that AWS can properly authenticate and eventually bill you. The command line interface obtains your credentials from a JSON document called a credentials file, which is stored in your home directory,
~/
.
Using a credentials file is the simplest way to make your AWS credentials available to the AWS Data
Pipeline CLI.
The credentials file contains the following name-value pairs.
Name
comment access-id private-key endpoint log-uri
Description
An optional comment within the credentials file.
The access key ID for your AWS account.
The secret access key for your AWS account
The endpoint for AWS Data Pipeline in the region where you are making requests.
The location of the Amazon S3 bucket where AWS
Data Pipeline writes log files.
In the following example credentials file,
AKIAIOSFODNN7EXAMPLE
represents an access key ID, and wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY
represents the corresponding secret access key.
The value of log-uri
specifies the location of your Amazon S3 bucket and the path to the log files for actions performed by the AWS Data Pipeline web service on behalf of your pipeline.
{
"access-id": "AKIAIOSFODNN7EXAMPLE",
"private-key": "wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY",
"endpoint": "datapipeline.us-east-1.amazonaws.com",
"port": "443",
"use-ssl": "true",
"region": "us-east-1",
"log-uri": "s3://myawsbucket/logfiles"
}
After you replace the values for the access-id
, private-key
, and log-uri
fields with the appropriate information, save the file as credentials.json in either your home directory,
~/
.
Verify the CLI
To verify that the command line interface (CLI) is installed, use the following command.
datapipeline --help
If the CLI is installed correctly, this command displays the list of commands for the CLI.
API Version 2012-10-29
18
AWS Data Pipeline Developer Guide
Deploy and Configure Task Runner
Deploy and Configure Task Runner
Task Runner is an task runner application that polls AWS Data Pipeline for scheduled tasks and processes the tasks assigned to it by the web service, reporting status as it does so.
Depending on your application, you may choose to:
• Have AWS Data Pipeline install and manage one or more Task Runner applications for you on computational resources managed by the web service. In this case, you do not need to install or configure Task Runner.
• Manually install and configure Task Runner on a computational resource such as a long-running Amazon
EC2 instance or a physical server. To do so, use the following procedures.
• Manually install and configure a custom task runner instead of Task Runner. The procedures for doing so depends on the implementation of the custom task runner.
For more information about Task Runner and when and where it should be configured, see
Note
You can only install Task Runner on Linux, UNIX, or Mac OS. Task Runner is not supported on the Windows operating system.
Topics
•
•
•
•
Install Java
Task Runner requires Java version 1.6 or later. To determine whether Java is installed, and the version that is running, use the following command: java -version
If you do not have Java 1.6 or later installed on your computer, you can download the latest version from http://www.oracle.com/technetwork/java/index.html
.
Install Task Runner
To install Task Runner, download
TaskRunner-1.0.jar
from Task Runner download and copy it into a folder. Additionally, download mysql-connector-java-5.1.18-bin.jar
from http://dev.mysql.com/usingmysql/java/ and copy it into the same folder where you install Task Runner.
Start Task Runner
In a new command prompt window that is set to the directory where you installed Task Runner, start Task
Runner with the following command. The
--config
option points to your credentials file. The
--workerGroup
option specifies the name of your worker group, which must be the same value as specified in your pipeline for tasks to be processed.
API Version 2012-10-29
19
AWS Data Pipeline Developer Guide
Verify Task Runner
java -jar TaskRunner-1.0.jar --config ~/credentials.json --workerGroup=myWork erGroup
When Task Runner is active, it prints the path to where log files are written in the terminal window. The following is an example.
Logging to /myComputerName/.../dist/output/logs
Warning
If you close the terminal window, or interrupt the command with CTRL+C, Task Runner stops, which halts the pipeline runs.
Verify Task Runner
The easiest way to verify that Task Runner is working is to check whether it is writing log files. The log files are stored in the directory where you started Task Runner.
When you check the logs, make sure you that are checking logs for the current date and time. Task
Runner creates a new log file each hour, where the hour from midnight to 1am is 00. So the format of the log file name is
TaskRunner.log.YYYY-MM-DD-HH
, where HH runs from 00 to 23, in UDT. To save storage space, any log files older than eight hours are compressed with GZip.
Install the AWS SDK
The easiest way to write applications that interact with AWS Data Pipeline or to implement a custom task runner is to use one of the AWS SDKs. The AWS SDKs provide functionality that simplify calling the web service APIs from your preferred programming environment.
For more information about the programming languages and development environments that have AWS
SDK support, see the AWS SDK listings .
If you are not writing programs that interact with AWS Data Pipeline, you do not need to install any of the
AWS SDKs. You can create and run pipelines using the console or command-line interface.
This guide provides examples of programming AWS Data Pipeline using Java. The following are examples of how to download and install the AWS SDK for Java.
To install the AWS SDK for Java using Eclipse
• Install the AWS Toolkit for Eclipse .
Eclipse is a popular Java development environment. The AWS Toolkit for Eclipse installs the latest version of the AWS SDK for Java. From Eclipse, you can easily modify, build, and run any of the samples included in the SDK.
To install the AWS SDK for Java
• If you are using a Java development environment other than Eclipse, download and install the AWS
SDK for Java .
API Version 2012-10-29
20
AWS Data Pipeline Developer Guide
Granting Permissions to Pipelines with IAM
Granting Permissions to Pipelines with IAM
In AWS Data Pipeline, IAM roles determine what your pipeline can access and actions it can perform.
Additionally, when your pipeline creates a resource, such as when a pipeline creates an Amazon EC2 instance, IAM roles determine the EC2 instance's permitted resources and actions. When you create a pipeline, you specify one IAM role that governs your pipeline and another IAM role to govern your pipeline's resources (referred to as a "resource role"), which can be the same role for both. Carefully consider the minimum permissions necessary for your pipeline to perform work and define the IAM roles accordingly.
It is important to note that even a modest pipeline might need access to resources and actions to various areas of AWS, for example:
• Accessing files in Amazon S3
• Creating and managing Amazon EMR clusters
• Creating and managing Amazon EC2 instances
• Accessing data in Amazon RDS or Amazon DynamoDB
• Sending notifications using Amazon SNS
When you use the AWS Data Pipeline console, you can choose a pre-defined, default IAM role and resource role or create a new one to suit your needs. However, when using the AWS Data Pipeline CLI, you must create a new IAM role and apply a policy to it yourself, for which you can use the following example policy. For more information about how to create a new IAM role and apply a policy to it, see
Managing IAM Policies in the Using IAM guide.
Warning
Carefully review and restrict permissions in the following example policy to only the resources that your pipeline requires.
{
"Statement": [
{
"Action": [
"s3:*"
],
"Effect": "Allow",
"Resource": [
"*"
]
},
{
"Action": [
"ec2:DescribeInstances",
"ec2:RunInstances",
"ec2:StartInstances",
"ec2:StopInstances",
"ec2:TerminateInstances"
],
"Effect": "Allow",
"Resource": [
"*"
]
},
{
"Action": [
"elasticmapreduce:*"
],
API Version 2012-10-29
21
AWS Data Pipeline Developer Guide
Granting Permissions to Pipelines with IAM
"Effect": "Allow",
"Resource": [
"*"
]
},
{
"Action": [
"dynamodb:*"
],
"Effect": "Allow",
"Resource": [
"*"
]
},
{
"Action": [
"rds:DescribeDBInstances",
"rds:DescribeDBSecurityGroups"
],
"Effect": "Allow",
"Resource": [
"*"
]
},
{
"Action": [
"sns:GetTopicAttributes",
"sns:ListTopics",
"sns:Publish",
"sns:Subscribe",
"sns:Unsubscribe"
],
"Effect": "Allow",
"Resource": [
"*"
]
},
{
"Action": [
"iam:PassRole"
],
"Effect": "Allow",
"Resource": [
"*"
]
},
{
"Action": [
"datapipeline:*"
],
"Effect": "Allow",
"Resource": [
"*"
]
}
]
}
API Version 2012-10-29
22
AWS Data Pipeline Developer Guide
Grant Amazon RDS Permissions to Task Runner
After you define a role and apply its policy, you define a trusted entities list, which indicates the entities or services that are permitted to use your new role. You can use the following IAM trust relationship definition to allow AWS Data Pipeline and Amazon EC2 to use your new pipeline and resource roles. For more information about editing IAM trust relationships, see Modifying a Role in the Using IAM guide.
{
"Version": "2008-10-17",
"Statement": [
{
"Sid": "",
"Effect": "Allow",
"Principal": {
"Service": [
"ec2.amazonaws.com",
"datapipeline.amazonaws.com"
]
},
"Action": "sts:AssumeRole"
}
]
}
Grant Amazon RDS Permissions to Task Runner
Amazon RDS allows you to control access to your DB Instances using database security groups (DB
Security Groups). A DB Security Group acts like a firewall controlling network access to your DB Instance.
By default, network access is turned off to your DB Instances. You must modify your DB Security Groups to let Task Runner access your Amazon RDS instances. Task Runner gains Amazon RDS access from the instance on which it runs, so the accounts and security groups that you add to your Amazon RDS instance depend on where you install Task Runner.
To grant permissions to Task Runner,
1.
Sign in to the AWS Management Console and open the Amazon RDS console .
2.
In the Amazon RDS: My DB Security Groups pane, click your Amazon RDS instance. In the DB
Security Group pane, under Connection Type, select EC2 Security Group. Configure the fields in the EC2 Security Group pane as described below:
For Task Runner running on an EC2 Resource,
• AWS Account Id:
Your AccountId
EC2 Security Group:
Your Security Group Name
For a Task Runner running on an EMR Resource,
• AWS Account Id:
Your AccountId
EC2 Security Group:
ElasticMapReduce-master
AWS Account Id:
Your AccountId
EC2 Security Group:
ElasticMapReduce-slave
API Version 2012-10-29
23
AWS Data Pipeline Developer Guide
Grant Amazon RDS Permissions to Task Runner
For a Task Runner running in your local environment (on-premise),
• CIDR: The IP address range of your on premise machine, or firewall if your on-premise computer is behind a firewall.
To allow connection from an RdsSqlPrecondition
• AWS Account Id:
793385162516
EC2 Security Group:
DataPipeline
API Version 2012-10-29
24
AWS Data Pipeline Developer Guide
Tutorial: Copy CSV Data from
Amazon S3 to Amazon S3
After you read What is AWS Data Pipeline? (p. 1)
and decide you want to use AWS Data Pipeline to automate the movement and transformation of your data, it is time to get started with creating data pipelines. To help you make sense of how AWS Data Pipeline works, let’s walk through a simple task.
This tutorial walks you through the process of creating a data pipeline to copy data from one Amazon S3 bucket to another and then send an Amazon SNS notification after the copy activity completes successfully.
You use the Amazon EC2 instance resource managed by AWS Data Pipeline for this copy activity.
Important
This tutorial does not employ the Amazon S3 API for high speed data transfer between Amazon
S3 buckets. It is intended only for demonstration purposes to help new customers understand how to create a simple pipeline and the related concepts. For advanced information about data transfer using Amazon S3, see Working with Buckets in the Amazon S3 Developer Guide.
The first step in pipeline creation process is to select the pipeline objects that make up your pipeline definition. After you select the pipeline objects, you add fields for each object. For more information, see
.
This tutorial uses the following objects to create a pipeline definition:
Activity
Activity the AWS Data Pipeline performs for this pipeline.
This tutorial uses the
CopyActivity
object to copy CSV data from one Amazon S3 bucket to another.
Important
There are distinct limitations regarding the CSV file format with
CopyActivity
and
S3DataNode
. For more information, see CopyActivity (p. 180) .
Schedule
The start date, time, and the recurrence for this activity. You can optionally specify the end date and time.
Resource
Resource AWS Data Pipeline must use to perform this activity.
This tutorial uses
Ec2Resource
, an Amazon EC2 instance provided by AWS Data Pipeline, to copy data. AWS Data Pipeline automatically launches the Amazon EC2 instance and then terminates the instance after the task finishes.
API Version 2012-10-29
25
AWS Data Pipeline Developer Guide
Before You Begin...
DataNodes
Input and output nodes for this pipeline.
This tutorial uses
S3DataNode
for both input and output nodes.
Action
Action AWS Data Pipeline must take when the specified conditions are met.
This tutorial uses
SnsAlarm
action to send Amazon SNS notifications to the email address you specify, after the task finishes successfully. You must subscribe to the Amazon SNS Topic Arn to receive the notifications.
The following steps outline how to create a data pipeline to copy data from one Amazon S3 bucket to another Amazon S3 bucket.
1. Create your pipeline definition
2. Validate and save your pipeline definition
3. Activate your pipeline
4. Monitor the progress of your pipeline
5. [Optional] Delete your pipeline
Before You Begin...
Be sure you've completed the following steps.
• Set up an Amazon Web Services (AWS) account to access the AWS Data Pipeline console. For more information, see
.
Set up the AWS Data Pipeline tools and interface you plan on using. For more information on interfaces
.
• Create an Amazon S3 bucket as a data source.
For more information, see Create a Bucket in the Amazon Simple Storage Service Getting Started
Guide.
• Upload your data to your Amazon S3 bucket.
For more information, see Add an Object to a Bucket in the Amazon Simple Storage Service Getting
Started Guide.
• Create another Amazon S3 bucket as a data target
• Create an Amazon SNS topic for sending email notification and make a note of the topic Amazon
Resource Name (ARN). For more information, see Create a Topic in the Amazon Simple Notification
Service Getting Started Guide.
• [Optional] This tutorial uses the default IAM role policies created by AWS Data Pipeline. If you would rather create and configure your own IAM role policy and trust relationships, follow the instructions
described in Granting Permissions to Pipelines with IAM (p. 21) .
Note
Some of the actions described in this tutorial can generate AWS usage charges, depending on whether you are using the AWS Free Usage Tier .
API Version 2012-10-29
26
AWS Data Pipeline Developer Guide
Using the AWS Data Pipeline Console
Using the AWS Data Pipeline Console
Topics
•
Create and Configure the Pipeline Definition Objects (p. 27)
•
Validate and Save Your Pipeline (p. 30)
•
Verify your Pipeline Definition (p. 30)
•
Activate your Pipeline (p. 31)
•
Monitor the Progress of Your Pipeline Runs (p. 31)
•
[Optional] Delete your Pipeline (p. 33)
The following sections include the instructions for creating the pipeline using the AWS Data Pipeline console.
To create your pipeline definition
1.
Sign in to the AWS Management Console and open the AWS Data Pipeline console .
2.
Click Create Pipeline.
3.
On the Create a New Pipeline page: a.
In the Pipeline Name box, enter a name (for example,
CopyMyS3Data
).
b.
In Pipeline Description, enter a description.
c.
Leave the Select Schedule Type: button set to the default type for this tutorial.
Note
Schedule type allows you to specify whether the objects in your pipeline definition should be scheduled at the beginning of interval or end of the interval. Time Series
Style Scheduling means instances are scheduled at the end of each interval and Cron
Style Scheduling means instances are scheduled at the beginning of each interval.
d.
Leave the Role boxes set to their default values for this tutorial.
Note
If you have created your own IAM roles and would like to use them in this tutorial, you can select them now.
e.
Click Create a new pipeline.
Create and Configure the Pipeline Definition
Objects
Next, you define the
Activity
object in your pipeline definition. When you define the
Activity
object, you also define the objects that AWS Data Pipeline must use to perform this activity.
1.
On the Pipeline:
name of your pipeline
page, select Add activity.
2.
In the Activities pane: a.
Enter the name of the activity; for example, copy-myS3-data
.
b.
In the Type box, select CopyActivity.
c.
In the Input box, select Create new: DataNode.
d.
In the Output box, select Create new: DataNode.
e.
In the Schedule box, select Create new: Schedule.
API Version 2012-10-29
27
AWS Data Pipeline Developer Guide
Create and Configure the Pipeline Definition Objects
f.
In the Add an optional field .. box, select RunsOn.
g.
In the Runs On box, select Create new: Resource.
h.
In the Add an optional field... box, select On Success.
i.
In the On Success box, select Create new: Action.
j.
In the left pane, separate the icons by dragging them apart.
You've completed defining your pipeline definition by specifying the objects AWS Data Pipeline uses to perform the copy activity.
The Pipeline:
name of your pipeline
pane shows the graphical representation of the pipeline you just created. The arrows indicate the connection between the various objects.
Next, configure the run date and time for your pipeline.
To configure run date and time for your pipeline
1.
On the Pipeline:
name of your pipeline
page, in the right pane, click Schedules.
2.
In the Schedules pane: a.
Enter a schedule name for this activity (for example, copy-myS3-data-schedule
).
b.
In the Type box, select Schedule.
c.
In the Start Date Time box, select the date from the calendar, and then enter the time to start the activity.
Note
AWS Data Pipeline supports the date and time expressed in "YYYY-MM-DDTHH:MM:SS" format in UTC/GMT only.
d.
In the Period box, enter the duration for the activity (for example,
1
), and then select the period category (for example,
Days
).
e.
[Optional] To specify the date and time to end the activity, in the Add an optional field box, select endDateTime, and enter the date and time.
To get your pipeline to launch immediately, set Start Date Time to a date one day in the past. AWS
Data Pipeline then starts launching the "past due" runs immediately in an attempt to address what it perceives as a backlog of work. This backfilling means you don't have to wait an hour to see AWS
Data Pipeline launch its first job flow.
Next, configure the input and the output data nodes for your pipeline.
API Version 2012-10-29
28
AWS Data Pipeline Developer Guide
Create and Configure the Pipeline Definition Objects
To configure the input and output data nodes of your pipeline
1.
On the Pipeline:
name of your pipeline
page, in the right pane, click DataNodes.
2.
In the DataNodes pane: a.
In the
DefaultDataNode1
Name box , enter the name for your input node (for example,
MyS3Input
).
In this tutorial, your input node is the Amazon S3 data source bucket.
b.
In the Type box, select S3DataNode.
c.
In the Schedule box, select copy-myS3-data-schedule.
d.
In the Add an optional field... box, select File Path.
e.
In the File Path box, enter the path to your Amazon S3 bucket (for example, s3://my-data-pipeline-input/
name of your data file
).
f.
In the
DefaultDataNode2
Name box, enter the name for your output node (for example,
MyS3Output
).
In this tutorial, your output node is the Amazon S3 data target bucket.
g.
In the Type box, select S3DataNode.
h.
In the Schedule box, select copy-myS3-data-schedule.
i.
In the Add an optional field... box, select File Path.
j.
In the File Path box, enter the path to your Amazon S3 bucket (for example, s3://my-data-pipeline-output/
name of your data file
).
Next, configure the resource AWS Data Pipeline must use to perform the copy activity.
To configure the resource,
1.
On the Pipeline:
name of your pipeline
page, in the right pane, click Resources.
2.
In the Resources pane: a.
In the Name box, enter the name for your resource (for example,
CopyDataInstance
).
b.
In the Type box, select Ec2Resource.
c.
In the Schedule box, select copy-myS3-data-schedule.
d.
Leave the Role and Resource Role boxes set to default values for this tutorial.
Note
If you have created your own IAM roles, you can select them now.
Next, configure the SNS notification action AWS Data Pipeline must perform after the copy activity finishes successfully.
To configure the SNS notification action
1.
On the Pipeline:
name of your pipeline
page, in the right pane, click Others.
2.
In the Others pane: a.
In the
DefaultAction1
Name box, enter the name for your Amazon SNS notification (for example,
CopyDataNotice
).
b.
In the Type box, select SnsAlarm.
API Version 2012-10-29
29
AWS Data Pipeline Developer Guide
Validate and Save Your Pipeline
c.
In the Topic Arn box, enter the ARN of your Amazon SNS topic.
d.
In the Message box, enter the message content.
e.
In the Subject box, enter the subject line for your notification.
f.
Leave the Role box set to the default value for this tutorial.
You have now completed all the steps required for creating your pipeline definition. Next, validate and save your pipeline.
Validate and Save Your Pipeline
You can save your pipeline definition at any point during the creation process. As soon as you save your pipeline definition, AWS Data Pipeline looks for syntax errors and missing values in your pipeline definition.
If your pipeline is incomplete or is incorrect, AWS Data Pipeline throws a validation error. If you plan to continue the creation process later, you can ignore the error message. If your pipeline definition is complete, and you are getting the validation error message, you have to fix the errors in the pipeline definition before activating your pipeline.
To validate and save your pipeline
1.
On the Pipeline:
name of your pipeline
page, click Save Pipeline.
2.
AWS Data Pipeline validates your pipeline definition and returns either success or the error message.
If you get an error message, click Close and then, in the right pane, click Errors.
3.
The Errors pane lists the objects failing validation. Click the plus (+) sign next to the object names and look for an error message in red.
4.
When you see the error message, click the specific object pane where you see the error and fix it.
For example, if you see an error message in the DataNodes object, click the DataNodes pane to fix the error.
5.
After you've fixed the errors listed in the Errors pane, click Save Pipeline.
6.
Repeat the process until your pipeline is validated.
Next, verify that your pipeline definition has been saved.
Verify your Pipeline Definition
It is important to verify that your pipeline was correctly initialized from your definitions before you activate it.
To verify your pipeline definition
1.
On the Pipeline:
name of your pipeline
page, click Back to list of pipelines.
2.
On the List Pipelines page, check if your newly-created pipeline is listed.
AWS Data Pipeline has created a unique Pipeline ID for your pipeline definition.
The Status column in the row listing your pipeline should show PENDING.
3.
Click on the triangle icon next to your pipeline. The Pipeline Summary panel below shows the details of your pipeline runs. Because your pipeline is not yet activated, you should see only 0s at this point.
4.
In the Pipeline summary panel, click View all fields to see the configuration of your pipeline definition.
API Version 2012-10-29
30
AWS Data Pipeline Developer Guide
Activate your Pipeline
5.
Click Close.
Next, activate your pipeline.
Activate your Pipeline
You must activate your pipeline to start creating and processing runs based on the specifications in your pipeline definition.
To activate your pipeline
1.
On the List Pipelines page, in the Details column of your pipeline, click View pipeline.
2.
In the Pipeline:
name of your pipeline
page, click Activate.
A confirmation dialog box opens up confirming the activation.
3.
Click Close.
Next, verify if your pipeline is running.
Monitor the Progress of Your Pipeline Runs
To monitor the progress of your pipeline
1.
On the List Pipelines page, in the Details column of your pipeline, click View instance details.
API Version 2012-10-29
31
AWS Data Pipeline Developer Guide
Monitor the Progress of Your Pipeline Runs
2.
The Instance details:
name of your pipeline
page lists the status of each instance.
Note
If you do not see runs listed, depending on when your pipeline was scheduled, either click the End (in UTC) date box and change it to a later date or click the Start (in UTC) date box and change it to an earlier date. Then click Update.
3.
If the Status column of all the instances in your pipeline indicates FINISHED, your pipeline has successfully completed the copy activity.You should receive an email about the successful completion of this task, to the account you specified for receiving your Amazon SNS notification.
You can also check your Amazon S3 data target bucket to verify if the data was copied.
4.
If the Status column of any of your instance indicates a status other than FINISHED, either your pipeline is waiting for some precondition to be met or it has failed.
a.
To troubleshoot the failed or the incomplete instance runs,
Click the triangle next to an instance, Instance summary panel opens to show the details of the selected instance.
b.
In the Instance summary pane. click View instance fields to see details of fields associated with the selected instance. If the status of your selected instance is FAILED, the details box has an entry indicating the reason for failure. For example,
@failureReason = Resource not healthy terminated
.
c.
In the Instance summary pane, in the Select attempt for this instance box, select the attempt number.
d.
In the Instance summary pane, click View attempt fields to see details of fields associated with the selected attempt.
5.
To take an action on your incomplete or failed instance, select an action (
Rerun|Cancel|Mark
Finished
) from the Action column of the instance.
You can use the information in the Instance summary pane and the View instance fields box to troubleshoot issues with your failed pipeline.
For more information about instance status, see Interpret Pipeline Status Details (p. 129)
. For more
Pipeline Problems and Solutions (p. 131)
.
API Version 2012-10-29
32
AWS Data Pipeline Developer Guide
[Optional] Delete your Pipeline
Important
Your pipeline is running and is incurring charges. For more information, see AWS Data Pipeline pricing . If you would like to stop incurring the AWS Data Pipeline usage charges, delete your pipleline.
[Optional] Delete your Pipeline
Deleting your pipeline deletes the pipeline definition including all the associated objects.You stop incurring charges as soon as your pipeline is deleted.
To delete your pipeline
1.
In the List Pipelines page, click the check box next to your pipeline.
2.
Click Delete.
3.
In the confirmation dialog box, click Delete to confirm the delete request.
Using the Command Line Interface
Topics
•
Define a Pipeline in JSON Format (p. 33)
•
Upload the Pipeline Definition (p. 38)
•
•
Verify the Pipeline Status (p. 39)
The following topics explain how to use the AWS Data Pipeline CLI to create and use pipelines to copy data from one Amazon S3 bucket to another. In this example, we perform the following steps:
• Create a pipeline definition using the CLI in JSON format
• Create the necessary IAM roles and define a policy and trust relationships
• Upload the pipeline definition using the AWS Data Pipeline CLI tools
• Monitor the progress of the pipeline
Define a Pipeline in JSON Format
This example scenario shows how to use JSON pipeline definitions and the AWS Data Pipeline CLI to schedule copying data between two Amazon S3 buckets at a specific time interval. This is the full pipeline definition JSON file followed by an explanation for each of its sections.
Note
We recommend that you use a text editor that can help you verify the syntax of JSON-formatted files, and name the file using the .json file extension.
{
"objects": [
API Version 2012-10-29
33
AWS Data Pipeline Developer Guide
Define a Pipeline in JSON Format
{
"id": "MySchedule",
"type": "Schedule",
"startDateTime": "2012-11-25T00:00:00",
"endDateTime": "2012-11-26T00:00:00",
"period": "1 day"
},
{
"id": "S3Input",
"type": "S3DataNode",
"schedule": {
"ref": "MySchedule"
},
"filePath": "s3://testbucket/file.txt"
},
{
"id": "S3Output",
"type": "S3DataNode",
"schedule": {
"ref": "MySchedule"
},
"filePath": "s3://testbucket/file-copy.txt"
},
{
"id": "MyEC2Resource",
"type": "Ec2Resource",
"schedule": {
"ref": "MySchedule"
},
"actionOnTaskFailure": "terminate",
"actionOnResourceFailure": "retryAll",
"maximumRetries": "1",
"role": "test-role",
"resourceRole": "test-role",
"instanceType": "m1.medium",
"instanceCount": "1",
"securityGroups": [
"test-group",
"default"
],
"keyPair": "test-pair"
},
{
"id": "MyCopyActivity",
"type": "CopyActivity",
"runsOn": {
"ref": "MyEC2Resource"
},
"input": {
"ref": "S3Input"
},
"output": {
"ref": "S3Output"
},
"schedule": {
"ref": "MySchedule"
}
}
API Version 2012-10-29
34
AWS Data Pipeline Developer Guide
Define a Pipeline in JSON Format
]
}
Schedule
The example AWS Data Pipeline JSON file begins with a section to define the schedule by which to copy the data. Many pipeline components have a reference to a schedule and you may have more than one.
The Schedule component is defined by the following fields:
{
"id": "MySchedule",
"type": "Schedule",
"startDateTime": "2012-11-22T00:00:00",
"endDateTime":"2012-11-23T00:00:00",
"period": "1 day"
},
Note
In the JSON file, you can define the pipeline components in any order you prefer. In this example, we chose the order that best illustrates the pipeline component dependencies.
Name
The user-defined name for the pipeline schedule, which is a label for your reference only.
Type
The pipeline component type, which is Schedule.
startDateTime
The date/time (in UTC format) that you want the task to begin.
endDateTime
The date/time (in UTC format) that you want the task to stop.
period
The time period that you want to pass between task attempts, even if the task occurs only one time.
The period must evenly divide the time between startDateTime
and endDateTime
. In this example, we set the period to be 1 day so that the pipeline copy operation can only run one time.
Amazon S3 Data Nodes
Next, the input S3DataNode pipeline component defines a location for the input files; in this case, an
Amazon S3 bucket location. The input S3DataNode component is defined by the following fields:
{
"id" : "S3Input",
"type" : "S3DataNode",
"schedule" : {"ref" : "MySchedule"},
"filePath" : "s3://testbucket/file.txt",
"schedule": { "ref": "MySchedule" }
},
Name
The user-defined name for the input location (a label for your reference only).
Type
The pipeline component type, which is "S3DataNode" to match the location where the data resides, in an Amazon S3 bucket.
API Version 2012-10-29
35
AWS Data Pipeline Developer Guide
Define a Pipeline in JSON Format
Schedule
A reference to the schedule component that we created in the preceding lines of the JSON file labeled
“MySchedule”.
Path
The path to the data associated with the data node. The syntax for a data node is determined by its type. For example, a data node for a database follows a different syntax that is appropriate for a database table.
Next, the output S3DataNode component defines the output destination location for the data. It follows the same format as the input S3DataNode component, except the name of the component and a different path to indicate the target file.
{
"id" : "S3Output",
"type" : "S3DataNode",
"schedule" : {"ref" : "MySchedule"},
"filePath" : "s3://testbucket/file-copy.txt",
"schedule": { "ref": "MySchedule" }
},
Resource
This is a definition of the computational resource that performs the copy operation. In this example, AWS
Data Pipeline should automatically create an EC2 instance to perform the copy task and terminate the resource after the task completes. The fields defined here control the creation and function of the Amazon
EC2 instance that does the work. The EC2Resource is defined by the following fields:
{
"id": "MyEC2Resource",
"type": "Ec2Resource",
"schedule": {"ref": "MySchedule"},
"actionOnTaskFailure": "terminate",
"actionOnResourceFailure": "retryAll",
"maximumRetries": "1",
"role" : "test-role",
"resourceRole": "test-role",
"instanceType": "m1.medium",
"instanceCount": "1",
"securityGroups": [ "test-group", "default" ],
"keyPair": "test-pair"
},
Name
The user-defined name for the pipeline schedule, which is a label for your reference only.
Type
The type of computational resource to perform work; in this case, an Amazon EC2 instance. There are other resource types available, such as an EmrCluster type.
Schedule
The schedule on which to create this computational resource.
actionOnTaskFailure
The action to take on the Amazon EC2 instance if the task fails. In this case, the instance should terminate so that each failed attempt to copy data does not leave behind idle, abandoned Amazon
EC2 instances with no work to perform. These instances require manual termination by an administrator.
API Version 2012-10-29
36
AWS Data Pipeline Developer Guide
Define a Pipeline in JSON Format actionOnResourceFailure
The action to perform if the resource is not created successfully. In this case, retry the creation of an
Amazon EC2 instance until it is successful.
maximumRetries
The number of times to retry the creation of this computational resource before marking the resource as a failure and stopping any further creation attempts. This setting works in conjunction with the actionOnResourceFailure
field.
Role
The IAM role of the account that accesses resources, such as accessing an Amazon S3 bucket to retrieve data.
resourceRole
The IAM role of the account that creates resources, such as creating and configuring an Amazon
EC2 instance on your behalf. Role and ResourceRole can be the same role, but separately provide greater granularity in your security configuration.
instanceType
The size of the Amazon EC2 instance to create. Ensure that you set the appropriate size of EC2 instance that best matches the load of the work that you want to perform with AWS Data Pipeline.
In this case, we set an m1.medium sized EC2 instance. For more information about the different instance types and when to use each one, see to the Amazon EC2 Instance Types topic at http://aws.amazon.com/ec2/instance-types/.
instanceCount
The number of Amazon EC2 instances in the computational resource pool to service any pipeline components depending on this resource.
securityGroups
The Amazon EC2 security groups to grant access to this EC2 instance. As the example shows, you can provide more than one group at a time (test-group and default).
keyPair
The name of the SSH public/private key pair to log in to the Amazon EC2 instance. For more information, see Amazon EC2 Key Pairs .
Activity
The last section in the JSON file is the definition of the activity that represents the work to perform. This example uses
CopyActivity
to copy data from a file in an Amazon S3 bucket to another file. The
CopyActivity
component is defined by the following fields:
{
"id" : "MyCopyActivity",
"type" : "CopyActivity",
"runsOn":{"ref":"MyEC2Resource"},
"input" : {"ref" : "S3Input"},
"output" : {"ref" : "S3Output"},
"schedule" : {"ref" : "MySchedule"}
}
Name
The user-defined name for the activity, which is a label for your reference only.
Type
The type of activity to perform, such as MyCopyActivity.
runsOn
The computational resource that performs the work that this activity defines. In this example, we provide a reference to the Amazon EC2 instance defined previously. Using the
runsOn field causes
AWS Data Pipeline to create the EC2 instance for you. The runsOn
field indicates that the resource
API Version 2012-10-29
37
AWS Data Pipeline Developer Guide
Upload the Pipeline Definition
exists in the AWS infrastructure, while the workerGroup
value indicates that you want to use your own on-premises resources to perform the work.
Schedule
The schedule on which to run this activity.
Input
The location of the data to copy.
Output
The target location data.
Upload the Pipeline Definition
You can upload a pipeline definition file using the AWS Data Pipeline CLI tools. For more information,
see Install the Command Line Interface (p. 15)
To upload your pipeline definition, use the following command.
On Linux/Unix/Mac OS:
./datapipeline -–create pipeline_name -–put pipeline_file
On Windows: ruby datapipeline -–create pipeline_name -–put pipeline_file
Where pipeline_name is the label for your pipeline and pipeline_file is the full path and file name for the file with the .json file extension that defines your pipeline.
If your pipeline validates successfully, you receive the following message:
Pipeline with name pipeline_name and id df-AKIAIOSFODNN7EXAMPLE created. Pipeline
definition pipeline_file.json uploaded.
Note
For more information about any errors returned by the –create command or other commands,
see Troubleshoot AWS Data Pipeline (p. 128)
.
Ensure that your pipeline appears in the pipeline list by using the following command.
On Linux/Unix/Mac OS:
./datapipeline --list-pipelines
On Windows: ruby datapipeline -–list-pipelines
The list of pipelines includes details such as Name, Id, State, and UserId. Take note of your pipeline ID, because you use this value for most AWS Data Pipeline CLI commands. The pipeline ID is a unique identifier using the format df-AKIAIOSFODNN7EXAMPLE
.
API Version 2012-10-29
38
AWS Data Pipeline Developer Guide
Activate the Pipeline
Activate the Pipeline
You must activate the pipeline, by using the --activate command-line parameter, before it will begin performing work. Use the following command.
On Linux/Unix/Mac OS:
./datapipeline --activate -–id df-AKIAIOSFODNN7EXAMPLE
On Windows: ruby datapipeline --activate -–id df-AKIAIOSFODNN7EXAMPLE
Where df-AKIAIOSFODNN7EXAMPLE is the identifier for your pipeline.
Verify the Pipeline Status
View the status of your pipeline and its components, along with its activity attempts and retries with the following command.
On Linux/Unix/Mac OS:
./datapipeline --list-runs –id df-AKIAIOSFODNN7EXAMPLE
On Windows: ruby datapipeline --list-runs –id df-AKIAIOSFODNN7EXAMPLE
Where df-AKIAIOSFODNN7EXAMPLE is the identifier for your pipeline.
The --list-runs command displays a list of pipelines components and details such as Name, Scheduled
Start, Status, ID, Started, and Ended.
Note
It is important to note the difference between the Scheduled Start date/time vs. the Started time.
It is possible to schedule a pipeline component to run at a certain time (Scheduled Start), but the actual start time (Started) could be later due to problems or delays with preconditions, dependencies, failures, or retries.
Note
AWS Data Pipeline may backfill a pipeline, which happens when you define a Scheduled Start date/time for a date in the past. In that situation, AWS Data Pipeline immediately runs the pipeline components the number of times the activity should have run if it had started on the Scheduled
Start time. When this happens, you see pipeline components run back-to-back at a greater frequency than the period value that you specified when you created the pipeline. AWS Data
Pipeline returns your pipeline to the defined frequency only when it catches up to the number of past runs.
Successful pipeline runs are indicated by all the activities in your pipeline reporting the
FINISHED
status.
Your pipeline frequency determines how many times the pipeline runs, and each run has its own success or failure, as indicated by the --list-runs command. Resources that you define in your pipeline, such as
Amazon EC2 instances, may show the
SHUTTING_DOWN
status until they are finally terminated after a successful run. Depending on how you configured your pipeline, you may have multiple Amazon EC2 resources, each with their own final status.
API Version 2012-10-29
39
AWS Data Pipeline Developer Guide
Tutorial: Copy Data From a MySQL
Table to Amazon S3
Topics
•
•
Using the AWS Data Pipeline Console (p. 42)
•
Using the Command Line Interface (p. 48)
This tutorial walks you through the process of creating a data pipeline to copy data (rows) from a table in MySQL database to a CSV (comma-separated values) file in Amazon S3 bucket and then send an
Amazon SNS notification after the copy activity completes successfully. You will use the Amazon EC2 instance resource provided by AWS Data Pipeline for this copy activity.
The first step in pipeline creation process is to select the pipeline objects that make up your pipeline definition. After you select the pipeline objects, you add fields for each pipeline object. For more information on pipeline definition, see
This tutorial uses the following objects to create a pipeline definition:
Activity
Activity the AWS Data Pipeline must perform for this pipeline.
This tutorial uses the
CopyActivity
to copy data from a MySQL table to an Amazon S3 bucket.
Schedule
The start date, time, and the duration for this activity. You can optionally specify the end date and time.
Resource
Resource AWS Data Pipeline must use to perform this activity.
This tutorial uses
Ec2Resource
, an Amazon EC2 instance provided by AWS Data Pipeline, to copy data. AWS Data Pipeline automatically launches the Amazon EC2 instance and then terminates the instance after the task finishes.
DataNodes
Input and output nodes for this pipeline.
This tutorial uses
MySQLDataNode
for source data and
S3DataNode
for target data.
API Version 2012-10-29
40
AWS Data Pipeline Developer Guide
Before You Begin ...
Action
Action AWS Data Pipeline must take when the specified conditions are met.
This tutorial uses
SnsAlarm
action to send Amazon SNS notification to the email address you specify, after the task finishes successfully.
For information about the additional objects and fields supported by the copy activity, see
The following steps outline how to create a data pipeline to copy data from MySQL table to Amazon S3 bucket.
1. Create your pipeline definition
2. Create and configure the pipeline definition objects
3. Validate and save your pipeline definition
4. Verify that your pipeline definition is saved
5. Activate your pipeline
6. Monitor the progress of your pipeline
7. [Optional] Delete your pipeline
Before You Begin ...
Be sure you've completed the following steps.
• Set up an Amazon Web Services (AWS) account to access the AWS Data Pipeline console. For more information, see
.
Set up the AWS Data Pipeline tools and interface you plan on using. For more information on interfaces
.
• Create an Amazon S3 bucket as a data source.
For more information, see Create a Bucket in Amazon Simple Storage Service Getting Started Guide.
• Create and launch a MySQL database instance as a data source.
For more information, see Launch a DB Instance in the Amazon Relational Database Service(RDS)
Getting Started Guide.
Note
Make a note of the user name and the password you used for creating the MySQL instance.
After you've launched your MySQL database instance, make a note of the instance's endpoint.
You will need all this information in this tutorial.
• Connect to your MySQL database instance, create a table, and then add test data values to the newly created table.
For more information, go to Create a Table in the MySQL documentation.
• Create an Amazon SNS topic for sending email notification and make a note of the topic Amazon
Resource Name (ARN). For more information, go to Create a Topic in Amazon Simple Notification
Service Getting Started Guide.
• [Optional] This tutorial uses the default IAM role policies created by AWS Data Pipeline. If you would rather create and configure your IAM role policy and trust relationships, follow the instructions described
in Granting Permissions to Pipelines with IAM (p. 21)
.
API Version 2012-10-29
41
AWS Data Pipeline Developer Guide
Using the AWS Data Pipeline Console
Note
Some of the actions described in this tutorial can generate AWS usage charges, depending on whether you are using the AWS Free Usage Tier .
Using the AWS Data Pipeline Console
Topics
•
Create and Configure the Pipeline Definition Objects (p. 42)
•
Validate and Save Your Pipeline (p. 45)
•
Verify Your Pipeline Definition (p. 45)
•
Activate your Pipeline (p. 46)
•
Monitor the Progress of Your Pipeline Runs (p. 47)
•
[Optional] Delete your Pipeline (p. 48)
The following sections include the instructions for creating the pipeline using the AWS Data Pipeline console.
To create your pipeline definition
1.
Sign in to the AWS Management Console and open the AWS Data Pipeline Console .
2.
Click Create Pipeline.
3.
On the Create a New Pipeline page: a.
In the Pipeline Name box, enter a name (for example,
CopyMySQLData
).
b.
In Pipeline Description, enter a description.
c.
Leave the Select Schedule Type: button set to the default type for this tutorial.
Note
Schedule type allows you to specify whether the objects in your pipeline definition should be scheduled at the beginning of interval or end of the interval. Time Series
Style Scheduling means instances are scheduled at the end of each interval and Cron
Style Scheduling means instances are scheduled at the beginning of each interval.
d.
Leave the Role boxes set to their default values for this tutorial.
Note
If you have created your own IAM roles and would like to use them in this tutorial, you can select them now.
e.
Click Create a new pipeline.
Create and Configure the Pipeline Definition
Objects
Next, you define the
Activity
object in your pipeline definition. When you define the
Activity
object, you also define the objects that AWS Data Pipeline must use to perform this activity.
1.
On the Pipeline:
name of your pipeline
page, click Add activity.
2.
In the Activities pane a.
Enter the name of the activity; for example, copy-mysql-data
API Version 2012-10-29
42
AWS Data Pipeline Developer Guide
Create and Configure the Pipeline Definition Objects
b.
In the Type box, select CopyActivity.
c.
In the Input box, select Create new: DataNode.
d.
In the Schedule box, select Create new: Schedule.
e.
In the Output box, select Create new: DataNode.
f.
In the Add an optional field .. box, select RunsOn.
g.
In the Runs On box, select Create new: Resource.
h.
In the Add an optional field .. box, select On Success.
i.
In the On Success box, select Create new: Action.
j.
In the left pane, separate the icons by dragging them apart.
You've completed defining your pipeline definition by specifying the objects AWS Data Pipeline will use to perform the copy activity.
The Pipeline:
name of your pipeline
pane shows the graphical representation of the pipeline you just created. The arrows indicate the connection between the various objects.
Next step, configure run date and time for your pipeline.
To configure run date and time for your pipeline,
1.
On the Pipeline:
name of your pipeline
page, in the right pane, click Schedules.
2.
In the Schedules pane: a.
Enter a schedule name for this activity (for example, copy-mysql-data-schedule
).
b.
In the Type box, select Schedule.
c.
In the Start Date Time box, select the date from the calendar, and then enter the time to start the activity.
Note
AWS Data Pipeline supports the date and time expressed in "YYYY-MM-DDTHH:MM:SS" format in UTC/GMT only d.
In the Period box, enter the duration for the activity (for example,
1
), and then select the period category (for example,
Days
).
e.
[Optional] To specify the date and time to end the activity, in the Add an optional field box, select endDateTime, and enter the date and time.
API Version 2012-10-29
43
AWS Data Pipeline Developer Guide
Create and Configure the Pipeline Definition Objects
To get your pipeline to launch immediately, set Start Date Time to a date one day in the past. AWS
Data Pipeline will then starting launching the "past due" runs immediately in an attempt to address what it percieves as a backlog of work. This backfilling means you don't have to wait an hour to see
AWS Data Pipeline launch its first job flow.
Next step, configure the input and the output data nodes for your pipeline.
To configure the input and output data nodes of your pipeline,
1.
On the Pipeline:
name of your pipeline
page, in the right pane, click DataNodes.
2.
In the DataNodes pane: a.
In the
DefaultDataNode1
Name box , enter the name for your input node (for example,
MySQLInput
).
In this tutorial, your input node is the Amazon RDS MySQL instance you just created.
b.
In the Type box, select MySQLDataNode.
c.
In the Username box, enter the user name you used when you created your MySQL database instance.
d.
In the Connection String box, enter the end point of your MySQL database instance (for example, mydbinstance.c3frkexample.us-east-1.rds.amazonaws.com
).
e.
In the *Password box, enter the password you used when you created your MySQL database instance.
f.
In the Table box, enter the name of the source MySQL database table (for example, input-table g.
In the Schedule box, select copy-mysql-data-schedule.
h.
In the
DefaultDataNode2
Name box, enter the name for your output node (for example,
MyS3Output
).
In this tutorial, your output node is the Amazon S3 data target bucket.
i.
In the Type box, select S3DataNode.
j.
In the Schedule box, select copy-mysql-data-schedule.
k.
In the Add an optional field .. box, select File Path.
l.
In the File Path box, enter the path to your Amazon S3 bucket (for example, s3://my-data-pipeline-output/
name of your csv file
).
Next step, configure the the resource AWS Data Pipeline must use to perform the copy activity.
To configure the resource,
1.
On the Pipeline:
name of your pipeline
page, in the right pane, click Resources.
2.
In the Resources pane: a.
In the Name box, enter the name for your resource (for example,
CopyDataInstance
).
b.
In the Type box, select Ec2Resource.
c.
In the Schedule box, select copy-mysql-data-schedule.
Next step, configure the SNS notification action AWS Data Pipeline must perform after the copy activity finishes successfully.
API Version 2012-10-29
44
AWS Data Pipeline Developer Guide
Validate and Save Your Pipeline
To configure the SNS notification action,
1.
On the Pipeline:
name of your pipeline
page, in the right pane, click Others.
2.
In the Others pane: a.
In the
DefaultAction1
Name box, enter the name for your Amazon SNS notification (for example,
CopyDataNotice
).
b.
In the Type box, select SnsAlarm.
c.
In the Message box, enter the message content.
d.
Leave the entry in the Role box set to default value.
e.
In the Topic Arn box, enter the ARN of your Amazon SNS topic.
f.
In the Subject box, enter the subject line for your notification.
You have now completed all the steps required for creating your pipeline definition. Next step, validate and save your pipeline.
Validate and Save Your Pipeline
You can save your pipeline definition at any point during the creation process. As soon as you save your pipeline definition, AWS Data Pipeline looks for syntax errors and missing values in your pipeline definition.
If your pipeline is incomplete or is incorrect, AWS Data Pipeline throws a validation error. If you plan to continue the creation process later, you can ignore the error message. If your pipeline definition is complete, and you are getting the validation error message, you'll have to fix the errors in the pipeline definition before activating your pipeline.
To validate and save your pipeline,
1.
On the Pipeline:
name of your pipeline
page, click Save Pipeline.
2.
AWS Data Pipeline validates your pipeline definition and returns either success or the error message.
3.
If you get an error message, click Close and then, in the right pane, click Errors.
4.
The Errors pane lists the objects failing validation.
Click the plus (+) sign next to the object names and look for an error message in red.
5.
When you see the error message, click the specific object pane where you see the error and fix it.
For example, if you see an error message in the DataNodes object, click the DataNodes pane to fix the error.
6.
After you've fixed the errors listed in the Errors pane, click Save Pipeline.
7.
Repeat the process until your pipeline is validated.
Next step, verify that your pipeline definition has been saved.
Verify Your Pipeline Definition
It is important to verify that your pipeline was correctly initialized from your definitions before you activate it.
To verify your pipeline definition,
1.
On the Pipeline:
name of your pipeline
page, click Back to list of pipelines.
2.
On the List Pipelines page, check if your newly-created pipeline is listed.
AWS Data Pipeline has created a unique Pipeline ID for your pipeline definition.
API Version 2012-10-29
45
AWS Data Pipeline Developer Guide
Activate your Pipeline
The Status column in the row listing your pipeine should show PENDING.
3.
Click on the triangle icon next to your pipeline. The Pipeline Summary panel below shows the details of your pipeline runs. Because your pipeline is not yet activated, you should see only 0's at this point.
4.
In the Pipeline summary panel, click View all fields to see the configuration of your pipeline definition.
5.
Click Close.
Next step, activate your pipeline.
Activate your Pipeline
You must activate your pipeline to start creating and processing runs based on the specifications in your pipeline definition.
To activate your pipeline,
1.
On the List Pipelines page, in the Details column of your pipeline, click View pipeline.
2.
In the Pipeline:
name of your pipeline
page, click Activate.
A confirmation dialog box opens up confirming the activation.
3.
Click Close.
Next step, verify if your pipeline is running.
API Version 2012-10-29
46
AWS Data Pipeline Developer Guide
Monitor the Progress of Your Pipeline Runs
Monitor the Progress of Your Pipeline Runs
To monitor the progress of your pipeline,
1.
On the List Pipelines page, in the Details column of your pipeline, click View instance details.
2.
The Instance details:
name of your pipeline
page lists the status of each instance.
Note
If you do not see instances listed, depending on when your pipeline was scheduled, either click End (in UTC) date box and change it to a later date or click Start (in UTC) date box and change it to an earlier date. And then click Update.
3.
If the Status column of all the objects in your pipeline indicates FINISHED, your pipeline has successfully completed the copy activity.You should receive an email about the successful completion of this task, to the account you specified for receiving your Amazon SNS notification.
You can also check your Amazon S3 data target bucket to verify if the data was copied.
4.
If the Status column of any of your instance indicates a status other than FINISHED, either your pipeline is waiting for some precondition to be met or it has failed.
a.
To troubleshoot the failed or the incomplete instance runs,
Click the triangle next to an instance , Instance summary panel opens to show the details of the selected instance.
b.
Click View instance fields to see additional details of the instance. If the status of your selected instance is FAILED the additional details box will have an entry indicating the reason for failure.
For example,
@failureReason = Resource not healthy terminated
.
c.
In the Instance summary pane, in the Select attempt for this instance box, select the attempt number.
d.
In the Instance summary pane, click View attempt fields to see details of fields associated with the selected attempt.
5.
To take an action on your incomplete or failed instance, select an action (
Rerun|Cancel|Mark
Finished
) from the Action column of the instance.
You can use the information in the Instance summary pane and the View instance fields box to troubleshoot issues with your failed pipeline.
API Version 2012-10-29
47
AWS Data Pipeline Developer Guide
[Optional] Delete your Pipeline
For more information about instance status, see Interpret Pipeline Status Details (p. 129)
. For more
Pipeline Problems and Solutions (p. 131)
.
Important
Your pipeline is running and is incurring charges. For more information, see AWS Data Pipeline pricing . If you would like to stop incurring the AWS Data Pipeline usage charges, delete your pipeline.
[Optional] Delete your Pipeline
Deleting your pipeline will delete the pipeline definition including all the associated objects. You will stop incurring charges as soon as your pipeline is deleted.
To delete your pipeline,
1.
In the List Pipelines page, click the check box next to your pipeline.
2.
Click Delete.
3.
In the confirmation dialog box, click Delete to confirm the delete request.
Using the Command Line Interface
Topics
•
Define a Pipeline in JSON Format (p. 49)
•
•
•
•
•
•
Upload the Pipeline Definition (p. 54)
•
•
Verify the Pipeline Status (p. 55)
The following topics explain how to use the AWS Data Pipeline CLI to create a pipeline to copy data from a MySQL table to a file in an Amazon S3 bucket. In this example, we perform the following steps:
• Create a pipeline definition using the CLI in JSON format
• Create the necessary IAM roles and define a policy and trust relationships
• Upload the pipeline definition using the AWS Data Pipeline CLI tools
• Monitor the progress of the pipeline
To complete the steps in this example, you need a MySQL database instance with a table that contains data. To create a MySQL database using Amazon RDS, see Get Started with Amazon RDS http://docs.aws.amazon.com/AmazonRDS/latest/GettingStartedGuide/Welcome.html. After you have an
API Version 2012-10-29
48
AWS Data Pipeline Developer Guide
Define a Pipeline in JSON Format
Amazon RDS instance, see the MySQL documentation to Create a Table http://dev.mysql.com/doc/refman/5.5/en//creating-tables.html.
Define a Pipeline in JSON Format
This example scenario shows how to use JSON pipeline definitions and the AWS Data Pipeline CLI to copy data (rows) from a table in a MySQL database to a CSV (comma-separated values) file in an Amazon
S3 bucket at a specified time interval. This is the full pipeline definition JSON file followed by an explanation for each of its sections.
Note
We recommend that you use a text editor that can help you verify the syntax of JSON-formatted files, and name the file using the .json file extension.
{
"objects": [
{
"id": "MySchedule",
"type": "Schedule",
"startDateTime": "2012-11-25T00:00:00",
"endDateTime": "2012-11-26T00:00:00",
"period": "1 day"
},
{
"id": "MySQLInput",
"type": "MySqlDataNode",
"schedule": {
"ref": "MySchedule"
},
"table": "table_name",
"username": "
user_name
",
"*password": "
my_password
",
"connectionString": "jdbc:mysql:/
/mysqlinstance
-rds.example.us-east-
1.rds.amazonaws.com:3306/
database_name
",
"selectQuery": "select * from #{table}"
},
{
"id": "S3Output",
"type": "S3DataNode",
"filePath": "s3://testbucket/output/output_file.csv",
"schedule": {
"ref": "MySchedule"
}
},
{
"id": "MyEC2Resource",
"type": "Ec2Resource",
"schedule": {
"ref": "MySchedule"
},
"actionOnTaskFailure": "terminate",
"actionOnResourceFailure": "retryAll",
"maximumRetries": "1",
"role": "test-role",
"resourceRole": "test-role",
"instanceType": "m1.medium",
"instanceCount": "1",
API Version 2012-10-29
49
AWS Data Pipeline Developer Guide
Schedule
"securityGroups": [
"test-group",
"default"
],
"keyPair": "test-pair"
},
{
"id": "MyCopyActivity",
"type": "CopyActivity",
"runsOn": {
"ref": "MyEC2Resource"
},
"input": {
"ref": "MySQLInput"
},
"output": {
"ref": "S3Output"
},
"schedule": {
"ref": "MySchedule"
}
}
]
}
Schedule
The example AWS Data Pipeline JSON file begins with a section to define the schedule by which to copy the data. Many pipeline components have a reference to a schedule and you may have more than one.
The Schedule component is defined by the following fields:
{
"id": "MySchedule",
"type": "Schedule",
"startDateTime": "2012-11-22T00:00:00",
"endDateTime":"2012-11-23T00:00:00",
"period": "1 day"
},
Note
In the JSON file, you can define the pipeline components in any order you prefer. In this example, we chose the order that best illustrates the pipeline component dependencies.
Name
The user-defined name for the pipeline schedule, which is a label for your reference only.
Type
The pipeline component type, which is Schedule.
startDateTime
The date/time (in UTC format) that you want the task to begin.
endDateTime
The date/time (in UTC format) that you want the task to stop.
period
The time period that you want to pass between task attempts, even if the task occurs only one time.
The period must evenly divide the time between startDateTime
and endDateTime
. In this example, we set the period to be 1 day so that the pipeline copy operation can only run one time.
API Version 2012-10-29
50
AWS Data Pipeline Developer Guide
MySQL Data Node
MySQL Data Node
Next, the input MySqlDataNode pipeline component defines a location for the input data; in this case, an
Amazon RDS instance. The input MySqlDataNode component is defined by the following fields:
{
"id": "MySQLInput",
"type": "MySqlDataNode",
"schedule": {"ref": "MySchedule"},
"table": "table_name",
"username": "
user_name
",
"*password": "
my_password
",
"connectionString": "jdbc:mysql:/
/mysqlinstance
-rds.example.us-east-
1.rds.amazonaws.com:3306/
database_name
",
"selectQuery" : "select * from #{table}"
},
Name
The user-defined name for the MySQL database, which is a label for your reference only.
Type
The MySqlDataNode type, which is an Amazon RDS instance using MySQL in this example..
Schedule
A reference to the schedule component that we created in the preceding lines of the JSON file labeled
“MySchedule”.
Table
The name of the database table that contains the data to copy. Replace table_name with the name of your database table.
Username
The user name of the database account that has sufficient permission to retrieve data from the database table. Replace user_name with the name of your user account.
Password
The password for the database account with the asterisk prefix to indicate that AWS Data Pipeline must encrypt the password value. Replace my_password with the correct password for your user account.
connectionString
The JDBC connection string for the CopyActivity object to connect to the database.
selectQuery
A valid SQL SELECT query that specifies which data to copy from the database table. Note that
#{table} is a variable that re-uses the table name provided by the "table" variable in the preceding lines of the JSON file.
Amazon S3 Data Node
Next, the S3Output pipeline component defines a location for the output file; in this case a CSV file in an
S3 bucket location. The output S3DataNode component is defined by the following fields:
{
"id": "S3Output",
"type": "S3DataNode",
"filePath": "s3://testbucket/output/output_file.csv",
"schedule":{"ref":"MySchedule"}
},
API Version 2012-10-29
51
AWS Data Pipeline Developer Guide
Resource
Name
The user-defined name for the input location (a label for your reference only).
Type
The pipeline component type, which is "S3DataNode" to match the location where the data resides, in an Amazon S3 bucket.
Path
The path to the data associated with the data node. The syntax for a data node is determined by its type. For example, a data node for a database follows a different syntax that is appropriate for a database table.
Schedule
A reference to the schedule component that we created in the preceding lines of the JSON file labeled
“MySchedule”.
Resource
This is a definition of the computational resource that performs the copy operation. In this example, AWS
Data Pipeline should automatically create an EC2 instance to perform the copy task and terminate the resource after the task completes. The fields defined here control the creation and function of the Amazon
EC2 instance that does the work. The EC2Resource is defined by the following fields:
{
"id": "MyEC2Resource",
"type": "Ec2Resource",
"schedule": {"ref": "MySchedule"},
"actionOnTaskFailure": "terminate",
"actionOnResourceFailure": "retryAll",
"maximumRetries": "1",
"role" : "test-role",
"resourceRole": "test-role",
"instanceType": "m1.medium",
"instanceCount": "1",
"securityGroups": [ "test-group", "default" ],
"keyPair": "test-pair"
},
Name
The user-defined name for the pipeline schedule, which is a label for your reference only.
Type
The type of computational resource to perform work; in this case, an Amazon EC2 instance. There are other resource types available, such as an EmrCluster type.
Schedule
The schedule on which to create this computational resource.
actionOnTaskFailure
The action to take on the Amazon EC2 instance if the task fails. In this case, the instance should terminate so that each failed attempt to copy data does not leave behind idle, abandoned Amazon
EC2 instances with no work to perform. These instances require manual termination by an administrator.
actionOnResourceFailure
The action to perform if the resource is not created successfully. In this case, retry the creation of an
Amazon EC2 instance until it is successful.
maximumRetries
The number of times to retry the creation of this computational resource before marking the resource as a failure and stopping any further creation attempts. This setting works in conjunction with the actionOnResourceFailure
field.
API Version 2012-10-29
52
AWS Data Pipeline Developer Guide
Activity
Role
The IAM role of the account that accesses resources, such as accessing an Amazon S3 bucket to retrieve data.
resourceRole
The IAM role of the account that creates resources, such as creating and configuring an Amazon
EC2 instance on your behalf. Role and ResourceRole can be the same role, but separately provide greater granularity in your security configuration.
instanceType
The size of the Amazon EC2 instance to create. Ensure that you set the appropriate size of EC2 instance that best matches the load of the work that you want to perform with AWS Data Pipeline.
In this case, we set an m1.medium sized EC2 instance. For more information about the different instance types and when to use each one, see to the Amazon EC2 Instance Types topic at http://aws.amazon.com/ec2/instance-types/.
instanceCount
The number of Amazon EC2 instances in the computational resource pool to service any pipeline components depending on this resource.
securityGroups
The Amazon EC2 security groups to grant access to this EC2 instance. As the example shows, you can provide more than one group at a time (test-group and default).
keyPair
The name of the SSH public/private key pair to log in to the Amazon EC2 instance. For more information, see Amazon EC2 Key Pairs .
Activity
The last section in the JSON file is the definition of the activity that represents the work to perform. In this case we use a CopyActivity component to copy data from a file in an Amazon S3 bucket to another file.
The CopyActivity component is defined by the following fields:
{
"id": "MyCopyActivity",
"type": "CopyActivity",
"runsOn":{"ref":"MyEC2Resource"},
"input": {"ref": "MySQLInput"},
"output": {"ref": "S3Output"},
"schedule":{"ref":"MySchedule"}
}
Name
The user-defined name for the activity, which is a label for your reference only.
Type
The type of activity to perform, such as MyCopyActivity.
runsOn
The computational resource that performs the work that this activity defines. In this example, we provide a reference to the EC2 instance defined previously. Using the runsOn field causes AWS
Data Pipeline to create the EC2 instance for you. The runsOn field indicates that the resource exists in the AWS infrastructure, while the workerGroup value indicates that you want to use your own on-premises resources to perform the work.
Schedule
The schedule on which to run this activity.
Input
The location of the data to copy.
API Version 2012-10-29
53
AWS Data Pipeline Developer Guide
Upload the Pipeline Definition
Output
The target location data.
Upload the Pipeline Definition
You can upload a pipeline definition file using the AWS Data Pipeline CLI tools. For more information,
see Install the Command Line Interface (p. 15)
To upload your pipeline definition, use the following command.
On Linux/Unix/Mac OS:
./datapipeline -–create pipeline_name -–put pipeline_file
On Windows: ruby datapipeline -–create pipeline_name -–put pipeline_file
Where pipeline_name is the label for your pipeline and pipeline_file is the full path and file name for the file with the .json file extension that defines your pipeline.
If your pipeline validates successfully, you receive the following message:
Pipeline with name pipeline_name and id df-AKIAIOSFODNN7EXAMPLE created. Pipeline
definition pipeline_file.json uploaded.
Note
For more information about any errors returned by the –create command or other commands,
see Troubleshoot AWS Data Pipeline (p. 128)
.
Ensure that your pipeline appears in the pipeline list by using the following command.
On Linux/Unix/Mac OS:
./datapipeline --list-pipelines
On Windows: ruby datapipeline -–list-pipelines
The list of pipelines includes details such as Name, Id, State, and UserId. Take note of your pipeline ID, because you use this value for most AWS Data Pipeline CLI commands. The pipeline ID is a unique identifier using the format df-AKIAIOSFODNN7EXAMPLE
.
Activate the Pipeline
You must activate the pipeline, by using the --activate command-line parameter, before it will begin performing work. Use the following command.
On Linux/Unix/Mac OS:
API Version 2012-10-29
54
AWS Data Pipeline Developer Guide
Verify the Pipeline Status
./datapipeline --activate -–id df-AKIAIOSFODNN7EXAMPLE
On Windows: ruby datapipeline --activate -–id df-AKIAIOSFODNN7EXAMPLE
Where df-AKIAIOSFODNN7EXAMPLE is the identifier for your pipeline.
Verify the Pipeline Status
View the status of your pipeline and its components, along with its activity attempts and retries with the following command.
On Linux/Unix/Mac OS:
./datapipeline --list-runs –id df-AKIAIOSFODNN7EXAMPLE
On Windows: ruby datapipeline --list-runs –id df-AKIAIOSFODNN7EXAMPLE
Where df-AKIAIOSFODNN7EXAMPLE is the identifier for your pipeline.
The --list-runs command displays a list of pipelines components and details such as Name, Scheduled
Start, Status, ID, Started, and Ended.
Note
It is important to note the difference between the Scheduled Start date/time vs. the Started time.
It is possible to schedule a pipeline component to run at a certain time (Scheduled Start), but the actual start time (Started) could be later due to problems or delays with preconditions, dependencies, failures, or retries.
Note
AWS Data Pipeline may backfill a pipeline, which happens when you define a Scheduled Start date/time for a date in the past. In that situation, AWS Data Pipeline immediately runs the pipeline components the number of times the activity should have run if it had started on the Scheduled
Start time. When this happens, you see pipeline components run back-to-back at a greater frequency than the period value that you specified when you created the pipeline. AWS Data
Pipeline returns your pipeline to the defined frequency only when it catches up to the number of past runs.
Successful pipeline runs are indicated by all the activities in your pipeline reporting the
FINISHED
status.
Your pipeline frequency determines how many times the pipeline runs, and each run has its own success or failure, as indicated by the --list-runs command. Resources that you define in your pipeline, such as
Amazon EC2 instances, may show the
SHUTTING_DOWN
status until they are finally terminated after a successful run. Depending on how you configured your pipeline, you may have multiple Amazon EC2 resources, each with their own final status.
API Version 2012-10-29
55
AWS Data Pipeline Developer Guide
Tutorial: Launch an Amazon EMR
Job Flow
If you regularly run an Amazon EMR job flow, such as to analyze web logs or perform analysis of scientific data, you can use AWS Data Pipeline to manage your Amazon EMR job flows. With AWS Data Pipeline you can specify preconditions that must be met before the job flow is launched (for example, ensuring that today's data been uploaded to Amazon S3), a schedule for repeatedly running the job flow, and the cluster configuration to use for the job flow. The following tutorial walks you through launching a simple job flow as an example. This can be used as a model for a simple Amazon EMR-based pipeline, or as part of a more involved pipeline.
This tutorial walks you through the process of creating a data pipeline for a simple Amazon EMR job flow to run a pre-existing Hadoop Streaming job provided by Amazon EMR, and then send an Amazon SNS notification after the task completes successfuly.You will use the Amazon EMR cluster resource provided by AWS Data Pipeline for this task. This sample application is called WordCount, and can also be run manually from the Amazon EMR console.
The first step in pipeline creation process is to select the pipeline objects that make up your pipeline definition. After you select the pipeline objects, you add fields for each pipeline object. For more information on pipeline definition, see
This tutorial uses the following objects to create a pipeline definition:
Activity
Activity the AWS Data Pipeline must perform for this pipeline.
This tutorial uses the
EmrActivity
to run a pre-existing Hadoop Streaming job provided by Amazon
EMR.
Schedule
Start date, time, and the duration for this activity. You can optionally specify the end date and time.
Resource
Resource AWS Data Pipeline must use to perform this activity.
This tutorial uses
EmrCluster
, a set of Amazon EC2 instances, provided AWS Data Pipeline to run the job flow.AWS Data Pipeline automatically launches the Amazon EMR cluster and then terminates the cluster after the task finishes.
Action
Action AWS Data Pipeline must take when the specified conditions are met.
API Version 2012-10-29
56
AWS Data Pipeline Developer Guide
Before You Begin ...
This tutorial uses
SnsAlarm
action to send Amazon SNS notification to the email address you specify, after the task finishes successfully.
For more information about the additional objects and fields supported by Amazon EMR activity, see
.
The following steps outline how to create a data pipeline to launch an Amazon EMR job flow.
1. Create your pipeline definition
2. Create and configure the pipeline definition objects
3. Validate and save your pipeline definition
4. Verify that your pipeline definition is saved
5. Activate your pipeline
6. Monitor the progress of your pipeline
7. [Optional] Delete your pipeline
Before You Begin ...
Be sure you've completed the following steps.
• Set up an Amazon Web Services (AWS) account to access the AWS Data Pipeline console. For more information, see
.
Set up the AWS Data Pipeline tools and interface you plan on using. For more information about interfaces and tools you can use to interact with AWS Data Pipeline, see
.
• Create an Amazon SNS topic for sending email notification and make a note of the topic Amazon
Resource Name (ARN). For more information, see Create a Topic in the Amazon Simple Notification
Service Getting Started Guide.
Note
Some of the actions described in this tutorial can generate AWS usage charges, depending on whether you are using the AWS Free Usage Tier .
Using the AWS Data Pipeline Console
Topics
•
Create and Configure the Pipeline Definition Objects (p. 58)
•
Validate and Save Your Pipeline (p. 60)
•
Verify Your Pipeline Definition (p. 60)
•
Activate your Pipeline (p. 61)
•
Monitor the Progress of Your Pipeline Runs (p. 61)
•
[Optional] Delete your Pipeline (p. 63)
The following sections include the instructions for creating the pipeline using the AWS Data Pipeline console.
API Version 2012-10-29
57
AWS Data Pipeline Developer Guide
Create and Configure the Pipeline Definition Objects
To create your pipeline definition
1.
Sign in to the AWS Management Console and open the AWS Data Pipeline console .
2.
Click Create Pipeline.
3.
On the Create a New Pipeline page: a.
In the Pipeline Name box, enter a name (for example,
MyEmrJob
).
b.
In Pipeline Description, enter a description.
c.
Leave the Select Schedule Type: button set to the default type for this tutorial.
Note
Schedule type allows you to specify whether the objects in your pipeline definition should be scheduled at the beginning of interval or end of the interval. Time Series
Style Scheduling means instances are scheduled at the end of each interval and Cron
Style Scheduling means instances are scheduled at the beginning of each interval.
d.
Leave the Role boxes set to their default values for this tutorial.
e.
Click Create a new pipeline.
Create and Configure the Pipeline Definition
Objects
Next, you define the
Activity
object in your pipeline definition. When you define the
Activity
object, you also define the objects that AWS Data Pipeline must use to perform this activity.
1.
On the Pipeline:
name of your pipeline
page, select Add activity.
2.
In the Activities pane a.
Enter the name of the activity; for example, my-emr-job b.
In the Type box, select EmrActivity.
c.
In the Step box, enter
/home/hadoop/contrib/streaming/hadoop-streaming.jar,-in put,\ s3n://elasticmapreduce/samples/wordcount/input,-output,\ s3://myawsbucket/word count/output/#{@scheduledStartTime},\
-mapper,s3n://elasticmapreduce/samples/word count/wordSplitter.py,-reducer,aggregate
.
d.
In the Schedule box, select Create new: Schedule.
e.
In the Add an optional field .. box, select Runs On.
f.
In the Runs On box, select Create new: EmrCluster.
g.
In the Add an optional field .. box, select On Success.
h.
In the On Success box, select Create new: Action.
You've completed defining your pipeline definition by specifying the objects AWS Data Pipeline will use to launch an Amazon EMR job flow.
The Pipeline:
name of your pipeline
pane shows a single activity icon for this pipeline.
Next, configure run date and time for your pipeline.
API Version 2012-10-29
58
AWS Data Pipeline Developer Guide
Create and Configure the Pipeline Definition Objects
To configure run date and time for your pipeline
1.
On the Pipeline:
name of your pipeline
page, in the right pane, click Schedules.
2.
In the Schedules pane: a.
Enter a schedule name for this activity (for example, my-emr-job-schedule
).
b.
In the Type box, select Schedule.
c.
In the Start Date Time box, select the date from the calendar, and then enter the time to start the activity.
Note
AWS Data Pipeline supports the date and time expressed in "YYYY-MM-DDTHH:MM:SS" format in UTC/GMT only.
d.
In the Period box, enter the duration for the activity (for example,
1
), and then select the period category (for example,
Days
).
e.
[Optional] To specify the date and time to end the activity, in the Add an optional field box, select endDateTime, and enter the date and time.
To get your pipeline to launch immediately, set Start Date Time to a date one day in the past. AWS
Data Pipeline will then starting launching the "past due" runs immediately in an attempt to address what it percieves as a backlog of work. This backfilling means you don't have to wait an hour to see
AWS Data Pipeline launch its first job flow.
Next, configure the resource AWS Data Pipeline must use to perform the Amazon EMR job.
To configure the resource
1.
On the Pipeline:
name of your pipeline
page, in the right pane, click Resources.
2.
In the Resources pane: a.
In the Name box, enter the name for your EMR cluster (for example,
MyEmrCluster
).
b.
Leave the Type box set to the default value.
c.
In the Schedule box, select my-emr-job-schedule.
Next, configure the SNS notification action AWS Data Pipeline must perform after the Amazon EMR job finishes successfully.
To configure the SNS notification action
1.
On the Pipeline:
name of your pipeline
page, in the right pane, click Others.
2.
In the Others pane: a.
In the
DefaultAction1
Name box, enter the name for your Amazon SNS notification (for example,
EmrJobNotice
).
b.
In the Type box, select SnsAlarm.
c.
In the Message box, enter the message content.
d.
Leave the entry in the Role box set to default.
e.
In the Subject box, enter the subject line for your notification.
f.
In the Topic Arn box, enter the ARN of your Amazon SNS topic.
API Version 2012-10-29
59
AWS Data Pipeline Developer Guide
Validate and Save Your Pipeline
You have now completed all the steps required for creating your pipeline definition. Next, validate and save your pipeline.
Validate and Save Your Pipeline
You can save your pipeline definition at any point during the creation process. As soon as you save your pipeline definition, AWS Data Pipeline looks for syntax errors and missing values in your pipeline definition.
If your pipeline is incomplete or is incorrect, AWS Data Pipeline throws a validation error. If you plan to continue the creation process later, you can ignore the error message. If your pipeline definition is complete, and you are getting the validation error message, you'll have to fix the errors in the pipeline definition before activating your pipeline.
To validate and save your pipeline
1.
On the Pipeline:
name of your pipeline
page, click Save Pipeline.
2.
AWS Data Pipeline validates your pipeline definition and returns either success or the error message.
3.
If you get an error message, click Close and then, in the right pane, click Errors.
4.
The Errors pane lists the objects failing validation.
Click the plus (+) sign next to the object names and look for an error message in red.
5.
When you see the error message, click the specific object pane where you see the error and fix it.
For example, if you see an error message in the Schedules object, click the Schedules pane to fix the error.
6.
After you've fixed the errors listed in the Errors pane, click Save Pipeline.
7.
Repeat the process until your pipeline is validated.
Next, verify that your pipeline definition has been saved.
Verify Your Pipeline Definition
It is important to verify that your pipeline was correctly initialized from your definitions before you activate it.
To verify your pipeline definition
1.
On the Pipeline:
name of your pipeline
page, click Back to list of pipelines.
2.
On the List Pipelines page, check if your newly-created pipeline is listed.
AWS Data Pipeline has created a unique Pipeline ID for your pipeline definition.
The Status column in the row listing your pipeine should show PENDING.
3.
Click on the triangle icon next to your pipeline. The Pipeline Summary panel below shows the details of your pipeline runs. Because your pipeline is not yet activated, you should see only 0s at this point.
4.
In the Pipeline summary panel, click View all fields to see the configuration of your pipeline definition.
API Version 2012-10-29
60
AWS Data Pipeline Developer Guide
Activate your Pipeline
5.
Click Close.
Next, activate your pipeline.
Activate your Pipeline
You must activate your pipeline to start creating and processing runs based on the specifications in your pipeline definition.
To activate your pipeline
1.
On the List Pipelines page, in the Details column of your pipeline, click View pipeline.
2.
In the Pipeline:
name of your pipeline
page, click Activate.
A confirmation dialog box opens up confirming the activation.
3.
Click Close.
Next, verify if your pipeline is running.
Monitor the Progress of Your Pipeline Runs
To monitor the progress of your pipeline
1.
On the List Pipelines page, in the Details column of your pipeline, click View instance details.
API Version 2012-10-29
61
AWS Data Pipeline Developer Guide
Monitor the Progress of Your Pipeline Runs
Note
You can also view the job flows in the Amazon EMR console. The job flows spawned by
AWS Data Pipeline on your behalf are displayed in the Amazon EMR console and billed to your AWS Account in the same manner as job flows that you launch manually. You can tell which job flows were spawned by AWS Data Pipeline by looking at the name of the job flow.
Those spawned by AWS Data Pipeline have a name formatted as follows:
job-flow-identifier
_@
emr-cluster-name
_
launch-time
. For more information, see View Job Flow Details in the Amazon Elastic MapReduce Developer Guide.
2.
The Instance details:
name of your pipeline
page lists the status of each instance in your pipeline definition.
Note
If you do not see instances listed, depending on when your pipeline was scheduled, either click End (in UTC) date box and change it to a later date or click Start (in UTC) date box and change it to an earlier date. Then click Update.
3.
If the Status column of all the instances in your pipeline indicates FINISHED, your pipeline has successfully completed the copy activity.You should receive an email about the successful completion of this task, to the account you specified for receiving your Amazon SNS notification.
4.
If the Status column of any of your instances indicates a status other than FINISHED, either your pipeline is waiting for some precondition to be met or it has failed.
a.
To troubleshoot the failed or the incomplete runs
Click the triangle next to an instance; the Instance summary panel opens to show the details of the selected instance.
b.
Click View instance fields to see additional details of the instance. If the status of your selected instance is FAILED, the details box has an entry indicating the reason for failure. For example,
@failureReason = Resource not healthy terminated
.
c.
In the Instance summary pane, in the Select attempt for this instance box, select the attempt number.
d.
In the Instance summary pane, click View attempt fields to see details of fields associated with the selected attempt.
5.
To take an action on your incomplete or failed instance, select an action (
Rerun|Cancel|Mark
Finished
) from the Action column of the instance.
You can use the information in the Instance summary pane and the View instance fields box to troubleshoot issues with your failed pipeline.
API Version 2012-10-29
62
AWS Data Pipeline Developer Guide
[Optional] Delete your Pipeline
For more information about instance status, see Interpret Pipeline Status Details (p. 129)
. For more
Pipeline Problems and Solutions (p. 131)
.
Important
Your pipeline is running and is incurring charges. For more information, see AWS Data Pipline pricing . If you would like to stop incurring the AWS Data Pipeline usage charges, delete your pipleline.
[Optional] Delete your Pipeline
Deleting your pipeline deletes the pipeline definition including all the associated objects.You stop incurring charges as soon as your pipeline is deleted.
To delete your pipeline
1.
In the List Pipelines page, click the check box next to your pipeline.
2.
Click Delete.
3.
In the confirmation dialog box, click Delete to confirm the delete request.
Using the Command Line Interface
Topics
If you regularly run an Amazon EMR job flow to analyze web logs or perform analysis of scientific data, you can use AWS Data Pipeline to manage your Amazon EMR job flows. With AWS Data Pipeline, you can specify preconditions that must be met before the job flow is launched (for example, ensuring that today's data been uploaded to Amazon S3.) The following tutorial walks you through launching the job flow that can be a model for a simple Amazon EMR-based pipeline, or as part of a more involved pipeline.
The following code is the pipeline definition file for a simple Amazon EMR job flow that runs a pre-existing
Hadoop streaming job provided by Amazon EMR. This sample application is called WordCount, and can also be run manually from the Amazon EMR console. In the following code, you should replace the
Amazon S3 bucket location with the name of an Amazon S3 bucket that you own.You should also replace the start and end dates. To get job flows launching immediately, set startDateTime
to a date one day in the past and endDateTime
to one day in the future. AWS Data Pipeline then starts launching the "past due" job flows immediately in an attempt to address what it perceives as a backlog of work. This backfilling means you don't have to wait an hour to see AWS Data Pipeline launch its first job flow.
{
"objects": [
{
"id": "Hourly",
"type": "Schedule",
"startDateTime": "2012-11-19T07:48:00",
"endDateTime": "2012-11-21T07:48:00",
"period": "1 hours"
},
API Version 2012-10-29
63
AWS Data Pipeline Developer Guide
Using the Command Line Interface
{
"id": "MyCluster",
"type": "EmrCluster",
"masterInstanceType": "m1.small",
"schedule": {
"ref": "Hourly"
}
},
{
"id": "MyEmrActivity",
"type": "EmrActivity",
"schedule": {
"ref": "Hourly"
},
"runsOn": {
"ref": "MyCluster"
},
"step": "/home/hadoop/contrib/streaming/hadoop-streaming.jar,-in put,s3n://elasticmapreduce/samples/wordcount/input,-output,s3://myawsbucket/word count/output/#{@scheduledStartTime},-mapper,s3n://elasticmapreduce/samples/word count/wordSplitter.py,-reducer,aggregate"
}
]
}
This pipeline has three objects:
•
Hourly
, which represents the schedule of the work. You can set a schedule as one of the fields on an transform. When you do, the transform runs according to that schedule, or in this case, hourly.
•
MyCluster
, which represents the set of Amazon EC2 instances used to run the job flow. You can specify the size and number of EC2 instances to run as the cluster. If you do not specify the number of instances, the job flow launches with two, a master node and a task node. You can add additional configurations to the cluster, such as bootstrap actions to load additional software onto the Amazon
EMR-provided AMI.
•
MyEmrActivity
, which represents the computation to process with the job flow. Amazon EMR supports several types of job flows, including streaming, Cascading, and Scripted Hive. The runsOn
field refers back to MyCluster, using that as the specification for the underpinnings of the job flow.
To create a pipeline that launches an Amazon EMR job flow
1.
Open a terminal window in the directory where you've installed the AWS Data Pipeline CLI. For more information about how to install the CLI, see
Install the Command Line Interface (p. 15) .
2.
Create a new pipeline.
./datapipeline --credentials ./credentials.json --create MyEmrPipeline
When the pipeline is created, AWS Data Pipeline returns a success message and an identifier for the pipeline.
Pipeline with name 'MyEmrPipeline' and id 'df-07634391Y0GRTUD0SP0' created.
API Version 2012-10-29
64
AWS Data Pipeline Developer Guide
Using the Command Line Interface
3.
Add the JSON definition to the pipeline. This gives AWS Data Pipeline the business logic it needs to manage your data.
./datapipeline --credentials ./credentials.json --put MyEmrPipelineDefini tion.df --id df-07634391Y0GRTUD0SP0
The following message is an example of a successfully uploaded pipeline.
State of pipeline id 'df-07634391Y0GRTUD0SP0' is currently 'PENDING'
4.
Activate the pipeline.
./datapipeline --credentials ./credentials.json --activate --id df-
07634391Y0GRTUD0SP0
If the pipeline definition is valid, the previous
--put
command uploads the business logic and activates the pipeline. If the pipeline is invalid, AWS Data Pipeline returns an error code indicating what the problems are.
5.
Wait until the pipeline has had time to start running, then verify the pipeline's operation.
./datapipeline --credentials ./credentials.json --list-runs --id df-
07634391Y0GRTUD0SP0
This returns information about the runs initiated by the pipeline, such as the following.
State of pipeline id 'df-07634391Y0GRTUD0SP0' is currently 'SCHEDULED'
The --list-runs command is fetching the last 4 days of pipeline runs.
If this takes too long, use --help for how to specify a different interval with --start-interval or --schedule-interval.
Name Scheduled Start
Status
ID Started
Ended
--------------------------------------------------------------------------
-----------------------------
1. MyCluster 2012-11-19T07:48:00
FINISHED
@MyCluster_2012-11-19T07:48:00 2012-11-20T22:29:33
API Version 2012-10-29
65
AWS Data Pipeline Developer Guide
Using the Command Line Interface
2012-11-20T22:40:46
2. MyEmrActivity 2012-11-19T07:48:00
FINISHED
@MyEmrActivity_2012-11-19T07:48:00 2012-11-20T22:29:31
2012-11-20T22:38:43
3. MyCluster 2012-11-19T08:03:00
RUNNING
@MyCluster_2012-11-19T08:03:00 2012-11-20T22:34:32
4. MyEmrActivity 2012-11-19T08:03:00
RUNNING
@MyEmrActivity_2012-11-19T08:03:00 2012-11-20T22:34:31
5. MyCluster 2012-11-19T08:18:00
CREATING
@MyCluster_2012-11-19T08:18:00 2012-11-20T22:39:31
6. MyEmrActivity 2012-11-19T08:18:00
WAITING_FOR_RUNNER
@MyEmrActivity_2012-11-19T08:18:00 2012-11-20T22:39:30
All times are listed in UTC and all command line input is treated as UTC.
Total of 6 pipeline runs shown from pipeline named 'MyEmrPipeline' where
--start-interval 2012-11-16T22:41:32,2012-11-20T22:41:32
You can view job flows launched by AWS Data Pipeline in the Amazon EMR console. The job flows spawned by AWS Data Pipeline on your behalf are displayed in the Amazon EMR console and billed to your AWS Account in the same manner as job flows that you launch manually.
To check the progress of job flows launched by AWS Data Pipeline
1.
Look at the name of the job flow to tell which job flows were spawned by AWS Data Pipeline. Those spawned by AWS Data Pipeline have a name formatted as follows:
<job-flow-identifier>
_@
<emr-cluster-name>
_
<launch-time>
. This is shown in the following screen.
API Version 2012-10-29
66
AWS Data Pipeline Developer Guide
Using the Command Line Interface
2.
Click on the Bootstrap Actions tab to display the bootstrap action that AWS Data Pipeline uses to install AWS Data Pipeline Task Agent on the Amazon EMR clusters that it launches.
API Version 2012-10-29
67
AWS Data Pipeline Developer Guide
Using the Command Line Interface
3.
After one of the runs is complete, navigate to the Amazon S3 console and check that the time-stamped output folder exists and contains the expected results of the job flow.
API Version 2012-10-29
68
AWS Data Pipeline Developer Guide
Part One: Import Data into Amazon DynamoDB
Tutorial: Import/Export Data in
Amazon DynamoDB With Amazon
EMR and Hive
This is the first of a two-part tutorial that demonstrates how to bring together multiple AWS features to solve real-world problems in a scalable way through a common scenario: moving schema-less data in and out of Amazon DynamoDB using Amazon EMR and Hive. Complete part one before you move on to part two. This tutorial involves the following concepts and procedures:
• Using the AWS Data Pipeline console and command-line interface (CLI) to create and configure pipelines
• Creating and configuring Amazon DynamoDB tables
• Creating and allocating work to Amazon EMR clusters
• Querying and processing data with Hive scripts
• Storing and accessing data using Amazon S3
Part One: Import Data into Amazon DynamoDB
Topics
•
•
Create an Amazon SNS Topic (p. 73)
•
Create an Amazon S3 Bucket (p. 74)
•
Using the AWS Data Pipeline Console (p. 74)
•
Using the Command Line Interface (p. 81)
The first part of this tutorial explains how to define an AWS Data Pipeline pipeline to retrieve data from a tab-delimited file in Amazon S3 to populate a Amazon DynamoDB table, use a Hive script to define the necessary data transformation steps, and automatically create an Amazon EMR cluster to perform the work. The first part of the tutorial involves the following steps:
1. Create a Amazon DynamoDB table to store the data
2. Create and configure the pipeline definition objects
API Version 2012-10-29
69
AWS Data Pipeline Developer Guide
Before You Begin...
3. Upload your pipeline definition
4. Verify your results
Before You Begin...
Be sure you've completed the following steps.
• Set up an Amazon Web Services (AWS) account to access the AWS Data Pipeline console. For more information, see
.
Set up the AWS Data Pipeline tools and interface you plan on using. For more information on interfaces
.
• Create an Amazon S3 bucket as a data source.
For more information, see Create a Bucket in the Amazon Simple Storage Service Getting Started
Guide.
• Create an Amazon DynamoDB table to store data as defined by the following procedure.
Be aware of the following:
• Imports may overwrite data in your Amazon DynamoDB table. When you import data from Amazon
S3, the import may overwrite items in your Amazon DynamoDB table. Make sure that you are importing the right data and into the right table. Be careful not to accidentally set up a recurring import pipeline that will import the same data multiple times.
• Exports may overwrite data in your Amazon S3 bucket. When you export data to Amazon S3, you may overwrite previous exports if you write to the same bucket path. The default behavior of the Export
DynamoDB to S3 template will append the job’s scheduled time to the Amazon S3 bucket path, which will help you avoid this problem.
• Import and Export jobs will consume some of your Amazon DynamoDB table’s provisioned throughput capacity. This section explains how to schedule an import or export job using Amazon EMR. The
Amazon EMR cluster will consume some read capacity during exports or write capacity during imports.
You can control the percentage of the provisioned capacity that the import/export jobs consume by with the settings M yImportJob.myDynamoDBWriteThroughputRatio
and
MyExportJob.myDynamoDBReadThroughputRatio
. Be aware that these settings determine how much capacity to consume at the beginning of the import/export process and will not adapt in real time if you change your table’s provisioned capacity in the middle of the process.
• Be aware of the costs. AWS Data Pipeline manages the import/export process for you, but you still pay for the underlying AWS services that are being used. The import and export pipelines will create Amazon
EMR clusters to read and write data and there are per-instance charges for each node in the cluster.
You can read more about the details of Amazon EMR Pricing . The default cluster configuration is one m1.small instance master node and one m1.xlarge instance task node, though you can change this configuration in the pipeline definition. There are also charges for AWS Data Pipeline. For more information, see AWS Data Pipeline Pricing and Amazon S3 Pricing .
Create an Amazon DynamoDB Table
This section explains how to create an Amazon DynamoDB table that is a prerequisite for this tutorial.
For more information, see Working with Tables in Amazon DynamoDB in the Amazon DynamoDB
Developer Guide.
Note
If you already have a Amazon DynamoDB table, you can skip this procedure to create one.
API Version 2012-10-29
70
AWS Data Pipeline Developer Guide
Before You Begin...
To create a Amazon DynamoDB table
1.
Sign in to the AWS Management Console and open the Amazon DynamoDB console .
2.
Click Create Table.
3.
On the Create Table / Primary Key page, enter a name (for example,
MyTable
) in the Table Name box.
Note
Your table name must be unique.
4.
In the Primary Key section, for the Primary Key Type radio button, select Hash.
5.
In the Hash Attribute Name field, select Number and enter
Id
in the text box as shown:
6.
Click Continue.
7.
On the Create Table / Provisioned Throughput Capacity page, in the Read Capacity Units box, enter
5
.
8.
In the Write Capacity Units box, enter
5
as shown:
API Version 2012-10-29
71
AWS Data Pipeline Developer Guide
Before You Begin...
Note
In this example, we use read and write capacity unit values of five because the sample input data is small. You may need a larger value depending on the size of your actual input data set. For more information, see Provisioned Throughput in Amazon DynamoDB in the Amazon
DynamoDB Developer Guide.
9.
Click Continue.
10. On the Create Table / Throughput Alarms page, in the Send notification to box, enter your email address as shown:
API Version 2012-10-29
72
AWS Data Pipeline Developer Guide
Create an Amazon SNS Topic
Create an Amazon SNS Topic
This section explains how to create an Amazon SNS topic and subscribe to receive notifications from
AWS Data Pipeline regarding the status of your pipeline components. For more information, see Create a Topic in the Amazon SNS Getting Started Guide.
Note
If you already have an Amazon SNS topic ARN to which you have subscribed, you can skip this procedure to create one.
To create an Amazon SNS topic
1.
Sign in to the AWS Management Console and open the Amazon SNS console .
2.
Click Create New Topic.
3.
In the Topic Name field, type your topic name, such as my-example-topic
, and select Create
Topic.
4.
Note the value from the Topic ARN field, which should be similar in format to this example: arn:aws:sns:us-east-1:403EXAMPLE:my-example-topic
.
To create an Amazon SNS subscription
1.
Sign in to the AWS Management Console and open the Amazon SNS console .
2.
In the navigation pane, select your Amazon SNS topic and click Create New Subscription.
3.
In the Protocol field, choose Email.
4.
In the Endpoint field, type your email address and select Subscribe.
Note
You must accept the subscription confirmation email to begin receiving Amazon SNS notifications at the email address you specify.
API Version 2012-10-29
73
AWS Data Pipeline Developer Guide
Create an Amazon S3 Bucket
Create an Amazon S3 Bucket
This section explains how to create an Amazon S3 bucket as a storage location for your input and output files related to this tutorial. For more information, see Create a Bucket in the Amazon Simple Storage
Service Getting Started Guide.
Note
If you already have an Amazon S3 bucket configured with write permissions, you can skip this procedure to create one.
To create an Amazon S3 bucket
1.
Sign in to the AWS Management Console and open the Amazon S3 console .
2.
Click Create Bucket.
3.
In the Bucket Name field, type your topic name, such as my-example-bucket
and select Create.
4.
In the Buckets pane, select your new bucket and select Permissions.
5.
Ensure that all user accounts that you want to access these files appear in the Grantee list.
Using the AWS Data Pipeline Console
Topics
•
Start Import from the Amazon DynamoDB Console (p. 74)
•
Create the Pipeline Definition using the AWS Data Pipeline Console (p. 75)
•
Create and Configure the Pipeline from a Template (p. 76)
•
Complete the Data Nodes (p. 76)
•
Complete the Resources (p. 77)
•
•
Complete the Notifications (p. 78)
•
Validate and Save Your Pipeline (p. 78)
•
Verify your Pipeline Definition (p. 79)
•
Activate your Pipeline (p. 79)
•
Monitor the Progress of Your Pipeline Runs (p. 80)
•
[Optional] Delete your Pipeline (p. 81)
The following topics explain how to how to define an AWS Data Pipeline pipeline to retrieve data from a tab-delimited file using the AWS Data Pipeline console.
Start Import from the Amazon DynamoDB Console
You can begin the Amazon DynamoDB import operation from within the Amazon DynamoDB console.
To start the data import
1.
Sign in to the AWS Management Console and open the Amazon DynamoDB console .
2.
On the Tables screen, click your Amazon DynamoDB table and click the Import Table button.
3.
On the Import Table screen, read the walkthrough and check the I have read the walkthrough
box, then select Build a Pipeline. This opens the AWS Data Pipeline console so that you can choose a template to import the Amazon DynamoDB table data.
API Version 2012-10-29
74
AWS Data Pipeline Developer Guide
Using the AWS Data Pipeline Console
Create the Pipeline Definition using the AWS Data Pipeline
Console
To create the new pipeline
1.
Sign in to the AWS Management Console and open the AWS Data Pipeline console or arrive at the
AWS Data Pipeline console through the Build a Pipeline button in the Amazon DynamoDB console.
2.
Click Create new pipeline.
3.
On the Create a New Pipeline page: a.
In the Pipeline Name box, enter a name (for example,
CopyMyS3Data
).
b.
In Pipeline Description, enter a description.
c.
Leave the Select Schedule Type: button set to the default type Time Series Time Scheduling for this tutorial. Schedule type allows you to specify whether the objects in your pipeline definition should be scheduled at the beginning of interval or end of the interval. Time Series Style
Scheduling means instances are scheduled at the end of each interval and Cron Style Scheduling means instances are scheduled at the beginning of each interval.
d.
Leave the Role boxes set to their default values for this tutorial, which is
DataPipelineDefaultRole.
Note
If you have created your own IAM roles and would like to use them in this tutorial, you can select them now.
e.
Leave the Role boxes set to their default values for this tutorial, which are
DataPipelineDefaultRole for the role and DataPipelineDefaultResourceRole for the resource role.
Note
If you have created your own IAM roles and would like to use them in this tutorial, you can select them now.
f.
Click Create a new Pipeline.
API Version 2012-10-29
75
AWS Data Pipeline Developer Guide
Using the AWS Data Pipeline Console
Create and Configure the Pipeline from a Template
On the Pipeline screen, click Templates and select Export S3 to DynamoDB. The AWS Data Pipeline console pre-populates a pipeline definition template with the base objects necessary to import data from
Amazon S3, as shown in the following screen.
Review the template and complete the missing fields. You start by choosing the schedule and frequency by which you want your data export operation to run.
To complete the schedule
• On the Pipeline screen, click Schedules.
a.
In the ImportSchedule section, set Period to 1 Hours.
b.
Set Start Date Time using the calendar to the current date, such as
2012-12-18
and the time to
00:00:00 UTC
.
c.
In the Add an optional field .. box, select End Date Time.
d.
Set End Date Time using the calendar to the following day, such as
2012-12-19
and the time to
00:00:00 UTC
.
Complete the Data Nodes
Next, you complete the data node objects in your pipeline definition template.
To complete the Amazon DynamoDB data node
1.
On the Pipeline:
name of your pipeline
page, select DataNodes.
2.
In the DataNodes pane:
API Version 2012-10-29
76
AWS Data Pipeline Developer Guide
Using the AWS Data Pipeline Console
a.
Enter the Name; for example:
DynamoDB
.
b.
In the MyDynamoDBData section, in the Table Name box, type the name of the Amazon
DynamoDB table where you want to store the output data; for example:
MyTable
.
To complete the Amazon S3 data node
• In the DataNodes pane:
• In the MyS3Data section, in the Directory Path field, type a valid Amazon S3 directory path for the location of your source data, for example,
s3://elasticmapreduce/samples/Store/ProductCatalog
. This sample file is a fictional product catalog that is pre-populated with delimited data for demonstration purposes.
Complete the Resources
Next, you complete the resources that will run the data import activities. Many of the fields are auto-populated by the template, as shown in the following screen. You only need to complete the empty fields.
To complete the resources
• On the Pipeline page, select Resources.
• In the Emr Log Uri box, type the path where to store Amazon EMR debugging logs, using the
Amazon S3 bucket that you configured in part one of this tutorial; for example:
s3://my-test-bucket/emr_debug_logs
.
API Version 2012-10-29
77
AWS Data Pipeline Developer Guide
Using the AWS Data Pipeline Console
Complete the Activity
Next, you complete the activity that represents the steps to perform in your data import operation.
To complete the activity
1.
On the Pipeline:
name of your pipeline
page, select Activities.
2.
In the MyImportJob section, review the default options already provided. You are not required to manually configure any options in this section.
Complete the Notifications
Next, configure the SNS notification action AWS Data Pipeline must perform depending on the outcome of the activity.
To configure the SNS success, failure, and late notification action
1.
On the Pipeline:
name of your pipeline
page, in the right pane, click Others.
2.
In the Others pane: a.
In the LateSnsAlarmsection, in the Topic Arn box, enter the ARN of your Amazon SNS topic that you created earlier in the tutorial; for example:
arn:aws:sns:us-east-1:403EXAMPLE:my-example-topic
.
b.
In the FailureSnsAlarmsection, in the Topic Arn box, enter the ARN of your Amazon SNS topic that you created earlier in the tutorial; for example:
arn:aws:sns:us-east-1:403EXAMPLE:my-example-topic
.
c.
In the SuccessSnsAlarmsection, in the Topic Arn box, enter the ARN of your Amazon SNS topic that you created earlier in the tutorial; for example:
arn:aws:sns:us-east-1:403EXAMPLE:my-example-topic
.
You have now completed all the steps required for creating your pipeline definition. Next, validate and save your pipeline.
Validate and Save Your Pipeline
You can save your pipeline definition at any point during the creation process. As soon as you save your pipeline definition, AWS Data Pipeline looks for syntax errors and missing values in your pipeline definition.
If your pipeline is incomplete or is incorrect, AWS Data Pipeline throws a validation error. If you plan to continue the creation process later, you can ignore the error message. If your pipeline definition is complete, and you are getting the validation error message, you'll have to fix the errors in the pipeline definition before activating your pipeline.
To validate and save your pipeline
1.
On the Pipeline:
name of your pipeline
page, click Save Pipeline.
2.
AWS Data Pipeline validates your pipeline definition and returns either success or the error message.
3.
If you get an error message, click Close and then, in the right pane, click Errors.
4.
The Errors pane lists the objects failing validation.
Click the plus (+) sign next to the object names and look for an error message in red.
5.
When you see the error message, click the specific object pane where you see the error and fix it.
For example, if you see an error message in the DataNodes object, click the DataNodes pane to fix the error.
API Version 2012-10-29
78
AWS Data Pipeline Developer Guide
Using the AWS Data Pipeline Console
6.
After you've fixed the errors listed in the Errors pane, click Save Pipeline.
7.
Repeat the process until your pipeline is validated.
Next, verify that your pipeline definition has been saved.
Verify your Pipeline Definition
It is important to verify that your pipeline was correctly initialized from your definitions before you activate it.
To verify your pipeline definition
1.
On the Pipeline:
name of your pipeline
page, click Back to list of pipelines.
2.
On the List Pipelines page, check if your newly-created pipeline is listed.
AWS Data Pipeline has created a unique Pipeline ID for your pipeline definition.
The Status column in the row listing your pipeine should show PENDING.
3.
Click on the triangle icon next to your pipeline. The Pipeline Summary panel below shows the details of your pipeline runs. Because your pipeline is not yet activated, you should see only 0s at this point.
4.
In the Pipeline summary panel, click View all fields to see the configuration of your pipeline definition.
5.
Click Close.
Next, activate your pipeline.
Activate your Pipeline
You must activate your pipeline to start creating and processing runs based on the specifications in your pipeline definition.
API Version 2012-10-29
79
AWS Data Pipeline Developer Guide
Using the AWS Data Pipeline Console
To activate your pipeline
1.
On the List Pipelines page, in the Details column of your pipeline, click View pipeline.
2.
In the Pipeline:
name of your pipeline
page, click Activate.
A confirmation dialog box opens up confirming the activation.
3.
Click Close.
Next, verify if your pipeline is running.
Monitor the Progress of Your Pipeline Runs
To monitor the progress of your pipeline
1.
On the List Pipelines page, in the Details column of your pipeline, click View instance details.
2.
The Instance details:
name of your pipeline
page lists the status of each object in your pipeline definition.
Note
If you do not see instances listed, depending on when your pipeline was scheduled, either click End (in UTC) date box and change it to a later date or click Start (in UTC) date box and change it to an earlier date. Then click Update.
3.
If the Status column of all the objects in your pipeline indicates FINISHED, your pipeline has successfully completed the copy activity.You should receive an email about the successful completion of this task, to the account you specified for receiving your Amazon SNS notification.
You can also check your Amazon S3 data target bucket to verify if the data was copied.
4.
If the Status column of any of your instance indicates a status other than FINISHED, either your pipeline is waiting for some precondition to be met or it has failed.
a.
To troubleshoot the failed or the incomplete instance runs
Click the triangle next to an instance; the Instance summary panel opens to show the details of the selected instance.
b.
Click View instance fields to see additional details of the instance. If the status of your selected instance is FAILED, the details box has an entry indicating the reason for failure; for example:
@failureReason = Resource not healthy terminated
.
c.
In the Instance summary pane, in the Select attempt for this instance box, select the attempt number.
d.
In the Instance summary pane, click View attempt fields to see details of fields associated with the selected attempt.
API Version 2012-10-29
80
AWS Data Pipeline Developer Guide
Using the Command Line Interface
5.
To take an action on your incomplete or failed instance, select an action (
Rerun|Cancel|Mark
Finished
) from the Action column of the instance.
You can use the information in the Instance summary pane and the View instance fields box to troubleshoot issues with your failed pipeline.
For more information about instance status, see Interpret Pipeline Status Details (p. 129)
. For more
Pipeline Problems and Solutions (p. 131)
.
Important
Your pipeline is running and is incurring charges. For more information, see AWS Data Pipline pricing . If you would like to stop incurring the AWS Data Pipeline usage charges, delete your pipeline.
[Optional] Delete your Pipeline
Deleting your pipeline deletes the pipeline definition including all the associated objects.You stop incurring charges as soon as your pipeline is deleted.
To delete your pipeline,
1.
In the List Pipelines page, click the check box next to your pipeline.
2.
Click Delete.
3.
In the confirmation dialog box, click Delete to confirm the delete request.
Using the Command Line Interface
Topics
•
Define the Import Pipeline in JSON Format (p. 82)
•
API Version 2012-10-29
81
AWS Data Pipeline Developer Guide
Using the Command Line Interface
•
•
•
•
•
Upload the Pipeline Definition (p. 88)
•
•
Verify the Pipeline Status (p. 89)
•
The following topics explain how to perform the steps in this tutorial using the AWS Data Pipeline CLI.
Define the Import Pipeline in JSON Format
This example pipeline definition shows how to use AWS Data Pipeline to retrieve data from a file in
Amazon S3 to populate a Amazon DynamoDB table, use a Hive script to define the necessary data transformation steps, and automatically create an Amazon EMR cluster to perform the work. Additionally, this pipeline sends Amazon SNS notifications if the pipeline succeeds, fails, or runs late. This is the full pipeline definition JSON file followed by an explanation for each of its sections.
Note
We recommend that you use a text editor that can help you verify the syntax of JSON-formatted files, and name the file using the .json file extension.
{
"objects": [
{
"id": "MySchedule",
"type": "Schedule",
"startDateTime": "2012-11-22T00:00:00",
"endDateTime":"2012-11-23T00:00:00",
"period": "1 day"
},
{
"id": "MyS3Data",
"type": "S3DataNode",
"schedule": {
"ref": "MySchedule"
},
"filePath": "s3://input_bucket/ProductCatalog",
"precondition": {
"ref": "InputReady"
}
},
{
"id": "InputReady",
"type": "S3PrefixNotEmpty",
"role": "test-role",
"s3Prefix": "#{node.filePath}"
},
{
"id": "ImportCluster",
"type": "EmrCluster",
"masterInstanceType": "m1.small",
"instanceCoreType": "m1.xlarge",
"instanceCoreCount": "1",
API Version 2012-10-29
82
AWS Data Pipeline Developer Guide
Using the Command Line Interface
"schedule": {
"ref": "MySchedule"
},
"enableDebugging": "true",
"emrLogUri": "s3://test_bucket/emr_logs"
},
{
"id": "MyImportJob",
"type": "EmrActivity",
"dynamoDBOutputTable": "MyTable",
"dynamoDBWritePercent": "1.00",
"s3MyS3Data": "#{input.path}",
"lateAfterTimeout": "12 hours",
"attemptTimeout": "24 hours",
"maximumRetries": "0",
"input": {
"ref": "MyS3Data"
},
"runsOn": {
"ref": "ImportCluster"
},
"schedule": {
"ref": "MySchedule"
},
"onSuccess": {
"ref": "SuccessSnsAlarm"
},
"onFail": {
"ref": "FailureSnsAlarm"
},
"onLateAction": {
"ref": "LateSnsAlarm"
},
"step": "s3://elasticmapreduce/libs/script-runner/script-run ner.jar,s3://elasticmapreduce/libs/hive/hive-script,--run-hive-script,--hiveversions,latest,--args,-f,s3://elasticmapreduce/libs/hive/dynamodb/importDynamoD
BTableFromS3,-d,DYNAMODB_OUTPUT_TABLE=#{dynamoDBOutputTable},-d,S3_INPUT_BUCK
ET=#{s3MyS3Data},-d,DYNAMODB_WRITE_PERCENT=#{dynamoDBWritePercent},-d,DYNAMODB_EN
DPOINT=dynamodb.us-east-1.amazonaws.com"
},
{
"id": "SuccessSnsAlarm",
"type": "SnsAlarm",
"topicArn": "arn:aws:sns:us-east-1:286192228708:mysnsnotify",
"role": "test-role",
"subject": "DynamoDB table '#{node.dynamoDBOutputTable}' import succeeded",
"message": "DynamoDB table '#{node.dynamoDBOutputTable}' import from S3 bucket '#{node.s3MyS3Data}' succeeded at #{node.@actualEndTime}. JobId:
#{node.id}"
},
{
"id": "LateSnsAlarm",
"type": "SnsAlarm",
"topicArn": "arn:aws:sns:us-east-1:286192228708:mysnsnotify",
"role": "test-role",
"subject": "DynamoDB table '#{node.dynamoDBOutputTable}' import is taking
a long time!",
API Version 2012-10-29
83
AWS Data Pipeline Developer Guide
Using the Command Line Interface
"message": "DynamoDB table '#{node.dynamoDBOutputTable}' import from S3 bucket '#{node.s3MyS3Data}' has exceeded the late warning period
'#{node.lateAfterTimeout}'. JobId: #{node.id}"
},
{
"id": "FailureSnsAlarm",
"type": "SnsAlarm",
"topicArn": "arn:aws:sns:us-east-1:286192228708:mysnsnotify",
"role": "test-role",
"subject": "DynamoDB table '#{node.dynamoDBOutputTable}' import failed!",
"message": "DynamoDB table '#{node.dynamoDBOutputTable}' import from S3 bucket '#{node.s3MyS3Data}' failed. JobId: #{node.id}. Error: #{node.errorMes sage}."
}
]
}
Schedule
The example AWS Data Pipeline JSON file begins with a section to define the schedule by which to copy the data. Many pipeline components have a reference to a schedule and you may have more than one.
The Schedule component is defined by the following fields:
{
"id": "MySchedule",
"type": "Schedule",
"startDateTime": "2012-11-22T00:00:00",
"endDateTime":"2012-11-23T00:00:00",
"period": "1 day"
},
Note
In the JSON file, you can define the pipeline components in any order you prefer. In this example, we chose the order that best illustrates the pipeline component dependencies.
Name
The user-defined name for the pipeline schedule, which is a label for your reference only.
Type
The pipeline component type, which is Schedule.
startDateTime
The date/time (in UTC format) that you want the task to begin.
endDateTime
The date/time (in UTC format) that you want the task to stop.
period
The time period that you want to pass between task attempts, even if the task occurs only one time.
The period must evenly divide the time between startDateTime
and endDateTime
. In this example, we set the period to be 1 day so that the pipeline copy operation can only run one time.
Amazon S3 Data Node
Next, the S3DataNode pipeline component defines a location for the input file; in this case a tab-delimited file in an Amazon S3 bucket location. The input S3DataNode component is defined by the following fields:
API Version 2012-10-29
84
AWS Data Pipeline Developer Guide
Using the Command Line Interface
{
"id": "MyS3Data",
"type": "S3DataNode",
"schedule": {
"ref": "MySchedule"
},
"filePath": "s3://input_bucket/ProductCatalog",
"precondition": {
"ref": "InputReady"
}
},
Name
The user-defined name for the input location (a label for your reference only).
Type
The pipeline component type, which is "S3DataNode" to match the location where the data resides, in an Amazon S3 bucket.
Schedule
A reference to the schedule component that we created in the preceding lines of the JSON file labeled
“MySchedule”.
Path
The path to the data associated with the data node. This path contains a sample product catalog input file that we use for this scenario. The syntax for a data node is determined by its type. For example, a data node for a file in Amazon S3 follows a different syntax that is appropriate for a database table.
Precondition
A reference to a precondition that must evaluate as true for the pipeline to consider the data node to be valid. The precondition itself is defined later in the pipeline definition file.
Precondition
Next, the precondition defines a condition that must be true for the pipeline to use the S3DataNode associated with this precondition. The precondition is defined by the following fields:
{
"id": "InputReady",
"type": "S3PrefixNotEmpty",
"role": "test-role",
"s3Prefix": "#{node.filePath}"
},
Name
The user-defined name for the precondition (a label for your reference only).
Type
The type of the precondition is S3PrefixNotEmpty, which checks an Amazon S3 prefix to ensure that it is not empty.
Role
The IAM role that provides the permissions necessary to access the S3DataNode.
S3Prefix
The Amazon S3 prefix to check for emptiness. This field uses an expression
#{node.filePath} populated from the referring component, which in this example is the S3DataNode that refers to this precondition.
API Version 2012-10-29
85
AWS Data Pipeline Developer Guide
Using the Command Line Interface
Amazon EMR Cluster
Next, the EmrCluster pipeline component defines an Amazon EMR cluster that processes and moves the data in this tutorial. The EmrCluster component is defined by the following fields:
{
"id": "ImportCluster",
"type": "EmrCluster",
"masterInstanceType": "m1.small",
"instanceCoreType": "m1.xlarge",
"instanceCoreCount": "1",
"schedule": {
"ref": "MySchedule"
},
"enableDebugging": "true",
"emrLogUri": "s3://test_bucket/emr_logs"
},
Name
The user-defined name for the Amazon EMR cluster (a label for your reference only).
Type
The computational resource type, which is an Amazon EMR cluster. For more information, see
Overview of Amazon EMR in the Amazon EMR Developer Guide.
masterInstanceType
The type of Amazon EC2 instance to use as the master node of the Amazon EMR cluster. For more information, see Amazon EC2 Instance Types in the Amazon EC2 Documentation.
instanceCoreType
The type of Amazon EC2 instance to use as the core node of the Amazon EMR cluster. For more information, see Amazon EC2 Instance Types in the Amazon EC2 Documentation.
instanceCoreCount
The number of core Amazon EC2 instances to use in the Amazon EMR cluster.
Schedule
A reference to the schedule component that we created in the preceding lines of the JSON file labeled
“MySchedule”.
enableDebugging
Indicates whether to create detailed debug logs for the Amazon EMR job flow.
emrLogUri
Specifies an Amazon S3 location to store the Amazon EMR job flow debug logs if you enabled debugging with the previously mentioned enableDebugging
field.
Amazon EMR Activity
Next, the EmrActivity pipeline component brings together the schedule, resources, and data nodes to define the work to perform, the conditions under which to do the work, and the actions to perform when certain events occur. The EmrActivity component is defined by the following fields:
{
"id": "MyImportJob",
"type": "EmrActivity",
"dynamoDBOutputTable": "MyTable",
"dynamoDBWritePercent": "1.00",
"s3MyS3Data": "#{input.path}",
"lateAfterTimeout": "12 hours",
API Version 2012-10-29
86
AWS Data Pipeline Developer Guide
Using the Command Line Interface
"attemptTimeout": "24 hours",
"maximumRetries": "0",
"input": {
"ref": "MyS3Data"
},
"runsOn": {
"ref": "ImportCluster"
},
"schedule": {
"ref": "MySchedule"
},
"onSuccess": {
"ref": "SuccessSnsAlarm"
},
"onFail": {
"ref": "FailureSnsAlarm"
},
"onLateAction": {
"ref": "LateSnsAlarm"
},
"step": "s3://elasticmapreduce/libs/script-runner/script-runner.jar,s3://elast icmapreduce/libs/hive/hive-script,--run-hive-script,--hive-versions,latest,-args,-f,s3://elasticmapreduce/libs/hive/dynamodb/importDynamoDBTableFromS3,d,DYNAMODB_OUTPUT_TABLE=#{dynamoDBOutputTable},-d,S3_INPUT_BUCKET=#{s3MyS3Data},d,DYNAMODB_WRITE_PERCENT=#{dynamoDBWritePercent},-d,DYNAMODB_ENDPOINT=dy namodb.us-east-1.amazonaws.com"
},
Name
The user-defined name for the Amazon EMR activity (a label for your reference only).
Type
The EmrActivity pipeline component type, which creates an Amazon EMR job flow to perform the defined work. For more information, see Overview of Amazon EMR in the Amazon EMR Developer
Guide.
dynamoDBOutputTable
The Amazon DynamoDB table where the Amazon EMR job flow writes the output of the Hive script.
dynamoDBWritePercent
Sets the rate of write operations to keep your Amazon DynamoDB database instance provisioned throughput rate in the allocated range for your table. The value is between 0.1 and 1.5, inclusively.
For more information, see Hive Options in Amazon EMR Developer Guide.
s3MyS3Data
An expression that refers to the Amazon S3 location path of the input data defined by the S3DataNode labeled "MyS3Data".
lateAfterTimeout
The amount of time, after the schedule start time, that the activity can wait to start before AWS Data
Pipeline considers it late.
attemptTimeout
The amount of time, after the schedule start time, that the activity has to complete before AWS Data
Pipeline considers it as failed.
maximumRetries
The maximum number of times that AWS Data Pipeline retries the activity.
input
The Amazon S3 location path of the input data defined by the S3DataNode labeled "MyS3Data".
API Version 2012-10-29
87
AWS Data Pipeline Developer Guide
Using the Command Line Interface runsOn
A reference to the computational resource that will run the activity; in this case, an EmrCluster labeled
"ImportCluster".
schedule
A reference to the schedule component that we created in the preceding lines of the JSON file labeled
“MySchedule”.
onSuccess
A reference to the action to perform when the activity is successful. In this case, it is to send an
Amazon SNS notification.
onFail
A reference to the action to perform when the activity fails. In this case, it is to send an Amazon SNS notification.
onLateAction
A reference to the action to perform when the activity is late. In this case, it is to send an Amazon
SNS notification.
step
Defines the steps for the EMR job flow to perform. This step calls a Hive script named importDynamoDBTableFromS3 that is provided by Amazon EMR and is specifically designed to move data from Amazon S3 into Amazon DynamoDB. To perform more complex data transformation tasks, you would customize this Hive script and provide its name and path here. For more information about sample Hive scripts that show how to perform data transformation tasks, see Contextual
Advertising using Apache Hive and Amazon EMR in AWS Articles and Tutorials.
Upload the Pipeline Definition
You can upload a pipeline definition file using the AWS Data Pipeline CLI tools. For more information,
see Install the Command Line Interface (p. 15)
To upload your pipeline definition, use the following command.
On Linux/Unix/Mac OS:
./datapipeline -–create pipeline_name -–put pipeline_file
On Windows: ruby datapipeline -–create pipeline_name -–put pipeline_file
Where pipeline_name is the label for your pipeline and pipeline_file is the full path and file name for the file with the .json file extension that defines your pipeline.
If your pipeline validates successfully, you receive the following message:
Pipeline with name pipeline_name and id df-AKIAIOSFODNN7EXAMPLE created. Pipeline
definition pipeline_file.json uploaded.
Note
For more information about any errors returned by the –create command or other commands,
see Troubleshoot AWS Data Pipeline (p. 128)
.
Ensure that your pipeline appears in the pipeline list by using the following command.
On Linux/Unix/Mac OS:
API Version 2012-10-29
88
AWS Data Pipeline Developer Guide
Using the Command Line Interface
./datapipeline --list-pipelines
On Windows: ruby datapipeline -–list-pipelines
The list of pipelines includes details such as Name, Id, State, and UserId. Take note of your pipeline ID, because you use this value for most AWS Data Pipeline CLI commands. The pipeline ID is a unique identifier using the format df-AKIAIOSFODNN7EXAMPLE
.
Activate the Pipeline
You must activate the pipeline, by using the --activate command-line parameter, before it will begin performing work. Use the following command.
On Linux/Unix/Mac OS:
./datapipeline --activate -–id df-AKIAIOSFODNN7EXAMPLE
On Windows: ruby datapipeline --activate -–id df-AKIAIOSFODNN7EXAMPLE
Where df-AKIAIOSFODNN7EXAMPLE is the identifier for your pipeline.
Verify the Pipeline Status
View the status of your pipeline and its components, along with its activity attempts and retries with the following command.
On Linux/Unix/Mac OS:
./datapipeline --list-runs –id df-AKIAIOSFODNN7EXAMPLE
On Windows: ruby datapipeline --list-runs –id df-AKIAIOSFODNN7EXAMPLE
Where df-AKIAIOSFODNN7EXAMPLE is the identifier for your pipeline.
The --list-runs command displays a list of pipelines components and details such as Name, Scheduled
Start, Status, ID, Started, and Ended.
Note
It is important to note the difference between the Scheduled Start date/time vs. the Started time.
It is possible to schedule a pipeline component to run at a certain time (Scheduled Start), but the actual start time (Started) could be later due to problems or delays with preconditions, dependencies, failures, or retries.
Note
AWS Data Pipeline may backfill a pipeline, which happens when you define a Scheduled Start date/time for a date in the past. In that situation, AWS Data Pipeline immediately runs the pipeline components the number of times the activity should have run if it had started on the Scheduled
API Version 2012-10-29
89
AWS Data Pipeline Developer Guide
Part Two: Export Data from Amazon DynamoDB
Start time. When this happens, you see pipeline components run back-to-back at a greater frequency than the period value that you specified when you created the pipeline. AWS Data
Pipeline returns your pipeline to the defined frequency only when it catches up to the number of past runs.
Successful pipeline runs are indicated by all the activities in your pipeline reporting the
FINISHED
status.
Your pipeline frequency determines how many times the pipeline runs, and each run has its own success or failure, as indicated by the --list-runs command. Resources that you define in your pipeline, such as
Amazon EC2 instances, may show the
SHUTTING_DOWN
status until they are finally terminated after a successful run. Depending on how you configured your pipeline, you may have multiple Amazon EC2 resources, each with their own final status.
Verify Data Import
Next, verify that the data import occurred successfully using the Amazon DynamoDB console to inspect the data in the table.
To create a Amazon DynamoDB table
1.
Sign in to the AWS Management Console and open the Amazon DynamoDB console .
2.
On the Tables screen, click your Amazon DynamoDB table and click the Explore Table button.
3.
On the Browse Items tab, columns that correspond to the data input file should display, such as Id,
Price, ProductCategory, as shown in the following screen. This indicates that the import operation from the file to the Amazon DynamoDB table occurred successfully.
Part Two: Export Data from Amazon DynamoDB
Topics
•
•
Using the AWS Data Pipeline Console (p. 92)
•
Using the Command Line Interface (p. 98)
This is the second of a two-part tutorial that demonstrates how to bring together multiple AWS features to solve real-world problems in a scalable way through a common scenario: moving schema-less data in and out of Amazon DynamoDB using Amazon EMR and Hive. This tutorial involves the following concepts and procedures:
API Version 2012-10-29
90
AWS Data Pipeline Developer Guide
Before You Begin ...
• Using the AWS Data Pipeline console and command-line interface (CLI) to create and configure pipelines
• Creating and configuring Amazon DynamoDB tables
• Creating and allocating work to Amazon EMR clusters
• Querying and processing data with Hive scripts
• Storing and accessing data using Amazon S3
Before You Begin ...
You must complete part one of this tutorial to ensure that your Amazon DynamoDB table contains the necessary data to perform the steps in this section. For more information, see
.
Additionally, be sure you've completed the following steps:
• Set up an Amazon Web Services (AWS) account to access the AWS Data Pipeline console. For more information, see
.
Set up the AWS Data Pipeline tools and interface you plan on using. For more information about interfaces and tools you can use to interact with AWS Data Pipeline, see
.
• Create an Amazon S3 bucket as a data output location.
For more information, see Create a Bucket in the Amazon Simple Storage Service Getting Started
Guide.
• Ensure that you have the Amazon DynamoDB table that was created and populated with data in part one of this tutorial. This table will be your data source for part two of the tutorial. For more information, see
Part One: Import Data into Amazon DynamoDB (p. 69) .
Be aware of the following:
• Imports may overwrite data in your Amazon DynamoDB table. When you import data from Amazon
S3, the import may overwrite items in your Amazon DynamoDB table. Make sure that you are importing the right data and into the right table. Be careful not to accidentally set up a recurring import pipeline that will import the same data multiple times.
• Exports may overwrite data in your Amazon S3 bucket. When you export data to Amazon S3, you may overwrite previous exports if you write to the same bucket path. The default behavior of the Export
DynamoDB to S3 template will append the job’s scheduled time to the Amazon S3 bucket path, which will help you avoid this problem.
• Import and Export jobs will consume some of your Amazon DynamoDB table’s provisioned throughput capacity. This section explains how to schedule an import or export job using Amazon EMR. The
Amazon EMR cluster will consume some read capacity during exports or write capacity during imports.
You can control the percentage of the provisioned capacity that the import/export jobs consume by with the settings M yImportJob.myDynamoDBWriteThroughputRatio
and
MyExportJob.myDynamoDBReadThroughputRatio
. Be aware that these settings determine how much capacity to consume at the beginning of the import/export process and will not adapt in real time if you change your table’s provisioned capacity in the middle of the process.
• Be aware of the costs. AWS Data Pipeline manages the import/export process for you, but you still pay for the underlying AWS services that are being used. The import and export pipelines will create Amazon
EMR clusters to read and write data and there are per-instance charges for each node in the cluster.
You can read more about the details of Amazon EMR Pricing . The default cluster configuration is one m1.small instance master node and one m1.xlarge instance task node, though you can change this configuration in the pipeline definition. There are also charges for AWS Data Pipeline. For more information, see AWS Data Pipeline Pricing and Amazon S3 Pricing .
API Version 2012-10-29
91
AWS Data Pipeline Developer Guide
Using the AWS Data Pipeline Console
Using the AWS Data Pipeline Console
Topics
•
Start Export from the Amazon DynamoDB Console (p. 92)
•
Create the Pipeline Definition using the AWS Data Pipeline Console (p. 93)
•
Create and Configure the Pipeline from a Template (p. 93)
•
Complete the Data Nodes (p. 94)
•
Complete the Resources (p. 95)
•
•
Complete the Notifications (p. 96)
•
Validate and Save Your Pipeline (p. 96)
•
Verify your Pipeline Definition (p. 96)
•
Activate your Pipeline (p. 97)
•
Monitor the Progress of Your Pipeline Runs (p. 97)
•
[Optional] Delete your Pipeline (p. 98)
The following topics explain how to perform the steps in part two of this tutorial using the AWS Data
Pipeline console.
Start Export from the Amazon DynamoDB Console
You can begin the Amazon DynamoDB export operation from within the Amazon DynamoDB console.
To start the data export
1.
Sign in to the AWS Management Console and open the Amazon DynamoDB console .
2.
On the Tables screen, click your Amazon DynamoDB table and click the Export Table button.
3.
On the Import / Export Table screen, select Build a Pipeline. This opens the AWS Data Pipeline console so that you can choose a template to export the Amazon DynamoDB table data.
API Version 2012-10-29
92
AWS Data Pipeline Developer Guide
Using the AWS Data Pipeline Console
Create the Pipeline Definition using the AWS Data Pipeline
Console
To create the new pipeline
1.
Sign in to the AWS Management Console and open the AWS Data Pipeline console or arrive at the
AWS Data Pipeline console through the Build a Pipeline button in the Amazon DynamoDB console.
2.
Click Create new pipeline.
3.
On the Create a New Pipeline page: a.
In the Pipeline Name box, enter a name (for example,
CopyMyS3Data
).
b.
In Pipeline Description, enter a description.
c.
Leave the Select Schedule Type: button set to the default type Time Series Time Scheduling for this tutorial. Schedule type allows you to specify whether the objects in your pipeline definition should be scheduled at the beginning of interval or end of the interval. Time Series Style
Scheduling means instances are scheduled at the end of each interval and Cron Style Scheduling means instances are scheduled at the beginning of each interval.
d.
Leave the Role boxes set to their default values for this tutorial, which is
DataPipelineDefaultRole.
Note
If you have created your own IAM roles and would like to use them in this tutorial, you can select them now.
e.
Leave the Role boxes set to their default values for this tutorial, which are
DataPipelineDefaultRole for the role and DataPipelineDefaultResourceRole for the resource role.
Note
If you have created your own IAM roles and would like to use them in this tutorial, you can select them now.
f.
Click Create a new Pipeline.
Create and Configure the Pipeline from a Template
On the Pipeline screen, click Templates and select Export DynamoDB to S3. The AWS Data Pipeline console pre-populates a pipeline definition template with the base objects necessary to export data from
Amazon DynamoDB, as shown in the following screen.
API Version 2012-10-29
93
AWS Data Pipeline Developer Guide
Using the AWS Data Pipeline Console
Review the template and complete the missing fields. You start by choosing the schedule and frequency by which you want your data export operation to run.
To complete the schedule
• On the Pipeline screen, click Schedules.
a.
In the DefaultSchedule1 section, set Name to
ExportSchedule
.
b.
Set Period to 1 Hours.
c.
Set Start Date Time using the calendar to the current date, such as
2012-12-18
and the time to
00:00:00 UTC
.
d.
In the Add an optional field .. box, select End Date Time.
e.
Set End Date Time using the calendar to the following day, such as
2012-12-19
and the time to
00:00:00 UTC
.
Complete the Data Nodes
Next, you complete the data node objects in your pipeline definition template.
To complete the Amazon DynamoDB data node
1.
On the Pipeline:
name of your pipeline
page, select DataNodes.
2.
In the DataNodes pane, in the Table Name box, type the name of the Amazon DynamoDB table that you created in part one of this tutorial; for example:
MyTable
.
API Version 2012-10-29
94
AWS Data Pipeline Developer Guide
Using the AWS Data Pipeline Console
To complete the Amazon S3 data node
• In the MyS3Data section, in the Directory Path field, type the path to the files where you want the
Amazon DynamoDB table data to be written, which is the Amazon S3 bucket that you configured in part one of this tutorial. For example:
s3://mybucket/output/MyTable
.
Complete the Resources
Next, you complete the resources that will run the data import activities. Many of the fields are auto-populated by the template, as shown in the following screen. You only need to complete the empty fields.
To complete the resources
• On the Pipeline page, select Resources.
• In the Emr Log Uri box, type the path where to store EMR debugging logs, using the Amazon
S3 bucket that you configured in part one of this tutorial; for example:
s3://mybucket/emr_debug_logs
.
Complete the Activity
Next, you complete the activity that represents the steps to perform in your data export operation.
To complete the activity
1.
On the Pipeline:
name of your pipeline
page, select Activities.
2.
In the MyExportJob section, review the default options already provided. You are not required to manually configure any options in this section.
API Version 2012-10-29
95
AWS Data Pipeline Developer Guide
Using the AWS Data Pipeline Console
Complete the Notifications
Next, configure the SNS notification action AWS Data Pipeline must perform depending on the outcome of the activity.
To configure the SNS success, failure, and late notification action
1.
On the Pipeline:
name of your pipeline
page, in the right pane, click Others.
2.
In the Others pane: a.
In the LateSnsAlarmsection, in the Topic Arn box, enter the ARN of your Amazon SNS topic that you created earlier in the tutorial; for example:
arn:aws:sns:us-east-1:403EXAMPLE:my-example-topic
.
b.
In the FailureSnsAlarmsection, in the Topic Arn box, enter the ARN of your Amazon SNS topic that you created earlier in the tutorial; for example:
arn:aws:sns:us-east-1:403EXAMPLE:my-example-topic
.
c.
In the SuccessSnsAlarmsection, in the Topic Arn box, enter the ARN of your Amazon SNS topic that you created earlier in the tutorial; for example:
arn:aws:sns:us-east-1:403EXAMPLE:my-example-topic
.
You have now completed all the steps required for creating your pipeline definition. Next, validate and save your pipeline.
Validate and Save Your Pipeline
You can save your pipeline definition at any point during the creation process. As soon as you save your pipeline definition, AWS Data Pipeline looks for syntax errors and missing values in your pipeline definition.
If your pipeline is incomplete or is incorrect, AWS Data Pipeline throws a validation error. If you plan to continue the creation process later, you can ignore the error message. If your pipeline definition is complete, and you are getting the validation error message, you'll have to fix the errors in the pipeline definition before activating your pipeline.
To validate and save your pipeline
1.
On the Pipeline:
name of your pipeline
page, click Save Pipeline.
2.
AWS Data Pipeline validates your pipeline definition and returns either success or the error message.
3.
If you get an error message, click Close and then, in the right pane, click Errors.
4.
The Errors pane lists the objects failing validation.
Click the plus (+) sign next to the object names and look for an error message in red.
5.
When you see the error message, click the specific object pane where you see the error and fix it.
For example, if you see an error message in the DataNodes object, click the DataNodes pane to fix the error.
6.
After you've fixed the errors listed in the Errors pane, click Save Pipeline.
7.
Repeat the process until your pipeline is validated.
Next, verify that your pipeline definition has been saved.
Verify your Pipeline Definition
It is important to verify that your pipeline was correctly initialized from your definitions before you activate it.
API Version 2012-10-29
96
AWS Data Pipeline Developer Guide
Using the AWS Data Pipeline Console
To verify your pipeline definition
1.
On the Pipeline:
name of your pipeline
page, click Back to list of pipelines.
2.
On the List Pipelines page, check if your newly-created pipeline is listed.
AWS Data Pipeline has created a unique Pipeline ID for your pipeline definition.
The Status column in the row listing your pipeine should show PENDING.
3.
Click on the triangle icon next to your pipeline. The Pipeline Summary panel below shows the details of your pipeline runs. Because your pipeline is not yet activated, you should see only 0s at this point.
4.
In the Pipeline summary panel, click View all fields to see the configuration of your pipeline definition.
5.
Click Close.
Next, activate your pipeline.
Activate your Pipeline
You must activate your pipeline to start creating and processing runs based on the specifications in your pipeline definition.
To activate your pipeline
1.
On the List Pipelines page, in the Details column of your pipeline, click View pipeline.
2.
In the Pipeline:
name of your pipeline
page, click Activate.
Next, verify if your pipeline is running.
Monitor the Progress of Your Pipeline Runs
To monitor the progress of your pipeline
1.
On the List Pipelines page, in the Details column of your pipeline, click View instance details.
2.
The Instance details:
name of your pipeline
page lists the status of each object in your pipeline definition.
Note
If you do not see runs listed, depending on when your pipeline was scheduled, either click
End (in UTC) date box and change it to a later date or click Start (in UTC) date box and change it to an earlier date. Then click Update.
3.
If the Status column of all the objects in your pipeline indicates FINISHED, your pipeline has successfully completed the copy activity.You should receive an email about the successful completion of this task, to the account you specified for receiving your Amazon SNS notification.
You can also check your Amazon S3 data target bucket to verify if the data was copied.
4.
If the Status column of any of your objects indicates a status other than FINISHED, either your pipeline is waiting for some precondition to be met or it has failed.
a.
To troubleshoot the failed or the incomplete runs
Click the triangle next to a run; the Instance summary panel opens to show the details of the selected run.
b.
Click View instance fields to see additional details of the run. If the status of your selected run is FAILED, the additional details box has an entry indicating the reason for failure; for example:
@failureReason = Resource not healthy terminated
.
API Version 2012-10-29
97
AWS Data Pipeline Developer Guide
Using the Command Line Interface
c.
You can use the information in the Instance summary panel and the View instance fields box to troubleshoot issues with your failed pipeline.
For more information about the status listed for the runs, see
Details (p. 129) . For more information about troubleshooting the failed or incomplete runs of your
pipeline, see AWS Data Pipeline Problems and Solutions (p. 131)
.
Important
Your pipeline is running and is incurring charges. For more information, see AWS Data Pipeline pricing . If you would like to stop incurring the AWS Data Pipeline usage charges, delete your pipeline.
[Optional] Delete your Pipeline
Deleting your pipeline deletes the pipeline definition including all the associated objects.You stop incurring charges as soon as your pipeline is deleted.
To delete your pipeline
1.
In the List Pipelines page, click the check box next to your pipeline.
2.
Click Delete.
3.
In the confirmation dialog box, click Delete to confirm the delete request.
Using the Command Line Interface
Topics
•
Define the Export Pipeline in JSON Format (p. 98)
•
•
•
•
•
Upload the Pipeline Definition (p. 104)
•
Activate the Pipeline (p. 105)
•
Verify the Pipeline Status (p. 105)
•
The following topics explain how to perform the steps in this tutorial using the AWS Data Pipeline CLI.
Define the Export Pipeline in JSON Format
This example pipeline definition shows how to use AWS Data Pipeline to retrieve data from an Amazon
DynamoDB table to populate a tab-delimited file in Amazon S3, use a Hive script to define the necessary data transformation steps, and automatically create an Amazon EMR cluster to perform the work.
API Version 2012-10-29
98
AWS Data Pipeline Developer Guide
Using the Command Line Interface
Additionally, this pipeline will send Amazon SNS notifications if the pipeline succeeds, fails, or runs late.
This is the full pipeline definition JSON file followed by an explanation for each of its sections.
Note
We recommend that you use a text editor that can help you verify the syntax of JSON-formatted files, and name the file using the .json file extension.
{
"objects": [
{
"id": "MySchedule",
"type": "Schedule",
"startDateTime": "2012-11-22T00:00:00",
"endDateTime":"2012-11-23T00:00:00",
"period": "1 day"
},
{
"id": "MyS3Data",
"type": "S3DataNode",
"schedule": {
"ref": "MySchedule"
},
"filePath": "s3://output_bucket/ProductCatalog"
},
{
"id": "ExportCluster",
"type": "EmrCluster",
"masterInstanceType": "m1.small",
"instanceCoreType": "m1.xlarge",
"instanceCoreCount": "1",
"schedule": {
"ref": "MySchedule"
},
"enableDebugging": "true",
"emrLogUri": "s3://test_bucket/emr_logs"
},
{
"id": "MyExportJob",
"type": "EmrActivity",
"dynamoDBInputTable": "MyTable",
"dynamoDBReadPercent": "0.25",
"s3OutputBucket": "#{output.path}",
"lateAfterTimeout": "12 hours",
"attemptTimeout": "24 hours",
"maximumRetries": "0",
"output": {
"ref": "MyS3Data"
},
"runsOn": {
"ref": "ExportCluster"
},
"schedule": {
"ref": "MySchedule"
},
"onSuccess": {
"ref": "SuccessSnsAlarm"
},
"onFail": {
API Version 2012-10-29
99
AWS Data Pipeline Developer Guide
Using the Command Line Interface
"ref": "FailureSnsAlarm"
},
"onLateAction": {
"ref": "LateSnsAlarm"
},
"step": "s3://elasticmapreduce/libs/script-runner/script-run ner.jar,s3://elasticmapreduce/libs/hive/hive-script,--run-hive-script,--hiveversions,latest,--args,-f,s3://elasticmapreduce/libs/hive/dynamodb/exportDynamoD
BTableToS3,-d,DYNAMODB_INPUT_TABLE=#{dynamoDBInputTable},-d,S3_OUTPUT_BUCK
ET=#{s3OutputBucket}/#{format(@actualStartTime,'YYYY-MM-dd_hh.mm')},-d,DY
NAMODB_READ_PERCENT=#{dynamoDBReadPercent},-d,DYNAMODB_ENDPOINT=dynamodb.useast-1.amazonaws.com"
},
{
"id": "SuccessSnsAlarm",
"type": "SnsAlarm",
"topicArn": "arn:aws:sns:us-east-1:286198878708:mysnsnotify",
"role": "test-role",
"subject": "DynamoDB table '#{node.dynamoDBInputTable}' export succeeded",
"message": "DynamoDB table '#{node.dynamoDBInputTable}' export to S3 bucket '#{node.s3OutputBucket}' succeeded at #{node.@actualEndTime}. JobId:
#{node.id}"
},
{
"id": "LateSnsAlarm",
"type": "SnsAlarm",
"topicArn": "arn:aws:sns:us-east-1:286198878708:mysnsnotify",
"role": "test-role",
"subject": "DynamoDB table '#{node.dynamoDBInputTable}' export is taking
a long time!",
"message": "DynamoDB table '#{node.dynamoDBInputTable}' export to S3 bucket '#{node.s3OutputBucket}' has exceeded the late warning period
'#{node.lateAfterTimeout}'. JobId: #{node.id}"
},
{
"id": "FailureSnsAlarm",
"type": "SnsAlarm",
"topicArn": "arn:aws:sns:us-east-1:286198878708:mysnsnotify",
"role": "test-role",
"subject": "DynamoDB table '#{node.dynamoDBInputTable}' export failed!",
"message": "DynamoDB table '#{node.dynamoDBInputTable}' export to S3 bucket '#{node.s3OutputBucket}' failed. JobId: #{node.id}. Error: #{node.er rorMessage}."
}
]
}
Schedule
The example AWS Data Pipeline JSON file begins with a section to define the schedule by which to copy the data. Many pipeline components have a reference to a schedule and you may have more than one.
The Schedule component is defined by the following fields:
API Version 2012-10-29
100
AWS Data Pipeline Developer Guide
Using the Command Line Interface
{
"id": "MySchedule",
"type": "Schedule",
"startDateTime": "2012-11-22T00:00:00",
"endDateTime":"2012-11-23T00:00:00",
"period": "1 day"
},
Note
In the JSON file, you can define the pipeline components in any order you prefer. In this example, we chose the order that best illustrates the pipeline component dependencies.
Name
The user-defined name for the pipeline schedule, which is a label for your reference only.
Type
The pipeline component type, which is Schedule.
startDateTime
The date/time (in UTC format) that you want the task to begin.
endDateTime
The date/time (in UTC format) that you want the task to stop.
period
The time period that you want to pass between task attempts, even if the task occurs only one time.
The period must evenly divide the time between startDateTime
and endDateTime
. In this example, we set the period to be 1 day so that the pipeline copy operation can only run one time.
Amazon S3 Data Node
Next, the S3DataNode pipeline component defines a location for the output file; in this case a tab-delimited file in an Amazon S3 bucket location. The output S3DataNode component is defined by the following fields:
{
"id": "MyS3Data",
"type": "S3DataNode",
"schedule": {
"ref": "MySchedule"
},
"filePath": "s3://output_bucket/ProductCatalog"
}
},
Name
The user-defined name for the output location (a label for your reference only).
Type
The pipeline component type, which is "S3DataNode" to match the data output location, in an Amazon
S3 bucket.
Schedule
A reference to the schedule component that we created in the preceding lines of the JSON file labeled
“MySchedule”.
Path
The path to the data associated with the data node. This path is an empty Amazon S3 location where a tab-delimited output file will be written that has the contents of a sample product catalog in an
API Version 2012-10-29
101
AWS Data Pipeline Developer Guide
Using the Command Line Interface
Amazon DynamoDB table. The syntax for a data node is determined by its type. For example, a data node for a file in Amazon S3 follows a different syntax that is appropriate for a database table.
Amazon EMR Cluster
Next, the EmrCluster pipeline component defines an Amazon EMR cluster that processes and moves the data in this tutorial. The EmrCluster component is defined by the following fields:
{
"id": "ImportCluster",
"type": "EmrCluster",
"masterInstanceType": "m1.small",
"instanceCoreType": "m1.xlarge",
"instanceCoreCount": "1",
"schedule": {
"ref": "MySchedule"
},
"enableDebugging": "true",
"emrLogUri": "s3://test_bucket/emr_logs"
},
Name
The user-defined name for the Amazon EMR cluster (a label for your reference only).
Type
The computational resource type, which is an Amazon EMR cluster. For more information,see
Overview of Amazon EMR in the Amazon EMR Developer Guide.
masterInstanceType
The type of Amazon EC2 instance to use as the master node of the Amazon EMR cluster. For more information, see Amazon EC2 Instance Types in the Amazon EC2 Documentation.
instanceCoreType
The type of Amazon EC2 instance to use as the core node of the Amazon EMR cluster. For more information, see Amazon EC2 Instance Types in the Amazon EC2 Documentation.
instanceCoreCount
The number of core Amazon EC2 instances to use in the Amazon EMR cluster.
Schedule
A reference to the schedule component that we created in the preceding lines of the JSON file labeled
“MySchedule”.
enableDebugging
Indicates whether to create detailed debug logs for the Amazon EMR job flow.
emrLogUri
Specifies an Amazon S3 location to store the Amazon EMR job flow debug logs if you enabled debugging with the previously-mentioned enableDebugging
field.
Amazon EMR Activity
Next, the EmrActivity pipeline component brings together the schedule, resources, and data nodes to define the work to perform, the conditions under which to do the work, and the actions to perform when certain events occur. The EmrActivity component is defined by the following fields:
{
"id": "MyExportJob",
"type": "EmrActivity",
API Version 2012-10-29
102
AWS Data Pipeline Developer Guide
Using the Command Line Interface
"dynamoDBInputTable": "MyTable",
"dynamoDBReadPercent": "0.25",
"s3OutputBucket": "#{output.path}",
"lateAfterTimeout": "12 hours",
"attemptTimeout": "24 hours",
"maximumRetries": "0",
"output": {
"ref": "MyS3Data"
},
"runsOn": {
"ref": "ExportCluster"
},
"schedule": {
"ref": "ExportPeriod"
},
"onSuccess": {
"ref": "SuccessSnsAlarm"
},
"onFail": {
"ref": "FailureSnsAlarm"
},
"onLateAction": {
"ref": "LateSnsAlarm"
},
"step": "s3://elasticmapreduce/libs/script-runner/script-runner.jar,s3://elast icmapreduce/libs/hive/hive-script,--run-hive-script,--hive-versions,latest,-args,-f,s3://elasticmapreduce/libs/hive/dynamodb/exportDynamoDBTableToS3,d,DYNAMODB_INPUT_TABLE=#{dynamoDBInputTable},-d,S3_OUTPUT_BUCKET=#{s3OutputBuck et}/#{format(@actualStartTime,'YYYY-MM-dd_hh.mm')},-d,DYNAMODB_READ_PERCENT=#{dy namoDBReadPercent},-d,DYNAMODB_ENDPOINT=dynamodb.us-east-1.amazonaws.com"
},
Name
The user-defined name for the Amazon EMR activity (a label for your reference only).
Type
The EmrActivity pipeline component type, which creates an Amazon EMR job flow to perform the defined work. For more information, see Overview of Amazon EMR in the Amazon EMR Developer
Guide.
dynamoDBInputTable
The Amazon DynamoDB table that the Amazon EMR job flow reads as the input for the Hive script.
dynamoDBReadPercent
Set the rate of read operations to keep your Amazon DynamoDB provisioned throughput rate in the allocated range for your table. The value is between 0.1 and 1.5, inclusively. For more information, see Hive Options in Amazon EMR Developer Guide.
s3OutputBucket
An expression that refers to the Amazon S3 location path for the output file defined by the S3DataNode labeled "MyS3Data".
lateAfterTimeout
The amount of time, after the schedule start time, that the activity can wait to start before AWS Data
Pipeline considers it late.
attemptTimeout
The amount of time, after the schedule start time, that the activity has to complete before AWS Data
Pipeline considers it as failed.
maximumRetries
The maximum number of times that AWS Data Pipeline retries the activity.
API Version 2012-10-29
103
AWS Data Pipeline Developer Guide
Using the Command Line Interface input
The Amazon S3 location path of the input data defined by the S3DataNode labeled "MyS3Data".
runsOn
A reference to the computational resource that will run the activity; in this case, an EmrCluster labeled
"ImportCluster".
schedule
A reference to the schedule component that we created in the preceding lines of the JSON file labeled
“MySchedule”.
onSuccess
A reference to the action to perform when the activity is successful. In this case, it is to send an
Amazon SNS notification.
onFail
A reference to the action to perform when the activity fails. In this case, it is to send an Amazon SNS notification.
onLateAction
A reference to the action to perform when the activity is late. In this case, it is to send an Amazon
SNS notification.
step
Defines the steps for the EMR job flow to perform. This step calls a Hive script named exportDynamoDBTableToS3 that is provided by Amazon EMR and is specifically designed to move data from Amazon DynamoDB to Amazon S3. To perform more complex data transformation tasks, you would customize this Hive script and provide its name and path here. For more information about sample Hive scripts that show how to perform data transformation tasks, see Contextual Advertising using Apache Hive and Amazon EMR in AWS Articles and Tutorials.
Upload the Pipeline Definition
You can upload a pipeline definition file using the AWS Data Pipeline CLI tools. For more information,
see Install the Command Line Interface (p. 15)
To upload your pipeline definition, use the following command.
On Linux/Unix/Mac OS:
./datapipeline -–create pipeline_name -–put pipeline_file
On Windows: ruby datapipeline -–create pipeline_name -–put pipeline_file
Where pipeline_name is the label for your pipeline and pipeline_file is the full path and file name for the file with the .json file extension that defines your pipeline.
If your pipeline validates successfully, you receive the following message:
Pipeline with name pipeline_name and id df-AKIAIOSFODNN7EXAMPLE created. Pipeline
definition pipeline_file.json uploaded.
Note
For more information about any errors returned by the –create command or other commands,
see Troubleshoot AWS Data Pipeline (p. 128)
.
Ensure that your pipeline appears in the pipeline list by using the following command.
API Version 2012-10-29
104
AWS Data Pipeline Developer Guide
Using the Command Line Interface
On Linux/Unix/Mac OS:
./datapipeline --list-pipelines
On Windows: ruby datapipeline -–list-pipelines
The list of pipelines includes details such as Name, Id, State, and UserId. Take note of your pipeline ID, because you use this value for most AWS Data Pipeline CLI commands. The pipeline ID is a unique identifier using the format df-AKIAIOSFODNN7EXAMPLE
.
Activate the Pipeline
You must activate the pipeline, by using the --activate command-line parameter, before it will begin performing work. Use the following command.
On Linux/Unix/Mac OS:
./datapipeline --activate -–id df-AKIAIOSFODNN7EXAMPLE
On Windows: ruby datapipeline --activate -–id df-AKIAIOSFODNN7EXAMPLE
Where df-AKIAIOSFODNN7EXAMPLE is the identifier for your pipeline.
Verify the Pipeline Status
View the status of your pipeline and its components, along with its activity attempts and retries with the following command.
On Linux/Unix/Mac OS:
./datapipeline --list-runs –id df-AKIAIOSFODNN7EXAMPLE
On Windows: ruby datapipeline --list-runs –id df-AKIAIOSFODNN7EXAMPLE
Where df-AKIAIOSFODNN7EXAMPLE is the identifier for your pipeline.
The --list-runs command displays a list of pipelines components and details such as Name, Scheduled
Start, Status, ID, Started, and Ended.
Note
It is important to note the difference between the Scheduled Start date/time vs. the Started time.
It is possible to schedule a pipeline component to run at a certain time (Scheduled Start), but the actual start time (Started) could be later due to problems or delays with preconditions, dependencies, failures, or retries.
Note
AWS Data Pipeline may backfill a pipeline, which happens when you define a Scheduled Start date/time for a date in the past. In that situation, AWS Data Pipeline immediately runs the pipeline
API Version 2012-10-29
105
AWS Data Pipeline Developer Guide
Using the Command Line Interface
components the number of times the activity should have run if it had started on the Scheduled
Start time. When this happens, you see pipeline components run back-to-back at a greater frequency than the period value that you specified when you created the pipeline. AWS Data
Pipeline returns your pipeline to the defined frequency only when it catches up to the number of past runs.
Successful pipeline runs are indicated by all the activities in your pipeline reporting the
FINISHED
status.
Your pipeline frequency determines how many times the pipeline runs, and each run has its own success or failure, as indicated by the --list-runs command. Resources that you define in your pipeline, such as
Amazon EC2 instances, may show the
SHUTTING_DOWN
status until they are finally terminated after a successful run. Depending on how you configured your pipeline, you may have multiple Amazon EC2 resources, each with their own final status.
Verify Data Export
Next, verify that the data export occurred successfully using viewing the output file contents.
To view the export file contents
1.
Sign in to the AWS Management Console and open the Amazon S3 console .
2.
On the Buckets pane, click the Amazon S3 bucket that contains your file output (the example pipeline uses the output path s3://output_bucket/ProductCatalog
) and open the output file with your preferred text editor. The output file name is an identifier value with no extension, such as this example: ae10f955-fb2f-4790-9b11-fbfea01a871e_000000
.
3.
Using your preferred text editor, view the contents of the output file and ensure that there is delimited data that corresponds to the Amazon DynamoDB source table, such as Id, Price, ProductCategory, as shown in the following screen. This indicates that the export operation from Amazon DynamoDB to the output file occurred successfully.
API Version 2012-10-29
106
AWS Data Pipeline Developer Guide
Tutorial: Run a Shell Command to
Process MySQL Table
This tutorial walks you through the process of creating a data pipeline to use a script stored in Amazon
S3 bucket to process a MySQL table, write the output in a comma-separated values (CSV) file in Amazon
S3 bucket, and then send an Amazon SNS notification after the task completes successfully. You will use the Amazon EC2 instance resource provided by AWS Data Pipeline for this shell command activity.
The first step in pipeline creation process is to select the pipeline objects that make up your pipeline definition. After you select the pipeline objects, you add fields for each pipeline object. For more information on pipeline definition, see
This tutorial uses the following objects to create a pipeline definition:
Activity
Activity the AWS Data Pipeline must perform for this pipeline.
This tutorial uses the
ShellCommandActivity
to process the data in MySQL table and write the output in a CSV file.
Schedule
The start date, time, and the duration for this activity. You can optionally specify the end date and time.
Resource
Resource AWS Data Pipeline must use to perform this activity.
This tutorial uses
Ec2Resource
, an Amazon EC2 instance provided by AWS Data Pipeline, to run a command for processing the data. AWS Data Pipeline automatically launches the Amazon EC2 instance and then terminates the instance after the task finishes.
DataNodes
Input and output nodes for this pipeline.
This tutorial uses two input nodes and one output node. The first input node is the
MySQLDataNode that contains the MySQL table. The second input node is the
S3DataNode
that contains the script.
The output node is the
S3DataNode
for storing the CSV file.
Action
Action AWS Data Pipeline must take when the specified conditions are met.
This tutorial uses
SnsAlarm
action to send Amazon SNS notification to the email address you specify, after the task finishes successfully.
API Version 2012-10-29
107
AWS Data Pipeline Developer Guide
Before you begin ...
For more information about the additional objects and fields supported by the copy activity, see
.
The following steps outline how to create a data pipeline to run a script stored in an Amazon S3 bucket.
1. Create your pipeline definition
2. Create and configure the pipeline definition objects
3. Validate and save your pipeline definition
4. Verify that your pipeline definition is saved
5. Activate your pipeline
6. Monitor the progress of your pipeline
7. [Optional] Delete your pipeline
Before you begin ...
Be sure you've completed the following steps.
• Set up an Amazon Web Services (AWS) account to access the AWS Data Pipeline console. For more information, see
.
Set up the AWS Data Pipeline tools and interface you plan on using. For more information on interfaces
.
• Create and launch a MySQL database instance as a data source.
For more information, see Launch a DB Instance in the Amazon Relational Database Service (RDS)
Getting Started Guide.
Note
Make a note of the user name and the password you used for creating the MySQL instance.
After you've launched your MySQL database instance, make a note of the instance's endpoint.
You will need all this information in this tutorial.
• Connect to your MySQL database instance, create a table, and then add test data values to the newly-created table.
For more information, see Create a Table in the MySQL documentation.
• Create an Amazon S3 bucket as a source for the script.
For more information, see Create a Bucket in the Amazon Simple Storage Service getting Started
Guide.
• Create a script to read the data in the MySQL table, process the data, and then write the results in a
CSV file. The script must run on an Amazon EC2 Linux instance.
Note
The AWS Data Pipeline computational resources (Amazon EMR job flow and Amazon EC2 instance) are not supported on Windows in this release.
• Upload your script to your Amazon S3 bucket.
For more information, see Add an Object to a Bucket in the Amazon Simple Storage Service Getting
Started Guide.
• Create another Amazon S3 bucket as a data target.
• Create an Amazon SNS topic for sending email notification and make a note of the topic Amazon
Resource Name (ARN). For more information on creating an Amazon SNS topic, see Create a Topic in the Amazon Simple Notification Service Getting Started Guide.
API Version 2012-10-29
108
AWS Data Pipeline Developer Guide
Using the AWS Data Pipeline Console
• [Optional] This tutorial uses the default IAM role policies created by AWS Data Pipeline. If you would rather create and configure your IAM role policy and trust relationships, follow the instructions described
in Granting Permissions to Pipelines with IAM (p. 21)
.
Note
Some of the actions described in this tutorial can generate AWS usage charges, depending on whether you are using the AWS Free Usage Tier .
Using the AWS Data Pipeline Console
Topics
•
Create and Configure the Pipeline Definition Objects (p. 109)
•
Validate and Save Your Pipeline (p. 112)
•
Verify your Pipeline Definition (p. 113)
•
Activate your Pipeline (p. 113)
•
Monitor the Progress of Your Pipeline Runs (p. 114)
•
[Optional] Delete your Pipeline (p. 115)
The following sections include the instructions for creating the pipeline using the AWS Data Pipeline console.
To create your pipeline definition
1.
Sign in to the AWS Management Console and open the AWS Data Pipeline console .
2.
Click Create Pipeline.
3.
On the Create a New Pipeline page: a.
In the Pipeline Name box, enter a name (for example,
RunDailyScript
).
b.
In Pipeline Description, enter a description.
c.
Leave the Select Schedule Type: button set to the default type for this tutorial.
Note
Schedule type allows you to specify whether the objects in your pipeline definition should be scheduled at the beginning of interval or end of the interval. Time Series
Style Scheduling means instances are scheduled at the end of each interval and Cron
Style Scheduling means instances are scheduled at the beginning of each interval.
d.
Leave the Role boxes set to their default values for this tutorial.
Note
If you have created your own IAM roles and would like to use them in this tutorial, you can select them now.
e.
Click Create a new pipeline.
Create and Configure the Pipeline Definition
Objects
Next, you define the
Activity
object in your pipeline definition. When you define the
Activity
object, you also define the objects that AWS Data Pipeline must use to perform this activity.
API Version 2012-10-29
109
AWS Data Pipeline Developer Guide
Create and Configure the Pipeline Definition Objects
1.
On the Pipeline:
name of your pipeline
page, click Add activity.
2.
In the Activities pane a.
Enter the name of the activity; for example, run-my-script
.
b.
In the Type box, select ShellCommandActivity.
c.
In the Schedule box, select Create new: Schedule.
d.
In the Add an optional field .. box, select Script Uri.
e.
In the Script Uri box, enter the path to your uploaded script; for example, s3://my-script/myscript.txt
.
f.
In the Add an optional field .. box, select Input.
g.
In the Input box, select Create new: DataNode.
h.
In the Add an optional field .. box, select Output.
i.
In the Output box, select Create new: DataNode.
j.
In the Add an optional field .. box, select RunsOn.
k.
In the Runs On box, select Create new: Resource.
l.
In the Add an optional field .. box, select On Success.
m. In the On Success box, select Create new: Action.
n.
In the left pane, separate the icons by dragging them apart.
You've completed defining your pipeline definition by specifying the objects AWS Data Pipeline will use to perform the shell command activity.
The Pipeline:
name of your pipeline
pane shows the graphical representation of the pipeline you just created. The arrows indicate the connection between the various objects.
Next, configure run date and time for your pipeline.
To configure run date and time for your pipeline
1.
On the Pipeline:
name of your pipeline
page, in the right pane, click Schedules.
2.
In the Schedules pane: a.
Enter a schedule name for this activity (for example, run-mysql-script-schedule
).
b.
In the Type box, select Schedule.
API Version 2012-10-29
110
AWS Data Pipeline Developer Guide
Create and Configure the Pipeline Definition Objects
c.
In the Start Date Time box, select the date from the calendar, and then enter the time to start the activity.
Note
AWS Data Pipeline supports the date and time expressed in "YYYY-MM-DDTHH:MM:SS" format in UTC/GMT only.
d.
In the Period box, enter the duration for the activity (for example,
1
), and then select the period category (for example,
Days
).
e.
[Optional] To specify the date and time to end the activity, in the Add an optional field box, select endDateTime, and enter the date and time.
To get your pipeline to launch immediately, set Start Date Time to a date one day in the past. AWS
Data Pipeline will then starting launching the "past due" runs immediately in an attempt to address what it perceives as a backlog of work. This backfilling means you don't have to wait an hour to see
AWS Data Pipeline launch its first job flow.
Next, configure the input and the output data nodes for your pipeline.
To configure the input and output data nodes of your pipeline
1.
On the Pipeline:
name of your pipeline
page, in the right pane, click DataNodes.
2.
In the DataNodes pane: a.
In the
DefaultDataNode1
Name box , enter the name for your MySQL data source node (for example,
MySQLTableInput
).
b.
In the Type box, select MySQLDataNode.
c.
In the Connection String box, enter the end point of your MySQL database instance (for example, mydbinstance.c3frkexample.us-east-1.rds.amazonaws.com
).
d.
In the Table box, enter the name of the source database table (for example, mysql-input-table
).
e.
In the Schedule box, select run-mysql-script-schedule.
f.
In the *Password box, enter the password you used when you created your MySQL database instance.
g.
In the Username box, enter the user name you used when you created your MySQL database instance.
h.
In the
DefaultDataNode2
Name box, enter the name for the data target node for your CSV file (for example,
MySQLScriptOutput
).
i.
In the Type box, select S3DataNode.
j.
In the Schedule box, select run-mysql-script-schedule.
k.
In the Add an optional field .. box, select File Path.
l.
In the File Path box, enter the path to your Amazon S3 bucket (for example, s3://my-data-pipeline-output/
name of your csv file
).
Next, configure the resource AWS Data Pipeline must use to run your script.
To configure the resource,
1.
On the Pipeline:
name of your pipeline
page, in the right pane, click Resources.
2.
In the Resources pane: a.
In the Name box, enter the name for your resource (for example,
RunScriptInstance
).
API Version 2012-10-29
111
AWS Data Pipeline Developer Guide
Validate and Save Your Pipeline
b.
In the Type box, select Ec2Resource.
c.
Leave the Resource Role and Role boxes set to the default values for this tutorial.
d.
In the Schedule box, select run-mysql-script-schedule.
Next, configure the SNS notification action AWS Data Pipeline must perform after your script runs successfully.
To configure the SNS notification action
1.
On the Pipeline:
name of your pipeline
page, in the right pane, click Others.
2.
In the Others pane: a.
In the
DefaultAction1
Name box, enter the name for your Amazon SNS notification (for example,
RunDailyScriptNotice
).
b.
In the Type box, select SnsAlarm.
c.
In the Topic Arn box, enter the ARN of your Amazon SNS topic.
d.
In the Subject box, enter the subject line for your notification.
e.
In the Message box, enter the message content.
f.
Leave the entry in the Role box set to default.
You have now completed all the steps required for creating your pipeline definition. Next, validate and save your pipeline.
Validate and Save Your Pipeline
You can save your pipeline definition at any point during the creation process. As soon as you save your pipeline definition, AWS Data Pipeline looks for syntax errors and missing values in your pipeline definition.
If your pipeline is incomplete or is incorrect, AWS Data Pipeline throws a validation error. If you plan to continue the creation process later, you can ignore the error message. If your pipeline definition is complete, and you are getting the validation error message, you'll have to fix the errors in the pipeline definition before activating your pipeline.
To validate and save your pipeline
1.
On the Pipeline:
name of your pipeline
page, click Save Pipeline.
2.
AWS Data Pipeline validates your pipeline definition and returns either success or the error message.
3.
If you get an error message, click Close and then, in the right pane, click Errors.
4.
The Errors pane lists the objects failing validation.
Click the plus (+) sign next to the object names and look for an error message in red.
5.
When you see the error message, click the specific object pane where you see the error and fix it.
For example, if you see an error message in the DataNodes object, click the DataNodes pane to fix the error.
6.
After you've fixed the errors listed in the Errors pane, click Save Pipeline.
7.
Repeat the process until your pipeline is validated.
Next, verify that your pipeline definition has been saved.
API Version 2012-10-29
112
AWS Data Pipeline Developer Guide
Verify your Pipeline Definition
Verify your Pipeline Definition
It is important to verify that your pipeline was correctly initialized from your definitions before you activate it.
To verify your pipeline definition
1.
On the Pipeline:
name of your pipeline
page, click Back to list of pipelines.
2.
On the List Pipelines page, check if your newly-created pipeline is listed.
AWS Data Pipeline has created a unique Pipeline ID for your pipeline definition.
The Status column in the row listing your pipeine should show PENDING.
3.
Click on the triangle icon next to your pipeline. The Pipeline Summary panel below shows the details of your pipeline runs. Because your pipeline is not yet activated, you should see only 0s at this point.
4.
In the Pipeline summary panel, click View all fields to see the configuration of your pipeline definition.
5.
Click Close.
Next, activate your pipeline.
Activate your Pipeline
You must activate your pipeline to start creating and processing runs based on the specifications in your pipeline definition.
To activate your pipeline
1.
On the List Pipelines page, in the Details column of your pipeline, click View pipeline.
2.
In the Pipeline:
name of your pipeline
page, click Activate.
API Version 2012-10-29
113
AWS Data Pipeline Developer Guide
Monitor the Progress of Your Pipeline Runs
A confirmation dialog box opens up confirming the activation.
3.
Click Close.
Next, verify if your pipeline is running.
Monitor the Progress of Your Pipeline Runs
To monitor the progress of your pipeline
1.
On the List Pipelines page, in the Details column of your pipeline, click View instance details.
2.
The Instance details:
name of your pipeline
page lists the status of each instance in your pipeline definition.
Note
If you do not see instances listed, depending on when your pipeline was scheduled, either click End (in UTC) date box and change it to a later date or click Start (in UTC) date box and change it to an earlier date. Then click Update.
3.
If the Status column of all the instances in your pipeline indicates FINISHED, your pipeline has successfully completed the copy activity.You should receive an email about the successful completion of this task, to the account you specified for receiving your Amazon SNS notification.
You can also check your Amazon S3 data target bucket to verify if the data was processed.
4.
If the Status column of any of your instances indicates a status other than FINISHED, either your pipeline is waiting for some precondition to be met or it has failed.
a.
To troubleshoot the failed or the incomplete instance runs
Click the triangle next to an instance, Instance summary panel opens to show the details of the selected instance.
b.
Click View instance fields to see additional details of the instance. If the status of your selected instance is FAILED, the details box has an entry indicating the reason for failure. For example,
@failureReason = Resource not healthy terminated
.
c.
In the Instance summary pane, in the Select attempt for this instance box, select the attempt number.
d.
In the Instance summary pane, click View attempt fields to see details of fields associated with the selected attempt.
API Version 2012-10-29
114
AWS Data Pipeline Developer Guide
[Optional] Delete your Pipeline
5.
To take an action on your incomplete or failed instance, select an action (
Rerun|Cancel|Mark
Finished
) from the Action column of the instance.
You can use the information in the Instance summary pane and the View instance fields box to troubleshoot issues with your failed pipeline.
For more information about instance status, see Interpret Pipeline Status Details (p. 129)
. For more
Pipeline Problems and Solutions (p. 131)
.
Important
Your pipeline is running and incurring charges. For more information, see AWS Data Pipeline pricing . If you would like to stop incurring the AWS Data Pipeline usage charges, delete your pipeline.
[Optional] Delete your Pipeline
Deleting your pipeline deletes the pipeline definition including all the associated objects.You stop incurring charges as soon as your pipeline is deleted.
To delete your pipeline
1.
In the List Pipelines page, click the check box next to your pipeline.
2.
Click Delete.
3.
In the confirmation dialog box, click Delete to confirm the delete request.
API Version 2012-10-29
115
AWS Data Pipeline Developer Guide
Using AWS Data Pipeline Console
Manage Pipelines
Topics
•
Using AWS Data Pipeline Console (p. 116)
•
Using the Command Line Interface (p. 121)
You can use either the AWS Data Pipeline console or the AWS Data Pipeline command line interface
(CLI) to view the details of your pipeline or to delete your pipeline.
Using AWS Data Pipeline Console
Topics
•
View pipeline definition (p. 116)
•
View details of each instance in an active pipeline (p. 117)
•
Modify pipeline definition (p. 119)
•
With the AWS Data Pipeline console, you can:
• View the pipeline definition of any pipeline associated with your account
• View the details of each instance in your pipeline and use the information to troubleshoot a failed instance run
• Modify pipeline definition
• Delete pipeline
The following sections walk you through the steps for managing your pipeline. Before you begin, be sure that you have at least one pipeline associated with your account, have access to the AWS Management
Console, and have opened the AWS Data Pipeline console at https://console.aws.amazon.com/datapipeline/ .
View pipeline definition
If you are signed in and have opened the AWS Data Pipeline console , your screen shows a list of pipelines associated with your account.
API Version 2012-10-29
116
AWS Data Pipeline Developer Guide
View details of each instance in an active pipeline
The Status column in the pipeline listing displays the current state of your pipelines. A pipeline is
SCHEDULED if the pipeline definition has passed validation and is activated, is currently running, or has completed its run. A pipeline is PENDING if the pipeline definition is incomplete or might have failed the validation step that all pipelines go through before being saved.
If you want to modify or complete your pipeline definition, see
Modify pipeline definition (p. 119) .
To view the pipeline definition of your pipeline,
1.
On the List Pipelines page, in the Details column of your pipeline, click View instance details (or click View pipeline, if PENDING).
2.
If your pipeline is SCHEDULED, a.
On the Instance details:
name of your pipeline
page, click View pipeline.
b.
The Pipeline:
name of your pipeline
[The pipeline is active.] page opens.
This is your pipeline definition page. As indicated in the title of the page, this pipeline is active.
3.
To view the pipeline definition object definitions, on the Pipeline:
name of your pipeline
page, click the object icons in the design pane. The corresponding object pane on the right panel opens.
4.
You can also click the object panes on the right panel to view the objects and the associated fields.
5.
If your pipeline definition graph does not fit in the design pane, use the pan buttons on the right side of the design pane to slide the canvas.
6.
Click Back to list of pipelines to get back to the List Pipelines page.
View details of each instance in an active pipeline
If you are signed in and have opened the AWS Data Pipeline console , your screen looks similar to this:
API Version 2012-10-29
117
AWS Data Pipeline Developer Guide
View details of each instance in an active pipeline
The Status column in the pipeline listing displays the current state of your pipelines. Your pipeline is active if the status is SCHEDULED. A pipeline is in SCHEDULED state if the pipeline definition has passed validation and is activated, is currently running, or has completed its run.You can view the pipeline definition, the runs list, and the details of each run of an active pipeline. For information on modifying an active pipeline, see
Modify pipeline definition (p. 119)
To retrieve the details of your active pipeline
1.
On the List Pipelines page, identify your active pipeline, and then click the small triangle that is next to the pipeline ID.
2.
In the Pipeline summary pane, click View fields to see additional information on your pipeline definition.
3.
Click Close to close the View fields box, and then click the triangle of your active pipeline again to close the Pipeline Summary pane.
4.
In the row that lists your active pipeline, click View instance details.
5.
The Instance details:
name of your pipeline
page lists all the instances of your active pipeline.
Note
If you do not see the list of instances, click End (in UTC) date box, change it to a later date, and then click Update.
6.
You can also use the Filter Object, Start, or End date-time fields to filter the number of instances returned based on either their current status or the date-range in which they were launched. Filtering the results is useful because, depending on the pipeline age and scheduling, the instance run history can be very large.
7.
If the Status column of all the runs in your pipeline displays the FINISHED state, your pipeline has successfully completed running.
API Version 2012-10-29
118
AWS Data Pipeline Developer Guide
Modify pipeline definition
If the Status column of any one of your runs indicate a status other than FINISHED, your pipeline is either running, waiting for some precondition to be met or has failed.
8.
Click the triangle next to an instance to show the details of the selected instance.
9.
In the Instance summary pane. click View instance fields to see details of fields associated with the selected instance. If the status of your selected instance is FAILED, the additional details box has an entry indicating the reason for failure. For example,
@failureReason = Resource not healthy terminated
.
10. In the Instance summary pane, in the Select attempt for this instance box, select the attempt number.
11. In the Instance summary pane, click View attempt fields to see details of fields associated with the selected attempt.
12. To take an action on your incomplete or failed instance, select an action (
Rerun|Cancel|Mark
Finished
) from the Action column of the instance.
13. You can use the information in the Instance summary pane and the View instance fields box to troubleshoot issues with your failed pipeline.
For more information about instance status, see
Interpret Pipeline Status Details (p. 129) . For more
information about troubleshooting the failed or incomplete instance runs of your pipeline, see
Data Pipeline Problems and Solutions (p. 131)
.
14. Click Back to list of pipelines to get back to the List Pipelines page.
Modify pipeline definition
If your pipeline is in a PENDING state, either your pipeline definition is incomplete or your pipeline might have failed the validation step that all pipelines go through before saving. If your pipeline is active, you may need to change some aspect of it. However, if you are modifying the pipeline definition of an active pipeline, you must keep in mind the following rules:
• Cannot change the Default objects
• Cannot change the schedule of an object
• Cannot change the dependencies between objects
• Cannot add/delete/modify reference fields for existing objects, only non-reference fields are allowed
• New objects cannot reference an previously existing object for the output field, only the input fields is allowed
Follow the steps in this section to either complete or modify your pipeline definition.
API Version 2012-10-29
119
AWS Data Pipeline Developer Guide
Modify pipeline definition
To modify your pipeline definition
1.
On the List Pipelines page, in the Details column of your pipeline, click View instance details (or click View pipeline, if PENDING).
2.
If your pipeline is SCHEDULED, a.
On the Instance details:
name of your pipeline
page, click View pipeline.
b.
The Pipeline:
name of your pipeline
[The pipeline is active.] page opens.
This is your pipeline definition page. As indicated in the title of the page, this pipeline is active.
3.
To complete or modify your pipeline definition: a.
On the Pipeline:
name of your pipeline
page, click the object panes in the right side panel and complete defining the objects and fields of your pipeline definition.
Note
If you are modifying an active pipeline, you will see some fields are grayed out and are inactive. You cannot modify those fields.
b.
Skip the next step and follow the steps to validate and save your pipeline definition.
4.
To edit your pipeline definition: a.
On the Pipeline:
name of your pipeline
page, click the Errors pane. The Errors pane lists the objects of your pipeline that failed validation.
b.
Click on the plus (+) sign next to the object names and look for an error message in red.
c.
Click the object pane where you see the error and fix it. For example, if you see error message in the DataNodes object, click the DataNodes pane to fix the error.
To validate and save your pipeline definition
1.
Click Save Pipeline. AWS Data Pipeline validates your pipeline definition and returns one of the following messages:
Or
2.
If you get an Error! message, click Close and then, on the right side panel, click Errors to see the objects that did not pass the validation.
Fix the errors and save. Repeat this step till your pipeline definition passes validation.
API Version 2012-10-29
120
AWS Data Pipeline Developer Guide
Delete a Pipeline
Activate and verify your pipeline
1.
After you've saved your pipeline definition with no validation errors, click Activate.
2.
To verify that your your pipeline definition has been activated, click Back to list of pipelines.
3.
In the List Pipelines page, check if your newly-created pipeline is listed and the Status column displays SCHEDULED.
Delete a Pipeline
When you no longer require a pipeline, such as a pipeline created during application testing, you should delete it to remove it from active use. Deleting a pipeline puts it into a deleting state. When the pipeline is in the deleted state, its pipeline definition and run history are gone. Therefore, you can no longer perform operations on the pipeline, including describing it.
You can't restore a pipeline after you delete it, so be sure that you won’t need the pipeline in the future before you delete it.
To delete your pipeline
1.
In the List Pipelines page, click the empty box next to your pipeline.
2.
Click Delete.
3.
In the confirmation dialog box, click Delete to confirm the delete request.
Using the Command Line Interface
Topics
•
Install the AWS Data Pipeline Command-Line Client (p. 122)
•
•
Setting Credentials for the AWS Data Pipeline Command Line Interface (p. 122)
•
•
Create a New Pipeline (p. 124)
•
Retrieve Pipeline Details (p. 124)
•
View Pipeline Versions (p. 125)
•
•
The AWS Data Pipeline Command-Line Client (CLI) is one of three ways to interact with AWS Data
Pipeline. The other two are: using the AWS Data Pipeline console—a graphical user interface—or calling
.
API Version 2012-10-29
121
AWS Data Pipeline Developer Guide
Install the AWS Data Pipeline Command-Line Client
Install the AWS Data Pipeline Command-Line Client
Install the AWS Data Pipeline Command-Line Client (CLI) as described in Install the Command Line
Command-Line Syntax
Use the AWS Data Pipeline CLI from your operating system's command-line prompt and type the CLI tool name "datapipeline" followed by one or more parameters. However, the syntax of the command is different on Linux/Unix/Mac OS compared to Windows. Linux/Unix/Mac users must use the "./" prefix for the CLI command and Windows users must specify "ruby" before the CLI command. For example, to view the CLI help text on Linux/Unix/Mac, the syntax is:
./datapipeline --help
However, to perform the same action on Windows, the syntax of the command is: ruby datapipeline --help
Other than the prefix, the AWS Data Pipeline CLI syntax is the same between operating systems.
Note
For brevity, we do not list all the operating system syntax permutations for each example in this documentation. Instead, we simply refer to the commands like the following example.
datapipeline --help
Setting Credentials for the AWS Data Pipeline
Command Line Interface
In order to connect to the AWS Data Pipeline web service to process your commands, the CLI needs the account details of an AWS account that has permissions to create and/or manage data pipelines. There are three ways to pass your credentials into the CLI:
• Implicitly, using a JSON file.
• Explicitly, by specifying a JSON file at the command line.
• Explicitly, by specifying credentials using a series of command-line options.
To Set Your Credentials Implicitly with a JSON File
• The easiest and most common way is implicitly, by creating a JSON file named credentials.json in either your home directory, or the directory where CLI is installed. For example, when you use the
CLI with Windows, the folder may be c:\datapipeline-cli\amazon\datapipeline. When you do this, the
CLI loads the credentials implicitly and you do not need to specify any credential information at the command line. Verify the credentials file syntax using the following example JSON file, where you replace the example access-id and private-key values with your own:
{
"access-id": "AKIAIOSFODNN7EXAMPLE",
"private-key": "wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY",
"endpoint": "datapipeline.us-east-1.amazonaws.com",
API Version 2012-10-29
122
AWS Data Pipeline Developer Guide
Setting Credentials for the AWS Data Pipeline Command
Line Interface
"port": "443",
"use-ssl": "true",
"region": "us-east-1",
"log-uri": "log-uri": "s3://myawsbucket/logfiles"
}
After setting your credentials file, test the CLI using the following command, which uses implicit credentials to call the CLI and display a list of all the data pipelines those credentials can access.
datapipeline --list-pipelines
To Set Your Credentials Explicitly with a JSON File
• The next easiest way to pass in credentials is explicitly, using the --credentials option to specify the location of a JSON file. With this method, you’ll have to add the --credentials option to each command-line call. This can be useful if you are SSHing into a machine to run the command-line client remotely, or if you are testing various sets of credentials. For information about what to include in the JSON file, go to How to Format the JSON File.
For example, the following command explicitly uses the credentials stored in a JSON file to call the
CLI and display a list of all the data pipelines those credentials can access.
datapipeline --credentials /my-directory/my-credentials.json --list-pipelines
To Set Your Credentials Using Command-Line Options
• The final way to pass in credentials is to specify them using a series of options at the command line.
This is the most verbose way to pass in credentials, but may be useful if you are scripting the CLI and want the flexibility of changing credential information without having to edit a JSON file. The options you’ll use in this scenario are
--access-key
,
--secret-key
and
--endpoint
.
For example, the following command explicitly uses credentials specified at the command line to call the CLI and display a list of all the data pipelines those credentials can access.
datapipeline --access-key
my-access-key-id
--secret-key
my-secret-accesskey
--endpoint datapipeline.us-east-1.amazonaws.com --list-pipelines
In the preceding,
my-access-key-id
would be replaced with your AWS Access Key ID, and
my-secret-access-key
replaced with AWS Secret Access Key, and --endpoint wound specify the endpoint the CLI should use when contacting the AWS Data Pipeline web service. For more information about how to locate your AWS security credentials, go to Locating Your AWS Security
Credentials.
Note
Because you are passing in your credentials at every command-line call, you may wish to take additional security precautions to ensure the privacy of your command-line calls, such as clearing auto-complete when you are done with your terminal session. You also should not store this script in an unsecured file.
API Version 2012-10-29
123
AWS Data Pipeline Developer Guide
List Pipelines
List Pipelines
A simple example that also helps you confirm that your credentials are set correctly is to view a get a list of the currently running pipelines using
--list-pipelines
command. This command returns the names and identifiers of all pipelines that you have permission to access.
datapipeline --list-pipelines
This is how you get the ID of pipelines that you want to work with using the CLI, because many commands require you to specify the pipeline ID using the --id parameter.
For more information, see --list-pipelines (p. 221) .
Create a New Pipeline
The first step to create a new data pipeline is to define your data activities and their data dependencies using a pipeline definition file. The syntax and usage of the pipeline definition is described in Pipeline
Definition Language Reference in the AWS Data Pipeline Developer’s Guide.
Once you’ve written the details of the new pipeline using JSON syntax, save them to a text file with the extension .json. You’ll then specify this pipeline definition file as part of the input when creating the new pipeline.
After creating your pipeline definition file, you can create a new pipeline by calling the
action of the AWS Data Pipeline CLI, as shown below.
datapipeline --create my-pipeline --put my-pipeline-file.json
If you leave off the
--put
option, as shown following, AWS Data Pipeline creates an empty pipeline. You can then use a subsequent
--put
call to attach a pipeline definition to the empty pipeline.
datapipeline --create pipeline_name
The --put parameter does not activate a pipeline by default. You must explicitly activate a pipeline before it will begin doing work, using the --activate command and specifying a pipeline ID as shown below.
datapipeline --activate --id pipeline_id
For more information about creating pipelines, see the
and
--put
actions.
Retrieve Pipeline Details
Using the CLI, you can retrieve all the information about a pipeline, which includes the pipeline definition and the run attempt history of the pipeline components.
Retrieving the Pipeline Definition
To get the complete pipeline definition, use the
--get
command. The pipeline objects are returned in alphabetical order, not in the order they had in the pipeline definition file that you uploaded, and the slots for each object are also returned in alphabetical order.
You can specify an output file to receive the pipeline definition, but the default is to print the information to standard output (which is typically your terminal screen).
API Version 2012-10-29
124
AWS Data Pipeline Developer Guide
View Pipeline Versions
The following example prints the pipeline definition to a file named output.txt
.
datapipeline --get --file output.txt --id df-00627471SOVYZEXAMPLE
The following example prints the pipeline definition to standard output (stdout).
datapipeline --get --id df-00627471SOVYZEXAMPLE
It's a good idea to retrieve the pipeline definition before you submit modifications, because it’s possible that another user or process changed the pipeline definition after you last worked with it. By downloading a copy of the current definition and using that as the basis for your modifications, you can be sure that you are working with the most recent pipeline definition.
It’s also a good idea to retrieve the pipeline definition again after you modify it to ensure that the update was successful.
For more information, see --get, --g (p. 219)
.
Retrieving the Pipeline Run History
To retrieve a history of the times that a pipeline has run, use the
--list-runs
command. This command has options that you can use to filter the number of runs returned based on either their current status or the date-range in which they were launched. Filtering the results is useful because, depending on the pipeline’s age and scheduling, the run history can be very large.
This example shows how to retrieve information for all runs.
datapipeline --list-runs --id df-00627471SOVYZEXAMPLE
This example shows how to retrieve information for all runs that have completed.
datapipeline --list-runs --id df-00627471SOVYZEXAMPLE --status finished
This example shows how to retrieve information for all runs launched in the specified time frame.
datapipeline --list-runs --id df-00627471SOVYZEXAMPLE --start-interval "2011-
09-02", "2011-09-11"
For more information, see --list-runs (p. 222)
.
View Pipeline Versions
There are two versions of a pipeline that you can view with the CLI. There is the "active" version, which is the version of a pipeline that is currently running. There is also the "latest" version, which is created when a user edits a running pipeline and is created as a copy of the "active" pipeline until you edit it.
When you upload the edited pipeline, it becomes the "active" version and the previous "active" version is no longer accessible. A new "latest" version is created if you edit the pipeline again, repeating the previously described cycle.
To retrieve a specific version of a pipeline, use the
--version
command, specifying the version name of the pipeline. For example, the following command retrieves the "active" version of a pipeline.
datapipeline --get --version active --id df-00627471SOVYZEXAMPLE
API Version 2012-10-29
125
AWS Data Pipeline Developer Guide
Modify a Pipeline
For more information, see --delete (p. 218)
.
Modify a Pipeline
After you’ve created a pipeline, you may need to change some aspect of it. To do this, get the current pipeline definition and save it to a file, update the pipeline definition file, and upload the updated pipeline definition to AWS Data Pipeline using the
--put
command.
The following rules apply when you modify a pipeline definition:
• Cannot change the Default object
• Cannot change the schedule of an object
• Cannot change the dependencies between objects
• Cannot added/delete/modify reference fields for existing objects, only non-reference fields are allowed
• New objects cannot reference an previously existing object for the output field, only the input fields is allowed
It's a good idea to retrieve the pipeline definition before you submit modifications, because it’s possible that another user or process changed the pipeline definition after you last worked with it. By downloading a copy of the current definition and using that as the basis for your modifications, you can be sure that you are working with the most recent pipeline definition.
The following example prints the pipeline definition to a file named output.txt
.
datapipeline --get --file output.txt --id df-00627471SOVYZEXAMPLE
Update your pipeline definition file and save it as my-updated-file.txt
. The following example uploads the updated pipeline definition.
datapipeline --put my-updated-file.txt --id df-00627471SOVYZEXAMPLE
You can retrieve the pipeline definition using
--get
to ensure that the update was successful.
When you use
--put
to replace the pipeline definition file, the previous pipeline definition is completely replaced. Currently there is no way to change only a portion, such as a single object, of a pipeline; you must include all previously defined objects in the updated pipeline definition.
For more information, see --put (p. 224) and
.
Delete a Pipeline
When you no longer require a pipeline, such as a pipeline created during application testing, you should delete it to remove it from active use. Deleting a pipeline puts it into a deleting state. When the pipeline is in the deleted state, its pipeline definition and run history are gone. Therefore, you can no longer perform operations on the pipeline, including describing it.
You can't restore a pipeline after you delete it, so be sure that you won’t need the pipeline in the future before you delete it.
To delete a pipeline, use the
--delete
command, specifying the identifier of the pipeline. For example, the following command deletes a pipeline.
datapipeline --delete --id df-00627471SOVYZEXAMPLE
API Version 2012-10-29
126
AWS Data Pipeline Developer Guide
Delete a Pipeline
For more information, see --delete (p. 218)
.
API Version 2012-10-29
127
AWS Data Pipeline Developer Guide
Proactively Monitor Your Pipeline
Troubleshoot AWS Data Pipeline
Topics
•
Proactively Monitor Your Pipeline (p. 128)
•
Verify Your Pipeline Status (p. 129)
•
Interpret Pipeline Status Details (p. 129)
•
•
AWS Data Pipeline Problems and Solutions (p. 131)
When you have a problem while AWS Data Pipeline , the most common symptom is that a pipeline won't run. Since there are several possible causes, this topic explains how to track the status of your AWS Data
Pipeline pipelines, get notifications when problems occur, and gather more information. After you have enough information to narrow the list of potential problems, this topic guides you to solutions. To get the most benefit from these troubleshooting steps and scenarios, you should use the console or CLI to gather the required information.
Proactively Monitor Your Pipeline
The best way to detect problems is to monitor your pipelines from the start proactively. You can configure pipeline components to inform you of certain situations or events, such as when a pipeline component fails or doesn't begin by its scheduled start time. AWS Data Pipeline makes it easy to configure notifications using Amazon SNS.
Using the AWS Data Pipeline CLI, you can configure a pipeline component to send Amazon SNS notifications on failures. Add the following code to your pipeline definition JSON file. This example also demonstrates how to use the AWS Data Pipeline expression language to insert details about the specific execution attempt denoted by the #{node.interval.start} and #{node.interval.end} variables:
Note
You must create an Amazon SNS topic to use for the Topic ARN value in the following example.
For more information, see the Create a Topic documentation at http://docs.aws.amazon.com/sns/latest/gsg/CreateTopic.html.
{
"id" : "FailureNotify",
"type" : "SnsAlarm",
API Version 2012-10-29
128
AWS Data Pipeline Developer Guide
Verify Your Pipeline Status
"subject" : "Failed to run pipeline component",
"message": "Error for interval #{node.interval.start}..#{node.inter val.end}.",
"topicArn":"arn:aws:sns:us-east-1:28619EXAMPLE:ExampleTopic"
},
You must also associate the notification to the pipeline component that you want to monitor, as shown by the following example. In this example using the onFail action, the component sends a notification if a file doesn't exist in an Amazon S3 bucket:
{
"id": "S3Data",
"type": "S3DataNode",
"schedule": { "ref": "MySchedule" },
"filePath": "s3://mys3bucket/file.txt",
"precondition": {"ref":"ExampleCondition"},
"onFail": {"ref":"FailureNotify"}
},
Using the steps mentioned in the previous examples, you can also use the onLateAction, and onSuccess pipeline component fields to notify you when a pipeline component has not been scheduled on-time or has succeeded, respectively. You should configure notifications for any critical tasks in a pipeline by adding notifications to the default object in a pipeline, they automatically apply to all components in that pipeline. Pipeline components get the ability to send notifications through their IAM roles. Do not modify the default IAM roles unless your situation demands it, otherwise notifications may not work.
Verify Your Pipeline Status
When you notice a problem with a pipeline, check the status of your pipeline components and look for error messages using the console or CLI and look for error messages.
To locate your pipeline ID using the CLI, run this command: ruby datapipeline --list-pipelines
After you have the pipeline ID, view the status of the pipeline components using the CLI. In this example, replace your pipeline ID with the example provided: ruby datapipeline --list-runs --id df-AKIAIOSFODNN7EXAMPLE
On the list of pipeline components, look at the status column of each component and pay special attention to any components that indicate a status of FAILED, WAITING_FOR_RUNNER, or CANCELLED.
Additionally, look at the Scheduled Start column and match it with a corresponding value for the Actual
Start column to ensure that the tasks occur with the timing that you expect.
Interpret Pipeline Status Details
The various status levels displayed in the AWS Data Pipeline console and CLI indicate the condition of a pipeline and its components. Pipelines have a SCHEDULED status if they have passed validation and are ready, currently performing work, or done with their work. PENDING status means the pipeline is not able to perform work for some reason; for example, the pipeline definition might be incomplete or might
API Version 2012-10-29
129
AWS Data Pipeline Developer Guide
Error Log Locations
have failed the validation step that all pipelines go through before activation. The pipeline status is simply an overview of a pipeline; to see more information, view the status of individual pipeline components.
You can do this by clicking through a pipeline in the console or retrieving pipeline component details using the CLI.
A pipeline component has the following available status values:
CHECKING_PRECONDITIONS
The component is checking to ensure that all its default and user-configured preconditions are met before performing its work.
WAITING_FOR_RUNNER
The component is waiting for its worker client to retrieve it as a work item. The component and worker client relationship is controlled by the runsOn
or the workerGroup
field defined by that component.
CREATING
The component or resource is in the process of being started, such as an Amazon EC2 instance.
VALIDATING
The pipeline definition is in the process of being validated by AWS Data Pipeline.
RUNNING
The resource is running and ready to receive work.
CANCELLED
The component was pre-emptively cancelled by a user or AWS Data Pipeline before it could run.
This can happen automatically when a failure occurs in different component or resource that this component depends on.
PAUSED
The component has been paused and is not currently performing work.
FINISHED
The component has completed its assigned work.
SHUTTING_DOWN
The resource is shutting down after successfully performing its defined work.
FAILED
The component or resource encountered an error and stopped working. When a component or resource fails, it can cause cancellations and failures to cascade to other components that depend on it.
Error Log Locations
This section explains the various logs that AWS Data Pipeline writes that you can use to determine the source of certain failures and errors.
Task Runner Logs
Task Runner writes a log file named TaskRunner.log to the local computer which runs in the <your home directory>/output/logs directory, where "AmazonDataPipeline_location" is the directory where you extracted the AWS Data Pipeline CLI tools. In this directory, Task Runner also creates several nested directories that are named after the pipeline ID that it ran, with subdirectories for the year, month, day, and attempt number in the format <pipeline ID>/<year>/<month>/<day>/<pipeline object attempt ID_Attempt=X>. In these folders, Task Runner writes three files:
• <Pipeline Attempt ID>_Attempt_<number>_main.log.gz - This archive logs the step-by-step execution of Task Runner work items (both succeeded and failed) along with any error messages that were generated.
• <Pipeline Attempt ID>_Attempt_<number>_stderr.log.gz - This archive logs only error messages that occurred while Task Runner processed tasks.
API Version 2012-10-29
130
AWS Data Pipeline Developer Guide
Pipeline Logs
• <Pipeline Attempt ID>_Attempt_<number>_stdout.log.gz - This log provides any standard output text if provided by certain tasks.
Pipeline Logs
You can configure pipelines to create log files in a location, such as in the following example where you use the Default object in a pipeline to cause all pipeline components to use that log location by default
(you can override this by configuring a log location in a specific pipeline component).
To configure the log location using the AWS Data Pipeline CLI in a pipeline JSON file, begin your pipeline file with the following text:
{ "objects": [
{
"id":"Default",
"logUri":"s3://mys3bucket/error_logs"
},
...
After you configure a pipeline log directory, Task Runner creates a copy of the logs in your directory, with the same formatting and file names described in the previous section about Task Runner logs.
AWS Data Pipeline Problems and Solutions
This topic provides various symptoms of AWS Data Pipeline problems and the recommended steps to solve them.
Pipeline Stuck in Pending Status
A pipeline that appears stuck in the PENDING status indicates a fundamental error in the pipeline definition.
Ensure that you did not receive any errors when you submitted your pipeline using the AWS Data Pipeline
CLI or when you attempted to save or activate your pipeline using the AWS Data Pipeline console.
Additionally, check that your pipeline has a valid definition.
To view your pipeline definition on the screen using the CLI: ruby datapipeline --get --id df-EXAMPLE_PIPELINE_ID
Ensure that the pipeline definition is complete, check your closing braces, verify required commas, check for missing references, and other syntax errors. It is best to use a text editor that can visually validate the syntax of JSON files.
Pipeline Component Stuck in Waiting for Runner
Status
If your pipeline is in the SCHEDULED state and one or more tasks appear stuck in the
WAITING_FOR_RUNNER state, ensure that you set a valid value for either the runsOn or workerGroup fields for those tasks. If both values are empty or missing, the task cannot start because there is no association between the task and a worker to perform the tasks. In this situation, you've defined work but haven't defined what computer will do that work. If applicable, verify that the workerGroup value assigned
API Version 2012-10-29
131
AWS Data Pipeline Developer Guide
Pipeline Component Stuck in Checking Preconditions
Status
to the pipeline component is exactly the same name and case as the workerGroup value that you configured for Task Runner.
Another potential cause of this problem is that the endpoint and access key provided to Task Runner is not the same as the AWS Data Pipeline console or the computer where the AWS Data Pipeline CLI tools are installed. You might have created new pipelines with no visible errors, but Task Runner polls the wrong location due to the difference in credentials, or polls the correct location with insufficient permissions to identify and run the work specified by the pipeline definition.
Pipeline Component Stuck in Checking
Preconditions Status
If your pipeline is in the SCHEDULED state and one or more tasks appear stuck in the
CHECKING_PRECONDITIONS state, make sure your pipeline's initial preconditions have been met. If the preconditions of the first object in the logic chain are not met, none of the objects that depend on that first object will be able to move out of the CHECKING_PRECONDITIONS state.
For example, consider the following excerpt from a pipeline definition. In this case, the InputData object has a precondition 'Ready' specifying that the data must exist before the InputData object is complete. If the data does not exist, the InputData object remains in the CHECKING_PRECONDITIONS state, waiting for the data specified by the path field to become available. Any objects that depend on InputData likewise remain in a CHECKING_PRECONDITIONS state waiting for the InputData object to reach the FINISHED state.
{
"id": "InputData",
"type": "S3DataNode",
"filePath": "s3://elasticmapreduce/samples/wordcount/wordSplitter.py",
"schedule":{"ref":"MySchedule"},
"precondition": "Ready"
},
{
"id": "Ready",
"type": "Exists"
...
Also, check that your objects have the proper permissions to access the data. In the preceding example, if the information in the credentials field did not have permissions to access the data specified in the path field, the InputData object would get stuck in a CHECKING_PRECONDITIONS state because it cannot access the data specified by the path field, even if that data exists.
Run Doesn't Start When Scheduled
Check that you have properly specified the dates in your schedule objects and that the startDateTime and endDateTime values are in UTC format, such as in the following example:
{
"id": "MySchedule",
"startDateTime": "2012-11-12T19:30:00",
"endDateTime":"2012-11-12T20:30:00",
"period": "1 Hour",
"type": "Schedule"
},
API Version 2012-10-29
132
AWS Data Pipeline Developer Guide
Pipeline Components Run in Wrong Order
Pipeline Components Run in Wrong Order
You might notice that the start and end times for your pipeline components are running in the wrong order, or in a different sequence than you expect. It is important to understand that pipeline components can start running simultaneously if their preconditions are met at start-up time. In other words, pipeline components do not execute sequentially by default; if you need a specific execution order, you must control the execution order with preconditions and dependsOn fields. Verify that you are using the dependsOn field populated with a reference to the correct prerequisite pipeline components, and that all the necessary pointers between components are present to achieve the order you require.
EMR Cluster Fails With Error: The security token included in the request is invalid
.
Insufficient Permissions to Access Resources
Permissions that you set on IAM roles determine whether AWS Data Pipeline can access your EMR clusters and EC2 instances to run your pipelines. Additionally, IAM provides the concept of trust relationships that go further to allow creation of resources on your behalf. For example, when you create a pipeline that uses an EC2 instance to run a command to move data, AWS Data Pipeline can provision this EC2 instance for you. If you encounter problems, especially those involving resources that you can access manually but AWS Data Pipeline cannot, verify your IAM roles, policies, and trust relationships as described in
Granting Permissions to Pipelines with IAM (p. 21) .
Creating a Pipeline Causes a Security Token Error
You receive the following error when you try to create a pipeline:
Failed to create pipeline with 'pipeline_name'. Error: UnrecognizedClientException - The security token included in the request is invalid.
Cannot See Pipeline Details in the Console
The AWS Data Pipeline console pipeline filter applies to the scheduled start date for a pipeline, without regard to when the pipeline was submitted. It is possible to submit a new pipeline using a scheduled start date that occurs in the past, which the default date filter may not show. To see the pipeline details, change your date filter to ensure that the scheduled pipeline start date fits within the date range filter.
Error in remote runner Status Code: 404, AWS
Service: Amazon S3
This error means that Task Runner could not access your files in Amazon S3. Verify that:
• You have credentials correctly set
• The Amazon S3 bucket that you are trying to access exists
• You are authorized to access the Amazon S3 bucket
API Version 2012-10-29
133
AWS Data Pipeline Developer Guide
Access Denied - Not Authorized to Perform Function datapipeline:
Access Denied - Not Authorized to Perform
Function datapipeline:
In the Task Runner logs, you may see an error that is similar to the following:
• ERROR Status Code: 403
• AWS Service: DataPipeline
• AWS Error Code: AccessDenied
• AWS Error Message: User: arn:aws:sts::XXXXXXXXXXXX:federated-user/i-XXXXXXXX is not authorized to perform: datapipeline:PollForWork.
Note
In the this error message, PollForWork may be replaced with names of other AWS Data Pipeline permissions.
This error message indicates that the IAM role you specified needs additional permissions necessary to interact with AWS Data Pipeline. Ensure that your IAM role policy contains the following lines, where
PollForWork is replaced with the name of the permission you want to add (use * to grant all permissions):
{
"Action": [ "datapipeline:PollForWork" ],
"Effect": "Allow",
"Resource": ["*"]
}
API Version 2012-10-29
134
AWS Data Pipeline Developer Guide
Creating Pipeline Definition Files
Pipeline Definition Files
Topics
•
Creating Pipeline Definition Files (p. 135)
•
Example Pipeline Definitions (p. 139)
•
•
Expression Evaluation (p. 155)
•
The AWS Data Pipeline web service receives a pipeline definition file as input. This file specifies objects for the data nodes, activities, schedules, and computational resources for the pipeline.
Creating Pipeline Definition Files
To create a pipeline definition file, you can use either the AWS Data Pipeline console interface or a text editor that supports saving files using the UTF-8 file format.
This topic describes creating a pipeline definition file using a text editor.
Topics
•
•
General Structure of a Pipeline Definition File (p. 136)
•
•
•
•
•
Saving the Pipeline Definition File (p. 139)
Prerequisites
Before you create your pipeline definition file, you should determine the following:
• Objectives and tasks you need to accomplish
• Location and format of your source data (data nodes) and how often you update them
API Version 2012-10-29
135
AWS Data Pipeline Developer Guide
General Structure of a Pipeline Definition File
• Calculations or changes to the data (activities) you need
• Dependencies and checks (preconditions) that indicate when tasks are ready to run
• Frequency (schedule) you need for the pipeline to run
• Validation tests to confirm your data reached the destination
• How you want to be notified about success and failure
• Performance, volume, and runtime goals that suggest using other AWS services like EMR to process your data
General Structure of a Pipeline Definition File
The first step in pipeline creation is to compose pipeline definition objects in a pipeline definition file. The following example illustrates the general structure of a pipeline definition file. This file defines two objects, which are delimited by '{' and '}', and separated by a comma. The first object defines two name-value pairs, known as fields. The second object defines three fields.
{
"objects" : [
{
"name1" : "value1",
"name2" : "value2"
},
{
"name1" : "value3",
"name3" : "value4",
"name4" : "value5"
}
]
}
Pipeline Objects
When creating a pipeline definition file, you must select the types of pipeline objects that you'll need, add them to the pipeline definition file, and then add the appropriate fields. For more information about pipeline
.
For example, you could create a pipeline definition object for an input data node and another for the output data node. Then create another pipeline definition object for an activity, such as processing the input data using Amazon EMR.
Pipeline fields
After you know which object types to include in your pipeline definition file, you add fields to the definition of each pipeline object. Field names are enclosed in quotes, and are separated from field values by a space, a colon, and a space, as shown in the following example.
"name" : "value"
The field value can be a text string, a reference to another object, a function call, an expression, or an ordered list of any of the preceding types. for more information about the types of data that can be used for field values, see
. For more information about functions that you can use to evaluate field values, see
Expression Evaluation (p. 155)
.
API Version 2012-10-29
136
AWS Data Pipeline Developer Guide
User-Defined Fields
Fields are limited to 2048 characters. Objects can be 20 KB in size, which means that you can't add many large fields to an object.
Each pipeline object must contain the following fields: id
and type
, as shown in the following example.
Other fields may also be required based on the object type. Select a value for id
that's meaningful to you, and is unique within the pipeline definition. The value for type
specifies the type of the object. Specify
{
"id": "MyCopyToS3",
"type": "CopyActivity"
}
For more information about the required and optional fields for each object, see the documentation for the object.
To include fields from one object in another object, use the parent
field with a reference to the object.
For example, object "B" includes its fields, "B1" and "B2", plus the fields from object "A", "A1" and "A2".
{
"id" : "A",
"A1" : "value",
"A2" : "value"
},
{
"id" : "B",
"parent" : {"ref" : "A"},
"B1" : "value",
"B2" : "value"
}
You can define common fields in an object named "Default". These fields are automatically included in every object in the pipeline definition file that doesn't explicitly set its parent
field to reference a different object.
{
"id" : "Default",
"onFail" : {"ref" : "FailureNotification"},
"maximumRetries" : "3",
"workerGroup" : "myWorkerGroup"
}
User-Defined Fields
You can create user-defined or custom fields on your pipeline components and refer to them with expressions. The following example shows a custom field named "myCustomField" and
"myCustomFieldReference" added to an S3DataNode:
{
"id": "S3DataInput",
"type": "S3DataNode",
"schedule": {"ref": "TheSchedule"},
"filePath": "s3://bucket_name",
"myCustomField": "This is a custom value in a custom field.",
API Version 2012-10-29
137
AWS Data Pipeline Developer Guide
Expressions
"my_customFieldReference": {"ref":"AnotherPipelineComponent"}
},
A custom field must have a name prefixed with the word "my" in all lower-case letters, followed by a capital letter or underscore character, such as the preceding example "myCustomField". A user-defined field can be both a string value or a reference to another pipeline component as shown by the preceding example "my_customFieldReference".
Note
On user-defined fields, AWS Data Pipeline only checks for valid references to other pipeline components, not any custom field string values that you add.
Expressions
Expressions enable you to share a value across related objects. Expressions are processed by the AWS
Data Pipeline web service at runtime, ensuring that all expressions are substituted with the value of the expression.
Expressions are delimited by: "#{" and "}". You can use an expressions in any pipeline definition object where a string is legal.
The following expression calls one of the AWS Data Pipeline functions. For more information, see
Expression Evaluation (p. 155)
.
#{format(myDateTime,'YYYY-MM-dd hh:mm:ss')}
Referencing Fields and Objects
To reference a field on the current object in an expression, use the node
keyword. This keyword is available with alarm and precondition objects.
In the following example, the path
field references the id
field in the same object to form a file name.
The value of path
evaluates to "s3://mybucket/ExampleDataNode.csv".
{
"id" : "ExampleDataNode",
"type" : "S3DataNode",
"schedule" : {"ref" : "ExampleSchedule"},
"filePath" : "s3://mybucket/#{node.filePath}.csv",
"precondition" : {"ref" : "ExampleCondition"},
"onFail" : {"ref" : "FailureNotify"}
}
You can use an expression to reference objects that include another object, such as an alarm or precondition object, using the node
keyword. For example, the precondition object "ExampleCondition" is referenced by the previously described "ExampleDataNode" object, so "ExampleCondition" can reference field values of "ExampleDataNode" using the node
keyword. In the following example, the value of path evaluates to "s3://mybucket/ExampleDataNode.csv".
{
"id" : "ExampleCondition",
"type" : "Exists"
}
API Version 2012-10-29
138
AWS Data Pipeline Developer Guide
Saving the Pipeline Definition File
Note
You can create pipelines that have dependencies, such as tasks in your pipeline that depend on the work of other systems or tasks. If your pipeline requires certain resources, add those dependencies to the pipeline using preconditions that you associate with data nodes and tasks so your pipelines are easier to debug and more resilient. Additionally, keep your dependencies within a single pipeline when possible, because cross-pipeline troubleshooting is difficult.
As another example, you can use an expression to refer to the date and time range created by a
Schedule object. For example, the message
field uses the
@scheduledStartTime
and
@scheduledEndTime runtime fields from the
Schedule
object that is referenced by the data node or activity that references this object in its onFail
field.
{
"id" : "FailureNotify",
"type" : "SnsAlarm",
"subject" : "Failed to run pipeline component",
"message": "Error for interval #{node.@scheduledStartTime}..#{node.@sched uledEndTime}.",
"topicArn":"arn:aws:sns:us-east-1:28619EXAMPLE:ExampleTopic"
},
Saving the Pipeline Definition File
After you have completed the pipeline definition objects in your pipeline definition file, save the pipeline definition file using UTF-8 encoding.
You must submit the pipeline definition file to the AWS Data Pipeline web service. There are two primary ways to submit a pipeline definition file, using the AWS Data Pipeline command line interface or using the AWS Data Pipeline console.
Example Pipeline Definitions
This section contains a collection of example pipelines that you can quickly use for various scenarios, once you are familiar with AWS Data Pipeline. For more detailed, step by step instructions for creating and using pipelines, we recommend that you read one or more of the detailed tutorials available in this
guide, for example Tutorial: Copy Data From a MySQL Table to Amazon S3 (p. 40) and
Import/Export Data in Amazon DynamoDB With Amazon EMR and Hive (p. 69)
.
Copy SQL Data to a CSV File in Amazon S3
This example pipeline definition shows how to set a precondition or dependency on the existence of data gathered within a specific hour of time before copying the data (rows) from a table in a SQL database to a CSV (comma-separated values) file in an Amazon S3 bucket. The prerequisites and steps listed in this example pipeline definition are based on a MySQL database and table created using Amazon RDS.
Prerequisites
To set up and test this example pipeline definition, see Get Started with Amazon RDS to complete the following steps:
1.
Sign up for Amazon RDS.
2.
Launch a MySQL DB instance.
3.
Authorize access.
API Version 2012-10-29
139
AWS Data Pipeline Developer Guide
Copy SQL Data to a CSV File in Amazon S3
4.
Connect to the MySQL DB instance.
Note
The database name in this example, mydatabase, is the same as the one in the Amazon
RDS Getting Started Guide.
After you can connect to your MySQL DB Instance, then use the MySQL command line client to:
1.
Use your test MySQL database and create a table named adEvents
.
USE myDatabase;
CREATE TABLE IF NOT EXISTS adEvents (eventTime DATETIME, eventId INT, siteName
VARCHAR(100));
2.
Insert test data values to the newly created table named adEvents
.
INSERT INTO adEvents (eventTime, eventId, siteName) values ('2012-06-17
10:00:00', 100, 'Sports');
INSERT INTO adEvents (eventTime, eventId, siteName) values ('2012-06-17
10:00:00', 200, 'News');
INSERT INTO adEvents (eventTime, eventId, siteName) values ('2012-06-17
10:00:00', 300, 'Finance');
Example Pipeline Definition
The eight pipeline definition objects in this example are defined with the objectives of having:
• A default object with values that can be used by all subsequent objects
• A precondition object that resolves to true when the data exists in a referencing object
• A schedule object that specifies beginning and ending dates and duration or period of time in a referencing object
• A data input or source object with MySQL connection information, query string, and referencing to the precondition and schedule objects
• A data output or destination object pointing to your specified Amazon S3 bucket
• An activity object for copying from MySQL to Amazon S3
• Amazon SNS notification objects used for signalling success and failure of a referencing activity object
{
"objects" : [
{
"id" : "Default",
"onFail" : {"ref" : "FailureNotify"},
"onSuccess" : {"ref" : "SuccessNotify"},
"maximumRetries" : "3",
"workerGroup" : "myWorkerGroup"
},
{
"id" : "Ready",
"type" : "Exists"
},
{
API Version 2012-10-29
140
AWS Data Pipeline Developer Guide
Launch an Amazon EMR Job Flow
"id" : "CopyPeriod",
"type" : "Schedule",
"startDateTime" : "2012-06-13T10:00:00",
"endDateTime" : "2012-06-13T11:00:00",
"period" : "1 hour"
},
{
"id" : "SqlTable",
"type" : "MySqlDataNode",
"schedule" : {"ref" : "CopyPeriod"},
"table" : "adEvents",
"username": "
user_name
",
"*password": "
my_password
",
"connectionString": "jdbc:mysql:/
/mysqlinstance
-rds.example.us-east-
1.rds.amazonaws.com:3306/
database_name
",
"selectQuery" : "select * from #{table} where eventTime >= '#{@scheduled
StartTime.format('YYYY-MM-dd HH:mm:ss')}' and eventTime < '#{@scheduledEnd
Time.format('YYYY-MM-dd HH:mm:ss')}'",
"precondition" : {"ref" : "Ready"}
},
{
"id" : "OutputData",
"type" : "S3DataNode",
"schedule" : {"ref" : "CopyPeriod"},
"filePath" : "s3://S3BucketNameHere/#{@scheduledStartTime}.csv"
},
{
"id" : "mySqlToS3",
"type" : "CopyActivity",
"schedule" : {"ref" : "CopyPeriod"},
"input" : {"ref" : "SqlTable"},
"output" : {"ref" : "OutputData"},
"onSuccess" : {"ref" : "SuccessNotify"}
},
{
"id" : "SuccessNotify",
"type" : "SnsAlarm",
"subject" : "Pipeline component succeeded",
"message": "Success for interval #{node.@scheduledStartTime}..#{node.@sched uledEndTime}.",
"topicArn":"arn:aws:sns:us-east-1:28619EXAMPLE:ExampleTopic"
},
{
"id" : "FailureNotify",
"type" : "SnsAlarm",
"subject" : "Failed to run pipeline component",
"message": "Error for interval #{node.@scheduledStartTime}..#{node.@sched uledEndTime}.",
"topicArn":"arn:aws:sns:us-east-1:28619EXAMPLE:ExampleTopic"
}
]
}
Launch an Amazon EMR Job Flow
This example pipeline definition provisions an Amazon EMR cluster and runs a job step on that cluster one time a day, governed by the existence of the specified Amazon S3 path.
API Version 2012-10-29
141
AWS Data Pipeline Developer Guide
Launch an Amazon EMR Job Flow
Example Pipeline Definition
The value for workerGroup
should match the value that you specified for Task Runner.
Replace myOutputPath
, myLogPath
.
{
"objects" : [
{
"id" : "Default",
"onFail" : {"ref" : "FailureNotify"},
"maximumRetries" : "3",
"workerGroup: "myWorkerGroup"
},
{
"id" : "Daily",
"type" : "Schedule",
"period" : "1 day",
"startDateTime: "2012-06-26T00:00:00",
"endDateTime" : "2012-06-27T00:00:00"
},
{
"id" : "InputData",
"type" : "S3DataNode",
"schedule" : {"ref" : "Daily"},
"filePath" : "s3://myBucket/#{@scheduledEndTime.format('YYYY-MM-dd')}",
"precondition" : {"ref" : "Ready"}
},
{
"id" : "Ready",
"type" : "S3DirectoryNotEmpty",
"prefix" : "#{node.filePath}",
},
{
"id" : "MyCluster",
"type" : "EmrCluster",
"masterInstanceType" : "m1.small",
"schedule" : {"ref" : "Daily"},
"enableDebugging" : "true",
"logUri": "s3://myLogPath/logs"
},
{
"id" : "MyEmrActivity",
"type" : "EmrActivity",
"input" : {"ref" : "InputData"},
"schedule" : {"ref" : "Daily"},
"onSuccess" : "SuccessNotify",
"runsOn" : {"ref" : "MyCluster"},
"preStepCommand" : "echo Starting #{id} for day #{@scheduledStartTime}
>> /tmp/stepCommand.txt",
"postStepCommand" : "echo Ending #{id} for day #{@scheduledStartTime} >>
/tmp/stepCommand.txt",
"step" : "/home/hadoop/contrib/streaming/hadoop-streaming.jar,-in put,s3n://elasticmapreduce/samples/wordcount/input,-output,s3://myOutputPath/word count/output/,-mapper,s3n://elasticmapreduce/samples/wordcount/wordSplitter.py,reducer,aggregate"
},
API Version 2012-10-29
142
AWS Data Pipeline Developer Guide
Run a Script on a Schedule
{
"id" : "SuccessNotify",
"type" : "SnsAlarm",
"subject" : "Pipeline component succeeded",
"message": "Success for interval #{node.@scheduledStartTime}..#{node.@sched uledEndTime}.",
"topicArn":"arn:aws:sns:us-east-1:28619EXAMPLE:ExampleTopic"
},
{
"id" : "FailureNotify",
"type" : "SnsAlarm",
"subject" : "Failed to run pipeline component",
"message": "Error for interval #{node.@scheduledStartTime}..#{node.@sched uledEndTime}.",
"topicArn":"arn:aws:sns:us-east-1:28619EXAMPLE:ExampleTopic"
}
]
}
Run a Script on a Schedule
This example pipeline definition runs an arbitrary command line script set on a 'time-based' schedule.
This is a 'time based' schedule and not a 'dependency based' schedule in the sense that 'MyProcess' is scheduled to run based on clock time, not based on the availability of data sources that are inputs to
'MyProcess'. The schedule object 'Period' in this case defines a schedule used by activity 'MyProcess' such that 'MyProcess' will be scheduled to execute every hour, beginning at startDatetime
. The interval could have been minutes, hours, days, weeks, or months and be made different by chaining the period fields on object 'Period'.
Note
When a schedule's startDateTime
is in the past, AWS Data Pipeline backfills your pipeline and begins scheduling runs immediately beginning at startDateTime
. For testing/development, use a relatively short interval for startDateTime
..
endDateTime
. If not, AWS Data Pipeline attempts to queue up and schedule all runs of your pipeline for that interval.
Example Pipeline Definition
{
"objects" : [
{
"id" : "Default",
"onFail" : {"ref" : "FailureNotify"},
"maximumRetries" : "3",
"workerGroup" : "myWorkerGroup"
},
{
"id" : "Period",
"type" : "Schedule",
"period" : "1 hour",
"startDateTime" : "2012-01-13T20:00:00",
"endDateTime" : "2012-01-13T21:00:00"
},
{
"id" : "MyProcess",
"type" : "ShellCommandActivity",
"onSuccess" : {"ref" : "SuccessNotify"},
API Version 2012-10-29
143
AWS Data Pipeline Developer Guide
Chain Multiple Activities and Roll Up Data
"command" : "/home/myScriptPath/myScript.sh #{@scheduledStartTime}
#{@scheduledEndTime}",
"schedule": {"ref" : "Period"},
"stderr" : "/tmp/stderr:#{@scheduledStartTime}",
"stdout" : "/tmp/stdout:#{@scheduledStartTime}"
},
{
"id" : "SuccessNotify",
"type" : "SnsAlarm",
"subject" : "Pipeline component succeeded",
"message": "Success for interval #{node.@scheduledStartTime}..#{node.@sched uledEndTime}.",
"topicArn":"arn:aws:sns:us-east-1:28619EXAMPLE:ExampleTopic"
},
{
"id" : "FailureNotify",
"type" : "SnsAlarm",
"subject" : "Failed to run pipeline component",
"message": "Error for interval #{node.@scheduledStartTime}..#{node.@sched uledEndTime}.",
"topicArn":"arn:aws:sns:us-east-1:28619EXAMPLE:ExampleTopic"
}
]
}
Chain Multiple Activities and Roll Up Data
This example pipeline definition demonstrates the following:
• Chaining multiple activities in a graph of dependencies based on inputs and outputs.
• Rolling up data from smaller granularities (such as 15 minute buckets) into a larger granularity (such as 1 hour buckets).
This pipeline defines a schedule named 'CopyPeriod', which describes 15 minute time intervals originating at UTC time 2012-01-17T00:00:00 and a schedule named 'HourlyPeriod', which describes 1 hour time intervals originating at UTC time 2012-01-17T00:00:00.
'InputData' describes files of this form:
• s3://myBucket/demo/2012-01-17T00:00:00.csv
• s3://myBucket/demo/2012-01-17T00:15:00.csv
• s3://myBucket/demo/2012-01-17T00:30:00.csv
• s3://myBucket/demo/2012-01-17T00:45:00.csv
• s3://myBucket/demo/2012-01-17T01:00:00.csv
Every 15 minute interval (specified by
@scheduledStartTime..@scheduledEndTime
), activity
'CopyMinuteData' checks for Amazon S3 file s3://myBucket/demo/#{@scheduledStartTime}.csv and when it is found, copies the file to s3://myBucket/demo/#{@scheduledEndTime}.csv, per the definition of output object 'OutputMinuteData'.
Similarly, for every hour's worth of 'OutputMinuteData' Amazon S3 files found to exist (four 15-minute files in this case), activity 'CopyHourlyData' runs and writes the output to an hourly file defined by the expression s3://myBucket/demo/hourly/#{@scheduledEndTime}.csv in 'HourlyData'.
API Version 2012-10-29
144
AWS Data Pipeline Developer Guide
Chain Multiple Activities and Roll Up Data
Finally, when the Amazon S3 file described by s3://myBucket/demo/hourly/#{@scheduledEndTime}.csv
in 'HourlyData' is found to exist, AWS Data Pipeline runs the script described by activity 'ShellOut'.
Example Pipeline Definition
{
"objects" " [
{
"id" : "Default",
"onFail" : {"ref" : "FailureNotify"},
"maximumRetries" : "3",
"workerGroup" : "myWorkerGroup"
},
{
"id" : "CopyPeriod",
"type" : "Schedule",
"period" : "15 minutes",
"startDateTime" : "2012-01-17T00:00:00",
"endDateTime" : "2012-01-17T02:00:00"
},
{
"id" : "InputData",
"type" : "S3DataNode",
"schedule" : {"ref" : "CopyPeriod"},
"filePath" : "s3://myBucket/demo/#{@scheduledStartTime}.csv",
"precondition" : {"ref" : "Ready"}
},
{
"id" : "OutputMinuteData",
"type" : "S3DataNode",
"schedule" : {"ref" : "CopyPeriod"},
"filePath" : "s3://myBucket/demo/#{@scheduledEndTime}.csv"
},
{
"id" : "Ready",
"type" : "Exists",
},
{
"id" : "CopyMinuteData",
"type" : "CopyActivity",
"schedule" : {"ref" : "CopyPeriod"},
"input" : {"ref" : "InputData"},
"output" : {"ref" : "OutputMinuteData"}
},
{
"id" : "HourlyPeriod",
"type" : "Schedule",
"period" : "1 hour",
"startDateTime" : "2012-01-17T00:00:00",
"endDateTime" : "2012-01-17T02:00:00"
},
{
"id" : "CopyHourlyData",
"type" : "CopyActivity",
"schedule" : {"ref" : "HourlyPeriod"},
"input" : {"ref" : "OutputMinuteData"},
"output" : {"ref" : "HourlyData"}
API Version 2012-10-29
145
AWS Data Pipeline Developer Guide
Copy Data from Amazon S3 to MySQL
},
{
"id" : "HourlyData",
"type" : "S3DataNode",
"schedule" : {"ref" : "HourlyPeriod"},
"filePath" : "s3://myBucket/demo/hourly/#{@scheduledEndTime}.csv"
},
{
"id" : "ShellOut",
"type" : "ShellCommandActivity",
"input" : {"ref" : "HourlyData"},
"command" : "/home/userName/xxx.sh #{@scheduledStartTime} #{@scheduledEnd
Time}",
"schedule" : {"ref" : "HourlyPeriod"},
"stderr" : "/tmp/stderr:#{@scheduledStartTime}",
"stdout" : "/tmp/stdout:#{@scheduledStartTime}",
"onSuccess" : {"ref" : "SuccessNotify"}
},
{
"id" : "SuccessNotify",
"type" : "SnsAlarm",
"subject" : "Pipeline component succeeded",
"message": "Success for interval #{node.@scheduledStartTime}..#{node.@sched uledEndTime}.",
"topicArn":"arn:aws:sns:us-east-1:28619EXAMPLE:ExampleTopic"
},
{
"id" : "FailureNotify",
"type" : "SnsAlarm",
"subject" : "Failed to run pipeline component",
"message": "Error for interval #{node.@scheduledStartTime}..#{node.@sched uledEndTime}.",
"topicArn":"arn:aws:sns:us-east-1:28619EXAMPLE:ExampleTopic"
}
]
}
Copy Data from Amazon S3 to MySQL
This example pipeline definition automatically creates an Amazon EC2 instance that will copy the specified data from a CSV file in Amazon S3 into a MySQL database table. For simplicity, the structure of the example MySQL insert statement assumes that you have a CSV input file with two columns of data that you are writing into a MySQL database table that has two matching columns of the appropriate data type.
If you have data of a different scope, you would modify the MySQL statement to include additional data columns or data types.
Example Pipeline Definition
{
"objects": [
{
"id": "Default",
"logUri": "s3://testbucket/error_log",
"schedule": {
"ref": "MySchedule"
API Version 2012-10-29
146
AWS Data Pipeline Developer Guide
Copy Data from Amazon S3 to MySQL
}
},
{
"id": "MySchedule",
"type": "Schedule",
"startDateTime": "2012-11-26T00:00:00",
"endDateTime": "2012-11-27T00:00:00",
"period": "1 day"
},
{
"id": "MyS3Input",
"filePath": "s3://testbucket/input_data_file.csv",
"type": "S3DataNode"
},
{
"id": "MyCopyActivity",
"input": {
"ref": "MyS3Input"
},
"output": {
"ref": "MyDatabaseNode"
},
"type": "CopyActivity",
"runsOn": {
"ref": "MyEC2Resource"
}
},
{
"id": "MyEC2Resource",
"type": "Ec2Resource",
"actionOnTaskFailure": "terminate",
"actionOnResourceFailure": "retryAll",
"maximumRetries": "1",
"role": "test-role",
"resourceRole": "test-role",
"instanceType": "m1.medium",
"instanceCount": "1",
"securityGroups": [
"test-group",
"default"
],
"keyPair": "test-pair"
},
{
"id": "MyDatabaseNode",
"type": "MySqlDataNode",
"table": "table_name",
"username": "
user_name
",
"*password": "
my_password
",
"connectionString": "jdbc:mysql:/
/mysqlinstance
-rds.example.us-east-
1.rds.amazonaws.com:3306/
database_name
",
"insertQuery": "insert into #{table} (column1_ name, column2_name) values
(?, ?);"
}
]
}
API Version 2012-10-29
147
AWS Data Pipeline Developer Guide
Extract Apache Web Log Data from Amazon S3 using
Hive
This example has the following fields defined in the MySqlDataNode:
id
User-defined identifier for the MySQL database, which is a label for your reference only.
type
MySqlDataNode type that matches the kind of location for our data, which is an Amazon RDS instance using MySQL in this example.
table
Name of the database table that contains the data to copy. Replace table_name with the name of your database table.
username
User name of the database account that has sufficient permission to retrieve data from the database table. Replace user_name with the name of your user account.
*password
Password for the database account with the asterisk prefix to indicate that AWS Data Pipeline must encrypt the password value. Replace my_password with the correct password for your user account.
connectionString
JDBC connection string for CopyActivity to connect to the database.
insertQuery
A valid SQL SELECT query that specifies which data to copy from the database table. Note that
#{table} is a variable that re-uses the table name provided by the "table" variable in the preceding lines of the JSON file.
Extract Apache Web Log Data from Amazon S3 using Hive
This example pipeline definition automatically creates an Amazon EMR cluster to extract data from Apache web logs in Amazon S3 to a CSV file in Amazon S3 using Hive.
Example Pipeline Definition
{
"objects": [
{
"startDateTime": "2012-05-04T00:00:00",
"id": "MyEmrResourcePeriod",
"period": "1 day",
"type": "Schedule",
"endDateTime": "2012-05-05T00:00:00"
},
{
"id": "MyHiveActivity",
"type": "HiveActivity",
"schedule": {
"ref": "MyEmrResourcePeriod"
},
"runsOn": {
"ref": "MyEmrResource"
},
"input": {
"ref": "MyInputData"
},
"output": {
API Version 2012-10-29
148
AWS Data Pipeline Developer Guide
Extract Apache Web Log Data from Amazon S3 using
Hive
"ref": "MyOutputData"
},
"hiveScript": "INSERT OVERWRITE TABLE ${output1} select host,user,time,request,status,size from ${input1};"
},
{
"schedule": {
"ref": "MyEmrResourcePeriod"
},
"masterInstanceType": "m1.small",
"coreInstanceType": "m1.small",
"enableDebugging": "true",
"keyPair": "test-pair",
"id": "MyEmrResource",
"coreInstanceCount": "1",
"actionOnTaskFailure": "continue",
"maximumRetries": "1",
"type": "EmrCluster",
"actionOnResourceFailure": "retryAll",
"terminateAfter": "10 hour"
},
{
"id": "MyInputData",
"type": "S3DataNode",
"schedule": {
"ref": "MyEmrResourcePeriod"
},
"directoryPath": "s3://test-hive/input-access-logs",
"dataFormat": {
"ref": "MyInputDataType"
}
},
{
"id": "MyOutputData",
"type": "S3DataNode",
"schedule": {
"ref": "MyEmrResourcePeriod"
},
"directoryPath": "s3://test-hive/output-access-logs",
"dataFormat": {
"ref": "MyOutputDataType"
}
},
{
"id": "MyOutputDataType",
"type": "Custom",
"columnSeparator": "\t",
"recordSeparator": "\n",
"column": [
"host STRING",
"user STRING",
"time STRING",
"request STRING",
"status STRING",
"size STRING"
]
},
{
API Version 2012-10-29
149
AWS Data Pipeline Developer Guide
Extract Amazon S3 Data (CSV/TSV) to Amazon S3 using
Hive
"id": "MyInputDataType",
"type": "RegEx",
"inputRegEx": "([^ ]*) ([^ ]*) ([^ ]*) (-|\\[[^\\]]*\\]) ([^
\"]*|\"[^\"]*\") (-|[0-9]*) (-|[0-9]*)(?: ([^ \"]*|\"[^\"]*\") ([^
\"]*|\"[^\"]*\"))?",
"outputFormat": "%1$s %2$s %3$s %4$s %5$s %6$s %7$s %8$s %9$s",
"column": [
"host STRING",
"identity STRING",
"user STRING",
"time STRING",
"request STRING",
"status STRING",
"size STRING",
"referer STRING",
"agent STRING"
]
}
]
}
Extract Amazon S3 Data (CSV/TSV) to Amazon S3 using Hive
This example pipeline definition creates an Amazon EMR cluster to extract data from Apache web logs in Amazon S3 to a CSV file in Amazon S3 using Hive.
Note
You can accommodate tab-delimited (TSV) data files similarly to how this sample demonstrates using comma-delimited (CSV) files, if you change the
MyInputDataType
and
MyOutputDataType
type field to "TSV" instead of "CSV".
Example Pipeline Definition
{
"objects": [
{
"startDateTime": "2012-05-04T00:00:00",
"id": "MyEmrResourcePeriod",
"period": "1 day",
"type": "Schedule",
"endDateTime": "2012-05-05T00:00:00"
},
{
"id": "MyHiveActivity",
"maximumRetries": "10",
"type": "HiveActivity",
"schedule": {
"ref": "MyEmrResourcePeriod"
},
"runsOn": {
"ref": "MyEmrResource"
},
"input": {
"ref": "MyInputData"
API Version 2012-10-29
150
AWS Data Pipeline Developer Guide
Extract Amazon S3 Data (CSV/TSV) to Amazon S3 using
Hive
},
"output": {
"ref": "MyOutputData"
},
"hiveScript": "INSERT OVERWRITE TABLE ${output1} select * from ${input1};"
},
{
"schedule": {
"ref": "MyEmrResourcePeriod"
},
"masterInstanceType": "m1.small",
"coreInstanceType": "m1.small",
"enableDebugging": "true",
"keyPair": "test-pair",
"id": "MyEmrResource",
"coreInstanceCount": "1",
"actionOnTaskFailure": "continue",
"maximumRetries": "2",
"type": "EmrCluster",
"actionOnResourceFailure": "retryAll",
"terminateAfter": "10 hour"
},
{
"id": "MyInputData",
"type": "S3DataNode",
"schedule": {
"ref": "MyEmrResourcePeriod"
},
"directoryPath": "s3://test-hive/input",
"dataFormat": {
"ref": "MyInputDataType"
}
},
{
"id": "MyOutputData",
"type": "S3DataNode",
"schedule": {
"ref": "MyEmrResourcePeriod"
},
"directoryPath": "s3://test-hive/output",
"dataFormat": {
"ref": "MyOutputDataType"
}
},
{
"id": "MyOutputDataType",
"type": "CSV",
"column": [
"Name STRING",
"Age STRING",
"Surname STRING"
]
},
{
"id": "MyInputDataType",
"type": "CSV",
"column": [
API Version 2012-10-29
151
AWS Data Pipeline Developer Guide
Extract Amazon S3 Data (Custom Format) to Amazon
S3 using Hive
"Name STRING",
"Age STRING",
"Surname STRING"
]
}
]
}
Extract Amazon S3 Data (Custom Format) to
Amazon S3 using Hive
This example pipeline definition creates an Amazon EMR cluster to extract data from Amazon S3 with
Hive, using a custom file format specified by the columnSeparator
and recordSeparator
fields.
Example Pipeline Definition
{
"objects": [
{
"startDateTime": "2012-05-04T00:00:00",
"id": "MyEmrResourcePeriod",
"period": "1 day",
"type": "Schedule",
"endDateTime": "2012-05-05T00:00:00"
},
{
"id": "MyHiveActivity",
"type": "HiveActivity",
"schedule": {
"ref": "MyEmrResourcePeriod"
},
"runsOn": {
"ref": "MyEmrResource"
},
"input": {
"ref": "MyInputData"
},
"output": {
"ref": "MyOutputData"
},
"hiveScript": "INSERT OVERWRITE TABLE ${output1} select * from ${input1};"
},
{
"schedule": {
"ref": "MyEmrResourcePeriod"
},
"masterInstanceType": "m1.small",
"coreInstanceType": "m1.small",
"enableDebugging": "true",
"keyPair": "test-pair",
"id": "MyEmrResource",
"coreInstanceCount": "1",
"actionOnTaskFailure": "continue",
API Version 2012-10-29
152
AWS Data Pipeline Developer Guide
Simple Data Types
"maximumRetries": "1",
"type": "EmrCluster",
"actionOnResourceFailure": "retryAll",
"terminateAfter": "10 hour"
},
{
"id": "MyInputData",
"type": "S3DataNode",
"schedule": {
"ref": "MyEmrResourcePeriod"
},
"directoryPath": "s3://test-hive/input",
"dataFormat": {
"ref": "MyInputDataType"
}
},
{
"id": "MyOutputData",
"type": "S3DataNode",
"schedule": {
"ref": "MyEmrResourcePeriod"
},
"directoryPath": "s3://test-hive/output-custom",
"dataFormat": {
"ref": "MyOutputDataType"
}
},
{
"id": "MyOutputDataType",
"type": "Custom",
"columnSeparator": ",",
"recordSeparator": "\n",
"column": [
"Name STRING",
"Age STRING",
"Surname STRING"
]
},
{
"id": "MyInputDataType",
"type": "Custom",
"columnSeparator": ",",
"recordSeparator": "\n",
"column": [
"Name STRING",
"Age STRING",
"Surname STRING"
]
}
]
}
Simple Data Types
The following types of data can be set as field values.
API Version 2012-10-29
153
AWS Data Pipeline Developer Guide
DateTime
Topics
•
•
•
Expression Evaluation (p. 154)
•
•
•
DateTime
AWS Data Pipeline supports the date and time expressed in "YYYY-MM-DDTHH:MM:SS" format in
UTC/GMT only. The following example sets the startDateTime
field of a
Schedule
object to
1/15/2012, 11:59 p.m.
, in the UTC/GMT timezone.
"startDateTime" : "2012-01-15T23:59:00"
Numeric
AWS Data Pipeline supports both integers and floating-point values.
Expression Evaluation
AWS Data Pipeline provides a set of functions that you can use to calculate the value of a field. For more information about these functions, see
Expression Evaluation (p. 155)
. The following example uses the makeDate
function to set the startDateTime
field of a
Schedule
object to
"2011-05-24T0:00:00"
GMT/UTC.
"startDateTime" : "makeDate(2011,5,24)"
Object References
An object in the pipeline definition. This can either be the current object, the name of an object defined elsewhere in the pipeline, or an object that lists the current object in a field, referenced by the node keyword. For more information about node
, see Referencing Fields and Objects (p. 138) . For more
information about the pipeline object types, see
Period
Indicates how often a scheduled event should run. It's expressed in the format "N
[ years
| months
| weeks
| days
| hours
| minutes
]", where N is a positive integer value.
The minimum period is 15 minutes and the maximum period is 3 years.
The following example sets the period
field of the
Schedule
object to 3 hours. This creates a schedule that runs every three hours.
"period" : "3 hours"
API Version 2012-10-29
154
AWS Data Pipeline Developer Guide
String
String
Standard string values. Strings must be surrounded by double quotes (").You can use the slash character
(\) to escape characters in a strings. Multiline strings are not supported.
The following examples show examples of valid string values for the id
field.
"id" : "My Data Object"
"id" : "My \"Data\" Object"
Strings can also contain expressions that evaluate to string values. These are inserted into the string, and are delimited with: "#{" and "}". The following example uses an expression to insert the name of the current object into a path.
"filePath" : "s3://myBucket/#{name}.csv"
For more information about using expressions, see Referencing Fields and Objects (p. 138) and
.
Expression Evaluation
The following functions are provided by AWS Data Pipeline. You can use them to evaluate field values.
Topics
•
Mathematical Functions (p. 155)
•
•
Date and Time Functions (p. 156)
Mathematical Functions
The following functions are available for working with numerical values.
Function
+
-
*
Description
Addition.
Example:
#{1 + 2}
Result:
3
Subtraction.
Example:
#{1 - 2}
Result:
-1
Multiplication.
Example:
#{1 * 2}
Result:
2
API Version 2012-10-29
155
AWS Data Pipeline Developer Guide
String Functions
/
Function
^
Description
Division. If you divide two integers, the result is truncated.
Example:
#{1 / 2}
, Result:
0
Example:
#{1.0 / 2}
, Result:
.5
Exponent.
Example:
#{2 ^ 2}
Result:
4.0
String Functions
The following functions are available for working with string values.
Function
+
Description
Concatenation. Non-string values are first converted to strings.
Example:
#{"hel" + "lo"}
Result:
"hello"
Date and Time Functions
The following functions are available for working with DateTime values. For the examples, the value of myDateTime
is
May 24, 2011 @ 5:10 pm GMT
.
Function
int minute(DateTime myDateTime) int hour(DateTime myDateTime)
Description
Gets the minute of the DateTime value as an integer.
Example:
#{minute(myDateTime)}
Result:
10
Gets the hour of the DateTime value as an integer.
Example:
#{hour(myDateTime)}
Result:
17
API Version 2012-10-29
156
AWS Data Pipeline Developer Guide
Date and Time Functions
Function
int day(DateTime myDateTime) int dayOfYear(DateTime myDateTime)
Description
Gets the day of the DateTime value as an integer.
Example:
#{day(myDateTime)}
Result:
24
Gets the day of the year of the
DateTime value as an integer.
Example:
#{dayOfYear(myDateTime)} int month(DateTime myDateTime) int year(DateTime myDateTime)
Result:
144
Gets the month of the DateTime value as an integer.
Example:
#{month(myDateTime)}
Result:
5
Gets the year of the DateTime value as an integer.
Example:
#{year(myDateTime)}
String format(DateTime myDateTime,String format)
Result:
2011
Creates a String object that is the result of converting the specified
DateTime using the specified format string.
Example:
#{format(myDateTime,'YYYY-MM-dd hh:mm:ss z')}
Result:
"2011-05-24T17:10:00
UTC"
DateTime inTimeZone(DateTime myDateTime,String zone)
Creates a DateTime object with the same date and time, but in the specified time zone, and taking daylight savings time into account.
For more information about time zones, see http://joda-time.sourceforge.net/timezones.html
.
Example:
#{inTimeZone(myDateTime,'America/Los_Angeles')}
Result:
"2011-05-24T10:10:00
America/Los_Angeles"
API Version 2012-10-29
157
AWS Data Pipeline Developer Guide
Date and Time Functions
Function
DateTime makeDate(int year,int month,int day)
Description
Creates a DateTime object, in
UTC, with the specified year, month, and day, at midnight.
Example:
#{makeDate(2011,5,24)}
Result:
"2011-05-24T0:00:00z"
DateTime makeDateTime(int year,int month,int day,int hour,int minute)
Creates a DateTime object, in
UTC, with the specified year, month, day, hour, and minute.
Example:
#{makeDateTime(2011,5,24,14,21)}
DateTime midnight(DateTime myDateTime)
DateTime yesterday(DateTime myDateTime)
Result:
"2011-05-24T14:21:00z"
Creates a DateTime object for the next midnight, relative to the specified DateTime.
Example:
#{midnight(myDateTime)}
Result:
"2011-05-24T0:00:00z"
Creates a DateTime object for the previous day, relative to the specified DateTime. The result is the same as minusDays(1).
DateTime sunday(DateTime myDateTime)
Example:
#{yesterday(myDateTime)}
Result:
"2011-05-23T17:10:00z"
Creates a DateTime object for the previous Sunday, relative to the specified DateTime. If the specified DateTime is a Sunday, the result is the specified
DateTime.
Example:
#{sunday(myDateTime)}
Result:
"2011-05-22 17:10:00
UTC"
API Version 2012-10-29
158
AWS Data Pipeline Developer Guide
Date and Time Functions
Function
DateTime firstOfMonth(DateTime myDateTime)
DateTime minusMinutes(DateTime myDateTime,int minutesToSub)
DateTime minusHours(DateTime myDateTime,int hoursToSub)
DateTime minusDays(DateTime myDateTime,int daysToSub)
DateTime minusWeeks(DateTime myDateTime,int weeksToSub)
Description
Creates a DateTime object for the start of the month in the specified
DateTime.
Example:
#{firstOfMonth(myDateTime)}
Result:
"2011-05-01T17:10:00z"
Creates a DateTime object that is the result of subtracting the specified number of minutes from the specified DateTime.
Example:
#{minusMinutes(myDateTime,1)}
Result:
"2011-05-24T17:09:00z"
Creates a DateTime object that is the result of subtracting the specified number of hours from the specified DateTime.
Example:
#{minusHours(myDateTime,1)}
Result:
"2011-05-24T16:10:00z"
Creates a DateTime object that is the result of subtracting the specified number of days from the specified DateTime.
Example:
#{minusDays(myDateTime,1)}
Result:
"2011-05-23T17:10:00z"
Creates a DateTime object that is the result of subtracting the specified number of weeks from the specified DateTime.
Example:
#{minusWeeks(myDateTime,1)}
Result:
"2011-05-17T17:10:00z"
API Version 2012-10-29
159
AWS Data Pipeline Developer Guide
Date and Time Functions
Function
DateTime minusMonths(DateTime myDateTime,int monthsToSub) minusYears(DateTime myDateTime,int yearsToSub)
Description
Creates a DateTime object that is the result of subtracting the specified number of months from the specified DateTime.
Example:
#{minusMonths(myDateTime,1)}
Result:
"2011-04-24T17:10:00z"
Creates a DateTime object that is the result of subtracting the specified number of years from the specified DateTime.
Example:
#{minusYears(myDateTime,1)}
DateTime plusMinutes(DateTime myDateTime,int minutesToAdd)
DateTime plusHours(DateTime myDateTime,int hoursToAdd)
Result:
"2010-05-24T17:10:00z"
Creates a DateTime object that is the result of adding the specified number of minutes to the specified
DateTime.
Example:
#{plusMinutes(myDateTime,1)}
Result:
"2011-05-24
17:11:00z"
Creates a DateTime object that is the result of adding the specified number of hours to the specified
DateTime.
Example:
#{plusHours(myDateTime,1)}
Result:
"2011-05-24T18:10:00z"
DateTime plusDays(DateTime myDateTime,int daysToAdd)
Creates a DateTime object that is the result of adding the specified number of days to the specified
DateTime.
Example:
#{plusDays(myDateTime,1)}
Result:
"2011-05-25T17:10:00z"
API Version 2012-10-29
160
AWS Data Pipeline Developer Guide
Objects
Function
DateTime plusWeeks(DateTime myDateTime,int weeksToAdd)
DateTime plusMonths(DateTime myDateTime,int monthsToAdd)
DateTime plusYears(DateTime myDateTime,int yearsToAdd)
Description
Creates a DateTime object that is the result of adding the specified number of weeks to the specified
DateTime.
Example:
#{plusWeeks(myDateTime,1)}
Result:
"2011-05-31T17:10:00z"
Creates a DateTime object that is the result of adding the specified number of months to the specified
DateTime.
Example:
#{plusMonths(myDateTime,1)}
Result:
"2011-06-24T17:10:00z"
Creates a DateTime object that is the result of adding the specified number of years to the specified
DateTime.
Example:
#{plusYears(myDateTime,1)}
Result:
"2012-05-24T17:10:00z"
Objects
This section describes the objects that you can use in your pipeline definition file.
Object Categories
The following is a list of AWS Data Pipeline objects by category.
Schedule
•
Data node
•
•
Activity
•
API Version 2012-10-29
161
AWS Data Pipeline Developer Guide
Object Hierarchy
•
•
Precondition
•
ShellCommandPrecondition (p. 192)
•
•
•
•
Computational resource
•
Alarm
•
Object Hierarchy
The following is the object hierarchy for AWS Data Pipeline.
Important
You can only create objects of the types that are listed in the previous section.
API Version 2012-10-29
162
AWS Data Pipeline Developer Guide
Schedule
Schedule
Defines the timing of a scheduled event, such as when an activity runs.
Note
When a schedule's startDateTime
is in the past, AWS Data Pipeline will backfill your pipeline and begin scheduling runs immediately beginning at startDateTime
. For testing/development, use a relatively short interval for startDateTime
..
endDateTime
. If not, AWS Data Pipeline attempts to queue up and schedule all runs of your pipeline for that interval.
API Version 2012-10-29
163
AWS Data Pipeline Developer Guide
Schedule
Syntax
The following slots are included in all objects.
Name
id name type parent
@sphere
Description
The ID of the object. IDs must be unique within a pipeline definition.
Type
String
The optional, user-defined label of the object.
If you do not provide a name
for an object in a pipeline definition, AWS Data Pipeline automatically duplicates the value of id
.
String
The type of object. Use one of the predefined
AWS Data Pipeline object types.
String
The parent of the object.
String
The sphere of an object denotes its place in the pipeline lifecycle, such as Pipeline,
Component, Instance, or Attempt.
String (read-only)
Required
Yes
No
Yes
No
No
This object includes the following fields.
Name
startDateTime endDateTime period
Description
The date and time to start the scheduled runs.
Type Required
String, in
DateTime or
DateTimeWithZone format
Yes
The date and time to end the scheduled runs.
The default behavior is to schedule runs until the pipeline is shut down.
String, in
DateTime or
DateTimeWithZone format
No
How often the pipeline should run. The format is "N
[minutes|hours|days|weeks|months]", where
N is a number followed by one of the time specifiers. For example, "15 minutes", runs the pipeline every 15 minutes.
String Yes
The minimum period is 15 minutes and the maximum period is 3 years.
@scheduledStartTime
The date and time that the scheduled run actually started. This value is added to the object by the schedule. By convention, activities treat the start as inclusive.
@scheduledEndTime
The date and time that the scheduled run actually ended. This value is added to the object by the schedule. By convention, activities treat the end as exclusive.
DateTime (read-only) No
DateTime (read-only) No
API Version 2012-10-29
164
AWS Data Pipeline Developer Guide
S3DataNode
Example
The following is an example of this object type. It defines a schedule of every hour starting at 00:00:00 hours on 2012-09-01 and ending at 00:00:00 hours on 2012-10-01. The first period ends at 01:00:00 on
2012-09-01.
{
"id" : "Hourly",
"type" : "Schedule",
"period" : "1 hours",
"startDateTime" : "2012-09-01T00:00:00",
"endDateTime" : "2012-10-01T00:00:00"
}
S3DataNode
Defines a data node using Amazon S3.
Note
When you use an S3DataNode as input to a CopyActivity, only CSV data format is supported.
Syntax
The following slots are included in all objects.
Name
id name type parent
@sphere
Description
The ID of the object. IDs must be unique within a pipeline definition.
Type
String
The optional, user-defined label of the object.
If you do not provide a name
for an object in a pipeline definition, AWS Data Pipeline automatically duplicates the value of id
.
String
The type of object. Use one of the predefined
AWS Data Pipeline object types.
String
The parent of the object.
String
The sphere of an object denotes its place in the pipeline lifecycle, such as Pipeline,
Component, Instance, or Attempt.
String (read-only)
Required
Yes
No
Yes
No
No
This object includes the following fields.
Name
filePath directoryPath
Description
The path to the object in Amazon S3 as a
URI, for example: s3://
my-bucket
/
my-key-for-file
.
Type
String
Amazon S3 directory path as a URI: s3://
my-bucket
/
my-key-for-directory
.
String
Required
No
No
API Version 2012-10-29
165
AWS Data Pipeline Developer Guide
S3DataNode
Name
compression dataFormat
Description
The format of the data described by the
S3DataNode. This field is only supported when you use S3DataNode with a
HiveActivity.
Type
The type of compression for the data described by the S3DataNode.
none
is no compression and gzip
is compressed with the gzip algorithm. This field is only supported when you use S3DataNode with a CopyActivity.
String
String
Required
No
Yes
This object includes the following slots from the
DataNode
object.
Name
onFail onSuccess precondition schedule scheduleType
Description
An action to run when the current instance fails.
Type
object reference
An email alarm to use when the object's run succeeds.
object reference
A condition that must be met for the data node to be valid. To specify multiple conditions, add multiple precondition slots. A data node is not ready until all its conditions are met.
Object reference
A schedule of the object. A common use is to specify a time schedule that correlates to the schedule for the object.
object reference
Required
No
No
No
Yes
This slot overrides the schedule
slot included from
SchedulableObject
, which is optional.
Schedule type allows you to specify whether the objects in your pipeline definition should be scheduled at the beginning of interval or end of the interval. Time Series Style
Scheduling means instances are scheduled at the end of each interval and Cron Style
Scheduling means instances are scheduled at the beginning of each interval.
Allowed values are
“cron” or “timeseries”.
Defaults to
"timeseries".
No
This object includes the following slots from
RunnableObject
.
Name
workerGroup retryDelay
Description
The worker group. This is used for routing tasks.
The timeout duration between two retry attempts. The default is 10 minutes.
Type
String
Period
Required
No
Yes
API Version 2012-10-29
166
AWS Data Pipeline Developer Guide
S3DataNode
Name
maximumRetries
Description
The maximum number of times to retry the action.
Type
Integer onLateNotify
The email alarm to use when the object's run is late.
object reference onLateKill
Indicates whether all pending or unscheduled tasks should be killed if they are late.
Boolean lateAfterTimeout
The period in which the object run must start.
If the activity does not start within the scheduled start time plus this time interval, it is considered late.
Period
Required
No
No
No
No reportProgressTimeout
The period for successive calls from Task
Runner to the ReportTaskProgress API. If
Task Runner, or other code that is processing the tasks, does not report progress within the specified period, the activity can be retried.
attemptTimeout logUri
Period
The timeout for an activity. If an activity does not complete within the start time plus this time interval, AWS Data Pipeline marks the attempt as failed and your retry settings determine the next steps taken.
Period
The location in Amazon S3 to store log files generated by Task Runner when performing work for this object.
String
No
No
No
@reportProgressTime
The last time that Task Runner, or other code that is processing the tasks, called the
ReportTaskProgress API.
DateTime
@activeInstances
Record of the currently scheduled instance objects
Schedulable object reference
No
No
DateTime (read-only) No
@lastRun
@status
The last run of the object. This is a runtime slot.
@scheduledPhysicalObjects
The currently scheduled instance objects.
This is a runtime slot.
The status of this object. This is a runtime slot. Possible values are: pending
, checking_preconditions
, running
, waiting_on_runner
, successful
, and failed
.
Object reference
(read-only)
String (read-only)
No
No
Integer (read-only) No
@triesLeft
The number of attempted runs remaining before setting the status of this object to failed. This is a runtime slot.
@actualStartTime
The date and time that the scheduled run actually started. This is a runtime slot.
@actualEndTime
The date and time that the scheduled run actually ended. This is a runtime slot.
DateTime (read-only)
DateTime (read-only)
No
No
API Version 2012-10-29
167
AWS Data Pipeline Developer Guide
S3DataNode
Name
errorCode errorMessage
Description
If the object failed, the error code. This is a runtime slot.
If the object failed, the error message. This is a runtime slot.
If the object failed, the error stack trace.
Type
String (read-only)
String (read-only) errorStackTrace
@scheduledStartTime
The date and time that the run was scheduled to start.
@scheduledEndTime
The date and time that the run was scheduled to end.
@componentParent
The component from which this instance is created.
@headAttempt
The latest attempt on the given instance.
String (read-only)
DateTime
DateTime
Object reference
(read-only)
Object reference
(read-only)
@resource activityStatus
The resource instance on which the given activity/precondition attempt is being run.
Object reference
(read-only)
The status most recently reported from the activity.
String
Required
No
No
No
No
No
No
No
No
No
This object includes the following slots from
SchedulableObject
.
Name
schedule scheduleType runsOn
Description Type
A schedule of the object. A common use is to specify a time schedule that correlates to the schedule for the object.
object reference
Required
No
Schedule type allows you to specify whether the objects in your pipeline definition should be scheduled at the beginning of interval or end of the interval. Time Series Style
Scheduling means instances are scheduled at the end of each interval and Cron Style
Scheduling means instances are scheduled at the beginning of each interval.
Allowed values are
“cron” or “timeseries”.
Defaults to
"timeseries".
No
The computational resource to run the activity or command. For example, an
Amazon EC2 instance or Amazon EMR cluster.
Resource object reference
No
Example
The following is an example of this object type. This object references another object that you'd define in the same pipeline definition file.
CopyPeriod
is a
Schedule
object.
API Version 2012-10-29
168
AWS Data Pipeline Developer Guide
MySqlDataNode
{
"id" : "OutputData",
"type" : "S3DataNode",
"schedule" : {"ref" : "CopyPeriod"},
"filePath" : "s3://myBucket/#{@scheduledStartTime}.csv"
}
See Also
•
MySqlDataNode
Defines a data node using MySQL.
Syntax
The following slots are included in all objects.
Name
id name type parent
@sphere
Description
The ID of the object. IDs must be unique within a pipeline definition.
Type
String
The optional, user-defined label of the object.
If you do not provide a name
for an object in a pipeline definition, AWS Data Pipeline automatically duplicates the value of id
.
String
The type of object. Use one of the predefined
AWS Data Pipeline object types.
String
The parent of the object.
String
The sphere of an object denotes its place in the pipeline lifecycle, such as Pipeline,
Component, Instance, or Attempt.
String (read-only)
Required
Yes
No
Yes
No
No
This object includes the following slots from the
SqlDataNode
object.
Name
table
Description
The name of the table in the MySQL database. To specify multiple tables, add multiple table
slots.
connectionString
The JDBC connection string to access the database.
selectQuery insertQuery
Type
String
String
A SQL statement to fetch data from the table.
String
A SQL statement to insert data into the table.
String
Required
Yes
No
No
No
API Version 2012-10-29
169
AWS Data Pipeline Developer Guide
MySqlDataNode
This object includes the following slots from the
DataNode
object.
Name
onFail onSuccess precondition schedule scheduleType
Description
An action to run when the current instance fails.
Type
object reference
An email alarm to use when the object's run succeeds.
object reference
A condition that must be met for the data node to be valid. To specify multiple conditions, add multiple precondition slots. A data node is not ready until all its conditions are met.
Object reference
A schedule of the object. A common use is to specify a time schedule that correlates to the schedule for the object.
object reference
Required
No
No
No
Yes
This slot overrides the schedule
slot included from
SchedulableObject
, which is optional.
Schedule type allows you to specify whether the objects in your pipeline definition should be scheduled at the beginning of interval or end of the interval. Time Series Style
Scheduling means instances are scheduled at the end of each interval and Cron Style
Scheduling means instances are scheduled at the beginning of each interval.
Allowed values are
“cron” or “timeseries”.
Defaults to
"timeseries".
No
This object includes the following slots from
RunnableObject
.
Name
workerGroup retryDelay maximumRetries onLateNotify
Description
The worker group. This is used for routing tasks.
The timeout duration between two retry attempts. The default is 10 minutes.
Type
String
Period
The maximum number of times to retry the action.
Integer
The email alarm to use when the object's run is late.
object reference onLateKill
Indicates whether all pending or unscheduled tasks should be killed if they are late.
Boolean lateAfterTimeout
The period in which the object run must start.
If the activity does not start within the scheduled start time plus this time interval, it is considered late.
Period
Required
No
Yes
No
No
No
No
API Version 2012-10-29
170
AWS Data Pipeline Developer Guide
MySqlDataNode
Name Description
reportProgressTimeout
The period for successive calls from Task
Runner to the ReportTaskProgress API. If
Task Runner, or other code that is processing the tasks, does not report progress within the specified period, the activity can be retried.
attemptTimeout logUri
Type
Period
The timeout for an activity. If an activity does not complete within the start time plus this time interval, AWS Data Pipeline marks the attempt as failed and your retry settings determine the next steps taken.
Period
The location in Amazon S3 to store log files generated by Task Runner when performing work for this object.
String
@reportProgressTime
The last time that Task Runner, or other code that is processing the tasks, called the
ReportTaskProgress API.
DateTime
@activeInstances
Record of the currently scheduled instance objects
Schedulable object reference
No
No
DateTime (read-only) No
@lastRun
The last run of the object. This is a runtime slot.
@scheduledPhysicalObjects
The currently scheduled instance objects.
This is a runtime slot.
@status
Object reference
(read-only)
String (read-only)
No
No
@triesLeft
The status of this object. This is a runtime slot. Possible values are: pending
, checking_preconditions
, running
, waiting_on_runner
, successful
, and failed
.
The number of attempted runs remaining before setting the status of this object to failed. This is a runtime slot.
Integer (read-only) No
DateTime (read-only) No
@actualStartTime
The date and time that the scheduled run actually started. This is a runtime slot.
@actualEndTime errorCode
The date and time that the scheduled run actually ended. This is a runtime slot.
If the object failed, the error code. This is a runtime slot.
DateTime (read-only)
String (read-only)
No
No errorMessage
If the object failed, the error message. This is a runtime slot.
If the object failed, the error stack trace.
errorStackTrace
@scheduledStartTime
The date and time that the run was scheduled to start.
@scheduledEndTime
The date and time that the run was scheduled to end.
String (read-only)
String (read-only)
DateTime
DateTime
Required
No
No
No
No
No
No
No
API Version 2012-10-29
171
AWS Data Pipeline Developer Guide
DynamoDBDataNode
Name
@headAttempt
Description
@componentParent
The component from which this instance is created.
The latest attempt on the given instance.
Type
Object reference
(read-only)
Object reference
(read-only)
@resource activityStatus
The resource instance on which the given activity/precondition attempt is being run.
The status most recently reported from the activity.
Object reference
(read-only)
String
Required
No
No
No
No
Example
The following is an example of this object type. This object references two other objects that you'd define in the same pipeline definition file.
CopyPeriod
is a
Schedule
object and
Ready
is a precondition object.
{
"id" : "Sql Table",
"type" : "MySqlDataNode",
"schedule" : {"ref" : "CopyPeriod"},
"table" : "adEvents",
"username": "
user_name
",
"*password": "
my_password
",
"connectionString": "jdbc:mysql:/
/mysqlinstance
-rds.example.us-east-
1.rds.amazonaws.com:3306/
database_name
",
"selectQuery" : "select * from #{table} where eventTime >= '#{@scheduledStart
Time.format('YYYY-MM-dd HH:mm:ss')}' and eventTime < '#{@scheduledEnd
Time.format('YYYY-MM-dd HH:mm:ss')}'",
"precondition" : {"ref" : "Ready"}
}
See Also
•
DynamoDBDataNode
Defines a data node using Amazon DynamoDB, which is specified as an input to a HiveActivity or
EMRActivity.
Note
The DynamoDBDataNode does not support the Exists precondition.
Syntax
The following slots are included in all objects.
Name
id
Description
The ID of the object. IDs must be unique within a pipeline definition.
Type
String
Required
Yes
API Version 2012-10-29
172
AWS Data Pipeline Developer Guide
DynamoDBDataNode
Name
name type parent
@sphere
Description Type
The optional, user-defined label of the object.
If you do not provide a name
for an object in a pipeline definition, AWS Data Pipeline automatically duplicates the value of id
.
String
The type of object. Use one of the predefined
AWS Data Pipeline object types.
String
The parent of the object.
String
The sphere of an object denotes its place in the pipeline lifecycle, such as Pipeline,
Component, Instance, or Attempt.
String (read-only)
Required
No
Yes
No
No
This object includes the following fields.
Name
tableName
Description
The DynamoDB table.
Type
String
Required
Yes
This object includes the following slots from the
DataNode
object.
Name
onFail onSuccess precondition schedule scheduleType
Description
An action to run when the current instance fails.
Type
object reference
An email alarm to use when the object's run succeeds.
object reference
A condition that must be met for the data node to be valid. To specify multiple conditions, add multiple precondition slots. A data node is not ready until all its conditions are met.
Object reference
A schedule of the object. A common use is to specify a time schedule that correlates to the schedule for the object.
object reference
Required
No
No
No
Yes
This slot overrides the schedule
slot included from
SchedulableObject
, which is optional.
Schedule type allows you to specify whether the objects in your pipeline definition should be scheduled at the beginning of interval or end of the interval. Time Series Style
Scheduling means instances are scheduled at the end of each interval and Cron Style
Scheduling means instances are scheduled at the beginning of each interval.
Allowed values are
“cron” or “timeseries”.
Defaults to
"timeseries".
No
API Version 2012-10-29
173
AWS Data Pipeline Developer Guide
DynamoDBDataNode
This object includes the following slots from
RunnableObject
.
Name
workerGroup retryDelay maximumRetries
Description
The worker group. This is used for routing tasks.
The timeout duration between two retry attempts. The default is 10 minutes.
The maximum number of times to retry the action.
Type
String
Period
Integer onLateNotify onLateKill
The email alarm to use when the object's run is late.
object reference
Indicates whether all pending or unscheduled tasks should be killed if they are late.
Boolean lateAfterTimeout
The period in which the object run must start.
If the activity does not start within the scheduled start time plus this time interval, it is considered late.
Period
Required
No
Yes
No
No
No
No reportProgressTimeout
The period for successive calls from Task
Runner to the ReportTaskProgress API. If
Task Runner, or other code that is processing the tasks, does not report progress within the specified period, the activity can be retried.
attemptTimeout
Period
The timeout for an activity. If an activity does not complete within the start time plus this time interval, AWS Data Pipeline marks the attempt as failed and your retry settings determine the next steps taken.
Period logUri
The location in Amazon S3 to store log files generated by Task Runner when performing work for this object.
String
@reportProgressTime
The last time that Task Runner, or other code that is processing the tasks, called the
ReportTaskProgress API.
DateTime
No
No
No
No
@activeInstances
Record of the currently scheduled instance objects
Schedulable object reference
No
DateTime (read-only) No
@lastRun
@status
The last run of the object. This is a runtime slot.
@scheduledPhysicalObjects
The currently scheduled instance objects.
This is a runtime slot.
The status of this object. This is a runtime slot. Possible values are: pending
, checking_preconditions
, running
, waiting_on_runner
, successful
, and failed
.
Object reference
(read-only)
String (read-only)
No
No
API Version 2012-10-29
174
AWS Data Pipeline Developer Guide
DynamoDBDataNode
Name
@triesLeft
Type
Integer (read-only)
Required
No
@actualStartTime
The date and time that the scheduled run actually started. This is a runtime slot.
@actualEndTime errorCode errorMessage errorStackTrace
The date and time that the scheduled run actually ended. This is a runtime slot.
If the object failed, the error code. This is a runtime slot.
If the object failed, the error message. This is a runtime slot.
If the object failed, the error stack trace.
DateTime (read-only)
DateTime (read-only)
String (read-only)
String (read-only)
@scheduledStartTime
The date and time that the run was scheduled to start.
@scheduledEndTime
The date and time that the run was scheduled to end.
@componentParent
The component from which this instance is created.
@headAttempt
The latest attempt on the given instance.
String (read-only)
DateTime
DateTime
Object reference
(read-only)
Object reference
(read-only)
No
No
No
No
No
No
No
No
No
@resource activityStatus
Description
The number of attempted runs remaining before setting the status of this object to failed. This is a runtime slot.
The resource instance on which the given activity/precondition attempt is being run.
Object reference
(read-only)
The status most recently reported from the activity.
String
No
No
This object includes the following slots from
SchedulableObject
.
Name
schedule scheduleType runsOn
Description Type
A schedule of the object. A common use is to specify a time schedule that correlates to the schedule for the object.
object reference
Required
No
Schedule type allows you to specify whether the objects in your pipeline definition should be scheduled at the beginning of interval or end of the interval. Time Series Style
Scheduling means instances are scheduled at the end of each interval and Cron Style
Scheduling means instances are scheduled at the beginning of each interval.
Allowed values are
“cron” or “timeseries”.
Defaults to
"timeseries".
No
The computational resource to run the activity or command. For example, an
Amazon EC2 instance or Amazon EMR cluster.
Resource object reference
No
API Version 2012-10-29
175
AWS Data Pipeline Developer Guide
ShellCommandActivity
Example
The following is an example of this object type. This object references two other objects that you'd define in the same pipeline definition file.
CopyPeriod
is a
Schedule
object and
Ready
is a precondition object.
{
"id" : "MyDynamoDBTable",
"type" : "DynamoDBDataNode",
"schedule" : {"ref" : "CopyPeriod"},
"tableName" : "adEvents",
"precondition" : {"ref" : "Ready"}
}
ShellCommandActivity
Runs a command or script.You can use
ShellCommandActivity
to run time-series or cron-like scheduled tasks.
When the stage
field is set to true and used with an S3DataNode, ShellCommandActivity supports the concept of staging data, which means that you can move data from Amazon S3 to a stage location, such as Amazon EC2 or your local environment, perform work on the data using scripts and the
ShellCommandActivity, and move it back to Amazon S3. In this case, when your shell command is connected to an input S3DataNode, your shell scripts to operate directly on the data using
${input1}
,
${input2}
, etc. referring to the ShellCommandActivity input fields. Similarly, output from the shell-command can be staged in an output directory to be automatically pushed to Amazon S3, referred to by
${output1}
,
${output2}
, etc. These expressions can pass as command-line arguments to the shell-command for you to use in data transformation logic.
Syntax
The following slots are included in all objects.
Name
id name type parent
@sphere
Description
The ID of the object. IDs must be unique within a pipeline definition.
Type
String
The optional, user-defined label of the object.
If you do not provide a name
for an object in a pipeline definition, AWS Data Pipeline automatically duplicates the value of id
.
String
The type of object. Use one of the predefined
AWS Data Pipeline object types.
String
The parent of the object.
String
The sphere of an object denotes its place in the pipeline lifecycle, such as Pipeline,
Component, Instance, or Attempt.
String (read-only)
Required
Yes
No
Yes
No
No
API Version 2012-10-29
176
AWS Data Pipeline Developer Guide
ShellCommandActivity
This object includes the following fields.
Name
command stdout stderr input output scriptUri stage
Description Type
The command to run. This value and any associated parameters must function in the environment from which you are running the
Task Runner.
String
The file that receives redirected output from the command that is run.
String
The file that receives redirected system error messages from the command that is run.
String
The input data source. To specify multiple data sources, add multiple input
fields.
The location for the output. To specify multiple locations, add multiple output fields.
Data node object reference
Data node object reference
An Amazon S3 URI path for a file to download and run as a shell command. Only one scriptUri
or command
field should be present.
A valid S3 URI
Determines whether staging is enabled and allows your shell commands to have access to the staged-data variables, such as
${input1}
,
${output1}
, etc.
Boolean
Required
Yes
No
No
No
No
No
No
Note
You must specify a command
value or a scriptUri
value, but both are not required.
This object includes the following slots from the
Activity
object.
Name
onFail onSuccess precondition schedule
Description
An action to run when the current instance fails.
Type
object reference
An email alarm to use when the object's run succeeds.
object reference
A condition that must be met before the object can run. To specify multiple conditions, add multiple precondition slots. The activity cannot run until all its conditions are met.
Object reference
A schedule of the object. A common use is to specify a time schedule that correlates to the schedule for the object.
object reference
This slot overrides the schedule
slot included from
SchedulableObject
, which is optional.
Required
No
No
No
Yes
API Version 2012-10-29
177
AWS Data Pipeline Developer Guide
ShellCommandActivity
This object includes the following slots from
RunnableObject
.
Name
workerGroup retryDelay maximumRetries
Description
The worker group. This is used for routing tasks.
The timeout duration between two retry attempts. The default is 10 minutes.
The maximum number of times to retry the action.
Type
String
Period
Integer onLateNotify onLateKill
The email alarm to use when the object's run is late.
object reference
Indicates whether all pending or unscheduled tasks should be killed if they are late.
Boolean lateAfterTimeout
The period in which the object run must start.
If the activity does not start within the scheduled start time plus this time interval, it is considered late.
Period
Required
No
Yes
No
No
No
No reportProgressTimeout
The period for successive calls from Task
Runner to the ReportTaskProgress API. If
Task Runner, or other code that is processing the tasks, does not report progress within the specified period, the activity can be retried.
attemptTimeout
Period
The timeout for an activity. If an activity does not complete within the start time plus this time interval, AWS Data Pipeline marks the attempt as failed and your retry settings determine the next steps taken.
Period logUri
The location in Amazon S3 to store log files generated by Task Runner when performing work for this object.
String
@reportProgressTime
The last time that Task Runner, or other code that is processing the tasks, called the
ReportTaskProgress API.
DateTime
No
No
No
No
@activeInstances
Record of the currently scheduled instance objects
Schedulable object reference
No
DateTime (read-only) No
@lastRun
@status
The last run of the object. This is a runtime slot.
@scheduledPhysicalObjects
The currently scheduled instance objects.
This is a runtime slot.
The status of this object. This is a runtime slot. Possible values are: pending
, checking_preconditions
, running
, waiting_on_runner
, successful
, and failed
.
Object reference
(read-only)
String (read-only)
No
No
API Version 2012-10-29
178
AWS Data Pipeline Developer Guide
ShellCommandActivity
Name
@triesLeft
Type
Integer (read-only)
Required
No
@actualStartTime
The date and time that the scheduled run actually started. This is a runtime slot.
@actualEndTime errorCode errorMessage errorStackTrace
The date and time that the scheduled run actually ended. This is a runtime slot.
If the object failed, the error code. This is a runtime slot.
If the object failed, the error message. This is a runtime slot.
If the object failed, the error stack trace.
DateTime (read-only)
DateTime (read-only)
String (read-only)
String (read-only)
@scheduledStartTime
The date and time that the run was scheduled to start.
@scheduledEndTime
The date and time that the run was scheduled to end.
@componentParent
The component from which this instance is created.
@headAttempt
The latest attempt on the given instance.
String (read-only)
DateTime
DateTime
Object reference
(read-only)
Object reference
(read-only)
No
No
No
No
No
No
No
No
No
@resource activityStatus
Description
The number of attempted runs remaining before setting the status of this object to failed. This is a runtime slot.
The resource instance on which the given activity/precondition attempt is being run.
Object reference
(read-only)
The status most recently reported from the activity.
String
No
No
This object includes the following slots from
SchedulableObject
.
Name
schedule scheduleType runsOn
Description Type
A schedule of the object. A common use is to specify a time schedule that correlates to the schedule for the object.
object reference
Required
No
Schedule type allows you to specify whether the objects in your pipeline definition should be scheduled at the beginning of interval or end of the interval. Time Series Style
Scheduling means instances are scheduled at the end of each interval and Cron Style
Scheduling means instances are scheduled at the beginning of each interval.
Allowed values are
“cron” or “timeseries”.
Defaults to
"timeseries".
No
The computational resource to run the activity or command. For example, an
Amazon EC2 instance or Amazon EMR cluster.
Resource object reference
No
API Version 2012-10-29
179
AWS Data Pipeline Developer Guide
CopyActivity
Example
The following is an example of this object type.
{
"id" : "CreateDirectory",
"type" : "ShellCommandActivity",
"command" : "mkdir new-directory"
}
See Also
•
•
CopyActivity
Copies data from one location to another. The copy operation is performed record by record.
Important
When you use an S3DataNode as input for CopyActivity, you can only use a Unix/Linux variant of the CSV data file format, which means that CopyActivity has specific limitations to its CSV support:
• The separator must be the "," (comma) character.
• The records will not be quoted.
• The default escape character will be ASCII value 92 (backslash).
• The end of record identifier will be ASCII value 10 (or "\n").
Warning
Windows-based systems typically use a different end of record character sequence: a carriage return and line feed together (ASCII value 13 and ASCII value 10). You must accommodate this difference using an additional mechanism, such as a pre-copy script to modify the input data, to ensure that CopyActivity can properly detect the end of a record, otherwise the CopyActivity will fail repeatedly.
Warning
Additionally, you may encounter repeated CopyActivity failures if you supply compressed data files as input, but do not specify this using the compression
field. In this case, CopyActivity will not properly detect the end of record character and the operation will fail.
Syntax
The following slots are included in all objects.
Name
id
Description
The ID of the object. IDs must be unique within a pipeline definition.
Type
String
Required
Yes
API Version 2012-10-29
180
AWS Data Pipeline Developer Guide
CopyActivity
Name
name type parent
@sphere
Description Type
The optional, user-defined label of the object.
If you do not provide a name
for an object in a pipeline definition, AWS Data Pipeline automatically duplicates the value of id
.
String
The type of object. Use one of the predefined
AWS Data Pipeline object types.
String
The parent of the object.
String
The sphere of an object denotes its place in the pipeline lifecycle, such as Pipeline,
Component, Instance, or Attempt.
String (read-only)
Required
No
Yes
No
No
This object includes the following fields.
Name
input output
Description
The input data source. To specify multiple data sources, add multiple input
fields.
The location for the output. To specify multiple locations, add multiple output fields.
Type
Data node object reference
Data node object reference
Required
Yes
Yes
This object includes the following slots from the
Activity
object.
Name
onFail onSuccess precondition schedule
Description
An action to run when the current instance fails.
This slot overrides the schedule
slot included from
SchedulableObject
, which is optional.
Type
object reference
An email alarm to use when the object's run succeeds.
object reference
A condition that must be met before the object can run. To specify multiple conditions, add multiple precondition slots. The activity cannot run until all its conditions are met.
Object reference
A schedule of the object. A common use is to specify a time schedule that correlates to the schedule for the object.
object reference
Required
No
No
No
Yes
API Version 2012-10-29
181
AWS Data Pipeline Developer Guide
CopyActivity
This object includes the following slots from
RunnableObject
.
Name
workerGroup retryDelay maximumRetries
Description
The worker group. This is used for routing tasks.
The timeout duration between two retry attempts. The default is 10 minutes.
The maximum number of times to retry the action.
Type
String
Period
Integer onLateNotify onLateKill
The email alarm to use when the object's run is late.
object reference
Indicates whether all pending or unscheduled tasks should be killed if they are late.
Boolean lateAfterTimeout
The period in which the object run must start.
If the activity does not start within the scheduled start time plus this time interval, it is considered late.
Period
Required
No
Yes
No
No
No
No reportProgressTimeout
The period for successive calls from Task
Runner to the ReportTaskProgress API. If
Task Runner, or other code that is processing the tasks, does not report progress within the specified period, the activity can be retried.
attemptTimeout
Period
The timeout for an activity. If an activity does not complete within the start time plus this time interval, AWS Data Pipeline marks the attempt as failed and your retry settings determine the next steps taken.
Period logUri
The location in Amazon S3 to store log files generated by Task Runner when performing work for this object.
String
@reportProgressTime
The last time that Task Runner, or other code that is processing the tasks, called the
ReportTaskProgress API.
DateTime
No
No
No
No
@activeInstances
Record of the currently scheduled instance objects
Schedulable object reference
No
DateTime (read-only) No
@lastRun
@status
The last run of the object. This is a runtime slot.
@scheduledPhysicalObjects
The currently scheduled instance objects.
This is a runtime slot.
The status of this object. This is a runtime slot. Possible values are: pending
, checking_preconditions
, running
, waiting_on_runner
, successful
, and failed
.
Object reference
(read-only)
String (read-only)
No
No
API Version 2012-10-29
182
AWS Data Pipeline Developer Guide
CopyActivity
Name
@triesLeft
Type
Integer (read-only)
Required
No
@actualStartTime
The date and time that the scheduled run actually started. This is a runtime slot.
@actualEndTime errorCode errorMessage errorStackTrace
The date and time that the scheduled run actually ended. This is a runtime slot.
If the object failed, the error code. This is a runtime slot.
If the object failed, the error message. This is a runtime slot.
If the object failed, the error stack trace.
DateTime (read-only)
DateTime (read-only)
String (read-only)
String (read-only)
@scheduledStartTime
The date and time that the run was scheduled to start.
@scheduledEndTime
The date and time that the run was scheduled to end.
@componentParent
The component from which this instance is created.
@headAttempt
The latest attempt on the given instance.
String (read-only)
DateTime
DateTime
Object reference
(read-only)
Object reference
(read-only)
No
No
No
No
No
No
No
No
No
@resource activityStatus
Description
The number of attempted runs remaining before setting the status of this object to failed. This is a runtime slot.
The resource instance on which the given activity/precondition attempt is being run.
Object reference
(read-only)
The status most recently reported from the activity.
String
No
No
This object includes the following slots from
SchedulableObject
.
Name
schedule scheduleType runsOn
Description Type
A schedule of the object. A common use is to specify a time schedule that correlates to the schedule for the object.
object reference
Required
No
Schedule type allows you to specify whether the objects in your pipeline definition should be scheduled at the beginning of interval or end of the interval. Time Series Style
Scheduling means instances are scheduled at the end of each interval and Cron Style
Scheduling means instances are scheduled at the beginning of each interval.
Allowed values are
“cron” or “timeseries”.
Defaults to
"timeseries".
No
The computational resource to run the activity or command. For example, an
Amazon EC2 instance or Amazon EMR cluster.
Resource object reference
No
API Version 2012-10-29
183
AWS Data Pipeline Developer Guide
EmrActivity
Example
The following is an example of this object type. This object references three other objects that you would define in the same pipeline definition file.
CopyPeriod
is a
Schedule
object and
InputData
and
OutputData
are data node objects.
{
"id" : "S3ToS3Copy",
"type" : "CopyActivity",
"schedule" : {"ref" : "CopyPeriod"},
"input" : {"ref" : "InputData"},
"output" : {"ref" : "OutputData"}
}
See Also
•
•
EmrActivity
Runs an Amazon EMR job flow.
Syntax
The following slots are included in all objects.
Name
id name type parent
@sphere
Description
The ID of the object. IDs must be unique within a pipeline definition.
Type
String
The optional, user-defined label of the object.
If you do not provide a name
for an object in a pipeline definition, AWS Data Pipeline automatically duplicates the value of id
.
String
The type of object. Use one of the predefined
AWS Data Pipeline object types.
String
The parent of the object.
String
The sphere of an object denotes its place in the pipeline lifecycle, such as Pipeline,
Component, Instance, or Attempt.
String (read-only)
Required
Yes
No
Yes
No
No
This object includes the following fields.
Name
runsOn
Description
Details about the Amazon EMR job flow.
Type
object reference
Required
Yes
API Version 2012-10-29
184
AWS Data Pipeline Developer Guide
EmrActivity
Name
step preStepCommand postStepCommand input output
Description Type
One or more steps for the job flow to run. To specify multiple steps, up to 255, add multiple step
fields.
String
Shell scripts to be run before any steps are run. To specify multiple scripts, up to 255, add multiple preStepCommand
fields.
String
Shell scripts to be run after all steps are finished. To specify multiple scripts, up to
255, add multiple postStepCommand
fields.
String
The input data source. To specify multiple data sources, add multiple input
fields.
The location for the output. To specify multiple locations, add multiple output fields.
Data node object reference
Data node object reference
Required
Yes
No
No
No
No
This object includes the following slots from the
Activity
object.
Name
onFail onSuccess precondition schedule
Description
An action to run when the current instance fails.
Type
object reference
An email alarm to use when the object's run succeeds.
object reference
A condition that must be met before the object can run. To specify multiple conditions, add multiple precondition slots. The activity cannot run until all its conditions are met.
Object reference
A schedule of the object. A common use is to specify a time schedule that correlates to the schedule for the object.
object reference
This slot overrides the schedule
slot included from
SchedulableObject
, which is optional.
Required
No
No
No
Yes
This object includes the following slots from
RunnableObject
.
Name
workerGroup retryDelay maximumRetries
Description
The worker group. This is used for routing tasks.
The timeout duration between two retry attempts. The default is 10 minutes.
The maximum number of times to retry the action.
Type
String
Period
Integer
Required
No
Yes
No
API Version 2012-10-29
185
AWS Data Pipeline Developer Guide
EmrActivity
Name
onLateNotify onLateKill
Description Type
The email alarm to use when the object's run is late.
object reference
Indicates whether all pending or unscheduled tasks should be killed if they are late.
Boolean lateAfterTimeout
The period in which the object run must start.
If the activity does not start within the scheduled start time plus this time interval, it is considered late.
Period
Required
No
No
No reportProgressTimeout
The period for successive calls from Task
Runner to the ReportTaskProgress API. If
Task Runner, or other code that is processing the tasks, does not report progress within the specified period, the activity can be retried.
attemptTimeout logUri
Period
The timeout for an activity. If an activity does not complete within the start time plus this time interval, AWS Data Pipeline marks the attempt as failed and your retry settings determine the next steps taken.
Period
The location in Amazon S3 to store log files generated by Task Runner when performing work for this object.
String
No
No
No
@reportProgressTime
The last time that Task Runner, or other code that is processing the tasks, called the
ReportTaskProgress API.
DateTime
@activeInstances
Record of the currently scheduled instance objects
Schedulable object reference
No
No
DateTime (read-only) No
@lastRun
The last run of the object. This is a runtime slot.
@scheduledPhysicalObjects
The currently scheduled instance objects.
This is a runtime slot.
@status
Object reference
(read-only)
String (read-only)
No
No
@triesLeft
The status of this object. This is a runtime slot. Possible values are: pending
, checking_preconditions
, running
, waiting_on_runner
, successful
, and failed
.
The number of attempted runs remaining before setting the status of this object to failed. This is a runtime slot.
Integer (read-only) No
DateTime (read-only) No
@actualStartTime
The date and time that the scheduled run actually started. This is a runtime slot.
@actualEndTime
The date and time that the scheduled run actually ended. This is a runtime slot.
errorCode
If the object failed, the error code. This is a runtime slot.
DateTime (read-only)
String (read-only)
No
No
API Version 2012-10-29
186
AWS Data Pipeline Developer Guide
EmrActivity
Name
errorMessage errorStackTrace
Description
If the object failed, the error message. This is a runtime slot.
If the object failed, the error stack trace.
@scheduledStartTime
The date and time that the run was scheduled to start.
@scheduledEndTime
The date and time that the run was scheduled to end.
Type
String (read-only)
String (read-only)
DateTime
DateTime
@componentParent
The component from which this instance is created.
@headAttempt
The latest attempt on the given instance.
@resource activityStatus
The resource instance on which the given activity/precondition attempt is being run.
The status most recently reported from the activity.
Object reference
(read-only)
Object reference
(read-only)
Object reference
(read-only)
String
Required
No
No
No
No
No
No
No
No
This object includes the following slots from
SchedulableObject
.
Name
schedule scheduleType runsOn
Description Type
A schedule of the object. A common use is to specify a time schedule that correlates to the schedule for the object.
object reference
Required
No
Schedule type allows you to specify whether the objects in your pipeline definition should be scheduled at the beginning of interval or end of the interval. Time Series Style
Scheduling means instances are scheduled at the end of each interval and Cron Style
Scheduling means instances are scheduled at the beginning of each interval.
Allowed values are
“cron” or “timeseries”.
Defaults to
"timeseries".
No
The computational resource to run the activity or command. For example, an
Amazon EC2 instance or Amazon EMR cluster.
Resource object reference
No
Example
The following is an example of this object type. This object references three other objects that you would define in the same pipeline definition file.
MyEmrCluster
is an
EmrCluster
object and
MyS3Input
and
MyS3Output
are
S3DataNode
objects.
{
"id" : "MyEmrActivity",
"type" : "EmrActivity",
API Version 2012-10-29
187
AWS Data Pipeline Developer Guide
HiveActivity
"runsOn" : {"ref" : "MyEmrCluster"},
"preStepCommand" : "scp remoteFiles localFiles",
"step" : "s3://myBucket/myPath/myStep.jar,firstArg,secondArg",
"step" : "s3://myBucket/myPath/myOtherStep.jar,anotherArg",
"postStepCommand" : "scp localFiles remoteFiles",
"input" : {"ref" : "MyS3Input"},
"output" : {"ref" : "MyS3Output"}
}
See Also
•
•
•
HiveActivity
Runs a Hive query on an Amazon EMR cluster. HiveActivity makes it easier to set up an EMR activity and automatically creates Hive tables based on input data coming in from either Amazon S3 or Amazon
RDS. All you need to specify is the HiveQL to execute on the source data. AWS Data Pipeline automatically creates Hive tables with
${input1}
,
${input2}
, etc. based on the input slots in the HiveActivity. For
S3 inputs, the dataFormat
field is used to create the Hive column names. For MySQL (RDS) inputs and the column names for the SQL query are used to create the Hive column names.
Syntax
The following slots are included in all objects.
Name
id name type parent
@sphere
Description
The ID of the object. IDs must be unique within a pipeline definition.
Type
String
The optional, user-defined label of the object.
If you do not provide a name
for an object in a pipeline definition, AWS Data Pipeline automatically duplicates the value of id
.
String
The type of object. Use one of the predefined
AWS Data Pipeline object types.
String
The parent of the object.
String
The sphere of an object denotes its place in the pipeline lifecycle, such as Pipeline,
Component, Instance, or Attempt.
String (read-only)
Required
Yes
No
Yes
No
No
This object includes the following fields.
Name
scriptUri
Description
The location of the Hive script to run. ex: s3://
script location
Type
String
Required
No
API Version 2012-10-29
188
AWS Data Pipeline Developer Guide
HiveActivity
Name
hiveScript
Description
The Hive script to run.
Type
String
Required
No
Note
You must specify a hiveScript
value or a scriptUri
value, but both are not required.
This object includes the following slots from the
Activity
object.
Name
onFail onSuccess precondition schedule
Description
An action to run when the current instance fails.
Type
object reference
An email alarm to use when the object's run succeeds.
object reference
A condition that must be met before the object can run. To specify multiple conditions, add multiple precondition slots. The activity cannot run until all its conditions are met.
Object reference
A schedule of the object. A common use is to specify a time schedule that correlates to the schedule for the object.
object reference
This slot overrides the schedule
slot included from
SchedulableObject
, which is optional.
Required
No
No
No
Yes
This object includes the following slots from
RunnableObject
.
Name
workerGroup retryDelay maximumRetries onLateNotify
Description
The worker group. This is used for routing tasks.
The timeout duration between two retry attempts. The default is 10 minutes.
Type
String
Period
The maximum number of times to retry the action.
Integer
The email alarm to use when the object's run is late.
object reference onLateKill
Indicates whether all pending or unscheduled tasks should be killed if they are late.
Boolean lateAfterTimeout
The period in which the object run must start.
If the activity does not start within the scheduled start time plus this time interval, it is considered late.
Period
Required
No
Yes
No
No
No
No
API Version 2012-10-29
189
AWS Data Pipeline Developer Guide
HiveActivity
Name Description
reportProgressTimeout
The period for successive calls from Task
Runner to the ReportTaskProgress API. If
Task Runner, or other code that is processing the tasks, does not report progress within the specified period, the activity can be retried.
attemptTimeout logUri
Type
Period
The timeout for an activity. If an activity does not complete within the start time plus this time interval, AWS Data Pipeline marks the attempt as failed and your retry settings determine the next steps taken.
Period
The location in Amazon S3 to store log files generated by Task Runner when performing work for this object.
String
@reportProgressTime
The last time that Task Runner, or other code that is processing the tasks, called the
ReportTaskProgress API.
DateTime
@activeInstances
Record of the currently scheduled instance objects
Schedulable object reference
No
No
DateTime (read-only) No
@lastRun
The last run of the object. This is a runtime slot.
@scheduledPhysicalObjects
The currently scheduled instance objects.
This is a runtime slot.
@status
Object reference
(read-only)
String (read-only)
No
No
@triesLeft
The status of this object. This is a runtime slot. Possible values are: pending
, checking_preconditions
, running
, waiting_on_runner
, successful
, and failed
.
The number of attempted runs remaining before setting the status of this object to failed. This is a runtime slot.
Integer (read-only) No
DateTime (read-only) No
@actualStartTime
The date and time that the scheduled run actually started. This is a runtime slot.
@actualEndTime errorCode
The date and time that the scheduled run actually ended. This is a runtime slot.
If the object failed, the error code. This is a runtime slot.
DateTime (read-only)
String (read-only)
No
No errorMessage
If the object failed, the error message. This is a runtime slot.
If the object failed, the error stack trace.
errorStackTrace
@scheduledStartTime
The date and time that the run was scheduled to start.
@scheduledEndTime
The date and time that the run was scheduled to end.
String (read-only)
String (read-only)
DateTime
DateTime
Required
No
No
No
No
No
No
No
API Version 2012-10-29
190
AWS Data Pipeline Developer Guide
HiveActivity
Name
@headAttempt
Description
@componentParent
The component from which this instance is created.
The latest attempt on the given instance.
Type
Object reference
(read-only)
Object reference
(read-only)
@resource activityStatus
The resource instance on which the given activity/precondition attempt is being run.
The status most recently reported from the activity.
Object reference
(read-only)
String
Required
No
No
No
No
This object includes the following slots from
SchedulableObject
.
Name
schedule scheduleType runsOn
Description Type
A schedule of the object. A common use is to specify a time schedule that correlates to the schedule for the object.
object reference
Required
No
Schedule type allows you to specify whether the objects in your pipeline definition should be scheduled at the beginning of interval or end of the interval. Time Series Style
Scheduling means instances are scheduled at the end of each interval and Cron Style
Scheduling means instances are scheduled at the beginning of each interval.
Allowed values are
“cron” or “timeseries”.
Defaults to
"timeseries".
No
The computational resource to run the activity or command. For example, an
Amazon EC2 instance or Amazon EMR cluster.
Resource object reference
No
Example
The following is an example of this object type. This object references three other objects that you would define in the same pipeline definition file.
CopyPeriod
is a
Schedule
object and
InputData
and
OutputData
are data node objects.
{
"id" : "S3ToS3Copy",
"type" : "CopyActivity",
"schedule" : {"ref" : "CopyPeriod"},
"input" : {"ref" : "InputData"},
"output" : {"ref" : "OutputData"}
}
See Also
•
•
API Version 2012-10-29
191
AWS Data Pipeline Developer Guide
ShellCommandPrecondition
ShellCommandPrecondition
A Unix/Linux shell command that can be executed as a precondition.
Syntax
The following slots are included in all objects.
Name
id name type parent
@sphere
Description
The ID of the object. IDs must be unique within a pipeline definition.
Type
String
The optional, user-defined label of the object.
If you do not provide a name
for an object in a pipeline definition, AWS Data Pipeline automatically duplicates the value of id
.
String
The type of object. Use one of the predefined
AWS Data Pipeline object types.
String
The parent of the object.
String
The sphere of an object denotes its place in the pipeline lifecycle, such as Pipeline,
Component, Instance, or Attempt.
String (read-only)
Required
Yes
No
Yes
No
No
This object includes the following fields.
Name
command scriptUri
Description Type
The command to run. This value and any associated parameters must function in the environment from which you are running the
Task Runner.
String
An Amazon S3 URI path for a file to download and run as a shell command. Only one scriptUri
or command field should be present.
A valid S3 URI
Required
Yes
No
This object includes the following slots from the
Precondition
object.
Name Description Type
preconditionMaximumRetries
Specifies the maximum number of times that a precondition is retried.
Integer node
The activity or data node for which this precondition is being checked. This is a runtime slot.
Object reference
(read-only)
Required
No
No
API Version 2012-10-29
192
AWS Data Pipeline Developer Guide
ShellCommandPrecondition
This object includes the following slots from
RunnableObject
.
Name
workerGroup retryDelay maximumRetries
Description
The worker group. This is used for routing tasks.
The timeout duration between two retry attempts. The default is 10 minutes.
The maximum number of times to retry the action.
Type
String
Period
Integer onLateNotify onLateKill
The email alarm to use when the object's run is late.
object reference
Indicates whether all pending or unscheduled tasks should be killed if they are late.
Boolean lateAfterTimeout
The period in which the object run must start.
If the activity does not start within the scheduled start time plus this time interval, it is considered late.
Period
Required
No
Yes
No
No
No
No reportProgressTimeout
The period for successive calls from Task
Runner to the ReportTaskProgress API. If
Task Runner, or other code that is processing the tasks, does not report progress within the specified period, the activity can be retried.
attemptTimeout
Period
The timeout for an activity. If an activity does not complete within the start time plus this time interval, AWS Data Pipeline marks the attempt as failed and your retry settings determine the next steps taken.
Period logUri
The location in Amazon S3 to store log files generated by Task Runner when performing work for this object.
String
@reportProgressTime
The last time that Task Runner, or other code that is processing the tasks, called the
ReportTaskProgress API.
DateTime
No
No
No
No
@activeInstances
Record of the currently scheduled instance objects
Schedulable object reference
No
DateTime (read-only) No
@lastRun
@status
The last run of the object. This is a runtime slot.
@scheduledPhysicalObjects
The currently scheduled instance objects.
This is a runtime slot.
The status of this object. This is a runtime slot. Possible values are: pending
, checking_preconditions
, running
, waiting_on_runner
, successful
, and failed
.
Object reference
(read-only)
String (read-only)
No
No
API Version 2012-10-29
193
AWS Data Pipeline Developer Guide
Exists
Name
@triesLeft
Type
Integer (read-only)
Required
No
@actualStartTime
The date and time that the scheduled run actually started. This is a runtime slot.
@actualEndTime errorCode errorMessage errorStackTrace
The date and time that the scheduled run actually ended. This is a runtime slot.
If the object failed, the error code. This is a runtime slot.
If the object failed, the error message. This is a runtime slot.
If the object failed, the error stack trace.
DateTime (read-only)
DateTime (read-only)
String (read-only)
String (read-only)
@scheduledStartTime
The date and time that the run was scheduled to start.
@scheduledEndTime
The date and time that the run was scheduled to end.
@componentParent
The component from which this instance is created.
@headAttempt
The latest attempt on the given instance.
String (read-only)
DateTime
DateTime
Object reference
(read-only)
Object reference
(read-only)
No
No
No
No
No
No
No
No
No
@resource activityStatus
Description
The number of attempted runs remaining before setting the status of this object to failed. This is a runtime slot.
The resource instance on which the given activity/precondition attempt is being run.
Object reference
(read-only)
The status most recently reported from the activity.
String
No
No
Example
The following is an example of this object type.
{
"id" : "VerifyDataReadiness",
"type" : "ShellCommandPrecondition",
"command" : "perl check-data-ready.pl"
}
See Also
•
•
Exists
Checks whether a data node object exists.
API Version 2012-10-29
194
AWS Data Pipeline Developer Guide
Exists
Syntax
The following slots are included in all objects.
Name
id name type parent
@sphere
Description
The ID of the object. IDs must be unique within a pipeline definition.
Type
String
The optional, user-defined label of the object.
If you do not provide a name
for an object in a pipeline definition, AWS Data Pipeline automatically duplicates the value of id
.
String
The type of object. Use one of the predefined
AWS Data Pipeline object types.
String
The parent of the object.
String
The sphere of an object denotes its place in the pipeline lifecycle, such as Pipeline,
Component, Instance, or Attempt.
String (read-only)
Required
Yes
No
Yes
No
No
This object includes the following slots from the
Precondition
object.
Name
node
Description
The activity or data node for which this precondition is being checked. This is a runtime slot.
Type
preconditionMaximumRetries
Specifies the maximum number of times that a precondition is retried.
Integer
Object reference
(read-only)
Required
No
No
This object includes the following slots from
RunnableObject
.
Name
workerGroup retryDelay maximumRetries onLateNotify
Description
The worker group. This is used for routing tasks.
The timeout duration between two retry attempts. The default is 10 minutes.
Type
String
Period
The maximum number of times to retry the action.
Integer
The email alarm to use when the object's run is late.
object reference onLateKill
Indicates whether all pending or unscheduled tasks should be killed if they are late.
Boolean lateAfterTimeout
The period in which the object run must start.
If the activity does not start within the scheduled start time plus this time interval, it is considered late.
Period
Required
No
Yes
No
No
No
No
API Version 2012-10-29
195
AWS Data Pipeline Developer Guide
Exists
Name Description
reportProgressTimeout
The period for successive calls from Task
Runner to the ReportTaskProgress API. If
Task Runner, or other code that is processing the tasks, does not report progress within the specified period, the activity can be retried.
attemptTimeout logUri
Type
Period
The timeout for an activity. If an activity does not complete within the start time plus this time interval, AWS Data Pipeline marks the attempt as failed and your retry settings determine the next steps taken.
Period
The location in Amazon S3 to store log files generated by Task Runner when performing work for this object.
String
@reportProgressTime
The last time that Task Runner, or other code that is processing the tasks, called the
ReportTaskProgress API.
DateTime
@activeInstances
Record of the currently scheduled instance objects
Schedulable object reference
No
No
DateTime (read-only) No
@lastRun
The last run of the object. This is a runtime slot.
@scheduledPhysicalObjects
The currently scheduled instance objects.
This is a runtime slot.
@status
Object reference
(read-only)
String (read-only)
No
No
@triesLeft
The status of this object. This is a runtime slot. Possible values are: pending
, checking_preconditions
, running
, waiting_on_runner
, successful
, and failed
.
The number of attempted runs remaining before setting the status of this object to failed. This is a runtime slot.
Integer (read-only) No
DateTime (read-only) No
@actualStartTime
The date and time that the scheduled run actually started. This is a runtime slot.
@actualEndTime errorCode
The date and time that the scheduled run actually ended. This is a runtime slot.
If the object failed, the error code. This is a runtime slot.
DateTime (read-only)
String (read-only)
No
No errorMessage
If the object failed, the error message. This is a runtime slot.
If the object failed, the error stack trace.
errorStackTrace
@scheduledStartTime
The date and time that the run was scheduled to start.
@scheduledEndTime
The date and time that the run was scheduled to end.
String (read-only)
String (read-only)
DateTime
DateTime
Required
No
No
No
No
No
No
No
API Version 2012-10-29
196
AWS Data Pipeline Developer Guide
S3KeyExists
Name
@headAttempt
Description
@componentParent
The component from which this instance is created.
The latest attempt on the given instance.
Type
Object reference
(read-only)
Object reference
(read-only)
@resource activityStatus
The resource instance on which the given activity/precondition attempt is being run.
The status most recently reported from the activity.
Object reference
(read-only)
String
Required
No
No
No
No
Example
The following is an example of this object type. The
InputData
object references this object,
Ready
, plus another object that you'd define in the same pipeline definition file.
CopyPeriod
is a
Schedule object.
{
"id" : "InputData",
"type" : "S3DataNode",
"schedule" : {"ref" : "CopyPeriod"},
"filePath" : "s3://test/InputData/#{@scheduledStartTime.format('YYYY-MM-ddhh:mm')}.csv",
"precondition" : {"ref" : "Ready"}
},
{
"id" : "Ready",
"type" : "Exists"
}
See Also
•
ShellCommandPrecondition (p. 192)
S3KeyExists
Checks whether a key exists in an Amazon S3 data node.
Syntax
The following slots are included in all objects.
Name
id name
Description
The ID of the object. IDs must be unique within a pipeline definition.
Type
String
The optional, user-defined label of the object.
If you do not provide a name
for an object in a pipeline definition, AWS Data Pipeline automatically duplicates the value of id
.
String
Required
Yes
No
API Version 2012-10-29
197
AWS Data Pipeline Developer Guide
S3KeyExists
Name
type parent
@sphere
Description
The parent of the object.
Type
The type of object. Use one of the predefined
AWS Data Pipeline object types.
String
String
The sphere of an object denotes its place in the pipeline lifecycle, such as Pipeline,
Component, Instance, or Attempt.
String (read-only)
Required
Yes
No
No
This object includes the following slots from the
Precondition
object.
Name
node
Description
The activity or data node for which this precondition is being checked. This is a runtime slot.
Type
preconditionMaximumRetries
Specifies the maximum number of times that a precondition is retried.
Integer
Object reference
(read-only)
Required
No
No
This object includes the following fields.
Name
s3Key
Description
Amazon S3 key to check for existence.
Type
String
Required
Yes
This object includes the following slots from
RunnableObject
.
Name
workerGroup retryDelay maximumRetries onLateNotify
Description
The worker group. This is used for routing tasks.
The timeout duration between two retry attempts. The default is 10 minutes.
Type
String
Period
The maximum number of times to retry the action.
Integer
The email alarm to use when the object's run is late.
object reference onLateKill
Indicates whether all pending or unscheduled tasks should be killed if they are late.
Boolean lateAfterTimeout
The period in which the object run must start.
If the activity does not start within the scheduled start time plus this time interval, it is considered late.
Period
Required
No
Yes
No
No
No
No
API Version 2012-10-29
198
AWS Data Pipeline Developer Guide
S3KeyExists
Name Description
reportProgressTimeout
The period for successive calls from Task
Runner to the ReportTaskProgress API. If
Task Runner, or other code that is processing the tasks, does not report progress within the specified period, the activity can be retried.
attemptTimeout logUri
Type
Period
The timeout for an activity. If an activity does not complete within the start time plus this time interval, AWS Data Pipeline marks the attempt as failed and your retry settings determine the next steps taken.
Period
The location in Amazon S3 to store log files generated by Task Runner when performing work for this object.
String
@reportProgressTime
The last time that Task Runner, or other code that is processing the tasks, called the
ReportTaskProgress API.
DateTime
@activeInstances
Record of the currently scheduled instance objects
Schedulable object reference
No
No
DateTime (read-only) No
@lastRun
The last run of the object. This is a runtime slot.
@scheduledPhysicalObjects
The currently scheduled instance objects.
This is a runtime slot.
@status
Object reference
(read-only)
String (read-only)
No
No
@triesLeft
The status of this object. This is a runtime slot. Possible values are: pending
, checking_preconditions
, running
, waiting_on_runner
, successful
, and failed
.
The number of attempted runs remaining before setting the status of this object to failed. This is a runtime slot.
Integer (read-only) No
DateTime (read-only) No
@actualStartTime
The date and time that the scheduled run actually started. This is a runtime slot.
@actualEndTime errorCode
The date and time that the scheduled run actually ended. This is a runtime slot.
If the object failed, the error code. This is a runtime slot.
DateTime (read-only)
String (read-only)
No
No errorMessage
If the object failed, the error message. This is a runtime slot.
If the object failed, the error stack trace.
errorStackTrace
@scheduledStartTime
The date and time that the run was scheduled to start.
@scheduledEndTime
The date and time that the run was scheduled to end.
String (read-only)
String (read-only)
DateTime
DateTime
Required
No
No
No
No
No
No
No
API Version 2012-10-29
199
AWS Data Pipeline Developer Guide
S3PrefixNotEmpty
Name
@headAttempt
Description
@componentParent
The component from which this instance is created.
The latest attempt on the given instance.
Type
Object reference
(read-only)
Object reference
(read-only)
@resource activityStatus
The resource instance on which the given activity/precondition attempt is being run.
The status most recently reported from the activity.
Object reference
(read-only)
String
Required
No
No
No
No
See Also
•
ShellCommandPrecondition (p. 192)
S3PrefixNotEmpty
A precondition to check that the Amazon S3 objects with the given prefix (represented as a URI) are present.
Syntax
The following slots are included in all objects.
Name
id name type parent
@sphere
Description
The ID of the object. IDs must be unique within a pipeline definition.
Type
String
The optional, user-defined label of the object.
If you do not provide a name
for an object in a pipeline definition, AWS Data Pipeline automatically duplicates the value of id
.
String
The type of object. Use one of the predefined
AWS Data Pipeline object types.
String
The parent of the object.
String
The sphere of an object denotes its place in the pipeline lifecycle, such as Pipeline,
Component, Instance, or Attempt.
String (read-only)
Required
Yes
No
Yes
No
No
This object includes the following slots from the
Precondition
object.
Name Description Type
preconditionMaximumRetries
Specifies the maximum number of times that a precondition is retried.
Integer
Required
No
API Version 2012-10-29
200
AWS Data Pipeline Developer Guide
S3PrefixNotEmpty
Name
node
Description
The activity or data node for which this precondition is being checked. This is a runtime slot.
Type
Object reference
(read-only)
This object includes the following fields.
Name
s3Prefix
Description Type
The Amazon S3 prefix to check for existence of objects.
String
Required
No
Required
Yes
This object includes the following slots from
RunnableObject
.
Name
workerGroup retryDelay maximumRetries
Description
The worker group. This is used for routing tasks.
The timeout duration between two retry attempts. The default is 10 minutes.
The maximum number of times to retry the action.
Type
String
Period
Integer onLateNotify onLateKill
The email alarm to use when the object's run is late.
object reference
Indicates whether all pending or unscheduled tasks should be killed if they are late.
Boolean lateAfterTimeout
The period in which the object run must start.
If the activity does not start within the scheduled start time plus this time interval, it is considered late.
Period reportProgressTimeout
The period for successive calls from Task
Runner to the ReportTaskProgress API. If
Task Runner, or other code that is processing the tasks, does not report progress within the specified period, the activity can be retried.
attemptTimeout
Period
The timeout for an activity. If an activity does not complete within the start time plus this time interval, AWS Data Pipeline marks the attempt as failed and your retry settings determine the next steps taken.
Period logUri
The location in Amazon S3 to store log files generated by Task Runner when performing work for this object.
String
@reportProgressTime
The last time that Task Runner, or other code that is processing the tasks, called the
ReportTaskProgress API.
DateTime
Required
No
Yes
No
No
No
No
No
No
No
No
API Version 2012-10-29
201
AWS Data Pipeline Developer Guide
S3PrefixNotEmpty
Name Description Type
@activeInstances
Record of the currently scheduled instance objects
Schedulable object reference
Required
No
DateTime (read-only) No
@lastRun
The last run of the object. This is a runtime slot.
@scheduledPhysicalObjects
The currently scheduled instance objects.
This is a runtime slot.
@status
The status of this object. This is a runtime slot. Possible values are: pending
, checking_preconditions
, running
, waiting_on_runner
, successful
, and failed
.
Object reference
(read-only)
String (read-only)
No
No
@triesLeft
The number of attempted runs remaining before setting the status of this object to failed. This is a runtime slot.
@actualStartTime
The date and time that the scheduled run actually started. This is a runtime slot.
Integer (read-only)
DateTime (read-only)
No
No
@actualEndTime errorCode
The date and time that the scheduled run actually ended. This is a runtime slot.
If the object failed, the error code. This is a runtime slot.
errorMessage
If the object failed, the error message. This is a runtime slot.
If the object failed, the error stack trace.
errorStackTrace
@scheduledStartTime
The date and time that the run was scheduled to start.
@resource activityStatus
DateTime (read-only)
String (read-only)
String (read-only)
String (read-only)
DateTime
@scheduledEndTime
The date and time that the run was scheduled to end.
@componentParent
The component from which this instance is created.
@headAttempt
The latest attempt on the given instance.
DateTime
Object reference
(read-only)
Object reference
(read-only)
The resource instance on which the given activity/precondition attempt is being run.
Object reference
(read-only)
The status most recently reported from the activity.
String
No
No
No
No
No
No
No
No
No
No
Example
The following is an example of this object type using required, optional, and expression fields.
{
"id": "InputReady",
API Version 2012-10-29
202
AWS Data Pipeline Developer Guide
RdsSqlPrecondition
"type": "S3PrefixNotEmpty",
"role": "test-role",
"s3Prefix": "#{node.filePath}"
}
See Also
•
ShellCommandPrecondition (p. 192)
RdsSqlPrecondition
A precondition that executes a query to verify the readiness of data within Amazon RDS. Specified conditions are combined by a logical AND operation.
Important
You must grant permissions to Task Runner to access Amazon RDS using an RdsSqlPrecondition
as described by Grant Amazon RDS Permissions to Task Runner (p. 23)
Syntax
The following slots are included in all objects.
Name
id name type parent
@sphere
Description
The ID of the object. IDs must be unique within a pipeline definition.
Type
String
The optional, user-defined label of the object.
If you do not provide a name
for an object in a pipeline definition, AWS Data Pipeline automatically duplicates the value of id
.
String
The type of object. Use one of the predefined
AWS Data Pipeline object types.
String
The parent of the object.
String
The sphere of an object denotes its place in the pipeline lifecycle, such as Pipeline,
Component, Instance, or Attempt.
String (read-only)
Required
Yes
No
Yes
No
No
This object includes the following fields.
Name
query username
*password rdsInstanceId
Description
The user name to connect with.
Type
A valid SQL query that should return one row with one column (scalar value).
String
String
The password to connect with. The asterisk instructs AWS Data Pipeline to encrypt the password.
String
The InstanceId to connect to.
String
Required
Yes
Yes
Yes
Yes
API Version 2012-10-29
203
AWS Data Pipeline Developer Guide
DynamoDBTableExists
Name
database equalTo lessThan greaterThan isTrue
Description
The logical database to connect to.
Type
String
This precondition is true if the value returned by the query is equal to this value.
Integer
This precondition is true if the value returned by the query is less than this value.
Integer
This precondition is true if the value returned by the query is greater than this value.
Integer
This precondition is true if the Boolean value returned by the query is equal to true.
Boolean
DynamoDBTableExists
A precondition to check that the Amazon DynamoDB table exists.
Syntax
The following slots are included in all objects.
Name
id name type parent
@sphere
Description
The ID of the object. IDs must be unique within a pipeline definition.
Type
String
The optional, user-defined label of the object.
If you do not provide a name
for an object in a pipeline definition, AWS Data Pipeline automatically duplicates the value of id
.
String
The type of object. Use one of the predefined
AWS Data Pipeline object types.
String
The parent of the object.
String
The sphere of an object denotes its place in the pipeline lifecycle, such as Pipeline,
Component, Instance, or Attempt.
String (read-only)
Required
Yes
No
Yes
No
No
This object includes the following fields.
Name
tableName
Description
The Amazon DynamoDB table to check.
Type
String
Required
Yes
Required
Yes
No
No
No
No
DynamoDBDataExists
"A precondition to check that data exists in a Amazon DynamoDB table.
API Version 2012-10-29
204
AWS Data Pipeline Developer Guide
Ec2Resource
Syntax
The following slots are included in all objects.
Name
id name type parent
@sphere
Description
The ID of the object. IDs must be unique within a pipeline definition.
Type
String
The optional, user-defined label of the object.
If you do not provide a name
for an object in a pipeline definition, AWS Data Pipeline automatically duplicates the value of id
.
String
The type of object. Use one of the predefined
AWS Data Pipeline object types.
String
The parent of the object.
String
The sphere of an object denotes its place in the pipeline lifecycle, such as Pipeline,
Component, Instance, or Attempt.
String (read-only)
Required
Yes
No
Yes
No
No
This object includes the following fields.
Name
tableName
Description
The Amazon DynamoDB table to check.
Type
String
Required
Yes
Ec2Resource
Represents the configuration of an Amazon EMR job flow.
Syntax
The following slots are included in all objects.
Name
id name type parent
@sphere
Description
The ID of the object. IDs must be unique within a pipeline definition.
Type
String
The optional, user-defined label of the object.
If you do not provide a name
for an object in a pipeline definition, AWS Data Pipeline automatically duplicates the value of id
.
String
The type of object. Use one of the predefined
AWS Data Pipeline object types.
String
The parent of the object.
String
The sphere of an object denotes its place in the pipeline lifecycle, such as Pipeline,
Component, Instance, or Attempt.
String (read-only)
Required
Yes
No
Yes
No
No
API Version 2012-10-29
205
AWS Data Pipeline Developer Guide
Ec2Resource
This object includes the following fields.
Name
instanceType instanceCount keyPair role resourceRole
Description
The type of EC2 instance to use for the resource pool. The default value is m1.small
.
The number of instances to use for the resource pool. The default value is
1
.
Type
String
Integer minInstanceCount
The minimum number of EC2 instances for the pool. The default value is
1
.
Integer securityGroups
String imageId
The EC2 security group to use for the instances in the resource pool.
The AMI version to use for the EC2 instances. The default value is ami-1624987f
, which we recommend using. For more information, see Amazon
Machine Images (AMIs) .
String
The Amazon EC2 key pair to use to log onto the EC2 instance. The default action is not to attach a key pair to the EC2 instance.
String
The IAM role to use to create the EC2 instance.
String
The IAM role to use to control the resources that the EC2 instance can access.
String
This object includes the following slots from the
Resource
object.
Name
terminateAfter
@resourceId
@resourceStatus
Description
The number of hours to wait before terminating the resource.
The unique identifier for the resource.
Type
Period
Period
The current status of the resource, such as checking_preconditions, creating, shutting_down, running, failed, timed_out, cancelled, or paused.
String
String
@failureReason
The reason for the failure to create the resource.
@resourceCreationTime
The time when this resource was created.
DateTime
This object includes the following slots from
RunnableObject
.
Name
workerGroup
Description
The worker group. This is used for routing tasks.
Type
String
API Version 2012-10-29
206
No
Yes
Yes
Required
No
No
No
No
No
Required
Yes
Yes
No
No
No
Required
No
AWS Data Pipeline Developer Guide
Ec2Resource
Name
retryDelay maximumRetries
Description
The timeout duration between two retry attempts. The default is 10 minutes.
The maximum number of times to retry the action.
Type
Period
Integer onLateNotify onLateKill
The email alarm to use when the object's run is late.
object reference
Indicates whether all pending or unscheduled tasks should be killed if they are late.
Boolean lateAfterTimeout
The period in which the object run must start.
If the activity does not start within the scheduled start time plus this time interval, it is considered late.
Period reportProgressTimeout
The period for successive calls from Task
Runner to the ReportTaskProgress API. If
Task Runner, or other code that is processing the tasks, does not report progress within the specified period, the activity can be retried.
attemptTimeout logUri
Period
The timeout for an activity. If an activity does not complete within the start time plus this time interval, AWS Data Pipeline marks the attempt as failed and your retry settings determine the next steps taken.
Period
The location in Amazon S3 to store log files generated by Task Runner when performing work for this object.
String
Required
Yes
No
No
No
No
No
No
No
@reportProgressTime
The last time that Task Runner, or other code that is processing the tasks, called the
ReportTaskProgress API.
DateTime
@activeInstances
Record of the currently scheduled instance objects
Schedulable object reference
No
No
DateTime (read-only) No
@lastRun
The last run of the object. This is a runtime slot.
@scheduledPhysicalObjects
The currently scheduled instance objects.
This is a runtime slot.
@status
Object reference
(read-only)
String (read-only)
No
No
@triesLeft
The status of this object. This is a runtime slot. Possible values are: pending
, checking_preconditions
, running
, waiting_on_runner
, successful
, and failed
.
The number of attempted runs remaining before setting the status of this object to failed. This is a runtime slot.
Integer (read-only) No
@actualStartTime
The date and time that the scheduled run actually started. This is a runtime slot.
DateTime (read-only) No
API Version 2012-10-29
207
AWS Data Pipeline Developer Guide
Ec2Resource
Name
@actualEndTime errorCode errorMessage errorStackTrace
Description
The date and time that the scheduled run actually ended. This is a runtime slot.
If the object failed, the error code. This is a runtime slot.
If the object failed, the error message. This is a runtime slot.
If the object failed, the error stack trace.
Type
DateTime (read-only) No
String (read-only)
String (read-only)
String (read-only)
@scheduledStartTime
The date and time that the run was scheduled to start.
@scheduledEndTime
The date and time that the run was scheduled to end.
@componentParent
@headAttempt
@resource activityStatus
The component from which this instance is created.
The latest attempt on the given instance.
The resource instance on which the given activity/precondition attempt is being run.
The status most recently reported from the activity.
DateTime
DateTime
Object reference
(read-only)
Object reference
(read-only)
Object reference
(read-only)
String
Required
No
No
No
No
No
No
No
No
No
This object includes the following slots from
SchedulableObject
.
Name
schedule scheduleType runsOn
Description Type
A schedule of the object. A common use is to specify a time schedule that correlates to the schedule for the object.
object reference
Required
No
Schedule type allows you to specify whether the objects in your pipeline definition should be scheduled at the beginning of interval or end of the interval. Time Series Style
Scheduling means instances are scheduled at the end of each interval and Cron Style
Scheduling means instances are scheduled at the beginning of each interval.
Allowed values are
“cron” or “timeseries”.
Defaults to
"timeseries".
No
The computational resource to run the activity or command. For example, an
Amazon EC2 instance or Amazon EMR cluster.
Resource object reference
No
Example
The following is an example of this object type. It launches an EC2 instance and shows some optional fields set.
API Version 2012-10-29
208
AWS Data Pipeline Developer Guide
EmrCluster
{
"id": "MyEC2Resource",
"type": "Ec2Resource",
"actionOnTaskFailure": "terminate",
"actionOnResourceFailure": "retryAll",
"maximumRetries": "1",
"role": "test-role",
"resourceRole": "test-role",
"instanceType": "m1.medium",
"instanceCount": "1",
"securityGroups": [
"test-group",
"default"
],
"keyPair": "test-pair"
}
EmrCluster
Represents the configuration of an Amazon EMR job flow. This object is used by
launch a job flow.
Syntax
The following slots are included in all objects.
Name
id name type parent
@sphere
Description Type
The ID of the object. IDs must be unique within a pipeline definition.
String
The optional, user-defined label of the object.
If you do not provide a name
for an object in a pipeline definition, AWS Data Pipeline automatically duplicates the value of id
.
String
The type of object. Use one of the predefined
AWS Data Pipeline object types.
String
The parent of the object.
String
The sphere of an object denotes its place in the pipeline lifecycle, such as Pipeline,
Component, Instance, or Attempt.
String (read-only)
Required
Yes
No
Yes
No
No
This object includes the following fields.
Name
coreInstanceType
Description
The type of EC2 instance to use for core nodes. The default value is m1.small
.
Type
masterInstanceType
The type of EC2 instance to use for the master node. The default value is m1.small
.
String
String
Required
No
No
API Version 2012-10-29
209
AWS Data Pipeline Developer Guide
EmrCluster
Name
taskInstanceType
Description
The type of EC2 instance to use for task nodes.
Type
coreInstanceCount
The number of core nodes to use for the job flow. The default value is
1
.
String
String taskInstanceCount
The number of task nodes to use for the job flow. The default value is
1
.
String keyPair
The Amazon EC2 key pair to use to log onto the master node of the job flow. The default action is not to attach a key pair to the job flow.
String hadoopVersion bootstrapAction enableDebugging logUri
The version of Hadoop to use in the job flow.
The default value is
0.20
. For more information about the Hadoop versions supported by Amazon EMR, see Supported
Hadoop Versions .
String
An action to run when the job flow starts.You
can specify comma-separated arguments.
To specify multiple actions, up to 255, add multiple bootstrapAction
fields. The default behavior is to start the job flow without any bootstrap actions.
String array
Enables debugging on the job flow.
String
The location in Amazon S3 to store log files from the job flow.
String
This object includes the following slots from the
Resource
object.
Name
terminateAfter
@resourceId
@resourceStatus
Description
The number of hours to wait before terminating the resource.
The unique identifier for the resource.
Type
Period
Period
The current status of the resource, such as checking_preconditions, creating, shutting_down, running, failed, timed_out, cancelled, or paused.
String
String
@failureReason
The reason for the failure to create the resource.
@resourceCreationTime
The time when this resource was created.
DateTime
No
No
No
Required
No
No
No
No
No
Required
Yes
Yes
No
No
No
API Version 2012-10-29
210
AWS Data Pipeline Developer Guide
EmrCluster
This object includes the following slots from
RunnableObject
.
Name
workerGroup retryDelay maximumRetries
Description
The worker group. This is used for routing tasks.
The timeout duration between two retry attempts. The default is 10 minutes.
The maximum number of times to retry the action.
Type
String
Period
Integer onLateNotify onLateKill
The email alarm to use when the object's run is late.
object reference
Indicates whether all pending or unscheduled tasks should be killed if they are late.
Boolean lateAfterTimeout
The period in which the object run must start.
If the activity does not start within the scheduled start time plus this time interval, it is considered late.
Period
Required
No
Yes
No
No
No
No reportProgressTimeout
The period for successive calls from Task
Runner to the ReportTaskProgress API. If
Task Runner, or other code that is processing the tasks, does not report progress within the specified period, the activity can be retried.
attemptTimeout
Period
The timeout for an activity. If an activity does not complete within the start time plus this time interval, AWS Data Pipeline marks the attempt as failed and your retry settings determine the next steps taken.
Period logUri
The location in Amazon S3 to store log files generated by Task Runner when performing work for this object.
String
@reportProgressTime
The last time that Task Runner, or other code that is processing the tasks, called the
ReportTaskProgress API.
DateTime
No
No
No
No
@activeInstances
Record of the currently scheduled instance objects
Schedulable object reference
No
DateTime (read-only) No
@lastRun
@status
The last run of the object. This is a runtime slot.
@scheduledPhysicalObjects
The currently scheduled instance objects.
This is a runtime slot.
The status of this object. This is a runtime slot. Possible values are: pending
, checking_preconditions
, running
, waiting_on_runner
, successful
, and failed
.
Object reference
(read-only)
String (read-only)
No
No
API Version 2012-10-29
211
AWS Data Pipeline Developer Guide
EmrCluster
Name
@triesLeft
Type
Integer (read-only)
Required
No
@actualStartTime
The date and time that the scheduled run actually started. This is a runtime slot.
@actualEndTime errorCode errorMessage errorStackTrace
The date and time that the scheduled run actually ended. This is a runtime slot.
If the object failed, the error code. This is a runtime slot.
If the object failed, the error message. This is a runtime slot.
If the object failed, the error stack trace.
DateTime (read-only)
DateTime (read-only)
String (read-only)
String (read-only)
@scheduledStartTime
The date and time that the run was scheduled to start.
@scheduledEndTime
The date and time that the run was scheduled to end.
@componentParent
The component from which this instance is created.
@headAttempt
The latest attempt on the given instance.
String (read-only)
DateTime
DateTime
Object reference
(read-only)
Object reference
(read-only)
No
No
No
No
No
No
No
No
No
@resource activityStatus
Description
The number of attempted runs remaining before setting the status of this object to failed. This is a runtime slot.
The resource instance on which the given activity/precondition attempt is being run.
Object reference
(read-only)
The status most recently reported from the activity.
String
No
No
This object includes the following slots from
SchedulableObject
.
Name
schedule scheduleType runsOn
Description Type
A schedule of the object. A common use is to specify a time schedule that correlates to the schedule for the object.
object reference
Required
No
Schedule type allows you to specify whether the objects in your pipeline definition should be scheduled at the beginning of interval or end of the interval. Time Series Style
Scheduling means instances are scheduled at the end of each interval and Cron Style
Scheduling means instances are scheduled at the beginning of each interval.
Allowed values are
“cron” or “timeseries”.
Defaults to
"timeseries".
No
The computational resource to run the activity or command. For example, an
Amazon EC2 instance or Amazon EMR cluster.
Resource object reference
No
API Version 2012-10-29
212
AWS Data Pipeline Developer Guide
SnsAlarm
Example
The following is an example of this object type. It launches an *Amazon EMR job flow using AMI version
1.0 and Hadoop 0.20.
{
"id" : "MyEmrCluster",
"type" : "EmrCluster",
"hadoopVersion" : "0.20",
"keypair" : "myKeyPair",
"masterInstanceType" : "m1.xlarge",
"coreInstanceType" : "m1.small",
"coreInstanceCount" : "10",
"instanceTaskType" : "m1.small",
"instanceTaskCount": "10",
"bootstrapAction" : "s3://elasticmapreduce/libs/ba/configure-ha doop,arg1,arg2,arg3",
"bootstrapAction" : "s3://elasticmapreduce/libs/ba/configure-otherstuff,arg1,arg2"
}
See Also
•
SnsAlarm
Sends an Amazon SNS notification message when an activity fails or finishes successfully.
Syntax
The following slots are included in all objects.
Name
id name type parent
@sphere
Description
The ID of the object. IDs must be unique within a pipeline definition.
Type
String
The optional, user-defined label of the object.
If you do not provide a name
for an object in a pipeline definition, AWS Data Pipeline automatically duplicates the value of id
.
String
The type of object. Use one of the predefined
AWS Data Pipeline object types.
String
The parent of the object.
String
The sphere of an object denotes its place in the pipeline lifecycle, such as Pipeline,
Component, Instance, or Attempt.
String (read-only)
Required
Yes
No
Yes
No
No
API Version 2012-10-29
213
AWS Data Pipeline Developer Guide
SnsAlarm
This object includes the following fields.
Name
subject message topicArn
Description
The subject line of the Amazon SNS notification message.
Type
String
The body text of the Amazon SNS notification.
String
The destination Amazon SNS topic ARN for the message.
String
Required
Yes
Yes
Yes
This object includes the following slots from the
Action
object.
Name
node
Description
The node for which this action is being performed. This is a runtime slot.
Type
Object reference
(read-only)
Required
No
Example
The following is an example of this object type. The values for node.input
and node.output
come from the data node or activity that references this object in its onSuccess
field.
{
"id" : "SuccessNotify",
"type" : "SnsAlarm",
"topicArn" : "arn:aws:sns:us-east-1:28619EXAMPLE:ExampleTopic",
"subject" : "COPY SUCCESS: #{node.@scheduledStartTime}",
"message" : "Files were copied from #{node.input} to #{node.output}."
}
API Version 2012-10-29
214
AWS Data Pipeline Developer Guide
--cancel
Command Line Reference
Before you read this section, you should be familiar with
Using the Command Line Interface (p. 121) .
This section is a detailed reference of the AWS Data Pipeline command line interface (CLI) commands and parameters to interact with AWS Data Pipeline.
You can combine commands on a single command line. Commands are processed from left to right. You can use the
--create
and
--id
commands anywhere on the command line, but not together, and not more than once.
--cancel
Description
Cancels one or more specified objects from within a pipeline that is either currently running or ran previously.
To see the status of the canceled pipeline object, use
--list-runs
.
Syntax
On Linux/Unix/Mac OS:
./datapipeline --cancel
object_id
--id
pipeline_id
[Common Options]
On Windows: ruby datapipeline --cancel
object_id
--id
pipeline_id
[Common Options]
API Version 2012-10-29
215
AWS Data Pipeline Developer Guide
Options
Options
Name
object_id
--id pipeline_id
Description
The identifier of the object to cancel. You can specify the name of a single object, or a comma-separated list of object identifiers.
Example: o-06198791C436IEXAMPLE
The identifier of the pipeline.
Example:
--id df-00627471SOVYZEXAMPLE
Required
Yes
Yes
Common Options
For more information, see Common Options for AWS Data Pipeline Commands (p. 229)
.
Output
None.
Examples
The following example demonstrates how to list the objects of a previously run or currently running pipeline.
Next, the example cancels an object of the pipeline. Finally, the example lists the results of the canceled object.
On Linux/Unix/Mac OS:
./datapipeline --list-runs --id df-00627471SOVYZEXAMPLE
./datapipeline --id df-00627471SOVYZEXAMPLE --cancel o-06198791C436IEXAMPLE
./datapipeline --list-runs --id df-00627471SOVYZEXAMPLE
On Windows: ruby datapipeline --list-runs --id df-00627471SOVYZEXAMPLE ruby datapipeline --id df-00627471SOVYZEXAMPLE --cancel o-06198791C436IEXAMPLE ruby datapipeline --list-runs --id df-00627471SOVYZEXAMPLE
Related Commands
•
•
•
API Version 2012-10-29
216
AWS Data Pipeline Developer Guide
--create
--create
Description
Creates a data pipeline with the specified name, but does not activate the pipeline.
There is a limit of 20 pipelines per AWS account.
command.
Syntax
On Linux/Unix/Mac OS:
./datapipeline --create
name
[Common Options]
On Windows: ruby datapipeline --create
name
[Common Options]
Options
Name
name
Description
The name of the pipeline.
Example: my-pipeline
Required
Yes
Common Options
For more information, see Common Options for AWS Data Pipeline Commands (p. 229)
.
Output
Pipeline with name 'name' and id 'df-xxxxxxxxxxxxxxxxxxxx' created.
df-xxxxxxxxxxxxxxxxxxxx
The identifier of the newly created pipeline (df-xxxxxxxxxxxxxxxxxxxx). You must specify this identifier with the
--id
command whenever you issue a command that operates on the corresponding pipeline.
Examples
The following example creates the first pipeline without specifying a pipeline definition file, and creates the second pipeline with a pipeline definition file.
On Linux/Unix/Mac OS:
API Version 2012-10-29
217
AWS Data Pipeline Developer Guide
Related Commands
./datapipeline --create my-first-pipeline
./datapipeline --create my-second-pipeline --put my-pipeline-file.json
On Windows: ruby datapipeline --create my-first-pipeline ruby datapipeline --create my-second-pipeline --put my-pipeline-file.json
Related Commands
•
•
•
--delete
Description
Stops the specified data pipeline, and cancels its future runs.
This command removes the pipeline definition file and run history. This action is irreversible; you can't restart a deleted pipeline.
Syntax
On Linux/Unix/Mac OS:
./datapipeline --delete --id
pipeline_id
[Common Options]
On Windows: ruby datapipeline --delete --id
pipeline_id
[Common Options]
Options
Name
--id pipeline_id
Description
The identifier of the pipeline.
Example:
--id df-00627471SOVYZEXAMPLE
Required
Yes
Common Options
For more information, see Common Options for AWS Data Pipeline Commands (p. 229)
.
API Version 2012-10-29
218
AWS Data Pipeline Developer Guide
Output
Output
State of pipeline id 'df-xxxxxxxxxxxxxxxxxxxx' is currently 'state'
Deleted pipeline 'df-xxxxxxxxxxxxxxxxxxxx'
A message indicating that the pipeline was successfully deleted.
Examples
The following example deletes the pipeline with the identifier df-00627471SOVYZEXAMPLE
.
On Linux/Unix/Mac OS:
./datapipeline --delete --id df-00627471SOVYZEXAMPLE
On Windows: ruby datapipeline --delete --id df-00627471SOVYZEXAMPLE
Related Commands
•
•
--get, --g
Description
Gets the pipeline definition file for the specified data pipeline and saves it to a file. If no file is specified, the file contents are written to standard output.
Syntax
On Linux/Unix/Mac OS:
./datapipeline --get
pipeline_definition_file
--id
pipeline_id
--version
pipeline_version
[Common Options]
On Windows: ruby datapipeline --get
pipeline_definition_file
--id
pipeline_id
--version
pipeline_version
[Common Options]
API Version 2012-10-29
219
AWS Data Pipeline Developer Guide
Options
Options
Name
--id pipeline_id
Description
The identifier of the pipeline.
Example:
--id df-00627471SOVYZEXAMPLE
pipeline_definition_file
The full path to the output file that receives the pipeline definition.
Default: standard output
Example: my-pipeline.json
--versionpipeline_version
The version name of the pipeline.
Example:
--version active
Example:
--version latest
Required
Yes
No
No
Common Options
For more information, see Common Options for AWS Data Pipeline Commands (p. 229)
.
Output
If an output file is specified, the output is a pipeline definition file; otherwise, the contents of the pipeline definition are written to standard output.
Examples
The first two command writes the definition to standard output (usually the terminal screen), and the second command writes the pipeline definition to the file my-pipeline.json
.
On Linux/Unix/Mac OS:
./datapipeline --get --id df-00627471SOVYZEXAMPLE
./datapipeline --get my-pipeline.json --id df-00627471SOVYZEXAMPLE
On Windows: ruby datapipeline --get --id df-00627471SOVYZEXAMPLE ruby datapipeline --get my-pipeline.json --id df-00627471SOVYZEXAMPLE
Related Commands
•
•
API Version 2012-10-29
220
AWS Data Pipeline Developer Guide
--help, --h
--help, --h
Description
Displays information about the commands provided by the CLI.
Syntax
On Linux/Unix/Mac OS:
./datapipeline --help
On Windows: ruby datapipeline --help
Options
None.
Output
A list of the commands used by the CLI, printed to standard output (typically the terminal window).
--list-pipelines
Description
Lists the pipelines that you have permission to access.
Syntax
On Linux/Unix/Mac OS:
./datapipeline --list-pipelines
On Windows: ruby datapipeline --list-pipelines
Options
None.
API Version 2012-10-29
221
AWS Data Pipeline Developer Guide
Related Commands
Related Commands
•
•
--list-runs
Description
Lists the times the specified pipeline has run.You can optionally filter the complete list of results to include only the runs you are interested in.
Syntax
On Linux/Unix/Mac OS:
./datapipeline --list-runs --id
pipeline_id
[filter] [Common Options]
On Windows: ruby datapipeline --list-runs --id
pipeline_id
[filter] [Common Options]
Options
Name
--id pipeline_id
--status code
--failed
--running
Description
The identifier of the pipeline.
Required
Yes
Filters the list to include only runs in the specified statuses.
No
The valid statuses are as follows: waiting
, pending
, cancelled
, running
, finished
, failed
, waiting_for_runner and checking_preconditions
.
Example:
--status running
You can combine statuses as a comma-separated list.
Example:
--status pending,checking_preconditions
Filters the list to include only runs in the failed state that started during the last 2 days and were scheduled to end within the last 15 days.
No
Filters the list to include only runs in the running state that started during the last 2 days and were scheduled to end within the last 15 days.
No
API Version 2012-10-29
222
AWS Data Pipeline Developer Guide
Common Options
Name
--start-interval
date1,date2
Description
Filters the list to include only runs that started within the specified interval.
--schedule-interval
date1,date2
Filters the list to include only runs that are scheduled to start within the specified interval.
Required
No
No
Common Options
For more information, see Common Options for AWS Data Pipeline Commands (p. 229)
.
Output
A list of the times the specified pipeline has run and the status of each run. You can filter this list by the options you specify when you run the command.
Examples
The first command lists all the runs for the specified pipeline. The other commands show how to filter the complete list of runs using different options.
On Linux/Unix/Mac OS:
./datapipeline --list-runs --id df-00627471SOVYZEXAMPLE
./datapipeline --list-runs --id df-00627471SOVYZEXAMPLE --status PENDING
./datapipeline --list-runs --id df-00627471SOVYZEXAMPLE --start-interval 2011-
11-29T06:07:21,2011-12-06T06:07:21
./datapipeline --list-runs --id df-00627471SOVYZEXAMPLE --schedule-interval
2011-11-29T06:07:21,2011-12-06T06:07:21
./datapipeline --list-runs --id df-00627471SOVYZEXAMPLE --failed
./datapipeline --list-runs --id df-00627471SOVYZEXAMPLE --running
On Windows: ruby datapipeline --list-runs --id df-00627471SOVYZEXAMPLE ruby datapipeline --list-runs --id df-00627471SOVYZEXAMPLE --status PENDING ruby datapipeline --list-runs --id df-00627471SOVYZEXAMPLE --start-interval
2011-11-29T06:07:21,2011-12-06T06:07:21 ruby datapipeline --list-runs --id df-00627471SOVYZEXAMPLE --schedule-interval
2011-11-29T06:07:21,2011-12-06T06:07:21 ruby datapipeline --list-runs --id df-00627471SOVYZEXAMPLE --failed ruby datapipeline --list-runs --id df-00627471SOVYZEXAMPLE --running
API Version 2012-10-29
223
AWS Data Pipeline Developer Guide
Related Commands
Related Commands
•
--put
Description
Uploads a pipeline definition file to AWS Data Pipeline for a new or existing pipeline, but does not activate the pipeline. Use the --activate parameter in a separate command when you want the pipeline to begin.
To specify a pipeline definition file at the time that you create the pipeline, use this command with the
Syntax
On Linux/Unix/Mac OS:
./datapipeline --activate --id
pipeline_id
[Common Options]
On Windows: ruby datapipeline --activate --id
pipeline_id
[Common Options]
Options
Name
--id pipeline_id
Description
The identifier of the pipeline.
You must specify the identifier of the pipeline when updating an existing pipeline with a new pipeline definition file.
Example:
--id df-00627471SOVYZEXAMPLE
Required
Conditional
Common Options
For more information, see Common Options for AWS Data Pipeline Commands (p. 229)
.
Output
A response indicating that the new definition was successfully loaded, or, in the case where you are also using the
command, an indication that the new pipeline was successfully activated.
API Version 2012-10-29
224
AWS Data Pipeline Developer Guide
Examples
Examples
The following examples show how to use
--put
to create a new pipeline (example one) and how to use
--put
and
--id
to add a definition file to a pipeline (example two) or update a preexisting pipeline definition file of a pipeline (example three).
On Linux/Unix/Mac OS:
./datapipeline --create my-pipeline --put my-pipeline-definition.json
./datapipeline --id df-00627471SOVYZEXAMPLE --put a-pipeline-definition.json
./datapipeline --id df-00627471SOVYZEXAMPLE --put my-updated-pipeline-defini tion.json
On Windows: ruby datapipeline --create my-pipeline --put my-pipeline-definition.json
ruby datapipeline --id df-00627471SOVYZEXAMPLE --put a-pipeline-definition.json
ruby datapipeline --id df-00627471SOVYZEXAMPLE --put my-updated-pipelinedefinition.json
Related Commands
•
•
--activate
Description
Starts a new or existing pipeline.
To specify a pipeline definition file at the time that you create the pipeline, use this command with the
Syntax
On Linux/Unix/Mac OS:
./datapipeline --put
pipeline_definition_file
--id
pipeline_id
[Common Options]
On Windows: ruby datapipeline --put
pipeline_definition_file
--id
pipeline_id
[Common Op tions]
API Version 2012-10-29
225
AWS Data Pipeline Developer Guide
Options
Options
Name Description
pipeline_definition_file
The name of the pipeline definition file.
--id pipeline_id
Example: pipeline-definition-file.json
The identifier of the pipeline.
You must specify the identifier of the pipeline when updating an existing pipeline with a new pipeline definition file.
Conditional
Example:
--id df-00627471SOVYZEXAMPLE
Required
Yes
Common Options
For more information, see Common Options for AWS Data Pipeline Commands (p. 229)
.
Output
A response indicating that the new definition was successfully loaded, or, in the case where you are also using the
command, an indication that the new pipeline was successfully created.
Examples
The following examples show how to use
--put
to create a new pipeline (example one) and how to use
--put
and
--id
to add a definition file to a pipeline (example two) or update a preexisting pipeline definition file of a pipeline (example three).
On Linux/Unix/Mac OS:
./datapipeline --create my-pipeline --put my-pipeline-definition.json
./datapipeline --id df-00627471SOVYZEXAMPLE --put a-pipeline-definition.json
./datapipeline --id df-00627471SOVYZEXAMPLE --put my-updated-pipeline-defini tion.json
On Windows: ruby datapipeline --create my-pipeline --put my-pipeline-definition.json
ruby datapipeline --id df-00627471SOVYZEXAMPLE --put a-pipeline-definition.json
ruby datapipeline --id df-00627471SOVYZEXAMPLE --put my-updated-pipelinedefinition.json
Related Commands
•
API Version 2012-10-29
226
AWS Data Pipeline Developer Guide
--rerun
•
--rerun
Description
Reruns one or more specified objects from within a pipeline that is either currently running or has previously run. Resets the retry count of the object and then runs the object. It also tries to cancel the current attempt if an attempt is running.
Syntax
On Linux/Unix/Mac OS:
./datapipeline --rerun
object_id
--id
pipeline_id
[Common Options]
On Windows: ruby datapipeline --rerun
object_id
--id
pipeline_id
[Common Options]
Note
object_id
can be a comma separated list.
Options
Name
object_id
--id pipeline_id
Description
The identifier of the object.
Example: o-06198791C436IEXAMPLE
The identifier of the pipeline.
Example:
--id df-00627471SOVYZEXAMPLE
Required
Yes
Yes
Common Options
For more information, see Common Options for AWS Data Pipeline Commands (p. 229)
.
Output
None. To see the status of the object set to rerun, use
--list-runs
.
Examples
Reruns the specified object in the indicated pipeline.
On Linux/Unix/Mac OS:
API Version 2012-10-29
227
AWS Data Pipeline Developer Guide
Related Commands
./datapipeline --rerun o-06198791C436IEXAMPLE --id df-00627471SOVYZEXAMPLE
On Windows: ruby datapipeline --rerun o-06198791C436IEXAMPLE --id df-00627471SOVYZEXAMPLE
Related Commands
•
•
--validate
Description
Validates the pipeline definition for correct syntax. Also performs additional checks, such as a check for circular dependencies.
Syntax
On Linux/Unix/Mac OS:
./datapipeline --validate
pipeline_definition_file
On Windows: ruby datapipeline --validate
pipeline_definition_file
Options
Name Description
pipeline_definition_file
The full path to the output file that receives the pipeline definition.
Default: standard output
Example: my-pipeline.json
Required
Yes
API Version 2012-10-29
228
AWS Data Pipeline Developer Guide
Common Options
Common Options for AWS Data Pipeline
Commands
The following set of options are accepted by most of the commands described in this guide.
Name Description
--access-key
aws_access_key
The access key ID associated with your AWS account.
If you specify
--access-key
, you must also specify
--secret-key
.
This option is required if you aren't using a JSON credentials file (see
--credentials
).
Required
Conditional
--credentials
json_file
Example:
--access-key AKIAIOSFODNN7EXAMPLE
For more information, see Setting Credentials for the AWS
Data Pipeline Command Line Interface.
The location of the JSON file with your AWS credentials.
You don't need to set this option if the JSON file is named credentials.json
, and it exists in either your user home directory or the directory where the AWS Data Pipeline CLI is installed. The CLI automatically finds the JSON file if it exists in either location.
Conditional
If you specify a credentials file (either using this option or by including credentials.json
in one of its two supported locations), you don't need to use the
--access-key
and
--secret-key
options.
Example:
TBD
--endpoint
url
For more information, see Setting Credentials for the AWS
Data Pipeline Command Line Interface.
The URL of the AWS Data Pipeline endpoint that the CLI should use to contact the web service.
If you specify an endpoint both in a JSON file and with this command line option, the CLI ignores the endpoint set with this command line option.
Example:
TBD
--id
pipeline_id
Use the specified pipeline identifier.
--limit
limit
Example:
--id df-00627471SOVYZEXAMPLE
The field limit for the pagination of objects.
Example:
TBD
Conditional
API Version 2012-10-29
229
AWS Data Pipeline Developer Guide
Common Options
Name Description
--secret-key
aws_secret_key
The secret access key associated with your AWS account.
If you specify
--secret-key
, you must also specify
--access-key
.
This option is required if you aren't using a JSON credentials file (see
--credentials
).
Example:
--secret-key wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY
Required
Conditional
For more information, see Setting Credentials for the Amazon
AWS Data Pipeline Command Line Interface.
--timeout
seconds
The number of seconds for the AWS Data Pipeline client to wait before timing out the http connection to the AWS Data
Pipeline web service.
No
--t, --trace
--v, --verbose
Example:
--timeout 120
Prints detailed debugging output.
Prints verbose output. This is useful for debugging.
No
No
API Version 2012-10-29
230
AWS Data Pipeline Developer Guide
Make an HTTP Request to AWS Data Pipeline
Program AWS Data Pipeline
Topics
•
Make an HTTP Request to AWS Data Pipeline (p. 231)
•
Actions in AWS Data Pipeline (p. 234)
Make an HTTP Request to AWS Data Pipeline
If you don't use one of the AWS SDKs, you can perform AWS Data Pipeline operations over HTTP using the POST request method. The POST method requires you to specify the operation in the header of the request and provide the data for the operation in JSON format in the body of the request.
HTTP Header Contents
AWS Data Pipeline requires the following information in the header of an HTTP request:
•
host
The AWS Data Pipeline endpoint. For information about endpoints, see Regions and Endpoints .
•
x-amz-date
You must provide the time stamp in either the HTTP Date header or the AWS x-amz-date header. (Some HTTP client libraries don't let you set the Date header.) When an x-amz-date header is present, the system ignores any Date header during the request authentication.
The date must be specified in one of the following three formats, as specified in the HTTP/1.1 RFC:
• Sun, 06 Nov 1994 08:49:37 GMT (RFC 822, updated by RFC 1123)
• Sunday, 06-Nov-94 08:49:37 GMT (RFC 850, obsoleted by RFC 1036)
• Sun Nov 6 08:49:37 1994 (ANSI C asctime() format)
•
Authorization
The set of authorization parameters that AWS uses to ensure the validity and authenticity of the request. For more information about constructing this header, go to Signature Version
4 Signing Process .
•
x-amz-target
The destination service of the request and the operation for the data, in the format:
<<serviceName>>_<<API version>>.<<operationName>>
For example,
DataPipeline_20121129.ActivatePipeline
•
content-type
Specifies JSON and the version. For example,
Content-Type: application/x-amz-json-1.0
The following is an example header for an HTTP request to activate a pipeline.
API Version 2012-10-29
231
AWS Data Pipeline Developer Guide
HTTP Body Content
POST / HTTP/1.1
host: datapipeline.us-east-1.amazonaws.com
x-amz-date: Mon, 12 Nov 2012 17:49:52 GMT x-amz-target: DataPipeline_20121129.ActivatePipeline
Authorization: AuthParams
Content-Type: application/x-amz-json-1.1
Content-Length: 39
Connection: Keep-Alive
HTTP Body Content
The body of an HTTP request contains the data for the operation specified in the header of the HTTP request. The data must be formatted according to the JSON data schema for each AWS Data Pipeline
API. The AWS Data Pipeline JSON data schema defines the types of data and parameters (such as comparison operators and enumeration constants) available for each operation.
Format the Body of an HTTP request
Use the JSON data format to convey data values and data structure, simultaneously. Elements can be nested within other elements by using bracket notation. The following example shows a request for putting a pipeline definition consisting of three objects and their corresponding slots.
{"pipelineId": "df-06372391ZG65EXAMPLE",
"pipelineObjects":
[
{"id": "Default",
"name": "Default",
"slots":
[
{"key": "workerGroup",
"stringValue": "MyWorkerGroup"}
]
},
{"id": "Schedule",
"name": "Schedule",
"slots":
[
{"key": "startDateTime",
"stringValue": "2012-09-25T17:00:00"},
{"key": "type",
"stringValue": "Schedule"},
{"key": "period",
"stringValue": "1 hour"},
{"key": "endDateTime",
"stringValue": "2012-09-25T18:00:00"}
]
},
{"id": "SayHello",
"name": "SayHello",
"slots":
[
{"key": "type",
API Version 2012-10-29
232
AWS Data Pipeline Developer Guide
HTTP Body Content
"stringValue": "ShellCommandActivity"},
{"key": "command",
"stringValue": "echo hello"},
{"key": "parent",
"refValue": "Default"},
{"key": "schedule",
"refValue": "Schedule"}
]
}
]
}
Handle the HTTP Response
Here are some important headers in the HTTP response, and how you should handle them in your application:
• HTTP/1.1—This header is followed by a status code. A code value of 200 indicates a successful operation. Any other value indicates an error.
• x-amzn-RequestId—This header contains a request ID that you can use if you need to troubleshoot a request with AWS Data Pipeline. An example of a request ID is
K2QH8DNOU907N97FNA2GDLL8OBVV4KQNSO5AEMVJF66Q9ASUAAJG.
• x-amz-crc32—AWS Data Pipeline calculates a CRC32 checksum of the HTTP payload and returns this checksum in the x-amz-crc32 header. We recommend that you compute your own CRC32 checksum on the client side and compare it with the x-amz-crc32 header; if the checksums do not match, it might indicate that the data was corrupted in transit. If this happens, you should retry your request.
AWS SDK users do not need to manually perform this verification, because the SDKs compute the checksum of each reply from Amazon DynamoDB and automatically retry if a mismatch is detected.
Sample AWS Data Pipeline JSON Request and Response
The following examples show a request for creating a new pipeline. Then it shows the AWS Data Pipeline response, including the pipeline identifier of the newly created pipeline.
HTTP POST Request
POST / HTTP/1.1
host: datapipeline.us-east-1.amazonaws.com
x-amz-date: Mon, 12 Nov 2012 17:49:52 GMT x-amz-target: DataPipeline_20121129.CreatePipeline
Authorization: AuthParams
Content-Type: application/x-amz-json-1.1
Content-Length: 50
Connection: Keep-Alive
{"name": "MyPipeline",
"uniqueId": "12345ABCDEFG"}
API Version 2012-10-29
233
AWS Data Pipeline Developer Guide
Actions in AWS Data Pipeline
AWS Data Pipeline Response
HTTP/1.1 200 x-amzn-RequestId: b16911ce-0774-11e2-af6f-6bc7a6be60d9 x-amz-crc32: 2215946753
Content-Type: application/x-amz-json-1.0
Content-Length: 2
Date: Mon, 16 Jan 2012 17:50:53 GMT
{"pipelineId": "df-06372391ZG65EXAMPLE"}
Actions in AWS Data Pipeline
• ActivatePipeline
• CreatePipeline
• DeletePipeline
• DescribeObjects
• DescribePipelines
• GetPipelineDefinition
• ListPipelines
• PollForTask
• PutPipelineDefinition
• QueryObjects
• ReportTaskProgress
• SetStatus
• SetTaskStatus
• ValidatePipelineDefinition
API Version 2012-10-29
234
AWS Data Pipeline Developer Guide
Install Task Runner
AWS Task Runner Reference
Topics
•
•
•
•
Setting Credentials for Task Runner (p. 236)
•
Task Runner Threading (p. 236)
•
Long Running Preconditions (p. 236)
•
Task Runner Configuration Options (p. 237)
Task Runner is a task agent application that polls AWS Data Pipeline for scheduled tasks and executes them on Amazon EC2 instances, Amazon EMR clusters, or other computational resources, reporting status as it does so. Depending on your application, you may choose to:
• Have AWS Data Pipeline install and manage one or more Task Runner applications for you on computational resources managed by the web service. In this case, you do not need to install or configure Task Runner.
• Manually install and configure Task Runner on a computational resource such as a long-running EC2 instance or a physical server. To do so, use the following procedures.
• Manually install and configure a custom task agent instead of Task Runner. The procedures for doing so will depend on the implementation of the custom task agent.
Install Task Runner
To install Task Runner, download
TaskRunner-1.0.jar
from Task Runner download and copy it into a folder. Additionally, download mysql-connector-java-5.1.18-bin.jar
from http://dev.mysql.com/usingmysql/java/ and copy it into the same folder where you install Task Runner.
Start Task Runner
In a new command prompt window that is set to the directory where you installed Task Runner, start Task
Runner with the following command.
API Version 2012-10-29
235
AWS Data Pipeline Developer Guide
Verify Task Runner
Warning
If you close the terminal window, or interrupt the command with CTRL+C, Task Runner stops, which halts the pipeline runs.
java -jar TaskRunner-1.0.jar --config ~/credentials.json --workerGroup=myWork erGroup
The
--config
option points to your credentials file. The
--workerGroup
option specifies the name of your worker group, which must be the same value as specified in your pipeline for tasks to be processed.
When Task Runner is active, it prints the path to where log files are written in the terminal window. The following is an example.
Logging to /myComputerName/.../dist/output/logs
Verify Task Runner
The easiest way to verify that Task Runner is working is to check whether it is writing log files. The log files are stored in the directory where you started Task Runner.
When you check the logs, make sure you that are checking logs for the current date and time. Task
Runner creates a new log file each hour, where the hour from midnight to 1am is 00. So the format of the log file name is
TaskRunner.log.YYYY-MM-DD-HH
, where HH runs from 00 to 23, in UDT. To save storage space, any log files older than eight hours are compressed with GZip.
Setting Credentials for Task Runner
In order to connect to the AWS Data Pipeline web service to process your commands, configure Task
Runner with an AWS account that has permissions to create and/or manage data pipelines.
To Set Your Credentials Implicitly with a JSON File
• Create a JSON file named credentials.json in the directory where you installed Task Runner. For
information about what to include in the JSON file, see Create a Credentials File (p. 18)
.
Task Runner Threading
Task Runner activities and preconditions are single threaded. This means it can handle one work item per thread. By default, it has one activity thread and two precondition threads. If you are installing Task
Runner and believe it will need to handle more than that at a single time, you need to increase the
--activities and --preconditions values.
Long Running Preconditions
For performance reasons, pipeline retry logic for preconditions happens in Task Runner and AWS Data
Pipeline supplies Task Runner with preconditions only once per 30 minute period. Task Runner will honor the retryDelay
field that you define on preconditions. You can configure "preconditionTimeout" slot to limit the precondition retry period.
API Version 2012-10-29
236
AWS Data Pipeline Developer Guide
Task Runner Configuration Options
Task Runner Configuration Options
These are the configuration options available from the command line when you launch Task Runner.
Command Line Parameter
--help
--config
--accessId
--secretKey
--endpoint
--workerGroup
--output
--log
--staging
--temp
--activities
--preconditions
--pcSuffix
Description
Displays command line help.
Path and file name of your credentials.json file
The AWS access ID for Task Runner to use when making requests
The AWS secret key for Task Runner to use when making requests
The AWS Data Pipeline service endpoint to use
The name of the worker group that Task Runner will retrieve work for
The Task Runner directory for output files
The Task Runner directory for local log files. If it is not absolute, it will be relative to output. Default is
'logs'.
The Task Runner directory for staging files. If it is not absolute, it will be relative to output. Default is
'staging'.
The Task Runner directory for temporary files. If it is not absolute, it will be relative to output. Default is 'tmp'.
Number of activity threads to run simultaneously, defaults to 1.
Number of precondition threads to run simultaneously, defaults to 2.
The suffix to use for preconditions. Defaults to
"precondition".
API Version 2012-10-29
237
AWS Data Pipeline Developer Guide
Account Limits
Web Service Limits
To ensure there is capacity for all users of the AWS Data Pipeline service, the web service imposes limits on the amount of resources you can allocate and the rate at which you can allocate them.
Account Limits
The following limits apply to a single AWS account. If you require additional capacity, you can contact
Amazon Web Services to increase your capacity.
Attribute
Number of pipelines
Limit
20
Number of pipeline components per pipeline
50
50 Number of fields per pipeline component
Number of UTF8 bytes per field name or identifier
Number of UTF8 bytes per field
256
10240
Adjustable
Yes
Yes
Yes
Yes
No
Number of UTF8 bytes per pipeline component
15,360, including the names of fields.
No
No Rate of creation of a instance from a pipeline component
1 per 5 minutes
5 Number of running instances of a pipeline component
Retries of a pipeline activity
5 per task
Yes
No
API Version 2012-10-29
238
AWS Data Pipeline Developer Guide
Web Service Call Limits
Attribute Limit
Minimum delay between retry attempts
2 minutes
15 minutes Minimum scheduling interval
Maximum number of rollups into a single object
32
Adjustable
No
No
No
Web Service Call Limits
AWS Data Pipeline limits the rate at which you can call the web service API. These limits also apply to
AWS Data Pipeline agents that call the web service API on your behalf, such as the console, CLI, and
Task Runner.
The following limits apply to a single AWS account. This means the total usage on the account, including that by IAM users, cannot exceed these limits.
The burst rate lets you save up web service calls during periods of inactivity and expend them all in a short amount of time. For example, CreatePipeline has a regular rate of 1 call each 5 seconds. If you don't call the service for 30 seconds, you will have 6 calls saved up. You could then call the web service
6 times in a second. Because this is below the burst limit and keeps your average calls at the regular rate limit, your calls are not be throttled.
If you exceed the rate limit and the burst limit, your web service call fails and returns a throttling exception.
The default implementation of a worker, Task Runner, automatically retries API calls that fail with a throttling exception, with a back off so that subsequent attempts to call the API occur at increasingly longer intervals. If you write a worker, we recommend that you implement similar retry logic.
These limits are applied against an individual AWS account.
API
ActivatePipeline
CreatePipeline
DeletePipeline
DescribeObjects
DescribePipelines
GetPipelineDefinition
PollForTask
ListPipelines
PutPipelineDefinition
QueryObjects
ReportProgress
SetTaskStatus
Regular rate limit
1 call per 5 seconds
1 call per 5 seconds
1 call per 5 seconds
1 call per 2 seconds
1 call per 5 seconds
1 call per 5 seconds
1 call per 2 seconds
1 call per 5 seconds
1 call per 5 seconds
1 call per 2 seconds
1 call per 2 seconds
2 call per second
Burst limit
10 calls
10 calls
10 calls
10 calls
10 calls
10 calls
10 calls
10 calls
10 calls
10 calls
10 calls
10 calls
API Version 2012-10-29
239
AWS Data Pipeline Developer Guide
Scaling Considerations
API
SetStatus
Regular rate limit
1 call per 5 seconds
ReportTaskRunnerHeartbeat 1 call per 5 seconds
ValidatePipelineDefinition 1 call per 5 seconds
Burst limit
10 calls
10 calls
10 calls
Scaling Considerations
AWS Data Pipeline scales to accommodate a huge number of concurrent tasks and you can configure it to automatically create the resources necessary to handle large workloads. These automatically-created resources are under your control and count against your AWS account resource limits. For example, if you configure AWS Data Pipeline to automatically create a 20-node EMR cluster to process data and your AWS account has an EC2 instance limit set to 20, you may inadvertently exhaust your available backfill resources. As a result, consider these resource restrictions in your design or increase your account limits accordingly.
If you require additional capacity, you can contact Amazon Web Services to increase your capacity.
API Version 2012-10-29
240
AWS Data Pipeline Developer Guide
AWS Data Pipeline Resources
Topics
The following table lists related resources to help you use AWS Data Pipeline.
Resource
AWS Data Pipeline API Reference
AWS Data Pipeline Technical FAQ
Release Notes
AWS Developer Resource Center
AWS Management Console
Discussion Forums
AWS Support Center
AWS Premium Support
AWS Data Pipeline Product Information
Contact Us
Description
Describes AWS Data Pipeline operations, errors, and data structures.
Covers the top 20 questions developers ask about this product.
Provide a high-level overview of the current release. They specifically note any new features, corrections, and known issues.
A central starting point to find documentation, code samples, release notes, and other information to help you build innovative applications with AWS.
The AWS Data Pipeline console.
A community-based forum for developers to discuss technical questions related to Amazon Web Services.
The home page for AWS Technical Support, including access to our Developer Forums, Technical FAQs, Service
Status page, and Premium Support.
The primary web page for information about AWS Premium
Support, a one-on-one, fast-response support channel to help you build and run applications on AWS Infrastructure
Services.
The primary web page for information about AWS Data
Pipeline.
A form for questions about your AWS account, including billing.
API Version 2012-10-29
241
Resource
Terms of Use
AWS Data Pipeline Developer Guide
Description
Detailed information about the copyright and trademark usage at Amazon.com and other topics.
API Version 2012-10-29
242
AWS Data Pipeline Developer Guide
Document History
This documentation is associated with the 2012-10-29 version of AWS Data Pipeline.
Change
Guide revision
Description Release Date
This release is the initial release of the AWS Data Pipeline
Developer Guide.
20 December
2012
API Version 2012-10-29
243
advertisement
Related manuals
advertisement
Table of contents
- 1 AWS Data Pipeline
- 6 What is AWS Data Pipeline?
- 6 How Does AWS Data Pipeline Work?
- 7 Pipeline Definition
- 9 Lifecycle of a Pipeline
- 10 Task Runners
- 10 Task Runner
- 11 Task Runner on AWS Data Pipeline-Managed Resources
- 12 Task Runner on User-Managed Resources
- 13 Custom Task Runner
- 15 Pipeline Components, Instances, and Attempts
- 16 Lifecycle of a Pipeline Task
- 17 Get Set Up for AWS Data Pipeline
- 17 Access the Console
- 20 Where Do I Go Now?
- 20 Install the Command Line Interface
- 20 Install Ruby
- 20 Install the RubyGems package management framework
- 21 Install Prerequisite Ruby Gems
- 22 Install the AWS Data Pipeline CLI
- 22 Locate your AWS Credentials
- 23 Create a Credentials File
- 23 Verify the CLI
- 24 Deploy and Configure Task Runner
- 24 Install Java
- 240 Install Task Runner
- 240 Start Task Runner
- 241 Verify Task Runner
- 25 Install the AWS SDK
- 26 Granting Permissions to Pipelines with IAM
- 28 Grant Amazon RDS Permissions to Task Runner
- 30 Tutorial: Copy CSV Data from Amazon S3 to Amazon S3
- 31 Before You Begin...
- 32 Using the AWS Data Pipeline Console
- 32 Create and Configure the Pipeline Definition Objects
- 35 Validate and Save Your Pipeline
- 35 Verify your Pipeline Definition
- 36 Activate your Pipeline
- 36 Monitor the Progress of Your Pipeline Runs
- 38 [Optional] Delete your Pipeline
- 38 Using the Command Line Interface
- 38 Define a Pipeline in JSON Format
- 40 Schedule
- 40 Amazon S3 Data Nodes
- 41 Resource
- 42 Activity
- 43 Upload the Pipeline Definition
- 44 Activate the Pipeline
- 44 Verify the Pipeline Status
- 45 Tutorial: Copy Data From a MySQL Table to Amazon S3
- 46 Before You Begin ...
- 47 Using the AWS Data Pipeline Console
- 47 Create and Configure the Pipeline Definition Objects
- 50 Validate and Save Your Pipeline
- 50 Verify Your Pipeline Definition
- 51 Activate your Pipeline
- 52 Monitor the Progress of Your Pipeline Runs
- 53 [Optional] Delete your Pipeline
- 53 Using the Command Line Interface
- 54 Define a Pipeline in JSON Format
- 55 Schedule
- 56 MySQL Data Node
- 56 Amazon S3 Data Node
- 57 Resource
- 58 Activity
- 59 Upload the Pipeline Definition
- 59 Activate the Pipeline
- 60 Verify the Pipeline Status
- 61 Tutorial: Launch an Amazon EMR Job Flow
- 62 Before You Begin ...
- 62 Using the AWS Data Pipeline Console
- 63 Create and Configure the Pipeline Definition Objects
- 65 Validate and Save Your Pipeline
- 65 Verify Your Pipeline Definition
- 66 Activate your Pipeline
- 66 Monitor the Progress of Your Pipeline Runs
- 68 [Optional] Delete your Pipeline
- 68 Using the Command Line Interface
- 74 Tutorial: Import/Export Data in Amazon DynamoDB With Amazon EMR and Hive
- 74 Part One: Import Data into Amazon DynamoDB
- 75 Before You Begin...
- 75 Create an Amazon DynamoDB Table
- 78 Create an Amazon SNS Topic
- 79 Create an Amazon S3 Bucket
- 79 Using the AWS Data Pipeline Console
- 79 Start Import from the Amazon DynamoDB Console
- 80 Create the Pipeline Definition using the AWS Data Pipeline Console
- 81 Create and Configure the Pipeline from a Template
- 81 Complete the Data Nodes
- 82 Complete the Resources
- 83 Complete the Activity
- 83 Complete the Notifications
- 83 Validate and Save Your Pipeline
- 84 Verify your Pipeline Definition
- 84 Activate your Pipeline
- 85 Monitor the Progress of Your Pipeline Runs
- 86 [Optional] Delete your Pipeline
- 86 Using the Command Line Interface
- 87 Define the Import Pipeline in JSON Format
- 89 Schedule
- 89 Amazon S3 Data Node
- 90 Precondition
- 91 Amazon EMR Cluster
- 91 Amazon EMR Activity
- 93 Upload the Pipeline Definition
- 94 Activate the Pipeline
- 94 Verify the Pipeline Status
- 95 Verify Data Import
- 95 Part Two: Export Data from Amazon DynamoDB
- 96 Before You Begin ...
- 97 Using the AWS Data Pipeline Console
- 97 Start Export from the Amazon DynamoDB Console
- 98 Create the Pipeline Definition using the AWS Data Pipeline Console
- 98 Create and Configure the Pipeline from a Template
- 99 Complete the Data Nodes
- 100 Complete the Resources
- 100 Complete the Activity
- 101 Complete the Notifications
- 101 Validate and Save Your Pipeline
- 101 Verify your Pipeline Definition
- 102 Activate your Pipeline
- 102 Monitor the Progress of Your Pipeline Runs
- 103 [Optional] Delete your Pipeline
- 103 Using the Command Line Interface
- 103 Define the Export Pipeline in JSON Format
- 105 Schedule
- 106 Amazon S3 Data Node
- 107 Amazon EMR Cluster
- 107 Amazon EMR Activity
- 109 Upload the Pipeline Definition
- 110 Activate the Pipeline
- 110 Verify the Pipeline Status
- 111 Verify Data Export
- 112 Tutorial: Run a Shell Command to Process MySQL Table
- 113 Before you begin ...
- 114 Using the AWS Data Pipeline Console
- 114 Create and Configure the Pipeline Definition Objects
- 117 Validate and Save Your Pipeline
- 118 Verify your Pipeline Definition
- 118 Activate your Pipeline
- 119 Monitor the Progress of Your Pipeline Runs
- 120 [Optional] Delete your Pipeline
- 121 Manage Pipelines
- 121 Using AWS Data Pipeline Console
- 121 View pipeline definition
- 122 View details of each instance in an active pipeline
- 124 Modify pipeline definition
- 126 Delete a Pipeline
- 126 Using the Command Line Interface
- 127 Install the AWS Data Pipeline Command-Line Client
- 127 Command-Line Syntax
- 127 Setting Credentials for the AWS Data Pipeline Command Line Interface
- 129 List Pipelines
- 129 Create a New Pipeline
- 129 Retrieve Pipeline Details
- 130 View Pipeline Versions
- 131 Modify a Pipeline
- 131 Delete a Pipeline
- 133 Troubleshoot AWS Data Pipeline
- 133 Proactively Monitor Your Pipeline
- 134 Verify Your Pipeline Status
- 134 Interpret Pipeline Status Details
- 135 Error Log Locations
- 135 Task Runner Logs
- 136 Pipeline Logs
- 136 AWS Data Pipeline Problems and Solutions
- 136 Pipeline Stuck in Pending Status
- 136 Pipeline Component Stuck in Waiting for Runner Status
- 137 Pipeline Component Stuck in Checking Preconditions Status
- 137 Run Doesn't Start When Scheduled
- 138 Pipeline Components Run in Wrong Order
- 138 EMR Cluster Fails With Error: The security token included in the request is invalid
- 138 Insufficient Permissions to Access Resources
- 138 Creating a Pipeline Causes a Security Token Error
- 138 Cannot See Pipeline Details in the Console
- 138 Error in remote runner Status Code: 404, AWS Service: Amazon S3
- 139 Access Denied - Not Authorized to Perform Function datapipeline:
- 140 Pipeline Definition Files
- 140 Creating Pipeline Definition Files
- 140 Prerequisites
- 141 General Structure of a Pipeline Definition File
- 141 Pipeline Objects
- 141 Pipeline fields
- 142 User-Defined Fields
- 143 Expressions
- 143 Referencing Fields and Objects
- 144 Saving the Pipeline Definition File
- 144 Example Pipeline Definitions
- 144 Copy SQL Data to a CSV File in Amazon S3
- 144 Prerequisites
- 145 Example Pipeline Definition
- 146 Launch an Amazon EMR Job Flow
- 147 Example Pipeline Definition
- 148 Run a Script on a Schedule
- 148 Example Pipeline Definition
- 149 Chain Multiple Activities and Roll Up Data
- 150 Example Pipeline Definition
- 151 Copy Data from Amazon S3 to MySQL
- 151 Example Pipeline Definition
- 153 Extract Apache Web Log Data from Amazon S3 using Hive
- 153 Example Pipeline Definition
- 155 Extract Amazon S3 Data (CSV/TSV) to Amazon S3 using Hive
- 155 Example Pipeline Definition
- 157 Extract Amazon S3 Data (Custom Format) to Amazon S3 using Hive
- 157 Example Pipeline Definition
- 158 Simple Data Types
- 159 DateTime
- 159 Numeric
- 159 Expression Evaluation
- 159 Object References
- 159 Period
- 160 String
- 160 Expression Evaluation
- 160 Mathematical Functions
- 161 String Functions
- 161 Date and Time Functions
- 166 Objects
- 166 Object Categories
- 167 Object Hierarchy
- 168 Schedule
- 169 Syntax
- 170 Example
- 170 S3DataNode
- 170 Syntax
- 173 Example
- 174 See Also
- 174 MySqlDataNode
- 174 Syntax
- 177 Example
- 177 See Also
- 177 DynamoDBDataNode
- 177 Syntax
- 181 Example
- 181 ShellCommandActivity
- 181 Syntax
- 185 Example
- 185 See Also
- 185 CopyActivity
- 185 Syntax
- 189 Example
- 189 See Also
- 189 EmrActivity
- 189 Syntax
- 192 Example
- 193 See Also
- 193 HiveActivity
- 193 Syntax
- 196 Example
- 196 See Also
- 197 ShellCommandPrecondition
- 197 Syntax
- 199 Example
- 199 See Also
- 199 Exists
- 200 Syntax
- 202 Example
- 202 See Also
- 202 S3KeyExists
- 202 Syntax
- 205 See Also
- 205 S3PrefixNotEmpty
- 205 Syntax
- 207 Example
- 208 See Also
- 208 RdsSqlPrecondition
- 208 Syntax
- 209 DynamoDBTableExists
- 209 Syntax
- 209 DynamoDBDataExists
- 210 Syntax
- 210 Ec2Resource
- 210 Syntax
- 213 Example
- 214 EmrCluster
- 214 Syntax
- 218 Example
- 218 See Also
- 218 SnsAlarm
- 218 Syntax
- 219 Example
- 220 Command Line Reference
- 220 --cancel
- 220 Description
- 220 Syntax
- 221 Options
- 221 Common Options
- 221 Output
- 221 Examples
- 221 Related Commands
- 222 --create
- 222 Description
- 222 Syntax
- 222 Options
- 222 Common Options
- 222 Output
- 222 Examples
- 223 Related Commands
- 223 --delete
- 223 Description
- 223 Syntax
- 223 Options
- 223 Common Options
- 224 Output
- 224 Examples
- 224 Related Commands
- 224 --get, --g
- 224 Description
- 224 Syntax
- 225 Options
- 225 Common Options
- 225 Output
- 225 Examples
- 225 Related Commands
- 226 --help, --h
- 226 Description
- 226 Syntax
- 226 Options
- 226 Output
- 226 --list-pipelines
- 226 Description
- 226 Syntax
- 226 Options
- 227 Related Commands
- 227 --list-runs
- 227 Description
- 227 Syntax
- 227 Options
- 228 Common Options
- 228 Output
- 228 Examples
- 229 Related Commands
- 229 --put
- 229 Description
- 229 Syntax
- 229 Options
- 229 Common Options
- 229 Output
- 230 Examples
- 230 Related Commands
- 230 --activate
- 230 Description
- 230 Syntax
- 231 Options
- 231 Common Options
- 231 Output
- 231 Examples
- 231 Related Commands
- 232 --rerun
- 232 Description
- 232 Syntax
- 232 Options
- 232 Common Options
- 232 Output
- 232 Examples
- 233 Related Commands
- 233 --validate
- 233 Description
- 233 Syntax
- 233 Options
- 234 Common Options for AWS Data Pipeline Commands
- 236 Program AWS Data Pipeline
- 236 Make an HTTP Request to AWS Data Pipeline
- 236 HTTP Header Contents
- 237 HTTP Body Content
- 237 Format the Body of an HTTP request
- 238 Handle the HTTP Response
- 238 Sample AWS Data Pipeline JSON Request and Response
- 238 HTTP POST Request
- 239 AWS Data Pipeline Response
- 239 Actions in AWS Data Pipeline
- 240 AWS Task Runner Reference
- 240 Install Task Runner
- 240 Start Task Runner
- 241 Verify Task Runner
- 241 Setting Credentials for Task Runner
- 241 Task Runner Threading
- 241 Long Running Preconditions
- 242 Task Runner Configuration Options
- 243 Web Service Limits
- 243 Account Limits
- 244 Web Service Call Limits
- 245 Scaling Considerations
- 246 AWS Data Pipeline Resources
- 248 Document History