![](http://s2.manualzz.com/store/data/040895544_1-612f342bc4333b5920db5c115f1c9252-128x128.png)
advertisement
![AWS Data Pipeline | Manualzz AWS Data Pipeline | Manualzz](http://s2.manualzz.com/store/data/040895544_1-612f342bc4333b5920db5c115f1c9252-360x466.png)
AWS Data Pipeline Developer Guide
Tutorial: Copy CSV Data from
Amazon S3 to Amazon S3
After you read What is AWS Data Pipeline? (p. 1)
and decide you want to use AWS Data Pipeline to automate the movement and transformation of your data, it is time to get started with creating data pipelines. To help you make sense of how AWS Data Pipeline works, let’s walk through a simple task.
This tutorial walks you through the process of creating a data pipeline to copy data from one Amazon S3 bucket to another and then send an Amazon SNS notification after the copy activity completes successfully.
You use the Amazon EC2 instance resource managed by AWS Data Pipeline for this copy activity.
Important
This tutorial does not employ the Amazon S3 API for high speed data transfer between Amazon
S3 buckets. It is intended only for demonstration purposes to help new customers understand how to create a simple pipeline and the related concepts. For advanced information about data transfer using Amazon S3, see Working with Buckets in the Amazon S3 Developer Guide.
The first step in pipeline creation process is to select the pipeline objects that make up your pipeline definition. After you select the pipeline objects, you add fields for each object. For more information, see
.
This tutorial uses the following objects to create a pipeline definition:
Activity
Activity the AWS Data Pipeline performs for this pipeline.
This tutorial uses the
CopyActivity
object to copy CSV data from one Amazon S3 bucket to another.
Important
There are distinct limitations regarding the CSV file format with
CopyActivity
and
S3DataNode
. For more information, see CopyActivity (p. 180) .
Schedule
The start date, time, and the recurrence for this activity. You can optionally specify the end date and time.
Resource
Resource AWS Data Pipeline must use to perform this activity.
This tutorial uses
Ec2Resource
, an Amazon EC2 instance provided by AWS Data Pipeline, to copy data. AWS Data Pipeline automatically launches the Amazon EC2 instance and then terminates the instance after the task finishes.
API Version 2012-10-29
25
AWS Data Pipeline Developer Guide
Before You Begin...
DataNodes
Input and output nodes for this pipeline.
This tutorial uses
S3DataNode
for both input and output nodes.
Action
Action AWS Data Pipeline must take when the specified conditions are met.
This tutorial uses
SnsAlarm
action to send Amazon SNS notifications to the email address you specify, after the task finishes successfully. You must subscribe to the Amazon SNS Topic Arn to receive the notifications.
The following steps outline how to create a data pipeline to copy data from one Amazon S3 bucket to another Amazon S3 bucket.
1. Create your pipeline definition
2. Validate and save your pipeline definition
3. Activate your pipeline
4. Monitor the progress of your pipeline
5. [Optional] Delete your pipeline
Before You Begin...
Be sure you've completed the following steps.
• Set up an Amazon Web Services (AWS) account to access the AWS Data Pipeline console. For more information, see
.
Set up the AWS Data Pipeline tools and interface you plan on using. For more information on interfaces
.
• Create an Amazon S3 bucket as a data source.
For more information, see Create a Bucket in the Amazon Simple Storage Service Getting Started
Guide.
• Upload your data to your Amazon S3 bucket.
For more information, see Add an Object to a Bucket in the Amazon Simple Storage Service Getting
Started Guide.
• Create another Amazon S3 bucket as a data target
• Create an Amazon SNS topic for sending email notification and make a note of the topic Amazon
Resource Name (ARN). For more information, see Create a Topic in the Amazon Simple Notification
Service Getting Started Guide.
• [Optional] This tutorial uses the default IAM role policies created by AWS Data Pipeline. If you would rather create and configure your own IAM role policy and trust relationships, follow the instructions
described in Granting Permissions to Pipelines with IAM (p. 21) .
Note
Some of the actions described in this tutorial can generate AWS usage charges, depending on whether you are using the AWS Free Usage Tier .
API Version 2012-10-29
26
AWS Data Pipeline Developer Guide
Using the AWS Data Pipeline Console
Using the AWS Data Pipeline Console
Topics
•
Create and Configure the Pipeline Definition Objects (p. 27)
•
Validate and Save Your Pipeline (p. 30)
•
Verify your Pipeline Definition (p. 30)
•
Activate your Pipeline (p. 31)
•
Monitor the Progress of Your Pipeline Runs (p. 31)
•
[Optional] Delete your Pipeline (p. 33)
The following sections include the instructions for creating the pipeline using the AWS Data Pipeline console.
To create your pipeline definition
1.
Sign in to the AWS Management Console and open the AWS Data Pipeline console .
2.
Click Create Pipeline.
3.
On the Create a New Pipeline page: a.
In the Pipeline Name box, enter a name (for example,
CopyMyS3Data
).
b.
In Pipeline Description, enter a description.
c.
Leave the Select Schedule Type: button set to the default type for this tutorial.
Note
Schedule type allows you to specify whether the objects in your pipeline definition should be scheduled at the beginning of interval or end of the interval. Time Series
Style Scheduling means instances are scheduled at the end of each interval and Cron
Style Scheduling means instances are scheduled at the beginning of each interval.
d.
Leave the Role boxes set to their default values for this tutorial.
Note
If you have created your own IAM roles and would like to use them in this tutorial, you can select them now.
e.
Click Create a new pipeline.
Create and Configure the Pipeline Definition
Objects
Next, you define the
Activity
object in your pipeline definition. When you define the
Activity
object, you also define the objects that AWS Data Pipeline must use to perform this activity.
1.
On the Pipeline:
name of your pipeline
page, select Add activity.
2.
In the Activities pane: a.
Enter the name of the activity; for example, copy-myS3-data
.
b.
In the Type box, select CopyActivity.
c.
In the Input box, select Create new: DataNode.
d.
In the Output box, select Create new: DataNode.
e.
In the Schedule box, select Create new: Schedule.
API Version 2012-10-29
27
AWS Data Pipeline Developer Guide
Create and Configure the Pipeline Definition Objects
f.
In the Add an optional field .. box, select RunsOn.
g.
In the Runs On box, select Create new: Resource.
h.
In the Add an optional field... box, select On Success.
i.
In the On Success box, select Create new: Action.
j.
In the left pane, separate the icons by dragging them apart.
You've completed defining your pipeline definition by specifying the objects AWS Data Pipeline uses to perform the copy activity.
The Pipeline:
name of your pipeline
pane shows the graphical representation of the pipeline you just created. The arrows indicate the connection between the various objects.
Next, configure the run date and time for your pipeline.
To configure run date and time for your pipeline
1.
On the Pipeline:
name of your pipeline
page, in the right pane, click Schedules.
2.
In the Schedules pane: a.
Enter a schedule name for this activity (for example, copy-myS3-data-schedule
).
b.
In the Type box, select Schedule.
c.
In the Start Date Time box, select the date from the calendar, and then enter the time to start the activity.
Note
AWS Data Pipeline supports the date and time expressed in "YYYY-MM-DDTHH:MM:SS" format in UTC/GMT only.
d.
In the Period box, enter the duration for the activity (for example,
1
), and then select the period category (for example,
Days
).
e.
[Optional] To specify the date and time to end the activity, in the Add an optional field box, select endDateTime, and enter the date and time.
To get your pipeline to launch immediately, set Start Date Time to a date one day in the past. AWS
Data Pipeline then starts launching the "past due" runs immediately in an attempt to address what it perceives as a backlog of work. This backfilling means you don't have to wait an hour to see AWS
Data Pipeline launch its first job flow.
Next, configure the input and the output data nodes for your pipeline.
API Version 2012-10-29
28
AWS Data Pipeline Developer Guide
Create and Configure the Pipeline Definition Objects
To configure the input and output data nodes of your pipeline
1.
On the Pipeline:
name of your pipeline
page, in the right pane, click DataNodes.
2.
In the DataNodes pane: a.
In the
DefaultDataNode1
Name box , enter the name for your input node (for example,
MyS3Input
).
In this tutorial, your input node is the Amazon S3 data source bucket.
b.
In the Type box, select S3DataNode.
c.
In the Schedule box, select copy-myS3-data-schedule.
d.
In the Add an optional field... box, select File Path.
e.
In the File Path box, enter the path to your Amazon S3 bucket (for example, s3://my-data-pipeline-input/
name of your data file
).
f.
In the
DefaultDataNode2
Name box, enter the name for your output node (for example,
MyS3Output
).
In this tutorial, your output node is the Amazon S3 data target bucket.
g.
In the Type box, select S3DataNode.
h.
In the Schedule box, select copy-myS3-data-schedule.
i.
In the Add an optional field... box, select File Path.
j.
In the File Path box, enter the path to your Amazon S3 bucket (for example, s3://my-data-pipeline-output/
name of your data file
).
Next, configure the resource AWS Data Pipeline must use to perform the copy activity.
To configure the resource,
1.
On the Pipeline:
name of your pipeline
page, in the right pane, click Resources.
2.
In the Resources pane: a.
In the Name box, enter the name for your resource (for example,
CopyDataInstance
).
b.
In the Type box, select Ec2Resource.
c.
In the Schedule box, select copy-myS3-data-schedule.
d.
Leave the Role and Resource Role boxes set to default values for this tutorial.
Note
If you have created your own IAM roles, you can select them now.
Next, configure the SNS notification action AWS Data Pipeline must perform after the copy activity finishes successfully.
To configure the SNS notification action
1.
On the Pipeline:
name of your pipeline
page, in the right pane, click Others.
2.
In the Others pane: a.
In the
DefaultAction1
Name box, enter the name for your Amazon SNS notification (for example,
CopyDataNotice
).
b.
In the Type box, select SnsAlarm.
API Version 2012-10-29
29
AWS Data Pipeline Developer Guide
Validate and Save Your Pipeline
c.
In the Topic Arn box, enter the ARN of your Amazon SNS topic.
d.
In the Message box, enter the message content.
e.
In the Subject box, enter the subject line for your notification.
f.
Leave the Role box set to the default value for this tutorial.
You have now completed all the steps required for creating your pipeline definition. Next, validate and save your pipeline.
Validate and Save Your Pipeline
You can save your pipeline definition at any point during the creation process. As soon as you save your pipeline definition, AWS Data Pipeline looks for syntax errors and missing values in your pipeline definition.
If your pipeline is incomplete or is incorrect, AWS Data Pipeline throws a validation error. If you plan to continue the creation process later, you can ignore the error message. If your pipeline definition is complete, and you are getting the validation error message, you have to fix the errors in the pipeline definition before activating your pipeline.
To validate and save your pipeline
1.
On the Pipeline:
name of your pipeline
page, click Save Pipeline.
2.
AWS Data Pipeline validates your pipeline definition and returns either success or the error message.
If you get an error message, click Close and then, in the right pane, click Errors.
3.
The Errors pane lists the objects failing validation. Click the plus (+) sign next to the object names and look for an error message in red.
4.
When you see the error message, click the specific object pane where you see the error and fix it.
For example, if you see an error message in the DataNodes object, click the DataNodes pane to fix the error.
5.
After you've fixed the errors listed in the Errors pane, click Save Pipeline.
6.
Repeat the process until your pipeline is validated.
Next, verify that your pipeline definition has been saved.
Verify your Pipeline Definition
It is important to verify that your pipeline was correctly initialized from your definitions before you activate it.
To verify your pipeline definition
1.
On the Pipeline:
name of your pipeline
page, click Back to list of pipelines.
2.
On the List Pipelines page, check if your newly-created pipeline is listed.
AWS Data Pipeline has created a unique Pipeline ID for your pipeline definition.
The Status column in the row listing your pipeline should show PENDING.
3.
Click on the triangle icon next to your pipeline. The Pipeline Summary panel below shows the details of your pipeline runs. Because your pipeline is not yet activated, you should see only 0s at this point.
4.
In the Pipeline summary panel, click View all fields to see the configuration of your pipeline definition.
API Version 2012-10-29
30
AWS Data Pipeline Developer Guide
Activate your Pipeline
5.
Click Close.
Next, activate your pipeline.
Activate your Pipeline
You must activate your pipeline to start creating and processing runs based on the specifications in your pipeline definition.
To activate your pipeline
1.
On the List Pipelines page, in the Details column of your pipeline, click View pipeline.
2.
In the Pipeline:
name of your pipeline
page, click Activate.
A confirmation dialog box opens up confirming the activation.
3.
Click Close.
Next, verify if your pipeline is running.
Monitor the Progress of Your Pipeline Runs
To monitor the progress of your pipeline
1.
On the List Pipelines page, in the Details column of your pipeline, click View instance details.
API Version 2012-10-29
31
AWS Data Pipeline Developer Guide
Monitor the Progress of Your Pipeline Runs
2.
The Instance details:
name of your pipeline
page lists the status of each instance.
Note
If you do not see runs listed, depending on when your pipeline was scheduled, either click the End (in UTC) date box and change it to a later date or click the Start (in UTC) date box and change it to an earlier date. Then click Update.
3.
If the Status column of all the instances in your pipeline indicates FINISHED, your pipeline has successfully completed the copy activity.You should receive an email about the successful completion of this task, to the account you specified for receiving your Amazon SNS notification.
You can also check your Amazon S3 data target bucket to verify if the data was copied.
4.
If the Status column of any of your instance indicates a status other than FINISHED, either your pipeline is waiting for some precondition to be met or it has failed.
a.
To troubleshoot the failed or the incomplete instance runs,
Click the triangle next to an instance, Instance summary panel opens to show the details of the selected instance.
b.
In the Instance summary pane. click View instance fields to see details of fields associated with the selected instance. If the status of your selected instance is FAILED, the details box has an entry indicating the reason for failure. For example,
@failureReason = Resource not healthy terminated
.
c.
In the Instance summary pane, in the Select attempt for this instance box, select the attempt number.
d.
In the Instance summary pane, click View attempt fields to see details of fields associated with the selected attempt.
5.
To take an action on your incomplete or failed instance, select an action (
Rerun|Cancel|Mark
Finished
) from the Action column of the instance.
You can use the information in the Instance summary pane and the View instance fields box to troubleshoot issues with your failed pipeline.
For more information about instance status, see Interpret Pipeline Status Details (p. 129)
. For more
Pipeline Problems and Solutions (p. 131)
.
API Version 2012-10-29
32
AWS Data Pipeline Developer Guide
[Optional] Delete your Pipeline
Important
Your pipeline is running and is incurring charges. For more information, see AWS Data Pipeline pricing . If you would like to stop incurring the AWS Data Pipeline usage charges, delete your pipleline.
[Optional] Delete your Pipeline
Deleting your pipeline deletes the pipeline definition including all the associated objects.You stop incurring charges as soon as your pipeline is deleted.
To delete your pipeline
1.
In the List Pipelines page, click the check box next to your pipeline.
2.
Click Delete.
3.
In the confirmation dialog box, click Delete to confirm the delete request.
Using the Command Line Interface
Topics
•
Define a Pipeline in JSON Format (p. 33)
•
Upload the Pipeline Definition (p. 38)
•
•
Verify the Pipeline Status (p. 39)
The following topics explain how to use the AWS Data Pipeline CLI to create and use pipelines to copy data from one Amazon S3 bucket to another. In this example, we perform the following steps:
• Create a pipeline definition using the CLI in JSON format
• Create the necessary IAM roles and define a policy and trust relationships
• Upload the pipeline definition using the AWS Data Pipeline CLI tools
• Monitor the progress of the pipeline
Define a Pipeline in JSON Format
This example scenario shows how to use JSON pipeline definitions and the AWS Data Pipeline CLI to schedule copying data between two Amazon S3 buckets at a specific time interval. This is the full pipeline definition JSON file followed by an explanation for each of its sections.
Note
We recommend that you use a text editor that can help you verify the syntax of JSON-formatted files, and name the file using the .json file extension.
{
"objects": [
API Version 2012-10-29
33
AWS Data Pipeline Developer Guide
Define a Pipeline in JSON Format
{
"id": "MySchedule",
"type": "Schedule",
"startDateTime": "2012-11-25T00:00:00",
"endDateTime": "2012-11-26T00:00:00",
"period": "1 day"
},
{
"id": "S3Input",
"type": "S3DataNode",
"schedule": {
"ref": "MySchedule"
},
"filePath": "s3://testbucket/file.txt"
},
{
"id": "S3Output",
"type": "S3DataNode",
"schedule": {
"ref": "MySchedule"
},
"filePath": "s3://testbucket/file-copy.txt"
},
{
"id": "MyEC2Resource",
"type": "Ec2Resource",
"schedule": {
"ref": "MySchedule"
},
"actionOnTaskFailure": "terminate",
"actionOnResourceFailure": "retryAll",
"maximumRetries": "1",
"role": "test-role",
"resourceRole": "test-role",
"instanceType": "m1.medium",
"instanceCount": "1",
"securityGroups": [
"test-group",
"default"
],
"keyPair": "test-pair"
},
{
"id": "MyCopyActivity",
"type": "CopyActivity",
"runsOn": {
"ref": "MyEC2Resource"
},
"input": {
"ref": "S3Input"
},
"output": {
"ref": "S3Output"
},
"schedule": {
"ref": "MySchedule"
}
}
API Version 2012-10-29
34
AWS Data Pipeline Developer Guide
Define a Pipeline in JSON Format
]
}
Schedule
The example AWS Data Pipeline JSON file begins with a section to define the schedule by which to copy the data. Many pipeline components have a reference to a schedule and you may have more than one.
The Schedule component is defined by the following fields:
{
"id": "MySchedule",
"type": "Schedule",
"startDateTime": "2012-11-22T00:00:00",
"endDateTime":"2012-11-23T00:00:00",
"period": "1 day"
},
Note
In the JSON file, you can define the pipeline components in any order you prefer. In this example, we chose the order that best illustrates the pipeline component dependencies.
Name
The user-defined name for the pipeline schedule, which is a label for your reference only.
Type
The pipeline component type, which is Schedule.
startDateTime
The date/time (in UTC format) that you want the task to begin.
endDateTime
The date/time (in UTC format) that you want the task to stop.
period
The time period that you want to pass between task attempts, even if the task occurs only one time.
The period must evenly divide the time between startDateTime
and endDateTime
. In this example, we set the period to be 1 day so that the pipeline copy operation can only run one time.
Amazon S3 Data Nodes
Next, the input S3DataNode pipeline component defines a location for the input files; in this case, an
Amazon S3 bucket location. The input S3DataNode component is defined by the following fields:
{
"id" : "S3Input",
"type" : "S3DataNode",
"schedule" : {"ref" : "MySchedule"},
"filePath" : "s3://testbucket/file.txt",
"schedule": { "ref": "MySchedule" }
},
Name
The user-defined name for the input location (a label for your reference only).
Type
The pipeline component type, which is "S3DataNode" to match the location where the data resides, in an Amazon S3 bucket.
API Version 2012-10-29
35
AWS Data Pipeline Developer Guide
Define a Pipeline in JSON Format
Schedule
A reference to the schedule component that we created in the preceding lines of the JSON file labeled
“MySchedule”.
Path
The path to the data associated with the data node. The syntax for a data node is determined by its type. For example, a data node for a database follows a different syntax that is appropriate for a database table.
Next, the output S3DataNode component defines the output destination location for the data. It follows the same format as the input S3DataNode component, except the name of the component and a different path to indicate the target file.
{
"id" : "S3Output",
"type" : "S3DataNode",
"schedule" : {"ref" : "MySchedule"},
"filePath" : "s3://testbucket/file-copy.txt",
"schedule": { "ref": "MySchedule" }
},
Resource
This is a definition of the computational resource that performs the copy operation. In this example, AWS
Data Pipeline should automatically create an EC2 instance to perform the copy task and terminate the resource after the task completes. The fields defined here control the creation and function of the Amazon
EC2 instance that does the work. The EC2Resource is defined by the following fields:
{
"id": "MyEC2Resource",
"type": "Ec2Resource",
"schedule": {"ref": "MySchedule"},
"actionOnTaskFailure": "terminate",
"actionOnResourceFailure": "retryAll",
"maximumRetries": "1",
"role" : "test-role",
"resourceRole": "test-role",
"instanceType": "m1.medium",
"instanceCount": "1",
"securityGroups": [ "test-group", "default" ],
"keyPair": "test-pair"
},
Name
The user-defined name for the pipeline schedule, which is a label for your reference only.
Type
The type of computational resource to perform work; in this case, an Amazon EC2 instance. There are other resource types available, such as an EmrCluster type.
Schedule
The schedule on which to create this computational resource.
actionOnTaskFailure
The action to take on the Amazon EC2 instance if the task fails. In this case, the instance should terminate so that each failed attempt to copy data does not leave behind idle, abandoned Amazon
EC2 instances with no work to perform. These instances require manual termination by an administrator.
API Version 2012-10-29
36
AWS Data Pipeline Developer Guide
Define a Pipeline in JSON Format actionOnResourceFailure
The action to perform if the resource is not created successfully. In this case, retry the creation of an
Amazon EC2 instance until it is successful.
maximumRetries
The number of times to retry the creation of this computational resource before marking the resource as a failure and stopping any further creation attempts. This setting works in conjunction with the actionOnResourceFailure
field.
Role
The IAM role of the account that accesses resources, such as accessing an Amazon S3 bucket to retrieve data.
resourceRole
The IAM role of the account that creates resources, such as creating and configuring an Amazon
EC2 instance on your behalf. Role and ResourceRole can be the same role, but separately provide greater granularity in your security configuration.
instanceType
The size of the Amazon EC2 instance to create. Ensure that you set the appropriate size of EC2 instance that best matches the load of the work that you want to perform with AWS Data Pipeline.
In this case, we set an m1.medium sized EC2 instance. For more information about the different instance types and when to use each one, see to the Amazon EC2 Instance Types topic at http://aws.amazon.com/ec2/instance-types/.
instanceCount
The number of Amazon EC2 instances in the computational resource pool to service any pipeline components depending on this resource.
securityGroups
The Amazon EC2 security groups to grant access to this EC2 instance. As the example shows, you can provide more than one group at a time (test-group and default).
keyPair
The name of the SSH public/private key pair to log in to the Amazon EC2 instance. For more information, see Amazon EC2 Key Pairs .
Activity
The last section in the JSON file is the definition of the activity that represents the work to perform. This example uses
CopyActivity
to copy data from a file in an Amazon S3 bucket to another file. The
CopyActivity
component is defined by the following fields:
{
"id" : "MyCopyActivity",
"type" : "CopyActivity",
"runsOn":{"ref":"MyEC2Resource"},
"input" : {"ref" : "S3Input"},
"output" : {"ref" : "S3Output"},
"schedule" : {"ref" : "MySchedule"}
}
Name
The user-defined name for the activity, which is a label for your reference only.
Type
The type of activity to perform, such as MyCopyActivity.
runsOn
The computational resource that performs the work that this activity defines. In this example, we provide a reference to the Amazon EC2 instance defined previously. Using the
runsOn field causes
AWS Data Pipeline to create the EC2 instance for you. The runsOn
field indicates that the resource
API Version 2012-10-29
37
AWS Data Pipeline Developer Guide
Upload the Pipeline Definition
exists in the AWS infrastructure, while the workerGroup
value indicates that you want to use your own on-premises resources to perform the work.
Schedule
The schedule on which to run this activity.
Input
The location of the data to copy.
Output
The target location data.
Upload the Pipeline Definition
You can upload a pipeline definition file using the AWS Data Pipeline CLI tools. For more information,
see Install the Command Line Interface (p. 15)
To upload your pipeline definition, use the following command.
On Linux/Unix/Mac OS:
./datapipeline -–create pipeline_name -–put pipeline_file
On Windows: ruby datapipeline -–create pipeline_name -–put pipeline_file
Where pipeline_name is the label for your pipeline and pipeline_file is the full path and file name for the file with the .json file extension that defines your pipeline.
If your pipeline validates successfully, you receive the following message:
Pipeline with name pipeline_name and id df-AKIAIOSFODNN7EXAMPLE created. Pipeline
definition pipeline_file.json uploaded.
Note
For more information about any errors returned by the –create command or other commands,
see Troubleshoot AWS Data Pipeline (p. 128)
.
Ensure that your pipeline appears in the pipeline list by using the following command.
On Linux/Unix/Mac OS:
./datapipeline --list-pipelines
On Windows: ruby datapipeline -–list-pipelines
The list of pipelines includes details such as Name, Id, State, and UserId. Take note of your pipeline ID, because you use this value for most AWS Data Pipeline CLI commands. The pipeline ID is a unique identifier using the format df-AKIAIOSFODNN7EXAMPLE
.
API Version 2012-10-29
38
AWS Data Pipeline Developer Guide
Activate the Pipeline
Activate the Pipeline
You must activate the pipeline, by using the --activate command-line parameter, before it will begin performing work. Use the following command.
On Linux/Unix/Mac OS:
./datapipeline --activate -–id df-AKIAIOSFODNN7EXAMPLE
On Windows: ruby datapipeline --activate -–id df-AKIAIOSFODNN7EXAMPLE
Where df-AKIAIOSFODNN7EXAMPLE is the identifier for your pipeline.
Verify the Pipeline Status
View the status of your pipeline and its components, along with its activity attempts and retries with the following command.
On Linux/Unix/Mac OS:
./datapipeline --list-runs –id df-AKIAIOSFODNN7EXAMPLE
On Windows: ruby datapipeline --list-runs –id df-AKIAIOSFODNN7EXAMPLE
Where df-AKIAIOSFODNN7EXAMPLE is the identifier for your pipeline.
The --list-runs command displays a list of pipelines components and details such as Name, Scheduled
Start, Status, ID, Started, and Ended.
Note
It is important to note the difference between the Scheduled Start date/time vs. the Started time.
It is possible to schedule a pipeline component to run at a certain time (Scheduled Start), but the actual start time (Started) could be later due to problems or delays with preconditions, dependencies, failures, or retries.
Note
AWS Data Pipeline may backfill a pipeline, which happens when you define a Scheduled Start date/time for a date in the past. In that situation, AWS Data Pipeline immediately runs the pipeline components the number of times the activity should have run if it had started on the Scheduled
Start time. When this happens, you see pipeline components run back-to-back at a greater frequency than the period value that you specified when you created the pipeline. AWS Data
Pipeline returns your pipeline to the defined frequency only when it catches up to the number of past runs.
Successful pipeline runs are indicated by all the activities in your pipeline reporting the
FINISHED
status.
Your pipeline frequency determines how many times the pipeline runs, and each run has its own success or failure, as indicated by the --list-runs command. Resources that you define in your pipeline, such as
Amazon EC2 instances, may show the
SHUTTING_DOWN
status until they are finally terminated after a successful run. Depending on how you configured your pipeline, you may have multiple Amazon EC2 resources, each with their own final status.
API Version 2012-10-29
39
advertisement
Related manuals
advertisement
Table of contents
- 1 AWS Data Pipeline
- 6 What is AWS Data Pipeline?
- 6 How Does AWS Data Pipeline Work?
- 7 Pipeline Definition
- 9 Lifecycle of a Pipeline
- 10 Task Runners
- 10 Task Runner
- 11 Task Runner on AWS Data Pipeline-Managed Resources
- 12 Task Runner on User-Managed Resources
- 13 Custom Task Runner
- 15 Pipeline Components, Instances, and Attempts
- 16 Lifecycle of a Pipeline Task
- 17 Get Set Up for AWS Data Pipeline
- 17 Access the Console
- 20 Where Do I Go Now?
- 20 Install the Command Line Interface
- 20 Install Ruby
- 20 Install the RubyGems package management framework
- 21 Install Prerequisite Ruby Gems
- 22 Install the AWS Data Pipeline CLI
- 22 Locate your AWS Credentials
- 23 Create a Credentials File
- 23 Verify the CLI
- 24 Deploy and Configure Task Runner
- 24 Install Java
- 240 Install Task Runner
- 240 Start Task Runner
- 241 Verify Task Runner
- 25 Install the AWS SDK
- 26 Granting Permissions to Pipelines with IAM
- 28 Grant Amazon RDS Permissions to Task Runner
- 30 Tutorial: Copy CSV Data from Amazon S3 to Amazon S3
- 31 Before You Begin...
- 32 Using the AWS Data Pipeline Console
- 32 Create and Configure the Pipeline Definition Objects
- 35 Validate and Save Your Pipeline
- 35 Verify your Pipeline Definition
- 36 Activate your Pipeline
- 36 Monitor the Progress of Your Pipeline Runs
- 38 [Optional] Delete your Pipeline
- 38 Using the Command Line Interface
- 38 Define a Pipeline in JSON Format
- 40 Schedule
- 40 Amazon S3 Data Nodes
- 41 Resource
- 42 Activity
- 43 Upload the Pipeline Definition
- 44 Activate the Pipeline
- 44 Verify the Pipeline Status
- 45 Tutorial: Copy Data From a MySQL Table to Amazon S3
- 46 Before You Begin ...
- 47 Using the AWS Data Pipeline Console
- 47 Create and Configure the Pipeline Definition Objects
- 50 Validate and Save Your Pipeline
- 50 Verify Your Pipeline Definition
- 51 Activate your Pipeline
- 52 Monitor the Progress of Your Pipeline Runs
- 53 [Optional] Delete your Pipeline
- 53 Using the Command Line Interface
- 54 Define a Pipeline in JSON Format
- 55 Schedule
- 56 MySQL Data Node
- 56 Amazon S3 Data Node
- 57 Resource
- 58 Activity
- 59 Upload the Pipeline Definition
- 59 Activate the Pipeline
- 60 Verify the Pipeline Status
- 61 Tutorial: Launch an Amazon EMR Job Flow
- 62 Before You Begin ...
- 62 Using the AWS Data Pipeline Console
- 63 Create and Configure the Pipeline Definition Objects
- 65 Validate and Save Your Pipeline
- 65 Verify Your Pipeline Definition
- 66 Activate your Pipeline
- 66 Monitor the Progress of Your Pipeline Runs
- 68 [Optional] Delete your Pipeline
- 68 Using the Command Line Interface
- 74 Tutorial: Import/Export Data in Amazon DynamoDB With Amazon EMR and Hive
- 74 Part One: Import Data into Amazon DynamoDB
- 75 Before You Begin...
- 75 Create an Amazon DynamoDB Table
- 78 Create an Amazon SNS Topic
- 79 Create an Amazon S3 Bucket
- 79 Using the AWS Data Pipeline Console
- 79 Start Import from the Amazon DynamoDB Console
- 80 Create the Pipeline Definition using the AWS Data Pipeline Console
- 81 Create and Configure the Pipeline from a Template
- 81 Complete the Data Nodes
- 82 Complete the Resources
- 83 Complete the Activity
- 83 Complete the Notifications
- 83 Validate and Save Your Pipeline
- 84 Verify your Pipeline Definition
- 84 Activate your Pipeline
- 85 Monitor the Progress of Your Pipeline Runs
- 86 [Optional] Delete your Pipeline
- 86 Using the Command Line Interface
- 87 Define the Import Pipeline in JSON Format
- 89 Schedule
- 89 Amazon S3 Data Node
- 90 Precondition
- 91 Amazon EMR Cluster
- 91 Amazon EMR Activity
- 93 Upload the Pipeline Definition
- 94 Activate the Pipeline
- 94 Verify the Pipeline Status
- 95 Verify Data Import
- 95 Part Two: Export Data from Amazon DynamoDB
- 96 Before You Begin ...
- 97 Using the AWS Data Pipeline Console
- 97 Start Export from the Amazon DynamoDB Console
- 98 Create the Pipeline Definition using the AWS Data Pipeline Console
- 98 Create and Configure the Pipeline from a Template
- 99 Complete the Data Nodes
- 100 Complete the Resources
- 100 Complete the Activity
- 101 Complete the Notifications
- 101 Validate and Save Your Pipeline
- 101 Verify your Pipeline Definition
- 102 Activate your Pipeline
- 102 Monitor the Progress of Your Pipeline Runs
- 103 [Optional] Delete your Pipeline
- 103 Using the Command Line Interface
- 103 Define the Export Pipeline in JSON Format
- 105 Schedule
- 106 Amazon S3 Data Node
- 107 Amazon EMR Cluster
- 107 Amazon EMR Activity
- 109 Upload the Pipeline Definition
- 110 Activate the Pipeline
- 110 Verify the Pipeline Status
- 111 Verify Data Export
- 112 Tutorial: Run a Shell Command to Process MySQL Table
- 113 Before you begin ...
- 114 Using the AWS Data Pipeline Console
- 114 Create and Configure the Pipeline Definition Objects
- 117 Validate and Save Your Pipeline
- 118 Verify your Pipeline Definition
- 118 Activate your Pipeline
- 119 Monitor the Progress of Your Pipeline Runs
- 120 [Optional] Delete your Pipeline
- 121 Manage Pipelines
- 121 Using AWS Data Pipeline Console
- 121 View pipeline definition
- 122 View details of each instance in an active pipeline
- 124 Modify pipeline definition
- 126 Delete a Pipeline
- 126 Using the Command Line Interface
- 127 Install the AWS Data Pipeline Command-Line Client
- 127 Command-Line Syntax
- 127 Setting Credentials for the AWS Data Pipeline Command Line Interface
- 129 List Pipelines
- 129 Create a New Pipeline
- 129 Retrieve Pipeline Details
- 130 View Pipeline Versions
- 131 Modify a Pipeline
- 131 Delete a Pipeline
- 133 Troubleshoot AWS Data Pipeline
- 133 Proactively Monitor Your Pipeline
- 134 Verify Your Pipeline Status
- 134 Interpret Pipeline Status Details
- 135 Error Log Locations
- 135 Task Runner Logs
- 136 Pipeline Logs
- 136 AWS Data Pipeline Problems and Solutions
- 136 Pipeline Stuck in Pending Status
- 136 Pipeline Component Stuck in Waiting for Runner Status
- 137 Pipeline Component Stuck in Checking Preconditions Status
- 137 Run Doesn't Start When Scheduled
- 138 Pipeline Components Run in Wrong Order
- 138 EMR Cluster Fails With Error: The security token included in the request is invalid
- 138 Insufficient Permissions to Access Resources
- 138 Creating a Pipeline Causes a Security Token Error
- 138 Cannot See Pipeline Details in the Console
- 138 Error in remote runner Status Code: 404, AWS Service: Amazon S3
- 139 Access Denied - Not Authorized to Perform Function datapipeline:
- 140 Pipeline Definition Files
- 140 Creating Pipeline Definition Files
- 140 Prerequisites
- 141 General Structure of a Pipeline Definition File
- 141 Pipeline Objects
- 141 Pipeline fields
- 142 User-Defined Fields
- 143 Expressions
- 143 Referencing Fields and Objects
- 144 Saving the Pipeline Definition File
- 144 Example Pipeline Definitions
- 144 Copy SQL Data to a CSV File in Amazon S3
- 144 Prerequisites
- 145 Example Pipeline Definition
- 146 Launch an Amazon EMR Job Flow
- 147 Example Pipeline Definition
- 148 Run a Script on a Schedule
- 148 Example Pipeline Definition
- 149 Chain Multiple Activities and Roll Up Data
- 150 Example Pipeline Definition
- 151 Copy Data from Amazon S3 to MySQL
- 151 Example Pipeline Definition
- 153 Extract Apache Web Log Data from Amazon S3 using Hive
- 153 Example Pipeline Definition
- 155 Extract Amazon S3 Data (CSV/TSV) to Amazon S3 using Hive
- 155 Example Pipeline Definition
- 157 Extract Amazon S3 Data (Custom Format) to Amazon S3 using Hive
- 157 Example Pipeline Definition
- 158 Simple Data Types
- 159 DateTime
- 159 Numeric
- 159 Expression Evaluation
- 159 Object References
- 159 Period
- 160 String
- 160 Expression Evaluation
- 160 Mathematical Functions
- 161 String Functions
- 161 Date and Time Functions
- 166 Objects
- 166 Object Categories
- 167 Object Hierarchy
- 168 Schedule
- 169 Syntax
- 170 Example
- 170 S3DataNode
- 170 Syntax
- 173 Example
- 174 See Also
- 174 MySqlDataNode
- 174 Syntax
- 177 Example
- 177 See Also
- 177 DynamoDBDataNode
- 177 Syntax
- 181 Example
- 181 ShellCommandActivity
- 181 Syntax
- 185 Example
- 185 See Also
- 185 CopyActivity
- 185 Syntax
- 189 Example
- 189 See Also
- 189 EmrActivity
- 189 Syntax
- 192 Example
- 193 See Also
- 193 HiveActivity
- 193 Syntax
- 196 Example
- 196 See Also
- 197 ShellCommandPrecondition
- 197 Syntax
- 199 Example
- 199 See Also
- 199 Exists
- 200 Syntax
- 202 Example
- 202 See Also
- 202 S3KeyExists
- 202 Syntax
- 205 See Also
- 205 S3PrefixNotEmpty
- 205 Syntax
- 207 Example
- 208 See Also
- 208 RdsSqlPrecondition
- 208 Syntax
- 209 DynamoDBTableExists
- 209 Syntax
- 209 DynamoDBDataExists
- 210 Syntax
- 210 Ec2Resource
- 210 Syntax
- 213 Example
- 214 EmrCluster
- 214 Syntax
- 218 Example
- 218 See Also
- 218 SnsAlarm
- 218 Syntax
- 219 Example
- 220 Command Line Reference
- 220 --cancel
- 220 Description
- 220 Syntax
- 221 Options
- 221 Common Options
- 221 Output
- 221 Examples
- 221 Related Commands
- 222 --create
- 222 Description
- 222 Syntax
- 222 Options
- 222 Common Options
- 222 Output
- 222 Examples
- 223 Related Commands
- 223 --delete
- 223 Description
- 223 Syntax
- 223 Options
- 223 Common Options
- 224 Output
- 224 Examples
- 224 Related Commands
- 224 --get, --g
- 224 Description
- 224 Syntax
- 225 Options
- 225 Common Options
- 225 Output
- 225 Examples
- 225 Related Commands
- 226 --help, --h
- 226 Description
- 226 Syntax
- 226 Options
- 226 Output
- 226 --list-pipelines
- 226 Description
- 226 Syntax
- 226 Options
- 227 Related Commands
- 227 --list-runs
- 227 Description
- 227 Syntax
- 227 Options
- 228 Common Options
- 228 Output
- 228 Examples
- 229 Related Commands
- 229 --put
- 229 Description
- 229 Syntax
- 229 Options
- 229 Common Options
- 229 Output
- 230 Examples
- 230 Related Commands
- 230 --activate
- 230 Description
- 230 Syntax
- 231 Options
- 231 Common Options
- 231 Output
- 231 Examples
- 231 Related Commands
- 232 --rerun
- 232 Description
- 232 Syntax
- 232 Options
- 232 Common Options
- 232 Output
- 232 Examples
- 233 Related Commands
- 233 --validate
- 233 Description
- 233 Syntax
- 233 Options
- 234 Common Options for AWS Data Pipeline Commands
- 236 Program AWS Data Pipeline
- 236 Make an HTTP Request to AWS Data Pipeline
- 236 HTTP Header Contents
- 237 HTTP Body Content
- 237 Format the Body of an HTTP request
- 238 Handle the HTTP Response
- 238 Sample AWS Data Pipeline JSON Request and Response
- 238 HTTP POST Request
- 239 AWS Data Pipeline Response
- 239 Actions in AWS Data Pipeline
- 240 AWS Task Runner Reference
- 240 Install Task Runner
- 240 Start Task Runner
- 241 Verify Task Runner
- 241 Setting Credentials for Task Runner
- 241 Task Runner Threading
- 241 Long Running Preconditions
- 242 Task Runner Configuration Options
- 243 Web Service Limits
- 243 Account Limits
- 244 Web Service Call Limits
- 245 Scaling Considerations
- 246 AWS Data Pipeline Resources
- 248 Document History