AWS Data Pipeline

Add to My manuals
248 Pages

advertisement

AWS Data Pipeline | Manualzz

AWS Data Pipeline Developer Guide

Tutorial: Copy CSV Data from

Amazon S3 to Amazon S3

After you read What is AWS Data Pipeline? (p. 1)

and decide you want to use AWS Data Pipeline to automate the movement and transformation of your data, it is time to get started with creating data pipelines. To help you make sense of how AWS Data Pipeline works, let’s walk through a simple task.

This tutorial walks you through the process of creating a data pipeline to copy data from one Amazon S3 bucket to another and then send an Amazon SNS notification after the copy activity completes successfully.

You use the Amazon EC2 instance resource managed by AWS Data Pipeline for this copy activity.

Important

This tutorial does not employ the Amazon S3 API for high speed data transfer between Amazon

S3 buckets. It is intended only for demonstration purposes to help new customers understand how to create a simple pipeline and the related concepts. For advanced information about data transfer using Amazon S3, see Working with Buckets in the Amazon S3 Developer Guide.

The first step in pipeline creation process is to select the pipeline objects that make up your pipeline definition. After you select the pipeline objects, you add fields for each object. For more information, see

Pipeline Definition (p. 2)

.

This tutorial uses the following objects to create a pipeline definition:

Activity

Activity the AWS Data Pipeline performs for this pipeline.

This tutorial uses the

CopyActivity

object to copy CSV data from one Amazon S3 bucket to another.

Important

There are distinct limitations regarding the CSV file format with

CopyActivity

and

S3DataNode

. For more information, see CopyActivity (p. 180) .

Schedule

The start date, time, and the recurrence for this activity. You can optionally specify the end date and time.

Resource

Resource AWS Data Pipeline must use to perform this activity.

This tutorial uses

Ec2Resource

, an Amazon EC2 instance provided by AWS Data Pipeline, to copy data. AWS Data Pipeline automatically launches the Amazon EC2 instance and then terminates the instance after the task finishes.

API Version 2012-10-29

25

AWS Data Pipeline Developer Guide

Before You Begin...

DataNodes

Input and output nodes for this pipeline.

This tutorial uses

S3DataNode

for both input and output nodes.

Action

Action AWS Data Pipeline must take when the specified conditions are met.

This tutorial uses

SnsAlarm

action to send Amazon SNS notifications to the email address you specify, after the task finishes successfully. You must subscribe to the Amazon SNS Topic Arn to receive the notifications.

The following steps outline how to create a data pipeline to copy data from one Amazon S3 bucket to another Amazon S3 bucket.

1. Create your pipeline definition

2. Validate and save your pipeline definition

3. Activate your pipeline

4. Monitor the progress of your pipeline

5. [Optional] Delete your pipeline

Before You Begin...

Be sure you've completed the following steps.

• Set up an Amazon Web Services (AWS) account to access the AWS Data Pipeline console. For more information, see

Access the Console (p. 12)

.

Set up the AWS Data Pipeline tools and interface you plan on using. For more information on interfaces

and tools you can use to interact with AWS Data Pipeline, see Get Set Up for AWS Data Pipeline (p. 12)

.

• Create an Amazon S3 bucket as a data source.

For more information, see Create a Bucket in the Amazon Simple Storage Service Getting Started

Guide.

• Upload your data to your Amazon S3 bucket.

For more information, see Add an Object to a Bucket in the Amazon Simple Storage Service Getting

Started Guide.

• Create another Amazon S3 bucket as a data target

• Create an Amazon SNS topic for sending email notification and make a note of the topic Amazon

Resource Name (ARN). For more information, see Create a Topic in the Amazon Simple Notification

Service Getting Started Guide.

• [Optional] This tutorial uses the default IAM role policies created by AWS Data Pipeline. If you would rather create and configure your own IAM role policy and trust relationships, follow the instructions

described in Granting Permissions to Pipelines with IAM (p. 21) .

Note

Some of the actions described in this tutorial can generate AWS usage charges, depending on whether you are using the AWS Free Usage Tier .

API Version 2012-10-29

26

AWS Data Pipeline Developer Guide

Using the AWS Data Pipeline Console

Using the AWS Data Pipeline Console

Topics

Create and Configure the Pipeline Definition Objects (p. 27)

Validate and Save Your Pipeline (p. 30)

Verify your Pipeline Definition (p. 30)

Activate your Pipeline (p. 31)

Monitor the Progress of Your Pipeline Runs (p. 31)

[Optional] Delete your Pipeline (p. 33)

The following sections include the instructions for creating the pipeline using the AWS Data Pipeline console.

To create your pipeline definition

1.

Sign in to the AWS Management Console and open the AWS Data Pipeline console .

2.

Click Create Pipeline.

3.

On the Create a New Pipeline page: a.

In the Pipeline Name box, enter a name (for example,

CopyMyS3Data

).

b.

In Pipeline Description, enter a description.

c.

Leave the Select Schedule Type: button set to the default type for this tutorial.

Note

Schedule type allows you to specify whether the objects in your pipeline definition should be scheduled at the beginning of interval or end of the interval. Time Series

Style Scheduling means instances are scheduled at the end of each interval and Cron

Style Scheduling means instances are scheduled at the beginning of each interval.

d.

Leave the Role boxes set to their default values for this tutorial.

Note

If you have created your own IAM roles and would like to use them in this tutorial, you can select them now.

e.

Click Create a new pipeline.

Create and Configure the Pipeline Definition

Objects

Next, you define the

Activity

object in your pipeline definition. When you define the

Activity

object, you also define the objects that AWS Data Pipeline must use to perform this activity.

1.

On the Pipeline:

name of your pipeline

page, select Add activity.

2.

In the Activities pane: a.

Enter the name of the activity; for example, copy-myS3-data

.

b.

In the Type box, select CopyActivity.

c.

In the Input box, select Create new: DataNode.

d.

In the Output box, select Create new: DataNode.

e.

In the Schedule box, select Create new: Schedule.

API Version 2012-10-29

27

AWS Data Pipeline Developer Guide

Create and Configure the Pipeline Definition Objects

f.

In the Add an optional field .. box, select RunsOn.

g.

In the Runs On box, select Create new: Resource.

h.

In the Add an optional field... box, select On Success.

i.

In the On Success box, select Create new: Action.

j.

In the left pane, separate the icons by dragging them apart.

You've completed defining your pipeline definition by specifying the objects AWS Data Pipeline uses to perform the copy activity.

The Pipeline:

name of your pipeline

pane shows the graphical representation of the pipeline you just created. The arrows indicate the connection between the various objects.

Next, configure the run date and time for your pipeline.

To configure run date and time for your pipeline

1.

On the Pipeline:

name of your pipeline

page, in the right pane, click Schedules.

2.

In the Schedules pane: a.

Enter a schedule name for this activity (for example, copy-myS3-data-schedule

).

b.

In the Type box, select Schedule.

c.

In the Start Date Time box, select the date from the calendar, and then enter the time to start the activity.

Note

AWS Data Pipeline supports the date and time expressed in "YYYY-MM-DDTHH:MM:SS" format in UTC/GMT only.

d.

In the Period box, enter the duration for the activity (for example,

1

), and then select the period category (for example,

Days

).

e.

[Optional] To specify the date and time to end the activity, in the Add an optional field box, select endDateTime, and enter the date and time.

To get your pipeline to launch immediately, set Start Date Time to a date one day in the past. AWS

Data Pipeline then starts launching the "past due" runs immediately in an attempt to address what it perceives as a backlog of work. This backfilling means you don't have to wait an hour to see AWS

Data Pipeline launch its first job flow.

Next, configure the input and the output data nodes for your pipeline.

API Version 2012-10-29

28

AWS Data Pipeline Developer Guide

Create and Configure the Pipeline Definition Objects

To configure the input and output data nodes of your pipeline

1.

On the Pipeline:

name of your pipeline

page, in the right pane, click DataNodes.

2.

In the DataNodes pane: a.

In the

DefaultDataNode1

Name box , enter the name for your input node (for example,

MyS3Input

).

In this tutorial, your input node is the Amazon S3 data source bucket.

b.

In the Type box, select S3DataNode.

c.

In the Schedule box, select copy-myS3-data-schedule.

d.

In the Add an optional field... box, select File Path.

e.

In the File Path box, enter the path to your Amazon S3 bucket (for example, s3://my-data-pipeline-input/

name of your data file

).

f.

In the

DefaultDataNode2

Name box, enter the name for your output node (for example,

MyS3Output

).

In this tutorial, your output node is the Amazon S3 data target bucket.

g.

In the Type box, select S3DataNode.

h.

In the Schedule box, select copy-myS3-data-schedule.

i.

In the Add an optional field... box, select File Path.

j.

In the File Path box, enter the path to your Amazon S3 bucket (for example, s3://my-data-pipeline-output/

name of your data file

).

Next, configure the resource AWS Data Pipeline must use to perform the copy activity.

To configure the resource,

1.

On the Pipeline:

name of your pipeline

page, in the right pane, click Resources.

2.

In the Resources pane: a.

In the Name box, enter the name for your resource (for example,

CopyDataInstance

).

b.

In the Type box, select Ec2Resource.

c.

In the Schedule box, select copy-myS3-data-schedule.

d.

Leave the Role and Resource Role boxes set to default values for this tutorial.

Note

If you have created your own IAM roles, you can select them now.

Next, configure the SNS notification action AWS Data Pipeline must perform after the copy activity finishes successfully.

To configure the SNS notification action

1.

On the Pipeline:

name of your pipeline

page, in the right pane, click Others.

2.

In the Others pane: a.

In the

DefaultAction1

Name box, enter the name for your Amazon SNS notification (for example,

CopyDataNotice

).

b.

In the Type box, select SnsAlarm.

API Version 2012-10-29

29

AWS Data Pipeline Developer Guide

Validate and Save Your Pipeline

c.

In the Topic Arn box, enter the ARN of your Amazon SNS topic.

d.

In the Message box, enter the message content.

e.

In the Subject box, enter the subject line for your notification.

f.

Leave the Role box set to the default value for this tutorial.

You have now completed all the steps required for creating your pipeline definition. Next, validate and save your pipeline.

Validate and Save Your Pipeline

You can save your pipeline definition at any point during the creation process. As soon as you save your pipeline definition, AWS Data Pipeline looks for syntax errors and missing values in your pipeline definition.

If your pipeline is incomplete or is incorrect, AWS Data Pipeline throws a validation error. If you plan to continue the creation process later, you can ignore the error message. If your pipeline definition is complete, and you are getting the validation error message, you have to fix the errors in the pipeline definition before activating your pipeline.

To validate and save your pipeline

1.

On the Pipeline:

name of your pipeline

page, click Save Pipeline.

2.

AWS Data Pipeline validates your pipeline definition and returns either success or the error message.

If you get an error message, click Close and then, in the right pane, click Errors.

3.

The Errors pane lists the objects failing validation. Click the plus (+) sign next to the object names and look for an error message in red.

4.

When you see the error message, click the specific object pane where you see the error and fix it.

For example, if you see an error message in the DataNodes object, click the DataNodes pane to fix the error.

5.

After you've fixed the errors listed in the Errors pane, click Save Pipeline.

6.

Repeat the process until your pipeline is validated.

Next, verify that your pipeline definition has been saved.

Verify your Pipeline Definition

It is important to verify that your pipeline was correctly initialized from your definitions before you activate it.

To verify your pipeline definition

1.

On the Pipeline:

name of your pipeline

page, click Back to list of pipelines.

2.

On the List Pipelines page, check if your newly-created pipeline is listed.

AWS Data Pipeline has created a unique Pipeline ID for your pipeline definition.

The Status column in the row listing your pipeline should show PENDING.

3.

Click on the triangle icon next to your pipeline. The Pipeline Summary panel below shows the details of your pipeline runs. Because your pipeline is not yet activated, you should see only 0s at this point.

4.

In the Pipeline summary panel, click View all fields to see the configuration of your pipeline definition.

API Version 2012-10-29

30

AWS Data Pipeline Developer Guide

Activate your Pipeline

5.

Click Close.

Next, activate your pipeline.

Activate your Pipeline

You must activate your pipeline to start creating and processing runs based on the specifications in your pipeline definition.

To activate your pipeline

1.

On the List Pipelines page, in the Details column of your pipeline, click View pipeline.

2.

In the Pipeline:

name of your pipeline

page, click Activate.

A confirmation dialog box opens up confirming the activation.

3.

Click Close.

Next, verify if your pipeline is running.

Monitor the Progress of Your Pipeline Runs

To monitor the progress of your pipeline

1.

On the List Pipelines page, in the Details column of your pipeline, click View instance details.

API Version 2012-10-29

31

AWS Data Pipeline Developer Guide

Monitor the Progress of Your Pipeline Runs

2.

The Instance details:

name of your pipeline

page lists the status of each instance.

Note

If you do not see runs listed, depending on when your pipeline was scheduled, either click the End (in UTC) date box and change it to a later date or click the Start (in UTC) date box and change it to an earlier date. Then click Update.

3.

If the Status column of all the instances in your pipeline indicates FINISHED, your pipeline has successfully completed the copy activity.You should receive an email about the successful completion of this task, to the account you specified for receiving your Amazon SNS notification.

You can also check your Amazon S3 data target bucket to verify if the data was copied.

4.

If the Status column of any of your instance indicates a status other than FINISHED, either your pipeline is waiting for some precondition to be met or it has failed.

a.

To troubleshoot the failed or the incomplete instance runs,

Click the triangle next to an instance, Instance summary panel opens to show the details of the selected instance.

b.

In the Instance summary pane. click View instance fields to see details of fields associated with the selected instance. If the status of your selected instance is FAILED, the details box has an entry indicating the reason for failure. For example,

@failureReason = Resource not healthy terminated

.

c.

In the Instance summary pane, in the Select attempt for this instance box, select the attempt number.

d.

In the Instance summary pane, click View attempt fields to see details of fields associated with the selected attempt.

5.

To take an action on your incomplete or failed instance, select an action (

Rerun|Cancel|Mark

Finished

) from the Action column of the instance.

You can use the information in the Instance summary pane and the View instance fields box to troubleshoot issues with your failed pipeline.

For more information about instance status, see Interpret Pipeline Status Details (p. 129)

. For more

information about troubleshooting the failed or incomplete instance runs of your pipeline, see AWS Data

Pipeline Problems and Solutions (p. 131)

.

API Version 2012-10-29

32

AWS Data Pipeline Developer Guide

[Optional] Delete your Pipeline

Important

Your pipeline is running and is incurring charges. For more information, see AWS Data Pipeline pricing . If you would like to stop incurring the AWS Data Pipeline usage charges, delete your pipleline.

[Optional] Delete your Pipeline

Deleting your pipeline deletes the pipeline definition including all the associated objects.You stop incurring charges as soon as your pipeline is deleted.

To delete your pipeline

1.

In the List Pipelines page, click the check box next to your pipeline.

2.

Click Delete.

3.

In the confirmation dialog box, click Delete to confirm the delete request.

Using the Command Line Interface

Topics

Define a Pipeline in JSON Format (p. 33)

Upload the Pipeline Definition (p. 38)

Activate the Pipeline (p. 39)

Verify the Pipeline Status (p. 39)

The following topics explain how to use the AWS Data Pipeline CLI to create and use pipelines to copy data from one Amazon S3 bucket to another. In this example, we perform the following steps:

• Create a pipeline definition using the CLI in JSON format

• Create the necessary IAM roles and define a policy and trust relationships

• Upload the pipeline definition using the AWS Data Pipeline CLI tools

• Monitor the progress of the pipeline

Define a Pipeline in JSON Format

This example scenario shows how to use JSON pipeline definitions and the AWS Data Pipeline CLI to schedule copying data between two Amazon S3 buckets at a specific time interval. This is the full pipeline definition JSON file followed by an explanation for each of its sections.

Note

We recommend that you use a text editor that can help you verify the syntax of JSON-formatted files, and name the file using the .json file extension.

{

"objects": [

API Version 2012-10-29

33

AWS Data Pipeline Developer Guide

Define a Pipeline in JSON Format

{

"id": "MySchedule",

"type": "Schedule",

"startDateTime": "2012-11-25T00:00:00",

"endDateTime": "2012-11-26T00:00:00",

"period": "1 day"

},

{

"id": "S3Input",

"type": "S3DataNode",

"schedule": {

"ref": "MySchedule"

},

"filePath": "s3://testbucket/file.txt"

},

{

"id": "S3Output",

"type": "S3DataNode",

"schedule": {

"ref": "MySchedule"

},

"filePath": "s3://testbucket/file-copy.txt"

},

{

"id": "MyEC2Resource",

"type": "Ec2Resource",

"schedule": {

"ref": "MySchedule"

},

"actionOnTaskFailure": "terminate",

"actionOnResourceFailure": "retryAll",

"maximumRetries": "1",

"role": "test-role",

"resourceRole": "test-role",

"instanceType": "m1.medium",

"instanceCount": "1",

"securityGroups": [

"test-group",

"default"

],

"keyPair": "test-pair"

},

{

"id": "MyCopyActivity",

"type": "CopyActivity",

"runsOn": {

"ref": "MyEC2Resource"

},

"input": {

"ref": "S3Input"

},

"output": {

"ref": "S3Output"

},

"schedule": {

"ref": "MySchedule"

}

}

API Version 2012-10-29

34

AWS Data Pipeline Developer Guide

Define a Pipeline in JSON Format

]

}

Schedule

The example AWS Data Pipeline JSON file begins with a section to define the schedule by which to copy the data. Many pipeline components have a reference to a schedule and you may have more than one.

The Schedule component is defined by the following fields:

{

"id": "MySchedule",

"type": "Schedule",

"startDateTime": "2012-11-22T00:00:00",

"endDateTime":"2012-11-23T00:00:00",

"period": "1 day"

},

Note

In the JSON file, you can define the pipeline components in any order you prefer. In this example, we chose the order that best illustrates the pipeline component dependencies.

Name

The user-defined name for the pipeline schedule, which is a label for your reference only.

Type

The pipeline component type, which is Schedule.

startDateTime

The date/time (in UTC format) that you want the task to begin.

endDateTime

The date/time (in UTC format) that you want the task to stop.

period

The time period that you want to pass between task attempts, even if the task occurs only one time.

The period must evenly divide the time between startDateTime

and endDateTime

. In this example, we set the period to be 1 day so that the pipeline copy operation can only run one time.

Amazon S3 Data Nodes

Next, the input S3DataNode pipeline component defines a location for the input files; in this case, an

Amazon S3 bucket location. The input S3DataNode component is defined by the following fields:

{

"id" : "S3Input",

"type" : "S3DataNode",

"schedule" : {"ref" : "MySchedule"},

"filePath" : "s3://testbucket/file.txt",

"schedule": { "ref": "MySchedule" }

},

Name

The user-defined name for the input location (a label for your reference only).

Type

The pipeline component type, which is "S3DataNode" to match the location where the data resides, in an Amazon S3 bucket.

API Version 2012-10-29

35

AWS Data Pipeline Developer Guide

Define a Pipeline in JSON Format

Schedule

A reference to the schedule component that we created in the preceding lines of the JSON file labeled

“MySchedule”.

Path

The path to the data associated with the data node. The syntax for a data node is determined by its type. For example, a data node for a database follows a different syntax that is appropriate for a database table.

Next, the output S3DataNode component defines the output destination location for the data. It follows the same format as the input S3DataNode component, except the name of the component and a different path to indicate the target file.

{

"id" : "S3Output",

"type" : "S3DataNode",

"schedule" : {"ref" : "MySchedule"},

"filePath" : "s3://testbucket/file-copy.txt",

"schedule": { "ref": "MySchedule" }

},

Resource

This is a definition of the computational resource that performs the copy operation. In this example, AWS

Data Pipeline should automatically create an EC2 instance to perform the copy task and terminate the resource after the task completes. The fields defined here control the creation and function of the Amazon

EC2 instance that does the work. The EC2Resource is defined by the following fields:

{

"id": "MyEC2Resource",

"type": "Ec2Resource",

"schedule": {"ref": "MySchedule"},

"actionOnTaskFailure": "terminate",

"actionOnResourceFailure": "retryAll",

"maximumRetries": "1",

"role" : "test-role",

"resourceRole": "test-role",

"instanceType": "m1.medium",

"instanceCount": "1",

"securityGroups": [ "test-group", "default" ],

"keyPair": "test-pair"

},

Name

The user-defined name for the pipeline schedule, which is a label for your reference only.

Type

The type of computational resource to perform work; in this case, an Amazon EC2 instance. There are other resource types available, such as an EmrCluster type.

Schedule

The schedule on which to create this computational resource.

actionOnTaskFailure

The action to take on the Amazon EC2 instance if the task fails. In this case, the instance should terminate so that each failed attempt to copy data does not leave behind idle, abandoned Amazon

EC2 instances with no work to perform. These instances require manual termination by an administrator.

API Version 2012-10-29

36

AWS Data Pipeline Developer Guide

Define a Pipeline in JSON Format actionOnResourceFailure

The action to perform if the resource is not created successfully. In this case, retry the creation of an

Amazon EC2 instance until it is successful.

maximumRetries

The number of times to retry the creation of this computational resource before marking the resource as a failure and stopping any further creation attempts. This setting works in conjunction with the actionOnResourceFailure

field.

Role

The IAM role of the account that accesses resources, such as accessing an Amazon S3 bucket to retrieve data.

resourceRole

The IAM role of the account that creates resources, such as creating and configuring an Amazon

EC2 instance on your behalf. Role and ResourceRole can be the same role, but separately provide greater granularity in your security configuration.

instanceType

The size of the Amazon EC2 instance to create. Ensure that you set the appropriate size of EC2 instance that best matches the load of the work that you want to perform with AWS Data Pipeline.

In this case, we set an m1.medium sized EC2 instance. For more information about the different instance types and when to use each one, see to the Amazon EC2 Instance Types topic at http://aws.amazon.com/ec2/instance-types/.

instanceCount

The number of Amazon EC2 instances in the computational resource pool to service any pipeline components depending on this resource.

securityGroups

The Amazon EC2 security groups to grant access to this EC2 instance. As the example shows, you can provide more than one group at a time (test-group and default).

keyPair

The name of the SSH public/private key pair to log in to the Amazon EC2 instance. For more information, see Amazon EC2 Key Pairs .

Activity

The last section in the JSON file is the definition of the activity that represents the work to perform. This example uses

CopyActivity

to copy data from a file in an Amazon S3 bucket to another file. The

CopyActivity

component is defined by the following fields:

{

"id" : "MyCopyActivity",

"type" : "CopyActivity",

"runsOn":{"ref":"MyEC2Resource"},

"input" : {"ref" : "S3Input"},

"output" : {"ref" : "S3Output"},

"schedule" : {"ref" : "MySchedule"}

}

Name

The user-defined name for the activity, which is a label for your reference only.

Type

The type of activity to perform, such as MyCopyActivity.

runsOn

The computational resource that performs the work that this activity defines. In this example, we provide a reference to the Amazon EC2 instance defined previously. Using the

runsOn field causes

AWS Data Pipeline to create the EC2 instance for you. The runsOn

field indicates that the resource

API Version 2012-10-29

37

AWS Data Pipeline Developer Guide

Upload the Pipeline Definition

exists in the AWS infrastructure, while the workerGroup

value indicates that you want to use your own on-premises resources to perform the work.

Schedule

The schedule on which to run this activity.

Input

The location of the data to copy.

Output

The target location data.

Upload the Pipeline Definition

You can upload a pipeline definition file using the AWS Data Pipeline CLI tools. For more information,

see Install the Command Line Interface (p. 15)

To upload your pipeline definition, use the following command.

On Linux/Unix/Mac OS:

./datapipeline -–create pipeline_name -–put pipeline_file

On Windows: ruby datapipeline -–create pipeline_name -–put pipeline_file

Where pipeline_name is the label for your pipeline and pipeline_file is the full path and file name for the file with the .json file extension that defines your pipeline.

If your pipeline validates successfully, you receive the following message:

Pipeline with name pipeline_name and id df-AKIAIOSFODNN7EXAMPLE created. Pipeline

definition pipeline_file.json uploaded.

Note

For more information about any errors returned by the –create command or other commands,

see Troubleshoot AWS Data Pipeline (p. 128)

.

Ensure that your pipeline appears in the pipeline list by using the following command.

On Linux/Unix/Mac OS:

./datapipeline --list-pipelines

On Windows: ruby datapipeline -–list-pipelines

The list of pipelines includes details such as Name, Id, State, and UserId. Take note of your pipeline ID, because you use this value for most AWS Data Pipeline CLI commands. The pipeline ID is a unique identifier using the format df-AKIAIOSFODNN7EXAMPLE

.

API Version 2012-10-29

38

AWS Data Pipeline Developer Guide

Activate the Pipeline

Activate the Pipeline

You must activate the pipeline, by using the --activate command-line parameter, before it will begin performing work. Use the following command.

On Linux/Unix/Mac OS:

./datapipeline --activate -–id df-AKIAIOSFODNN7EXAMPLE

On Windows: ruby datapipeline --activate -–id df-AKIAIOSFODNN7EXAMPLE

Where df-AKIAIOSFODNN7EXAMPLE is the identifier for your pipeline.

Verify the Pipeline Status

View the status of your pipeline and its components, along with its activity attempts and retries with the following command.

On Linux/Unix/Mac OS:

./datapipeline --list-runs –id df-AKIAIOSFODNN7EXAMPLE

On Windows: ruby datapipeline --list-runs –id df-AKIAIOSFODNN7EXAMPLE

Where df-AKIAIOSFODNN7EXAMPLE is the identifier for your pipeline.

The --list-runs command displays a list of pipelines components and details such as Name, Scheduled

Start, Status, ID, Started, and Ended.

Note

It is important to note the difference between the Scheduled Start date/time vs. the Started time.

It is possible to schedule a pipeline component to run at a certain time (Scheduled Start), but the actual start time (Started) could be later due to problems or delays with preconditions, dependencies, failures, or retries.

Note

AWS Data Pipeline may backfill a pipeline, which happens when you define a Scheduled Start date/time for a date in the past. In that situation, AWS Data Pipeline immediately runs the pipeline components the number of times the activity should have run if it had started on the Scheduled

Start time. When this happens, you see pipeline components run back-to-back at a greater frequency than the period value that you specified when you created the pipeline. AWS Data

Pipeline returns your pipeline to the defined frequency only when it catches up to the number of past runs.

Successful pipeline runs are indicated by all the activities in your pipeline reporting the

FINISHED

status.

Your pipeline frequency determines how many times the pipeline runs, and each run has its own success or failure, as indicated by the --list-runs command. Resources that you define in your pipeline, such as

Amazon EC2 instances, may show the

SHUTTING_DOWN

status until they are finally terminated after a successful run. Depending on how you configured your pipeline, you may have multiple Amazon EC2 resources, each with their own final status.

API Version 2012-10-29

39

advertisement

Related manuals

Download PDF

advertisement

Table of contents