Aster Client Guide

Loading Data

Aster Loader Tool

Here’s how we run Aster Loader, piping its input data through sed:

$ cat sampleData-3.tsv \

| sed -e 's_\\_\\\\_g' \

| ncluster_loader -h $QUEEN_IP -d my_db -U beehive -w beehive testo / dev/stdin

Loading tuples using node '192.168.28.100'.

3 tuples were successfully loaded into table 'test'.

Here are the result rows:

$ act -h $SYSMAN_IP -d my_db -U beehive -w beehive -c 'SELECT * FROM testo ORDER BY id;'

id | string

----+---------------------------------------------------------------------

1 | This is just a line.

5 | How often do back-slash characters ('\') appear in your data?

6 | And how often do you think they actually disappear: 1 \? 2 \? 3 \?

7 | \W\a\y \t\o\o \o\f\t\e\n\! \! \!

(4 rows)

Example with Error Logging

Use the same assumptions as in the previous example, and assume we will log malformed rows (that is, rows that the loader cannot interpret and therefore cannot load) to a table called

“

2010MarchSales_error_table

,” tagging each error row with the label

“

2010MarchSalesErr

”. At the end of the load attempt, the error data will also be copied to the file,

/home/ccrisp/2010MarchSales_error.txt

. We’ll set a limit of 100 error rows; if more than 100 errors are encountered, the load will be cancelled.

To do this:

1

Create the custom error logging table: Run ACT as a user with table creation rights (for example, a user with the catalog_admin role) and type:

CREATE TABLE 2010MarchSales_error_table () INHERITS

(nc_errortable_part);

2

Exit ACT, return to the command line, and type:

$ ./ncluster_loader -h 10.50.25.100 –w beehive -D "~" --el-enabled

--el-label

2010MarchSalesErr --el-limit 100 --el-table

2010MarchSales_error_table --el-errfile /home/ccrisp/

2010MarchSales_error.txt sales_fact 2010MarchSales_data.txt

For more information on logging malformed rows, see

“Error Logging” on page 204

.

Hints for Successful Loading

Recommended Character Set Is UTF-8

The default character set for Aster databases is UTF-8, and the Aster team recommends that you load only UTF-8 formatted data when loading to char, varchar, and text columns.

For the tools you use to prepare files and to connect to Aster, make sure you have set the default character set to UTF-8. This is particularly important if you are loading data from a

Windows-based machine. For example, if you will use an SSH client (e.g., putty) to run ncluster_loader, make sure you set the SSH client’s default character set to UTF-8.

Aster Client Guide 201

Loading Data

Aster Loader Tool

We recommend that, prior to loading, you convert your text files to UTF-8. For example, if you’re a Notepad++ user, you can use the command, “Convert to UTF8 without BOM.”

Newline Character

Make sure your data file uses a consistent character to represent newlines. If the file uses

\r\n for newlines, then it should not also use

\n

for newlines, and vice versa. If your file contains both UNIX-style

\n

newlines and Windows-style

\r\n

newlines, then you must clean the file before you try to load it. The UNIX command, dos2unix

, can be useful for doing this.

Multiple Loader Nodes

The Aster Loader Tool supports the use of many Aster Loader nodes. For most loading tasks, the queen is sufficient to handle all loading, but for high volume loading, you can add dedicated loader nodes to your cluster.

To use a loader node, you invoke one or more ncluster_loader instances that will load through that loader node. You may run many ncluster_loader sessions in parallel against one loader node, and you may use many loader nodes in parallel (with each node handling loads from a number of ncluster_loader instances).

To do this, you invoke each ncluster_loader instance with the

-l

(and optionally

-f

) argument to specify the loader node. The required flags are:

• as always, the

--hostname

option (

-h

) provides the queen IP address;

• the

--loader

flag (

-l

) provides the IP address of the desired loader node; and

• Optionally, the

--force-loader

flag (

-f

) forces the use of the desired loader node.

Loading Parent Child Tables with Inheritance

The

--auto-partition

option is retained in order to support parent/child tables created with inheritance. It is not used when working with parent/child tables created with autopartitioning. Using

--auto-partition

instructs the Aster Loader Tool to automatically send each row to the right child table during loading. Each row is directed to a table according to the check constraints you have set up on the child tables.

For example, if you partition your data into daily child tables based on the contents of a timestamp

column, each ultimate child table in your schema will have a CHECK constraint that specifies what value of timestamp

may be loaded into that child table. When you load data, the autopartitioning feature will route each row to the appropriate child table, based on its timestamp

value.

Use autopartitioning like this:

202 Aster Client Guide

Aster Client Guide

Loading Data

Aster Loader Tool

1

Set up the parent-child table schema in your database. On each ultimate child table, write a CHECK constraint that specifies what data may be loaded into that child table.

Notice!

Aster Database does not detect overlapping constraints on peer child tables. As a result, the correct placement of a row during loading can be indeterminate.

Workaround: Take care that the constraints you define do not create overlapping logical partitions. A simple mistake would be to set up range constraints like this:

CHECK ( ymdh BETWEEN '2005-07-01' AND '2005-08-01' );

CHECK ( ymdh BETWEEN '2005-08-01' AND '2005-09-01' );

In this example, it is not clear in which partition the ymdh value '2005-08-01' resides.

2

3

Prepare your data for loading:

a

Your data input file can contain data values that will end up in many different child tables.

b

To handle rows that do not fit your partitioning scheme, you can rely on the standard

error logging feature of the Aster Loader Tool (see “Error Logging” on page 204

) or create a check constraint that will catch rows that you do not want to include in your partitions.

Run the Aster Loader Tool with the

-a or

--auto-partition

flag.

Detecting UNIQUE and PRIMARY KEY Violations Before Loading

Detecting UNIQUE and PRIMARY KEY violations in the data to be loaded is not always straightforward. In many cases the source is not a database you can easily run a query on to detect non-unique keys. Some techniques you can use to detect these conditions in your source data:

• Build a version of the target table in the target database without a UNIQUE or PRIMARY

KEY constraint, load the data, then run a “detect duplicates” query to find the problematic rows/keys. In some cases only loading a sample of the data is sufficient to provide enough clues to find and fix the problem in the source data.

• An alternative step (using an ETL tool) would be to use this “keyless” version of the target table as a staging/temp table, which would load then check for issues like duplicate keys and dump them to a second error table. If no issues are found, then transfer the data to the final destination table.

• If the source is a database, then run the “detect duplicates” query there.

Using COPY with Columnar Tables

A loading operation using the Aster Loader Tool, COPY, or INSERT can be expensive when the following conditions exist:

• the target table uses columnar storage, AND

• the target table has many logical partitions, AND

• the loaded data matches many different logical partitions.

203

Aster Client Guide

Hints for Successful Loading

Table of contents