StreamSets Data Collector Release Notes

StreamSets Data Collector Release Notes
October 13, 2016
New Features and Enhancements
We’re happy to announce a new version of StreamSets Data Collector. This version features new
features and enhancements in the following areas.
Support for the Confluent Schema Registry
The Confluent Schema Registry is a distributed storage layer for Avro schemas. You can configure Data
Collector stages that process Avro data to work with the Confluent Schema Registry in the following
Origins can look up Avro schemas in the Schema Registry by the specified schema ID or
subject. The Kafka Consumer origin can also look up the Avro schema ID embedded in each
Kafka message in the Schema Registry.
Destinations can look up Avro schemas in the Schema Registry by the specified schema ID or
subject. Or, destinations can register and store new Avro schemas in the Schema Registry. The
Kafka Producer destination can also embed the Avro schema ID in each message that it writes.
Installation and Configuration
New stage libraries.​ Data Collector now supports Elasticsearch version 2.4.
Install additional tarball libraries using the Data Collector user interface.​ If you install the
Data Collector core tarball, you can now use the Data Collector user interface to install individual
stage libraries.
Amazon Web Services Stages
Connect to Amazon Web Services through endpoints.​ In addition to connecting to the
standard Amazon Web Services regions, you can select Other for the region and then specify
the endpoint to connect to for the following stages:
○ Amazon S3 origin
○ Kinesis Consumer origin
○ Amazon S3 destination
○ Kinesis Firehose destination
○ Kinesis Producer destination
Renamed property for the Amazon S3 destination.​ The File Name Prefix property has been
renamed to the Object Name Prefix property.
New MapR FS origin​. Use the new MapR FS origin in a cluster mode pipeline to process files
stored on MapR FS.
Directory origin enhancement. ​The Directory origin can now create error records for delimited
data with more than the expected number of fields.
© 2016, ​StreamSets, Inc.​, ​Apache License, Version 2.0
StreamSets Data Collector Release Notes
Oracle CDC Client enhancement. ​Improved performance by processing data based on the
commit number in ascending order.
UDP Source and UDP to Kafka enhancement.​ To improve performance when reading
messages from UDP ports, you can configure the UDP Source and UDP to Kafka origins to use
multiple receiver threads for each port. Because the multi-threading requires native libraries, it is
available when Data Collector runs on 64-bit Linux.
Groovy Evaluator, JavaScript Evaluator, and Jython Evaluator enhancements.​ You can
now use the processors to create new records and to create list-map fields. In a pipeline that
processes the whole file data format, you can now use the processors to access whole file data
by creating an input stream to the file.
JDBC Lookup enhancement.​ To improve pipeline performance, you can configure the JDBC
Lookup processor to locally cache the lookup values returned from a database table.
Data Formats
Text data format enhancement. ​When writing text data, you can now specify the record
separator characters that you want to use.
Whole file data format enhancement.​ You can now use the Groovy Evaluator, JavaScript
Evaluator, and Jython Evaluator to access whole file data by creating an input stream to the file.
Cluster Mode
MapR support for cluster batch mode.​ Use the new MapR FS origin in a cluster mode
pipeline to process files from MapR FS.
Please feel free to check out the ​Documentation​ for this release.
You can upgrade previous versions of Data Collector to version For instructions on upgrading,
see the ​Upgrade Documentation​.
Fixed Issues
The following table lists some of the known issues that are fixed with this release.
For the full list, click ​here​.
Add notification in the user interface and documentation for the Hive Metadata
processor that Hive table names are created with lowercase letters.
© 2016, ​StreamSets, Inc.​, ​Apache License, Version 2.0
StreamSets Data Collector Release Notes
Memory leaks can occur when running thousands of pipelines because Data
Collector does not purge the running pipeline cache at regular intervals.
Expose additional functions for data rules in the Alert Text property.
The RPM installation is missing the root-lib folder.
Pipelines upgraded to SDC that include the XML Flattener processor
generate the following validation error:
CREATION_013 - Configuration value 'true' is not boolean, it is
a '{}'
When the Directory origin encounters a line with an unexpected number of columns,
it stops reading the rest of the file. Instead, it should generate an error record for the
malformed line and then continue reading the rest of the file.
Memory leaks can occur when Data Collector constantly evaluates different
expressions because the commons-el library maintains a cache of all expressions
that is not property evicted.
When you stop a cluster mode pipeline, the _tmp file might not be renamed.
The Hadoop FS origin incorrectly lists MapR as an available stage library.
Known Issues
Please note the following known issues with this release.
For a full list of known issues, check out ​our JIRA​.
The Cassandra destination encounters problems connecting to a Cassandra cluster
because the Cassandra stage library directory contains a mixed version of netty JAR
1. Remove all netty* JAR files from the following directory:
2. Download the following netty JAR file:
3. Add the netty-all-4.0.41.Final.jar file to the Cassandra stage library directory.
© 2016, ​StreamSets, Inc.​, ​Apache License, Version 2.0
StreamSets Data Collector Release Notes
If you configure a UDP Source or UDP to Kafka origin to enable multithreading after
you have already run the pipeline with the option disabled, the following validation
error displays:
Multithreaded UDP server is not available on your platform.
Workaround: Restart Data Collector.
Data Collector cannot access Vault secrets stored in Hashicorp Vault.
In cluster mode, Data Collector does not generate log files for worker Data Collectors.
The MapR FS destination does not support impersonating an HDFS user. Instead, the
destination always uses the user account who started the Data Collector to connect to
MapR FS.
Data preview fails for pipelines that use the Dev Raw Data Source origin when you
refresh the data preview or run data preview with changes.
The XML Flattener processor fails to parse XML that contains whitespace after the
XML prolog.
Workaround: Use an Expression Evaluator or scripting processor to remove the
whitespace before using the XML Flattener.
The Hive Streaming destination using the MapR library cannot connect to a MapR
cluster that uses Kerberos or username/password login authentication.
The Field Renamer processor does not support quoting regex special characters in
field names. For example, if you specify a field name of ​/'tag|attr'​, the processor
interprets the pipe symbol (|) as the regex OR and cannot find the field.
Workaround: Manually quote the special character by wrapping it in \Q and \E like so:
If you run Data Collector from Docker, you cannot shut down Data Collector by
running ​docker stop​ or pressing Ctrl+C from the Docker Quickstart Terminal.
Workaround: In the Data Collector console, click ​Administration >
​ ​ Shut Down​.
Using the following commands to shut down or restart Data Collector does not
properly complete the shutdown:
● service sdc stop
● service sdc restart
Workaround: In the Data Collector console, click ​Administration >
​ ​ Shut Down​ or
Administration ​>​ Restart​.
Cluster streaming pipelines that run on YARN use the YARN user instead of the Data
Collector user to run executors.
© 2016, ​StreamSets, Inc.​, ​Apache License, Version 2.0
StreamSets Data Collector Release Notes
When you upgrade Data Collector from the RPM package, the environment
configuration file ​$SDC_DIST/libexec/​ ​is overwritten.
Workaround: Back up the ​​ ​file before you upgrade.
When a pipeline writes error records to Elasticsearch, the record header information error code, error message, and error stage - is not preserved.
If you configure a Kafka Producer destination to write one message per batch, and
then use a cluster pipeline to process that data from the Kafka cluster, the cluster
pipeline might encounter an out of memory error.
To process records larger than 1 MB, you must configure the
DataFactoryBuilder.OverRunLimit property. However, this property is not configurable
in the Data Collector configuration file, ​$SDC_CONF/​.
Workaround: Set the value of DataFactoryBuilder.OverRunLimit property in the
SDC_JAVA_OPTS environment variable in the Data Collector environment file,
$SDC_DIST/libexec/​ or ​$SDC_DIST/libexec/​.
Set the property greater than the largest record you want to process. For example, to
process records up to 2 MB, set the property to 2097152 as follows:
A cluster mode pipeline can hang with a ​CONNECT_ERROR status. ​This can be a
temporary connection problem that resolves, returning the pipeline to the RUNNING
If the problem is not temporary, you might need to manually edit the pipeline state file
to set the pipeline to STOPPED. Edit the file only after you confirm that the pipeline is
no longer running on the cluster or that the cluster has been decommissioned.
To manually change the pipeline state, edit the following file: ​$SDC_DATA​/runInfo/
<cluster pipeline name>/<revision>/
In the file, change CONNECT_ERROR to STOPPED and save the file.
When using the Kafka Consumer or Kafka Producer on HDP 2.3 with Kerberos
enabled,​ set the Kafka broker configuration property​ to PLAINTEXT.
When enabling Kerberos, HDP 2.3 sets the ​
property to PLAINTEXTSASL, which is not supported.
If the property is not set to PLAINTEXT, when the pipeline starts, validation errors
indicate a problem connecting to Kafka.
At this time, writing to error records to file is not supported for cluster mode pipelines.
Workaround: Write error records to Kafka or to an SDC RPC pipeline.
© 2016, ​StreamSets, Inc.​, ​Apache License, Version 2.0
StreamSets Data Collector Release Notes
For cluster mode pipelines configured to stop on error or to stop upon reaching a
memory limit, the Data Collector cannot stop all worker pipelines as expected.
Workaround: To stop all pipelines, use the Stop icon in the Data Collector console.
Contact Information
For more information about StreamSets, visit our website:​​.
To review the latest documentation or try out our tutorials, check out the following links:
User Guide
User Guide tutorial
GitHub tutorials
To report an issue, ask for help, or find out about our next meetup, check out our Community page:​.
For general inquiries, email us at ​​.
© 2016, ​StreamSets, Inc.​, ​Apache License, Version 2.0
Download PDF