Elassandra Documentation

Elassandra Documentation

Release v2.4.2-10

Vincent Royer

Sep 07, 2017

Contents

1 Architecture

3

1.1

Concepts Mapping

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

4

1.2

Durability

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

4

1.3

Shards and Replica

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

5

1.4

Write path

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

1.5

Search path

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

8

9

2 Installation

11

2.1

Tarball

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

11

2.2

DEB package

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

12

2.2.1

Import the GPG Key

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

12

2.2.2

Install Elassandra from the APT repository

. . . . . . . . . . . . . . . . . . . . . . . . . .

12

2.2.3

Install extra tools

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

13

2.2.4

Usage

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

13

2.3

RPM package

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

13

2.3.1

Setup the RPM repository

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

14

2.3.2

Install Elassandra

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

14

2.3.3

Install extra tools

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

14

2.3.4

Usage

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

14

2.4

Docker image

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

15

2.4.1

Start an elassandra server instance

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

15

2.4.2

Connect to Cassandra from an application in another Docker container

. . . . . . . . . . . .

15

2.4.3

Make a cluster

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

15

2.4.4

Container shell access and viewing Cassandra logs

. . . . . . . . . . . . . . . . . . . . . .

16

2.4.5

Environment Variables

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

16

2.5

Build from source

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

17

3 Configuration

19

3.1

Directory Layout

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

19

3.2

Configuration

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

19

3.3

Logging configuration

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

20

3.4

Multi datacenter configuration

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

20

3.5

Elassandra Settings

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

21

3.6

Sizing and tunning

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

23

3.6.1

Write performances

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

23

3.6.2

Search performances

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

23

i

4 Mapping

25

4.1

Type mapping

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

25

4.2

Bidirectionnal mapping

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

26

4.3

Meta-Fields

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

27

4.4

Mapping change with zero downtime

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

28

4.5

Partitioned Index

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

29

4.6

Object and Nested mapping

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

30

4.7

Dynamic mapping of Cassandra Map

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

31

4.7.1

Dynamic Template with Dynamic Mapping

. . . . . . . . . . . . . . . . . . . . . . . . . .

33

4.8

Parent-Child Relationship

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

34

4.9

Indexing Cassandra static columns

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

35

4.10 Elassandra as a JSON-REST Gateway

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

38

4.11 Check Cassandra consistency with elasticsearch

. . . . . . . . . . . . . . . . . . . . . . . . . . . .

39

5 Operations

43

5.1

Indexing

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

43

5.2

GETing

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

44

5.3

Updates

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

46

5.4

Searching

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

46

5.4.1

Optimizing search requests

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

47

5.4.2

Caching features

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

48

5.5

Create, delete and rebuild index

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

49

5.6

Open, close, index

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

50

5.7

Flush, refresh index

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

50

5.8

Percolator

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

51

5.9

Managing Elassandra nodes

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

52

5.10 Backup and restore

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

53

5.10.1

Restoring a snapshot

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

53

5.10.2

Point in time recovery

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

53

5.10.3

Restoring to a different cluster

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

54

5.11 How to change the elassandra cluster name

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

54

6 Integration

55

6.1

Integration with an existing cassandra cluster

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

55

6.1.1

Rolling upgrade to elassandra

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

55

6.1.2

Create a new elassandra datacenter

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

55

6.2

Installing an Elasticsearch plugins

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

56

6.3

Running Kibana with Elassandra

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

56

6.4

JDBC Driver sql4es + Elassandra

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

57

6.5

Running Spark with Elassandra

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

57

7 Testing

59

7.1

Testing environnement

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

59

7.2

Elassandra unit test

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

59

8 Breaking changes and limitations

61

8.1

Deleting an index does not delete cassandra data

. . . . . . . . . . . . . . . . . . . . . . . . . . . .

61

8.2

Cannot index document with empty mapping

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

61

8.3

Nested or Object types cannot be empty

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

62

8.4

Document version is meaningless

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

62

8.5

Index and type names

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

63

8.6

Column names

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

63

8.7

Null values

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

63

8.8

Elasticsearch unsupported feature

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

63

8.9

Cassandra limitations

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

63

ii

9 Indices and tables

65

iii

iv

Elassandra tightly integrates Elasticsearch in Cassandra .

Contents:

Elassandra Documentation, Release v2.4.2-10

Contents 1


2 Contents

CHAPTER

1

Architecture

Elassandra tightly integrates elasticsearch within cassandra as a secondary index, allowing near-realtime search with all existing elasticsearch APIs, plugins and tools like Kibana.

When you index a document, the JSON document is stored as a row in a cassandra table and synchronously indexed in elasticsearch.

3


Concepts Mapping

Elasticsearch

Cluster

Cassandra

Virtual

Datacenter

Node

Description

All nodes of a datacenter forms an Elasticsearch cluster

Shard

Index

Type

Document

Field

Object or nested field

Keyspace

Table

Row

Cell

User Defined

Type

From an Elasticsearch perspective :

Each cassandra node is an elasticsearch shard for each indexed keyspace

An elasticsearch index is backed by a keyspace

Each elasticsearch document type is backed by a cassandra table

An elasticsearch document is backed by a cassandra row

Each indexed field is backed by a cassandra cell (row x column)

Automatically create User Defined Type to store elasticsearch object

• An Elasticsearch cluster is a Cassandra virtual datacenter.

• Every Elassandra node is a master primary data node.

• Each node only index local data and acts as a primary local shard.

• Elasticsearch data is not more stored in lucene indices, but in cassandra tables.

– An Elasticsearch index is mapped to a cassandra keyspace,

– Elasticsearch document type is mapped to a cassandra table.

– Elasticsearch document _id is a string representation of the cassandra primary key.

• Elasticsearch discovery now rely on the cassandra gossip protocol . When a node join or leave the cluster, or when a schema change occurs, each nodes update nodes status and its local routing table.

• Elasticsearch gateway now store metadata in a cassandra table and in the cassandra schema. Metadata updates are played sequentially through a cassandra lightweight transaction . Metadata UUID is the cassandra hostId of the last modifier node.

• Elasticsearch REST and java API remain unchanged.

• Logging is now based on logback as cassandra.

From a Cassandra perspective :

• Columns with an ElasticSecondaryIndex are indexed in Elasticsearch.

• By default, Elasticsearch document fields are multivalued, so every field is backed by a list. Single valued document field can be mapped to a basic types by setting ‘cql_collection: singleton’ in our type mapping. See

Elasticsearch document mapping for details.

• Nested documents are stored using cassandra User Defined Type or map .

• Elasticsearch provides a JSON-REST API to cassandra, see Elasticsearch API .

Durability

All writes to a cassandra node are recorded both in a memory table and in a commit log. When a memtable flush occurs, it flushes the elasticsearch secondary index on disk. When restarting after a failure, cassandra replays commitlogs and re-indexes elasticsearch documents that were no flushed by elasticsearch. This the reason why elasticsearch translog is disabled in elassandra.

4 Chapter 1. Architecture


Shards and Replica

Unlike Elasticsearch, sharding depends on the number of nodes in the datacenter, and number of replica is defined by your keyspace Replication Factor . Elasticsearch numberOfShards is just an information about number of nodes.

• When adding a new elasticassandra node, the cassandra boostrap process gets some token ranges from the existing ring and pull the corresponding data. Pulled data are automatically indexed and each node update its routing table to distribute search requests according to the ring topology.

• When updating the Replication Factor, you will need to run a nodetool repair <keyspace> on the new node to effectively copy and index the data.

• If a node become unavailable, the routing table is updated on all nodes in order to route search requests on available nodes. The actual default strategy routes search requests on primary token ranges’ owner first, then to replica nodes if available. If some token ranges become unreachable, the cluster status is red, otherwise cluster status is yellow.

After starting a new Elassandra node, data and elasticsearch indices are distributed on 2 nodes (with no replication).

nodetool status twitter

Datacenter: DC1

===============

Status = Up / Down

|/ State = Normal / Leaving / Joining / Moving

-Address Load Tokens Owns (effective) Host ID

˓→

Rack

UN 127.0

.

0.1

156 , 9 KB 2 70 , 3 % 74 ae1629 0149 4e65 b790 -

˓→ cd25c7406675 RAC1

UN 127.0

.

0.2

129 , 01 KB 2

˓→

4e523 e4582b9 RAC2

29 , 7 % e5df0651 8608 4590 92e1 -

The routing table now distributes search request on 2 elasticassandra nodes covering 100% of the ring.

curl XGET 'http://localhost:9200/_cluster/state/?pretty=true'

{

"cluster_name" : "Test Cluster" ,

"version" : 12 ,

"master_node" : "74ae1629-0149-4e65-b790-cd25c7406675" ,

"blocks" : { },

"nodes" : {

"74ae1629-0149-4e65-b790-cd25c7406675" : {

"name" : "localhost" ,

"status" : "ALIVE" ,

"transport_address" : "inet[localhost/127.0.0.1:9300]" ,

"attributes" : {

"data" : "true" ,

"rack" : "RAC1" ,

"data_center" : "DC1" ,

"master" : "true"

},

}

"e5df0651-8608-4590-92e1-4e523e4582b9" : {

"name" : "127.0.0.2" ,


"transport_address" : "inet[127.0.0.2/127.0.0.2:9300]" ,

"attributes" : {

"data" : "true" ,

"rack" : "RAC2" ,


1.3. Shards and Replica 5


6

"master" : "true"

}

},

}

"metadata" : {

"version" : 1 ,

"uuid" : "e5df0651-8608-4590-92e1-4e523e4582b9" ,

"templates" : { },

"indices" : {

"twitter" : {

"state" : "open" ,

"settings" : {

"index" : {

"creation_date" : "1440659762584" ,

"uuid" : "fyqNMDfnRgeRE9KgTqxFWw" ,

"number_of_replicas" : "1" ,

"number_of_shards" : "1" ,

"version" : {

"created" : "1050299"

}

},

}

"mappings" : {

"user" : {

"properties" : {

"name" : {

"type" : "string"

}

}

},

"tweet" : {

"properties" : {

"message" : {

},

"type" : "string"

"postDate" : {

"format" : "dateOptionalTime" ,

"type" : "date"

},

"user" : {

},

"type" : "string"

"_token" : {

"type" : "long"

}

}

}

},

"aliases" : [ ]

}

}

},

"routing_table" : {

"indices" : {

"twitter" : {

"shards" : {

"0" : [ {

"state" : "STARTED" ,

Chapter 1. Architecture


}

"primary" : true,

"node" : "74ae1629-0149-4e65-b790-cd25c7406675" ,

"token_ranges" : [ "(-8879901672822909480,4094576844402756550]" ],

"shard" : 0 ,

"index" : "twitter"

} ],

"1" : [ {


"primary" : true,

"node" : "e5df0651-8608-4590-92e1-4e523e4582b9" ,

"token_ranges" : [ "(-9223372036854775808,-8879901672822909480]" ,

˓→

"(4094576844402756550,9223372036854775807]" ],

"shard" : 1 ,

"index" : "twitter"

} ]

}

}

},

}

"routing_nodes" : {

"unassigned" : [ ],

"nodes" : {

"e5df0651-8608-4590-92e1-4e523e4582b9" : [ {


"primary" : true,

"node" : "e5df0651-8608-4590-92e1-4e523e4582b9" ,

"token_ranges" : [ "(-9223372036854775808,-8879901672822909480]" ,

˓→

"(4094576844402756550,9223372036854775807]" ],

"shard" : 1 ,

"index" : "twitter"

} ],

"74ae1629-0149-4e65-b790-cd25c7406675" : [ {


"primary" : true,

"node" : "74ae1629-0149-4e65-b790-cd25c7406675" ,

"token_ranges" : [ "(-8879901672822909480,4094576844402756550]" ],

"shard" : 0 ,

"index" : "twitter"

} ]

}

},

"allocations" : [ ]

Internally, each node broadcasts its local shard status in the gossip application state X1 ( “twitter”:STARTED ) and its current metadata UUID/version in application state X2.

nodetool gossipinfo

127.0

.

0.2

/ 127.0

.

0.2

generation: 1440659838 heartbeat: 396197

DC:DC1

NET_VERSION: 8

SEVERITY: 1.3877787807814457E-17

X1:{ "twitter" : 3 }

X2:e5df0651 8608 4590 92e1 4e523 e4582b9 / 1

RELEASE_VERSION: 2.1

.

8

RACK:RAC2

1.3. Shards and Replica 7


STATUS:NORMAL, 8879901672822909480

SCHEMA:ce6febf4 571 d 30 d2 afeb b8db9d578fd1

INTERNAL_IP: 127.0

.

0.2

RPC_ADDRESS: 127.0

.

0.2

LOAD: 131314.0

HOST_ID:e5df0651 8608 4590 92e1 4e523 e4582b9 localhost / 127.0

.

0.1

generation: 1440659739 heartbeat: 396550

DC:DC1

NET_VERSION: 8

SEVERITY: 2.220446049250313E-16

X1:{ "twitter" : 3 }

X2:e5df0651 8608 4590 92e1 4e523 e4582b9 / 1

RELEASE_VERSION: 2.1

.

8

RACK:RAC1

STATUS:NORMAL, 4318747828927358946

SCHEMA:ce6febf4 571 d 30 d2 afeb b8db9d578fd1

RPC_ADDRESS: 127.0

.

0.1

INTERNAL_IP: 127.0

.

0.1

LOAD: 154824.0

HOST_ID: 74 ae1629 0149 4e65 b790 cd25c7406675

Write path

Write operations (Elasticsearch index, update, delete and bulk operations) are converted to CQL write requests managed by the coordinator node. The elasticsearch document _id is converted to the underlying primary key, and the corresponding row is stored on many nodes according to the Cassandra replication factor. Then, on each node hosting this row, an Elasticsearch document is indexed through a Cassandra custom secondary index. Every document includes a _token fields used used when searching.

At index time, every nodes directly generates lucene fields without any JSON parsing overhead, and Lucene files does not contains any version number, because version-based concurrency management becomes meaningless in a



multi-master database like Cassandra.

Search path

Search request is done in two phases. In the query phase, the coordinator node add a token_ranges filter to the query and broadcasts a search request to all nodes. This token_ranges filter covers all the Cassandra ring and avoid duplicate results. Then, in the fetch phases, the coordinator fetches the required fields by issuing a CQL request in the underlying

Cassandra table, and builds the final JSON response.

Adding a token_ranges filter to the original Elasticsearch query introduce an overhead in the query phase, and the more you have vnodes, the more this overhead increase with many OR clauses. To mitigates this overhead, Elassandra provides a random search strategy requesting the minimum of nodes to cover the whole Cassandra ring. For example, if you have a datacenter with four nodes and a replication factor of two, it will request only two nodes with simplified token_ranges filters (adjacent token ranges are automatically merged).

Additionnaly, as these token_ranges filters only change when the datacenter topology change (for example when a node is down or when adding a new node), Elassandra introduces a token_range bitset cache for each lucene segment.

With this cache, out of range documents are seen as deleted documents at the lucene segment layer for subsequent queries using the same token_range filter. This drastically improves search performances.

Finally, the CQL fetch overhead can be mitigated by using keys and rows Cassandra caching, eventually using the off-heap caching features of Cassandra.

1.5. Search path 9



CHAPTER

2

Installation

There are a number of ways to install Elassandra: from the

tarball , with the

deb package

or

rpm package , with a

docker image , or even from

source .

Elassandra is based on Cassandra and ElasticSearch, thus it will be easier if you’re already familiar with one on these technologies.

Tarball

Elassandra requires at least Java 8. Oracle JDK is the recommended version, but OpenJDK should work as well. You can check which version is installed on your computer:

$ java -version java version "1.8.0_121"

Java(TM) SE Runtime Environment (build 1.8.0_121-b13)

Java HotSpot(TM) 64-Bit Server VM (build 25.121-b13, mixed mode)

Once java is correctly installed, download the Elassandra tarball: wget https://github.com/strapdata/elassandra/releases/download/v2.4.2-10/

˓→ elassandra-2.4.2.tar.gz

Then extract its content: tar -xzf elassandra-2.4.2.tar.gz

Go to the extracted directory: cd elassandra-2.4.2

If you need, configure conf/cassandra.yaml (cluster name, listen address, snitch, ...), then start elassandra: bin / cassandra f e

This starts an Elassandra instance in foreground, with ElasticSearch enabled. Afterwards your node is reachable on localhost on port 9042 (CQL) and 9200 (HTTP). Keep this terminal open and launch a new one.

11


To use cqlsh, we first need to install the Cassandra driver for python. Ensure python and pip are installed, then: sudo pip install cassandra driver

Now connect to the node with cqlsh: bin / cqlsh

Then you must be able to type CQL commands. See the CQL reference .

Also, we started Elassandra with ElasticSearch enabled (according to the -e option), so let’s request the REST API: curl X GET http: // localhost: 9200 /

You should get something like:

{

"name" : "127.0.0.1",

"cluster_name" : "Test Cluster",

"cluster_uuid" : "7cb65cea-09c1-4d6a-a17a-24efb9eb7d2b",

"version" : {

"number" : "2.4.2",

"build_hash" : "b0b4cb025cb8aa74538124a30a00b137419983a3",

"build_timestamp" : "2017-04-19T13:11:11Z",

"build_snapshot" : true,

},

"lucene_version" : "5.5.2"

"tagline" : "You Know, for Search"

}

You’re ready for playing with Elassandra. For instance, try to index a document with the ElasticSearch API, then from cqlsh look for the keyspace/table/row automatically created. Cassandra now benefits from dynamic mapping !

On a production environment, it’s better to modify some system settings like disabling swap. This guide shows you how to. On linux, consider installing jemalloc .

DEB package

Our packages are hosted on packagecloud.io

. Elassandra can be downloaded using an APT repository.

Note: Elassandra requires Java 8 to be installed.

Import the GPG Key

Download and install the public signing key: curl L https: // packagecloud .

io / elassandra / latest / gpgkey | sudo apt key add -

Install Elassandra from the APT repository

Ensure apt is able to use https:

12 Chapter 2. Installation


sudo apt get install apt transport https

Add the Elassandra repository to your source list: echo "deb https://packagecloud.io/elassandra/latest/debian jessie main" | sudo tee a

˓→

/ etc / apt / sources .

list .

d / elassandra .

list

Update apt cache and install Elassandra: sudo apt get update sudo apt get install elassandra

Warning: You should uninstall Cassandra prior to install Elassandra cause the two packages conflict.

Install extra tools

Also install Python, pip, and cassandra-driver: sudo apt get update && sudo apt get install python python pip sudo pip install cassandra driver

Usage

This package installs a systemd service named cassandra, but do not start nor enable it. For those who don’t have systemd, a init.d script is also provided.

To start elassandra using systemd, run: sudo systemctl start cassandra

Files locations:

• /etc/cassandra : configurations

• /var/lib/cassandra: database storage

• /var/log/cassandra: logs

• /usr/share/cassandra: plugins, modules, cassandra.in.sh, lib...

RPM package

Our packages are hosted on packagecloud.io

. Elassandra can be downloaded using a RPM repository.

Note: Elassandra requires Java 8 to be installed.

2.3. RPM package 13


Setup the RPM repository

Create a file called elassandra.repo in the directory /etc/yum.repos.d/ (redhat) or /etc/zypp/ repos.d/

(opensuse), containing:

[elassandra_latest] name=Elassandra repository baseurl=https://packagecloud.io/elassandra/latest/el/7/$basearch type=rpm-md repo_gpgcheck=1 gpgcheck=0 enabled=1 gpgkey=https://packagecloud.io/elassandra/latest/gpgkey autorefresh=1 sslverify=1 sslcacert=/etc/pki/tls/certs/ca-bundle.crt

Install Elassandra

Using yum: sudo yum install elassandra

Warning: You should uninstall Cassandra prior to install Elassandra cause the two packages conflict.

Install extra tools

Also install Python, pip, and cassandra-driver: sudo yum install python python pip sudo pip install cassandra driver

Usage

This package installs a systemd service named cassandra, but do not start nor enable it. For those who don’t have systemd, a init.d script is also provided.

To start elassandra using systemd, run: sudo systemctl start cassandra

Files locations:

• /etc/cassandra : configurations

• /var/lib/cassandra: database storage

• /var/log/cassandra: logs

• /usr/share/cassandra: plugins, modules, cassandra.in.sh, lib...



Docker image

We provide an image on docker hub : docker pull strapdata / elassandra

This image is based on the official Cassandra image whose the documentation is valid as well for Elassandra.

Start an elassandra server instance

Starting an Elassandra instance is simple: docker run -name some elassandra d strapdata / elassandra

...where some-cassandra is the name you want to assign to your container and tag is the tag specifying the

Elassandra version you want. Default is latest.

Connect to Cassandra from an application in another Docker container

This image exposes the standard Cassandra ports and the HTTP ElasticSearch port (9200), so container linking makes the Elassandra instance available to other application containers. Start your application container like this in order to link it to the Elassandra container: docker run -name some app -link some elassandra:elassandra d app that uses -

˓→ elassandra

Make a cluster

Using the environment variables documented below, there are two cluster scenarios: instances on the same machine and instances on separate machines. For the same machine, start the instance as described above. To start other instances, just tell each new node where the first is.

docker run -name some elassandra2 d e CASSANDRA_SEEDS = "$(docker inspect --format='{

˓→

{ .NetworkSettings.IPAddress }}' some-elassandra)" elassandra

... where some-elassandra is the name of your original Elassandra container, taking advantage of docker inspect to get the IP address of the other container.

Or you may use the docker run --link option to tell the new node where the first is: docker run -name some elassandra2 d -link some elassandra:elassandra elassandra

For separate machines (ie, two VMs on a cloud provider), you need to tell Elassandra what IP address to advertise to the other nodes (since the address of the container is behind the docker bridge).

Assuming the first machine’s IP address is 10.42.42.42 and the second’s is 10.43.43.43, start the first with exposed gossip port: docker run -name some elassandra d e CASSANDRA_BROADCAST_ADDRESS = 10.42

.

42.42

p

˓→

7000 : 7000 elassandra

Then start an Elassandra container on the second machine, with the exposed gossip port and seed pointing to the first machine:

2.4. Docker image 15


docker run -name some elassandra d e CASSANDRA_BROADCAST_ADDRESS = 10.43

.

43.43

p

˓→

7000 : 7000 e CASSANDRA_SEEDS = 10.42

.

42.42

elassandra

Container shell access and viewing Cassandra logs

The docker exec command allows you to run commands inside a Docker container. The following command line will give you a bash shell inside your elassandra container:

$ docker exec -it some-elassandra bash

The Cassandra Server log is available through Docker’s container log:

$ docker logs some-elassandra

Environment Variables

When you start the Elassandra image, you can adjust the configuration of the Elassandra instance by passing one or more environment variables on the docker run command line. We already have seen some of them.

Variable Name

CASSAN-

Description

This variable is for controlling which IP address to listen for incoming connections on.

CASSANthe IP address of the container as it starts. This default should work in most use cases.

This variable is for controlling which IP address to advertise to other nodes. The default

CASSAN-

DRA_RPC_ADDRESS

CASSAN-

DRA_START_RPC

CASSAN-

DRA_SEEDS broadcast_address and broadcast_rpc_address options in cassandra.yaml.

This variable is for controlling which address to bind the thrift rpc server to. If you do not specify an address, the wildcard address (0.0.0.0) will be used. It will set the rpc_address option in cassandra.yaml.

This variable is for controlling if the thrift rpc server is started. It will set the start_rpc option in cassandra.yaml. As Elastic search used this port in Elassandra, it will be set

ON by default.

This variable is the comma-separated list of IP addresses used by gossip for bootstrapping new nodes joining a cluster. It will set the seeds value of the seed_provider option in cassandra.yaml. The

CASSANDRA_BROADCAST_ADDRESS will be added the the seeds passed in so that the sever will talk to itself as well.

CASSANThis variable sets the name of the cluster and must be the same for all nodes in the

DRA_CLUSTER_NAME cluster. It will set the cluster_name option of cassandra.yaml.

CASSAN-

DRA_NUM_TOKENS

This variable sets number of tokens for this node. It will set the num_tokens option of cassandra.yaml.

CASSANDRA_DC

CASSAN-

DRA_RACK

CASSAN-

This variable sets the datacenter name of this node. It will set the dc option of cassandra-rackdc.properties.

This variable sets the rack name of this node. It will set the rack option of cassandra-rackdc.properties.

This variable sets the snitch implementation this node will use. It will set the

CASSAN-

DRA_DAEMON

The Cassandra entry-point class: org.apache.cassandra.service.ElassandraDaemon

to start with

ElasticSearch enabled (default), org.apache.cassandra.service.ElassandraDaemon

otherwise.



Build from source

Requirements:

• Oracle JDK 1.8 or OpenJDK 8

• maven >= 3.5

Clone Elassandra repository and Cassandra sub-module: git clone -recursive git

@github

.

com:strapdata / elassandra .

git cd elassandra

Elassandra uses Maven for its build system. Simply run: mvn clean package DskipTests

It’s gonna take a while, you might go for a cup of tea.

If everything succeed, tarballs will be built in: distribution/tar/target/release/elasandra-2.4.2-SNAPSHOT.tar.gz

distribution/zip/target/release/elasandra-2.4.2-SNAPSHOT.zip

Then follow the instructions for

tarball

installation.

2.5. Build from source 17



CHAPTER

3

Configuration

Directory Layout

Elassandra merge the cassandra and elasticsearch directories as follow :

• conf : Cassandra configuration directory + elasticsearch.yml default configuration file.

• bin : Cassandra scripts + elasticsearch plugin script.

• lib : Cassandra and elasticsearch jar dependency

• pylib : Cqlsh python library.

• tools : Cassandra tools.

• plugins : Elasticsearch plugins installation directory.

• modules : Elasticsearch modules directory.

• work : Elasticsearch working directory.

Elasticsearch paths are set according to the following environement variables and system properties :

• path.home : CASSANDRA_HOME environement variable, cassandra.home system property, the current directory.

• path.conf : CASSANDRA_CONF environement variable, path.conf or path.home.

• path.data

: cassandra_storagedir/data/elasticsearch.data, path.data

system property or path.home/data/elasticsearch.data

Configuration

Elasticsearch configuration rely on cassandra configuration file conf/cassandra.yaml for the following parameters.

19


Cassandra cluster.

name

Elasticsearch cluster_name rpc_address network.host

transport.host

Description

Elasticsearch cluster name is mapped to the cassandra cluster name.

Elasticsearch network and transport bind addresses are set to the cassandra rpc listen addresses.

Elasticsearch network and transport publish addresses is set to the cassandra broadcast rpc address.

transport.publish_host

Node role (master, primary, data) is automatically set by elassandra, standard configuration should only set cluster_name, rpc_address in the conf/cassandra.yaml.

Caution: If you use the GossipPropertyFile Snitch to configure your cassandra datacenter and rack properties in conf/cassandra-rackdc.properties, keep in mind this snitch falls back to the PropertyFileSnitch when gossip is not enabled. So, when re-starting the first node, dead nodes can appear in the default DC and rack configured in conf/cassandra-topology.properties. This also breaks the replica placement strategy and the computation of the

Elasticsearch routing tables. So it is strongly recommended to set the same default rack and datacenter in both the conf/cassandra-topology.properties and conf/cassandra-rackdc.properties.

Logging configuration

The cassandra logs in logs/system.log includes elasticsearch logs according to the your conf/logback.

conf settings. See cassandra logging configuration .

Per keyspace (or per table) logging level can be configured using the logger name org.elassandra.index.

ExtendedElasticSecondaryIndex.<keyspace>.<table>

.

Multi datacenter configuration

By default, all elassandra datacenters share the same Elasticsearch cluster name and mapping. This mapping is stored in the elastic_admin keyspace.

20 Chapter 3. Configuration


If you want to manage distinct Elasticsearch clusters inside a cassandra cluster (when indexing differents tables in different datacenter), you can set a datacenter.group in conf/elasticsearch.yml and thus, all elassandra datacenters sharing the same datacenter group name will share the same mapping. Those elasticsearch clusters will be named <cluster_name>@<datacenter.group> and mapping will be stored in a dedicated keyspace.table

elastic_admin_<datacenter.group>.metadata

.

All elastic_admin[_<datacenter.group>] keyspaces are configured with NetworkReplicationStrategy

(see data replication ). where the replication factor is automatically set to the number of nodes in each datacenter. This ensure maximum availibility for the elaticsearch metadata. When removing a node from an elassandra datacenter, you should manually decrease the elastic_admin[_<datacenter.group>] replication factor to the number of nodes.

When a mapping change occurs, Elassandra updates Elasticsearch metadata in elastic_admin[_<datacenter.group>].metadata

within a lightweight transaction to avoid conflit with concurrent updates. This transaction requires QUORUM available nodes, that is more than half the nodes of one or more datacenters regarding your datacenter.group configuration. It also involve cross-datacenter network latency for each mapping update.

Tip: Cassandra cross-datacenter writes are not sent directly to each replica; instead, they are sent to a single replica with a parameter telling that replica to forward to the other replicas in that datacenter; those replicas will respond diectly to the original coordinator. This reduces network trafic between datacenters when having many replica.

Elassandra Settings

Most of the settings can be set at variuous levels :

• As a system property, default property is es.<property_name>

• At clutser level, default setting is cluster.default_<property_name>

• At index level, setting is index.<property_name>

• At table level, setting is configured as a _meta:{ “<property_name> : <value> } for a document type.

For exemple, drop_on_delete_index can be :

• set as a system property es.drop_on_delete_index for all created indices.

• set at the cluster level with the cluster.default_drop_on_delete_index dynamic settings,

• set at the index level with the index.drop_on_delete_index dynamic index settings,

• set as the Elasticsearch document type level with _meta :

{ "drop_on_delete_index":true } in the document type mapping.

When a settings is dynamic, it’s relevant only for index and cluster setting levels, system and document type setting levels are immutables.

3.5. Elassandra Settings 21


Setting Update Levels

22

cluster cluster

Default value

Description secondary index implementation class.

This class must implements org.apache.cassandra.index.Index

interface.

search strategy class.

Available strategy are :

•

PrimaryFirstSearchStrategy distributes search requests to all available nodes

• search performance when

RF

>

1.

whole cassandra ring.

This improves a subset of available nodes covering the

RandomSearchStrategy distributes search requests to

Chapter 3. Configuration

cluster cluster, system system include_node_id type, index, cluster, system index, cluster, system index, cluster, system index, cluster, system index, cluster, system cluster, system dex dex dex true

30s false false false false false

6 false false false rows or columns expires.

If true, snapshot the lucene file when snapshoting

SSTable.

If true, caches the token_range sequent search requests because it generates lucene tombestones, but allows to update documents when or rows invlove a read to reindex).

This comes with a performance cost for both compactions and sub-

If true, modified documents during compacting of Cassandra

SSTables are indexed

(removed columns filter result for each lucene segment.

Defines how long a token_ranges filter query is cached in memory.

When such a query is removed from the cache, associated cached token_ranges bitset are also removed for all lucene segments.

If true, drop underlying cassandra tables and keyspace when deleting an index, thus emulating the Elaticsearch behaviour.

Set the lucene numeric precision step, see

Lucene

Numeric

Range

QUery .

If true, indexes static documents

(elasticsearch documents containing only static and partition key columns).

If true and index_static_document is true, indexes a document containg only the static and partition key columns.

If true and index_static_only is false, indexes static columns in the elasticsearch documents, otherwise, ignore static columns.

If true, indexes the cassandra hostId in the

_node field.

If true, synchronously refreshes the elasticsearch index on each index updates.

sion), otherwise, use the standard

Elasticsearch

Engine.

Dynamic mapping update timeout.

If true, use the optimized lucene

Version-

LessEngine

(does not more manage any document verfunction implementation class.

Available implementa-

: tions are

•

MessageFormatPartitionFunction based on the java

Message-

Format.format()

•

StringPartitionFunction based on the java

String.format().


Sizing and tunning

Basically, Elassandra requires much CPU than standelone Cassandra or Elasticsearch and Elassandra write throughput should be half the cassandra write throughput if you index all columns. If you only index a subset of columns, performance would be better.

Design recommendations :

• Increase number of Elassandra node or use partitioned index to keep shards size below 50Gb.

• Avoid huge wide rows, write-lock on a wide row can dramatically affect write performance.

• Choose the right compaction strategy to fit your workload (See this blog post by Justin Cameron)

System recommendations :

• Turn swapping off.

• Configure less than half the total memory of your server and up to 30.5Gb. Minimum recommended DRAM for production deployments is 32Gb. If you are not aggregating on analyzed string fields, you can probably use less memory to improve file system cache used by Doc Values (See this excelent blog post by Chris Earle).

• Set -Xms to the same value as -Xmx.

• Ensure JNA and jemalloc are correctly installed and enabled.

Write performances

• By default, Elasticsearch analyzes the input data of all fields in a special _all field. If you don’t need it, disable it.

• By default, Elasticsearch shards are refreshed every second, making new document visible for search within a second. If you don’t need it, increase the refresh interval to more than a second, or even turn if off temporarily by setting the refresh interval to -1.

• Use the optimized version less Lucene engine (the default) to reduce index size.

• Disable index_on_compaction (default is false) to avoid the Lucene segments merge overhead when compacting SSTables.

• Index partitioning may increase write throughput by writing to several Elasticsearch indexes in parallel, but choose an efficient partition function implementation. For exemple, String.format() is much more faster that

Message.format()

.

Search performances

• Use 16 to 64 vnodes per node to reduce the complexity of the token_ranges filter.

• Use the random search strategy and increase the Cassandra replication factor to reduce the number of nodes requires for a search request.

• Enable the token_ranges_bitset_cache. This cache compute the token ranges filter once per Lucene segment. Check the token range bitset cache statistics to ensure this caching is efficient.

• Enable Cassandra row caching to reduce the overhead introduce by fetching the requested fields from the underlying Cassandra table.

• Enable Cassandra off-heap row caching in your Cassandra configuration.

• When this is possible, clean lucene tombestones (updated or deleted documents) and reduce the number of

Lucene segments by forcing a merge.

3.6. Sizing and tunning 23


24 Chapter 3. Configuration

CHAPTER

4

Mapping

Basically, an Elasticsearch index is mapped to a cassandra keyspace, and a document type to a cassandra table.

Type mapping

Here is the mapping from Elasticsearch field basic types to CQL3 types :

Elasticearch Types string integer, short, byte long double float boolean binary ip string geo_point geo_shape object, nested

CQL Types text timestamp bigint double float boolean blob inet uuid, timeuuid

UDT geo_point or text text

Custom User Defined Type

Comment

Internet address

Specific mapping (1)

Built-In User Defined Type (3)

Require _source enable (2)

1. Existing Cassandra uuid and timeuuid columns are mapped to Elasticsearch string, but such columns cannot be created through the elasticsearch mapping.

2. Existing Cassandra text columns containing a geohash string can be mapped to an Elasticsearch geo_point.

3. Geo shapes require _source to be enabled to store the original JSON document (default is disabled).

These parameters control the cassandra mapping.

25


Parameter Values singleton cql_struct udt or map cql_udt_name

<ta-

Description

Control how the field of type X is mapped to a column list<X>, set<X> or X.

Default is list because Elasticsearch fields are multivalued.

Control how an object or nested field is mapped to a User Defined Type or to a cassandra map<text,?>. Default is udt.

Elasticsearch index full document. For partial CQL updates, this control which fields should be read to index a full document from a row. Default is true meaning that updates involve reading all missing fields.

Field position in the cassandra the primary key of the underlying cassandra table. Default is -1 meaning that the field is not part of the cassandra primary key.

When the cql_primary_key_order >= 0, specify if the field is part of the cassandra partition key. Default is false meaning that the field is not part of the cassandra partition key.

Specify the Cassandra User Defined Type name to use to store an object. By underscores)

For more information about cassandra collection types and compound primary key, see CQL Collections and Compound keys .

Bidirectionnal mapping

Elassandra supports the Elasticsearch Indice API and automatically creates the underlying cassandra keyspaces and tables. For each Elasticsearch document type, a cassandra table is created to reflect the Elasticsearch mapping. However, deleting an index does not remove the underlying keyspace, it just removes cassandra secondary indices associated to mapped columns.

Additionally, with the new put mapping parameter discover, Elassandra create or update the Elasticsearch mapping for an existing cassandra table. Columns matching the provided regular expression are mapped as Elasticsearch fields. The following command creates the elasticsearch mapping for all columns starting by ‘a’ of the cassandra table my_keyspace.my_table.and set a specific analyzer for column name.

curl XPUT "http://localhost:9200/my_keyspace/_mapping/my_table" d '{

"my_table" : {

"discover" : "a.*" ,

"properties" : {

"name" : {

"type" : "string" ,

"index" : "analyzed"

}

}

}

} '

By default, all text columns are mapped with "index":"not_analyzed".

Tip:

When creating the first Elasticsearch index for a given cassandra table, elassandra creates a custom CQL secondary index asynchonously for each mapped field when all shards are started. Cassandra build index on all nodes for all existing data. Subsequent CQL inserts or updates are automatically indexed in Elasticsearch.

If you then add a second or more Elasticsearch indices to an existing indexed table, existing data are not automatically re-indexed because cassandra has already indexed existing data. Instead of re-insert your data in the cassandra table,

26 Chapter 4. Mapping


you may use the following command to force a cassandra index rebuild. It will re-index your cassandra table to all associated elasticsearch indices : nodetool rebuild_index -threads < N > < keyspace_name > < table_name > elastic_ < table_name >

˓→

_idx

• column_name is any indexed columns (or elasticsearch top-level document field).

• rebuild_index reindexes SSTables from disk, but not from MEMtables. In order to index the very last inserted document, run a nodetool flush <kespace_name> before rebuilding your elasticsearch indices.

• When deleting an elasticsearch index, elasticsearch index files are removed form the data/elasticsearch.data

directory, but cassandra secondary indices remains in the CQL schema until the last associated elasticsearch index is removed. Cassandra is acting as a primary data storage, so keyspace and tables and data are never removed when deleting an elasticsearch index.

Meta-Fields

Elasticsearch meta-fields meaning is slightly different in Elassandra :

• _index is the index name mapped to the underlying cassandra keyspace name (dash [-] and dot[.] are automatically replaced by underscore [_]).

• _type is the document type name mapped to the underlying cassandra table name (dash [-] and dot[.] are automatically replaced by underscore [_]).

• _id is the document ID is a string representation of the primary key of the underlying cassandra table. Single field primary key is converted to a string, compound primary key is converted to a JSON array.

• _source is the indexed JSON document.

By default, _source is disabled in ELassandra, meaning that _source is rebuild from the underlying cassandra columns.

If _source is enabled (see Mapping _source field ) ELassandra stores documents indexed by with the Elasticsearch API in a dedicated Cassandra text column named _source.

This allows to retreive the orginal JSON document for

‘GeoShape Query<https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl-geo-shapequery.html>‘_ .

• _routing is valued with a string representation of the partition key of the underlying cassandra table. Single partition key is converted to a string, compound partition key is converted to a JSON array. Specifying

_routing on get, index or delete operations is useless, since the partition key is included in _id. On search operations, Elassandra compute the cassandra token associated to _routing for the search type, and reduce the search only to a cassandra node hosting this token. (WARNING: Without any search types, Elassandra cannot compute the cassandra token and returns an error all shards failed).

• _ttl and _timestamp are mapped to the cassandra TTL and WRITIME . The returned _ttl and

_timestamp for a document will be the one of a regular cassandra columns if there is one in the underlying table. Moreover, when indexing a document throught the Elasticearch API, all cassandra cells carry the same WRITETIME and TTL, but this could be different when upserting some cells using CQL.

• _parent is string representation of the parent document primary key. If the parent document primary key is composite, this is string representation of columns defined by cql_parent_pk in the mapping. See Parent-

Child Relationship .

• _token is a meta-field introduced by Elassandra, valued with token(<partition_key>).

• _node is a meta-field introduced by Elassandra, valued with the cassandra host id, allowing to check the datacenter consistency.

4.3. Meta-Fields 27


Mapping change with zero downtime

You can map several Elasticsearch indices with different mapping to the same cassandra keyspace. By default, an index is mapped to a keyspace with the same name, but you can specify a target keyspace in your index settings.

For example, you can create a new index twitter2 mapped to the cassandra keyspace twitter and set a mapping for type tweet associated to the existing cassandra table twitter.tweet.

curl XPUT "http://localhost:9200/twitter2/" d '{

"settings" : { "keyspace" : "twitter" } },

"mappings" : {

"tweet" : {

"properties" : {

"message" : { "type" : "string" , "index" : "not_analyzed" },

"post_date" : { "type" : "date" , "format" : "yyyy-MM-dd" },

"user" : { "type" : "string" , "index" : "not_analyzed" },

"size" : { "type" : "long" }

}

}

}

}

You can set a specific mapping for twitter2 and re-index existing data on each cassandra node with the following command (indices are named elastic_<tablename>).

nodetool rebuild_index [ -threads < N > ] twitter tweet elastic_tweet_idx

By default, rebuild_index use only one thread, but Elassandra supports multi-threaded index rebuild with the new parameter –threads. Index name is <elastic>_<table_name>_<column_name>_idx where column_name is any indexed column name. Once your twitter2 index is ready, set an alias twitter for twitter2 to switch from the old mapping to the new one, and delete the old twitter index.

curl XPOST "http://localhost:9200/_aliases" d '{ "actions" : [ { "add" : { "index"

˓→

: "twitter2", "alias" : "twitter" } } ] }' curl XDELETE "http://localhost:9200/twitter"



Partitioned Index

Elasticsearch TTL support is deprecated since Elasticsearch 2.0 and the Elasticsearch TTLService is disabled in Elassandra. Rather than periodically looking for expired documents, Elassandra supports partitioned index allowing to manage per time-frame indices. Thus, old data can be removed by simply deleting old indices.

Partitioned index also allows to index more than 2^31 documents on a node (2^31 is the lucene max documents per index).

An index partition function acts as a selector when many indices are associated to a cassandra table. A partition function is defined by 3 or more fields separated by a space character :

• Function name.

• Index name pattern.

• 1 to N document field names.

The target index name is the result your partition function,

A partition function must implements the java interface org.elassandra.index.PartitionFunction. Two implementation classes are provided :

• StringFormatPartitionFunction (the default) based on the JDK function String.format(Locale locale, <parttern>,<arg1>,...) .

• MessageFormatPartitionFunction based on the JDK function MessageFormat.format(<parttern>,<arg1>,...) .

Index partition function are stored in a map, so a given index function is executed exactly once for all mapped index. For example, the toYearIndex function generates the target index logs_<year> depending on the value of the date_field for each document (or row).

You can define each per-year index as follow, with the same index.partition_function for all logs_<year>.

4.5. Partitioned Index 29


All those indices will be mapped to the keyspace logs, and all columns of the table mylog automatically mapped to the document type mylog.

curl XPUT "http://localhost:9200/logs_2016" d '{

"settings" : {

"keyspace" : "logs" ,

"index.partition_function" : "toYearIndex logs_{0,date,yyyy} date_field" ,

"index.partition_function_class" : "MessageFormatPartitionFunction"

},

} '

"mappings" : {

"mylog" : { "discover" : ".*" }

}

Tip: When creating the first Elasticsearch index for a Cassandra table, Elassandra may create some Cassandra secondary indices. Only the first created secondary index trigger a compaction to index the existing data. So, if you create a partitioned index on a table having some data, the index rebuild may start before all partition are created, and some rows could be ignored if matching a not yet created partitioned index. To avoid this situation, create partitioned indices before injecting data or rebuild the secondary index entirely.

Tip: Partition function is executed for each indexed document, so if write throughput is a concern, you should choose an efficient implementation class.

To remove an old index.

curl XDELETE "http://localhost:9200/logs_2013"

Cassandra TTL can be used in conjunction with partitioned index to automatically removed rows during the normal cassandra compaction and repair processes when index_on_compaction is true, but this introduce a lucene merge overhead because document are re-indexed when compacting. You can also use the DateTieredCompactionStrategy to the TimeWindowTieredCompactionStrategy to improve performance of time series-like workloads.

Object and Nested mapping

By default, Elasticsearch Object or nested types are mapped to dynamically created Cassandra User Defined Types .

curl XPUT 'http://localhost:9200/twitter/tweet/1' d '{

"user" : {

"name" : {

"first_name" : "Vincent" ,

"last_name" : "Royer"

},

"uid" : "12345"

},

"message" : "This is a tweet!"

} ' curl XGET 'http://localhost:9200/twitter/tweet/1/_source'

{ "message" : "This is a tweet!" , "user" :{ "uid" :[ "12345" ], "name" :[{ "first_name" :[ "Vincent

˓→

" ], "last_name" :[ "Royer" ]}]}}

The resulting cassandra user defined types and table.



cqlsh > describe keyspace twitter;

CREATE TYPE twitter .

tweet_user ( name frozen < list < frozen < tweet_user_name >>> , uid frozen < list < text >>

);

CREATE TYPE twitter .

tweet_user_name ( last_name frozen < list < text >> , first_name frozen < list < text >>

);

CREATE TABLE twitter .

tweet (

"_id" text PRIMARY KEY, message list < text > , person list < frozen < tweet_person >>

) cqlsh > select

*

from twitter.tweet

;

_id | message | user

-----+----------------------+---------------------------------------------------------

˓→

--------------------

1 | [ 'This is a tweet!' ] | [{name: [{last_name: [ 'Royer' ], first_name: [ 'Vincent' ]}],

˓→ uid: [ '12345' ]}]

Dynamic mapping of Cassandra Map

Nested document can be mapped to User Defined Type or to CQL map . In the following example, the cassandra map is automatically mapped with cql_mandatory:true, so a partial CQL update cause a read of the whole map to re-index a document in the elasticsearch index.

cqlsh > CREATE KEYSPACE IF NOT EXISTS twitter WITH replication = { 'class' :

˓→

'NetworkTopologyStrategy' , 'dc1' : '1' }; cqlsh > CREATE TABLE twitter .

user ( name text, attrs map < text,text > ,

PRIMARY KEY (name)

); cqlsh > INSERT INTO twitter .

user (name,attrs) VALUES ( 'bob' ,{ 'email' : '[email protected]' ,

˓→

'firstname' : 'bob' });

Create the type mapping from the cassandra table and search for the bob entry.

curl XPUT "http://localhost:9200/twitter/_mapping/user" d '{ "user" : { "discover"

˓→

: ".*" }}'

{ "acknowledged" :true} curl XGET 'http://localhost:9200/twitter/_mapping/user?pretty=true'

{

"twitter" : {

"mappings" : {

"user" : {

"properties" : {

"attrs" : {

"type" : "nested" ,

"cql_struct" : "map" ,

4.7. Dynamic mapping of Cassandra Map 31


}

}

}

}

}

"cql_collection" : "singleton" ,

"properties" : {

"email" : {

"type" : "string"

},

"firstname" : {

"type" : "string"

}

},

}

"name" : {

"type" : "string" ,


"cql_partition_key" : true,

"cql_primary_key_order" : 0

} curl XGET "http://localhost:9200/twitter/user/bob?pretty=true"

{

"_index" : "twitter" ,

"_type" : "user" ,

"_id" : "bob" ,

"_version" : 0 ,

"found" : true,

"_source" :{ "name" : "bob" , "attrs" :{ "email" : "[email protected]" , "firstname" : "bob" }}

}

Now insert a new entry in the attrs map column and search for a nested field attrs.city:paris.

cqlsh > UPDATE twitter .

user SET attrs = attrs + { 'city' : 'paris' } WHERE name = 'bob' ; curl XGET "http://localhost:9200/twitter/_search?pretty=true" d '{

"query" :{

"nested" :{

"path" : "attrs" ,

"query" :{ "match" : { "attrs.city" : "paris" } }

}

}

} '

{

"took" : 3 ,

"timed_out" : false,

"_shards" : {

"total" : 1 ,

"successful" : 1 ,

"failed" : 0

},

"hits" : {

"total" : 1 ,

"max_score" : 2.3862944

,

"hits" : [ {

"_index" : "twitter" ,

"_type" : "user" ,



}

"_id" : "bob" ,

˓→

"name" : "bob" }

} ]

}

"_score" : 2.3862944

,

"_source" :{ "attrs" :{ "city" : "paris" , "email" : "[email protected]" , "firstname" : "bob" },

Dynamic Template with Dynamic Mapping

Dynamic templates can be used when creating a dynamic field from a Cassandra map.

"mappings" : {

"event_test" : {

"dynamic_templates" : [

{ "strings_template" : {

"match" : "strings.*" ,

"mapping" : {

"type" : "string" ,

"index" : "not_analyzed"

}

}}

],

"properties" : {

"id" : {

"type" : "string" ,

"index" : "not_analyzed" ,



},

"cql_primary_key_order" : 0

"strings" : {

"type" : "object" ,


"cql_collection" : "singleton"

}

}

}

}

Then, a new entry key1 in the underlying cassandra map will have the following mapping:

"mappings" : {

"event_test" : {

"dynamic_templates" : [ {

"strings_template" : {

"mapping" : {


"type" : "string" ,

},

"doc_values" : true

}

} ],

"match" : "strings.*"

"properties" : {

"strings" : {

4.7. Dynamic mapping of Cassandra Map 33


}

}

}



"type" : "nested" ,

"properties" : {

"key1" : {


"type" : "string"

}

},

"id" : {


"type" : "string" ,


"cql_primary_key_order" : 0 ,


}

Note that because doc_values is true by default for a not analyzed field, it does not appear in the mapping.

Parent-Child Relationship

Elassandra supports parent-child relationship when parent and child document are located on the same cassandra node.

This condition is met :

• when running a single node cluster,

• when the keyspace replication factor equals the number of nodes or

• when the parent and child documents share the same cassandra partition key, as shown in the following example.

Create an index company (a cassandra keyspace), a cassandra table, insert 2 rows and map this table as document type employee.

cqlsh << EOF

CREATE KEYSPACE IF NOT EXISTS company WITH replication = { 'class' :

˓→

'NetworkTopologyStrategy' , 'dc1' : '1' };

CREATE TABLE company .

employee (

"_parent" text,

"_id" text, name text, dob timestamp, hobby text, primary key (( "_parent" ), "_id" )

);

INSERT INTO company .

employee ( "_parent" , "_id" ,name,dob,hobby) VALUES ( 'london' , '1' ,

˓→

'Alice Smith' , '1970-10-24' , 'hiking' );

INSERT INTO company .

employee ( "_parent" , "_id" ,name,dob,hobby) VALUES ( 'london' , '2' ,

˓→

'Alice Smith' , '1990-10-24' , 'hiking' );

EOF curl XPUT "http://$NODE:9200/company2" d '{

"mappings" : {

"employee" : {

"discover" : ".*" ,



"_parent" : { "type" : "branch" , "cql_parent_pk" : "branch" }

}

}

} ' curl XPOST "http://127.0.0.1:9200/company/branch/_bulk" d '

{ "index" : { "_id" : "london" }}

{ "district" : "London Westminster" , "city" : "London" , "country" : "UK" }

{ "index" : { "_id" : "liverpool" }}

{ "district" : "Liverpool Central" , "city" : "Liverpool" , "country" : "UK" }

{ "index" : { "_id" : "paris" }}

{ "district" : "Champs Élysées" , "city" : "Paris" , "country" : "France" }

'

Search for documents having children document of type employee with dob date greater than 1980.

curl XGET "http://$NODE:9200/company2/branch/_search?pretty=true" d '{

"query" : {

"has_child" : {

"type" : "employee" ,

"query" : {

"range" : {

"dob" : {

"gte" : "1980-01-01"

}

}

}

}

} '

}

Search for employee documents having a parent document where country match UK.

curl XGET "http://$NODE:9200/company2/employee/_search?pretty=true" d '{

"query" : {

"has_parent" : {

"parent_type" : "branch" ,

"query" : {

"match" : { "country" : "UK"

}

}

}

} '

}

Indexing Cassandra static columns

When a Cassandra table have one or more clustering columns, a static columns is shared by all the rows with the same partition key.

4.9. Indexing Cassandra static columns 35


A slight modification of cassandra code provides support of secondary index on static columns, allowing to search on static columns values (CQL search on static columns remains unsupported). Each time a static columns is modified, a document containing the partition key and only static columns is indexed in Elasticserach. By default, static columns are not indexed with every wide rows because any update on a static column would require reindexation of all wide rows. However, you can request for fields backed by a static columns on any get/search request.

The following example demonstrates how to use static columns to store meta information of a timeserie.

curl XPUT "http://localhost:9200/test" d '{

"mappings" : {

"timeseries" : {

"properties" : {

"t" : {

"type" : "date" ,

"format" : "strict_date_optional_time||epoch_millis" ,



},

"meta" : {

"type" : "nested" ,


"cql_static_column" : true,


"include_in_parent" : true,

"index_static_document" : true,

"index_static_columns" : true,

"properties" : {

"region" : {

"type" : "string"

}

},

}

"v" : {

"type" : "double" ,


},

"m" : {

"type" : "string" ,




}

}

}



} '

} cqlsh << EOF

INSERT INTO test .

timeseries (m, t, v) VALUES ( 'server1-cpu' , '2016-04-10 13:30' , 10 );

INSERT INTO test .


INSERT INTO test .


INSERT INTO test .

timeseries (m, meta) VALUES ( 'server1-cpu' , { 'region' : 'west' } );

SELECT

*

FROM test .

timeseries;

EOF m | t | meta | v

-------------+-----------------------------+--------------------+---server1 cpu | 2016 04 10 11 : 30 : 00.000000

z | { 'region' : 'west' } | 10 server1 cpu | 2016 04 10 11 : 31 : 00.000000

z | { 'region' : 'west' } | 20 server1 cpu | 2016 04 10 11 : 32 : 00.000000

z | { 'region' : 'west' } | 15

Search for wide rows only where v=10 and fetch the meta.region field.

curl XGET "http://localhost:9200/test/timeseries/_search?pretty=true&q=v:10&fields=m,

˓→ t,v,meta.region,_source"

"hits" : [ {

"_index" : "test" ,

"_type" : "timeseries" ,

"_id" : "[\"server1-cpu\",1460287800000]" ,

"_score" : 1.9162908

,

"_routing" : "server1-cpu" ,

"_source" : {

"t" : "2016-04-10T11:30:00.000Z" ,

"v" : 10.0

,

"meta" : { "region" : "west" },

"m" : "server1-cpu"

},

"fields" : {

"meta.region" : [ "west" ],

"t" : [ "2016-04-10T11:30:00.000Z" ],

"m" : [ "server1-cpu" ],

"v" : [ 10.0

]

}

} ]

Search for rows where meta.region=west, returns only a static document (i.e. document containg the partition key and static columns) because index_static_document is true.

curl XGET "http://localhost:9200/test/timeseries/_search?pretty=true&q=meta.

˓→ region:west&fields=m,t,v,meta.region"

"hits" : {

"total" : 1 ,

"max_score" : 1.5108256

,

"hits" : [ {

"_index" : "test" ,

"_type" : "timeseries" ,

"_id" : "server1-cpu" ,

"_score" : 1.5108256

,

"_routing" : "server1-cpu" ,

"fields" : {

"m" : [ "server1-cpu" ],

4.9. Indexing Cassandra static columns 37


}

} ]

"meta.region" : [ "west" ]

If needed, you can change the default behavior for a specific cassandra table (or elasticsearch document type), by using the following custom metadata :

• index_static_document controls whether or not static document (i.e. document containg the partition key and static columns) are indexed (default is false).

• index_static_only if true, it ony indexes static documents with partition key as _id and static columns as fields.

• index_static_columns controls whether or not static columns are included in indexed documents (default is false).

Be careful, if

‘‘ index_static_document‘‘=*false* and

‘‘ index_static_only‘‘=*true*, it does not index any document.

In our example with the following mapping, static columns are indexed in every documents, allowing to search on.

curl XPUT http: // localhost: 9200 / test / _mapping / timeseries d '{

"timeseries" : {

"discover" : ".*" ,

"_meta" : {

"index_static_document" :true,

"index_static_columns" :true

}

}

} '

Elassandra as a JSON-REST Gateway

When dynamic mapping is disabled and a mapping type has no indexed field, elassandra nodes can act as a JSON-

REST gateway for cassandra to get, set or delete a cassandra row without any indexing overhead. In this case, the mapping may be use to cast types or format date fields, as shown below.

CREATE TABLE twitter .

tweet (

"_id" text PRIMARY KEY, message list < text > , post_date list < timestamp > , size list < bigint > , user list < text >

) curl XPUT "http://$NODE:9200/twitter/" d '{

"settings" :{ "index.mapper.dynamic" :false },

"mappings" :{

"tweet" :{

"properties" :{

"size" : { "type" : "long" , "index" : "no" },

"post_date" :{ "type" : "date" , "index" : "no" , "format" : "strict_date_

˓→ optional_time||epoch_millis" }

}

}

}

} '



As the result, you can index, get or delete a cassandra row, including any column of your cassandra table.

curl -XPUT "http://localhost:9200/twitter/tweet/1?consistency=one" -d '{

"user" : "vince",

"post_date" : "2009-11-15T14:12:12",

"message" : "look at Elassandra !!",

"size": 50

}'

{"_index":"twitter","_type":"tweet","_id":"1","_version":1,"_shards":{"total":1,

˓→

"successful":1,"failed":0},"created":true}

$ curl -XGET "http://localhost:9200/twitter/tweet/1?pretty=true&fields=message,user,

{

˓→ size,post_date'

"_index" : "twitter",

"_type" : "tweet",

"_id" : "1",

"_version" : 1,

"found" : true,

"fields" : {

"size" : [ 50 ],

"post_date" : [ "2009-11-15T14:12:12.000Z" ],

"message" : [ "look at Elassandra !!" ],

"user" : [ "vince" ]

}

}

$ curl -XDELETE "http://localhost:9200/twitter/tweet/1?pretty=true'

{

"found" : true,

"_index" : "twitter",

"_type" : "tweet",

"_id" : "1",

"_version" : 0,

"_shards" : {

"total" : 1,

"successful" : 1,

"failed" : 0

}

}

Check Cassandra consistency with elasticsearch

When the index.include_node = true (default is false), the _node metafield containing the Cassandra host id is included in every indexed document. This allows to to distinguish multiple copies of a document when the datacenter replication factor is greater than one. Then a token range aggregation allows to count the number of documents for each token range and for each Cassandra node.

In the following example, we have 1000 accounts documents in a keyspace with RF=2 in a two nodes datacenter, and each token ranges have the same number of document for the two nodes.

curl XGET "http://$NODE:9200/accounts/_search?pretty=true&size=0" d '{

"aggs" : {

"tokens" : {

"token_range" : {

"field" : "_token"

4.11. Check Cassandra consistency with elasticsearch 39


},

"aggs" : {

"nodes" : {

"terms" : { "field" : "_node" }

}

}

}

}

} '

{

"took" : 23 ,

"timed_out" : false,

"_shards" : {

"total" : 2 ,

"successful" : 2 ,

},

"failed" : 0

"hits" : {

"total" : 2000 ,

"max_score" : 0.0

,

},

"hits" : [ ]

"aggregations" : {

"tokens" : {

"buckets" : [ {

"key" : "(-9223372036854775807,-4215073831085397715]" ,

"from" : 9223372036854775807 ,

"from_as_string" : "-9223372036854775807" ,

"to" : 4215073831085397715 ,

"to_as_string" : "-4215073831085397715" ,

"doc_count" : 562 ,

"nodes" : {

"doc_count_error_upper_bound" : 0 ,

"sum_other_doc_count" : 0 ,

"buckets" : [ {

"key" : "528b78d3-fae9-49ae-969a-96668566f1c3" ,

"doc_count" : 281

}, {

"key" : "7f0b782e-5b75-409b-85e9-f5f96a75a7dc" ,

"doc_count" : 281

} ]

}

}, {

"key" : "(-4215073831085397714,7919694572960951318]" ,

"from" : 4215073831085397714 ,

"from_as_string" : "-4215073831085397714" ,

"to" : 7919694572960951318 ,

"to_as_string" : "7919694572960951318" ,

"doc_count" : 1268 ,

"nodes" : {



"buckets" : [ {

"key" : "528b78d3-fae9-49ae-969a-96668566f1c3" ,

"doc_count" : 634

}, {


"doc_count" : 634



}

}

}

} ]

}

}, {

"key" : "(7919694572960951319,9223372036854775807]" ,

"from" : 7919694572960951319 ,

"from_as_string" : "7919694572960951319" ,

"to" : 9223372036854775807 ,

"to_as_string" : "9223372036854775807" ,

}

} ]

"doc_count" : 170 ,

"nodes" : {



"buckets" : [ {

"key" : "528b78d3-fae9-49ae-969a-96668566f1c3" ,

"doc_count" : 85

}, {


"doc_count" : 85

} ]

Of course, according to your use case, you should add a filter to your query to ignore write operations occurring during the check.

4.11. Check Cassandra consistency with elasticsearch 41



CHAPTER

5

Operations

Indexing

Let’s try and index some twitter like information (demo from Elasticsearch )). First, let’s create a twitter user, and add some tweets (the twitter index will be created automatically, see automatic index and mapping creation in Elasticsearch documentation): curl XPUT 'http://localhost:9200/twitter/user/kimchy' d '{ "name" : "Shay Banon" }' curl XPUT 'http://localhost:9200/twitter/tweet/1' d '

{

"user" : "kimchy" ,

"postDate" : "2009-11-15T13:12:00" ,

"message" : "Trying out Elassandra, so far so good?"

} ' curl XPUT 'http://localhost:9200/twitter/tweet/2' d '

{

"user" : "kimchy" ,

"postDate" : "2009-11-15T14:12:12" ,

"message" : "Another tweet, will it be indexed?"

} '

You now have two rows in the Cassandra twitter.tweet table.

cqlsh

Connected to Test Cluster at 127.0

.

0.1

: 9042.

[cqlsh 5.0

.

1 | Cassandra 2.1

.

8 | CQL spec 3.2

.

0 | Native protocol v3]

Use HELP

for

help .

cqlsh > select

_id | message

*

from twitter.tweet

;

| postDate |

˓→ user

-----+--------------------------------------------+------------------------------+----

˓→

--------

2 |

˓→

'kimchy' ]

[ 'Another tweet, will it be indexed?' ] | [ '2009-11-15 15:12:12+0100' ] | [

1 | [ 'Trying out Elassandra, so far so good?' ] | [ '2009-11-15 14:12:00+0100' ] | [

˓→

'kimchy' ]

43


( 2 rows)

Apache Cassandra is a column store that only support upsert operation. This means that deleting a cell or a row invovles the creation of a tombestone (insert a null) kept until the compaction later removes both the obsolete data and the tombstone (See this blog about Cassandra tombstone ).

By default, when using the Elasticsearch API to replace a document by a new one, Elassandra insert a row corresponding to the new document including null for unset fields. Without these null (cell tombstones), old fields not present in the new document would be kept at the Cassandra level as zombie cells.

Moreover, indexing with op_type=create

(See ‘Elasticsearch indexing

‘<https://www.elastic.co/guide/en/elasticsearch/reference/current/docs-index_.html#operation-type>‘_ ) require a Cassandra PAXOS transaction to check if the document exists in the underlying datacenter. This comes with useless performance cost if you use automatic generated document ID (See Automatic ID generation . ), as this ID will be the Cassandra primary key.

:

Depending on op_type and document ID, CQL requests are issued as follow when indexing with the Elasticsearch API op_type Generated ID create

INSERT INTO ...

VALUES(...) index

INSERT INTO ...

VALUES(...)

Provided ID

INSERT INTO ... VALUES(...) IF NOT

EXISTS (1)

DELETE FROM ... WHERE ... INSERT

INTO ... VALUES(...)

Comment

Index a new document.

Replace a document that may already exists

(1) The IF NOT EXISTS comes with the cost of the PAXOS transaction. If you don’t need to check the uniqueness of the provided ID, add parameter check_unique_id=false.

GETing

Now, let’s see if the information was added by GETting it: curl XGET 'http://localhost:9200/twitter/user/kimchy?pretty=true' curl XGET 'http://localhost:9200/twitter/tweet/1?pretty=true' curl XGET 'http://localhost:9200/twitter/tweet/2?pretty=true'

Elasticsearch state now show reflect the new twitter index. Because we are currently running on one node, the token_ranges routing attribute match 100% of the ring from Long.MIN_VALUE to Long.MAX_VALUE.

curl XGET 'http://localhost:9200/_cluster/state/?pretty=true'

{

"cluster_name" : "Test Cluster" ,

"version" : 5 ,

"master_node" : "74ae1629-0149-4e65-b790-cd25c7406675" ,

"blocks" : { },

"nodes" : {

"74ae1629-0149-4e65-b790-cd25c7406675" : {

"name" : "localhost" ,


"transport_address" : "inet[localhost/127.0.0.1:9300]" ,

"attributes" : {

"data" : "true" ,

"rack" : "RAC1" ,


"master" : "true"

}

44 Chapter 5. Operations


}

},

"metadata" : {

"version" : 3 ,

"uuid" : "74ae1629-0149-4e65-b790-cd25c7406675" ,

"templates" : { },

"indices" : {

"twitter" : {

"state" : "open" ,

"settings" : {

"index" : {

"creation_date" : "1440659762584" ,

"uuid" : "fyqNMDfnRgeRE9KgTqxFWw" ,

"number_of_replicas" : "1" ,

"number_of_shards" : "1" ,

"version" : {

"created" : "1050299"

}

},

}

"mappings" : {

"user" : {

"properties" : {

"name" : {

"type" : "string"

}

},

}

"tweet" : {

"properties" : {

"message" : {

},

"type" : "string"

"postDate" : {

"format" : "dateOptionalTime" ,

},

"type" : "date"

"user" : {

"type" : "string"

}

}

}

},

"aliases" : [ ]

}

},

}

"routing_table" : {

"indices" : {

"twitter" : {

"shards" : {

"0" : [ {


"primary" : true,

"node" : "74ae1629-0149-4e65-b790-cd25c7406675" ,

"token_ranges" : [ "(-9223372036854775808,9223372036854775807]" ],

"shard" : 0 ,

"index" : "twitter"

5.2. GETing 45


}

} ]

}

}

}

},

"routing_nodes" : {

"unassigned" : [ ],

"nodes" : {

"74ae1629-0149-4e65-b790-cd25c7406675" : [ {


"primary" : true,

"node" : "74ae1629-0149-4e65-b790-cd25c7406675" ,

"token_ranges" : [ "(-9223372036854775808,9223372036854775807]" ],

"shard" : 0 ,

"index" : "twitter"

} ]

}

},

"allocations" : [ ]

Updates

In Cassandra, an update is an upsert operation (if the row does not exists, it’s an insert). As Elasticsearch, Elassandra issue a GET operation before any update. Then, to keep the same semantic as Elasticsearch, update operations are converted to upsert with the ALL consistency level. Thus, later get operations are consistent. (You should consider

CQL UPDATE operation to avoid this performance cost)

Scripted updates, upsert (scripted_upsert and doc_as_upsert) are also supported.

Searching

Let’s find all the tweets that kimchy posted: curl XGET 'http://localhost:9200/twitter/tweet/_search?q=user:kimchy&pretty=true'

We can also use the JSON query language Elasticsearch provides instead of a query string: curl XGET 'http://localhost:9200/twitter/tweet/_search?pretty=true' d '

{

"query" : {

"match" : { "user" : "kimchy" }

}

} '

To avoid duplicates results when the Cassandra replication factor is greater than one, Elassandra adds a token_ranges filter to every queries distributed to all nodes. Because every document contains a _token fields computed at indextime, this ensure that a node only retrieves documents for the requested token ranges. The token_ranges parameter is a conjunction of Lucene NumericRangeQuery build from the Elasticsearch routing tables to cover the entire Cassandra ring. .. code:



curl XGET 'http://localhost:9200/twitter/tweet/_search?pretty=true&token_ranges=(0,

{

˓→

9223372036854775807)' d '

"query" : {


}

} '

Of course, if the token range filter cover all ranges (Long.MIN_VALUE to Long.MAX_VALUE), Elassandra automatically remove the useless filter.

Finally, you can restrict a query to the coordinator node with preference=_only_local parameter, for all token_ranges as shown below : curl XGET 'http://localhost:9200/twitter/tweet/_search?pretty=true&preference=_only_

{

˓→ local&token_ranges=' d '

"query" : {


}

} '

Optimizing search requests

The search strategy

Elassandra supports various search strategies to distribute a search request over the Elasticsearch cluster. A search strategy is configured at index-level with the index.search_strategy_class parameter.

Strategy org.elassandra.cluster.

routing.

PrimaryFirstSearchStrategy

(Default) org.elassandra.cluster.

routing.

RandomSearchStrategy

Description

Search on all alive nodes in the datacenter. All alive nodes responds for their primary token ranges, and for replica token ranges when there is some unavailable nodes. This strategy is always used to build the routing table in the cluster state.

For each query, randomly distribute a search request to a minimum of nodes to reduce the network traffic. For example, if your underlying keyspace replication factor is N, a search only invloves 1/N of the nodes.

You can create an index with the RandomSearchStrategy as shown below.

curl XPUT "http://localhost:9200/twitter/" d '{

"settings" : {

"index.search_strategy_class" : "RandomSearchStrategy"

}

} '

Tip: When changing a keyspace replication factor, you can force an Elasticsearch routing table update by closing and re-opening all associated elasticsearch indices. To troubleshoot search request routing, set the logging level to

DEBUG for class org.elassandra.cluster.routing in the conf/logback.xml file.

5.4. Searching 47


Caching features

Compared to Elasticsearch, Elassandra introduces a search overhead by adding to each query a token ranges filter and by fetching fields through a CQL request at the Cassandra layer. These overheads can be both mitigated by using caching features.

Token Ranges Query Cache

Token ranges filter depends on the node or vnodes configuration, are quite stable and shared for all keyspaces having the same replication factor. These filters only change when the datacenter topology changes, for example when a node is temporary down or when a node is added to the datacenter. So, Elassandra use a cache to keep these queries, a conjunction of Lucene NumericRangeQuery often reused for every search requests.

As a classic caching strategy, the token_ranges_query_expire controls the expiration time of useless token ranges filter queries into memory. The default is 5 minutes.

Token Ranges Bitset Cache

When enabled, the token ranges bitset cache keeps into memory the results of the token range filter for each Lucene segment. This in-memory bitset, acting as the liveDocs Lucene thumbstones mechanism, is then reused for subsequent

Lucene search queries. For each Lucene segment, this document bitset is updated when the Lucene thumbstones count increase (it’s a bitwise AND between the actual Lucene thumbstones and the token range filter result), or removed if the corresponding token ranges query is removed because unused from the token range query cache.

You can enable the token range bitset cache at index level by setting index.token_ranges_bitset_cache to true (Default is false), or configure the its default value for newly created indices at cluster or system levels.

You can also bypass this cache by adding token_ranges_bitset_cache=false in your search request : curl XPUT "http://localhost:9200/twitter/_search?token_ranges_bitset_cache=false&

˓→ q=*:*"

Finally, you can check the in-memory size of the token ranges bitset cache with the Elasticsearch stats API, and clear it when clearing the Elasticsearch query_cache : curl XGET "http://localhost:9200/_stats?pretty=true"

...

"segments" : {

"count" : 3 ,

"memory_in_bytes" : 26711 ,

"terms_memory_in_bytes" : 23563 ,

"stored_fields_memory_in_bytes" : 1032 ,

"term_vectors_memory_in_bytes" : 0 ,

"norms_memory_in_bytes" : 384 ,

"doc_values_memory_in_bytes" : 1732 ,

"index_writer_memory_in_bytes" : 0 ,

},

"index_writer_max_memory_in_bytes" : 421108121 ,

"version_map_memory_in_bytes" : 0 ,

"fixed_bit_set_memory_in_bytes" : 0 ,

"token_ranges_bit_set_memory_in_bytes" : 240

...



Cassandra Key and Row Cache

To improve CQL fetch requests response time, Cassandra provides key and row caching features configured for each

Cassandra table as follow :

ALTER TABLE ...

WITH caching = { 'keys' : 'ALL' , 'rows_per_partition' : '1' };

To enable Cassandra row caching, set the row_cache_size_in_mb parameter in your conf/cassandra.yaml, and set row_cache_class_name: org.apache.cassandra.cache.OHCProvider

to use off-heap memory.

Tip: Elasticsearch also provides a Lucene query cache, used for segments having more than 10k documents, and for some frequent queries (queries done more than 5 or 20 times depending of the nature of the query). The shard request cache, can also be enable if the token range bitset cache is disabled.

Create, delete and rebuild index

In order to create an Elasticsearch index from an existing Cassandra table, you can specify the underlying keyspace. In the following example, all columns but message is automatically mapped with the default mapping, and the message is explicitly mapped with a custom mapping.

curl XGET 'http://localhost:9200/twitter_index' d '{

"settings" : { "keyspace" : "twitter" }

"mappings" : {

"tweet" : {

"discover" : "^(?!message).*" ,

"properties" : {

˓→

"singleton" }

"message" : { "type" : "string" , "index" : "analyzed" , "cql_collection" :

}

}

} '

}

Deleting an Elasticsearch index does not remove any Cassandra data, it keeps the underlying Cassandra tables but remove Elasticsearch index files.

curl XDELETE 'http://localhost:9200/twitter_index'

To re-index your existing data, for example after a mapping change to index a new column, run a nodetool rebuild_index as follow : nodetool rebuild_index [ -threads < N > ] < keyspace > < table > elastic_ < table > _idx

Tip: By default, rebuild index runs on a single thread. In order to improve re-indexing performance, Elassandra comes with a multi-threaded rebuild_index implementation. The –threads parameter allows to specify the number of threads dedicated to re-index a Cassandra table. Number of indexing threads should be tuned carefully to avoid CPU exhaustion. Moreover, indexing throughput is limited by locking at the lucene level, but this limit can be exceeded by using a partitioned index invloving many independant shards.

5.5. Create, delete and rebuild index 49


Alternatively, you can use the built-in rebuild action to rebuild index on all your Elasticsearch cluster at the same time. The num_thread parameter is optional, default is one, but you should care about the load of your cluster in a production environnement.

curl XGET 'http://localhost:9200/twitter_index/_rebuild?num_threads=4'

Re-index existing data rely on the Cassandra compaction manager. You can trigger a Cassandra compaction when :

• Creating the first Elasticsearch index on a Cassandra table with existing data,

• Running a nodetool rebuild_index command,

• Running a nodetool repair on a keyspace having indexed tables (a repair actually creates new SSTables triggering index build).

If the compaction manager is busy, secondary index rebuild is added as a pending task and executed later on. You can check current running compactions with a nodetool compactionstats and check pending compaction tasks with a nodetool tpstats.

nodetool h 52.43

.

156.196

compactionstats pending tasks: 1

˓→ completed total id unit progress compaction type

052 c70f0 8690 11e6 aa56 674 c194215f6 Secondary index build

˓→

66347424 330228366 bytes 20 , 09 %

Active compaction remaining time : 0 h00m00s keyspace lastfm table playlist

To stop a compaction task (including a rebuild index task), you can either use a nodetool stop or use the JMX management operation stopCompactionById (on MBean org.apache.cassandra.db.CompactionManager).

Open, close, index

Open and close operations allow to close and open an Elasticsearch index. Even if the Cassandra secondary index remains in the CQL schema while the index is closed, it has no overhead, it’s just a dummy function call. Obviously, when several Elasticsearch indices are associated to the same Cassandra table, data are indexed in opened indices, but not in closed ones.

curl XPOST 'localhost:9200/my_index/_close' curl XPOST 'localhost:9200/my_index/_open'

Warning: Elasticsearch translog is disabled in Elassandra, so you might loose some indexed documents when closing an index if index.flush_on_close is false.

Flush, refresh index

A refresh makes all index updates performed since the last refresh available for search. By default, refresh is scheduled every second. By design, setting refresh=true on a index operation has no effect with Elassandra, because write operations are converted to CQL queries and documents are indexed later by a custom secondary index. So, the per-index refresh interval should be set carfully according to your needs.

curl XPOST 'localhost:9200/my_index/_refresh'



A flush basically write a lucene index on disk. Because document _source is stored in Cassandra table in elassandra, it make sense to execute a nodetool flush <keyspace> <table> to flush both Cassandra Memtables to

SSTables and lucene files for all associated Elasticsearch indices. Moreover, remember that a nodetool snapshot also involve a flush before creating a snapshot.

curl XPOST 'localhost:9200/my_index/_flush'

Percolator

Elassandra supports distributed percolator by storing percolation queries in a dedicated Cassandra table

_percolator

. As for documents, token ranges filtering applies to avoid duplicate query matching.

curl XPUT "localhost:9200/my_index" d '{

"mappings" : {

"my_type" : {

"properties" : {

"message" : { "type" : "string" },

"created_at" : { "type" : "date" }

}

}

} '

} curl XPUT "localhost:9200/my_index/.percolator/1" d '{

"query" : {

"match" : {

"message" : "bonsai tree"

}

}

} ' curl XPUT "localhost:9200/my_index/.percolator/2" d '{

"query" : {

"match" : {

"message" : "bonsai tree"

}

},

"priority" : "high"

} ' curl XPUT "localhost:9200/my_index/.percolator/3" d '{

"query" : {

"range" : {

"created_at" : {

"gte" : "2010-01-01T00:00:00" ,

"lte" : "2011-01-01T00:00:00"

}

}

},

"type" : "tweet" ,

"priority" : "high"

} '

Then search for matching queries.

5.8. Percolator 51


curl XGET 'localhost:9200/my_index/my_type/_percolate?pretty=true' d '{

"doc" : {

"message" : "A new bonsai tree in the office"

}

} '

{

"took" : 4 ,

"_shards" : {

"total" : 2 ,

"successful" : 2 ,

},

"failed" : 0

"total" : 2 ,

"matches" : [ {

"_index" : "my_index" ,

"_id" : "2"

}, {


"_id" : "1"

} ]

} curl XGET 'localhost:9200/my_index/my_type/_percolate?pretty=true' d '{

"doc" : {

"message" : "A new bonsai tree in the office"

},

"filter" : {

"term" : {

"priority" : "high"

}

}

} '

{

"took" : 4 ,

"_shards" : {

"total" : 2 ,

"successful" : 2 ,

"failed" : 0

},

"total" : 1 ,

"matches" : [ {


"_id" : "2"

} ]

}

Managing Elassandra nodes

You can add, remove or replace an Elassandra node by using the same procedure as for Cassandra (see Adding nodes to an existing cluster ). Even if it’s technically possible, you should never boostrap more than one node at a time,

During the bootstrap process, pulled data from existing nodes are automatically indexed by Elasticsearch on the new node, involving a kind of an automatic Elasticsearch resharding. You can monitor and resume the Cassandra boostrap process with the nodetool bootstrap command.

After boostrap successfully ends, you should cleanup nodes to throw out any data that is no longer owned by that node,



with a nodetool cleanup . Because cleanup involves by a Delete-by-query in Elasticsearch indices, it is recommended to smoothly schedule cleanups one at a time in you datacenter.

Backup and restore

By design, Elassandra synchronously update Elasticsearch indices on Cassandra write path and flushing a Cassandra table invlove a flush of all associated elasticsearch indices. Therefore, elassandra can backup data by taking a snapshot of Cassandra SSTables and Elasticsearch Lucene files on the same time on each node, as follow :

1. nodetool snapshot --tag <snapshot_name> <keyspace_name>

2. For all indices associated to <keyspace_name> cp -al $CASSANDRA_DATA/elasticsearch.data/<cluster_name>/nodes/0/indices/

<index_name>/0/index/(_*|segment*) $CASSANDRA_DATA/elasticsearch.data/ snapshots/<index_name>/<snapshot_name>/

Of course, rebuilding Elasticsearch indices after a Cassandra restore is another option.

Restoring a snapshot

Restoring Cassandra SSTable and Elasticsearch Lucene files allow to recover a keyspace and its associated Elasticsearch indices without stopping any node. (but it is not intended to duplicate data to another virtual datacenter or cluster)

To perform a hot restore of Cassandra keyspace and its Elasticsearch indices :

1. Close all Elasticsearch indices associated to the keyspace

2. Trunacte all Cassandra tables of the keyspace (because of delete operation later than the snapshot)

3. Restore the Cassandra table with your snapshot on each node

4. Restore Elasticsearch snapshot on each nodes (if ES index is open during nodetool refresh, this cause Elasticsearch index rebuild by the compaction manager, usually 2 threads).

5. Load restored SSTables with a nodetool refresh

6. Open all indices associated to the keyspace

Point in time recovery

Point-in-time recovery is intended to recover the data at any time. This require a restore of the last available Cassandra and Elasticsearch snapshot before your recovery point and then apply the commitlogs from this restore point to the recovery point. In this case, replaying commitlogs on startup also re-index data in Elasticsearch indices, ensuring consistency at the recovery point.

Of course, when stopping a production cluster is not possible, you should restore on a temporary cluster, make a full snapshot, and restore it on your production cluster as describe by the hot restore procedure.

To perform a point-in-time-recovery of a Cassandra keyspace and its Elasticsearch indices, for all nodes in the same time :

1. Stop all the datacenter nodes.

2. Restore the last Cassandra snapshot before the restore point and commitlogs from that point to the restore point

3. Restore the last Elasticsearch snapshot before the restore point.

5.10. Backup and restore 53


4. Restart your nodes

Restoring to a different cluster

It is possible to restore a Cassandra keyspace and its associated Elasticsearch indices to another cluster.

1. On the target cluster, create the same Cassandra schema without any custom secondary indices

2. From the source cluster, extract the mapping of your associated indices and apply it to your destination cluster.

Your keyspace and indices should be open and empty at this step.

If you are restoring into a new cluster having the same number of nodes, configure it with the same token ranges

(see https://docs.datastax.com/en/Cassandra/2.1/cassandra/operations/ops_snapshot_restore_new_cluster.html

).

In this case, you can restore from Cassandra and Elasticsearch snapshots as describe in step 1, 3 and 4 of the snapshot restore procedure.

Otherwise, when the number of node and the token ranges from the source and desination cluster does not match, use the sstableloader to restore your Cassandra snapshots (see https://docs.datastax.com/en/cassandra/2.0/cassandra/tools/ toolsBulkloader_t.html

). This approach is much time-and-io-consuming because all rows are read from the sstables and injected into the Cassandra cluster, causing an full Elasticsearch index rebuild.

How to change the elassandra cluster name

Because the cluster name is a part of the Elasticsearch directory structure, managing snapshots with shell scripts could be a nightmare when cluster name contains space caracters. Therfore, it is recommanded to avoid space caraters in your elassandra cluster name.

On all nodes:

1. In a cqlsh, UPDATE system.local SET cluster_name = ‘<new_cluster_name>’ where key=’local’;

2. Update the cluster_name parameter with the same value in your conf/cassandra.yaml

3. Run a nodetool flush system (this flush your system keyspace on disk)

Then:

4. On one node only, change the primary key of your cluster metadata in the elastic_admin.metadata table, using cqlsh :

• COPY elastic_admin.metadata (cluster_name, metadata, owner, version) TO ‘metadata.csv’;

• Update the cluster name in the file metadata.csv (first field in the JSON document).

• COPY elastic_admin.metadata (cluster_name, metadata, owner, version) FROM ‘metadata.csv’;

• DELETE FROM elastic_admin.metadata WHERE cluster_name=’<old_cluster_name>’;

5. Stop all nodes in the cluster

6. On all nodes, in you Cassandra data directory, move elasticsearch.data/<old_cluster_name> to elasticsearch.data/<new_cluster_name>

7. Restart all nodes

8. Check the cluster name in the Elasticsearch cluster state and that you can update the mapping.


CHAPTER

6

Integration

Integration with an existing cassandra cluster

Elassandra include a modified version of cassandra 2.2, so all nodes of a cluster should run elassandra binaries.

However, you can start a node with or without the elasticsearch support. Obviously, all nodes of a datacenter should run cassandra only or cassandra with elasticsearch.

Rolling upgrade to elassandra

Before starting any elassandra node with elasticsearch enable, do a rolling replace of the cassandra binaries by the elassandra ones. For each node :

• Install elassandra.

• Replace the elassandra configuration files by the one from your existing cluster (cassandra.yml and snitch configuration file)

• Stop your cassandra node.

• Restart cassandra elassandra bin/cassandra or cassandra with elasticsearch enable elassandra bin/cassandra -e

Create a new elassandra datacenter

The overall procedure is similar the cassandra one describe on https://docs.datastax.com/en/cassandra/2.1/cassandra/ operations/ops_add_dc_to_cluster_t.html

.

For earch nodes in your new datacenter :

• Install elassandra.

• Set auto_bootstrap: false in your conf/cassandra.yaml.

• Start cassandra-only nodes in your new datacenter and check that all nodes join the cluster.

55


bin / cassandra

• Restart all nodes in your new datacenter with elasticsearch enable. You should see started shards but empty indices.

bin / cassandra e

• Set the replication factor of indexed keyspaces to one or more in your new datacenter.

• Pull data from your existaing datacenter.

nodetool rebuild < source datacenter name >

After rebuild on all your new nodes, you should see the same number of document for each indices in your new and existing datacenters.

• Set auto_bootstrap: true

(default value) in your conf/cassandra.yaml

• Create new elasticsearch index or map some existing cassandra tables.

Tip: If you need to replay this procedure for a node :

• stop your node

• nodetool removenode <id-of-node-to-remove>

• clear data, commitlogs and saved_cache directories.

Installing an Elasticsearch plugins

Elasticsearch plugin installation remains unchanged, see elasticsearch plugin installation .

• bin/plugin install <url>

Running Kibana with Elassandra

Kibana version 4.6 can run with Elassandra, providing a visualization tool for cassandra and elasticsearch data.

• If you want to load sample data from the Kibana Getting started , apply the following changes to logstash.jsonl

with a sed command.

s / logstash 2015.05

.

18 / logstash_20150518 / g s / logstash 2015.05

.

19 / logstash_20150519 / g s / logstash 2015.05

.

20 / logstash_20150520 / g s / article:modified_time / articleModified_time / g s / article:published_time / articlePublished_time / g s / article:section / articleSection / g s / article:tag / articleTag / g s / og: type / ogType / g s / og:title / ogTitle / g s / og:description / ogDescription / g s / og:site_name / ogSite_name / g

56 Chapter 6. Integration


s / og:url / ogUrl / g s / og:image:width / ogImageWidth / g s / og:image:height / ogImageHeight / g s / og:image / ogImage / g s / twitter:title / twitterTitle / g s / twitter:description / twitterDescription / g s / twitter:card / twitterCard / g s / twitter:image / twitterImage / g s / twitter:site / twitterSite / g

JDBC Driver sql4es + Elassandra

The Elasticsearch JDBC driver . can be used with elassandra. Here is a code example :

Class .

forName( "nl.anchormen.sql4es.jdbc.ESDriver" );

Connection con = DriverManager .

getConnection( "jdbc:sql4es://localhost:9300/twitter?

˓→ cluster.name=Test%20Cluster" );

Statement st = con .

createStatement();

ResultSet rs = st .

executeQuery( "SELECT user,avg(size),count(*) FROM tweet GROUP BY

˓→ user" );

ResultSetMetaData rsmd = rs .

getMetaData(); int nrCols = rsmd .

getColumnCount();

while

(rs .

next()){

for

( int i = 1 ; i <= nrCols; i ++ ){

System .

out .

println(rs .

getObject(i));

}

} rs .

close(); con .

close();

Running Spark with Elassandra

A modified version of the elasticsearch-hadoop connector is available for elassandra at

‘https://github.com/vroyer/elasticsearch-hadoop‘_ .

This connector works with spark as describe in the elasticsearch documentation available at https://www.elastic.co/guide/en/elasticsearch/hadoop/current/index.html.

For example, in order to submit a spark job in client mode.

bin / spark submit -driver class path < yourpath >/ elasticsearch spark_2 .

10 2.2

.

0.

jar --

˓→ master spark: //< sparkmaster > : 7077 -deploy mode client < application .

jar >

6.4. JDBC Driver sql4es + Elassandra 57


58 Chapter 6. Integration

CHAPTER

7

Testing

Elasticsearch comes with a testing framework based on JUNIT and RandomizedRunner provided by the randomizedtesting project. Most of these tests work with Elassandra to ensure compatibility between Elasticsearch and Elassandra.

Testing environnement

By default, JUnit creates one instance of each test class and executes each @Test method in parallel in many threads.

Because Cassandra use many static variables, concurrent testing is not possible, so each test is executed sequentially (using a semaphore to serialize tests) on a single node Elassandra cluster listening on localhost, see ![ESSingleNodeTestCase]( https://github.com/strapdata/elassandra/blob/master/core/src/test/java/ org/elasticsearch/test/ESSingleNodeTestCase.java

) . Test configuration is located in src/test/resources/conf, data and logs are generated in target/tests/.

Between each test, all indices (and underlying keyspaces and tables) are removed to have idempotent testings and avoid conflicts on index names. System settings es.synchronous_refresh and es.drop_on_delete_index are set to true in the parent pom.xml.

Finally, the testing framework randomizes the locale settings representing a specific geographical, political, or cultural region, but Apache Cassandra does not support such setting because string manipulation are implemented with the default locale settings (see CASSANDRA-12334). For exemple, String.format(“SELECT %s FROM ...”,...) is computed as String.format(Local.getDefault(),”SELECT %s FROM ...”,...), involving errors for some Locale setting. As a workaround, a javassit byte-code manipulation in the Ant build step adds a Locale.ROOT argument to weak method calls in all Cassandra classes.

Elassandra unit test

Elassandra unit test allows to use both the Elasticsearch API and CQL requests as shown in the following sample.

public

class ParentChildTests

extends ESSingleNodeTestCase {

@Test

59


public void testCQLParentChildTest() throws Exception { process(ConsistencyLevel .

ONE, "CREATE KEYSPACE IF NOT EXISTS company3 WITH

˓→ replication={ 'class':'NetworkTopologyStrategy', 'DC1':'1' }" ); process(ConsistencyLevel .

ONE, "CREATE TABLE company3.employee (branch text,\"_

˓→ id\" text, name text, dob timestamp, hobby text, primary key ((branch),\"_id\"))" ); assertAcked(client() .

admin() .

indices() .

prepareCreate( "company3" )

.

addMapping( "branch" , "{ \"branch\": {} }" )

.

addMapping( "employee" , "{ \"employee\" : { \"discover\" : \".*\", \"_

˓→ parent\" : { \"type\": \"branch\", \"cql_parent_pk\":\"branch\" } }}" )

.

get()); ensureGreen( "company3" ); assertThat(client() .

prepareIndex( "company3" , "branch" , "london" )

.

setSource( "{ \"district\": \"London Westminster\", \"city\": \

˓→

"

London\", \"country\": \"UK\" }" )

.

get() .

isCreated(), equalTo(true)); assertThat(client() .

prepareIndex( "company3" , "branch" , "liverpool" )

.

setSource( "{ \"district\": \"Liverpool Central\", \"city\": \

˓→

"

Liverpool\", \"country\": \"UK\" }" )

.

get() .

isCreated(), equalTo(true)); assertThat(client() .

prepareIndex( "company3" , "branch" , "paris" )

.

setSource( "{ \"district\": \"Champs Élysées\", \"city\": \"Paris\", \

˓→

"

country\": \"France\" }" )

.

get() .

isCreated(), equalTo(true)); process(ConsistencyLevel .

ONE, "INSERT INTO company3.employee (branch,\"_id\",

˓→ name,dob,hobby) VALUES ('london','1','Alice Smith','1970-10-24','hiking')" ); process(ConsistencyLevel .

ONE, "INSERT INTO company3.employee (branch,\"_id\",

˓→ name,dob,hobby) VALUES ('london','2','Bob Robert','1970-10-24','hiking')" ); assertThat(client() .

prepareSearch() .

setIndices( "company3" ) .

setTypes( "branch" )

.

setQuery(QueryBuilders .

hasChildQuery( "employee" , QueryBuilders .

˓→ rangeQuery( "dob" ) .

gte( "1970-01-01" ))) .

get() .

getHits() .

getTotalHits(), equalTo( 1 L)); assertThat(client() .

prepareSearch() .

setIndices( "company3" ) .

setTypes( "employee

}

˓→

" )

.

setQuery(QueryBuilders .

hasParentQuery( "branch" , QueryBuilders .

˓→ matchQuery( "country" , "UK" ))) .

get() .

getHits() .

getTotalHits(), equalTo( 2 L));

}

To run this specific test :

$mvn test -Pdev -pl com.strapdata.elasticsearch:elasticsearch -Dtests.

˓→ seed=56E318ABFCECC61 -Dtests.class=org.elassandra.ParentChildTests

-Des.logger.level=DEEBUG -Dtests.assertion.disabled=false -Dtests.security.

˓→ manager=false -Dtests.heap.size=1024m -Dtests.locale=de-GR -Dtests.timezone=Etc/UTC

To run all unit tests :

$mvn test

60 Chapter 7. Testing

CHAPTER

8

Breaking changes and limitations

Deleting an index does not delete cassandra data

By default, Cassandra is considered as a primary data storage for Elasticsearch, so deleting an Elasticsearch index does not delete Cassandra content, keyspace and tables remain unchanged. If you want to use Elassandra as Elasticsearch, you can configure your cluster or only some indices with the drop_on delete_index like this.

$curl -XPUT "$NODE:9200/twitter/" -d'{

"settings":{ "index":{ "drop_on_delete_index":true } }

}'

Or to set drop_on delete_index at cluster level :

$curl -XPUT "$NODE:9200/_cluster/settings" -d'{

"persistent":{ "cluster.drop_on_delete_index":true }

}'

Cannot index document with empty mapping

:

Elassandra cannot index any document for a type having no mapped properties and no underlying clustering key because Cassandra cannot create a secondary index on the partition key and there is no other indexed columns. Example

$curl -XPUT "$NODE:9200/foo/bar/1?pretty" -d'{}'

{

"_index" : "foo",

"_type" : "bar",

"_id" : "1",

"_version" : 1,

"_shards" : {

"total" : 1,

"successful" : 1,

61


}

"failed" : 0

},

"created" : true

The underlying cassandra table foo.bar has only a primary key column with no secondary index. So, search operations won’t return any result.

cqlsh > desc KEYSPACE foo ;

CREATE KEYSPACE foo WITH replication = { 'class' : 'NetworkTopologyStrategy' , 'DC1' : '1

˓→

' } AND durable_writes = true;

CREATE TABLE foo .

bar (

"_id" text PRIMARY KEY

) WITH bloom_filter_fp_chance = 0.01

AND caching = '{"keys":"ALL", "rows_per_partition":"NONE"}'

AND comment = 'Auto-created by Elassandra'

AND compaction = { 'class' : 'org.apache.cassandra.db.compaction.

˓→

SizeTieredCompactionStrategy' }

AND compression = { 'sstable_compression' : 'org.apache.cassandra.io.compress.

˓→

LZ4Compressor' }

AND dclocal_read_repair_chance = 0.1

AND default_time_to_live = 0

AND gc_grace_seconds = 864000

AND max_index_interval = 2048

AND memtable_flush_period_in_ms = 0

AND min_index_interval = 128

AND read_repair_chance = 0.0

AND speculative_retry = '99.0PERCENTILE' ; cqlsh > SELECT

*

FROM foo .

bar ;

_id

-----

1

( 1 rows)

To get the same behavior as Elasticsearch, just add a dummy field in your mapping.

Nested or Object types cannot be empty

Because Elasticsearch nested and object types are backed by a Cassandra User Defined Type, it requires at least one sub-field.

Document version is meaningless

Elasticsearch’s versioning system helps to cope with conflicts, but in a multi-master database like Apache Cassandra, versionning cannot ensure global consistency of compare-and-set operations.

In Elassandra, Elasticsearch version management is disabled by default, document version is not more indexed in lucene files and document version is always 1. This simplification improves write throughput and reduce the memory footprint by eliminating the in-memory version cache implemented in the Elasticsearch internal lucene engine.

62 Chapter 8. Breaking changes and limitations


If you want to keep the Elasticsearch internal lucene file format including a version number for each document, you should create your index with index.version_less_engine set to false like this :

$curl -XPUT "$NODE:9200/twitter/" -d'{

"settings":{ "index.version_less_engine":false } }

}'

Finally, if you need to avoid conflicts on write operations, you should use Cassandra lightweight transactions (or

PAXOS transaction). Such lightweight transactions is also used when updating the Elassandra mapping or when indexing a document with op_type=create, but of course, it comes with a network cost.

Index and type names

Because cassandra does not support special caraters in keyspace and table names, Elassandra automatically replaces dot (.) and dash (-) characters by underscore (_) in index and type names to create underlying Cassandra keyspaces and tables. When such a modification occurs, Elassandra keeps this change in memory to correctly convert keyspace/table to index/type.

Moreover, Cassandra table names are limited to 48 caraters, so Elasticsearch type names are also limited to 48 characters.

Column names

For Elasticsearch, field mapping is unique in an index. So, two columns having the same name, indexed in an index, should have the same CQL type and share the same Elasticsearch mapping.

Null values

To be able to search for null values, Elasticsearch can replace null by a default value (see https://www.elastic.co/ guide/en/elasticsearch/reference/2.4/null-value.html

). In Elasticsearch, an empty array is not a null value, wheras in

Cassandra, an empty array is stored as null and replaced by the default null value at index time.

Elasticsearch unsupported feature

• Tribe node allows to query multiple Elasticsearch clusters. This feature is not currently supported by Elassandra.

• Elasticsearch snapshot and restore operations are disabled (See backup and restore in operations).

Cassandra limitations

• Elassandra only supports the murmur3 partitioner.

• The thrift protocol is supported only for read operations.

• Elassandra synchronously indexes rows into Elasticsearch. This may increases the write duration, particulary when indexing complex document like GeoShape , so Cassandra write_request_timeout_in_ms is set to 5 seconds (Cassandra default is 2000ms, see Cassandra config )

8.5. Index and type names 63


• In order to avoid concurrent mapping or persistent cluster settings updates, Elassandra plays a PAXOS transaction that require QUORUM available nodes for the keyspace elastic_admin to succeed. So it is recommended to have at least 3 nodes in 3 distinct racks (A 2 nodes datacenter won’t accept any mapping update when a node is unavailable).

• CQL3 TRUNCATE on a Cassandra table deletes all associated Elasticsearch documents by playing a delete_by_query where _type = <table_name>. Of course, such a delete_by_query comes with a performance cost.

64 Chapter 8. Breaking changes and limitations

• genindex

• modindex

• search

CHAPTER

9

Indices and tables

65

No results