advertisement
Elassandra Documentation
Release v2.4.2-10
Vincent Royer
Sep 07, 2017
Contents
3
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
4
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
4
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
5
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
8
9
11
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
11
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
12
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
12
Install Elassandra from the APT repository
. . . . . . . . . . . . . . . . . . . . . . . . . .
12
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
13
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
13
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
13
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
14
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
14
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
14
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
14
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
15
Start an elassandra server instance
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
15
Connect to Cassandra from an application in another Docker container
. . . . . . . . . . . .
15
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
15
Container shell access and viewing Cassandra logs
. . . . . . . . . . . . . . . . . . . . . .
16
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
16
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
17
19
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
19
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
19
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
20
Multi datacenter configuration
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
20
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
21
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
23
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
23
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
23
i
25
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
25
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
26
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
27
Mapping change with zero downtime
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
28
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
29
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
30
Dynamic mapping of Cassandra Map
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
31
Dynamic Template with Dynamic Mapping
. . . . . . . . . . . . . . . . . . . . . . . . . .
33
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
34
Indexing Cassandra static columns
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
35
4.10 Elassandra as a JSON-REST Gateway
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
38
4.11 Check Cassandra consistency with elasticsearch
. . . . . . . . . . . . . . . . . . . . . . . . . . . .
39
43
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
43
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
44
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
46
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
46
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
47
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
48
Create, delete and rebuild index
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
49
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
50
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
50
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
51
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
52
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
53
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
53
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
53
Restoring to a different cluster
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
54
5.11 How to change the elassandra cluster name
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
54
55
Integration with an existing cassandra cluster
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
55
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
55
Create a new elassandra datacenter
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
55
Installing an Elasticsearch plugins
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
56
Running Kibana with Elassandra
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
56
JDBC Driver sql4es + Elassandra
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
57
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
57
59
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
59
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
59
8 Breaking changes and limitations
61
Deleting an index does not delete cassandra data
. . . . . . . . . . . . . . . . . . . . . . . . . . . .
61
Cannot index document with empty mapping
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
61
Nested or Object types cannot be empty
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
62
Document version is meaningless
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
62
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
63
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
63
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
63
Elasticsearch unsupported feature
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
63
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
63
ii
65
iii
iv
Elassandra tightly integrates Elasticsearch in Cassandra .
Contents:
Elassandra Documentation, Release v2.4.2-10
Contents 1
Elassandra Documentation, Release v2.4.2-10
2 Contents
CHAPTER
1
Architecture
Elassandra tightly integrates elasticsearch within cassandra as a secondary index, allowing near-realtime search with all existing elasticsearch APIs, plugins and tools like Kibana.
When you index a document, the JSON document is stored as a row in a cassandra table and synchronously indexed in elasticsearch.
3
Elassandra Documentation, Release v2.4.2-10
Concepts Mapping
Elasticsearch
Cluster
Cassandra
Virtual
Datacenter
Node
Description
All nodes of a datacenter forms an Elasticsearch cluster
Shard
Index
Type
Document
Field
Object or nested field
Keyspace
Table
Row
Cell
User Defined
Type
From an Elasticsearch perspective :
Each cassandra node is an elasticsearch shard for each indexed keyspace
An elasticsearch index is backed by a keyspace
Each elasticsearch document type is backed by a cassandra table
An elasticsearch document is backed by a cassandra row
Each indexed field is backed by a cassandra cell (row x column)
Automatically create User Defined Type to store elasticsearch object
• An Elasticsearch cluster is a Cassandra virtual datacenter.
• Every Elassandra node is a master primary data node.
• Each node only index local data and acts as a primary local shard.
• Elasticsearch data is not more stored in lucene indices, but in cassandra tables.
– An Elasticsearch index is mapped to a cassandra keyspace,
– Elasticsearch document type is mapped to a cassandra table.
– Elasticsearch document _id is a string representation of the cassandra primary key.
• Elasticsearch discovery now rely on the cassandra gossip protocol . When a node join or leave the cluster, or when a schema change occurs, each nodes update nodes status and its local routing table.
• Elasticsearch gateway now store metadata in a cassandra table and in the cassandra schema. Metadata updates are played sequentially through a cassandra lightweight transaction . Metadata UUID is the cassandra hostId of the last modifier node.
• Elasticsearch REST and java API remain unchanged.
• Logging is now based on logback as cassandra.
From a Cassandra perspective :
• Columns with an ElasticSecondaryIndex are indexed in Elasticsearch.
• By default, Elasticsearch document fields are multivalued, so every field is backed by a list. Single valued document field can be mapped to a basic types by setting ‘cql_collection: singleton’ in our type mapping. See
Elasticsearch document mapping for details.
• Nested documents are stored using cassandra User Defined Type or map .
• Elasticsearch provides a JSON-REST API to cassandra, see Elasticsearch API .
Durability
All writes to a cassandra node are recorded both in a memory table and in a commit log. When a memtable flush occurs, it flushes the elasticsearch secondary index on disk. When restarting after a failure, cassandra replays commitlogs and re-indexes elasticsearch documents that were no flushed by elasticsearch. This the reason why elasticsearch translog is disabled in elassandra.
4 Chapter 1. Architecture
Elassandra Documentation, Release v2.4.2-10
Shards and Replica
Unlike Elasticsearch, sharding depends on the number of nodes in the datacenter, and number of replica is defined by your keyspace Replication Factor . Elasticsearch numberOfShards is just an information about number of nodes.
• When adding a new elasticassandra node, the cassandra boostrap process gets some token ranges from the existing ring and pull the corresponding data. Pulled data are automatically indexed and each node update its routing table to distribute search requests according to the ring topology.
• When updating the Replication Factor, you will need to run a nodetool repair <keyspace> on the new node to effectively copy and index the data.
• If a node become unavailable, the routing table is updated on all nodes in order to route search requests on available nodes. The actual default strategy routes search requests on primary token ranges’ owner first, then to replica nodes if available. If some token ranges become unreachable, the cluster status is red, otherwise cluster status is yellow.
After starting a new Elassandra node, data and elasticsearch indices are distributed on 2 nodes (with no replication).
nodetool status twitter
Datacenter: DC1
===============
Status = Up / Down
|/ State = Normal / Leaving / Joining / Moving
-Address Load Tokens Owns (effective) Host ID
˓→
Rack
UN 127.0
.
0.1
156 , 9 KB 2 70 , 3 % 74 ae1629 0149 4e65 b790 -
˓→ cd25c7406675 RAC1
UN 127.0
.
0.2
129 , 01 KB 2
˓→
4e523 e4582b9 RAC2
29 , 7 % e5df0651 8608 4590 92e1 -
The routing table now distributes search request on 2 elasticassandra nodes covering 100% of the ring.
curl XGET 'http://localhost:9200/_cluster/state/?pretty=true'
{
"cluster_name" : "Test Cluster" ,
"version" : 12 ,
"master_node" : "74ae1629-0149-4e65-b790-cd25c7406675" ,
"blocks" : { },
"nodes" : {
"74ae1629-0149-4e65-b790-cd25c7406675" : {
"name" : "localhost" ,
"status" : "ALIVE" ,
"transport_address" : "inet[localhost/127.0.0.1:9300]" ,
"attributes" : {
"data" : "true" ,
"rack" : "RAC1" ,
"data_center" : "DC1" ,
"master" : "true"
},
}
"e5df0651-8608-4590-92e1-4e523e4582b9" : {
"name" : "127.0.0.2" ,
"status" : "ALIVE" ,
"transport_address" : "inet[127.0.0.2/127.0.0.2:9300]" ,
"attributes" : {
"data" : "true" ,
"rack" : "RAC2" ,
"data_center" : "DC1" ,
1.3. Shards and Replica 5
Elassandra Documentation, Release v2.4.2-10
6
"master" : "true"
}
},
}
"metadata" : {
"version" : 1 ,
"uuid" : "e5df0651-8608-4590-92e1-4e523e4582b9" ,
"templates" : { },
"indices" : {
"twitter" : {
"state" : "open" ,
"settings" : {
"index" : {
"creation_date" : "1440659762584" ,
"uuid" : "fyqNMDfnRgeRE9KgTqxFWw" ,
"number_of_replicas" : "1" ,
"number_of_shards" : "1" ,
"version" : {
"created" : "1050299"
}
},
}
"mappings" : {
"user" : {
"properties" : {
"name" : {
"type" : "string"
}
}
},
"tweet" : {
"properties" : {
"message" : {
},
"type" : "string"
"postDate" : {
"format" : "dateOptionalTime" ,
"type" : "date"
},
"user" : {
},
"type" : "string"
"_token" : {
"type" : "long"
}
}
}
},
"aliases" : [ ]
}
}
},
"routing_table" : {
"indices" : {
"twitter" : {
"shards" : {
"0" : [ {
"state" : "STARTED" ,
Chapter 1. Architecture
Elassandra Documentation, Release v2.4.2-10
}
"primary" : true,
"node" : "74ae1629-0149-4e65-b790-cd25c7406675" ,
"token_ranges" : [ "(-8879901672822909480,4094576844402756550]" ],
"shard" : 0 ,
"index" : "twitter"
} ],
"1" : [ {
"state" : "STARTED" ,
"primary" : true,
"node" : "e5df0651-8608-4590-92e1-4e523e4582b9" ,
"token_ranges" : [ "(-9223372036854775808,-8879901672822909480]" ,
˓→
"(4094576844402756550,9223372036854775807]" ],
"shard" : 1 ,
"index" : "twitter"
} ]
}
}
},
}
"routing_nodes" : {
"unassigned" : [ ],
"nodes" : {
"e5df0651-8608-4590-92e1-4e523e4582b9" : [ {
"state" : "STARTED" ,
"primary" : true,
"node" : "e5df0651-8608-4590-92e1-4e523e4582b9" ,
"token_ranges" : [ "(-9223372036854775808,-8879901672822909480]" ,
˓→
"(4094576844402756550,9223372036854775807]" ],
"shard" : 1 ,
"index" : "twitter"
} ],
"74ae1629-0149-4e65-b790-cd25c7406675" : [ {
"state" : "STARTED" ,
"primary" : true,
"node" : "74ae1629-0149-4e65-b790-cd25c7406675" ,
"token_ranges" : [ "(-8879901672822909480,4094576844402756550]" ],
"shard" : 0 ,
"index" : "twitter"
} ]
}
},
"allocations" : [ ]
Internally, each node broadcasts its local shard status in the gossip application state X1 ( “twitter”:STARTED ) and its current metadata UUID/version in application state X2.
nodetool gossipinfo
127.0
.
0.2
/ 127.0
.
0.2
generation: 1440659838 heartbeat: 396197
DC:DC1
NET_VERSION: 8
SEVERITY: 1.3877787807814457E-17
X1:{ "twitter" : 3 }
X2:e5df0651 8608 4590 92e1 4e523 e4582b9 / 1
RELEASE_VERSION: 2.1
.
8
RACK:RAC2
1.3. Shards and Replica 7
Elassandra Documentation, Release v2.4.2-10
STATUS:NORMAL, 8879901672822909480
SCHEMA:ce6febf4 571 d 30 d2 afeb b8db9d578fd1
INTERNAL_IP: 127.0
.
0.2
RPC_ADDRESS: 127.0
.
0.2
LOAD: 131314.0
HOST_ID:e5df0651 8608 4590 92e1 4e523 e4582b9 localhost / 127.0
.
0.1
generation: 1440659739 heartbeat: 396550
DC:DC1
NET_VERSION: 8
SEVERITY: 2.220446049250313E-16
X1:{ "twitter" : 3 }
X2:e5df0651 8608 4590 92e1 4e523 e4582b9 / 1
RELEASE_VERSION: 2.1
.
8
RACK:RAC1
STATUS:NORMAL, 4318747828927358946
SCHEMA:ce6febf4 571 d 30 d2 afeb b8db9d578fd1
RPC_ADDRESS: 127.0
.
0.1
INTERNAL_IP: 127.0
.
0.1
LOAD: 154824.0
HOST_ID: 74 ae1629 0149 4e65 b790 cd25c7406675
Write path
Write operations (Elasticsearch index, update, delete and bulk operations) are converted to CQL write requests managed by the coordinator node. The elasticsearch document _id is converted to the underlying primary key, and the corresponding row is stored on many nodes according to the Cassandra replication factor. Then, on each node hosting this row, an Elasticsearch document is indexed through a Cassandra custom secondary index. Every document includes a _token fields used used when searching.
At index time, every nodes directly generates lucene fields without any JSON parsing overhead, and Lucene files does not contains any version number, because version-based concurrency management becomes meaningless in a
8 Chapter 1. Architecture
Elassandra Documentation, Release v2.4.2-10
multi-master database like Cassandra.
Search path
Search request is done in two phases. In the query phase, the coordinator node add a token_ranges filter to the query and broadcasts a search request to all nodes. This token_ranges filter covers all the Cassandra ring and avoid duplicate results. Then, in the fetch phases, the coordinator fetches the required fields by issuing a CQL request in the underlying
Cassandra table, and builds the final JSON response.
Adding a token_ranges filter to the original Elasticsearch query introduce an overhead in the query phase, and the more you have vnodes, the more this overhead increase with many OR clauses. To mitigates this overhead, Elassandra provides a random search strategy requesting the minimum of nodes to cover the whole Cassandra ring. For example, if you have a datacenter with four nodes and a replication factor of two, it will request only two nodes with simplified token_ranges filters (adjacent token ranges are automatically merged).
Additionnaly, as these token_ranges filters only change when the datacenter topology change (for example when a node is down or when adding a new node), Elassandra introduces a token_range bitset cache for each lucene segment.
With this cache, out of range documents are seen as deleted documents at the lucene segment layer for subsequent queries using the same token_range filter. This drastically improves search performances.
Finally, the CQL fetch overhead can be mitigated by using keys and rows Cassandra caching, eventually using the off-heap caching features of Cassandra.
1.5. Search path 9
Elassandra Documentation, Release v2.4.2-10
10 Chapter 1. Architecture
CHAPTER
2
Installation
There are a number of ways to install Elassandra: from the
or
Elassandra is based on Cassandra and ElasticSearch, thus it will be easier if you’re already familiar with one on these technologies.
Tarball
Elassandra requires at least Java 8. Oracle JDK is the recommended version, but OpenJDK should work as well. You can check which version is installed on your computer:
$ java -version java version "1.8.0_121"
Java(TM) SE Runtime Environment (build 1.8.0_121-b13)
Java HotSpot(TM) 64-Bit Server VM (build 25.121-b13, mixed mode)
Once java is correctly installed, download the Elassandra tarball: wget https://github.com/strapdata/elassandra/releases/download/v2.4.2-10/
˓→ elassandra-2.4.2.tar.gz
Then extract its content: tar -xzf elassandra-2.4.2.tar.gz
Go to the extracted directory: cd elassandra-2.4.2
If you need, configure conf/cassandra.yaml (cluster name, listen address, snitch, ...), then start elassandra: bin / cassandra f e
This starts an Elassandra instance in foreground, with ElasticSearch enabled. Afterwards your node is reachable on localhost on port 9042 (CQL) and 9200 (HTTP). Keep this terminal open and launch a new one.
11
Elassandra Documentation, Release v2.4.2-10
To use cqlsh, we first need to install the Cassandra driver for python. Ensure python and pip are installed, then: sudo pip install cassandra driver
Now connect to the node with cqlsh: bin / cqlsh
Then you must be able to type CQL commands. See the CQL reference .
Also, we started Elassandra with ElasticSearch enabled (according to the -e option), so let’s request the REST API: curl X GET http: // localhost: 9200 /
You should get something like:
{
"name" : "127.0.0.1",
"cluster_name" : "Test Cluster",
"cluster_uuid" : "7cb65cea-09c1-4d6a-a17a-24efb9eb7d2b",
"version" : {
"number" : "2.4.2",
"build_hash" : "b0b4cb025cb8aa74538124a30a00b137419983a3",
"build_timestamp" : "2017-04-19T13:11:11Z",
"build_snapshot" : true,
},
"lucene_version" : "5.5.2"
"tagline" : "You Know, for Search"
}
You’re ready for playing with Elassandra. For instance, try to index a document with the ElasticSearch API, then from cqlsh look for the keyspace/table/row automatically created. Cassandra now benefits from dynamic mapping !
On a production environment, it’s better to modify some system settings like disabling swap. This guide shows you how to. On linux, consider installing jemalloc .
DEB package
Our packages are hosted on packagecloud.io
. Elassandra can be downloaded using an APT repository.
Note: Elassandra requires Java 8 to be installed.
Import the GPG Key
Download and install the public signing key: curl L https: // packagecloud .
io / elassandra / latest / gpgkey | sudo apt key add -
Install Elassandra from the APT repository
Ensure apt is able to use https:
12 Chapter 2. Installation
Elassandra Documentation, Release v2.4.2-10
sudo apt get install apt transport https
Add the Elassandra repository to your source list: echo "deb https://packagecloud.io/elassandra/latest/debian jessie main" | sudo tee a
˓→
/ etc / apt / sources .
list .
d / elassandra .
list
Update apt cache and install Elassandra: sudo apt get update sudo apt get install elassandra
Warning: You should uninstall Cassandra prior to install Elassandra cause the two packages conflict.
Install extra tools
Also install Python, pip, and cassandra-driver: sudo apt get update && sudo apt get install python python pip sudo pip install cassandra driver
Usage
This package installs a systemd service named cassandra, but do not start nor enable it. For those who don’t have systemd, a init.d script is also provided.
To start elassandra using systemd, run: sudo systemctl start cassandra
Files locations:
• /etc/cassandra : configurations
• /var/lib/cassandra: database storage
• /var/log/cassandra: logs
• /usr/share/cassandra: plugins, modules, cassandra.in.sh, lib...
RPM package
Our packages are hosted on packagecloud.io
. Elassandra can be downloaded using a RPM repository.
Note: Elassandra requires Java 8 to be installed.
2.3. RPM package 13
Elassandra Documentation, Release v2.4.2-10
Setup the RPM repository
Create a file called elassandra.repo in the directory /etc/yum.repos.d/ (redhat) or /etc/zypp/ repos.d/
(opensuse), containing:
[elassandra_latest] name=Elassandra repository baseurl=https://packagecloud.io/elassandra/latest/el/7/$basearch type=rpm-md repo_gpgcheck=1 gpgcheck=0 enabled=1 gpgkey=https://packagecloud.io/elassandra/latest/gpgkey autorefresh=1 sslverify=1 sslcacert=/etc/pki/tls/certs/ca-bundle.crt
Install Elassandra
Using yum: sudo yum install elassandra
Warning: You should uninstall Cassandra prior to install Elassandra cause the two packages conflict.
Install extra tools
Also install Python, pip, and cassandra-driver: sudo yum install python python pip sudo pip install cassandra driver
Usage
This package installs a systemd service named cassandra, but do not start nor enable it. For those who don’t have systemd, a init.d script is also provided.
To start elassandra using systemd, run: sudo systemctl start cassandra
Files locations:
• /etc/cassandra : configurations
• /var/lib/cassandra: database storage
• /var/log/cassandra: logs
• /usr/share/cassandra: plugins, modules, cassandra.in.sh, lib...
14 Chapter 2. Installation
Elassandra Documentation, Release v2.4.2-10
Docker image
We provide an image on docker hub : docker pull strapdata / elassandra
This image is based on the official Cassandra image whose the documentation is valid as well for Elassandra.
Start an elassandra server instance
Starting an Elassandra instance is simple: docker run -name some elassandra d strapdata / elassandra
...where some-cassandra is the name you want to assign to your container and tag is the tag specifying the
Elassandra version you want. Default is latest.
Connect to Cassandra from an application in another Docker container
This image exposes the standard Cassandra ports and the HTTP ElasticSearch port (9200), so container linking makes the Elassandra instance available to other application containers. Start your application container like this in order to link it to the Elassandra container: docker run -name some app -link some elassandra:elassandra d app that uses -
˓→ elassandra
Make a cluster
Using the environment variables documented below, there are two cluster scenarios: instances on the same machine and instances on separate machines. For the same machine, start the instance as described above. To start other instances, just tell each new node where the first is.
docker run -name some elassandra2 d e CASSANDRA_SEEDS = "$(docker inspect --format='{
˓→
{ .NetworkSettings.IPAddress }}' some-elassandra)" elassandra
... where some-elassandra is the name of your original Elassandra container, taking advantage of docker inspect to get the IP address of the other container.
Or you may use the docker run --link option to tell the new node where the first is: docker run -name some elassandra2 d -link some elassandra:elassandra elassandra
For separate machines (ie, two VMs on a cloud provider), you need to tell Elassandra what IP address to advertise to the other nodes (since the address of the container is behind the docker bridge).
Assuming the first machine’s IP address is 10.42.42.42 and the second’s is 10.43.43.43, start the first with exposed gossip port: docker run -name some elassandra d e CASSANDRA_BROADCAST_ADDRESS = 10.42
.
42.42
p
˓→
7000 : 7000 elassandra
Then start an Elassandra container on the second machine, with the exposed gossip port and seed pointing to the first machine:
2.4. Docker image 15
Elassandra Documentation, Release v2.4.2-10
docker run -name some elassandra d e CASSANDRA_BROADCAST_ADDRESS = 10.43
.
43.43
p
˓→
7000 : 7000 e CASSANDRA_SEEDS = 10.42
.
42.42
elassandra
Container shell access and viewing Cassandra logs
The docker exec command allows you to run commands inside a Docker container. The following command line will give you a bash shell inside your elassandra container:
$ docker exec -it some-elassandra bash
The Cassandra Server log is available through Docker’s container log:
$ docker logs some-elassandra
Environment Variables
When you start the Elassandra image, you can adjust the configuration of the Elassandra instance by passing one or more environment variables on the docker run command line. We already have seen some of them.
Variable Name
CASSAN-
Description
This variable is for controlling which IP address to listen for incoming connections on.
CASSANthe IP address of the container as it starts. This default should work in most use cases.
This variable is for controlling which IP address to advertise to other nodes. The default
CASSAN-
DRA_RPC_ADDRESS
CASSAN-
DRA_START_RPC
CASSAN-
DRA_SEEDS broadcast_address and broadcast_rpc_address options in cassandra.yaml.
This variable is for controlling which address to bind the thrift rpc server to. If you do not specify an address, the wildcard address (0.0.0.0) will be used. It will set the rpc_address option in cassandra.yaml.
This variable is for controlling if the thrift rpc server is started. It will set the start_rpc option in cassandra.yaml. As Elastic search used this port in Elassandra, it will be set
ON by default.
This variable is the comma-separated list of IP addresses used by gossip for bootstrapping new nodes joining a cluster. It will set the seeds value of the seed_provider option in cassandra.yaml. The
CASSANDRA_BROADCAST_ADDRESS will be added the the seeds passed in so that the sever will talk to itself as well.
CASSANThis variable sets the name of the cluster and must be the same for all nodes in the
DRA_CLUSTER_NAME cluster. It will set the cluster_name option of cassandra.yaml.
CASSAN-
DRA_NUM_TOKENS
This variable sets number of tokens for this node. It will set the num_tokens option of cassandra.yaml.
CASSANDRA_DC
CASSAN-
DRA_RACK
CASSAN-
This variable sets the datacenter name of this node. It will set the dc option of cassandra-rackdc.properties.
This variable sets the rack name of this node. It will set the rack option of cassandra-rackdc.properties.
This variable sets the snitch implementation this node will use. It will set the
CASSAN-
DRA_DAEMON
The Cassandra entry-point class: org.apache.cassandra.service.ElassandraDaemon
to start with
ElasticSearch enabled (default), org.apache.cassandra.service.ElassandraDaemon
otherwise.
16 Chapter 2. Installation
Elassandra Documentation, Release v2.4.2-10
Build from source
Requirements:
• Oracle JDK 1.8 or OpenJDK 8
• maven >= 3.5
Clone Elassandra repository and Cassandra sub-module: git clone -recursive git
@github
.
com:strapdata / elassandra .
git cd elassandra
Elassandra uses Maven for its build system. Simply run: mvn clean package DskipTests
It’s gonna take a while, you might go for a cup of tea.
If everything succeed, tarballs will be built in: distribution/tar/target/release/elasandra-2.4.2-SNAPSHOT.tar.gz
distribution/zip/target/release/elasandra-2.4.2-SNAPSHOT.zip
Then follow the instructions for
installation.
2.5. Build from source 17
Elassandra Documentation, Release v2.4.2-10
18 Chapter 2. Installation
CHAPTER
3
Configuration
Directory Layout
Elassandra merge the cassandra and elasticsearch directories as follow :
• conf : Cassandra configuration directory + elasticsearch.yml default configuration file.
• bin : Cassandra scripts + elasticsearch plugin script.
• lib : Cassandra and elasticsearch jar dependency
• pylib : Cqlsh python library.
• tools : Cassandra tools.
• plugins : Elasticsearch plugins installation directory.
• modules : Elasticsearch modules directory.
• work : Elasticsearch working directory.
Elasticsearch paths are set according to the following environement variables and system properties :
• path.home : CASSANDRA_HOME environement variable, cassandra.home system property, the current directory.
• path.conf : CASSANDRA_CONF environement variable, path.conf or path.home.
• path.data
: cassandra_storagedir/data/elasticsearch.data, path.data
system property or path.home/data/elasticsearch.data
Configuration
Elasticsearch configuration rely on cassandra configuration file conf/cassandra.yaml for the following parameters.
19
Elassandra Documentation, Release v2.4.2-10
Cassandra cluster.
name
Elasticsearch cluster_name rpc_address network.host
transport.host
Description
Elasticsearch cluster name is mapped to the cassandra cluster name.
Elasticsearch network and transport bind addresses are set to the cassandra rpc listen addresses.
Elasticsearch network and transport publish addresses is set to the cassandra broadcast rpc address.
transport.publish_host
Node role (master, primary, data) is automatically set by elassandra, standard configuration should only set cluster_name, rpc_address in the conf/cassandra.yaml.
Caution: If you use the GossipPropertyFile Snitch to configure your cassandra datacenter and rack properties in conf/cassandra-rackdc.properties, keep in mind this snitch falls back to the PropertyFileSnitch when gossip is not enabled. So, when re-starting the first node, dead nodes can appear in the default DC and rack configured in conf/cassandra-topology.properties. This also breaks the replica placement strategy and the computation of the
Elasticsearch routing tables. So it is strongly recommended to set the same default rack and datacenter in both the conf/cassandra-topology.properties and conf/cassandra-rackdc.properties.
Logging configuration
The cassandra logs in logs/system.log includes elasticsearch logs according to the your conf/logback.
conf settings. See cassandra logging configuration .
Per keyspace (or per table) logging level can be configured using the logger name org.elassandra.index.
ExtendedElasticSecondaryIndex.<keyspace>.<table>
.
Multi datacenter configuration
By default, all elassandra datacenters share the same Elasticsearch cluster name and mapping. This mapping is stored in the elastic_admin keyspace.
20 Chapter 3. Configuration
Elassandra Documentation, Release v2.4.2-10
If you want to manage distinct Elasticsearch clusters inside a cassandra cluster (when indexing differents tables in different datacenter), you can set a datacenter.group in conf/elasticsearch.yml and thus, all elassandra datacenters sharing the same datacenter group name will share the same mapping. Those elasticsearch clusters will be named <cluster_name>@<datacenter.group> and mapping will be stored in a dedicated keyspace.table
elastic_admin_<datacenter.group>.metadata
.
All elastic_admin[_<datacenter.group>] keyspaces are configured with NetworkReplicationStrategy
(see data replication ). where the replication factor is automatically set to the number of nodes in each datacenter. This ensure maximum availibility for the elaticsearch metadata. When removing a node from an elassandra datacenter, you should manually decrease the elastic_admin[_<datacenter.group>] replication factor to the number of nodes.
When a mapping change occurs, Elassandra updates Elasticsearch metadata in elastic_admin[_<datacenter.group>].metadata
within a lightweight transaction to avoid conflit with concurrent updates. This transaction requires QUORUM available nodes, that is more than half the nodes of one or more datacenters regarding your datacenter.group configuration. It also involve cross-datacenter network latency for each mapping update.
Tip: Cassandra cross-datacenter writes are not sent directly to each replica; instead, they are sent to a single replica with a parameter telling that replica to forward to the other replicas in that datacenter; those replicas will respond diectly to the original coordinator. This reduces network trafic between datacenters when having many replica.
Elassandra Settings
Most of the settings can be set at variuous levels :
• As a system property, default property is es.<property_name>
• At clutser level, default setting is cluster.default_<property_name>
• At index level, setting is index.<property_name>
• At table level, setting is configured as a _meta:{ “<property_name> : <value> } for a document type.
For exemple, drop_on_delete_index can be :
• set as a system property es.drop_on_delete_index for all created indices.
• set at the cluster level with the cluster.default_drop_on_delete_index dynamic settings,
• set at the index level with the index.drop_on_delete_index dynamic index settings,
• set as the Elasticsearch document type level with _meta :
{ "drop_on_delete_index":true } in the document type mapping.
When a settings is dynamic, it’s relevant only for index and cluster setting levels, system and document type setting levels are immutables.
3.5. Elassandra Settings 21
Elassandra Documentation, Release v2.4.2-10
Setting Update Levels
22
cluster cluster
Default value
Description secondary index implementation class.
This class must implements org.apache.cassandra.index.Index
interface.
search strategy class.
Available strategy are :
•
PrimaryFirstSearchStrategy distributes search requests to all available nodes
• search performance when
RF
>
1.
whole cassandra ring.
This improves a subset of available nodes covering the
RandomSearchStrategy distributes search requests to
Chapter 3. Configuration
cluster cluster, system system include_node_id type, index, cluster, system index, cluster, system index, cluster, system index, cluster, system index, cluster, system cluster, system dex dex dex true
30s false false false false false
6 false false false rows or columns expires.
If true, snapshot the lucene file when snapshoting
SSTable.
If true, caches the token_range sequent search requests because it generates lucene tombestones, but allows to update documents when or rows invlove a read to reindex).
This comes with a performance cost for both compactions and sub-
If true, modified documents during compacting of Cassandra
SSTables are indexed
(removed columns filter result for each lucene segment.
Defines how long a token_ranges filter query is cached in memory.
When such a query is removed from the cache, associated cached token_ranges bitset are also removed for all lucene segments.
If true, drop underlying cassandra tables and keyspace when deleting an index, thus emulating the Elaticsearch behaviour.
Set the lucene numeric precision step, see
Lucene
Numeric
Range
QUery .
If true, indexes static documents
(elasticsearch documents containing only static and partition key columns).
If true and index_static_document is true, indexes a document containg only the static and partition key columns.
If true and index_static_only is false, indexes static columns in the elasticsearch documents, otherwise, ignore static columns.
If true, indexes the cassandra hostId in the
_node field.
If true, synchronously refreshes the elasticsearch index on each index updates.
sion), otherwise, use the standard
Elasticsearch
Engine.
Dynamic mapping update timeout.
If true, use the optimized lucene
Version-
LessEngine
(does not more manage any document verfunction implementation class.
Available implementa-
: tions are
•
MessageFormatPartitionFunction based on the java
Message-
Format.format()
•
StringPartitionFunction based on the java
String.format().
Elassandra Documentation, Release v2.4.2-10
Sizing and tunning
Basically, Elassandra requires much CPU than standelone Cassandra or Elasticsearch and Elassandra write throughput should be half the cassandra write throughput if you index all columns. If you only index a subset of columns, performance would be better.
Design recommendations :
• Increase number of Elassandra node or use partitioned index to keep shards size below 50Gb.
• Avoid huge wide rows, write-lock on a wide row can dramatically affect write performance.
• Choose the right compaction strategy to fit your workload (See this blog post by Justin Cameron)
System recommendations :
• Turn swapping off.
• Configure less than half the total memory of your server and up to 30.5Gb. Minimum recommended DRAM for production deployments is 32Gb. If you are not aggregating on analyzed string fields, you can probably use less memory to improve file system cache used by Doc Values (See this excelent blog post by Chris Earle).
• Set -Xms to the same value as -Xmx.
• Ensure JNA and jemalloc are correctly installed and enabled.
Write performances
• By default, Elasticsearch analyzes the input data of all fields in a special _all field. If you don’t need it, disable it.
• By default, Elasticsearch shards are refreshed every second, making new document visible for search within a second. If you don’t need it, increase the refresh interval to more than a second, or even turn if off temporarily by setting the refresh interval to -1.
• Use the optimized version less Lucene engine (the default) to reduce index size.
• Disable index_on_compaction (default is false) to avoid the Lucene segments merge overhead when compacting SSTables.
• Index partitioning may increase write throughput by writing to several Elasticsearch indexes in parallel, but choose an efficient partition function implementation. For exemple, String.format() is much more faster that
Message.format()
.
Search performances
• Use 16 to 64 vnodes per node to reduce the complexity of the token_ranges filter.
• Use the random search strategy and increase the Cassandra replication factor to reduce the number of nodes requires for a search request.
• Enable the token_ranges_bitset_cache. This cache compute the token ranges filter once per Lucene segment. Check the token range bitset cache statistics to ensure this caching is efficient.
• Enable Cassandra row caching to reduce the overhead introduce by fetching the requested fields from the underlying Cassandra table.
• Enable Cassandra off-heap row caching in your Cassandra configuration.
• When this is possible, clean lucene tombestones (updated or deleted documents) and reduce the number of
Lucene segments by forcing a merge.
3.6. Sizing and tunning 23
Elassandra Documentation, Release v2.4.2-10
24 Chapter 3. Configuration
CHAPTER
4
Mapping
Basically, an Elasticsearch index is mapped to a cassandra keyspace, and a document type to a cassandra table.
Type mapping
Here is the mapping from Elasticsearch field basic types to CQL3 types :
Elasticearch Types string integer, short, byte long double float boolean binary ip string geo_point geo_shape object, nested
CQL Types text timestamp bigint double float boolean blob inet uuid, timeuuid
UDT geo_point or text text
Custom User Defined Type
Comment
Internet address
Specific mapping (1)
Built-In User Defined Type (3)
Require _source enable (2)
1. Existing Cassandra uuid and timeuuid columns are mapped to Elasticsearch string, but such columns cannot be created through the elasticsearch mapping.
2. Existing Cassandra text columns containing a geohash string can be mapped to an Elasticsearch geo_point.
3. Geo shapes require _source to be enabled to store the original JSON document (default is disabled).
These parameters control the cassandra mapping.
25
Elassandra Documentation, Release v2.4.2-10
Parameter Values singleton cql_struct udt or map cql_udt_name
<ta-
Description
Control how the field of type X is mapped to a column list<X>, set<X> or X.
Default is list because Elasticsearch fields are multivalued.
Control how an object or nested field is mapped to a User Defined Type or to a cassandra map<text,?>. Default is udt.
Elasticsearch index full document. For partial CQL updates, this control which fields should be read to index a full document from a row. Default is true meaning that updates involve reading all missing fields.
Field position in the cassandra the primary key of the underlying cassandra table. Default is -1 meaning that the field is not part of the cassandra primary key.
When the cql_primary_key_order >= 0, specify if the field is part of the cassandra partition key. Default is false meaning that the field is not part of the cassandra partition key.
Specify the Cassandra User Defined Type name to use to store an object. By underscores)
For more information about cassandra collection types and compound primary key, see CQL Collections and Compound keys .
Bidirectionnal mapping
Elassandra supports the Elasticsearch Indice API and automatically creates the underlying cassandra keyspaces and tables. For each Elasticsearch document type, a cassandra table is created to reflect the Elasticsearch mapping. However, deleting an index does not remove the underlying keyspace, it just removes cassandra secondary indices associated to mapped columns.
Additionally, with the new put mapping parameter discover, Elassandra create or update the Elasticsearch mapping for an existing cassandra table. Columns matching the provided regular expression are mapped as Elasticsearch fields. The following command creates the elasticsearch mapping for all columns starting by ‘a’ of the cassandra table my_keyspace.my_table.and set a specific analyzer for column name.
curl XPUT "http://localhost:9200/my_keyspace/_mapping/my_table" d '{
"my_table" : {
"discover" : "a.*" ,
"properties" : {
"name" : {
"type" : "string" ,
"index" : "analyzed"
}
}
}
} '
By default, all text columns are mapped with "index":"not_analyzed".
Tip:
When creating the first Elasticsearch index for a given cassandra table, elassandra creates a custom CQL secondary index asynchonously for each mapped field when all shards are started. Cassandra build index on all nodes for all existing data. Subsequent CQL inserts or updates are automatically indexed in Elasticsearch.
If you then add a second or more Elasticsearch indices to an existing indexed table, existing data are not automatically re-indexed because cassandra has already indexed existing data. Instead of re-insert your data in the cassandra table,
26 Chapter 4. Mapping
Elassandra Documentation, Release v2.4.2-10
you may use the following command to force a cassandra index rebuild. It will re-index your cassandra table to all associated elasticsearch indices : nodetool rebuild_index -threads < N > < keyspace_name > < table_name > elastic_ < table_name >
˓→
_idx
• column_name is any indexed columns (or elasticsearch top-level document field).
• rebuild_index reindexes SSTables from disk, but not from MEMtables. In order to index the very last inserted document, run a nodetool flush <kespace_name> before rebuilding your elasticsearch indices.
• When deleting an elasticsearch index, elasticsearch index files are removed form the data/elasticsearch.data
directory, but cassandra secondary indices remains in the CQL schema until the last associated elasticsearch index is removed. Cassandra is acting as a primary data storage, so keyspace and tables and data are never removed when deleting an elasticsearch index.
Meta-Fields
Elasticsearch meta-fields meaning is slightly different in Elassandra :
• _index is the index name mapped to the underlying cassandra keyspace name (dash [-] and dot[.] are automatically replaced by underscore [_]).
• _type is the document type name mapped to the underlying cassandra table name (dash [-] and dot[.] are automatically replaced by underscore [_]).
• _id is the document ID is a string representation of the primary key of the underlying cassandra table. Single field primary key is converted to a string, compound primary key is converted to a JSON array.
• _source is the indexed JSON document.
By default, _source is disabled in ELassandra, meaning that _source is rebuild from the underlying cassandra columns.
If _source is enabled (see Mapping _source field ) ELassandra stores documents indexed by with the Elasticsearch API in a dedicated Cassandra text column named _source.
This allows to retreive the orginal JSON document for
‘GeoShape Query<https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl-geo-shapequery.html>‘_ .
• _routing is valued with a string representation of the partition key of the underlying cassandra table. Single partition key is converted to a string, compound partition key is converted to a JSON array. Specifying
_routing on get, index or delete operations is useless, since the partition key is included in _id. On search operations, Elassandra compute the cassandra token associated to _routing for the search type, and reduce the search only to a cassandra node hosting this token. (WARNING: Without any search types, Elassandra cannot compute the cassandra token and returns an error all shards failed).
• _ttl and _timestamp are mapped to the cassandra TTL and WRITIME . The returned _ttl and
_timestamp for a document will be the one of a regular cassandra columns if there is one in the underlying table. Moreover, when indexing a document throught the Elasticearch API, all cassandra cells carry the same WRITETIME and TTL, but this could be different when upserting some cells using CQL.
• _parent is string representation of the parent document primary key. If the parent document primary key is composite, this is string representation of columns defined by cql_parent_pk in the mapping. See Parent-
Child Relationship .
• _token is a meta-field introduced by Elassandra, valued with token(<partition_key>).
• _node is a meta-field introduced by Elassandra, valued with the cassandra host id, allowing to check the datacenter consistency.
4.3. Meta-Fields 27
Elassandra Documentation, Release v2.4.2-10
Mapping change with zero downtime
You can map several Elasticsearch indices with different mapping to the same cassandra keyspace. By default, an index is mapped to a keyspace with the same name, but you can specify a target keyspace in your index settings.
For example, you can create a new index twitter2 mapped to the cassandra keyspace twitter and set a mapping for type tweet associated to the existing cassandra table twitter.tweet.
curl XPUT "http://localhost:9200/twitter2/" d '{
"settings" : { "keyspace" : "twitter" } },
"mappings" : {
"tweet" : {
"properties" : {
"message" : { "type" : "string" , "index" : "not_analyzed" },
"post_date" : { "type" : "date" , "format" : "yyyy-MM-dd" },
"user" : { "type" : "string" , "index" : "not_analyzed" },
"size" : { "type" : "long" }
}
}
}
}
You can set a specific mapping for twitter2 and re-index existing data on each cassandra node with the following command (indices are named elastic_<tablename>).
nodetool rebuild_index [ -threads < N > ] twitter tweet elastic_tweet_idx
By default, rebuild_index use only one thread, but Elassandra supports multi-threaded index rebuild with the new parameter –threads. Index name is <elastic>_<table_name>_<column_name>_idx where column_name is any indexed column name. Once your twitter2 index is ready, set an alias twitter for twitter2 to switch from the old mapping to the new one, and delete the old twitter index.
curl XPOST "http://localhost:9200/_aliases" d '{ "actions" : [ { "add" : { "index"
˓→
: "twitter2", "alias" : "twitter" } } ] }' curl XDELETE "http://localhost:9200/twitter"
28 Chapter 4. Mapping
Elassandra Documentation, Release v2.4.2-10
Partitioned Index
Elasticsearch TTL support is deprecated since Elasticsearch 2.0 and the Elasticsearch TTLService is disabled in Elassandra. Rather than periodically looking for expired documents, Elassandra supports partitioned index allowing to manage per time-frame indices. Thus, old data can be removed by simply deleting old indices.
Partitioned index also allows to index more than 2^31 documents on a node (2^31 is the lucene max documents per index).
An index partition function acts as a selector when many indices are associated to a cassandra table. A partition function is defined by 3 or more fields separated by a space character :
• Function name.
• Index name pattern.
• 1 to N document field names.
The target index name is the result your partition function,
A partition function must implements the java interface org.elassandra.index.PartitionFunction. Two implementation classes are provided :
• StringFormatPartitionFunction (the default) based on the JDK function String.format(Locale locale, <parttern>,<arg1>,...) .
• MessageFormatPartitionFunction based on the JDK function MessageFormat.format(<parttern>,<arg1>,...) .
Index partition function are stored in a map, so a given index function is executed exactly once for all mapped index. For example, the toYearIndex function generates the target index logs_<year> depending on the value of the date_field for each document (or row).
You can define each per-year index as follow, with the same index.partition_function for all logs_<year>.
4.5. Partitioned Index 29
Elassandra Documentation, Release v2.4.2-10
All those indices will be mapped to the keyspace logs, and all columns of the table mylog automatically mapped to the document type mylog.
curl XPUT "http://localhost:9200/logs_2016" d '{
"settings" : {
"keyspace" : "logs" ,
"index.partition_function" : "toYearIndex logs_{0,date,yyyy} date_field" ,
"index.partition_function_class" : "MessageFormatPartitionFunction"
},
} '
"mappings" : {
"mylog" : { "discover" : ".*" }
}
Tip: When creating the first Elasticsearch index for a Cassandra table, Elassandra may create some Cassandra secondary indices. Only the first created secondary index trigger a compaction to index the existing data. So, if you create a partitioned index on a table having some data, the index rebuild may start before all partition are created, and some rows could be ignored if matching a not yet created partitioned index. To avoid this situation, create partitioned indices before injecting data or rebuild the secondary index entirely.
Tip: Partition function is executed for each indexed document, so if write throughput is a concern, you should choose an efficient implementation class.
To remove an old index.
curl XDELETE "http://localhost:9200/logs_2013"
Cassandra TTL can be used in conjunction with partitioned index to automatically removed rows during the normal cassandra compaction and repair processes when index_on_compaction is true, but this introduce a lucene merge overhead because document are re-indexed when compacting. You can also use the DateTieredCompactionStrategy to the TimeWindowTieredCompactionStrategy to improve performance of time series-like workloads.
Object and Nested mapping
By default, Elasticsearch Object or nested types are mapped to dynamically created Cassandra User Defined Types .
curl XPUT 'http://localhost:9200/twitter/tweet/1' d '{
"user" : {
"name" : {
"first_name" : "Vincent" ,
"last_name" : "Royer"
},
"uid" : "12345"
},
"message" : "This is a tweet!"
} ' curl XGET 'http://localhost:9200/twitter/tweet/1/_source'
{ "message" : "This is a tweet!" , "user" :{ "uid" :[ "12345" ], "name" :[{ "first_name" :[ "Vincent
˓→
" ], "last_name" :[ "Royer" ]}]}}
The resulting cassandra user defined types and table.
30 Chapter 4. Mapping
Elassandra Documentation, Release v2.4.2-10
cqlsh > describe keyspace twitter;
CREATE TYPE twitter .
tweet_user ( name frozen < list < frozen < tweet_user_name >>> , uid frozen < list < text >>
);
CREATE TYPE twitter .
tweet_user_name ( last_name frozen < list < text >> , first_name frozen < list < text >>
);
CREATE TABLE twitter .
tweet (
"_id" text PRIMARY KEY, message list < text > , person list < frozen < tweet_person >>
) cqlsh > select
*
from twitter.tweet
;
_id | message | user
-----+----------------------+---------------------------------------------------------
˓→
--------------------
1 | [ 'This is a tweet!' ] | [{name: [{last_name: [ 'Royer' ], first_name: [ 'Vincent' ]}],
˓→ uid: [ '12345' ]}]
Dynamic mapping of Cassandra Map
Nested document can be mapped to User Defined Type or to CQL map . In the following example, the cassandra map is automatically mapped with cql_mandatory:true, so a partial CQL update cause a read of the whole map to re-index a document in the elasticsearch index.
cqlsh > CREATE KEYSPACE IF NOT EXISTS twitter WITH replication = { 'class' :
˓→
'NetworkTopologyStrategy' , 'dc1' : '1' }; cqlsh > CREATE TABLE twitter .
user ( name text, attrs map < text,text > ,
PRIMARY KEY (name)
); cqlsh > INSERT INTO twitter .
user (name,attrs) VALUES ( 'bob' ,{ 'email' : '[email protected]' ,
˓→
'firstname' : 'bob' });
Create the type mapping from the cassandra table and search for the bob entry.
curl XPUT "http://localhost:9200/twitter/_mapping/user" d '{ "user" : { "discover"
˓→
: ".*" }}'
{ "acknowledged" :true} curl XGET 'http://localhost:9200/twitter/_mapping/user?pretty=true'
{
"twitter" : {
"mappings" : {
"user" : {
"properties" : {
"attrs" : {
"type" : "nested" ,
"cql_struct" : "map" ,
4.7. Dynamic mapping of Cassandra Map 31
Elassandra Documentation, Release v2.4.2-10
}
}
}
}
}
"cql_collection" : "singleton" ,
"properties" : {
"email" : {
"type" : "string"
},
"firstname" : {
"type" : "string"
}
},
}
"name" : {
"type" : "string" ,
"cql_collection" : "singleton" ,
"cql_partition_key" : true,
"cql_primary_key_order" : 0
} curl XGET "http://localhost:9200/twitter/user/bob?pretty=true"
{
"_index" : "twitter" ,
"_type" : "user" ,
"_id" : "bob" ,
"_version" : 0 ,
"found" : true,
"_source" :{ "name" : "bob" , "attrs" :{ "email" : "[email protected]" , "firstname" : "bob" }}
}
Now insert a new entry in the attrs map column and search for a nested field attrs.city:paris.
cqlsh > UPDATE twitter .
user SET attrs = attrs + { 'city' : 'paris' } WHERE name = 'bob' ; curl XGET "http://localhost:9200/twitter/_search?pretty=true" d '{
"query" :{
"nested" :{
"path" : "attrs" ,
"query" :{ "match" : { "attrs.city" : "paris" } }
}
}
} '
{
"took" : 3 ,
"timed_out" : false,
"_shards" : {
"total" : 1 ,
"successful" : 1 ,
"failed" : 0
},
"hits" : {
"total" : 1 ,
"max_score" : 2.3862944
,
"hits" : [ {
"_index" : "twitter" ,
"_type" : "user" ,
32 Chapter 4. Mapping
Elassandra Documentation, Release v2.4.2-10
}
"_id" : "bob" ,
˓→
"name" : "bob" }
} ]
}
"_score" : 2.3862944
,
"_source" :{ "attrs" :{ "city" : "paris" , "email" : "[email protected]" , "firstname" : "bob" },
Dynamic Template with Dynamic Mapping
Dynamic templates can be used when creating a dynamic field from a Cassandra map.
"mappings" : {
"event_test" : {
"dynamic_templates" : [
{ "strings_template" : {
"match" : "strings.*" ,
"mapping" : {
"type" : "string" ,
"index" : "not_analyzed"
}
}}
],
"properties" : {
"id" : {
"type" : "string" ,
"index" : "not_analyzed" ,
"cql_collection" : "singleton" ,
"cql_partition_key" : true,
},
"cql_primary_key_order" : 0
"strings" : {
"type" : "object" ,
"cql_struct" : "map" ,
"cql_collection" : "singleton"
}
}
}
}
Then, a new entry key1 in the underlying cassandra map will have the following mapping:
"mappings" : {
"event_test" : {
"dynamic_templates" : [ {
"strings_template" : {
"mapping" : {
"index" : "not_analyzed" ,
"type" : "string" ,
},
"doc_values" : true
}
} ],
"match" : "strings.*"
"properties" : {
"strings" : {
4.7. Dynamic mapping of Cassandra Map 33
Elassandra Documentation, Release v2.4.2-10
}
}
}
"cql_struct" : "map" ,
"cql_collection" : "singleton" ,
"type" : "nested" ,
"properties" : {
"key1" : {
"index" : "not_analyzed" ,
"type" : "string"
}
},
"id" : {
"index" : "not_analyzed" ,
"type" : "string" ,
"cql_partition_key" : true,
"cql_primary_key_order" : 0 ,
"cql_collection" : "singleton"
}
Note that because doc_values is true by default for a not analyzed field, it does not appear in the mapping.
Parent-Child Relationship
Elassandra supports parent-child relationship when parent and child document are located on the same cassandra node.
This condition is met :
• when running a single node cluster,
• when the keyspace replication factor equals the number of nodes or
• when the parent and child documents share the same cassandra partition key, as shown in the following example.
Create an index company (a cassandra keyspace), a cassandra table, insert 2 rows and map this table as document type employee.
cqlsh << EOF
CREATE KEYSPACE IF NOT EXISTS company WITH replication = { 'class' :
˓→
'NetworkTopologyStrategy' , 'dc1' : '1' };
CREATE TABLE company .
employee (
"_parent" text,
"_id" text, name text, dob timestamp, hobby text, primary key (( "_parent" ), "_id" )
);
INSERT INTO company .
employee ( "_parent" , "_id" ,name,dob,hobby) VALUES ( 'london' , '1' ,
˓→
'Alice Smith' , '1970-10-24' , 'hiking' );
INSERT INTO company .
employee ( "_parent" , "_id" ,name,dob,hobby) VALUES ( 'london' , '2' ,
˓→
'Alice Smith' , '1990-10-24' , 'hiking' );
EOF curl XPUT "http://$NODE:9200/company2" d '{
"mappings" : {
"employee" : {
"discover" : ".*" ,
34 Chapter 4. Mapping
Elassandra Documentation, Release v2.4.2-10
"_parent" : { "type" : "branch" , "cql_parent_pk" : "branch" }
}
}
} ' curl XPOST "http://127.0.0.1:9200/company/branch/_bulk" d '
{ "index" : { "_id" : "london" }}
{ "district" : "London Westminster" , "city" : "London" , "country" : "UK" }
{ "index" : { "_id" : "liverpool" }}
{ "district" : "Liverpool Central" , "city" : "Liverpool" , "country" : "UK" }
{ "index" : { "_id" : "paris" }}
{ "district" : "Champs Élysées" , "city" : "Paris" , "country" : "France" }
'
Search for documents having children document of type employee with dob date greater than 1980.
curl XGET "http://$NODE:9200/company2/branch/_search?pretty=true" d '{
"query" : {
"has_child" : {
"type" : "employee" ,
"query" : {
"range" : {
"dob" : {
"gte" : "1980-01-01"
}
}
}
}
} '
}
Search for employee documents having a parent document where country match UK.
curl XGET "http://$NODE:9200/company2/employee/_search?pretty=true" d '{
"query" : {
"has_parent" : {
"parent_type" : "branch" ,
"query" : {
"match" : { "country" : "UK"
}
}
}
} '
}
Indexing Cassandra static columns
When a Cassandra table have one or more clustering columns, a static columns is shared by all the rows with the same partition key.
4.9. Indexing Cassandra static columns 35
Elassandra Documentation, Release v2.4.2-10
A slight modification of cassandra code provides support of secondary index on static columns, allowing to search on static columns values (CQL search on static columns remains unsupported). Each time a static columns is modified, a document containing the partition key and only static columns is indexed in Elasticserach. By default, static columns are not indexed with every wide rows because any update on a static column would require reindexation of all wide rows. However, you can request for fields backed by a static columns on any get/search request.
The following example demonstrates how to use static columns to store meta information of a timeserie.
curl XPUT "http://localhost:9200/test" d '{
"mappings" : {
"timeseries" : {
"properties" : {
"t" : {
"type" : "date" ,
"format" : "strict_date_optional_time||epoch_millis" ,
"cql_primary_key_order" : 1 ,
"cql_collection" : "singleton"
},
"meta" : {
"type" : "nested" ,
"cql_struct" : "map" ,
"cql_static_column" : true,
"cql_collection" : "singleton" ,
"include_in_parent" : true,
"index_static_document" : true,
"index_static_columns" : true,
"properties" : {
"region" : {
"type" : "string"
}
},
}
"v" : {
"type" : "double" ,
"cql_collection" : "singleton"
},
"m" : {
"type" : "string" ,
"cql_partition_key" : true,
"cql_primary_key_order" : 0 ,
"cql_collection" : "singleton"
}
}
}
36 Chapter 4. Mapping
Elassandra Documentation, Release v2.4.2-10
} '
} cqlsh << EOF
INSERT INTO test .
timeseries (m, t, v) VALUES ( 'server1-cpu' , '2016-04-10 13:30' , 10 );
INSERT INTO test .
timeseries (m, t, v) VALUES ( 'server1-cpu' , '2016-04-10 13:31' , 20 );
INSERT INTO test .
timeseries (m, t, v) VALUES ( 'server1-cpu' , '2016-04-10 13:32' , 15 );
INSERT INTO test .
timeseries (m, meta) VALUES ( 'server1-cpu' , { 'region' : 'west' } );
SELECT
*
FROM test .
timeseries;
EOF m | t | meta | v
-------------+-----------------------------+--------------------+---server1 cpu | 2016 04 10 11 : 30 : 00.000000
z | { 'region' : 'west' } | 10 server1 cpu | 2016 04 10 11 : 31 : 00.000000
z | { 'region' : 'west' } | 20 server1 cpu | 2016 04 10 11 : 32 : 00.000000
z | { 'region' : 'west' } | 15
Search for wide rows only where v=10 and fetch the meta.region field.
curl XGET "http://localhost:9200/test/timeseries/_search?pretty=true&q=v:10&fields=m,
˓→ t,v,meta.region,_source"
"hits" : [ {
"_index" : "test" ,
"_type" : "timeseries" ,
"_id" : "[\"server1-cpu\",1460287800000]" ,
"_score" : 1.9162908
,
"_routing" : "server1-cpu" ,
"_source" : {
"t" : "2016-04-10T11:30:00.000Z" ,
"v" : 10.0
,
"meta" : { "region" : "west" },
"m" : "server1-cpu"
},
"fields" : {
"meta.region" : [ "west" ],
"t" : [ "2016-04-10T11:30:00.000Z" ],
"m" : [ "server1-cpu" ],
"v" : [ 10.0
]
}
} ]
Search for rows where meta.region=west, returns only a static document (i.e. document containg the partition key and static columns) because index_static_document is true.
curl XGET "http://localhost:9200/test/timeseries/_search?pretty=true&q=meta.
˓→ region:west&fields=m,t,v,meta.region"
"hits" : {
"total" : 1 ,
"max_score" : 1.5108256
,
"hits" : [ {
"_index" : "test" ,
"_type" : "timeseries" ,
"_id" : "server1-cpu" ,
"_score" : 1.5108256
,
"_routing" : "server1-cpu" ,
"fields" : {
"m" : [ "server1-cpu" ],
4.9. Indexing Cassandra static columns 37
Elassandra Documentation, Release v2.4.2-10
}
} ]
"meta.region" : [ "west" ]
If needed, you can change the default behavior for a specific cassandra table (or elasticsearch document type), by using the following custom metadata :
• index_static_document controls whether or not static document (i.e. document containg the partition key and static columns) are indexed (default is false).
• index_static_only if true, it ony indexes static documents with partition key as _id and static columns as fields.
• index_static_columns controls whether or not static columns are included in indexed documents (default is false).
Be careful, if
‘‘ index_static_document‘‘=*false* and
‘‘ index_static_only‘‘=*true*, it does not index any document.
In our example with the following mapping, static columns are indexed in every documents, allowing to search on.
curl XPUT http: // localhost: 9200 / test / _mapping / timeseries d '{
"timeseries" : {
"discover" : ".*" ,
"_meta" : {
"index_static_document" :true,
"index_static_columns" :true
}
}
} '
Elassandra as a JSON-REST Gateway
When dynamic mapping is disabled and a mapping type has no indexed field, elassandra nodes can act as a JSON-
REST gateway for cassandra to get, set or delete a cassandra row without any indexing overhead. In this case, the mapping may be use to cast types or format date fields, as shown below.
CREATE TABLE twitter .
tweet (
"_id" text PRIMARY KEY, message list < text > , post_date list < timestamp > , size list < bigint > , user list < text >
) curl XPUT "http://$NODE:9200/twitter/" d '{
"settings" :{ "index.mapper.dynamic" :false },
"mappings" :{
"tweet" :{
"properties" :{
"size" : { "type" : "long" , "index" : "no" },
"post_date" :{ "type" : "date" , "index" : "no" , "format" : "strict_date_
˓→ optional_time||epoch_millis" }
}
}
}
} '
38 Chapter 4. Mapping
Elassandra Documentation, Release v2.4.2-10
As the result, you can index, get or delete a cassandra row, including any column of your cassandra table.
curl -XPUT "http://localhost:9200/twitter/tweet/1?consistency=one" -d '{
"user" : "vince",
"post_date" : "2009-11-15T14:12:12",
"message" : "look at Elassandra !!",
"size": 50
}'
{"_index":"twitter","_type":"tweet","_id":"1","_version":1,"_shards":{"total":1,
˓→
"successful":1,"failed":0},"created":true}
$ curl -XGET "http://localhost:9200/twitter/tweet/1?pretty=true&fields=message,user,
{
˓→ size,post_date'
"_index" : "twitter",
"_type" : "tweet",
"_id" : "1",
"_version" : 1,
"found" : true,
"fields" : {
"size" : [ 50 ],
"post_date" : [ "2009-11-15T14:12:12.000Z" ],
"message" : [ "look at Elassandra !!" ],
"user" : [ "vince" ]
}
}
$ curl -XDELETE "http://localhost:9200/twitter/tweet/1?pretty=true'
{
"found" : true,
"_index" : "twitter",
"_type" : "tweet",
"_id" : "1",
"_version" : 0,
"_shards" : {
"total" : 1,
"successful" : 1,
"failed" : 0
}
}
Check Cassandra consistency with elasticsearch
When the index.include_node = true (default is false), the _node metafield containing the Cassandra host id is included in every indexed document. This allows to to distinguish multiple copies of a document when the datacenter replication factor is greater than one. Then a token range aggregation allows to count the number of documents for each token range and for each Cassandra node.
In the following example, we have 1000 accounts documents in a keyspace with RF=2 in a two nodes datacenter, and each token ranges have the same number of document for the two nodes.
curl XGET "http://$NODE:9200/accounts/_search?pretty=true&size=0" d '{
"aggs" : {
"tokens" : {
"token_range" : {
"field" : "_token"
4.11. Check Cassandra consistency with elasticsearch 39
Elassandra Documentation, Release v2.4.2-10
},
"aggs" : {
"nodes" : {
"terms" : { "field" : "_node" }
}
}
}
}
} '
{
"took" : 23 ,
"timed_out" : false,
"_shards" : {
"total" : 2 ,
"successful" : 2 ,
},
"failed" : 0
"hits" : {
"total" : 2000 ,
"max_score" : 0.0
,
},
"hits" : [ ]
"aggregations" : {
"tokens" : {
"buckets" : [ {
"key" : "(-9223372036854775807,-4215073831085397715]" ,
"from" : 9223372036854775807 ,
"from_as_string" : "-9223372036854775807" ,
"to" : 4215073831085397715 ,
"to_as_string" : "-4215073831085397715" ,
"doc_count" : 562 ,
"nodes" : {
"doc_count_error_upper_bound" : 0 ,
"sum_other_doc_count" : 0 ,
"buckets" : [ {
"key" : "528b78d3-fae9-49ae-969a-96668566f1c3" ,
"doc_count" : 281
}, {
"key" : "7f0b782e-5b75-409b-85e9-f5f96a75a7dc" ,
"doc_count" : 281
} ]
}
}, {
"key" : "(-4215073831085397714,7919694572960951318]" ,
"from" : 4215073831085397714 ,
"from_as_string" : "-4215073831085397714" ,
"to" : 7919694572960951318 ,
"to_as_string" : "7919694572960951318" ,
"doc_count" : 1268 ,
"nodes" : {
"doc_count_error_upper_bound" : 0 ,
"sum_other_doc_count" : 0 ,
"buckets" : [ {
"key" : "528b78d3-fae9-49ae-969a-96668566f1c3" ,
"doc_count" : 634
}, {
"key" : "7f0b782e-5b75-409b-85e9-f5f96a75a7dc" ,
"doc_count" : 634
40 Chapter 4. Mapping
Elassandra Documentation, Release v2.4.2-10
}
}
}
} ]
}
}, {
"key" : "(7919694572960951319,9223372036854775807]" ,
"from" : 7919694572960951319 ,
"from_as_string" : "7919694572960951319" ,
"to" : 9223372036854775807 ,
"to_as_string" : "9223372036854775807" ,
}
} ]
"doc_count" : 170 ,
"nodes" : {
"doc_count_error_upper_bound" : 0 ,
"sum_other_doc_count" : 0 ,
"buckets" : [ {
"key" : "528b78d3-fae9-49ae-969a-96668566f1c3" ,
"doc_count" : 85
}, {
"key" : "7f0b782e-5b75-409b-85e9-f5f96a75a7dc" ,
"doc_count" : 85
} ]
Of course, according to your use case, you should add a filter to your query to ignore write operations occurring during the check.
4.11. Check Cassandra consistency with elasticsearch 41
Elassandra Documentation, Release v2.4.2-10
42 Chapter 4. Mapping
CHAPTER
5
Operations
Indexing
Let’s try and index some twitter like information (demo from Elasticsearch )). First, let’s create a twitter user, and add some tweets (the twitter index will be created automatically, see automatic index and mapping creation in Elasticsearch documentation): curl XPUT 'http://localhost:9200/twitter/user/kimchy' d '{ "name" : "Shay Banon" }' curl XPUT 'http://localhost:9200/twitter/tweet/1' d '
{
"user" : "kimchy" ,
"postDate" : "2009-11-15T13:12:00" ,
"message" : "Trying out Elassandra, so far so good?"
} ' curl XPUT 'http://localhost:9200/twitter/tweet/2' d '
{
"user" : "kimchy" ,
"postDate" : "2009-11-15T14:12:12" ,
"message" : "Another tweet, will it be indexed?"
} '
You now have two rows in the Cassandra twitter.tweet table.
cqlsh
Connected to Test Cluster at 127.0
.
0.1
: 9042.
[cqlsh 5.0
.
1 | Cassandra 2.1
.
8 | CQL spec 3.2
.
0 | Native protocol v3]
Use HELP
for
help .
cqlsh > select
_id | message
*
from twitter.tweet
;
| postDate |
˓→ user
-----+--------------------------------------------+------------------------------+----
˓→
--------
2 |
˓→
'kimchy' ]
[ 'Another tweet, will it be indexed?' ] | [ '2009-11-15 15:12:12+0100' ] | [
1 | [ 'Trying out Elassandra, so far so good?' ] | [ '2009-11-15 14:12:00+0100' ] | [
˓→
'kimchy' ]
43
Elassandra Documentation, Release v2.4.2-10
( 2 rows)
Apache Cassandra is a column store that only support upsert operation. This means that deleting a cell or a row invovles the creation of a tombestone (insert a null) kept until the compaction later removes both the obsolete data and the tombstone (See this blog about Cassandra tombstone ).
By default, when using the Elasticsearch API to replace a document by a new one, Elassandra insert a row corresponding to the new document including null for unset fields. Without these null (cell tombstones), old fields not present in the new document would be kept at the Cassandra level as zombie cells.
Moreover, indexing with op_type=create
(See ‘Elasticsearch indexing
‘<https://www.elastic.co/guide/en/elasticsearch/reference/current/docs-index_.html#operation-type>‘_ ) require a Cassandra PAXOS transaction to check if the document exists in the underlying datacenter. This comes with useless performance cost if you use automatic generated document ID (See Automatic ID generation . ), as this ID will be the Cassandra primary key.
:
Depending on op_type and document ID, CQL requests are issued as follow when indexing with the Elasticsearch API op_type Generated ID create
INSERT INTO ...
VALUES(...) index
INSERT INTO ...
VALUES(...)
Provided ID
INSERT INTO ... VALUES(...) IF NOT
EXISTS (1)
DELETE FROM ... WHERE ... INSERT
INTO ... VALUES(...)
Comment
Index a new document.
Replace a document that may already exists
(1) The IF NOT EXISTS comes with the cost of the PAXOS transaction. If you don’t need to check the uniqueness of the provided ID, add parameter check_unique_id=false.
GETing
Now, let’s see if the information was added by GETting it: curl XGET 'http://localhost:9200/twitter/user/kimchy?pretty=true' curl XGET 'http://localhost:9200/twitter/tweet/1?pretty=true' curl XGET 'http://localhost:9200/twitter/tweet/2?pretty=true'
Elasticsearch state now show reflect the new twitter index. Because we are currently running on one node, the token_ranges routing attribute match 100% of the ring from Long.MIN_VALUE to Long.MAX_VALUE.
curl XGET 'http://localhost:9200/_cluster/state/?pretty=true'
{
"cluster_name" : "Test Cluster" ,
"version" : 5 ,
"master_node" : "74ae1629-0149-4e65-b790-cd25c7406675" ,
"blocks" : { },
"nodes" : {
"74ae1629-0149-4e65-b790-cd25c7406675" : {
"name" : "localhost" ,
"status" : "ALIVE" ,
"transport_address" : "inet[localhost/127.0.0.1:9300]" ,
"attributes" : {
"data" : "true" ,
"rack" : "RAC1" ,
"data_center" : "DC1" ,
"master" : "true"
}
44 Chapter 5. Operations
Elassandra Documentation, Release v2.4.2-10
}
},
"metadata" : {
"version" : 3 ,
"uuid" : "74ae1629-0149-4e65-b790-cd25c7406675" ,
"templates" : { },
"indices" : {
"twitter" : {
"state" : "open" ,
"settings" : {
"index" : {
"creation_date" : "1440659762584" ,
"uuid" : "fyqNMDfnRgeRE9KgTqxFWw" ,
"number_of_replicas" : "1" ,
"number_of_shards" : "1" ,
"version" : {
"created" : "1050299"
}
},
}
"mappings" : {
"user" : {
"properties" : {
"name" : {
"type" : "string"
}
},
}
"tweet" : {
"properties" : {
"message" : {
},
"type" : "string"
"postDate" : {
"format" : "dateOptionalTime" ,
},
"type" : "date"
"user" : {
"type" : "string"
}
}
}
},
"aliases" : [ ]
}
},
}
"routing_table" : {
"indices" : {
"twitter" : {
"shards" : {
"0" : [ {
"state" : "STARTED" ,
"primary" : true,
"node" : "74ae1629-0149-4e65-b790-cd25c7406675" ,
"token_ranges" : [ "(-9223372036854775808,9223372036854775807]" ],
"shard" : 0 ,
"index" : "twitter"
5.2. GETing 45
Elassandra Documentation, Release v2.4.2-10
}
} ]
}
}
}
},
"routing_nodes" : {
"unassigned" : [ ],
"nodes" : {
"74ae1629-0149-4e65-b790-cd25c7406675" : [ {
"state" : "STARTED" ,
"primary" : true,
"node" : "74ae1629-0149-4e65-b790-cd25c7406675" ,
"token_ranges" : [ "(-9223372036854775808,9223372036854775807]" ],
"shard" : 0 ,
"index" : "twitter"
} ]
}
},
"allocations" : [ ]
Updates
In Cassandra, an update is an upsert operation (if the row does not exists, it’s an insert). As Elasticsearch, Elassandra issue a GET operation before any update. Then, to keep the same semantic as Elasticsearch, update operations are converted to upsert with the ALL consistency level. Thus, later get operations are consistent. (You should consider
CQL UPDATE operation to avoid this performance cost)
Scripted updates, upsert (scripted_upsert and doc_as_upsert) are also supported.
Searching
Let’s find all the tweets that kimchy posted: curl XGET 'http://localhost:9200/twitter/tweet/_search?q=user:kimchy&pretty=true'
We can also use the JSON query language Elasticsearch provides instead of a query string: curl XGET 'http://localhost:9200/twitter/tweet/_search?pretty=true' d '
{
"query" : {
"match" : { "user" : "kimchy" }
}
} '
To avoid duplicates results when the Cassandra replication factor is greater than one, Elassandra adds a token_ranges filter to every queries distributed to all nodes. Because every document contains a _token fields computed at indextime, this ensure that a node only retrieves documents for the requested token ranges. The token_ranges parameter is a conjunction of Lucene NumericRangeQuery build from the Elasticsearch routing tables to cover the entire Cassandra ring. .. code:
46 Chapter 5. Operations
Elassandra Documentation, Release v2.4.2-10
curl XGET 'http://localhost:9200/twitter/tweet/_search?pretty=true&token_ranges=(0,
{
˓→
9223372036854775807)' d '
"query" : {
"match" : { "user" : "kimchy" }
}
} '
Of course, if the token range filter cover all ranges (Long.MIN_VALUE to Long.MAX_VALUE), Elassandra automatically remove the useless filter.
Finally, you can restrict a query to the coordinator node with preference=_only_local parameter, for all token_ranges as shown below : curl XGET 'http://localhost:9200/twitter/tweet/_search?pretty=true&preference=_only_
{
˓→ local&token_ranges=' d '
"query" : {
"match" : { "user" : "kimchy" }
}
} '
Optimizing search requests
The search strategy
Elassandra supports various search strategies to distribute a search request over the Elasticsearch cluster. A search strategy is configured at index-level with the index.search_strategy_class parameter.
Strategy org.elassandra.cluster.
routing.
PrimaryFirstSearchStrategy
(Default) org.elassandra.cluster.
routing.
RandomSearchStrategy
Description
Search on all alive nodes in the datacenter. All alive nodes responds for their primary token ranges, and for replica token ranges when there is some unavailable nodes. This strategy is always used to build the routing table in the cluster state.
For each query, randomly distribute a search request to a minimum of nodes to reduce the network traffic. For example, if your underlying keyspace replication factor is N, a search only invloves 1/N of the nodes.
You can create an index with the RandomSearchStrategy as shown below.
curl XPUT "http://localhost:9200/twitter/" d '{
"settings" : {
"index.search_strategy_class" : "RandomSearchStrategy"
}
} '
Tip: When changing a keyspace replication factor, you can force an Elasticsearch routing table update by closing and re-opening all associated elasticsearch indices. To troubleshoot search request routing, set the logging level to
DEBUG for class org.elassandra.cluster.routing in the conf/logback.xml file.
5.4. Searching 47
Elassandra Documentation, Release v2.4.2-10
Caching features
Compared to Elasticsearch, Elassandra introduces a search overhead by adding to each query a token ranges filter and by fetching fields through a CQL request at the Cassandra layer. These overheads can be both mitigated by using caching features.
Token Ranges Query Cache
Token ranges filter depends on the node or vnodes configuration, are quite stable and shared for all keyspaces having the same replication factor. These filters only change when the datacenter topology changes, for example when a node is temporary down or when a node is added to the datacenter. So, Elassandra use a cache to keep these queries, a conjunction of Lucene NumericRangeQuery often reused for every search requests.
As a classic caching strategy, the token_ranges_query_expire controls the expiration time of useless token ranges filter queries into memory. The default is 5 minutes.
Token Ranges Bitset Cache
When enabled, the token ranges bitset cache keeps into memory the results of the token range filter for each Lucene segment. This in-memory bitset, acting as the liveDocs Lucene thumbstones mechanism, is then reused for subsequent
Lucene search queries. For each Lucene segment, this document bitset is updated when the Lucene thumbstones count increase (it’s a bitwise AND between the actual Lucene thumbstones and the token range filter result), or removed if the corresponding token ranges query is removed because unused from the token range query cache.
You can enable the token range bitset cache at index level by setting index.token_ranges_bitset_cache to true (Default is false), or configure the its default value for newly created indices at cluster or system levels.
You can also bypass this cache by adding token_ranges_bitset_cache=false in your search request : curl XPUT "http://localhost:9200/twitter/_search?token_ranges_bitset_cache=false&
˓→ q=*:*"
Finally, you can check the in-memory size of the token ranges bitset cache with the Elasticsearch stats API, and clear it when clearing the Elasticsearch query_cache : curl XGET "http://localhost:9200/_stats?pretty=true"
...
"segments" : {
"count" : 3 ,
"memory_in_bytes" : 26711 ,
"terms_memory_in_bytes" : 23563 ,
"stored_fields_memory_in_bytes" : 1032 ,
"term_vectors_memory_in_bytes" : 0 ,
"norms_memory_in_bytes" : 384 ,
"doc_values_memory_in_bytes" : 1732 ,
"index_writer_memory_in_bytes" : 0 ,
},
"index_writer_max_memory_in_bytes" : 421108121 ,
"version_map_memory_in_bytes" : 0 ,
"fixed_bit_set_memory_in_bytes" : 0 ,
"token_ranges_bit_set_memory_in_bytes" : 240
...
48 Chapter 5. Operations
Elassandra Documentation, Release v2.4.2-10
Cassandra Key and Row Cache
To improve CQL fetch requests response time, Cassandra provides key and row caching features configured for each
Cassandra table as follow :
ALTER TABLE ...
WITH caching = { 'keys' : 'ALL' , 'rows_per_partition' : '1' };
To enable Cassandra row caching, set the row_cache_size_in_mb parameter in your conf/cassandra.yaml, and set row_cache_class_name: org.apache.cassandra.cache.OHCProvider
to use off-heap memory.
Tip: Elasticsearch also provides a Lucene query cache, used for segments having more than 10k documents, and for some frequent queries (queries done more than 5 or 20 times depending of the nature of the query). The shard request cache, can also be enable if the token range bitset cache is disabled.
Create, delete and rebuild index
In order to create an Elasticsearch index from an existing Cassandra table, you can specify the underlying keyspace. In the following example, all columns but message is automatically mapped with the default mapping, and the message is explicitly mapped with a custom mapping.
curl XGET 'http://localhost:9200/twitter_index' d '{
"settings" : { "keyspace" : "twitter" }
"mappings" : {
"tweet" : {
"discover" : "^(?!message).*" ,
"properties" : {
˓→
"singleton" }
"message" : { "type" : "string" , "index" : "analyzed" , "cql_collection" :
}
}
} '
}
Deleting an Elasticsearch index does not remove any Cassandra data, it keeps the underlying Cassandra tables but remove Elasticsearch index files.
curl XDELETE 'http://localhost:9200/twitter_index'
To re-index your existing data, for example after a mapping change to index a new column, run a nodetool rebuild_index as follow : nodetool rebuild_index [ -threads < N > ] < keyspace > < table > elastic_ < table > _idx
Tip: By default, rebuild index runs on a single thread. In order to improve re-indexing performance, Elassandra comes with a multi-threaded rebuild_index implementation. The –threads parameter allows to specify the number of threads dedicated to re-index a Cassandra table. Number of indexing threads should be tuned carefully to avoid CPU exhaustion. Moreover, indexing throughput is limited by locking at the lucene level, but this limit can be exceeded by using a partitioned index invloving many independant shards.
5.5. Create, delete and rebuild index 49
Elassandra Documentation, Release v2.4.2-10
Alternatively, you can use the built-in rebuild action to rebuild index on all your Elasticsearch cluster at the same time. The num_thread parameter is optional, default is one, but you should care about the load of your cluster in a production environnement.
curl XGET 'http://localhost:9200/twitter_index/_rebuild?num_threads=4'
Re-index existing data rely on the Cassandra compaction manager. You can trigger a Cassandra compaction when :
• Creating the first Elasticsearch index on a Cassandra table with existing data,
• Running a nodetool rebuild_index command,
• Running a nodetool repair on a keyspace having indexed tables (a repair actually creates new SSTables triggering index build).
If the compaction manager is busy, secondary index rebuild is added as a pending task and executed later on. You can check current running compactions with a nodetool compactionstats and check pending compaction tasks with a nodetool tpstats.
nodetool h 52.43
.
156.196
compactionstats pending tasks: 1
˓→ completed total id unit progress compaction type
052 c70f0 8690 11e6 aa56 674 c194215f6 Secondary index build
˓→
66347424 330228366 bytes 20 , 09 %
Active compaction remaining time : 0 h00m00s keyspace lastfm table playlist
To stop a compaction task (including a rebuild index task), you can either use a nodetool stop or use the JMX management operation stopCompactionById (on MBean org.apache.cassandra.db.CompactionManager).
Open, close, index
Open and close operations allow to close and open an Elasticsearch index. Even if the Cassandra secondary index remains in the CQL schema while the index is closed, it has no overhead, it’s just a dummy function call. Obviously, when several Elasticsearch indices are associated to the same Cassandra table, data are indexed in opened indices, but not in closed ones.
curl XPOST 'localhost:9200/my_index/_close' curl XPOST 'localhost:9200/my_index/_open'
Warning: Elasticsearch translog is disabled in Elassandra, so you might loose some indexed documents when closing an index if index.flush_on_close is false.
Flush, refresh index
A refresh makes all index updates performed since the last refresh available for search. By default, refresh is scheduled every second. By design, setting refresh=true on a index operation has no effect with Elassandra, because write operations are converted to CQL queries and documents are indexed later by a custom secondary index. So, the per-index refresh interval should be set carfully according to your needs.
curl XPOST 'localhost:9200/my_index/_refresh'
50 Chapter 5. Operations
Elassandra Documentation, Release v2.4.2-10
A flush basically write a lucene index on disk. Because document _source is stored in Cassandra table in elassandra, it make sense to execute a nodetool flush <keyspace> <table> to flush both Cassandra Memtables to
SSTables and lucene files for all associated Elasticsearch indices. Moreover, remember that a nodetool snapshot also involve a flush before creating a snapshot.
curl XPOST 'localhost:9200/my_index/_flush'
Percolator
Elassandra supports distributed percolator by storing percolation queries in a dedicated Cassandra table
_percolator
. As for documents, token ranges filtering applies to avoid duplicate query matching.
curl XPUT "localhost:9200/my_index" d '{
"mappings" : {
"my_type" : {
"properties" : {
"message" : { "type" : "string" },
"created_at" : { "type" : "date" }
}
}
} '
} curl XPUT "localhost:9200/my_index/.percolator/1" d '{
"query" : {
"match" : {
"message" : "bonsai tree"
}
}
} ' curl XPUT "localhost:9200/my_index/.percolator/2" d '{
"query" : {
"match" : {
"message" : "bonsai tree"
}
},
"priority" : "high"
} ' curl XPUT "localhost:9200/my_index/.percolator/3" d '{
"query" : {
"range" : {
"created_at" : {
"gte" : "2010-01-01T00:00:00" ,
"lte" : "2011-01-01T00:00:00"
}
}
},
"type" : "tweet" ,
"priority" : "high"
} '
Then search for matching queries.
5.8. Percolator 51
Elassandra Documentation, Release v2.4.2-10
curl XGET 'localhost:9200/my_index/my_type/_percolate?pretty=true' d '{
"doc" : {
"message" : "A new bonsai tree in the office"
}
} '
{
"took" : 4 ,
"_shards" : {
"total" : 2 ,
"successful" : 2 ,
},
"failed" : 0
"total" : 2 ,
"matches" : [ {
"_index" : "my_index" ,
"_id" : "2"
}, {
"_index" : "my_index" ,
"_id" : "1"
} ]
} curl XGET 'localhost:9200/my_index/my_type/_percolate?pretty=true' d '{
"doc" : {
"message" : "A new bonsai tree in the office"
},
"filter" : {
"term" : {
"priority" : "high"
}
}
} '
{
"took" : 4 ,
"_shards" : {
"total" : 2 ,
"successful" : 2 ,
"failed" : 0
},
"total" : 1 ,
"matches" : [ {
"_index" : "my_index" ,
"_id" : "2"
} ]
}
Managing Elassandra nodes
You can add, remove or replace an Elassandra node by using the same procedure as for Cassandra (see Adding nodes to an existing cluster ). Even if it’s technically possible, you should never boostrap more than one node at a time,
During the bootstrap process, pulled data from existing nodes are automatically indexed by Elasticsearch on the new node, involving a kind of an automatic Elasticsearch resharding. You can monitor and resume the Cassandra boostrap process with the nodetool bootstrap command.
After boostrap successfully ends, you should cleanup nodes to throw out any data that is no longer owned by that node,
52 Chapter 5. Operations
Elassandra Documentation, Release v2.4.2-10
with a nodetool cleanup . Because cleanup involves by a Delete-by-query in Elasticsearch indices, it is recommended to smoothly schedule cleanups one at a time in you datacenter.
Backup and restore
By design, Elassandra synchronously update Elasticsearch indices on Cassandra write path and flushing a Cassandra table invlove a flush of all associated elasticsearch indices. Therefore, elassandra can backup data by taking a snapshot of Cassandra SSTables and Elasticsearch Lucene files on the same time on each node, as follow :
1. nodetool snapshot --tag <snapshot_name> <keyspace_name>
2. For all indices associated to <keyspace_name> cp -al $CASSANDRA_DATA/elasticsearch.data/<cluster_name>/nodes/0/indices/
<index_name>/0/index/(_*|segment*) $CASSANDRA_DATA/elasticsearch.data/ snapshots/<index_name>/<snapshot_name>/
Of course, rebuilding Elasticsearch indices after a Cassandra restore is another option.
Restoring a snapshot
Restoring Cassandra SSTable and Elasticsearch Lucene files allow to recover a keyspace and its associated Elasticsearch indices without stopping any node. (but it is not intended to duplicate data to another virtual datacenter or cluster)
To perform a hot restore of Cassandra keyspace and its Elasticsearch indices :
1. Close all Elasticsearch indices associated to the keyspace
2. Trunacte all Cassandra tables of the keyspace (because of delete operation later than the snapshot)
3. Restore the Cassandra table with your snapshot on each node
4. Restore Elasticsearch snapshot on each nodes (if ES index is open during nodetool refresh, this cause Elasticsearch index rebuild by the compaction manager, usually 2 threads).
5. Load restored SSTables with a nodetool refresh
6. Open all indices associated to the keyspace
Point in time recovery
Point-in-time recovery is intended to recover the data at any time. This require a restore of the last available Cassandra and Elasticsearch snapshot before your recovery point and then apply the commitlogs from this restore point to the recovery point. In this case, replaying commitlogs on startup also re-index data in Elasticsearch indices, ensuring consistency at the recovery point.
Of course, when stopping a production cluster is not possible, you should restore on a temporary cluster, make a full snapshot, and restore it on your production cluster as describe by the hot restore procedure.
To perform a point-in-time-recovery of a Cassandra keyspace and its Elasticsearch indices, for all nodes in the same time :
1. Stop all the datacenter nodes.
2. Restore the last Cassandra snapshot before the restore point and commitlogs from that point to the restore point
3. Restore the last Elasticsearch snapshot before the restore point.
5.10. Backup and restore 53
Elassandra Documentation, Release v2.4.2-10
4. Restart your nodes
Restoring to a different cluster
It is possible to restore a Cassandra keyspace and its associated Elasticsearch indices to another cluster.
1. On the target cluster, create the same Cassandra schema without any custom secondary indices
2. From the source cluster, extract the mapping of your associated indices and apply it to your destination cluster.
Your keyspace and indices should be open and empty at this step.
If you are restoring into a new cluster having the same number of nodes, configure it with the same token ranges
(see https://docs.datastax.com/en/Cassandra/2.1/cassandra/operations/ops_snapshot_restore_new_cluster.html
).
In this case, you can restore from Cassandra and Elasticsearch snapshots as describe in step 1, 3 and 4 of the snapshot restore procedure.
Otherwise, when the number of node and the token ranges from the source and desination cluster does not match, use the sstableloader to restore your Cassandra snapshots (see https://docs.datastax.com/en/cassandra/2.0/cassandra/tools/ toolsBulkloader_t.html
). This approach is much time-and-io-consuming because all rows are read from the sstables and injected into the Cassandra cluster, causing an full Elasticsearch index rebuild.
How to change the elassandra cluster name
Because the cluster name is a part of the Elasticsearch directory structure, managing snapshots with shell scripts could be a nightmare when cluster name contains space caracters. Therfore, it is recommanded to avoid space caraters in your elassandra cluster name.
On all nodes:
1. In a cqlsh, UPDATE system.local SET cluster_name = ‘<new_cluster_name>’ where key=’local’;
2. Update the cluster_name parameter with the same value in your conf/cassandra.yaml
3. Run a nodetool flush system (this flush your system keyspace on disk)
Then:
4. On one node only, change the primary key of your cluster metadata in the elastic_admin.metadata table, using cqlsh :
• COPY elastic_admin.metadata (cluster_name, metadata, owner, version) TO ‘metadata.csv’;
• Update the cluster name in the file metadata.csv (first field in the JSON document).
• COPY elastic_admin.metadata (cluster_name, metadata, owner, version) FROM ‘metadata.csv’;
• DELETE FROM elastic_admin.metadata WHERE cluster_name=’<old_cluster_name>’;
5. Stop all nodes in the cluster
6. On all nodes, in you Cassandra data directory, move elasticsearch.data/<old_cluster_name> to elasticsearch.data/<new_cluster_name>
7. Restart all nodes
8. Check the cluster name in the Elasticsearch cluster state and that you can update the mapping.
54 Chapter 5. Operations
CHAPTER
6
Integration
Integration with an existing cassandra cluster
Elassandra include a modified version of cassandra 2.2, so all nodes of a cluster should run elassandra binaries.
However, you can start a node with or without the elasticsearch support. Obviously, all nodes of a datacenter should run cassandra only or cassandra with elasticsearch.
Rolling upgrade to elassandra
Before starting any elassandra node with elasticsearch enable, do a rolling replace of the cassandra binaries by the elassandra ones. For each node :
• Install elassandra.
• Replace the elassandra configuration files by the one from your existing cluster (cassandra.yml and snitch configuration file)
• Stop your cassandra node.
• Restart cassandra elassandra bin/cassandra or cassandra with elasticsearch enable elassandra bin/cassandra -e
Create a new elassandra datacenter
The overall procedure is similar the cassandra one describe on https://docs.datastax.com/en/cassandra/2.1/cassandra/ operations/ops_add_dc_to_cluster_t.html
.
For earch nodes in your new datacenter :
• Install elassandra.
• Set auto_bootstrap: false in your conf/cassandra.yaml.
• Start cassandra-only nodes in your new datacenter and check that all nodes join the cluster.
55
Elassandra Documentation, Release v2.4.2-10
bin / cassandra
• Restart all nodes in your new datacenter with elasticsearch enable. You should see started shards but empty indices.
bin / cassandra e
• Set the replication factor of indexed keyspaces to one or more in your new datacenter.
• Pull data from your existaing datacenter.
nodetool rebuild < source datacenter name >
After rebuild on all your new nodes, you should see the same number of document for each indices in your new and existing datacenters.
• Set auto_bootstrap: true
(default value) in your conf/cassandra.yaml
• Create new elasticsearch index or map some existing cassandra tables.
Tip: If you need to replay this procedure for a node :
• stop your node
• nodetool removenode <id-of-node-to-remove>
• clear data, commitlogs and saved_cache directories.
Installing an Elasticsearch plugins
Elasticsearch plugin installation remains unchanged, see elasticsearch plugin installation .
• bin/plugin install <url>
Running Kibana with Elassandra
Kibana version 4.6 can run with Elassandra, providing a visualization tool for cassandra and elasticsearch data.
• If you want to load sample data from the Kibana Getting started , apply the following changes to logstash.jsonl
with a sed command.
s / logstash 2015.05
.
18 / logstash_20150518 / g s / logstash 2015.05
.
19 / logstash_20150519 / g s / logstash 2015.05
.
20 / logstash_20150520 / g s / article:modified_time / articleModified_time / g s / article:published_time / articlePublished_time / g s / article:section / articleSection / g s / article:tag / articleTag / g s / og: type / ogType / g s / og:title / ogTitle / g s / og:description / ogDescription / g s / og:site_name / ogSite_name / g
56 Chapter 6. Integration
Elassandra Documentation, Release v2.4.2-10
s / og:url / ogUrl / g s / og:image:width / ogImageWidth / g s / og:image:height / ogImageHeight / g s / og:image / ogImage / g s / twitter:title / twitterTitle / g s / twitter:description / twitterDescription / g s / twitter:card / twitterCard / g s / twitter:image / twitterImage / g s / twitter:site / twitterSite / g
JDBC Driver sql4es + Elassandra
The Elasticsearch JDBC driver . can be used with elassandra. Here is a code example :
Class .
forName( "nl.anchormen.sql4es.jdbc.ESDriver" );
Connection con = DriverManager .
getConnection( "jdbc:sql4es://localhost:9300/twitter?
˓→ cluster.name=Test%20Cluster" );
Statement st = con .
createStatement();
ResultSet rs = st .
executeQuery( "SELECT user,avg(size),count(*) FROM tweet GROUP BY
˓→ user" );
ResultSetMetaData rsmd = rs .
getMetaData(); int nrCols = rsmd .
getColumnCount();
while
(rs .
next()){
for
( int i = 1 ; i <= nrCols; i ++ ){
System .
out .
println(rs .
getObject(i));
}
} rs .
close(); con .
close();
Running Spark with Elassandra
A modified version of the elasticsearch-hadoop connector is available for elassandra at
‘https://github.com/vroyer/elasticsearch-hadoop‘_ .
This connector works with spark as describe in the elasticsearch documentation available at https://www.elastic.co/guide/en/elasticsearch/hadoop/current/index.html.
For example, in order to submit a spark job in client mode.
bin / spark submit -driver class path < yourpath >/ elasticsearch spark_2 .
10 2.2
.
0.
jar --
˓→ master spark: //< sparkmaster > : 7077 -deploy mode client < application .
jar >
6.4. JDBC Driver sql4es + Elassandra 57
Elassandra Documentation, Release v2.4.2-10
58 Chapter 6. Integration
CHAPTER
7
Testing
Elasticsearch comes with a testing framework based on JUNIT and RandomizedRunner provided by the randomizedtesting project. Most of these tests work with Elassandra to ensure compatibility between Elasticsearch and Elassandra.
Testing environnement
By default, JUnit creates one instance of each test class and executes each @Test method in parallel in many threads.
Because Cassandra use many static variables, concurrent testing is not possible, so each test is executed sequentially (using a semaphore to serialize tests) on a single node Elassandra cluster listening on localhost, see ![ESSingleNodeTestCase]( https://github.com/strapdata/elassandra/blob/master/core/src/test/java/ org/elasticsearch/test/ESSingleNodeTestCase.java
) . Test configuration is located in src/test/resources/conf, data and logs are generated in target/tests/.
Between each test, all indices (and underlying keyspaces and tables) are removed to have idempotent testings and avoid conflicts on index names. System settings es.synchronous_refresh and es.drop_on_delete_index are set to true in the parent pom.xml.
Finally, the testing framework randomizes the locale settings representing a specific geographical, political, or cultural region, but Apache Cassandra does not support such setting because string manipulation are implemented with the default locale settings (see CASSANDRA-12334). For exemple, String.format(“SELECT %s FROM ...”,...) is computed as String.format(Local.getDefault(),”SELECT %s FROM ...”,...), involving errors for some Locale setting. As a workaround, a javassit byte-code manipulation in the Ant build step adds a Locale.ROOT argument to weak method calls in all Cassandra classes.
Elassandra unit test
Elassandra unit test allows to use both the Elasticsearch API and CQL requests as shown in the following sample.
public
class ParentChildTests
extends ESSingleNodeTestCase {
@Test
59
Elassandra Documentation, Release v2.4.2-10
public void testCQLParentChildTest() throws Exception { process(ConsistencyLevel .
ONE, "CREATE KEYSPACE IF NOT EXISTS company3 WITH
˓→ replication={ 'class':'NetworkTopologyStrategy', 'DC1':'1' }" ); process(ConsistencyLevel .
ONE, "CREATE TABLE company3.employee (branch text,\"_
˓→ id\" text, name text, dob timestamp, hobby text, primary key ((branch),\"_id\"))" ); assertAcked(client() .
admin() .
indices() .
prepareCreate( "company3" )
.
addMapping( "branch" , "{ \"branch\": {} }" )
.
addMapping( "employee" , "{ \"employee\" : { \"discover\" : \".*\", \"_
˓→ parent\" : { \"type\": \"branch\", \"cql_parent_pk\":\"branch\" } }}" )
.
get()); ensureGreen( "company3" ); assertThat(client() .
prepareIndex( "company3" , "branch" , "london" )
.
setSource( "{ \"district\": \"London Westminster\", \"city\": \
˓→
"
London\", \"country\": \"UK\" }" )
.
get() .
isCreated(), equalTo(true)); assertThat(client() .
prepareIndex( "company3" , "branch" , "liverpool" )
.
setSource( "{ \"district\": \"Liverpool Central\", \"city\": \
˓→
"
Liverpool\", \"country\": \"UK\" }" )
.
get() .
isCreated(), equalTo(true)); assertThat(client() .
prepareIndex( "company3" , "branch" , "paris" )
.
setSource( "{ \"district\": \"Champs Élysées\", \"city\": \"Paris\", \
˓→
"
country\": \"France\" }" )
.
get() .
isCreated(), equalTo(true)); process(ConsistencyLevel .
ONE, "INSERT INTO company3.employee (branch,\"_id\",
˓→ name,dob,hobby) VALUES ('london','1','Alice Smith','1970-10-24','hiking')" ); process(ConsistencyLevel .
ONE, "INSERT INTO company3.employee (branch,\"_id\",
˓→ name,dob,hobby) VALUES ('london','2','Bob Robert','1970-10-24','hiking')" ); assertThat(client() .
prepareSearch() .
setIndices( "company3" ) .
setTypes( "branch" )
.
setQuery(QueryBuilders .
hasChildQuery( "employee" , QueryBuilders .
˓→ rangeQuery( "dob" ) .
gte( "1970-01-01" ))) .
get() .
getHits() .
getTotalHits(), equalTo( 1 L)); assertThat(client() .
prepareSearch() .
setIndices( "company3" ) .
setTypes( "employee
}
˓→
" )
.
setQuery(QueryBuilders .
hasParentQuery( "branch" , QueryBuilders .
˓→ matchQuery( "country" , "UK" ))) .
get() .
getHits() .
getTotalHits(), equalTo( 2 L));
}
To run this specific test :
$mvn test -Pdev -pl com.strapdata.elasticsearch:elasticsearch -Dtests.
˓→ seed=56E318ABFCECC61 -Dtests.class=org.elassandra.ParentChildTests
-Des.logger.level=DEEBUG -Dtests.assertion.disabled=false -Dtests.security.
˓→ manager=false -Dtests.heap.size=1024m -Dtests.locale=de-GR -Dtests.timezone=Etc/UTC
To run all unit tests :
$mvn test
60 Chapter 7. Testing
CHAPTER
8
Breaking changes and limitations
Deleting an index does not delete cassandra data
By default, Cassandra is considered as a primary data storage for Elasticsearch, so deleting an Elasticsearch index does not delete Cassandra content, keyspace and tables remain unchanged. If you want to use Elassandra as Elasticsearch, you can configure your cluster or only some indices with the drop_on delete_index like this.
$curl -XPUT "$NODE:9200/twitter/" -d'{
"settings":{ "index":{ "drop_on_delete_index":true } }
}'
Or to set drop_on delete_index at cluster level :
$curl -XPUT "$NODE:9200/_cluster/settings" -d'{
"persistent":{ "cluster.drop_on_delete_index":true }
}'
Cannot index document with empty mapping
:
Elassandra cannot index any document for a type having no mapped properties and no underlying clustering key because Cassandra cannot create a secondary index on the partition key and there is no other indexed columns. Example
$curl -XPUT "$NODE:9200/foo/bar/1?pretty" -d'{}'
{
"_index" : "foo",
"_type" : "bar",
"_id" : "1",
"_version" : 1,
"_shards" : {
"total" : 1,
"successful" : 1,
61
Elassandra Documentation, Release v2.4.2-10
}
"failed" : 0
},
"created" : true
The underlying cassandra table foo.bar has only a primary key column with no secondary index. So, search operations won’t return any result.
cqlsh > desc KEYSPACE foo ;
CREATE KEYSPACE foo WITH replication = { 'class' : 'NetworkTopologyStrategy' , 'DC1' : '1
˓→
' } AND durable_writes = true;
CREATE TABLE foo .
bar (
"_id" text PRIMARY KEY
) WITH bloom_filter_fp_chance = 0.01
AND caching = '{"keys":"ALL", "rows_per_partition":"NONE"}'
AND comment = 'Auto-created by Elassandra'
AND compaction = { 'class' : 'org.apache.cassandra.db.compaction.
˓→
SizeTieredCompactionStrategy' }
AND compression = { 'sstable_compression' : 'org.apache.cassandra.io.compress.
˓→
LZ4Compressor' }
AND dclocal_read_repair_chance = 0.1
AND default_time_to_live = 0
AND gc_grace_seconds = 864000
AND max_index_interval = 2048
AND memtable_flush_period_in_ms = 0
AND min_index_interval = 128
AND read_repair_chance = 0.0
AND speculative_retry = '99.0PERCENTILE' ; cqlsh > SELECT
*
FROM foo .
bar ;
_id
-----
1
( 1 rows)
To get the same behavior as Elasticsearch, just add a dummy field in your mapping.
Nested or Object types cannot be empty
Because Elasticsearch nested and object types are backed by a Cassandra User Defined Type, it requires at least one sub-field.
Document version is meaningless
Elasticsearch’s versioning system helps to cope with conflicts, but in a multi-master database like Apache Cassandra, versionning cannot ensure global consistency of compare-and-set operations.
In Elassandra, Elasticsearch version management is disabled by default, document version is not more indexed in lucene files and document version is always 1. This simplification improves write throughput and reduce the memory footprint by eliminating the in-memory version cache implemented in the Elasticsearch internal lucene engine.
62 Chapter 8. Breaking changes and limitations
Elassandra Documentation, Release v2.4.2-10
If you want to keep the Elasticsearch internal lucene file format including a version number for each document, you should create your index with index.version_less_engine set to false like this :
$curl -XPUT "$NODE:9200/twitter/" -d'{
"settings":{ "index.version_less_engine":false } }
}'
Finally, if you need to avoid conflicts on write operations, you should use Cassandra lightweight transactions (or
PAXOS transaction). Such lightweight transactions is also used when updating the Elassandra mapping or when indexing a document with op_type=create, but of course, it comes with a network cost.
Index and type names
Because cassandra does not support special caraters in keyspace and table names, Elassandra automatically replaces dot (.) and dash (-) characters by underscore (_) in index and type names to create underlying Cassandra keyspaces and tables. When such a modification occurs, Elassandra keeps this change in memory to correctly convert keyspace/table to index/type.
Moreover, Cassandra table names are limited to 48 caraters, so Elasticsearch type names are also limited to 48 characters.
Column names
For Elasticsearch, field mapping is unique in an index. So, two columns having the same name, indexed in an index, should have the same CQL type and share the same Elasticsearch mapping.
Null values
To be able to search for null values, Elasticsearch can replace null by a default value (see https://www.elastic.co/ guide/en/elasticsearch/reference/2.4/null-value.html
). In Elasticsearch, an empty array is not a null value, wheras in
Cassandra, an empty array is stored as null and replaced by the default null value at index time.
Elasticsearch unsupported feature
• Tribe node allows to query multiple Elasticsearch clusters. This feature is not currently supported by Elassandra.
• Elasticsearch snapshot and restore operations are disabled (See backup and restore in operations).
Cassandra limitations
• Elassandra only supports the murmur3 partitioner.
• The thrift protocol is supported only for read operations.
• Elassandra synchronously indexes rows into Elasticsearch. This may increases the write duration, particulary when indexing complex document like GeoShape , so Cassandra write_request_timeout_in_ms is set to 5 seconds (Cassandra default is 2000ms, see Cassandra config )
8.5. Index and type names 63
Elassandra Documentation, Release v2.4.2-10
• In order to avoid concurrent mapping or persistent cluster settings updates, Elassandra plays a PAXOS transaction that require QUORUM available nodes for the keyspace elastic_admin to succeed. So it is recommended to have at least 3 nodes in 3 distinct racks (A 2 nodes datacenter won’t accept any mapping update when a node is unavailable).
• CQL3 TRUNCATE on a Cassandra table deletes all associated Elasticsearch documents by playing a delete_by_query where _type = <table_name>. Of course, such a delete_by_query comes with a performance cost.
64 Chapter 8. Breaking changes and limitations
• genindex
• modindex
• search
CHAPTER
9
Indices and tables
65
advertisement
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Related manuals
advertisement
Table of contents
- 9 Architecture
- 10 Concepts Mapping
- 10 Durability
- 11 Shards and Replica
- 14 Write path
- 15 Search path
- 17 Installation
- 17 Tarball
- 18 DEB package
- 18 Import the GPG Key
- 18 Install Elassandra from the APT repository
- 19 Install extra tools
- 19 Usage
- 19 RPM package
- 20 Setup the RPM repository
- 20 Install Elassandra
- 20 Install extra tools
- 20 Usage
- 21 Docker image
- 21 Start an elassandra server instance
- 21 Connect to Cassandra from an application in another Docker container
- 21 Make a cluster
- 22 Container shell access and viewing Cassandra logs
- 22 Environment Variables
- 23 Build from source
- 25 Configuration
- 25 Directory Layout
- 25 Configuration
- 26 Logging configuration
- 26 Multi datacenter configuration
- 27 Elassandra Settings
- 29 Sizing and tunning
- 29 Write performances
- 29 Search performances
- 31 Mapping
- 31 Type mapping
- 32 Bidirectionnal mapping
- 33 Meta-Fields
- 34 Mapping change with zero downtime
- 35 Partitioned Index
- 36 Object and Nested mapping
- 37 Dynamic mapping of Cassandra Map
- 39 Dynamic Template with Dynamic Mapping
- 40 Parent-Child Relationship
- 41 Indexing Cassandra static columns
- 44 Elassandra as a JSON-REST Gateway
- 45 Check Cassandra consistency with elasticsearch
- 49 Operations
- 49 Indexing
- 50 GETing
- 52 Updates
- 52 Searching
- 53 Optimizing search requests
- 54 Caching features
- 55 Create, delete and rebuild index
- 56 Open, close, index
- 56 Flush, refresh index
- 57 Percolator
- 58 Managing Elassandra nodes
- 59 Backup and restore
- 59 Restoring a snapshot
- 59 Point in time recovery
- 60 Restoring to a different cluster
- 60 How to change the elassandra cluster name
- 61 Integration
- 61 Integration with an existing cassandra cluster
- 61 Rolling upgrade to elassandra
- 61 Create a new elassandra datacenter
- 62 Installing an Elasticsearch plugins
- 62 Running Kibana with Elassandra
- 63 JDBC Driver sql4es + Elassandra
- 63 Running Spark with Elassandra
- 65 Testing
- 65 Testing environnement
- 65 Elassandra unit test
- 67 Breaking changes and limitations
- 67 Deleting an index does not delete cassandra data
- 67 Cannot index document with empty mapping
- 68 Nested or Object types cannot be empty
- 68 Document version is meaningless
- 69 Index and type names
- 69 Column names
- 69 Null values
- 69 Elasticsearch unsupported feature
- 69 Cassandra limitations
- 71 Indices and tables