Scalability with Apache Cassandra Krisantus Sembiring Jie Lu Fengyi Hong

Scalability with Apache Cassandra Krisantus Sembiring Jie Lu Fengyi Hong
Scalability with Apache Cassandra
Krisantus Sembiring
Jie Lu
Fengyi Hong
Computer Science Department
University of Crete
Heraklion, Greece
Computer Science Department
University of Crete
Heraklion, Greece
Computer Science Department
University of Crete
Heraklion, Greece
[email protected]
[email protected]
[email protected]
2.2 Data Management
As the era of information explosion comes, one huge server with
all the necessary data inside has become the past. And flash
crowds which appear only a short while will probably determine
one’s business. Distributed highly scalable database therefore
became another serious research area. In this paper, we will
present our experiment and improvement for dynamic scalability
with one of the leading open source implementation, Apache
The column is the lowest/smallest increment of data. It is a tuple
(triplet) that contains a name, a value and a timestamp. All values
are supplied by the client, including the 'timestamp'. This means
that clocks on the clients should be synchronized (in the
Cassandra server environment is useful also), as these timestamps
are used for conflict resolution. In many cases the 'timestamp' is
not used in client applications, and it becomes convenient to think
of a column as a name/value pair. For the remainder of this
document, 'timestamps' will be elided for readability. A column
family is a container for columns, analogous to the table in a
relational system. In Cassandra, each column family is stored in a
separate file, and the file is sorted in row major order. Related
columns, those that you'll access together, should be kept within
the same column family. A key space is the first dimension of the
Cassandra hash, and is the container for column families. Key
spaces are of roughly the same granularity as a schema or database
in RMDB.
Scalability, Apache Cassandra, Distributed Database
Flash crowd will cause a sudden load spike, but it is hard to
justify paying tens of thousands of dollars for resources [3].
Therefore, by dynamically adding and removing resource as
needed one would be able to handle the load spike while
minimizing the cost required.
In a distributed database cluster, the resource would be the
number of VM instance running. Our objective in this project is to
implement dynamic scaling up and down in a Cassandra Cluster.
In order to do that, we have to figure out how to identify highly
loaded node or lightly loaded node. The goal of scale up is to
alleviate heavily loaded node by adding a new node. On the other
hand, the goal of scaling down is to remove lightly loaded node.
The idea is to monitor the workload in each node and then
perform scale up and down as needed.
The challenges in this project are how to monitor workload, and
how to scale up and down while client might be reading from or
writing to the node being scaled.
The rest of this report is structured as follow: section 2 will be
introduction of Apache Cassandra; section 3 is working platform;
section 4 is implementation; section 5 is evaluation and the last
section is conclusion.
2. Backgrounds
2.1 Cassandra
Cassandra is a The Apache Cassandra Project develops a highly
scalable second-generation distributed database, bringing together
Dynamo’s fully distributed design and Bigtable’s ColumnFamilybased data model. Data could have several copies to ensure
availability inside one or across clusters. A cluster refers to the
machines (nodes) in a logical Cassandra instance. Clusters can
contain multiple keyspaces. To understand these concepts, here
we present how Cassandra manages its data.
2.3 Cassandra Architecture
This is Cassandra’s Architecture.
Table 1. Architecture
Core Layer
Messaging Service
Gossip Failure Detection
Cluster state
Middle Layer
Commit log
Service Layer
Hinted handoff
Read repair
Admin tools
In this report we’ll primarily write about parts that are tightly
connected with non-fail-recovery part, which are Partitioner,
Replication in Core Layer; Commit log, Memtable, SSTable and
Compaction in Middle Layer; Tombstones, Bootstrap and
Monitoring in Top Layer.
First of all, Cassandra has the feature of Eventual Consistency.
Why? As written above, each data has more than one copy in the
database. Replication Factor suggests the total number of replicas
one key space has. Number of writing replica means how many
replicas should be updated before writing operation succeeds,
here use W for short form. Similarly number of reading replicas
suggests how many replicas should be read from before actually
return the data, use R for short. Distributed database give up some
of the schema traditional database have – ACID for durability and
availability, in other words eventual consistency. Eventual
consistency ensures that data would be available almost anytime
with accuracy. When W + R > N (Replication Factor), whenever
read, client will compare the fetched data, and provide most recent
data. User could set these 3 numbers to achieve different reading
and writing speed.
Node Manager selects any node randomly or node with lowest
workload. After that, client can establish connection to the
selected node. If a connection cannot be established, the client
simply requests another node. This will enable spreading client
requests across the cluster.
2.4 Operations
Any node partitioner’s, CommitLog, Memtable, SSTable,
Compaction should wait for all W responses.
Writing includes several steps: first write to a disk commit log,
second send update to appropriate nodes, the nodes will first write
into their local logs then update memtables. Memtables will not
be flushed to disk unless it’s out of space, it has too many keys or
user has defined a time between each flush. When flushing
happens, the system also updates SSTable, which is data file for
system, and SSTable Index, the index for SSTable. Periodically
data files are merged and sorted into a new file (and creates new
index). This is called compaction. During compaction keys are
merged, columns are combined and marked tombstones are
Reading waits for R responses. Client is connected to any node,
reading operation compare the result of R replicas and return the
most recent one.
When deleting, like most of other distributed database, Cassandra
don’t directly delete the data. Because, if it deletes them (the
operation will also applied to only selected write replicas), the
other not yet updated nodes would regard the already deleted ones
as missing last update and repair the data. So Cassandra uses
additional bits to marking deleted data – set up Tombstone. And
when next compaction comes, all data marked as deleted would be
officially removed
3. Platform
Our working environment is a eucalyptus cloud. We had a total of
6 VMs running at most; each has a 1GB volume attached where
Cassandra and other prerequisite files are stored. We use Apache
Cassandra 0.7.2 and JRE 6 update 24. Each VM instance uses
Ubuntu 10.4 image and m1.xlarge type.
One VM instance is used for each Cassandra node. Cassandra
data and commit files are stored in the VM secondary hard disk
(~18GB). In the next section, unless specified otherwise, we do
not change the default value of Cassandra configuration file.
We created bash script to enable creation of Cassandra cluster
with arbitrary nodes. The script also requires the number of seed
node to allocate. The number of seed nodes recommended in one
cluster is two [1]. Each time a cluster created we always specify
the initial token in order to make sure balance token range
between nodes. Following recommendation in [1], we use the
formula below to calculate the initial token for each node.
Parameter “nodes” here is the number of nodes to allocate.
def tokens(nodes)
for x in xrange(nodes):
print 2 ** 127 / nodes * x
4. Implementation
Figure 1 shows the architecture of our implementation. Client
application will request connection to Cassandra node from a
program which consists of 2 modules: Node Manager and
Monitor. This program is written in java.
Figure 1 Architecture
4.1 Monitor
The monitor module is responsible for retrieving workload
information via Java Management Extension (JMX) API from all
Cassandra nodes in the cluster. The retrieval is done periodically
for instance every 10 seconds.
JMX provides management of Cassandra nodes in two key ways.
First, JMX enables user to understand node’s health and overall
performance in terms of memory, threads, and CPU usage—things
that are generally applicable to any Java application. Second,
JMX allows user to work with specific aspects of Cassandra
which have been instrumented.
Based on our observation, out of the rich information exposed by
Cassandra through JMX, we select the following data as the key
indicator for a node’s workload: percentage of CPU usage,
percentage of memory usage, read count, write count, read latency
and write latency. In addition to this, we also monitor hard disk
usage of Cassandra node.
Please note, this is not the load information as reported by JMX
but the percentage of available hard disk in the hard drive where
Cassandra data is stored. This seems to be a trivial matter, but we
have seen the entire cluster goes down as Cassandra running out
of space. Moreover, some Cassandra operation e.g. compaction
often requires extra free disk usage to be performed successfully.
Thus, Running out of space in a production environment would
be very troublesome. Therefore, disk usage is a very important
monitoring data.
If adding a new hard disk or increasing capacity of hard disk in a
running VM instance is not supported than disk usage should be
considered for scaling up and down.
For CPU usage, JMX provides a method to get process CPU time
via OperatingSystemMXBean. In order to calculate CPU usage
we use the following equation.
CPU _ usage =
CPU _ time1 − CPU _ time0
nano _ time1 − nano _ time0
The time interval between the two measurements is 100ms. We
have compared the output of this equation with output of top
program in Linux. Although it is less sensitive, the output is
consistent. We do not use top because it is much slower in our
test environment even after limiting the iteration to 1 frame. Ps
command is much faster but it shows snapshot (average of
accumulated CPU usage from the moment the process start)
instead of the real time view.
4.2 Node Manager
In order to select Cassandra node the client should connected to,
Node Manager maintain list of available nodes. This list is
updated based on the report received from monitor module.
There are two strategies for node selection: random or load
balanced. In order to select a node with less workload from two
nodes, first, CPU usage is compared if the difference is not
significant than memory usage is compared and vice versa read
latency is compared and then finally write latency. The drawback
of load balanced selection is probably it is slower and if
monitoring data is not updated fast enough than node with higher
workload might be selected instead.
In addition to providing client with connection to Cassandra node,
other responsibility of Node Manager is to perform scale up, scale
down and cleanup operation when monitoring data exceed the
threshold specified for these operations. Node Manager also has
to make sure only a number of node can scale up or scale down at
the same time to minimize performance decrease.
We initially write bash scripts for scaling up and down operation
in order to test scaling manually. After that, scaling is done
dynamically by having the Node Manager execute these scripts as
4.3 Scaling up
The goal of scaling up is to alleviate the workload of heavily
loaded node by adding a new node to the cluster. In order to
identify heavily loaded node, a threshold can be specified for
instance: disk usage >= 75% or (memory usage > 50% and CPU
usage > 60%) and node is not cleaning up. The condition, node
not cleaning up is added because cleanup operation is very CPU
intensive and from our observation CPU usage can jump high
(often > 60%). Therefore, scaling up should be avoided while this
operation is in progress. The state of other Cassandra operations
which might also be CPU intensive can be monitored trough
JMX. Furthermore, in our implementation we only allow one
scale up operation at one time for performance concern.
Otherwise, the next bootstrap attempt will be thrown off. In our
implementation we want to avoid this by specifying initial token,
hence allowing next scale up to be performed immediately in
order to handle the flash crowd. In our implementation, we just
keep a list of nodes that needs clean up job and schedule the clean
up to be performed later, ideally during off-peak hours. The nodes
requiring clean up is neighboring nodes that shared the same sub
range with the new node.
4.4 Scaling down
The goal of scaling down is to remove lightly loaded node, hence
reducing cost by minimizing the number of VM instance running.
In order to identify lightly loaded node, a threshold can be
specified for instance: memory usage < 50% and CPU usage < 2%
and no node is scaling up. For performance concern, we only
allow one node to be removed at the same time.
The steps performed when scaling up are as follow:
Remove node from Node Manager list
Decommissioned lightly loaded node
Terminate VM instance.
4.5 Cleanup
Cleanup operation, depending on the data size, can take a
relatively long time to finish. For instance in our setup, clean up
on a node with a few hundred megabytes of data can take about 5
minutes and with more than 1 GB it can takes about 15 minutes.
This operation is required to keep correct information of node’s
load (the load here is the size of data as reported by Cassandra
through JMX). Cleanup operation can be used to remove extra
replica. This is usually performed after bootstrap operation as the
data given up to the new node is not automatically removed from
source node(s). As we always specify the initial token for
bootstrap, cleanup operation can be postponed to be performed
during off-peak hour. The step to perform clean up is:
Invoke cleanup operation on specified node
Remove node from list of nodes to cleanup.
Because of performance concern, in our implementation we only
allow one cleanup operation at one time and we also make sure no
other node is scaling up or down.
5. Evaluation
In order to evaluate our implementation, we create a client
application sending a read and write request to the cluster. The
client is written using java thrift API. It uses the following simple
data model.
The steps performed when scaling up are as follow:
Allocate a new VM instance
Calculate initial token.
Bootstrap new node at specified token.
Schedule cleanup operation on source node(s).
For step 2, in our implementation the heavily loaded node will
give up half of its token range to the new node. In the token ring,
the new node will be placed exactly before the heavily loaded
node. Cassandra also provides automatic selection of initial token
based on data size in each node. However, for this to work the
cleanup operation must be performed before the next scale up.
Table 2 Client data model
Key space
Replication factor
Column Family
Id, Content
Name, Mark
Average payload
The client consists of 8 threads: two threads each for reading and
writing to Assignment and Grade column family. Every 5 requests
the client thread will request for a new connection from Node
Manager. Node manager will choose available Cassandra node
randomly. In each client connection, the consistency level used is
The experiment starts with a cluster of two nodes. The response
time for each request (read/write operation) is measured. In
addition to this, the number of failure is also counted. There are
two types of common exceptions received by client:
UnavailableException if not enough replicas are available and
TimeoutException if response is not received before timeout.
As the number of requests processed changes, the workload of
each node also changes. Thus, scaling up and down is performed
dynamically as the workload exceeds the specified workload. For
this experiment the CPU threshold for scaling up is lowered to 30
% in order to shorten the experiment time.
To trigger scaling up, requests with much smaller delay in
between are sent. As a result, much higher number of requests
processed at certain point in time, the flash crowd. The
experiment result is presented in figure 2. The first chart shows
number of nodes. The first scaling up is after processing about
20% of total requests and the second one is after 40% and before
processing 60% of requests. As some of the threads already finish,
scaling down is done after processing 75% of the total requests.
The second chart shows the response time during the experiment.
In general the response time is small because the data payload for
read and write operation is relatively small (less than 100 kb). As
observed from the chart, response time increases before scaling up
and down finish. This is because bootstrap operation will put
some stress to the source nodes as the data required will be
streamed from these nodes. And vice versa decommission will
stream data from removed node to the other nodes. Consequently,
the number of failed operations slightly increases. After scaling up
finish the response time improves.
Figure 2 Response time during scaling up and down
6. Conclusion
We have shown that by monitoring nodes in the cluster using
JMX we can dynamically scale up and down. Scaling up is done
by bootstrapping new node at a specified token and scaling down
by decommissioning a specified node.
Based on the experiment result, a further investigation is required
in order to avoid performance decrease before scaling up finish. It
will be interesting to see the performance if load balanced node
selection strategy is used in Node Manager.
[1] The Apache Cassandra.
[2] Hewitt, Eben. 2010. Cassandra: The Definitive Guide.
O’Reilly Media.
[3] Elson et all.2008. Handling Flash Crowds from your
Garage. USENIX ’08: 2008 USENIX Annual Technical
Was this manual useful for you? yes no
Thank you for your participation!

* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project

Download PDF