Scalability with Apache Cassandra Krisantus Sembiring Jie Lu Fengyi Hong Computer Science Department University of Crete Heraklion, Greece +30-6944788396 Computer Science Department University of Crete Heraklion, Greece +49-17699033401 Computer Science Department University of Crete Heraklion, Greece +49-17699044078 [email protected] [email protected] [email protected] ABSTRACT 2.2 Data Management As the era of information explosion comes, one huge server with all the necessary data inside has become the past. And flash crowds which appear only a short while will probably determine one’s business. Distributed highly scalable database therefore became another serious research area. In this paper, we will present our experiment and improvement for dynamic scalability with one of the leading open source implementation, Apache Cassandra. The column is the lowest/smallest increment of data. It is a tuple (triplet) that contains a name, a value and a timestamp. All values are supplied by the client, including the 'timestamp'. This means that clocks on the clients should be synchronized (in the Cassandra server environment is useful also), as these timestamps are used for conflict resolution. In many cases the 'timestamp' is not used in client applications, and it becomes convenient to think of a column as a name/value pair. For the remainder of this document, 'timestamps' will be elided for readability. A column family is a container for columns, analogous to the table in a relational system. In Cassandra, each column family is stored in a separate file, and the file is sorted in row major order. Related columns, those that you'll access together, should be kept within the same column family. A key space is the first dimension of the Cassandra hash, and is the container for column families. Key spaces are of roughly the same granularity as a schema or database in RMDB. Keywords Scalability, Apache Cassandra, Distributed Database 1. INTRODUCTION Flash crowd will cause a sudden load spike, but it is hard to justify paying tens of thousands of dollars for resources . Therefore, by dynamically adding and removing resource as needed one would be able to handle the load spike while minimizing the cost required. In a distributed database cluster, the resource would be the number of VM instance running. Our objective in this project is to implement dynamic scaling up and down in a Cassandra Cluster. In order to do that, we have to figure out how to identify highly loaded node or lightly loaded node. The goal of scale up is to alleviate heavily loaded node by adding a new node. On the other hand, the goal of scaling down is to remove lightly loaded node. The idea is to monitor the workload in each node and then perform scale up and down as needed. The challenges in this project are how to monitor workload, and how to scale up and down while client might be reading from or writing to the node being scaled. The rest of this report is structured as follow: section 2 will be introduction of Apache Cassandra; section 3 is working platform; section 4 is implementation; section 5 is evaluation and the last section is conclusion. 2. Backgrounds 2.1 Cassandra Cassandra is a The Apache Cassandra Project develops a highly scalable second-generation distributed database, bringing together Dynamo’s fully distributed design and Bigtable’s ColumnFamilybased data model. Data could have several copies to ensure availability inside one or across clusters. A cluster refers to the machines (nodes) in a logical Cassandra instance. Clusters can contain multiple keyspaces. To understand these concepts, here we present how Cassandra manages its data. 2.3 Cassandra Architecture This is Cassandra’s Architecture. Table 1. Architecture Core Layer Messaging Service Gossip Failure Detection Cluster state Partitioner Replication Middle Layer Commit log Memtable SSTable Indexes Compaction Service Layer Tombstones Hinted handoff Read repair Bootstrap Monitoring Admin tools In this report we’ll primarily write about parts that are tightly connected with non-fail-recovery part, which are Partitioner, Replication in Core Layer; Commit log, Memtable, SSTable and Compaction in Middle Layer; Tombstones, Bootstrap and Monitoring in Top Layer. First of all, Cassandra has the feature of Eventual Consistency. Why? As written above, each data has more than one copy in the database. Replication Factor suggests the total number of replicas one key space has. Number of writing replica means how many replicas should be updated before writing operation succeeds, here use W for short form. Similarly number of reading replicas suggests how many replicas should be read from before actually return the data, use R for short. Distributed database give up some of the schema traditional database have – ACID for durability and availability, in other words eventual consistency. Eventual consistency ensures that data would be available almost anytime with accuracy. When W + R > N (Replication Factor), whenever read, client will compare the fetched data, and provide most recent data. User could set these 3 numbers to achieve different reading and writing speed. Node Manager selects any node randomly or node with lowest workload. After that, client can establish connection to the selected node. If a connection cannot be established, the client simply requests another node. This will enable spreading client requests across the cluster. 2.4 Operations Any node partitioner’s, CommitLog, Memtable, SSTable, Compaction should wait for all W responses. Writing includes several steps: first write to a disk commit log, second send update to appropriate nodes, the nodes will first write into their local logs then update memtables. Memtables will not be flushed to disk unless it’s out of space, it has too many keys or user has defined a time between each flush. When flushing happens, the system also updates SSTable, which is data file for system, and SSTable Index, the index for SSTable. Periodically data files are merged and sorted into a new file (and creates new index). This is called compaction. During compaction keys are merged, columns are combined and marked tombstones are discarded. Reading waits for R responses. Client is connected to any node, reading operation compare the result of R replicas and return the most recent one. When deleting, like most of other distributed database, Cassandra don’t directly delete the data. Because, if it deletes them (the operation will also applied to only selected write replicas), the other not yet updated nodes would regard the already deleted ones as missing last update and repair the data. So Cassandra uses additional bits to marking deleted data – set up Tombstone. And when next compaction comes, all data marked as deleted would be officially removed 3. Platform Our working environment is a eucalyptus cloud. We had a total of 6 VMs running at most; each has a 1GB volume attached where Cassandra and other prerequisite files are stored. We use Apache Cassandra 0.7.2 and JRE 6 update 24. Each VM instance uses Ubuntu 10.4 image and m1.xlarge type. One VM instance is used for each Cassandra node. Cassandra data and commit files are stored in the VM secondary hard disk (~18GB). In the next section, unless specified otherwise, we do not change the default value of Cassandra configuration file. We created bash script to enable creation of Cassandra cluster with arbitrary nodes. The script also requires the number of seed node to allocate. The number of seed nodes recommended in one cluster is two . Each time a cluster created we always specify the initial token in order to make sure balance token range between nodes. Following recommendation in , we use the formula below to calculate the initial token for each node. Parameter “nodes” here is the number of nodes to allocate. def tokens(nodes) for x in xrange(nodes): print 2 ** 127 / nodes * x 4. Implementation Figure 1 shows the architecture of our implementation. Client application will request connection to Cassandra node from a program which consists of 2 modules: Node Manager and Monitor. This program is written in java. Figure 1 Architecture 4.1 Monitor The monitor module is responsible for retrieving workload information via Java Management Extension (JMX) API from all Cassandra nodes in the cluster. The retrieval is done periodically for instance every 10 seconds. JMX provides management of Cassandra nodes in two key ways. First, JMX enables user to understand node’s health and overall performance in terms of memory, threads, and CPU usage—things that are generally applicable to any Java application. Second, JMX allows user to work with specific aspects of Cassandra which have been instrumented. Based on our observation, out of the rich information exposed by Cassandra through JMX, we select the following data as the key indicator for a node’s workload: percentage of CPU usage, percentage of memory usage, read count, write count, read latency and write latency. In addition to this, we also monitor hard disk usage of Cassandra node. Please note, this is not the load information as reported by JMX but the percentage of available hard disk in the hard drive where Cassandra data is stored. This seems to be a trivial matter, but we have seen the entire cluster goes down as Cassandra running out of space. Moreover, some Cassandra operation e.g. compaction often requires extra free disk usage to be performed successfully. Thus, Running out of space in a production environment would be very troublesome. Therefore, disk usage is a very important monitoring data. If adding a new hard disk or increasing capacity of hard disk in a running VM instance is not supported than disk usage should be considered for scaling up and down. For CPU usage, JMX provides a method to get process CPU time via OperatingSystemMXBean. In order to calculate CPU usage we use the following equation. CPU _ usage = CPU _ time1 − CPU _ time0 *100% nano _ time1 − nano _ time0 The time interval between the two measurements is 100ms. We have compared the output of this equation with output of top program in Linux. Although it is less sensitive, the output is consistent. We do not use top because it is much slower in our test environment even after limiting the iteration to 1 frame. Ps command is much faster but it shows snapshot (average of accumulated CPU usage from the moment the process start) instead of the real time view. 4.2 Node Manager In order to select Cassandra node the client should connected to, Node Manager maintain list of available nodes. This list is updated based on the report received from monitor module. There are two strategies for node selection: random or load balanced. In order to select a node with less workload from two nodes, first, CPU usage is compared if the difference is not significant than memory usage is compared and vice versa read latency is compared and then finally write latency. The drawback of load balanced selection is probably it is slower and if monitoring data is not updated fast enough than node with higher workload might be selected instead. In addition to providing client with connection to Cassandra node, other responsibility of Node Manager is to perform scale up, scale down and cleanup operation when monitoring data exceed the threshold specified for these operations. Node Manager also has to make sure only a number of node can scale up or scale down at the same time to minimize performance decrease. We initially write bash scripts for scaling up and down operation in order to test scaling manually. After that, scaling is done dynamically by having the Node Manager execute these scripts as needed. 4.3 Scaling up The goal of scaling up is to alleviate the workload of heavily loaded node by adding a new node to the cluster. In order to identify heavily loaded node, a threshold can be specified for instance: disk usage >= 75% or (memory usage > 50% and CPU usage > 60%) and node is not cleaning up. The condition, node not cleaning up is added because cleanup operation is very CPU intensive and from our observation CPU usage can jump high (often > 60%). Therefore, scaling up should be avoided while this operation is in progress. The state of other Cassandra operations which might also be CPU intensive can be monitored trough JMX. Furthermore, in our implementation we only allow one scale up operation at one time for performance concern. Otherwise, the next bootstrap attempt will be thrown off. In our implementation we want to avoid this by specifying initial token, hence allowing next scale up to be performed immediately in order to handle the flash crowd. In our implementation, we just keep a list of nodes that needs clean up job and schedule the clean up to be performed later, ideally during off-peak hours. The nodes requiring clean up is neighboring nodes that shared the same sub range with the new node. 4.4 Scaling down The goal of scaling down is to remove lightly loaded node, hence reducing cost by minimizing the number of VM instance running. In order to identify lightly loaded node, a threshold can be specified for instance: memory usage < 50% and CPU usage < 2% and no node is scaling up. For performance concern, we only allow one node to be removed at the same time. The steps performed when scaling up are as follow: 1. Remove node from Node Manager list 2. Decommissioned lightly loaded node 3. Terminate VM instance. 4.5 Cleanup Cleanup operation, depending on the data size, can take a relatively long time to finish. For instance in our setup, clean up on a node with a few hundred megabytes of data can take about 5 minutes and with more than 1 GB it can takes about 15 minutes. This operation is required to keep correct information of node’s load (the load here is the size of data as reported by Cassandra through JMX). Cleanup operation can be used to remove extra replica. This is usually performed after bootstrap operation as the data given up to the new node is not automatically removed from source node(s). As we always specify the initial token for bootstrap, cleanup operation can be postponed to be performed during off-peak hour. The step to perform clean up is: 1. Invoke cleanup operation on specified node 2. Remove node from list of nodes to cleanup. Because of performance concern, in our implementation we only allow one cleanup operation at one time and we also make sure no other node is scaling up or down. 5. Evaluation In order to evaluate our implementation, we create a client application sending a read and write request to the cluster. The client is written using java thrift API. It uses the following simple data model. The steps performed when scaling up are as follow: 1. Allocate a new VM instance 2. Calculate initial token. 3. Bootstrap new node at specified token. 4. Schedule cleanup operation on source node(s). For step 2, in our implementation the heavily loaded node will give up half of its token range to the new node. In the token ring, the new node will be placed exactly before the heavily loaded node. Cassandra also provides automatic selection of initial token based on data size in each node. However, for this to work the cleanup operation must be performed before the next scale up. Table 2 Client data model Key space Replication factor Column Family Assignment Grade Student 2 Columns Id, Content Name, Mark Average payload 100kb 200bytes The client consists of 8 threads: two threads each for reading and writing to Assignment and Grade column family. Every 5 requests the client thread will request for a new connection from Node Manager. Node manager will choose available Cassandra node randomly. In each client connection, the consistency level used is ONE. The experiment starts with a cluster of two nodes. The response time for each request (read/write operation) is measured. In addition to this, the number of failure is also counted. There are two types of common exceptions received by client: UnavailableException if not enough replicas are available and TimeoutException if response is not received before timeout. As the number of requests processed changes, the workload of each node also changes. Thus, scaling up and down is performed dynamically as the workload exceeds the specified workload. For this experiment the CPU threshold for scaling up is lowered to 30 % in order to shorten the experiment time. To trigger scaling up, requests with much smaller delay in between are sent. As a result, much higher number of requests processed at certain point in time, the flash crowd. The experiment result is presented in figure 2. The first chart shows number of nodes. The first scaling up is after processing about 20% of total requests and the second one is after 40% and before processing 60% of requests. As some of the threads already finish, scaling down is done after processing 75% of the total requests. The second chart shows the response time during the experiment. In general the response time is small because the data payload for read and write operation is relatively small (less than 100 kb). As observed from the chart, response time increases before scaling up and down finish. This is because bootstrap operation will put some stress to the source nodes as the data required will be streamed from these nodes. And vice versa decommission will stream data from removed node to the other nodes. Consequently, the number of failed operations slightly increases. After scaling up finish the response time improves. Figure 2 Response time during scaling up and down 6. Conclusion We have shown that by monitoring nodes in the cluster using JMX we can dynamically scale up and down. Scaling up is done by bootstrapping new node at a specified token and scaling down by decommissioning a specified node. Based on the experiment result, a further investigation is required in order to avoid performance decrease before scaling up finish. It will be interesting to see the performance if load balanced node selection strategy is used in Node Manager. 7. REFERENCES  The Apache Cassandra. http://cassandra.apache.org/  Hewitt, Eben. 2010. Cassandra: The Definitive Guide. O’Reilly Media.  Elson et all.2008. Handling Flash Crowds from your Garage. USENIX ’08: 2008 USENIX Annual Technical Conference.
* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project