VMware vFabric GemFire ™ High Performance, Distributed Main-Memory and Events Platform.

VMware vFabric GemFire ™ High Performance, Distributed Main-Memory and Events Platform.
VMware vFabric GemFire™
High Performance, Distributed Main-Memory
and Events Platform.
T E C H N I C A L W H I T E PA P E R
VMware vFabric GemFire™
Table of Contents
1. Premise . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.1 Traditional design focus-ACID transactions and IO . . . . . . . . . . . . . . . . . . 3
1.2 Manage Data in Pooled Cluster Memory. . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.3 Context Rich, Active Events . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.4 ‘Data-Aware’ Behavior and Parallel Execution. . . . . . . . . . . . . . . . . . . . . . . 5
2. Object Storage Model. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2.1 Dynamic Membership Based Distributed System. . . . . . . . . . . . . . . . . . . . 7
2.2 Failure Detection. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
3. Deployment Architectures. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
3.1 Peer to Peer (P2P). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
3.2 Super-Peers (AKA Client-Server). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
3.3 Gateway Connected Distributed Systems . . . . . . . . . . . . . . . . . . . . . . . . . 10
4. Replication and Consistency. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
4.1 Replicated Regions. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
4.2 Partitioned Regions. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
5. Horizontal Partitioning with Dynamic Rebalancing. . . . . . . . . . . . . . . . . . . . 13
6. Partitioning with Redundancy. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
7. Persistence – ‘Shared-Nothing Operations Logging’. . . . . . . . . . . . . . . . . . . 15
8. Caching Plug-Ins . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
9. Programming Model – “Hello World” Example. . . . . . . . . . . . . . . . . . . . . . . . 17
9.1 (I) Configure Cache Servers (Create <cache.xml>) . . . . . . . . . . . . . . . . . 17
9.2 (II) Start the Cache Server Locator (discovery service) . . . . . . . . . . . . . 17
9.3 (III) Launch the Cache Servers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
9.4 (IV) Coding the Java Client. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
10. Reliable Publish Subscribe and Continuous Querying . . . . . . . . . . . . . . . . 19
10.1 Continuous Querying. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
10.2 ‘Delta Propagation’. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
10.3 Contextual Information Available at Memory Speeds. . . . . . . . . . . . . . 20
10.4 High Availability and Durability through Memory-Based Replication.20
11. Performance Benchmark. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
12. Replicated Region Query Test. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
13. Partitioned Region Query Test . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
14. Partitioned Region Write Test. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
15. Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
References. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
T E C H N I C A L W H I T E PA P E R / 2
VMware vFabric GemFire™
1. Premise
1.1 Traditional design focus-ACID transactions and IO
Traditional database design is based on strict adherence to ACID (Atomicity, Consistency, Isolation, Durability)
transactional properties. In practice, when coordinated, enterprise-wide processes are built by integrating
stove-pipe applications together, we find that the individual database becomes merely a participant in a
complex transaction that involves many sources of data with no single coordinator ensuring data consistency
across disparate systems. Often, it is the application (or interfacing humans) that acts as the coordinator and is
required to apply compensatory action upon failures.
The traditional database has taken the ’one size fits all‘ approach to data management [1] with a design that
is overly centralized, disk oriented and with extensive use of locking for concurrency control. As the features
offered have increased over the years, the complexity of the database engine has increased. The design has
natural impedance to highly concurrent data access. The design is highly optimized for to address the single
biggest bottleneck for disk oriented databases – disk IO performance. After all, that choice makes perfect sense
- disk access can be 100X slower than RAM. Initial attempts at memory-based data repositories were based on
centralized designs and only focused on preserving ACI properties for all data access [8].
Figure 1: Typical Relational Database Architecture
In contrast, vFabric GemFire™ starts with a design where the goal is to completely eliminate disk as the bottleneck, where data is managed primarily in memory, and where data is distributed across a cluster of nodes
for scalability and performance. In the vFabric GemFire™ design, there is no presumption of the requirement
for ACID transactions and the developer is given the flexibility of options and a conscious choice as to the
degree of consistency required for concurrent data modifications. A choice of consistency does not need to
be applied across the entire system, but instead can be tailored to specific data sets or logical data partitions.
This flexibility in data consistency enables use cases where applications concerned about the scalability
and availability characteristics of data can trade-off, or relax, the needs for strict data consistency, while
preserving the ability to employ strict consistency in other parts of the system as needed. In other words,
vFabric GemFire™ gives system designers the ability to optimize high performance solutions by flexibly tuning
consistency, availability, and partition-tolerance at different places in their application.
T E C H N I C A L W H I T E PA P E R / 3
VMware vFabric GemFire™
Traditional OLTP system design promotes the idea of a database tier for managing all data. This approach
belittles the significant need for temporal data management in the application tier - session state, workflow
state, and the caching of frequently accessed data. Many caching solutions exist today and often, caching is
incorporated into middle tier platforms. These implementations, however, tend to suffer from either being too
relaxed with respect to data consistency, lack support for managing referential data integrity or are limited by
the available process memory on a single machine.
1.2 Manage Data in Pooled Cluster Memory
The design premise of vFabric GemFire™ is to distribute data across a cluster of nodes, providing a choice of
full replication to selected nodes, or fault tolerant partitions of data across selected nodes. While logging every
single update to disk is supported, it is usually avoided by the fact that the update is synchronously applied
to one or more replicas in memory. vFabric GemFire™ allows data to be queried, to participate in transactions,
to share the process space of the application and to be synchronized with external data repositories
(synchronously or asynchronously). Experience shows us that the working data set for most OLTP applications
is small enough to be held in the composite memory of a reasonable number of commodity servers. vFabric
GemFire™ shifts the focus from disk IO optimization to one that is optimized for distribution. The internal data
structures are optimized to conserve memory as a resource and are designed for highly concurrent access.
To effectively address the continuously changing load patterns across applications and to optimize resource
utilization, vFabric GemFire™ is designed to adapt to the dynamic expansion or contraction in the capacity of
the cluster that may accompany demand spikes or ebbs. The design also focuses on being able to handle a
myriad set of Byzantine failure conditions. Data availability is ensured through efficient replication techniques
among distributed processes and/or replication across clusters that are geographically isolated. By replacing
disk IO with memory access and capitalizing on the increased, reliable network bandwidth now available,
vFabric GemFire™’s goal is to provide near-instantaneous propagation of updates across networks. The benefits
observed in customer deployments today indicate that this approach performs far better and is more cost
effective than database replication and/or the use of highly-available disk storage clusters. vFabric GemFire™
presents a clear opportunity for quick ROI.
1.3 Context Rich, Active Events
Recently, there has been a surge in ’event driven’ application architectures that integrate event producers
and event consumers. Often the applications or services that communicate through events also share one or
more databases. Information propagated in events is typically kept to a minimum, in order to avoid degrading
the performance of the Message Oriented Middleware, which itself is usually employing disk IO to ensure
fault tolerance. The event data is detached from its contextual relationships and, during its trip through the
messaging system, does not participate in the transactional environment of the database. As a result, the first
thing that the receiver of an event must usually do is to make a call to the database to re-establish the context
of the message and ensure that it is working with the latest version of the information. The messaging middleware, in order to ensure its own integrity, introduces an additional transactional requirement, resulting in the
complexity and overhead of two-phase commit protocols.
vFabric GemFire™, as an alternative, allows applications to subscribe directly to data changes through interest
registration and query expressions. Any changes to the data or its relationships are propagated reliably as
change notifications to the subscriber. vFabric GemFire™ combines the power of stream data processing
capabilities with traditional database management. It extends the semantics offered in database management
systems with support for ’Continuous Querying‘ which eliminates the need for application polling and supports
the rich semantics of event driven architectures. With vFabric GemFire™, applications can execute ad-hoc
queries or register queries for continuous execution. New data updates are evaluated against registered queries
and change notifications are propagated through the data fabric, in a reliable manner, to interested, distributed
applications.
T E C H N I C A L W H I T E PA P E R / 4
VMware vFabric GemFire™
1.4 ‘Data-Aware’ Behavior and Parallel Execution
Database vendors introduced stored procedures so that applications could off-load data intensive application
logic to the database node where the behavior could be collocated with the data. vFabric GemFire™ extends
this paradigm by supporting invocation of application defined functions on highly partitioned data nodes such
that the behavior execution itself can be parallelized. Behavior is routed to and executed on the node(s) that
host(s) the data required by the behavior. Application functions can be executed on just one node, executed in
parallel on a subset of nodes or in parallel across all the nodes. This programming model is similar to the now
Map-Reduce model popularized by Google.
Just like with ‘stored procedures’ applications can pass an arbitrary list of arguments and expect one or
more results back. In addition, the function invocation API also allows the application to hint about the keys
(corresponding to entries managed in the data regions) the function may depend on. These keys provide the
routing information required to precisely identify the members where the function should be parallelized. If no
routing information is provided, the function is executed only on a single data host or in parallel on all the data
hosts. If parallelized, results from each member is streamed back, aggregated and returned to the caller. The
application has the choice to provide custom aggregations algorithms as well.
Data-aware function routing is most appropriate for applications that require iteration over multiple data
items (such as query or custom aggregation functions). By using vFabric GemFire™ features to collocate the
relevant data and by parallelizing the function, the overall processing throughput will be dramatically increased.
More importantly, the calculation latency is inversely proportional to the number of nodes on which it can be
parallelized. GemStone customers have used these features to reduce complex calculations from taking hours
to complete down to seconds.
The rest of this paper is organized as follows: We review additional important concepts and features offered in
vFabric GemFire™, expose elements of the runtime architecture that speak to the low latency, data consistency
and high availability aspects of the design. We provide a glimpse into the programming model and how reliable
publish subscribe semantics is offered as a core part of the distributed container. Finally, we describe a simple
performance benchmark.
2. Object Storage Model
For simplicity and high performance, vFabric GemFire™ manages data as object entries (key-value pairs) in
concurrent hash maps. The average vFabric GemFire™ application gets access to the data through an enhanced
Map interface (for instance, as java.util.Map in Java) and configures a variety of attributes that control options
such as:
Replication (Will synchronized copies be maintained on each node?)
Partitioning (Will the entries be striped across nodes?)
Disk Storage (Will the data only be in memory, overflow to disk or be persisted to disk?)
Eviction (Will least recently used entries be evicted when memory usage exceeds a certain threshold?)
Caching Plug-ins To:
• Respond to events (on the server or in clients)
• Read-through (load from external source on a read miss)
• Write-through (synchronously write to external source)
• Write-behind (asynchronously write back to a external data source)
T E C H N I C A L W H I T E PA P E R / 5
VMware vFabric GemFire™
Arbitrarily complex objects can be stored in vFabric GemFire™. vFabric GemFire™ only imposes a single
requirement on these objects - it requires them to be serializable. This collection of key-value pairs along
with the aforementioned configurable attributes constitutes a ‘Data Region’. In some sense, a data region is
similar to a relational Table - except that it is typically managed across multiple servers. For instance, one could
configure a replicated data region for managing ‘product catalog’ objects and have a large partitioned data
region for managing millions of orders spread across many nodes.
Object fields in keys or values in the region can be indexed and regions can be queried using OQL (an ODMG
standard similar to SQL). The querying features do not include all of the fringe modifiers offered in modern
relational databases but are instead designed to offer very high performance, especially for queries based on
relational or Boolean operators (AND, OR, <, > , etc). Main memory indexes can be created on any of the object
fields (in the root object or a nested field in the object graph). The query engine uses a cost based optimizer to
effectively utilize indexes.
Data regions can themselves be nested and contain child data regions. All the data regions fit into a single
logical name-space that is accessible from any member that participates in a vFabric GemFire™ distributed
system (further described below). A typical deployment would consist of a cluster of cache server nodes that
manage the data regions (10s to 100s) and a larger number of client nodes (100s to 1000s) that access these
cache servers. The cache server could be launched using scripts or through APIs. Each cache server defines the
data regions it will manage declaratively (using XML) or explicitly using an API. The client nodes accessing the
cache servers could themselves also host a local ’edge‘ or ‘L1’ cache for high, in-process cache reads but always
delegates to the servers when the data needs to be updated.
These in-process, ’edge‘ caches can be hosted in C++, Java or .NET client processes. All cache servers form
a peer to peer network and are always running in a JVM. vFabric GemFire™ uses a language-neutral binary
format for object representation on the wire that allows data interchange across different languages at very
high speeds without incurring the traditional costs associated with using a self describing interchange format
like XML.
The high level architecture of vFabric GemFire™ is depicted in figure (2) below. The vFabric GemFire™ schema
based on data regions is depicted in figure (3) below.
Figure 2: VMware vFabric GemFire Architecture
T E C H N I C A L W H I T E PA P E R / 6
VMware vFabric GemFire™
Regions host data in one or more members
Figure 3
2.1 Dynamic Membership Based Distributed System
In vFabric GemFire™, members (distributed processes) that host data connect to each other in a peer to peer
(P2P) network to form a ‘Distributed System’. vFabric GemFire™ supports dynamic group membership where
members may join or leave the distributed system at any time with minimal impact to other members in the
Distributed System. This ability to dynamically alter capacity is the most important characteristic that allows
stateful applications to be built without over provisioning for peak demands while still targeting specific SLA
goals on data such as availability or performance. Membership changes do not introduce locking or contention
points with the other members. Members can discover each other either by simply subscribing to a common
multicast channel or using a TCP-based discovery service (called a ‘Locator’) if the network is not enabled for
multicast. The system automatically elects a group membership coordinator - a member that is responsible for
allowing new members to join the distributed system and is responsible for communicating any membership
changes in a consistent fashion to all members of the distributed system.
When vFabric GemFire™ is being used as an embedded data fabric within a clustered application, access
to any data within the cluster will incur, at most, a single network hop, with each application node being
directly connected to every other member of the distributed system. In this model, if the data is managed in a
partitioned manner and if all concurrent access to the data were to be uniformly distributed across the entire
data set, increasing the capacity (the application cluster size) would linearly increase the aggregate throughput
for data access and processing, assuming the network bandwidth doesn’t become the bottleneck. Additionally,
the ‘at most a single hop’ access to any data offers predictable data access latency.
2.2 Failure Detection
vFabric GemFire™ is commonly used to build systems that offer predictable latency and continuous availability.
When any member of the distributed system fails, it is important for other services to detect the loss very
quickly and transition application clients to other members. vFabric GemFire™ uses multiple techniques
to detect such failure conditions. When a member departs, normally other members would be notified
immediately through a dedicated membership event channel. If a member departs abnormally, vFabric
GemFire™ detects this condition using a combination of TCP/IP stream-socket connections and UDP datagram
heartbeats (heartbeats are sent by each member to one of its neighbors forming a ring for failure detection).
Failure to respond to the heartbeat is communicated to other members. If all the members agree on the
offending member being a suspect, it is removed from the membership of the distributed system. When
communicating with peer replicas, vFabric GemFire™ defaults to synchronous communication using an ACKbased protocol. Lack of ACKs within a configured time interval will automatically trigger suspect processing
with the coordinator, which in turn makes the final determination. vFabric GemFire™ supports numerous
T E C H N I C A L W H I T E PA P E R / 7
VMware vFabric GemFire™
properties to configure the timings and tolerances of the failure detection system, allowing it to be tailored
to the conditions experienced in different network environments. It also includes specialized configuration for
diagnosing and properly handling ‘temporarily slow’ members differently from failed members – this control
can avoid dangerous ‘membership thrashing’ that might otherwise occur. Details aside, the important point is
that the distributed system membership view is kept consistent across all members.
3. Deployment Architectures
vFabric GemFire™ supports three primary architectures:
Peer to Peer (P2P) – where all the members have direct connectivity to each other.
Client Server or ‘Super Peer’ Model – where client application processes connect and load balance across a
subset of P2P vFabric GemFire™ cache servers. Each client process may employ an ’edge’ or ’near’ cache.
Hub-Spoke Connected Distributed Systems – where any P2P distributed system could be configured to
replicate all or portion of its managed data using asynchronous store-n-forward gateways to one or more
remote Distributed Systems over LAN or WAN based connections.
3.1 Peer to Peer (P2P)
An ‘embedded P2P’ architecture in vFabric GemFire™ describes a case where application logic is running locally,
in process, with cached data on a set of server nodes. The embedded P2P architecture is most suitable when
the applications are long-running or continuously running and the number of processing nodes is relatively
stable. Short-lived processes can overwhelm the group membership system and could also be daunting
(from a manageability perspective) over a very large number of nodes.
While all members of a P2P architecture are ‘equal’ from the point of view of membership, the manifestation
of the combinations and permutations of options and hosted regions means that different members may often
play different roles in the architecture. Some members may host no data, but may still process events.
P2P architectures are commonly used to distribute a large data set, and the processing associated with it,
across a linearly scalable set of servers. This model is also used when vFabric GemFire™ is embedded within
clustered JEE application servers as depicted in figure (4).
T E C H N I C A L W H I T E PA P E R / 8
VMware vFabric GemFire™
Figure 4
3.2 Super-Peers (AKA Client-Server)
A super-peer is a member of the Distributed System network that operates both as a server to a set of clients,
and as an equal in a network of peers. vFabric GemFire™ super peer architectures can support thousands of
client connections (such as in a computational grid application), load balanced across a P2P server cluster.
The clients delegate their requests to one of the ‘super’ peers , resulting in a highly scalable architecture. The
trade-off (compared with the P2P architecture) is that it may potentially require an additional network hop in
order to to access partitioned data. These clients themselves can host an ‘edge’ cache with the most frequently
accessed data stored locally. Edge cache consistency is maintained through expiry and/or by configuring the
servers to automatically push invalidations or data updates to the clients. Clients always connect to servers
using TCP/IP.
In order to scale the number of concurrent connections handled per server, non-blocking IO is used to multiplex
a large number of incoming connections to a configurable number of worker threads. The incoming channels
use a message streaming protocol that prevents overwhelming the socket buffers and from consuming too
much memory. For instance, with a simple communication infrastructure, a thousand connections reading or
writing a one MB object will require one GB of buffer space if no streaming or chunking of the message was
being used, expending memory that could otherwise be used to manage cached data.
Load information from each peer server is continuously aggregated in the ‘locator’ service and used to
dynamically condition the client load across all super peer servers. Essentially, each client can be configured
to transparently acquire load information, re-connect to a less loaded server and drop the existing connection.
vFabric GemFire™ borrows several concepts highlighted in Staged Event-driven architecture (or SEDA) [7]
where each server can be well-conditioned to load, preventing resources from being overcommitted when
demand exceeds service capacity.
T E C H N I C A L W H I T E PA P E R / 9
VMware vFabric GemFire™
3.3 Gateway Connected Distributed Systems
The P2P Distributed System design requires that the peers are tightly coupled and share a common high
speed network backplane. Traffic among peers is generally synchronous and can be very chatty. To distribute
data across members that span WAN boundaries (or across multiple clusters within a single network), vFabric
GemFire™ offers a novel approach that extends the super peer model described earlier. vFabric GemFire™
provides a gateway feature to provide asynchronous replication to remote Distributed Systems. A gateway
process listens for update, delete, and insert events on one or more data regions and then enqueues the
events in a fault tolerant manner (in memory and/or on disk) for delivery to a remote system. Event ordering
is preserved inside of the queues, and when multiple updates are made against the same key, the events may
optionally be ‘conflated’ (i.e. earlier updates are dropped from the queue and the latest update stays in the
order it occurred). Events that are sent to a remote system are batched based on the optimal message size for
the network in use between the Distributed Systems. On the receiving side, conflict detection and resolution is
delegated to custom application code through a callback that provides access to both the existing and the new
object value. Gateways can be configured to be unidirectional or bidirectional and can be configured to receive
events from multiple remote distributed systems. Gateways can be used to create a number of layouts and
architectures, such as rings, hub-and-spoke, and combinations thereof.
Figure 5
The gateway design is biased towards high availability of data and propagation of events with the lowest
latency. To achieve sender side continuous availability, events are enqueued on at least two members, playing
primary and secondary roles. Failure of a ‘primary’ results in election of a secondary as the new primary to
continue propagating enqueued events. Similarly, to guard against receiver side failures, each sender node
has visibility to more than one receiver in order to continue propagating updates if one of the receivers fails or
becomes unresponsive.
4. Replication and Consistency
Some of the key principles of our design include:
Designed for high performance with the assumption that concurrent updates don’t typically conflict
(window of conflict being measured in few milliseconds).
When conflicts do occur, it is the application that is in the best position to resolve the conflict. The application
can be notified with the changes along with the prior state of the object.
Tunability of Consistency, Availability, and Partition-Tolerance at functional points throughout system.
For example, availability and scale out with predictable performance can be prioritized compared to
T E C H N I C A L W H I T E PA P E R / 1 0
VMware vFabric GemFire™
maintaining ACID transactional properties for all data. Thus updates will not be rejected due to concurrent
writes or failure conditions. [“Propagate operations quickly to avoid conflicts. While connected, propagate
often and keep replicas in close synchronization. This will minimize divergence when disconnection
does occur.” [6]]
We assume that distributed locking will be used at a coarse level and when the access is infrequent.
For instance, in an application that manages customer accounts, use distributed locking to prevent multiple
users from logging into a single account, but avoid using any pessimistic concurrency control when “fine
grained” operations are being done.
4.1 Replicated Regions
For replicated regions, all data in the region is replicated to every peer server node that hosts that region.
Replicated Regions use a ‘multiple masters’ replication scheme.
Multiple Masters: When a data region is purely replicated (not partitioned), there is no designated master for
each data entry. Updates initiated from any member are concurrently propagated to each member that hosts
the region. Upon successful processing of the received event, an ACK is sent as a response to the initiating
member. The initiating member waits for an ACK from each replica before returning from the data update call.
For updates to Replicated Regions, consistency is configurable, with the following options:
Distribution without ACKS (d-no-ACK): This is the most optimistic model where replication is eager assuming
the traffic on the network has no congestion. vFabric GemFire™, when using TCP as the transport turns OFF
the Nagle algorithm and avoids sender-side buffering of packets. When using UDP or UDP multicast transports,
sender-side buffering is typically done (unless vFabric GemFire™ flow control kicks in to slow traffic on the
channel due to negative acknowledgements). The sender does not wait for a response from the replica and
returns control to the application the moment the message is routed to the transport layer. Applications use
“d-no-ACK” when lost updates can be tolerated. For instance, in financial trading applications, continuous price
updates on a very active instrument can tolerate data update misses as a new update will replace the value
within a short time window. Note that, in practice, distribution failures due to problems in the transport layer
are raised as alerts notifying network administrators. Also, the use of multiple failure detection protocols in the
system will forcibly disconnect a member from the distributed system if it has become unresponsive, reducing
the probability of replicas diverging from each other for too long.
Distribution with ACKs (d-ACK): Here, replication is eager, where the message is dispatched to each member in
parallel and ACKs are processed as they arrive from each receiver. The invocation completes only after all the
ACKs (one for each replica) has been received. Any update to an existing object is done by swizzling to a new
object to provide atomicity when multiple fields in an object get modified. Along with sending the entire object
over the wire, vFabric GemFire™ also supports sending just the ‘delta’ (only the updated fields) to replicas.
To guarantee the atomicity, by default, vFabric GemFire™ clones the existing object, applies the ‘delta’ and then
replaces the object. The application can choose to serialize concurrent threads using Java synchronization and
avoid the costs associated with cloning. Applications where primitive fields are constantly updated using this
mechanism can see a dramatic boost in performance with significant reduction in the garbage generated in the
‘older generation’ of the JVM.
Distribution with ACKs and locking (d-ACK with locking): This is similar to the d-ACK protocol except that
before propagating to the replicas, a distributed locks is acquired on the entry key. If the locks are granted to
some other thread or process, the replication can block until the locks become available or timeout.
Replication and Transactions: All transactions are initiated and terminated in a single thread. Any replica
node that initiates a transaction acts as the transaction coordinator and only engages the replicas at commit
time. No distributed locks are acquired until commit time and the design generally assumes that transactional
unit of work is small and there are no conflicts. Conflicts are detected at commit time and the transaction
automatically rolled back if it fails. Repeatable read isolation level is provided when the data regions are
accessed using keys. The design avoids the overhead and complexities associated with undo and redo logs
T E C H N I C A L W H I T E PA P E R / 1 1
VMware vFabric GemFire™
in traditional database systems by associating the transactional working set with the thread of execution.
This allows the working set to be efficiently transmitted as a single batch at commit time and also simplifies
the query engine design. The transactional working set manages the read set (all key-value pairs fetched in
the scope of the transaction) as well as the dirty set (updated entries) and is used for queries only when the
access is based on keys. Any OQL query executed within the scope of the transaction is only executed on the
committed state. So, any transactional updates that need to be subsequently retrieved within the scope of the
transaction have to fetch using primary keys. Essentially, repeatable read semantics is offered to applications
that do key based access.
Replication in the face of ‘slow receivers’: With synchronous communication to replicas, the sender can be
throttled if any one of the receivers is unable to keep up with the rate of replication. Often, it is the case that
one or more receivers are momentarily slow. When this occurs, you may want the system to continue as fast
as the healthy members can go. vFabric GemFire™ facilitates this by detecting the slowness of a receiver and
automatically switching to ’asynchronous‘ communication using a queue for that member. The queue can
optionally be conflated so continuous updates on the same set of keys are only propagated once. If the receiver
is later able to catch up then the queue is removed and all communication becomes synchronous again.
4.2 Partitioned Regions
For replicated regions, the data in the region is distributed across every peer server node that hosts that region.
Some peer server nodes may also host replicates of partitions for backup purposes. Partitioned Regions us a
‘single master’ replication scheme.
Single Master: When data is partitioned across many members of the distributed system, the system ensures
that there is only a single member at any moment in time that owns an object entry (identified by a primary
key). The key range itself is uniformly distributed across all the members hosting the partitioned region so that
no single member becomes a scalability bottleneck. It is the responsibility of the member owning the object to
propagate the changes to the replicas. All concurrent operations on the same entry are serialized by making
sure that all replicas see the changes in the exact same order. Essentially, when partitioned data regions are in
use, vFabric GemFire™ ensures that all concurrent modifications to an object entry are atomic and isolated from
each other and ‘total ordering’ is preserved across all replicas. When any member fails, the ownership of the
objects is transferred to an alternate node in a consistent manner (i.e. making sure that all peer servers have a
consistent view of the new owner).
Because of the single master scheme used for partitioned regions, alternative consistency mechanisms are not
required nor are they available. Key locking is available through a lock service, but is not enforced by the region
(i.e. all users must respect the lock protocol in order for it to be effective).
Traditional optimistic replication uses lazy replication techniques designed to conserve bandwidth, increase
throughput through batching and lazily forwarding messages. Conflicts are discovered after they happen and
reaching agreement on the final contents incrementally. System availability is compromised for the sake of
higher throughput.
vFabric GemFire™ instead uses an eager replication model between peers by propagating to each replica in
parallel and synchronously. The approach is tilted in the favor of data availability and lowest possible latency for
propagation data changes. By eagerly propagating to each of its replicas, it is now possible for clients reading
data to be load balanced to any of the replicas.
T E C H N I C A L W H I T E PA P E R / 1 2
VMware vFabric GemFire™
5. Horizontal Partitioning with
Dynamic Rebalancing
vFabric GemFire™ supports horizontal partitioning of entries in a partitioned data region. The default
partitioning strategy is based on a simple hashing algorithm applied against the entry’s key. Alternatively,
applications may configure custom partitioning strategies as well. When using custom partitioning, data in
multiple data regions may be collocated so that related data will be always be located on the same server (still
in different regions) regardless of any rebalancing activity.
Unlike static partitioning systems where changes in the cluster size might require either re-hash of all data
or even a restart of the cluster, vFabric GemFire™ uses a variant of consistent hashing. vFabric GemFire™’s
partitioning start with either the entry’s key (default) or with a ‘routing object’ returned from a custom
partitioning algorithm. These objects are then hashed into logical buckets. The number of buckets is
configurable, but should always be set to be a large prime number since it cannot be changed without
dropping the region from (or restarting) all servers hosting that region. Buckets are then assigned to physical
servers (processes) that are configured to host the region. Partition replicates are also based on the same
buckets. The principle advantage with consistent hashing is that when capacity is changed (added or removed),
it is not necessary to rehash all of the keys and a potential migration of every entry is avoided. When new peer
servers get added, and the administrator initiates rebalancing, a small number of buckets will be moved from
multiple peers (based on the current load on existing members) to the new member. The goal of rebalancing is
to achieve a fair load distribution across the entire system.
Each region can be configured to use a maximum amount of heap memory. That maximum can be expressed
in terms of maximum number of entries, total cumulative entry size or a maximum percentage of the available
heap. It may be configured separately on each server that hosts the region.
This fine control allows clusters to be built using heterogeneous server platforms and helps avoid the
occurrence of hot-spots (where too much data is stored on a single host with limited resources).
The process of publishing a new entry into a partitioned data region is depicted in figure (6) below. Each peer
member maintains a consistent view of which buckets are assigned to each peer server. Since all peers maintain
open communication channels to every other peer, a request for any data from any peer can be resolved with
at most a single network hop.
Figure 6
T E C H N I C A L W H I T E PA P E R / 1 3
VMware vFabric GemFire™
One of the key design goals for vFabric GemFire™ was that an increase in the number of peer servers would
translate to near linearly proportional increase in throughput (or ability to handle users), while maintaining
predicable, low latency. Most of the current data partitioning schemes in caching platforms or database designs
are based on sharding and promise linear scaling while making the assumption that concurrent access will be
uniform across all partitions at all times. In practice, data access patterns change over time causing uneven load
across the partitions. A system that can adapt to changing data access patterns and balance the load has a
higher probability of scaling linearly. In vFabric GemFire™, each peer server continuously monitors its resource
utilization (CPU, memory and network usage), throughput, GC pauses and latencies within different layers of
the system. When uneven load patterns are detected, rebalancing may be triggered by applications (using a
Java API) or through administrative action (via the JMX agent or vFabric GemFire™ tools). The smallest unit
of migration is a single, entire bucket. Bucket migration is a non-blocking operation meaning that data reads
and updates can occur while the bucket is being migrated. The creation of a new replica of the bucket is also
a non-blocking operation, where all interleaved updates will also be routed to the new replica. If a larger total
number of buckets defined for the region, the size of each bucket will be smaller, resulting in smaller amounts
of data being transferred when buckets need to be migrated. Setting the number of buckets too high increases
the administrative load on the cluster. As a rule of thumb, the maximum number of buckets should be a prime
number that is 10 to 100 times larger than the expected maximum number of peer servers that will host the
partitioned region.
6. Partitioning with Redundancy
When redundancy is configured for partitioned data regions, as discussed earlier, the replication is synchronous
and the update only returns after an explicit ACK from each replica has been recieved. Buckets are assigned to
be either primary or secondary buckets. All reads are load balanced across primary and secondary buckets but
writes are always coordinated by the primary and then routed to the secondary buckets. All inserts or updates
to a specific key are serialized by the primary bucket which applies the update locally and the sends update
to the secondary buckets. The replication to all secondary buckets is done in parallel. The serialization through
the primary bucket ensures that all replicas see changes to any entry in the same order, ensuring consistency.
By making sure that primary buckets are uniformly spread across all the eligible members, no single member
becomes a choke point.
When a member departs (normally or abnormally), primary buckets could be lost. A primary election process
picks the oldest secondary bucket to be the new primary and communicates this change to all peer servers
allowing updates to continue without any interruption. Then, if configured to maintain a certain redundancy
level, the remaining members recover the lost buckets by replicating the newly elected primary buckets. If not
enough members exist or there isn’t sufficient capacity to recover all the lost buckets, the system executes
a best effort algorithm and logs warnings on buckets with compromised redundancy levels. Often, when
machines are brought down for system maintenance, it is acceptable to allow the system to operate with
reduced redundancy for short durations. Delays can be configured causing the system to wait for certain
duration before attempting to recover secondary buckets. This allows the node(s) to come back up, have the
original buckets re-created, have ‘primary’ ownership restored and have the system load and performance
characteristics return back to the way they were before the machine(s) went down.
T E C H N I C A L W H I T E PA P E R / 1 4
VMware vFabric GemFire™
7. Persistence – ‘Shared-Nothing
Operations Logging’
Unlike a traditional database system, vFabric GemFire™ does not manage data and transaction logs in separate
files. All insert, update and delete operations are written to log files in ‘append only’ mode. The design
minimizes the latency for relatively small writes to disk. This is traditionally a significant obstacle for scaling
OLTP applications with traditional database management systems. By avoiding the need to seek the disk block
to synchronously write to, the ‘append only’ operation allows disk seek times to be avoided (if the disk is not
concurrently being used by other processes) and the only cost incurred is the rotational latency (note that even
some of the best disk technology today takes about 2ms to seek to a track). vFabric GemFire™ also relaxes the
requirement to flush to disk even when synchronous writes to disk are configured. The design ensures that any
write has been flushed out of the vFabric GemFire™ process but relies on the optimizations in disk schedulers
to time the block writes to disk. vFabric GemFire™ relies on the underlying disk buffers so that both the I/O
interface and the disk read/write head can operate at full speed.
For Replicated Regions, any data loss concern from the sudden crash of a node may be mitigated by
configuring at least one additional replica to also write to disk. By configuring D-ACK based replication, it is
guaranteed that the data has been written to all the replicas and any replica that is configured for synchronous
disk persistence has actually flushed to the file system.
Each member is explicitly configured to use disk persistence at a data region level and does not share the
files at runtime with other members. This permits the use of commodity servers without the costs associated
with expensive storage systems. The log files can be pre-grown to avoid fragmentation issues and the system
automatically rolls to use a new file when one fills up. The old log files are coalesced in the background into a
compressed file on disk. Administrators can explicitly control when the coalescing operation occurs to minimize
thrashing during peak usage hours.
The vFabric GemFire™ operation logging architecture is depicted in figure (7).
Figure 7
T E C H N I C A L W H I T E PA P E R / 1 5
VMware vFabric GemFire™
8. Caching Plug-Ins
A common usage pattern with vFabric GemFire™ involves bootstrapping the Distributed System from one or
more data sources at startup time. Often the entire data set is loaded or a well defined subset of the data is
loaded into vFabric GemFire™ to act as a distributed cache. This permits the application to issue ad-hoc queries
using OQL with any number of memory based indexes on the managed data. Any updates to the data may
be propagated synchronously (‘write through‘) or asynchronously (‘write behind‘) to the data sources. When
ad-hoc queries are used and the entire data set is not held in the cache. Eviction or expiry of data will result in
inconsistent data being returned. The ‘Overflow to disk’ option does not impact the accuracy of queries, since
indices and primary keys are always kept in memory.
When the application wants to use vFabric GemFire™ purely as a cache, all access to the data has to be based
on primary keys. This usage now permits the cache to evict objects that are not frequently used, and lazily
load on a cache miss. When used as a cache, vFabric GemFire™ can also be plugged in as a L2 cache within
Hibernate or as a caching interceptor within Spring containers.
The following artifacts are offered by vFabric GemFire™ to enable caching environments:
LRU Eviction: A clock based algorithm is used to efficiently evict entries when a certain eviction threshold is
crossed. Multiple policies are configurable to control when the eviction is triggered. For instance, the policy
could be based on the count of the entries managed in a data region, the memory consumed by a data region
or when the heap usage exceeds a certain threshold. With data being managed in the JVM heap, the design
takes into account how modern generational garbage collectors work. Generational garbage collectors avoid
being too aggressive about eviction when most of the heap is consumed by garbage and avoid being too lazy
and causing ‘Out of memory’ conditions. The action taken when the eviction thresholds are crossed is also
configurable. For instance, one action could be to remove the entry from the cache, invalidate it so it can be
refreshed from the data source or simply overflow to disk. When heap based eviction is configured, then the
overflow to disk becomes a ’safety valve‘ that prevents an ‘Out of memory’ condition by moving objects that
aren’t frequently accessed, to disk.
Expiry: Objects in the cache can be configured to have a lifetime (specified through a ‘Time to live’ attribute).
If the object remains ‘inactive (no reads or writes) for the time period specified, the object can be removed,
invalidated or moved to disk.
Read through: When an application attempts a read on a key that is not in the cache, a ‘data loader’ callback
may be invoked to load the entry from an external data source. Synchronization is used to prevent multiple
concurrent threads from overwhelming the data source trying to fetch a popular key. These callbacks can be
configured on any node in the Distributed System and irrespective of which member originates the request, the
loader is invoked in the remote member, published in the cache and made available to the caller. Often, when
data is partitioned, the loader is collocated and executed on node that is hosting the data.
Write through: All updates are synchronously propagated to the data source by invoking a ‘cache writer’
callback. If and only if the writer is successful, the update becomes visible in the cache. If the writer were to
be triggered on a thread that is currently in a transaction, the update to the data source can also participate in
the transaction. If the cache was embedded in a container such as JEE, then vFabric GemFire™ along with the
writer callback can participate in an externally coordinated JTA transaction. When participating in an external
coordinated transaction, vFabric GemFire™ does not register a XA resource manager, but, instead relies on
the ‘before completion’ and ‘after completion’ callbacks to commit (or rollback) the transaction in the cache
once the transaction outcome has been determined and communicated to any other participating resource
managers (such as JDBC).
Write behind: All updates are enqueued in same order as seen by the Distributed System and delivered
asynchronously to a listener callback. The underlying machinery for how queues are managed in memory
with one or more secondaries, the batching, conflation and durability semantics is the same as used when
replicating asynchronously to remote distributed systems (please see section on ‘Deployment topologies’).
T E C H N I C A L W H I T E PA P E R / 1 6
VMware vFabric GemFire™
9. Programming Model – “Hello World”
Example
To illustrate the programming model, we assume the deployment uses a cluster of vFabric GemFire™ cache
server nodes managing the data and clients accessing data using keys and OQL queries. We also assume that
the client is a Java application and embeds a local cache. The C++ and C# APIs are very similar to the Java API.
The typical developer model consists of the following steps:
9.1 (I) Configure Cache Servers (Create <cache.xml>)
This is the declarative means to describe what data is being managed on any member node.
Each cache server is provided a XML description of the cache it will host during startup.
Here is an example:
<cache>
<!-- Specify the port where this cache server instance will accept client
connections -->
<cache-server port=”40404” />
<!-- Each cache instance declares one or more data regions it wants to host -->
<region name=”Customers”>
<!-- region-attributes control how the data is managed within vFabric GemFire™.
‘data-policy’ of replicate means the data will be replicated
to/from other replicates in the distributed system.
‘scope’ of distribute-ack means the data will be replicated to
others with a ACK based protocol
-->
<region-attributes data-policy=”replicate” scope=”distributed-ack”>
<!-- Application wants to explicitly load the data
from a mysql database upon a cache miss -->
<cache-loader>
<class-name>com.company.data.DatabaseLoader</class-name>
<parameter name=”URL”>
<string>jdbc:mysql://myObeseHost/UberDatabase</string>
</parameter>
</cache-loader>
</region-attributes>
</region>
</cache>
Each cache server announces itself to a discovery service so it is easy for clients to load balance across a farm
of cache servers. This discovery service is called the ‘locator’.
9.2 (II) Start the Cache Server Locator (discovery service)
shell > vFabric GemFire™ start-locator -server=true -port=41111
T E C H N I C A L W H I T E PA P E R / 1 7
VMware vFabric GemFire™
9.3 (III) Launch the Cache Servers
These are the servers that manage the application data. Each server is connected to each other in a P2P
network to distribute data to each other quickly and detect any failure conditions in the system. Servers can
discover each other using the locator or optionally use a multicast channel.
Cache servers are started using the following command on each node in the cluster where the application
wants to manage data.
shell > cacheserver start -J-Xmx8000m locator-port=41111 cache-xml-file=<location of cache.xml> -classpath=<classpath>
Here ‘-J-Xmx8000m’ specifies the maximum heap available for data, and the classpath specifies the location of
the application classes that are being invoked to load data.
vFabric GemFire™ cache servers can also be launched using APIs and embedded within any Java process.
9.4 (IV) Coding the Java Client
The client application starts up an edge cache and which will pull data from the remote cache server upon a
miss. All updates are synchronously propagated and replicated to all the cache servers. The client side cache
can be configured using a <cache.xml> similar to a server cache with one notable difference - the multithreaded client typically configures a connection pool to the server cluster.
Here is an example of the client <cache.xml>:
<cache>
<!-- Point to the discovery service (locator) to create the client connection pool
to
servers -->
<pool name=”CacheServerConnections”>
<locator host=”localhost” port=”41111”>
</pool>
<!-- Client cache wants to host the “customers” data region (local embedded cache)->
<region name=”Customers”>
<!-- region-attributes control how the data is managed within vFabric GemFire™.
‘data-policy’ of local for a local embedded data region.
‘pool’ indicates which servers within the server cluster the client can
connect to
-->
<region-attributes data-policy=”local” pool=”CacheServerConnections”>
</region-attributes>
</region>
</cache>
The client application code looks quite straightforward with most of the vFabric GemFire™ specific
configuration hidden away in XML.
T E C H N I C A L W H I T E PA P E R / 1 8
VMware vFabric GemFire™
The example below connects to the vFabric GemFire™ distributed system, acquires an instance of the local
cache and accesses data regions as Maps.
DistributedSystem distributedSystem = DistributedSystem.connect(new Properties());
Cache cache = CacheFactory.create(distributedSystem);
// Get a reference to the data region
Map customers = (Map)cache.getRegion(“Customers”);
// Put some customer objects
customers.put(<customerKey>, <Serializable Customer object>);
// get a customer
Customer someCust = (Customer)customers.get(<custId>);
// A simple OQL query illustrating navigating the objects using methods or public fields
SelectResults results = customers.query(“select distinct * from customers c where
c.getAddresses().size >1”);
// results is a collection of objects of the type managed in the region
// executing batch puts by using the ‘Region’ interface
Region custRegion = (Region) customers;
custRegion.putAll(Map<K,V> someMapOfCustomers);
10. Reliable Publish Subscribe and Continuous
Querying
With the core system built for efficient and reliable data distribution, vFabric GemFire™ allows client
applications to express interest in rapidly changing data and reliably delivers data change notifications to the
clients. Unlike some database solutions in the market, where update event processing is built as a layer on top
of the core engine, vFabric GemFire™ incorporates messaging capabilities as a fundamental building block in its
design.
We often see applications that update data in a shared database and dispatch change events through a
messaging solution, such as JMS, to subscribers. These subscribers are often components or services that
closely cooperate and share the same database. Essentially, the services are loosely coupled from an availability
standpoint but are tightly coupled at a data structure level because of the shared database. The application
design is often complicated by the need to do either 2-phase commits between the database and the
messaging server, which are inherently slow. This architecture also exposes the systems to increased failure
conditions that have to be handled at the application level and exposes the need to deal with race conditions
caused by multiple, asynchronous channels (message bus and replicated databases). For instance, an incoming
message causes the application to look for related data in a replicated database that has yet to arrive, resulting
in an exception. vFabric GemFire™ avoids all these issues by combining the data management aspects with the
messaging aspects for applications running is distributed environments.
There are a few characteristics that distinguish the active nature of vFabric GemFire™ from a traditional
messaging system.
T E C H N I C A L W H I T E PA P E R / 1 9
VMware vFabric GemFire™
10.1 Continuous Querying
vFabric GemFire™ clients can subscribe to data regions by expressing interest on the entire data region, specific
keys, or a subset of keys identified using an OQL query expression or a regular expression. When a query is
initially registered, a result set from the evaluation of the query is returned to the client. Subsequently, any
updates to the data region results in automatic, continuous query evaluation and updates satisfying the
registered queries are automatically pushed to the subscribing clients.
10.2 ‘Delta Propagation’
When clients update any existing objects, they have the choice to capture only the ’delta‘ (i.e. the updated
fields) and propagate just the ’delta‘ across the distributed system all the way to subscribing clients that host a
local ’edge‘ cache. This ability to capture and propagate just the updated fields to a subscriber saves bandwidth
and can potentially deliver higher message throughput.
10.3 Contextual Information Available at Memory Speeds
Unlike a database, where individual data objects can have relationships, there are no inherent relationships
between multiple messages in a messaging system. The sender has the choice between including the
contextual information as part of the payload or forcing or the receiver to us a shared database to retrieve the
context required to process the messages. The first choice slows down the messaging middle-ware (because
it has to push more data through its guaranteed delivery system). The second slows down the database
(because of additional ‘redundant’ queries). So, if the speed of message delivery were high, the subscriber
could still be throttled at the speed of access to the databases. With vFabric GemFire™, event notifications can
only be generated on objects that are being managed by vFabric GemFire™. When events are delivered, the
related data objects can be fetched at memory speeds. This combination of data management and messaging
provides a unique architecture enabling applications with predictable SLA characteristics in massively
distributed environments.
10.4 High Availability and Durability through Memory-Based Replication
Most messaging systems use some form of disk persistence to provide reliability. If a primary messaging server
goes down, the secondary reads from disk and delivers the rest of the messages. vFabric GemFire™ instead
uses a memory-based, highly available FIFO queue implementation where the queue is replicated to the
memory in at least one other node in the distributed system. This is far more efficient for overall throughput
and achieves the lowest possible latency for asynchronous message delivery. Each subscribing client’s events
are managed in a server side queue that is replicated to at least one other server node. Often, this is the
node where the cache is also replicated, so the events delivered into the queue are merely references to
cached objects. A pool of dispatcher threads is continuously trying to send the events to the client as fast as
possible. Keeping the queues separate allows different clients to operate at different speeds and no single
slow consumer can throttle the event propagation rate of the other consumers. vFabric GemFire™ provides
the option for durable subscriptions that keep the events alive and continues to accumulate events, even
if the client disconnects temporarily (normally or abnormally). Options also allow for the conflation for the
queues so that only the latest update for each key is sent to a slowly consuming client. A client session and its
event queue will be dropped if the consumer is so slow that the queue reaches a configurable maximum size
threshold.
Unlike a database, where individual data objects can have relationships, there are no inherent relationships
between multiple messages in a messaging system. The sender has the choice between including the
contextual information as part of the payload or forcing or the receiver to us a shared database to retrieve the
context required to process the messages. The first choice slows down the messaging middleware (because
it has to push more data through its guaranteed delivery system). The second slows down the database
(because of additional ‘redundant’ queries) So, if the speed of message delivery were high, the subscriber
could still be throttled at the speed of access to the databases. With vFabric GemFire™, event notifications can
T E C H N I C A L W H I T E PA P E R / 2 0
VMware vFabric GemFire™
only be generated on objects that are being managed by vFabric GemFire™. When events are delivered, the
related data objects can be fetched at memory speeds. This combination of data management and messaging
provides a unique architecture enabling applications with predictable SLA characteristics in massively
distributed environments.
11. Performance Benchmark
One of goals of vFabric GemFire™’s design is to achieve near linear increase in throughput with an increasing
number of concurrent clients and servers managing data. Below, we present a throughput benchmark
demonstrating the raw throughput and level of scalability that can be achieved using commodity hardware.
All tests used vFabric GemFire™ Enterprise 6.0 release.
The test used 2 blade centers, one for storing the data with redundancy in memory and the second blade center
was used to simulate a large number of concurrent clients. All blades were multi core commodity machines.
Each blade within either of the blade centers has Gigabit connectivity to each other but the network switch
connecting the 2 blade centers is limited to xGigabits/sec.
The specific details are available from GemStone Systems, a division of VMware.
12. Replicated Region Query Test
The number of servers used for managing data is doubled from 2 to 4 to 8. We limited the number of physical
nodes used to simulate the clients to five. Keys are ‘longs’ and the value is an object about 1KB in size.
Each client application is a simple Java process that starts 18 concurrent threads that query using the primary
key or write objects as fast as they possibly can. The test starts with a single server hosting data with a single
client JVM accessing the data and then progresses by increasing both the number of clients as well as the
servers in proportion.
Each client does some amount of warm up and then fetches the value stored in a data region using a randomly
generated key. There are 18 concurrent threads active in any client JVM. The X-axis in the graph below shows
the increase in the number of replicated servers (data hosts), the Y-axis on the left shows the number of
concurrent client threads (18 per JVM) and the Y-axis on the right shows the aggregate throughput observed
across all the data hosts.
T E C H N I C A L W H I T E PA P E R / 2 1
VMware vFabric GemFire™
The results show that the aggregate throughput nearly doubles when the client threads count is doubled along
with the available servers to access the data from. With 1000 concurrent client threads, we don’t get double
the throughput as load in client hosts reaches a saturation point with 200 threads competing for CPU (150,000
context switches per second).
13. Partitioned Region Query Test
The partitioned region tests used 12 physical data-hosts and 9 physical hosts for clients. Each client virtual
machine had 2 threads and 9 virtual machines per host. The test was similar to the replicated region test except
the number of clients used was lower and the object size was about 70 bytes.
T E C H N I C A L W H I T E PA P E R / 2 2
VMware vFabric GemFire™
Similar to the replicated region test, observed throughput nearly doubles when capacity is doubled but gets
bounded by the available CPU and network constraints at higher loads.
With a redundant copy for any key, client access patterns have a better chance of being load balanced
uniformly resulting in a more linear increase in throughput with increasing load and capacity.
14. Partitioned Region Write Test
Writes, in general, can be more expensive than reads with the increased GC activity. Note that in this test most
of the writes update existing entries creating garbage that needs to be collected.
T E C H N I C A L W H I T E PA P E R / 2 3
VMware vFabric GemFire™
Redundancy obviously provides higher resiliency but comes at a cost of reduced throughput having to
synchronously apply each update to 2 member nodes before the write completes.
15. Conclusion
The features in general purpose online transaction processing database systems were developed to support
transaction processing in the 1970’s and 1980’s, when an OLTP database was many times larger than the
main memory, and when the computers that ran these databases cost hundreds of thousands to millions of
dollars [9]. Today, the situation is quite different. First, modern processors are very fast, memory is abundant,
cheap commodity servers and networks are lot more reliable, and it isn’t uncommon to find clusters with
hundreds of gigabytes in total memory. The demand spikes are much more unpredictable and yet customer
expectations with respect to availability and predictable latency are very high (expectations raised by engines
like Google). vFabric GemFire™ presents an alternative memory oriented strategy that brings together the
key semantics found in relational databases to commodity cluster environments, relaxes some of the strict
consistency requirements in favor of higher performance and scalability. It offers a single platform that serves
like a database (or a cache) and also like a reliable publish subscribe system allowing clients to express complex
interest on data through queries.
T E C H N I C A L W H I T E PA P E R / 2 4
VMware vFabric GemFire™
References
[1] M. Stonebraker and U. Cetintemel. “one size fits all”
An idea whose time has come and gone. In ICDE ’05,pages 2-11, 2005.
[2] M. Stonebraker, S. Madden, D. J. Abadi, S. Harizopoulos, N. Hachem, and P. Helland.
The end of an architectural era: (it’s time for a complete rewrite). In VLDB ’07, 2007.
[3] Life beyond Distributed Transactions: an Apostate’s Opinion Position Paper Pat Helland
[4] The Revolution in Database Architecture
Jim Gray, Microsoft Research Technical Report, March 2004
[5] A single phase distributed commit protocol for main memory database systems
Inseon Lee, Heon Y. Yeom, School of Computer Science, Seoul National University
[6] http://highscalability.com/paper-optimistic-replication
[7] SEDA: An Architecture for Well-Conditioned, Scalable Internet Services
Matt Welsh, David Culler, and Eric Brewer,
Computer Science Division, University of California, Berkeley
[8] A memory-conserving, snapshot-consistent checkpoint algorithm for in-memory databases
A.-P. Liedes and A. Wolski., In ICDE ’06, page 99, 2006
[9] OLTP Through the Looking Glass, and What We Found There
Stavros Harizopoulos HP Labs, Michael Stonebraker, Samuel Madden Daniel J. Abadi
Massachusetts Institute of Technology Cambridge, MA
T E C H N I C A L W H I T E PA P E R / 2 5
VMware, Inc. 3401 Hillview Avenue Palo Alto CA 94304 USA Tel 877-486-9273 Fax 650-427-5001 www.vmware.com
Copyright © 2010 VMware, Inc. All rights reserved. This product is protected by U.S. and international copyright and intellectual property laws. VMware products are covered by one or more patents listed at
http://www.vmware.com/go/patents. VMware is a registered trademark or trademark of VMware, Inc. in the United States and/or other jurisdictions. All other marks and names mentioned herein may be
trademarks of their respective companies. Item No: VMW_10Q3_WP_vFabric_GemFire_EN_R1
Was this manual useful for you? yes no
Thank you for your participation!

* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project

Download PDF

advertisement