Segmentation in a Distributed Real-Time Main

Segmentation in a Distributed Real-Time Main
Segmentation in a Distributed
Real-Time Main-Memory Database
HS-IDA-MD-02-008
Gunnar Mathiason
Submitted by Gunnar Mathiason to the University of
Skövde as a dissertation towards the degree of M.Sc. by
examination and dissertation in the Department of
Computer Science.
September 2002
I certify that all material in this dissertation which is not my
own work has been identified and that no material is
included for which a degree has already been conferred
upon me.
__________________________________
Gunnar Mathiason
Abstract
To achieve better scalability, a fully replicated, distributed,
main-memory database is divided into subparts, called
segments. Segments may have individual degrees of
redundancy and other properties that can be used for
replication control. Segmentation is examined for the
opportunity of decreasing replication effort, lower memory
requirements and decrease node recovery times. Typical
usage scenarios are distributed databases with many nodes
where only a small number of the nodes share information.
We present a framework for virtual full replication that
implements segments with scheduled replication of updates
between sharing nodes.
Selective replication control needs information about the
application semantics that is specified using segment
properties, which includes consistency classes and other
properties. We define a syntax for specifying the
application semantics and segment properties for the
segmented database. In particular, properties of segments
that are subject to hard real-time constraints must be
specified. We also analyze the potential improvements for
such an architecture.
To my wife Lena and our children
Jacob, Frederik and Jesper.
Contents
Chapter 1 Introduction................................................................................................... 1
1.1
Distributed computing ................................................................................... 1
1.2
Replication..................................................................................................... 2
1.3
Motivation for segmentation ......................................................................... 2
1.4
Segmentation and virtual full replication ...................................................... 3
1.5
Results ........................................................................................................... 4
1.6
Dissertation outline........................................................................................ 5
Chapter 2 Background ................................................................................................... 6
2.1
Databases and the transaction concept .......................................................... 6
2.1.1
Serializability and correctness ............................................................... 8
2.2
Distributed systems ....................................................................................... 9
2.3
Real-time systems........................................................................................ 10
2.3.1
Real-time databases ............................................................................. 11
2.4
Distributed databases................................................................................... 12
2.5
Redundancy and replication ........................................................................ 13
2.5.1
Replicated databases............................................................................ 14
2.5.2
Issues for distributed and replicated databases.................................... 15
2.5.3
Trading off consistency ....................................................................... 16
2.5.4
Replication in distributed real-time databases..................................... 19
2.6
Network partitioning and node failures ....................................................... 21
2.7
The DeeDS prototype .................................................................................. 22
Chapter 3 The Partial Replication Problem................................................................. 24
3.1
Replication and the driving scenario ........................................................... 24
3.1.1
The WITAS project and DeeDS.......................................................... 25
3.1.2
The scalability problem ....................................................................... 26
3.1.3
Desired segment properties ................................................................. 27
i
3.2
Major goal ................................................................................................... 28
3.3
Subgoals ...................................................................................................... 29
3.3.1
Partial replication in current systems................................................... 29
3.3.2
Segment properties and consistency classes........................................ 30
3.3.3
Prototype development ........................................................................ 30
Chapter 4 Segmentation in Distributed Real-Time Databases .................................... 31
4.1
Segmentation of the database ...................................................................... 31
4.1.1
Introducing segments........................................................................... 32
4.1.2
Assumptions ........................................................................................ 34
4.1.3
Definitions ........................................................................................... 35
4.1.4
Setup and allocation of segments ........................................................ 36
4.1.5
Distributing the replication schema..................................................... 39
4.2
Properties of segments................................................................................. 39
4.2.1
Chosen subset of segment properties .................................................. 40
4.2.2
Requirements on consistency classes .................................................. 41
4.2.3
Consistency classes ............................................................................. 42
4.3
Specifying segment properties .................................................................... 44
4.3.1
Specifying segments using the application properties......................... 45
4.3.2
Syntax for specifying applications and segments................................ 46
4.3.2.1
Specification of applications and transactions................................. 47
4.3.2.2
Specification of segments................................................................ 48
4.3.2.3
Specification of transaction access to segments .............................. 50
4.3.3
Consistency and tolerance classes ....................................................... 51
4.3.4
Manual segment setup by matching the specification ......................... 54
4.3.5
Automatic setup of segments with properties ..................................... 56
4.3.6
Segment tables..................................................................................... 59
4.4
Scheduled replication .................................................................................. 60
4.4.1
Replication........................................................................................... 60
4.4.2
Requirements for scheduled replication .............................................. 62
ii
Chapter 5 Implementation ........................................................................................... 65
5.1
Architecture ................................................................................................. 65
5.2
Support for replication of segments............................................................. 67
5.3
Scheduled propagation ................................................................................ 68
5.4
Scheduled integration .................................................................................. 73
Chapter 6 Evaluation ................................................................................................... 75
6.1
Evaluation model......................................................................................... 75
6.2
Discussion.................................................................................................... 81
6.2.1
Segmentation of distributed real-time databases ................................. 81
6.2.2
Replication........................................................................................... 82
6.2.3
Scalability ............................................................................................ 84
6.2.4
Validation by simulation ..................................................................... 85
6.3
Problems ...................................................................................................... 87
6.3.1
Segments.............................................................................................. 87
6.3.2
Architecture ......................................................................................... 88
6.3.3
Replication........................................................................................... 89
6.4
Related work................................................................................................ 89
6.4.1
Partial replication................................................................................. 90
6.4.2
Eventual consistency ........................................................................... 92
6.4.3
Specification of inconsistency............................................................. 93
6.4.4
Mixed consistency ............................................................................... 94
Chapter 7 Conclusions................................................................................................. 96
7.1
Achievements .............................................................................................. 96
7.2
Contributions ............................................................................................... 97
7.3
Future work ................................................................................................. 98
7.3.1
Implementation.................................................................................... 98
7.3.2
Dynamic allocation of segments by recovery ..................................... 99
7.3.3
Segment properties and consistency classes........................................ 99
7.3.4
Other issues for future work ................................................................ 99
iii
Acknowledgements ................................................................................................... 101
Bibliography .............................................................................................................. 102
iv
Chapter 1
Introduction
1.1 Distributed computing
In the early days of the computer era, computers were most often executing dedicated
tasks in isolation from each other. When inter-computer communication was
introduced, several computers could cooperate on tasks. Parts of computer programs as
well as parts of the data used could be separately located. From now on, the computing
power could be located at the place were there was a need for processing data and the
work could be shared between distributed computers, cooperating for a common goal.
Sharing work means that the systems state must be shared. The common data in a
distributed system represents the state of the distributed system. For a common view of
the system state, co-operating computers need access to common data. The data must be
transferred, with a known degree of safety and reliability, to all cooperating computers.
1
1.2 Replication
Full replication of data makes the entire set of data available locally at all cooperating
computers, here called nodes. This enables several nodes to work on copies of the same
data in parallel and without the need for contacting other nodes to get data required for a
certain operation, since all data is locally available. Replication also gives data fault
tolerance, since there are other copies of the data available in the system if a node fails.
Replication requires a systematic and reliable way of propagating changes of the data
between nodes, so that replicated data stays consistent among nodes.
1.3 Motivation for segmentation
A database may be replicated and its replicas may be distributed to different nodes. A
fully replicated distributed database needs extensive data communication between nodes
to support global consistency, since all updates must be sent to all other nodes, requiring
a large effort for keeping the database replicas consistent. Full replication also needs
memory large enough to store a full database copy at each site. A distributed database
with improved scalability and lowered replication effort can be achieved by replicating
only the parts of the database that are used at the nodes, still supporting same degree of
fault-tolerance and data availability as with a fully replicated database, but with less
communication for replication. We define segments as the units of granularity for
allocation of replicas and for controlling common data object properties.
In a fully replicated database, an update to a node is replicated to all other nodes, but in
a segmented database an update is replicated to certain nodes only, since each segment
may have an individual degree of replication. An update to such a database will
replicate the update only to the nodes where the segment is allocated. A distributed
database with segments allocated only to nodes where the data is actually used
effectively provides virtual full replication (Andler, Hansson, Eriksson, Mellin,
Berndtsson & Eftring, 1996) for all users at these nodes, which see no difference in data
availability compared to if the database would be fully replicated. Such a database will
2
scale much better and the database system will have a lower replication effort, since
updates are replicated only to nodes that will use the updates. Improving scalability will
enable more nodes to be added and less network bandwidth to be used.
For this dissertation we use the WITAS system (Doherty, Granlund, Kuchinski,
Sandewall, Nordberg, Skarman & Wiklund, 2000) as our driving scenario. In a WITAS
system, Unmanned (partly autonomous) Aerial Vehicles (UAV) communicate with their
Mobile operators and a Command Center, in a distributed system where the actors have
interest in different parts of the data with different consistency requirements. A
distributed main-memory database, such as the Distributed Active Real-Time Database
System (DeeDS) (Andler et al., 1996), is well suited as communication medium for
real-time communication in such a system. Predictability is achieved in DeeDS by using
main-memory residence (there are no unpredictable disk delays) and full replication
(there are no communication delays an data access since all data is available locally) of
the database, but the application must tolerate temporary inconsistencies (since there is
no distributed commit).
Scalability and replication efficiency is a main concern in a WITAS system, since it
may consist of many participants in large missions and with transfers of large amounts
of data between nodes. Introducing segments in the DeeDS database system increases
efficiency and scalability for communication in a WITAS system, since it is expected
that most segments can be shared between a small number of nodes.
1.4 Segmentation and virtual full replication
In this dissertation we explore segmentation as a principle for improving replication
efficiency, by providing virtual full replication in an active distributed real-time main
memory database. To do this, we have developed concepts and solutions for central
issues in segmentation:
3
•
How segments are setup and what data objects to include in a particular segment.
•
How access to segments determine to which nodes a particular segment is allocated.
•
How other requirements on the data, in particular consistency requirements,
influences segment properties and the replication of updates in the database.
•
How the concept of segments constitutes a method for supporting virtual full
replication.
1.5 Results
We use segments to introduce granularity in a distributed database to support virtual full
replication (see 1.3). Depending on application requirements for a minimum degree of
replication we may considerably reduce the replication effort, compared to a fully
replicated database.
With segmentation of the database we achieve virtual full replication that improves
scalability, which enables us to add more nodes and have larger databases in a
distributed real-time main-memory database. In a WITAS system, we can add more
participants and allow more updates of the database.
We differentiate segment properties and consistency classes for specifying requirements
on data objects to support scheduled replication of updates between nodes.
We introduce a segmentation architecture for the DeeDS distributed database, to support
replication of updates with different properties. We also present a high-level design for
segments in the DeeDS database system, analyze the implications of the design for a
segmented database and define conditions for a simulation to validate the design.
4
We specify algorithms for
•
segmentation of a replicated database based on user access patterns only
•
segmentation based on user access patterns and data property requirements
•
building of segment replication tables from the application specification
•
scheduled replication of updates of mixed consistencies
We define a basic evaluation model for how to measure the improvement in replication
effort for a segmented system. We make conclusions about how parameter changes
influence scalability and replication efficiency.
We compare the concept of segmentation with other work done in the area and with
similar concepts in other areas.
We propose future work in segmentation, including dynamic allocation of segments to
support unspecified data needs and still provide virtual full replication and we also
propose work for other open issues in the area.
1.6 Dissertation outline
In chapter two, a background and an introduction to definitions within the area of
Distributed Real-Time Systems and Distributed Database Replication is given. Chapter
three defines the problem on which this dissertation focuses. In chapter four we present
segmentation and concepts that we have developed, segment properties and consistency
classes, specification of segments and scheduled replication of updates for segmented
databases. Chapter five describes an architecture for a database system to support
segmented databases and a high level design in DeeDS, where some important issues
are discussed. Chapter six contains an evaluation model and a discussion about
segments, usage implications and potential problems with our solution. Finally, in
chapter seven we summarize our results and relate our work to other work in the area.
5
Chapter 2
Background
This work concerns the area of distributed real-time database systems. The concepts
distributed, real-time and database systems all have special meanings that impose
requirements on the system. Transactions ensure that database updates are correctly
performed. This chapter elaborates on each of these concepts and also introduces the
reader to other key terminology and concepts.
Also in this chapter, an introduction is given to the database system used as the intended
target for a prototype implementation; the Distributed Active Real-Time Database
System (DeeDS).
2.1 Databases and the transaction concept
A database is a related collection of data and metadata. Metadata is the information
about the data collection, such as descriptions of the relations and types of the database.
Databases are accesses by using queries for retrieving data and updates for storing data.
6
Transactions are central to databases. They are the means for grouping the database
operations that logically belong together and they have properties, which ensure a welldefined state of the database after execution. Database updates and queries are executed
in transactions. The consistency of the database can be controlled if the database is
accessed by transactions only, due to the properties of transactions.
In a database context, consistency results in that the integrity constraints between data
entities are preserved when their values are updated, so that the data entities agree on
the state of what is represented. Consistency regards both value and temporal
consistency. Consistent data is the correctness criterion for many database applications
and if data of the database is not consistent, the database cannot be used.
According to Gray & Reuter (1993), the term transaction is often given one of the
following various meanings:
•
The request or input message that started the operation (here transaction
request/reply).
•
All effects of the execution of the operation (here transaction).
•
The program(s) that execute(s) the operation (here transaction program).
For this dissertation we choose to use the second alternative, which states that a
transaction acts as an isolated execution unit that embeds a number of operations. No
effect from the operations is seen by other transactions before the transaction completes
(commits).
Transactions have specific properties, to guarantee that the effect of the transaction and
its operations is dependable. These properties are often called the ACID properties
(Atomicity, Consistency, Isolation and Durability). In a database context, the ACID
properties are (Gray, 1993):
7
Atomicity – The changes done to the database by the transaction operations are atomic,
i.e. either all changes or no changes apply.
Consistency – A transaction does not violate any integrity constraints when
transforming the state of the database, from one consistent state to another
consistent state.
Isolation – Transactions may execute concurrently, but a transaction never sees that
other transactions execute concurrently. Ongoing transactions are not observable.
It appears as the transaction executes on its own, in isolation.
Durability – Once the transaction has completed successfully, the changes are
permanent and not deleted by subsequent failures.
Introducing transactions in a concurrent system simplifies the design of such a system.
Since transactions execute in isolation, it is not necessary to explicitly synchronize
processes that access the same data. Active processes can be regarded as independent in
that sense. For distributed systems, transactions offer a way to abstract concurrency
control and reduce the need for synchronization mechanisms between separated parts of
the system.
When considering the effect on a database by concurrently executing several
transactions, there is a need to order the apparent execution of transactions. The
ordering of transactions is called serialization and is an issue of the result of combining
transactions on a set of data.
2.1.1 Serializability and correctness
For databases where full consistency is important, updates have serializability as the
correctness criterion. Elmasri and Navathe (2000) explain serial schedules and
serializability: A serial schedule is a sequence of transactions (a schedule), where only
one transaction is executed at a time. Two schedules are serializable if the execution of
them on the same specific database state always results in the same resulting database
8
state: the schedules are result equivalent. However, two different schedules may
produce the same database state by accident, so usually serializability implicitly means
conflict serializability. There is a conflict in serializability when operations in two
transactions access the same data element and where one on the operations is a write
operation on the data element. Serializability guarantees that all nodes in the system
execute concurrent transactions in the same order regardless of at which node the
transaction entered the system. A formal introduction to serializability theory can be
found in (Helal, Heddaya, & Bhargava, 1996) and Bernstein, Hadzilacos and Goodman
(1987) gives the details on the subject of serializability, while seminal work in the area
is found in (Gray, Lorie, Putzulo & Traiger 1976).
2.2 Distributed systems
This dissertation uses the definition for a Distributed System from (Burns and Wellings,
2001). A distributed system is a system of multiple autonomous processing elements,
cooperating for a common purpose. It can be either a tightly or loosely coupled system,
depending on whether the processing elements have access to a common memory or
not.
A number of issues arise, when a system is distributed. Some of the most essential
problems for this dissertation are in the areas of:
•
Concurrent updates and concurrent communication. In all systems with
concurrent executing, there is risk that the same variable is written from two
threads of execution, where the values written by one thread is overwritten by
the other thread.
•
Communication delays, discontinuous operation of communication mechanisms
and partitioning of the communication network. If the communication breaks
down or slows down, the entire replicated system stops or loses performance.
9
•
Replication of data when the same data is used at several nodes. Multiple copies
of variables with the same data must have the same value.
Helal el. al (1996) describes solutions to these problems for databases with ACID
properties and protocols that have been also developed for handling the problems of
communication delays and discontinued communication links.
2.3 Real-time systems
Correct computer systems are expected to give a correct logical result from a
computation. In addition to giving correct logical results, a real-time system is expected
to produce the results in a timely fashion. Timeliness requirements are typically
expressed as deadlines, which specify when computing results are expected to be
available for usage.
Several classifications exist for real-time systems. One established classification is
based on the value of the deadline. Deadlines may be hard, firm or soft, depending on
the value of the result if a deadline is missed (Locke, 1986), as seen in Figure 1. A realtime system where the value from a missed deadline is strongly or even infinitively
negative is called hard real-time system. A missed deadline for a firm real-time system
results in (near) zero value, while the value decreases by time for a soft real-time
system, where a deadline has passed.
10
Hard
utility
start
Firm
deadline
utility
Soft
start
deadline
utility
penalty
penalty
time
time
damage
damage
start
deadline
penalty
time
damage
Figure 1. Value functions
2.3.1 Real-time databases
A real-time database system (Ramamritham, 1996) must process accesses, so that
specified deadlines for each access are met and be designed so that it is possible to
prove that database transaction deadlines will be met. Transactions in a real-time system
need to be time-cognizant. Since there is a value connected with the timely execution of
computations of the system, there is a value of the usage for the results of the
transaction. Transactions that execute outside of the deadline boundaries have less value
or may damage the system, depending on the type of deadline associated with it.
For real-time databases the most important characteristics is predictability. The data of
the real-time database has temporal consistency as an additional integrity constraint
compared to traditional databases and for this reason all hard deadlines and as many as
possible soft deadlines must be met. Predictability is many times more important than
consistency, why the consistency constraint is relaxed to achieve predictability in a realtime database.
11
2.4 Distributed databases
A distributed database is allocated at several nodes. With a database split over several
nodes, the data itself is regarded as the object of distribution. The database parts at the
distributed nodes together forms the complete database. Each separate part of the
database is called a partition of the database.
A distributed database is primarily data-driven, opposed to a distributed system, which
is primarily state or event-driven. In a distributed database, the data itself is the way for
nodes to communicate and cooperate, while in a distributed system, explicit sending of
messages between nodes transforms the state of the distributed application. With a
distributed system, the partitioning of the application must be considered.
When distributing a database, we may locate a partition to a node where the data is most
frequently used, which will increase the performance of the database, since network
communication can be reduced. Bandwidth requirement decreases while the overall
system performance increases. With some distributed real-time databases, the data used
by real-time transactions are placed at the node where the transaction enters the
database, which enables the transactions to execute with local data accesses only. With
only local data access, it is much easier to predict the execution time for the transaction.
This can also be applied for many nodes, by replicating the same data to several nodes,
enabling local execution at all these nodes.
Transactions may also be distributed themselves (Hevner & Yao, 1979), which means
that transactions are transferred to other nodes if the transactions cannot be fully
executed at the node where it entered the system, or that data needs to be fetched from
other nodes to complete the transaction. In such case, the execution of a transaction is
distributed, as well as the data that the transaction uses.
An example of a partitioned distributed database is the Internet Domain Name Server
(DNS) service of the Internet. DNS servers translate textual names for Internet nodes
into IP addresses, so that messages can be routed to the correct Internet node. To have
12
the entire DNS translation table (DNS database) for the whole Internet at one central
site would mean that all Internet traffic using textual node names needed to be translated
at that single DNS server, which would have the requirement of an extremely high
availability and performance. Instead, the DNS database is partitioned and distributed
into sub domains of the Internet in a hierarchical structure, where different DNS servers
handle DNS requests about different Internet domains. Partitioning the DNS database
results in higher overall capacity since the traffic is spread over several nodes with less
traffic at each node. DNS database partitioning also improves fault tolerance, since DNS
servers will fail independently of each other.
2.5 Redundancy and replication
Redundancy is the approach of adding resources, which are not necessary for the system
to operate under normal conditions. When adding redundancy, data availability
increases and the system may become fault tolerant, at the cost of extra processing or
extra space requirements.
Redundancy may be implemented as multiplying existing parts (e.g. hardware,
processes, data) of the system or adding other parts that partly contain already existing
information or meta-information. Replication, where identical copies of parts of the
system is multiplied, is essential for distributed systems to achieve fault tolerance and
availability. With replication, a distributed system is not dependent on that all nodes or
the communication with all nodes is in operation. Instead, an alternative node may
perform the distributed task in case that a node fails. For a distributed and replicated
database there will be other replicas of the data to use if one replica fails. Availability is
essential for a distributed node to continue its operation in isolation. In a full replicated
system all resources are available locally, so that local operation may continue also
when other parts of the system fails.
13
Replication also improves fault tolerance in other ways in a system: By replicating
sensors, the system may become tolerant to sensor failures. Finally, entire subsystems
may be replicated, which is common for critical systems e.g. space mission controllers.
2.5.1 Replicated databases
Distributed databases may have employ redundancy by having the data partition
duplicated at several different nodes. A database where data or partitions are replicated
is called a replicated database. For distributed databases with replication and global
consistency requirements, one-copy-serializability is often used as the correctness
criterion (Bernstein & Goodman, 1984). This means that transactions are globally
ordered as if they were executed on a system with only one replica. Hence they may be
tested for serializability as if the replicated database was a database with no replication.
An example of a replicated database (even if not stored in a database system) is the
cached data in an Internet Proxy Server, which acts as a gate between a sub-net and the
larger side of the Internet. If local nodes at the sub-net frequently access certain nodes
of the larger Internet, the proxy server stores copies of the original data locally. Once
stored locally, the accesses to the same Internet node will result in data retrieval from
the local copy, if the local copy is consistent with the remote data. Being able to retrieve
the local copy instead of the original copy results in a higher availability. With an
Internet proxy server, the remote data and the local copy are of different importance,
since the principle is to always check the version of the remote data first, before
retrieving the local copy. The original copy is regarded as the master copy.
Replicated databases may however have equally important copies, where other
consistency checking algorithms are used. Replication of data requires synchronization
of data updates. Helal et al. (1996) includes a summary of many traditional replication
techniques for replicated databases. Both immediate consistency systems, as well as
relaxed consistency system are described. The cost for having full replication with
serialization as correctness criterion in a distributed database is that all replicas must be
locked during the update and changes must be replicated to all nodes during the update
14
transaction. With partially replicated databases, some updates are not replicated to all
nodes, but still requires locking of all involved nodes during transaction. For large-scale
replicated databases, weak consistency systems are discussed by Helal et al. (1996) as a
key to scalability and availability. With weak consistency, replicas are not guaranteed to
have the same value at all points in time. For this dissertation, we use a database with a
eventual consistency, first introduced in Grapevine (Birrell, Levin, Needham &
Schroeder, 1982), where updates are written to one node only and where other replicas
are updated in the background. Eventual consistency results in that database replicas
will converge into a consistent state at some point in time. An eventually consistent
database is not expected to support mutual consistency at any given time point, but if
transactions are stopped, the replication mechanism will eventually bring the replicas
into mutual and internal consistency.
2.5.2 Issues for distributed and replicated databases
The following issues need to be considered for distributed databases, according to the
survey by Davidson (1984)
•
Consistency requirements need to be known. For some applications mutual
consistency between replicas needs to be preserved. Such distributed
applications need to have a globally consistent database after each transaction. A
global commit mechanism must be available and often a global lock is used
during update transactions to lock the entire database during the update. Other
databases require only local consistency, where the local data is internally
consistent. Applications of such databases are tolerant to mutual inconsistencies.
Concurrent updates to data replicas are allowed but conflicting updates must be
detected and resolved. For applications that are tolerant to temporary mutual
inconsistencies, concurrent updating transactions may commit locally and no
global locking is needed. The node is more autonomous than with global
consistency.
15
•
Fault-tolerance must be addressed. A partitioned (but not replicated) database
depends on all nodes being available for the database to be complete. If one
node fails or cannot communicate with other nodes, the data allocated to that
node will become inaccessible. The system must either be reconfigured to use a
backup/complementary node, or the entire database must be stopped. By
replicating the database, we may support fault-tolerance.
•
A distributed database needs to support availability. Network partitioning or
costly network communication may reduce availability and stop the operation
for a non-replicated database. Replication increases availability since same data
can be accessed without using the network.
•
The partitions of a distributed database must be allocated with care. Proper
allocation gives the opportunity to reduce network communication and increase
database performance, if (replicated) partitions are placed close to where data is
created and/or used. Other partitioning allocation policies may be to place
partitions at trusted or safe nodes, node with a known high dependability history,
nodes with high communication bandwidth or at nodes with predictable access
times.
2.5.3 Trading off consistency
Pessimistic replication ensures global consistency in distributed databases that uses
serializability as correctness criterion. With optimistic replication the correctness
criterion is traded off to improve efficiency of the database by increased concurrency
(Kung & Robinson, 1981). Many systems tolerate a relaxed global consistency, since
they do not require full mutual consistency at all times between replicas, at least for
most of the data. For these systems, there is an opportunity to improve performance by
trading off consistency and gaining efficiency. This often requires application specific
knowledge and means that the database uses semantics as the correctness criterion
instead of the syntactic approach of serialization. Many applications perform task
equally well with relaxed consistency requirements, except that efficiency is higher.
16
Considerable research effort has been put in refining replication techniques, so that it is
possible to get higher system efficiency from less replication effort and relaxing the
consistency requirement is one way.
With optimistic replication, it is possible to replace the global atomic commit with a
local commit for an update transaction and there is no need to lock other nodes to write
the update to the database. Committing transactions locally only will make the
transaction more predictable, since there is no need to synchronize the update with all
replicas (participating nodes) in the system.
To predict the execution time for pessimistic replication, it is necessary to include
delays in locking and synchronizing a known number of nodes with individually
scheduled processing. It is very hard to know what timing can be expected from remote
nodes. Also, we must also be able to find worst-case timing for communication between
nodes and be able to detect and act on communication link failures or delays. To predict
a worst-case global commit action might be impossible and in such case we cannot
support real-time properties for that system.
With optimistic replication, the update and commit is done locally. Worst-case
execution time is much easier to predict, since only local resources are included. Later,
independent from the transaction, the system propagates the update to the other replicas.
When abandoning serialization as the correctness criterion, we must have other ways to
define correctness. This is usually done in degrees of consistency. It is common to refer
to three types of consistency:
•
Internal consistency. Data within a node (or single database) is consistent.
•
External consistency. Data of a database is a consistent representation of the
state of the environment.
•
Mutual consistency. Database replicas are consistent with each other. For
replicated databases with serialization as correctness criterion, replicas are
17
always fully mutually consistent. For databases with relaxed consistency there
are several types of inconsistencies. Our work focuses on replicated database
with eventual consistency.
Similar to the optimistic and pessimistic approaches of data replication, other concepts
are available in literature. Gray, Helland, O’Neil & Shasha (1996) discusses the
differences as eager and lazy replication. In systems with eager propagation, a data
update transaction is immediately propagated to all other nodes within the original
transaction. During the propagation all nodes are locked by the update. In lazy
propagation systems, the data update is done without locking other nodes than the node
receiving the update transaction. The update is propagated to the other nodes after that
the transaction has committed locally, typically as separate transactions.
The minor difference between the pessimistic-optimistic and eager-lazy definitions is
that optimistic replication does not say how replication is done, but instead generally
stating that the node performs propagation after that the local commit has taken place.
According to Gray et al. (1996), lazy replication is done as separate time-stamped
transactions explicitly.
This dissertation uses the terms of optimistic and pessimistic replication for the
difference between replicating data between nodes within or after transaction commit,
implying that transactions are committed locally or globally.
As a result of the replication approach, the distributed database will have the property of
weak or strong consistency. Weak consistency allows nodes to temporarily be mutually
inconsistent, but converge into a globally consistent state at some point in time. Sheth &
Rusinciewicz (1990) use the term eventual consistency for replicas that converge into
mutual consistence, distinguished from lagging consistency, which never will reach the
consistent state.
Eventual consistency is central in the DeeDS database for achieving real-time properties
in a distributed real-time database (Andler et al., 1996). In DeeDS, transactions commit
18
locally, so that it is possible to predict the timeliness of a transaction. Two types of
eventual consistency are supported in DeeDS. Gustavsson (1995) describes optimistic
replication with as-soon-as-possible (ASAP) eventual consistency, while Lundström
(1997) describe optimistic replication with bounded eventual consistency, so that it is
possible to calculate how long time it takes before nodes are mutually consistent.
2.5.4 Replication in distributed real-time databases
As any real-time system, real-time databases are concerned with timeliness of
processing. Transactions in real-time databases must have predictable execution time,
implying that access to data read or written by the transaction also must be predictable
(Ramamritham, 1996). In distributed real-time databases with replication and
serializability as the correctness criterion, the timing of a transaction will depend on
other nodes. The node where the transaction entered the system can guarantee timing
requirements only if the resource at other nodes involved in the distributed transaction
are known. Detailed a priori knowledge about requirements on the system would be
necessary, including all possible combinations of transactions. Also, overloads may
cause transactions to be blocked, so that unpredictable delays occur. This could be
solved by pre-allocating resources at nodes to support a certain number of requests from
other nodes, but this lowers the efficiency of the system.
However, a full analysis of the application is often difficult to make. Certain critical
parts or transactions may be known, so that requirements can be specified for these, but
often far from all requirements are fully known.
To overcome the problem of fulfilling requirements on real-time distributed and
replicated databases, sources of unpredictability need to be removed, such as network
delays and dependence on other nodes. There are systems built, which are replicated
distributed real-time databases and they address different sources of unpredictability:
•
Disk access. Most databases have their persistent storage on hard disks, for
which access times are random. It is possible to define an average access time,
19
but for real-time systems, the worst-case access time is what influences real-time
behavior. For this reason, many real-time databases are implemented as mainmemory databases to enable predictable access times (Garcia-Molina & Salem,
1992).
•
Network access. Most commercial computer networks are built to support safe
file transfers where real-time properties are not of a large interest. Some network
types are very efficient (e.g. LANs), but without being able to specify worst-case
access times. By using real-time network protocols, like DOD-CSMA-CD or
CSMA-DCR (Le Lann & Rivierre, 1993), the propagation time for messages can
be bounded. Still there are uncertainties due to the processing at the remote node
that cannot easily be predicted.
•
Full replication. When having the complete database available locally, there is
no need for a transaction to cause a remote access to retrieve data for the
transaction. In other words, transactions read all data locally that will ever be
used in a transaction, avoiding the need in network traffic before transaction
commit (Andler et al. 1996).
•
Local commit of transactions. In addition to the unpredictability of the network,
a request to another node can be further delayed by the internal processing at the
remote node, which is answering the request. As mentioned above in 2.5.3, by
trading off consistency and allowing controlled inconsistencies, predictability of
the local transaction increases, since the transaction is committed locally only
and thereby only local worst-case transaction processing time needs to be
analyzed and predicted. Local commit protocols require conflict resolution
mechanisms, such as version vectors (Parker & Ramos, 1982). The DeeDS
database uses local commit and in particular distributed commit is avoided.
•
Failing nodes and recovery. Replicas of a replicated database may be destroyed
by a failing node. Failing nodes must be detected by other nodes and recovered
20
within a time bound, since transaction may depend on a failing node for their
timely execution (Leifsson, 1999).
The DeeDS system is the research tool that we use in this dissertation, since it addresses
several of these issues.
2.6 Network partitioning and node failures
The different nodes in a distributed database may fail independently of each other. A
distributed and replicated system use redundancy to provide fault-tolerance. A failing
node may be replaced by an operational replica of the node. Thus, remaining nodes may
uphold a certain level of service if the remaining nodes of the system are built to
compensate for failing nodes. By adding redundant nodes and data replicas to the
system, continuous operation in case of failure can be guaranteed to a certain known
degree. This is an improvement compared to a centralized database, where the entire
system fails.
A failed node needs to be recovered to resume the same level of tolerance level again
for the system. The node must be restarted and reloaded with its replica of the
distributed database and updated so that the state of its database gets consistent with the
other replicas. The correct nodes need to know which nodes have failed. With
pessimistic propagation algorithms, the entire system can be blocked until the node has
recovered, since the transaction commit is waiting for the down node to reply.
Optimistic propagation don’t rely on that all replicas are available and locked during an
update transaction.
Distributed systems and databases may also become partitioned, resulting in isolated
sub-parts of the database, where the sub-parts cannot communicate. In pessimistic
replication systems, systems partitioning will block update transactions, since global
locks cannot be handled, since nodes of the other partition are not available. In
optimistic replication systems, network communication is not directly involved in the
21
transaction since there is no distributed commit among nodes and the nodes of the
system continue to operate individually in each partition. During network partitioning,
the partitions will be read and updated independently of each other during partitioning,
resulting in that the partitions and their (groups of) nodes become mutually inconsistent
across partition borders. At reconnect, the database must be made consistent again by
reconciliation (Bernstein, Hadzilacos & Goodman, 1987) of the database and for this
there must be a conflict detection mechanism and conflict resolution policies to resolve
conflicts that are detected.
2.7 The DeeDS prototype
DeeDS (Andler et al., 1996) is a unique integration of several advanced concepts such
as active functionality, distribution, and real-time database system with hard and soft
deadlines. A few selected principles guide the research with DeeDS. The goal is to
develop a distributed database system with real-time properties; in this predictability
becomes essential.
Key features of DeeDS are:
•
Main memory residency. There is no persistent storage on disk, to reduce the
unpredictability of disk access.
•
Optimistic and full replication is used, which supports real-time properties at
each local node and makes the system independent of network delays or network
partitioning.
•
Recovery and fault tolerance is supported by node replication failed nodes may
be timely recovered from identical node replicas.
•
Active functionality with rules that have time constructs.
22
By reducing or eliminating sources of unpredictability, predictable execution times and
thereby timeliness can be achieved. By avoiding storage on disk and don’t relying on
the propagation time on the network, sources of unpredictable delays are removed. All
database accesses are done locally at the node where a transaction entered the system,
resulting in that no transaction needs to access other nodes during its execution.
Predictability is easier ensured, by having local and main memory accesses only, and
the transaction execution time is much easier to predict.
By having the full database locally available, there is no need for pessimistic
synchronization mechanisms, since the local database replica is just as important as any
other replica in the system. The transaction will run entirely on the local node, with no
needs for unpredictable remote data access. As a consequence of full replication, instead
of avoiding inconsistencies to occur through one-copy-serialization of the entire
database system, mutual inconsistencies between independently updated database
replicas are detected when the replica content is propagated between nodes. This can be
done independently from the actual execution of the real-time transaction, thus the
transaction is not limited by the propagation delays that will occur. As optimistic
replication results in temporarily inconsistent replicas, applications need to be tolerant
to temporarily inconsistent replicas.
By recovering nodes from other main memory nodes, timely recovery is supported,
since no unpredictable disk access is needed (Leifsson, 1999). Since all nodes have the
full database, a recovering node may be fully restored by copying the database contents
from another node. Recovery has no timeliness guarantees for the node that is
recovering.
23
Chapter 3
The Partial Replication Problem
3.1 Replication and the driving scenario
In a fully replicated distributed database, replicas of the entire database are available at
each database node. All clients at each node may locally read or write any information
in the database. Real-time execution is supported, since all execution is done with
locally accessible data, giving predictable timeliness of data access. Clients with
unknown requirements on the data can be added to the database system without
restrictions, since all data is available at each node and is consistent to a certain (known)
degree. The flexibility comes with a high cost in storage requirements, communication
and data synchronization. A large replication effort is required to make the data
mutually consistent between all database replicas at the nodes in the system.
The work in this dissertation aims at solving the scalability problem of fully replicated
distributed database and finding opportunities for improved replication effort, by
dividing the fully replicated database into segments (partitions of the database as units
of allocation of replicas), which can be individually replicated based on specified
replication requirements from all the clients at a certain database node. If the
24
specification indicates that a segment will never be used by any clients on a node, it
does not need to be replicated to that node. A certain database segment may not be
available at a node, but the clients at the node do not need to be aware of that, because
they will never access it. The client assumption of full replication of the database is still
valid and we call this virtual full replication (Andler et al., 1996).
3.1.1 The WITAS project and DeeDS
The WITAS project (Doherty et al., 2000) aims at developing Unmanned Autonomous
Vehicles (UAV) that can be given high-level command for surveillance missions, then
autonomously fly to a site for collecting information and later return and report about
the results from the mission. Besides the flying vehicles, there are also ground-based
vehicles for communication and coordination together with a central Command Center.
The communication between the aerial and ground-based vehicles and the Command
Center is required to have real-time properties, which can be supported by the DeeDS
real-time distributed database system. Thus, DeeDS is suitable as a tool for
communication between the vehicles and the Command Center (participants) and has
been selected for use in communication between simulations of UAVs and ground
vehicles (Brohede, 2001).
25
Unmanned Aerial Vehicles
Unmanned Aerial Vehicles
Command
center
Mobile
control
Mobile
control
Figure 2. A WITAS system
It is expected that a typical WITAS system will have many participants and that large
amounts of data will be transferred between the participants through the real-time
database. With full replication, this would mean that much data is replicated to many
nodes without the data actually used at the participants, resulting in a high replication
effort compared to actual usage of the replicated data.
Dividing the database into segments with individual real-time and consistency
properties, the replication may be done more selectively and the bandwidth requirement
reduced, resulting in a more efficient database system overall. In DeeDS, time-bounded
replication is available. When improving replication efficiency, the entire database may
have tighter bounds in its replication. The database becomes mutually consistent within
tighter time-bounds, resulting in a distributed database with tighter consistency.
3.1.2 The scalability problem
A fully distributed database scales badly due to its excess replication. In a distributed
database with n nodes, an update to one data object at one of the nodes initiates an
update to the other n-1 nodes. Thus, the replication effort for an update to one data
26
object is n-1, since the update must be replicated no all other nodes. Updating s data
objects at one node results in a replication effort of s·(n-1). We assume that an increase
in the number of nodes in the system will result in a proportional increase in number of
updates that the distributed database will receive. The scalability of the system is thus
O(s·n) or O(n2). For databases with optimistic replication, the situation is somewhat
better, since updates do not need to access all nodes in the system during the transaction
itself. Still all data items must be replicated to all nodes at some point in time, so the
replication effort will be the same. The amount of data to replicate is the same, but
transaction timing does not depend on the replication effort. With both segmentation
and optimistic replication there is a potential for a lower replication effort, since the
amount of data to replicate is less.
For systems with a small number of nodes, the amount of communication required for
full replication may not be a significant problem. Here, the simplicity achieved by
replicating the entire database is more valuable than spending time on specifying what
should be replicated where. For databases with a large number of nodes we can
dramatically reduce communication needs and improve timeliness with a small increase
of work in the preparation of the database replication allocation, by introducing a way
of specifying properties for segments of the database.
3.1.3 Desired segment properties
Intuitively, there are some parameters that can be used for definitions of segment
properties that influence the replication effort: Degree of replication– how many
replicas are actually needed of the segment in the system and to what nodes are they
allocated. This depends on where the segment is actually used and how many replicas
that are needed for recovery and for a minimum degree of fault-tolerance. Requirements
of consistency – Do we need to update segment replicas within the execution of the
transaction? How long are we able to wait before we must have consistent database
replicas? Applications that use data may be tolerant to temporal inconsistencies,
enabling the database to have optimistic replication.
27
Without quantifying the specific gain of virtual full replication, it is clear that there is a
potential improvement in efficiency when segmenting the database. Much work has
been done already in partial distribution for distributed databases (Mukkamala, 1988;
Alonso, 1997). In this dissertation we show how a fully replicated distributed real-time
system can benefit from partial replication combined with support for real-time
properties, availability and scalability.
3.2 Major goal
The scalability problem described in 3.1.2 is the central problem of this dissertation.
There are large distributed real-time main memory database systems that use optimistic
replication strategies to improve scalability and require mechanisms for controlling
temporary inconsistencies to prevent database replicas from diverging. Our goal is to
introduce a more selective replication of updates so we can use a distributed database
for more clients and larger amounts of data, since data is replicated only to the nodes
that actually need the data. With segments, we have the opportunity to have different
consistency requirements for different segments, supporting both real-time requirements
and consistency requirements from clients of the system in the same framework.
To support virtual full replication, we explore how segments can improve the
replication effort in real-time distributed main-memory databases, by specifying
database client requirements on data. With virtual full replication, database clients’
assumption of full database replication still holds, since clients are not aware of the fact
that the database is segmented. Excess data redundancy is reduced since unused data is
not replicated and stored.
By differentiating consistency classes, the database clients’ requirements on data
consistency may be specified in a structured way. The consistency specification enables
the database system to replicate data with urgent consistency needs before data that does
not need to have same level of mutual consistency with other replicas. Also, we can
save memory for a node, which does not have the full database present physically.
28
One of the properties of a segment could be the importance (by some ordering) of the
segment, which enables recovery to restore the most urgent segments before other
segments, possibly making the system ready for operation sooner than if the whole
database needed to be recovered before operation of a node. Another property of a
segment could be the media for storage, which creates an opportunity for storing or
swapping segments to other media than main memory in a virtual memory style, even
further reducing the physical memory needs at a node. Segments also may be specified
as not having any real-time properties, making it possible to use disks and network
communication for retrieval of the segment.
We relate the different segment properties of consistency, node allocation, real-time
properties, main-memory presence, recovery order etc. in a hierarchy of segment
properties, which is useful for our driving scenario requirements.
3.3 Subgoals
To evaluate segment-based replication, a number of different steps have been taken. We
have studied current approaches to similar problems, developed a framework for
scheduled replication of segments with segment properties and used the framework in a
design for a prototype.
3.3.1 Partial replication in current systems
We investigate current distributed databases to find solutions with architectures,
protocols and algorithms that may suit a distributed real-time main memory database
system. Ideas and concepts from existing solutions are considered, in particular
concepts in systems with optimistic replication. Replication with higher granularity,
concepts from file replication systems, partial replication and replication where
properties can be defined are searched for and studied.
29
Partial replication is a large research area where many classical concepts for replication
control already exist, mainly for immediate consistency systems. The investigation of
existing work focus on related topics in areas where existing solutions with real-time
properties and with optimistic replication are available and where the allocation of
partitioned and replicated data is not required to be known by applications, since we
want to support virtual full replication.
3.3.2 Segment properties and consistency classes
We differentiate a few segment properties. Our aim is not to develop a full structure of
possible segment properties and consistency classes, but to find possible properties for a
typical system that is able to make use of segments as replication units.
New concepts have been defined within the area segmentation. It is considered that
consistency classes are used to specify both the usage patterns for clients and properties
of the accessed segments.
3.3.3 Prototype development
A prototype design and an evaluation model are developed for evaluating the possible
advantages of segmentation. The prototype design is describes how such system would
need to be built in the DeeDS database system. What is essential is that it is possible to
draw conclusions from what benefits segmentation gives and what difficulties are
encountered in an implementation. In particular it is important to be able to make
conclusions about communication replication efficiency and scalability.
30
Chapter 4
Segmentation in Distributed Real-Time
Databases
In this chapter we introduce segments in distributed real-time databases for improved
scalability and lowered replication effort. We elaborate on how segments are setup and
present concepts for how segments and their properties can be defined, both based on
data access patterns and segmentation that considers application semantics for data
requirements. Finally we discuss requirements for scheduled replication in a framework
for segments of mixed consistency.
4.1 Segmentation of the database
We divide a fully replicated database into segments to introduce granularity for
selective replication, allowing virtual full replication (Andler et al., 1996). A segment is
a group of data objects that share properties. All segment replicas are intended to be
identical copies, regarding data objects, data content (possibly temporarily inconsistent)
and properties of the segment. The fact that data objects in a segment have common
31
properties mean that a segment captures some aspects of the application semantics,
which can be exploited to refine replication control and achieve more efficient
replication. An example of segment properties is the required replication degree and the
allocation of the segment, which let us replicate the segment to certain nodes only,
instead of replicating the segment to all nodes, which is the case with full replication.
Another important segment property that we use is the supported degree of consistency
between segment replicas. With segments and segment properties, a more selective and
concurrent replication is possible. Our solution has some resemblance with grouping of
data as described in (Mukkamala, 1988). However, instead of allocating data to the
most optimal locations only, we allocate data to all nodes where it is used. The
remaining sections of 4.1 introduce the principles of segmentation, section 4.2 presents
properties for segments and section 4.3 introduces a syntax for specifying properties for
specific segments.
4.1.1 Introducing segments
To achieve virtual full replication we need a granule for the allocation of groups of data
that share properties. With segments of data, we can choose where to allocate groups of
data, so that data is allocated only to nodes where used.
Segments have two key characteristics:
-
Containers for data object properties – All data objects within the segment share
the same properties.
-
Units of allocation – A segment is allocated to a specific subset of all nodes.
32
Database
Database
X
Y
W
Z
Figure 3. Segmenting the database
When a database is segmented, we introduce granularity into the database, which
enables us to treat different subparts of the original database differently based on the
different properties they have. In particular, we can replicate the different segments
according to different policies and we can choose different allocations for different
segments of the database. A fully distributed database has the entire database replicated
to all nodes.
Repl 1
Repl 3
Repl 2
Figure 4. A replicated database with three replicas
We differentiate Database replicas from Segment replicas. In a replicated database with
full replication there is one full database replica at each node, containing the entire
database and the database is the unit of allocation. In a replicated and segmented
database there may be several segment replicas allocated to a node, which are replicated
independent of each other based on their individual properties. This reduces excess
replication and communication effort required to replicate updates to data. A replicated
33
and segmented database can be de facto fully replicated to some nodes as a consequence
of that all segments replicas are allocated to a node (as with Node 1 in Figure 5).
Node 1
X: Repl 1
Y: Repl 1
X
W
Y
Z
Z: Repl 1
Node 2
X: Repl 2
Node 3
X: Repl 3
Y: Repl 2
X
X
Y
Z
Z: Repl 2
Figure 5. A replicated and segmented database
A replicated and segmented database, according to the principle that each segment is
replicated only to nodes where it may be accessed, is called a virtually fully replicated
database. From the viewpoint of the application, such a database cannot be
distinguished from a fully replicated database. This implies that all transactions can be
run locally on each single node.
4.1.2 Assumptions
In this dissertation, we assume the following for segment-based replication:
•
Data objects in a segment use the segment properties of degree of replication,
allocation, timeliness, consistency, etc., which are assigned to the segment. A
data object can only be assigned to one segment at a time.
•
For every segment replica, there can be several users of the data. Segment access
can be done by concurrent transactions of different clients.
34
•
The replication degree, r, for a specific segment is 1 ≤ r ≤ n, where n is the total
number of nodes.
•
In this work, the number of segments, their allocations and their properties are
assumed to be fixed throughout the execution of the database.
4.1.3 Definitions
The following definitions for a segmented and distributed database are used:
•
The term size of a segment defines the number of data objects in the segment,
while the span of a data object defines which nodes the data object is allocated
to and the span of a segment defines the nodes that a segment is allocated to.
The span describes not only the degree of replication, but also the specific node
allocation for a segment. The database size is the sum of the sizes of the
segments in the database. The span of segments and data objects is used for
allocation of segments to nodes and data objects to segments.
•
An application uses the data of the database to implement a system and may be
distributed. Within an application, there are processes that access segment
replicas to execute a function within the application. Processes have properties
related to their function, e.g. a control process may have real-time properties.
The database is accessed by means of transactions, which read or write data in
segments and have implicit properties, depending on the process using them.
Within each segment there are data objects, which belong to one segment only.
We also use the more general term client for any entity that accesses the
database
and
has
requirements
on
properties
of
the
data.
In a WITAS context, the WITAS system as a whole may correspond to an
application, while the UAVs may contain a process for flight control and another
for storing tracking data and the mobile unit has processes for interpreting
tracking data and for reporting to the Command center. In such a system, there
35
are segments shared between processes at different nodes, e.g. the tracking data
segment.
Application 1
Node 1
Process 1
X
W
Y
Process 2
Z
...
Node 2
Node 3
X
Application 2
X
Y
Process 1
...
Z
Figure 6. Applications and processes in segmented databases
4.1.4 Setup and allocation of segments
The purpose of introducing granularity is to limit the replication effort but still locally
supply all the data used by clients. In this dissertation, we present several related ways
of segmenting the database. Here we present a basic segmentation algorithm, which is
the basis for segmentation with segment properties. For our basic approach to define
segments, we use access information about what data objects are accessed at different
nodes, which origins from a manual analysis of what data objects are accessed from
processes and their transactions. From this information we can setup segments and an
allocation schema for these segments. Data objects that are accessed at several nodes
must be assigned to a segment that is available at all these nodes.
An algorithm for finding the allocation schema for segments may look like:
36
1. Initialize an empty list of segments
2. Initialize an empty list of data objects
3. For every di, where di is a data object of the database, list the nodes, nj, where
the data object is accessed from the database clients, di = { nj, …} and add to the
list of data objects.
4. Initialize a spanning size variable, s = number of nodes. This is the maximum
degree of replication for any segment in the system.
5. Find data objects in the list of di that are accessed at s number of nodes. Define a
new segment, gk for each combination of nodes that have data objects spanning
them. (Note that there may be combinations of nodes which have no objects
spanning them)
6. For each new segment defined,
a. Add the segment to the list of segments.
b. Setup an allocation table that expresses the node allocations for that
segment.
c. Remove the elements di, which were used for creating the segment from
the list of data objects. Data objects can belong to one segment only.
7. Decrease s by 1
8. Repeat from 5 until s =0
For segments that have requirements for a minimum replication degree that is higher
than the actual number of nodes of which replicas are accessed, the allocation schema
needs to be extended with ghost replicas, which are resident at a node without being
accessed at that node. A general lowest replication degree may be defined and to
guarantee that, the following need to be executed:
37
1. Define a minimal replication degree required
2. For each segment in the list of segments
a. Where number of nodes for allocation for a segment < general
replication degree, add allocations of the segment to other nodes (by
adding nodes to the list for gk) until the required replication degree is
fulfilled for the segment
The explicitly required degree of replication can be defined for many reasons, e.g.
tolerance to node failures, data availability and database efficiency. For this algorithm
we have chosen to use a general minimum replication degree for all segments, but
additional replicas could also be assigned individually for each segment, which we do in
our more advanced approaches below, where we can specify the minimum degree of
replication for each segment individually.
The algorithm will loop through 2n-1 combinations of nodes, but list of data objects are
reduced as they are assigned to segments. The complexity of the algorithm is O(2n).
Consider the following example. We have a fully replicated database with three nodes
N1, N2 and N3 where the database contains the elements A,B,C,D,E,F,G,H,I. When
analyzing the access patterns we discover what database elements are used at different
nodes A:N1,N2,N3; B:N1,N2; C:N3; D:N1,N3; E:N1,N3; F:N2; G:N2; H:N3 and
I:N1,N3. Now we can create segments for the data. Detecting which data objects span 3,
2 and 1 nodes gives following segments: For data objects that span n (=3) number of
nodes we have segment g1:N1, N2, N3: A. For 2 nodes, we have segments g2:N1, N2:B;
g3:N1, N3:D,E,I and for 1 node, we have segments g4:N2:F,G; g5:N3:C,H. In total we
have 5 segments. The allocation table for each of the segments will look like:
g1:N1,N2,N3; g2:N1,N2; g3:N1,N3; g4:N2; and g5:N3. If we now have a requirement for
a replication degree of 2, we need to extend the allocation table with additional
allocations for segments g4 and g5 since they are allocated to only one node each. Since
there is no prerequisite for any node, in this example we can choose node N1 for both,
resulting in new allocation entries for segments g4:N2,(N1) and g5:N3,(N1). However,
38
allocation of ghost replicas may be done according to some other allocation policy, e.g.
on the least loaded node or on the fastest node.
With a fully replicated system, an update to any of the data objects would require 2
operations (one update is replicated to two other nodes) to replicate the update to the
other nodes. Updating each one of all the data objects would require 2*9 =18
operations. With the segmented database without ghost replicas, one update to each of
the elements would require 2+1+1+1+1=6 operations, with ghost replicas we would
need 10 operations.
4.1.5 Distributing the replication schema
By storing a copy of the segment allocation information within each segment replica,
the propagator at each node knows to which other nodes an update for a particular
segment need to be sent. The propagation of the replication schema itself is done with a
pessimistic 2-phase-commit distributed transaction for getting immediate consistency
between nodes, since all nodes must consistently agree on where segment replicas are
allocated. Optimistic replication of similar meta-information for selective replication
(status vectors) is however possible and is used in the Ficus file system (Ratner, Popek
& Reiher, 1996). We reduce complexity by replicate segment tables with immediate
consistency.
4.2 Properties of segments
In 4.1, we divided a replicated and fully distributed database into segments based on the
actual usage of data. In addition to this access-based replication, we may use other
differences in requirements that the application has on the different segments, to
replicate even more selectively and also timely according to consistency requirements
of clients. To be able to specify the requirements from an application on segments, we
need to differentiate a set of possible properties that segments may have, which we can
use for the specification.
39
We define properties that could improve the replication effort. The intention is to find
properties and consistency classes that are useful in the driving scenario, not a complete
class structure. The chosen properties are used in a syntax for the specification of an
application, so that an analysis can be made of the effect that the properties have on
replication for that application.
4.2.1 Chosen subset of segment properties
We have selected the following segment properties, which we expect to have an
influence on the replication effort. The segment properties are used for specifying each
segment individually.
•
Degree of replication – we want to keep the degree of replication low to prevent
redundant replication, since data should not be replicated if not used at a node.
To support virtual full replication, we need to have local availability for data at
the nodes where data is used, which is also the degree for an optimal replication
of data.
•
Recovery – we want to support more efficient recovery. When defining time
constraints for segments, we know which segments that are more time critical
and probably needs to be recovered before others. For each segment, we also
want to define which media we recover from, so we know if we can expect realtime properties for the recovery of the segment.
•
Residence of segment – For segments stored in main memory we can support
predictable access times, but for swappable segments or for segments stored at
disk this is not possible. Thus, when knowing the storage media for the segment
we know if we can expect timely access to the segment.
•
Consistency classes – The knowledge of what different requirements that
applications have for mutual consistency between replicas can be used for
schedule replication, to support a mix of consistency requirements.
40
A lower degree of replication reduces communication and memory requirements, while
relaxed consistency reduces communication and delays during a transaction.
When relaxing the mutual consistency requirements for replicas, the communication
effort for replicates updates is reduced. A fully consistent replicated database with a
two-phase-commit replication policy requires 4 messages for each node to be updated.
An approach where eventual consistency is allowed requires only 1 message to
propagate the update. Again consider the example in 4.1.4. For a fully replicated fully
consistent database, where we use a two-phase-commit protocol, we need 9 * 2 * 4 = 72
messages for updating the database (with a point-to-point implementation of the
multicast operation), which is almost an order of magnitude more than a segment-based
partially replicated database with eventual consistency, which would require at most 10
messages to update the database.
For the reason of replication efficiency, we consider the consistency class of a segment
as one of the most important segments properties. Segments with different consistency
classes need to be replicated differently.
4.2.2 Requirements on consistency classes
A real-time system is many times a mix of components with different real-time
requirements. A core of the system may have hard real-time constraints while other
parts will not benefit from using a database with hard real-time features. In the real-time
database used in our work, we gain performance and predictability at the expense of
consistency. For real-time oriented tasks of the system there is a need for timely
execution of tasks, also when tasks executes transactions on the database. For some
real-time tasks it may also be necessary to know the worst-case propagation time for an
update to reach the other nodes. However, a database application may also need full
consistency to perform its task correctly. Thus, there is a clear need for supporting a
several classes of consistency-efficiency tradeoffs. Adly, Nagi & Bacon (1993) have
presented a protocol for supporting different levels of asynchronous replication,
allowing strong and weak consistency to be integrated in the same framework. Our
41
solution adapts the idea of classes of consistency in the same framework by using
properties of segments for supporting both immediate and eventual consistency between
segment replicas.
For a distributed and replicated real-time database system, where we have requirements
for mixed consistency classes, the replication mechanism of the database system thus
needs to support simultaneous replication of updates for several classes. For this we
need to have a scheduling mechanism for the replication in the database system. Since a
global commit protocol (which supports full consistency) will lock all involved data
objects during the propagation of the update, other consistency requirements for the
same data object are not meaningful. We choose to assign consistency requirements at
the segment level and therefore objects within a segment all have the same consistency
properties. A node may have several segments of the database allocated to it and these
segments may each support a different consistency class.
4.2.3 Consistency classes
In this dissertation, we differentiate three consistency classes, with varying degrees of
consistency relaxation (immediate or eventual) and varying degrees of real-time
properties (as-soon-as-possible or bounded).
Immediate consistency (the immediate class) has serialization as correctness criterion,
expressed as one-copy-serialization for replicated systems (Helal et al., 1996). Replicas
of this consistency class are always mutually consistent at any point in time and an
update to a replica will not make an update visible at any node until all the replicas have
been updated. Since an update is coordinated between replicas it may take an unknown
time to replicate an update, as the coordination may be delayed by network partitioning
or failed nodes. At the time the update is propagated and committed, immediate
consistent segments will be globally consistent.
Eventual consistency (Birrell et al., 1982) segments lets the transaction commit locally,
i.e; an update can be made visible on the local node (and each updated node) even
42
though all the replicas have not yet been updated. The replication (propagation and
integration) of updates may take place after the local commit, so the predictability of the
transaction on the local node does not depend on any execution at other nodes or the
communication with them.
For eventual consistency with as-soon-as-possible replication (the ASAP class)
(Gustavsson, 1995), the replication takes an unknown time to update the other
replicas. The execution time of the transaction with local commit can be predicted,
while the propagation time cannot.
For eventual consistency with bounded replication (the bounded class) (Lundström,
1997), the propagation and integration time is bounded so that replication can be
predictable. The time bound expresses the time it takes to propagate the effect of the
transaction to all nodes that have a replica of the data and integrate the changes
(including conflict detection and resolution).
There are several important differences between the consistency classes. Only segments
that have a predictable local execution time for transactions can support real-time
constraints of the clients. The price for this local timely execution is temporary
inconsistent replicas, where inconsistency lasts from the moment the local transactions
commits until the update has been replicated to all other segment replicas. Applications
using an eventually consistent database must be tolerant to temporarily globally
inconsistent data. Using bounded replication will enable a time bound for the updates to
be replicated, which means that the mutual consistency in an eventually consistent
database may be re-established within a guaranteed time, if no additional transactions
enter the system.
The following table summarizes the key attributes of the consistency classes used.
43
Table 1. Key attributes for consistency classes
Immediate
Eventual with
Eventual with
consistency
ASAP
bounded
replication
replication
No
Local
Global
No
Yes
Yes
No
Yes
No
Bad
Better
Better
Usage constraints
Blocking risk
Need tolerant
Need real-time
(system)
at partitioning
applications
network
Real-time supported /
bounded (application)
Tolerance to inconsistencies
required (application)
Tolerance to partitioning
(system)
Scalability
(system)
4.3 Specifying segment properties
We have described how the database is divided into segments (4.1) and we have defined
possible properties that segments can have (4.2). Now we can specify the properties of
segments in a particular database. We need to base the specification on knowledge
about the application semantics. Some clients are real-time oriented while some may
have full mutual consistency as a requirement for the data stored in the database. The
real-time clients may also have different type of deadlines. The properties of the clients
need to be specified, so that segments they access fulfill the needs of the application. In
this paragraph we give a syntax for specifying an application and properties of segments
used in the application. We also give an algorithm for checking the validity of a
specification of a system.
Other approaches for specifying consistency dependencies between replicas of data can
be found in Wiederhold & Qian (1987) and Alonso, Barbara & Garcia-Molina, 1990).
Wiederhold & Qian introduces the identity connection for capturing inconsistencies
44
between data replicas. Alonso et al. use predicates to express the relation between
replicas. Our solution is based on a specification of the properties of a segment, which is
an indirect way of describing replica dependencies; there are fewer dependency
relations to describe due to the grouping of data objects.
Node 1
Client 1 –
Global time constraint
X
W
Y
Z
Node 2
Client 2 –
Local time constraint
Node 3
X
X
Y
Z
Client 3 –
Full consistency
Client 4 –
Global time constraint
Client 5 –
Full consistency
Figure 7. A replicated and segmented database accessed by clients with different requirements
4.3.1 Specifying segments using the application properties
We chose to use segments to be the granule for the property specification for data
objects. One can argue that the database semantics originate from the application and a
consequence of that would be to specify the application and derive the object properties
from that. However, the semantics also includes what data to share and which clients
that shares data. From our algorithm in 4.1.4 we get segments based on data access and
the resulting segment allocation table shows which data objects are shared between
clients. When also taking other segment properties in consideration, we need to match
allocation knowledge with other segments from the application.
Our restriction that each data object belongs to one segment only and has one unique
combination of the segment properties is required for replicating updates of mixed
consistency in the same framework. E.g. if one client requires immediate consistency on
45
the same data object as another client requires bounded consistency, the immediate
consistency access could block the bounded access, effectively preventing the bounded
transaction from being bounded in time. Thus, we let each data object have one set of
properties and then we group data objects with the same properties into segments. In
that way one data object can have only one set of properties, which is important to be
able to replicate the data so that properties are supported.
To validate the requirements from the application on the specification of segments, we
need to specify properties of the application. The combination of the segment
specification and the application requirement specification need to be matched, which is
done offline prior to execution.
The consistency requirements of an application can only be supported if accessed
segments have properties to support these, e.g. if a real-time client have one transaction
that accesses a segment that supports immediate consistency, the timeliness of the client
cannot be guaranteed anymore.
4.3.2 Syntax for specifying applications and segments
To be able to match the segments in a database and with the requirements from the
application, we use a syntax to describe the application requirements and the segmented
database properties. The syntax as expressed here is meant to be extensive enough to
analyze the effects of using segments with properties in a distributed database as
defined in 4.2.1 and 4.2.3, but is also extensible for additional properties. We connect
properties with the entities defined in 4.1.3.
Once we have the segments automatically created for us, we can specify properties of
the segments. We can both use the syntax to manually specify properties for segments
that have already been automatically setup based on access patterns or we can
automatically setup segments with properties based on the specification, using our
algorithm in 4.1.4.
46
4.3.2.1 Specification of applications and transactions
We choose to associate the requirements from the application with processes, since we
want to allow applications to have several threads of execution with separate properties.
An alternative could have been to connect all application requirements to the application
entity, but in such case the entire application is given the same properties. By pointing
out which transactions that processes uses and specifying the accesses by transactions to
segments, we can connect processes with segments and thereby check that requirements
from processes (and the application) matches the properties of the segments that it
accesses.
The application and segment specification is not only used for matching segments with
application requirements prior to database execution. It is also used during execution by
the segment allocation and replication scheduling and could also be used for recovery of
segments.
Syntax specification:
Notation:
RESERVED WORD
entity name
{} option, alternative
[] default
| alternative
: one-many relation
Syntax:
PROCESS = processid
NODE = nodeid
TOLERANT = {[YES] | NO}
TIMECONSTRAINT = {LOCAL | GLOBAL | [NO]}
: TRANSACTION = transactionid
47
processid = [id to address a process of the application,
application-wide name scope]
nodeid = [if to address the node where the process executes,
system-wide name scope]
transactionid = [id to address a transaction, processid name scope]
Note: transaction timebound, local <= global
Note: transactions are always addressed within a process name
scope, a pair of [processid, transactionsid] is needed to address a
transaction globally
Note: TOLERANT = NO and TIMECONSTRAINT = {LOCAL | GLOBAL} are not
compatible
TRANSACTION = transactionsid
: OBJECT = objid
Explanation:
A process in the application uses a number of transactions, which have the
characteristics of the process using it. The process may be tolerant to temporarily
mutually inconsistencies replicas or not. The process may be constrained by time in its
execution, either locally or globally. If the process is globally time constrained, there
need to be a bound on propagation of database updates by the process.
Example:
PROCESS = process2 NODE = node4 TOLERANT = YES
TIMECONSTRAINT = LOCAL
: TRANSACTION = trans1
: TRANSACTION = trans2
: TRANSACTION = trans3
4.3.2.2 Specification of segments
Syntax:
SEGMENT = segmentid
REPLICATION = degree
48
CONSISTENCY = {IMMEDIATE |
[EVENTUAL {ASAP] | BOUNDED WITHIN timeout}
CONTAINER = containerid
STORAGE = {[MEMORY] | SWAPPABLE | DISK}
RECOVERFROM = {[PEER] | DISK | NONE}
: ALLOCATION = nodeid
segmentid = [id to address a segment, system-wide name scope]
degree = [replication degree], default 1
timeout = [relative time in ms, from transaction start]
containerid = [id to address a container, system-wide name scope]
nodeid = [id to address a node, system-wide name scope]
Explanation:
We have the option to specify a higher degree of replication for segments than we get
from the initial segmentation (according to 4.1.4). The segment has consistency
properties and for eventual consistency segments with bounded replication, the
maximum time for replication is specified. STORAGE defines where the segment is stored
and RECOVERFROM defines whether it can recover the segment from a peer node (a
segment replica in main memory at another node), from permanent disk storage or if
there is no storage to recover from. ALLOCATION specifies nodes where we explicitly
want to allocate the segment, since we may want additional segment allocation to
certain nodes for some reason (e.g. ghost replicas that needs to be stored at safe nodes),
but usually allocation is a result of the data access patterns we use in the preceding
segmentation step.
Example:
SEGMENT = segmentX
REPLICATION = 3
CONSISTENCY = EVENTUAL BOUNDED WITHIN 2000
CONTAINER = containerZ
STORAGE = MEMORY
49
RECOVERFROM = PEER
: ALLOCATION = node1
: ALLOCATION = node3
4.3.2.3 Specification of transaction access to segments
Syntax:
ACCESS PROCESS = processid TRANSACTION = transactionid
: SEGMENT = segmentid
segmentid = [id to address a segment, system-wide name scope]
processid = [id to address a process, application-wide name scope]
transactionid = [id to address a transaction, process-wide name
scope]
Explanation:
We need to match the segment attributes that we have specified with the actual access
done by transactions, so that we can judge what possible properties that the application
process can have. An example is a real-time process, which is not allowed to access an
immediate consistency segment, since the real-time properties cannot be guaranteed in
that case. With the keyword ACCESS, we list all segments that a transaction accesses.
We need to specify transactions for segment access, rather than specifying access for
processes directly, since processes uses several transactions to access separate segments
and same transactions may be used by several processes.
Example:
ACCESS PROCESS = process1 TRANSACTION = trans1
: SEGMENT = segmentX
: SEGMENT = segmentY
: SEGMENT = segmentZ
ACCESS PROCESS = process1 TRANSACTION = trans3
50
: SEGMENT = segmentY
4.3.3 Consistency and tolerance classes
As stated in 4.3.1 the supported properties for a process depends on the segments that its
transaction accesses. For processes that access segments associated with different
consistency classes, we need a way to relate the process properties with the properties of
the segments that are accessed. From the consistency properties we have described in
4.2.3, we define a consistency class hierarchy.
Consistency Class Hierarchy (CCH):
PROCESS is TIMECONSTRAINT = GLOBAL:
Only segments that have the property of EVENTUAL BOUNDED class are
accessed by the transaction and all the accessed segments have STORAGE =
MEMORY. This will allow the transaction the property of TIMECONSTRAINT =
GLOBAL. If all transactions of a process have this property, the process has the
same property
else
PROCESS is TIMECONSTRAINT = LOCAL:
Only segments that have one of the properties of EVENTUAL BOUNDED or
EVENTUAL ASAP classes are accessed by the transaction and all accessed
segments have STORAGE = MEMORY. This will allow the transaction the
property of TIMECONSTRAINT = LOCAL. If all transactions of a process have
this property, the process has the same property
else
51
PROCESS is TIMECONSTRAINT = NO:
Segments that the properties of EVENTUAL BOUNDED, EVENTUAL ASAP or
IMMEDIATE class can be accessed by transaction when TIMECONSTRAINT =
NO.
Since we do not use other consistency classes than EVENTUAL BOUNDED,
EVENTUAL ASAP or IMMEDIATE, processes will have one of the time constraint
property TIMECONSTRAINT = GLOBAL | LOCAL | NO
Consider the following example. We have processes C1-C5 and transactions T1 to T6
accessing segments of consistency classes EVENTUAL BOUNDED, EVENTUAL
ASAP or IMMEDIATE according to Figure 8. C3 will have GLOBAL time constraint
since only EVENTUAL BOUNDED segments are accessed. C1 and C4 can only have
LOCAL time constraint, since accessing one EVENTUAL ASAP segment. C2 and C5
will have NO time constraint, since at least one segment of IMMEDIATE consistency is
accessed.
52
Immediate
T5
T6
C2
C5
Eventual asap
T1
Eventual bounded
T2
T3
T4
C1
C3
C4
Figure 8. Processes, time constraints and consistency classes
We also need to define a similar hierarchy for supporting the property of NON
TOLERANCE to inconsistencies:
Tolerance Class Hierarchy (TCH):
PROCESS is TOLERANT = NO:
Only segments that have the property of IMMEDIATE class are accessed by the
transaction. This will allow the transaction the property of TOLERANT = NO. If all
transactions of a process have this property, the process has the same property
else the process must have the property of TOLERANT = YES
The processes specified in the system must match their specified properties of
TIMECONSTRAINT and TOLERANT respectively with their segment access
according to the class hierarchies (CCH and TCH).
53
4.3.4 Manual segment setup by matching the specification
From an application (with processes) and segment specification we can create segment
allocation tables, containing information about at which nodes the segments are
allocated and the properties of the segments. The allocation schema is indirectly
distributed to the nodes as segment tables as described in 4.1.5. A segment table also
contains information about segment storage media and segment recovery. Note that data
components of complex objects, where values of data objects are combinations of other
data, or where we have inheritance or associations between objects, cannot be assigned
to different segments. All objects within a complex object need to be assigned to the
same segment to be consistently replicated.
A segment allocation schema can only be created once the usage constraints of the
specification are matched with segments without conflicts. Application semantic
knowledge is required for resolving match conflicts. We present two alternative
methods for segmenting the database for segment with properties. The first alternative is
to design segments with properties in an interactive and manual approach, using a
matching algorithm that this paragraph gives. This enables the database designer to
explicitly assign properties to segment. For this we use the entities Process, Segment
and Access from the application specification.
The algorithm for matching the segment properties with the process requirements is as
follows:
Matching algorithm and segment allocation table generation
We assume that we have created segments of the database, as described in 4.1.4.
1. For every process specified
a. For every transaction in such process
54
i. Compare time constraints of the process specification with the
access patterns of its transactions, so that the specified time
constraint for the process matches the consistency class hierarchy
(CCH) level of the access to segments. If we cannot fulfill the
requirements of timeliness, we manually split the segment into
several segments, one for each time constraint. To split a
segment, we use the access information for individual objects, as
done in 4.1.4 with the extension that we also consider consistency
classes for objects. If there are conflicting consistency
requirements on the same data object, we give a match error
message.
ii. Compare the tolerance to temporal inconsistencies of the process
specification against the access patterns of its transactions, so that
the specified constraint for the process matches the tolerance
class hierarchy (TCH) level of the access to segments. If we
cannot fulfill the requirements of non-tolerance, we manually
split the segment into several segments, one for each tolerance
type. To split a segment, we use the access information for
individual objects, as done in 4.1.4 with the extension that we
also consider consistency classes for objects. If there are
conflicting consistency requirements on the same data object, we
give a match error message.
iii. If there are no matching errors, define segments allocations to the
nodes where segments are accessed and at nodes where explicit
segment allocation is specified. The list of segment allocations is
called the segment allocation table and contains all segments in
the systems with their properties.
55
2. For every segment in the list of segment, where required replication degree is
greater than the actual number of allocations, add ghost replicas according to
some chosen policy.
Once we have a matched specification, we know that the allocation table follows the
consistency specification for the application and matches the segments that we have
specified. We may now distribute the segment allocation tables to the nodes of the
distributed database.
Consider the following example. The segment table from the example in 4.1.4 is
g1:N1,N2,N3 g2:N1,N2 g3:N1,N3 g4:N2 (N1) and g5:N3 (N1). Now let entity A in g1
contain a number of different objects A1..A4. Say for segment g1 we access data objects
A1 and A3 at node N1 from a process with NO time constraints, N1:A1 and A3 (NO).
Other accesses are N2:A2 and A4 (LOCAL) and N3:A2 (GLOBAL). We can split
segment g1 into new segments g1-1: A1,A3 (IMMEDIATE) and g1-2: A2, A4 (EVASAP).
Since processes at N2 and N3 have different requirements (N1:LOCAL, N2:GLOBAL)
on element A2, we split segment g1-2 further into two new segments g1-2-1:A2
(EVBOUNDED) and g1-2-2:A4 (EVASAP). Now we have got three segments where
properties are considered, instead of just one segment.
4.3.5 Automatic setup of segments with properties
Our second alternative for setting up segments with properties is to automatically
generate segments, by adding information of which data objects that are accessed by
transactions. There are also no initial segments to connect with the application
requirements, so the Access entity of the specification cannot be used. For our
automatic segmentation algorithm we use the entities Process, Segment and Transaction
from the application specification.
1. For each object specified to be accessed at a node, compare the time constraint
requirements from the processes accessing it and select a least common
constraint that the node poses on the data object, following the Consistency
56
Class Hierarchy (CCH): If at least one process requires global time constraint at
the node, the consensus accesses for that object at that node is the global time
constraint, regardless of if all other accesses only requires local time constraint.
With all processes requiring local time constraint only, the object may be
accesses with local time constraint at that node. With all processes requiring no
time constraint and are not tolerant, the objects must be accesses with full
consistency at that node. With incompatible time constraints the algorithm
signals a conflict an exits. With this step we know the requirements from each
node for each data object.
2. Go through the data objects and compare the requirements on the individual
object from each node accessing it, by comparing with the information of the
Process-Transaction information. For incompatible requirements (e.g. full
consistency at one node and local consistency at another, give an error message
for the object and stop the algorithm. For a mix of global and local consistency
requirements for the data object, the consistency class is eventual bounded. If all
objects have local consistency requirements, the consistency class is eventual
ASAP, and if all objects have full consistency requirements, the consistency
class is immediate.
3. Group objects that have the same combination of <span, consistency> in
segments in the same way as in the algorithm in 4.1.4, starting with the highest
replication degree. For each span combination, collect the data objects with
same span and consistency class into the same segment.
Consider the following example. We extend the segment table from the example in
4.1.4 by replacing data object A with data objects A0-A3. For each node we find the
consensus requirement for each data object at each node, according step 1. of the
algorithm and list this in the following table. Now, for each object we need to find a
consensus requirement for the segment. For object A0 replication needs to be bounded,
since there is a requirement that object A0 must be globally consistent at node N1. For
every object we do the same. For object C we see that there are conflicting requirements
57
and the algorithm stops with an error. If solving this conflict by making access from N2
immediate (or possibly removing object C), the algorithm may continue. Now, we can
combine all objects with the same combination of <span, consistency> into segments.
Table 2. Timeliness requirements for data objects at nodes
N1
N2
N3
Class
A0
X (global)
X (local)
X (local)
Bounded
A1
X (local)
X (local)
X (local)
ASAP
A2
X (local)
X (local)
X (local)
ASAP
A3
X (local)
X (local)
X (global)
Bounded
B
X (cons)
X (cons)
C
X (local)
Immediate
X (cons)
!!!
D
X (local)
X (local)
ASAP
E
X (local)
X (local)
ASAP
...
58
The resulting segment table will be:
r=3
g1a:N1,N2,N3:A0,A3 (bounded)
g1b:N1,N2,N3:A1, A2 (ASAP)
r=2
g2:N1,N2:B (immediate)
g3:N1,N3:D,E (ASAP)
g4:N2,N3:C (immediate)
…
As a result of considering the time constraints on the data objects, we have setup
segments of different allocation and consistency.
4.3.6 Segment tables
The result of matching segments and application requirements is a segment allocation
table. From this table we can extract segment replication tables, which are lists of tables
for each segment that is sent to all the nodes that have replica of the segment. Segment
tables that are stored at each node where there are segment replicas contain the
information necessary for the replication of that particular segment, according to the
properties of the segments. This also includes Storage and Recovery media, since
timeliness depend on these parameters.
A segment table contains the following entries:
Segment ID, A system-wide unique identifier
List of allocation nodes, All nodes where segment is allocated
Consistency class, IMMEDIATE | EVASAP | EVBOUNDED
Timeout, for EVBOUNDED class, otherwise unused
Recovery media, PEER | DISK | NONE
Storage media, MEMORY | SWAPPABLE | DISK
Container ID, A system-wide unique identifier
59
The segment tables are distributed to the nodes of the system, using immediate
consistency replication, as described in 4.1.5
4.4 Scheduled replication
In a segmented and replicated system with consistency classes, we need a replication
mechanism for transferring updates to other nodes of the system according to the
consistency properties of the segments. In this paragraph we define terms we use in this
dissertation and which are also used in work of Eriksson (2002). We also define
principles for an extended replication mechanism for the DeeDS prototype database
system that supports replication of segments with mixed properties.
4.4.1 Replication
In our prototype database systems there are no distributed transactions. All changes of
the state of the database are contained in the updated data and it is propagated data
updates only that carry the state information to other nodes. In this sense, the local data
update and the data replication process constitute a ‘super-transaction’ that spans the
entire distributed system. Garcia-Molina & Salem (1987) define this as a SAGA.
60
Local replica
Remote replica
Network with
reliable broadcast
Update transaction
Log
package
Local
Commit
Propagation
(Coordination)
Remote /
Integration
Serializing
Log filter
Log
Figure 9. SAGA - The super-transaction concept
A SAGA covers the entire sequence of operations that are necessary to achieve eventual
consistency of the distributed replicated database, namely commit and log of the local
update transaction, propagation of the change and finally integration of the update at
other nodes.
For replication, we use the following terms for the DeeDS database:
•
Replication – covers all operations that are included in propagating the update
and integrating the update into the database, including serializing or conflict
detection with conflict resolution.
•
Propagation – the operation of sending out an update to all other nodes, which
have a replica of the database, according to the properties of the data object or
the segment that it is contained within.
•
Integration – the operation of inserting the update, or the simultaneous updates
from several nodes, into the database replica of a remote node. To integrate an
update from another node, we need to serialize the update or we use conflict
61
detection to find conflicting and concurrent updates to the same data object and
we use conflict resolution actions to solve conflicting updates.
Lundström (1997) describes the mechanism in DeeDS for conflict detection based on
extended version vectors, called log filters, presented in (Parker, D.S. & Ramos, R.A,
1982). When an update transaction arrives at a node, the update transaction executes
locally only. Since we have all used data available locally, there is no need to access
data at other nodes. Once the transaction is committed, the update is stored in the log,
together with a version vector for the update. When the transaction is propagated the
update is sent with its version vector to the other nodes by a multicast operation. At the
remote side the version vector is compared with the log filter, which describes what is
already stored in the database replica at that node. The value is copied to the database if
its version vector dominates the log filter information and if the log filter dominates, the
update is discarded. If neither the version vector nor the log filter dominates, the update
conflict with the replica and conflict resolution is needed.
4.4.2 Requirements for scheduled replication
We have concluded above that there is no easy way of ordering consistency classes (see
table in 4.2.3), since there are several attributes on which the ordering can be based
depending on which characteristic is important for an application. For this reason, we
cannot simply prioritize the replication based on some ordering of the consistency
classes. Instead we suggest a scheduling algorithm and an architecture, which intend to
satisfy segment properties in the replication of segments. The implementation must be
designed for predictability in update replication.
In our suggested solution for replication of segments with mixed properties, we assume
the following:
•
There is no inter-segment dependency order for updates to other nodes, so
updates for different segments can be propagated in arbitrary order. This enables
62
the replication mechanism to concurrently handle updates for different segments
of different consistency classes.
•
There is no global ordering of updates. Several updates for the same segment
(possible coming from several different nodes) are not globally ordered, but are
handled at the integration side (remote node, which is receiving the update), by
using conflict detection and resolution. The scheduler for update replication
must prioritize updates that require a bounded time on replication. This means
that there could not be a forced propagation order at the propagating side for
other reasons than the time bound for replication. Also no order can be
established at the integrating side, instead updates are integrated individually.
Simultaneous integration of updates cannot occur.
•
Integration of updates into a local database uses comparison of version vectors
and log filters to detect conflicts. All information needed for successful
integration is contained within the update packet. We assume that only one pair
of [version vector, log filter] is compared at a time and thereby only one update
is integrated at a time and that all segment replicas use identical integration
processing. Comparing and integrating several simultaneous updates is possible
but also complex to handle and we consider this as future work, possibly in
combination with work on conflict resolution.
•
There is a maximum arrival rate for transactions that require bounded replication
and there is also a maximum burst size (the maximum number of arrivals within
a time period) for arrivals of such transactions. A lower bound on the interarrival interval enables us to limit the buffer size for updates to be propagated
and thereby we get a predictable propagation delay time in this buffer.
•
Support for bounded replication requires a real-time network. Properties of realtime network are (Verissimo, 1993): 1) Enforce bounded delay from request to
transmission of frame; 2) Ensure that a message is delivered despite of
occurrence of omissions; 3) Control of partitioning. A real-time network enables
63
us to have a bounded log buffer size since we do not need to keep log entries for
updates that have been sent. We know that the remote side will receive the
updates. Once the update is sent out from the propagator side, the local log
buffer can be emptied (Lundström, 1997).
•
Updates are sent using point-to-point based broadcast (multicast) and multicast
may be implemented as point-to-point communication or as hardware-supported
multicast.
64
Chapter 5
Implementation
In the previous chapter, we have defined segmentation and specified an approach for
adding segmentation to a distributed real-time database. This chapter describes an
intended implementation of segmentation in the DeeDS database (Andler et al., 1996).
We have chosen to leave a full implementation for future work and limit the
implementation part to a description of what changes are required in DeeDS and an
analysis of the implications of supporting a segmented database, in particular for
scalability.
5.1 Architecture
A DeeDS database is made up of identical peer nodes and there is no master node for
replication or central coordination. Within a DeeDS node, there are different
components with different tasks and each component may contain several modules. The
DeeDS architecture is described in (Andler, Hansson, Mellin, Eriksson & Eftring, 1998)
65
Real-time applications
DeeDS
...
Services
DeeDS at
other
nodes
OBST object store
...
Replication and
concurrency
Scheduler
tdbm storage manager
...
Real-time OS
Distributed communication
Tight coupling
(Andler et. al, 1998)
Figure 10. The DeeDS architecture
To support segmentation in DeeDS, the replication of data need to be changed. To
support virtual full replication, replication needs to make use of segment properties, in
particular propagating of updates to nodes where segments are allocated. Both
propagation and integration must be changed to support updates of mixed consistency.
Support for consistent distribution of segment tables is needed. The replication module
is integrated in both the tdbm and OBST components and the replication protocol itself
is implemented as an extension module in the tdbm component. A full implementation
of the replication module is described in (Eriksson, 2002).
The replication module contains several important parts, where the Logger is
responsible for collecting updates, and the Propagator and the Integrator are responsible
for sending out updates and receiving updates respectively. Typically the replication of
an update starts with a local update performed by the tdbm on behalf of OBST and
66
logged in the Logger until transaction commit. In the current architecture, the
Propagator sends out updates in FIFO order to the other nodes and the Integrator
receives the updates from other nodes, which are then checked for conflicts by using
version vectors and a log filter. In our extended replication mechanism, we change the
FIFO order of sending updates, which is replaced by a scheduling mechanism. A
description of the design of the replication module is found in (Eriksson, 2002).
TDBM
DOI
Integrator
Local
updates
Logfilter
Remote
updates
Logger
VVHandler
Replicated
updates
Propagator
(Ericsson, 2002)
Figure 11. The replication module of DeeDS
5.2 Support for replication of segments
We have presented segmentation algorithms, the syntax for segment specification and
the consistency hierarchy and with these we can structure the segmentation and
replication of the database. To support this, the following changes are required in
DeeDS:
-
Segment tables storage and lookup – The DeeDS node receives static segment
tables when the database system is initialized. Segment tables for the node are
stored at the node and used by replication control in tdbm and OBST, both for
scheduling and directed addressing of updates (multicasts).
67
-
Scheduling updates for propagation – The Propagator and the Integrator need to
schedule outgoing and incoming updates according to the segment tables, in a
way that segment properties are maintained.
-
Coordination of two-phase-commit updates – The current DeeDS database
system does not support other transactions than updates with eventual
consistency. The propagator is changed to also act as a coordinator to handle
both two-phase-commit and eventual consistency updates according to their
properties.
-
The mechanism of conflict detection with version vectors and log filters does
not need to be changed to support segmentation. The data objects stored at a
node are allowed to be reallocated to other segments within the node, so we need
a common log filter for all data objects that are allocated to a node, which keeps
the history of data updates also when objects are reallocated.
We elaborate on scheduling and coordination in the following sections.
5.3 Scheduled propagation
Bounded replication has deadlines for replication, which guarantees that database
updates are replicated within a known time. This requires predictability of replication,
which means that processing of bounded replication updates must have higher priority
than ASAP replication updates. Updates for segments with immediate consistency
requirements have no time constraints, but will lock resources globally as long as the
update has not committed. We want to have high priority for updates of such segments
too, while still allowing bounded updates the highest priority. To handle these update
requirements we define a double-queue architecture.
The log of the propagating node contains updates to be sent to other nodes. Each update
is put into one of the new queues, depending on the consistency class for the segment
that the update belongs to. Bounded replication is put in the Bounded queue, while
68
ASAP updates are put in the Ordinary update queue. For efficiency reasons, immediate
consistency updates are put at the head of the Ordinary queue, since we do not want to
lock resources unnecessarily.
ASAP
Ordinary queue
DOI
Immediate
Bounded
Head of queue
Bounded queue
Sender
Head of queue
Figure 12. Propagation scheduling queues
The Sender in the Propagation module sends out updates from the bounded queue as
long as there are such updates to propagate. The updates in this queue are sorted with
the scheduling policy of earliest deadline first, so that all updates have the best chance
of meeting the replication bound time. The maximum time that an update must wait in
the Bounded queue is the time for one entry to be sent multiplied with the number of
elements in the queue. The send time for one element is the time for the request to send
until the message is sent, which is specified in (Lundström, 1997) and is bounded under
certain conditions. With a Bounded queue we have a longer propagation time for a
bounded update, since the wait time to send is multiplied by the number of elements,
which are queued in front of the update. Other components of the total propagation
time, as defined by Lundström, remain the same.
Sending out an update at the head of the Bounded queue is delayed for maximum the
time it takes to process any on-going send operation, regardless of what type of update
that is in progress. For a two-phase-commit (2PC) update, the coordinator needs four
69
messages per remote node, but after the coordinator has sent only one single message in
such a 2PC update process, a bounded queue update may be sent out. Immediateconsistency updates have no time constraints and the data in such a segment will not be
influenced by a bounded update that interleaves the coordination process. The Bounded
and the Immediate updates are sent to independent segments, since all segments have
disjoint sets of data objects.
The size of the Ordinary queue is not influenced by time constraints of the data, so it
may be large and may also be allowed to grow as update requests arrive. For the
Bounded queue, there should not be more bounded updates for a segment than can be
replicated by the database system, otherwise we need to indicate a fault, since we are
not allowed to lose updates. We need to restrict the maximum update rate (sporadic
update behavior) that can be handled by the system, by allowing a maximum update rate
for each segment instance. This will also effectively limit the number of elements in the
Bounded queue, resulting in a predictable propagation time.
We want the Sender to have exclusive access to the network and we also want to delay
send-access of bounded updates within at most the time to complete one send operation.
To fulfill this, we define the Sender to execute as a two-thread algorithm using a
semaphore for exclusive access to the network. To collect replies from the immediate
consistency segments, we use a separate thread.
High priority thread:
1. while forever
a. wait for (Bounded queue size > 0)
b. wait for network token
c. while (Bounded queue size > 0)
i. send update from Bounded queue
70
d. release network token
2. end
Low priority thread:
1. while forever
a. wait for (Ordinary queue size > 0)
b. if update type is Immediate (Act as a 2PC coordinator)
i. wait for network token
ii. send phase1 message
iii. release network token (Let Bounded updates interleave)
iv. await phase_1_signal (from the receiver thread)
v. wait for network token
vi. send phase2 message
vii. release network token
viii. await phase_2_signal
c. else
i. wait for network token
ii. send one asap update
iii. release network token
2. end
71
Thread for receiving 2PC replies and signal to the Low priority thread:
1. while forever
a. if phase1
i. while received_cnt < number_of_nodes
1. receivemessage, increase receive_cnt
ii. send phase_1_signal
b. else
i. while received_cnt < number_of_nodes
1. receivemessage, increase receive_cnt
ii. send phase_2_signal
2. end
Our send for 2PC messages in the Low priority thread relies on a hardware-supported
multicast. If multicast is implemented as point-to-point, we need to let the High priority
task interleave the multiple send operation of the 2PC. The following replaces the send
phase 1 / 2 message and the release network token operations.
1. Until propagated to all nodes for the segment (sent_cnt == number_of_nodes)
a. Send to one node, increase sent_cnt
b. Release network token
c. If sent_cnt < number_of_nodes
i. wait network token
2. end
72
5.4 Scheduled integration
Integration of updates for segments of different consistency classes needs to be
scheduled in a similar way as with propagation of updates. The bounded replication
updates must have a predictable integration time. As with propagation, the immediate
consistency updates should be integrated before ASAP updates for the efficiency
reasons. Having a similar double-queue mechanism also at the integration side supports
these requirements.
The size of the Bounded queue on the integration side must be much larger than on the
propagation size. The worst-case scenario is when all nodes that may update a bounded
segment from another node simultaneously send updates that need to be integrated.
Since all these updates have been admitted to the system we cannot reject or abort them,
but must be able to store them in the Bounded queue at the integration side and process
them in a predictable way. The Bounded integration queue will necessarily be large and
scalability suffers, since much memory is required and the worst-case integration time
becomes long. Similar to the Bounded queue, the size of the Ordinary queue must be
large enough to receive all updates from other nodes. Since segmentation is used, the
number of replicas of the segment is reduced, compared to a fully replicated database
and the degree of replication is used when dimensioning the integration queue.
The integration process is identical to the description of (Lundström, 1997), where
conflict detection is done by comparing one version vector of an update with the log
filter of the database at the node, one at a time. Integration must take place within the
processing of one update entry of the queue to achieve predictable integration time. To
be able to support bounded replication, the integration time must be bounded, which
will exclude conflict resolution actions that do not have a predictable execution time
(e.g. compensating actions which cascade, resolution that involves other nodes etc).
Bounded replication implies that the update is available within a known time, so any
conflict resolution that jeopardize the predictability of the update being integrated and
73
available to other users must be avoided. In addition to unpredictable integration
policies, this also includes delayed or detached conflict resolution.
74
Chapter 6
Evaluation
In this chapter, we analyze the proposed framework for segmentation in a distributed
database. We define a model for an evaluation of our framework. The evaluation aims at
reasoning about the framework to validate advantages and problems.
6.1 Evaluation model
To reason about how segmentation with consistency classes improves communication
scalability and replication efficiency, we define a model for replication. When
evaluating replication, we assume that there is a validated segment allocation table
available according to the algorithm in 4.3.4.
Spatial model for replication with segments
Since we aim at reducing the communication effort, we need to calculate the amount of
data to be sent over the network to replicate updates. To be able to measure a potential
improvement of segmentation to support virtual full replication, we define:
75
Replication effort: A measure to express the effort of making the database
consistent after an update to any data object.
We can calculate the replication effort gain in reducing the degree of replication for
individual segments, by comparing the replication effort of a particular segmented
database with a fully replicated database, where all data objects are replicated to all
nodes. The definition enables us to evaluate the replication effort of introducing
segments and having individual degrees of replication. We can also see that the
replication effort of a segmented and fully replicated database is the same as for a nonsegmented database.
To have basic model for our evaluation, we make the following assumptions:
•
For every update of a data object replica, we use one network message. All
update messages are of the same size
•
All data objects have the same size in bytes.
•
Our basic evaluation model is intended for being decoupled from a particular
application. Thus, modeling of access patterns for updates, distribution of
updates and frequency of updates are not included in the model.
We use the following denotations:
n
g
S
si
di
ci
m(c)
c
number of nodes
number of segments
size of the database (in number of data objects)
size of segment (in number of data objects)
degree of replication for segment i
the consistency class for segment i (IMMEDIATE | BOUNDED | ASAP)
number of messages required to replicate an update for a consistency class
the consistency class for a database
re
replication effort
76
In a segmented database, we can express the size of the database as the sum of the sizes
of all segments. The relation between segment size, number of segments and the size of
the database is:
g
S = å si
i =1
The replication effort can be calculated for specific applications, where number of
segments and their degree of replication is known.
Replication effort:
g
re1 = å (d i − 1) * si
i =1
The replication effort is O(n), due to O(1) for the factor (di-1) and O(n) for the factor si.
By comparing the replication effort of the specific application with the same application
being full replicated, we can express the replication effort ratio as a measure for the
improvement of the replication effort.
Replication effort ratio:
g
re1ratio =
å (d
i
− 1) * si
i =1
S * (n − 1)
In 3.1.2 we see that a fully replicated database results is O(n2) when the size of the
database increases with the number of nodes. Thus the re1ratio is O(1/n). The
replication effort is lowered with each separate segment, which has a lower degree of
replication. The replication effort is independent of the time constraints on individual
segments, since replication efforts is purely a matter of distributing the updates to the
other replicas. Consistency requirements influence the replication effort, depending on
that full consistency replication (e.g. 2PC) in general uses four messages for replicating
an update, while optimistic replication updates a replica with one message only. When
77
we add this to our formula for replication effort, we may compare the improvement in
replication effort compared to the corresponding fully replicated database with full or
eventual consistency.
g
re2 ratio =
å ((d
i
− 1) * si * m(ci ))
i =1
S * (n − 1) * m(c)
We see that the replication effort can be reduced by decreasing the degree of replication
for segments or by lowering the consistency requirements on data.
For full consistency replication with 2PC we can improve the replication effort by using
hardware-supported multicast. In such case we do not need four messages for each node
for distributing the update, but will manage to send out to all replicas with one message
only and reduce the number of messages needed for establish consistency. However, the
replies from the nodes during the 2PC cycle are sent back to the 2PC coordinator
individually.
Temporal model for replication with segments
To have a measure for the efficiency of replication in a segmented database, we need to
relate the replication effort to a time period. When efficiency of a database system is
improved, the replicas will be consistent earlier than otherwise. An improved efficiency
may give better real-time properties for the system since we may the opportunity use
tighter bounds for deadlines. For this dissertation we define replication throughput as a
specific measure of efficiency:
Replication throughput: The replication effort per time it takes to re-establish
consistency between replicas, where time spans from the time point of an update
until the time point where the change has been replicated to all replicas.
78
We consider a detailed temporal analysis with a formal definition for replication
throughput as subsequent work. In this dissertation we reason about how the defined
framework supports replication throughput and what parameters influence the same.
For a certain replication effort we can achieve higher replication throughput by reducing
the time to process the replication. We identify three main components of time involved
in scheduled replication:
tprop
time for waiting for network propagation and propagation scheduling
tnetwork time for propagating an update over a real-time network
tintegrate time for integrating an update
To support real-time properties for replication, we have a minimum time between
sporadic arrivals of updates, which equals to a maximum update rate. Due to this there
will be a maximum on how many entries are stored in the bounded queue, which also
sets the limit on how long time an update must wait in the propagation queue. Thus, the
tprop time depends on the maximum allowed update rate. Similarly, the tintegrate time
depends on the maximum storage of the integration queue.
To support real-time of the flow of updates in a system we dimension a system based on
the update rates in a segmented database. Bounded replication of updates rely on that
there is a maximum time for wait-to-send for the real-time network and we dimension
our flow control based on this. A maximum wait time for the network allows the sender
to send sporadically at (1/max_wait_time) rate. The updates for segments on a node
must share this update rate, so the sum of the maximum allowed rates for each segment
must not exceed this rate. The maximum rates for each segment is a manual design
decision that must be based on a priori application knowledge.
At the integration side, the nodes that receive updates must be dimensioned to receive
the updates at the rate they are sent at the sending nodes, since we are not allowed to
lose updates. Virtual full replication results in that the integrating node receive only
updates from nodes that share segments with the integrating node, so the receive rate at
79
each node is dimensioned for this. The segment update rates have effectively been
transformed into update rate requirements on the integrating side. If the integrating side
cannot support required update rates, the propagation rates must be re-dimensioned.
To calculate the required update rate for the integrating node we use the formula:
gin
gpr
i =1
j =1
å (å rate( j ))
where
gpr
number of propagating segments
gin
number of integrating segments
Consider the following example. The propagating nodes N1 and N2 contains segments
(A, B, C) and (A) respectively. The integrating nodes N3 and N4 shares segments (A,
B) and (C) respectively. The send rate at the real-time network is 10 messages / second
for both nodes N1 and N2. We divide this on the segments at nodes N1 and N2 as
N1:A(5 messages/s), B(3/s) and C(2/s), N2:A(10/s). Node N3 will receive updates for
segment A at a maximum at 5/s from node N1 and 10/s from node N2, totally 15/s. For
segment B node N3 will receive updates for segment B at a maximum of 3/s (from node
N1 only). At node N4, segment C is updated at a maximum rate of 2/s from node N1.
The update rates at the propagating nodes require that N3 in total can handle a minimum
rate of 18/s, while N4 is required to handle a minimum rate of 2/s. Now, if node N3
cannot support an update rate for segment A at 15/s or 18/s for the node, a solution can
be to lower the allowed update rate for segment A at node N2. Reducing the allowed
update rate to 5/s gives an update rate at N3 of 13/s for the node and 10/s for segment A
at N3. Using rates in this example is somewhat misleading, since they represent a
minimum arrival rate of sporadic behavior. However, using rates makes the example
more intuitive.
80
Propagating N1
A
5/s
Integrating N3
10/s
A
15/s
18/s
B
3/s
B
3/s
C
2/s
Integrating N4
Propagating N2
A
10/s
10/s
2/s
C
2/s
Figure 13. Update rates and resulting integration rate requirements
6.2 Discussion
6.2.1 Segmentation of distributed real-time databases
Segmentation reduces the replication effort with the amount of lowered replication for
each individual segment. By allowing an individual degree of replication for each
segment and defining segments based on access to data objects, we optimize the degree
of replication for each data object. With full replication the replication effort is O(N2)
for N nodes, since updates at each of the N nodes results in sending out updates to N-1
nodes. The degree of replication required is application dependent and the actual gain of
segmentation can only be known within the context of an application, for which we can
specify the degree of replication for each segment. However, locality and hot-spot
models for distributed data suggest that only a few data objects are shared among many
nodes, while many data objects are shared by only a small amount of nodes. Gray &
Reuter (1993) formulates this as:
81
“A popular rule of thumb, the 20/80 rule, holds that 20% of the data get
80% of all references, and the remaining 80% of the data get only 20%
of the references … There seem to be many applications that can be
more
appropriately
characterized
by,
say,
a
1/99
rule”
(Gray & Reuter, 1993, p.670)
Nodes with data used only locally will have a segment that is not replicated at all,
except for replicas that exist for fault-tolerance reasons, in contrast with a fully
replicated database, where all such data are always replicated to all nodes, regardless of
if it is ever used or not.
When reducing replication, we reduce the need for reconciliation (conflict detection and
resolution) at the nodes. With full replication, all nodes must check updates for
conflicts, also for data that is not in use at the node. With full replication there is a high
probability of conflicting updates and these conflicts must be resolved at all nodes in the
system and that a large amount of processing time overall is used for reconciliation for
data that is never used. Gray el al (1996) models the reconciliation behavior for such a
system. With segmentation we need to detect and resolve conflicts only at nodes where
data is replicated. The actual gain in overall processing time is again dependent on the
degree of replication required in a particular application and the number of nodes there
is in the system.
When recovering a segmented database, we may choose to recover certain segments
before others. If the segment table for a node is loaded early in the recovery process, it
can be used to recover segments in some priority order, enabling the system to earlier
support critical parts of the database. Recovery of segmented distributed databases is
regarded as future work.
6.2.2 Replication
In a fully replicated system, bounded updates are sent to all nodes in the system, either
by a broadcast or a multicast primitive. Lundström (1997) calculates the bounded
82
replication time for a replication mechanism that is fully replicated, using a point-topoint implementation of a multicast. A segmented database use smaller groups for the
multicast, which makes the propagation more efficient for a point-to-point multicast.
The actual gain in replication time depends on the degree of replication for different
segments in a particular application.
It is not possible to prove the scheduled replication architecture without a full formalism
which we do not have, but we can make some statements about the architecture that
indicate that updates are handled sufficiently and the result of an update will be correct:
Segments do not overlap – Data objects belong to one segment only and there is no
other way of updating the data object and all updates for the data object follow the same
replication procedure. Updates for different segments never conflict with each other, but
updates for same segment may conflicts and for such conflicts we have either
serializability to avoid conflicts (for Immediate consistency) or conflict detection with
conflict resolution (for Eventual consistency). Replicas of data objects are updated in
the same way regardless of its location or the behavior of the network, since the data
replication is based on the properties of the segment that it belongs to.
Propagation and integration of bounded updates are prioritized – Following our design
assumptions, we know that bounded updates will be replicated within a bounded time.
We also know that immediate consistency updates are one-copy-serialized and the
ASAP updates are replicated when possible. This means that for all segments, we can
guarantee the consistency properties specified.
Nodes have compatible integration – Updates for ASAP segments commute, Immediate
consistency updates are serialized and Bounded updates are processed in the same order
at all replicas. This means that updates are integrated so that all nodes converge into
mutual consistency, once all pending updates have been processed.
83
6.2.3 Scalability
Several of the parameters that influence the scalability of our solution have been
discussed above. This section summarizes how factors influence scalability of our
framework.
Consistency class – Birrell et al. (1982) shows that pessimistic replication for global
consistency scales very badly and that pessimistic replication improves this
significantly. Our framework supports both pessimistic and optimistic replication and
basically scales accordingly, except from that segmentation improves scalability for
both replication types.
Degree of replication – Our evaluation model shows that when specifying the degree of
replication individually for each segment, we reduce the replication effort linearly for
the entire database, compared to full replication, where the replication effort is O(N2). In
addition, a lowered degree of replication improves reconciliation and recovery, which
further improve the overall efficiency.
Number of nodes – Compared to a fully replicated system, our framework is less
sensitive to increasing the number of nodes of the system, since segmentation will lower
the replication effort.
Segment size and number of segments – The size of a segment is not related to the size
of an update to propagate to other segment replicas, so the segment size does not
directly influence scalability. Segment tables at each node contain the list of where other
segment replicas are allocated. With smaller segments it can be expected that there are
more segments in the database and the table will grow linearly.
Database size – Our evaluation model shows that with an increasing database size, the
replication effort increases linearly.
84
6.2.4 Validation by simulation
A simulation is required to extensively validate our framework and need to include
ways to evaluate both the offline processing of preparing allocation tables and the
online processing for replication. Access to application semantics and requirements is
required for a valid simulation and in our case we use requirements from the WITAS
application.
The offline simulation is a matter of validating the process of transferring the
application requirements into allocation tables and using them for replication.
Application semantics is captured in the application and segment specification, which is
compiled into segment tables. For this an implementation of the translation algorithms
is needed. Once segment tables are transferred to the nodes of the system, the online
simulation is executed.
To validate the timeliness and correctness of the replication framework we have
proposed, we need an executing simulation model or a prototype implementation. To be
able to perform sufficient on-line validation, certain conditions need to be fulfilled:
Replication scheduling model – We need to implement or model the replication
scheduling processing mechanism, so that a correct evaluation of its timeliness and
behavior can be done. For this, a time-cognizant model with queues is required.
Real-time network – The propagation of network traffic, simulated or actual, must
follow the properties of a real-time network. This includes bounded transmission delay,
bounded omission degree and bounded inaccessibility according to (Verissimo, 1993).
Patterns of database accesses – The patterns of database access need to be modeled or
known, so that timely behavior of the database can be related to the ways the database is
used. In particular, the mix of application consistency requirements, the arrival rates and
burst sizes is required knowledge.
85
Global time reference – To be able to measure the lag for bringing the database to
mutual consistency and thereby measure the effectiveness and efficiency of replication
scheduling, we need specific measurement points at the nodes of replication and a
global time reference, so that the time from transaction commit to consistency can be
measured.
Once we have an accurate model for a evaluating a simulation or a prototype, the
validation can be performed:
Validation of segment tables – The algorithms for creating segment tables (plain
segmentation and segments with properties) can be validated separately. For segments
without properties, the segment tables must support replication of data to all nodes
where data is used, which can easily be verified from the list of data accesses. Segments
with properties are also verified per data object, but with considering the consistency
requirements and other properties for each data object.
Validation of improved replication effort – The improvement in replication effort can be
calculated from the segment allocation tables when using our formulas in 6.1, once the
segment specification has been completed.
Validation of timeliness of updates – With measurements of the update lag for replicas,
we will know how long time the slowest update will need. For bounded updates this is
essential, since we can see if its replication completed before its deadline. A valid
scheduling architecture will complete all its bounded updates within the specified time
bounds.
Validation of throughput / efficiency improvement – The efficiency improvement is
measured over a set of updates of mixed consistency. Increasing the efficiency means
that more updates can be completed over time, while still fully supporting consistency
requirements of these updates. To measure the improvement in efficiency with
scheduled replication, an application setup using the scheduled replication is compared
to the same application setup for a segmented but not scheduled replication. For
86
scheduled replication, it is expected that more replication slack time will be available
and this time can be used for increasing the update load on the system overall. The
parameters defined to influence scalability, discussed in 6.2.3, should be altered to
validate the scalability discussion.
Validation of architecture and scheduling – Besides completing replication of updates
within bounded time, a valid scheduling architecture will also guarantee that all
segments are updated, that no updates are lost due to queue overruns and that replicas
do not diverge or dilute. The database should converge into consistency, also after that
it has been heavily updated for a long time, as long as the updates are done within the
specified limits. We also need to find out possible breakdown conditions with the
proposed architecture.
6.3 Problems
Some of the drawbacks with our suggested framework have been discussed in preceding
paragraphs and we summarize them in this paragraph.
6.3.1 Segments
Our approach to capture application semantics, by specifying requirements on data
objects may limit the usability of our framework. For many non-hard real-time
applications it may not be possible to specify all data accesses at every node of the
system a priori and for such applications there is a clear need for an extended approach,
where segments are allocated to additional nodes during execution. We have also
indicated this as an extension to the work of this dissertation.
Matching the specification of the application with the segments that are defined may be
a difficult task. Our algorithm may not find a match of requirements and segments.
Much of manual work may be needed to tune the application and segment specification
to get a matched segment table.
87
The syntax of the segment specification may be source of limitation for the framework.
The syntax may need to be extended to support some applications, since we have
chosen to have a limited syntax to be able to primarily study the principles of
introducing segments with properties. Further validation of the work should indicate if
there are additional constructs required in the syntax to manage to perform a full
simulation or run a prototype. In such cases, the syntax can easily be extended.
For databases where there are many segments, we may get large segment tables at the
nodes and a high amount of processing. The processing of large segment tables must be
ensured to be predictable. We have not discussed scalability for such systems in detail
in this dissertation, which is a limitation of the dissertation and a subject for future
work.
Also, if we add more segment attributes, the database will be divided into more
segments, since our assumption is that separate segments have disjoint sets of
properties. With plain segmentation without properties we could possibly have 2n-1
segments (where n is number of nodes). With disjoint properties this will grow to 2n*p-1
segments (where p is number of properties).
6.3.2 Architecture
For the architecture to solve the problem that we have defined, we have made
assumptions that may not hold in a real environment. A problem with the proposed
architecture may be when usage conditions are not properly defined. Wrong
dimensioning may result in lost updates or breakdowns. Database update rates, burst
sizes and network propagation times are parameters that are difficult to ultimately
define in a real environment. Partitioning is one of the reasons for both immediate
consistency and bounded updates to fail in a real environment.
88
6.3.3 Replication
Compared to the current replication architecture in DeeDS, we introduce additional
processing of updates. While meant to improve the efficiency, we may not achieve the
improvement that theoretical can be defined. The reason is that the scheduled
replication requires more processing of updates. This algorithmic overhead needs to be
measured to know the actual efficiency improvement gained.
For integration of updates, our framework does not support several consecutive and
related updates. We have chosen to integrate updates individually and that means that
averages over several updates cannot be used for conflict resolution. Our update
integration can also not depend on orders between updates.
The size of the integration queues may be large, in particular at nodes where there are
segments with a high degree of replication. The waiting time in the integration queue
depends on the time to integrate one update multiplied with the number of updates in
the queue. A long integration queue prolongs the waiting time linearly with the number
of entries.
6.4 Related work
This paragraph lists work that has had an influence on our results and the solutions that
we present. To improve scalability, distributed databases are often partitioned and
replicated to make data locally available, which reduce communication between nodes.
Different levels of granularity for partitioned databases can be found in the literature. To
our knowledge there is no work done in partial replication, which aims at supporting
full replication semantics. We have studied approaches for data locality, replication and
weak consistency, where replicas are allowed to be temporarily inconsistent, in
particular systems where the inconsistencies can be bounded, quantified or controlled in
some other way. Some approaches use rules for describe the amount of allowed
89
inconsistency, while other allow nodes to have unlimited but convergent inconsistency,
so that replicas eventually becomes consistent.
6.4.1 Partial replication
With partial replication of databases, parts of the database can be allocated to the most
optimal locations (Mukkamala, 1988). When partitioning the database into fragments,
the fragments can be allocated to nodes where the data is used most frequently, reducing
the need for data access through the network (Elmasri & Navathe, 2000). Horizontal
and vertical fragmentation (Ceri, 1982; Navathe, Ceri, Wiederhold & Dou, 1984) are
particular kinds of partitioning, where database tables are split up and where part of
tables (fragments) are located at different nodes. In our work we are concerned with
replication of partitions of the database to have local availability of data at several
nodes, which is not a main concern for approaches with plain database partitioning,
where replication is of less importance and distributed transactions are used to reach the
data from other nodes than where partitions are allocated. Our solution replicates data so
that distributed transactions are not needed. Partitioned and replicated databases may
provide the semantics of virtual full replication by partitioning and replicating the
database so that local data is available. This however usually requires manual
intervention, while we support local data availability by automatically setting up
segments and their allocation.
Much work in partial replication can be found in the area of distributed file systems.
Several approaches exist, where granularity is used for replicate smaller amounts of data
to more efficiently support mutual consistency. The Andrew File System and the Coda
file system (Satyanarayanan, 1993) uses a set of servers that have full mutual
consistency, where mobile nodes can connect and synchronize to make themselves
consistent with the master data of a server group in a client/server model. In the Ficus
file system (Guy, Heidemann, Mak, Page, Popek and Rothmeier, 1990; Page, Guy,
Popek, Heidemann, Mak & Rothmeier; 1991; Reiher, Heidemann, Ratner, Skinner &
Popek, 1994) the nodes are peer nodes that uses optimistic replication with conflict
90
detection and conflict resolution for resolving conflicting updates. Ficus replicates files
only to nodes that use them and has a information model about files to support partial
replication. Our work has been influenced by the ideas of granularity (segmentation of
the database), usage of peer nodes (design for the DeeDS database system) and the use
of a richer information model is used for selective replication of updates (segment
properties).
The optimistic replication in the Ficus file system better supports scalability, but
requires that conflicting updates are detected so that they can be solved. A common
approach for systems with optimistic replication is to use version vectors and log filters
(Parker & Ramos, 1982). For the Ficus file system application semantics is used for
resolving file conflicts and for conflicts where conflicts cannot be automatically
resolved, the user is informed by email for manual conflict resolution. Optimistic
replication, version vectors and log filters are used in the DeeDS database, for which we
have intended our prototype implementation.
Adly and Kumar (1994) presents a protocol for optimistic replication only to certain
neighboring nodes in a distributed system, opposed to system with full replication
where all nodes are updated. The replication to neighboring nodes results in that the
closest nodes are closer in consistency than nodes more far away from the updated
node. In our work, we also replicate updates to certain nodes, but this is not based on
distance to the updated node but rather the need for the data to be available.
Nicola and Jarke (2000) present a formal model for partial replication that can be used
for analyzing the best partial replication schema for an application. They differentiate
two dimensions of replication, ‘All objects to some sites’ and ‘Some objects to all sites’.
The two dimensions are combined in an orthogonal way to have more expressive and
formal expressions for degree of replication, which can be analyzed for optimal
performance. For our future work in the area, we intend to analyze our automatic
segmentation with this model.
91
6.4.2 Eventual consistency
Grapevine (Birrell et al., 1982) is one of the first systems using eventual consistency,
where name server replicas are allowed to be temporarily inconsistent. After an update
is written and committed at one of the replicas, asynchronous replication is used to
propagate the update to other replicas. There is mutual inconsistency between the
replicas as long as the update has not been propagated to all nodes.
The DeeDS database system (Andler et al., 1996) allows immediate updates at any
replica, where replicas eventually converge into a consistent database. Full replication
supports real-time access to data for all clients. Replication after local commit without
known replication time supports real-time updates for all clients (by simulating
partitioning of the network), but requires database clients to be tolerant to temporary
inconsistencies.
Lundström (1997) shows that replicated databases with eventual consistency and
delayed replication may have a bounded delay for replication. This requires that the
system uses a real-time network and that network partitions are of bounded durations.
Furthermore, conflict resolution must use forward recovery and there must be an upper
bound on the transaction inter-arrival rate. Update transactions must be treated as hard
deadline transactions (with a replication bound) and resources must be sufficient to meet
hard deadline transactions.
Pu & Leff (1991) present epsilon-serializability (ESR), which is a relaxed criterion for
mutual consistency and where data eventually converges into a consistent state. The
amount of accumulated inconsistency (the overlap) for a query is controlled so each
query can be kept within inconsistency bounds. With ESR, queries may see inconsistent
results and updates may arrive in different order to different replicas. This produces
eventually consistent replicas that are equivalent to a serial schedule while increasing
concurrency of the database. Son & Zhang (1995) uses epsilon-serializability to
guarantee both timeliness and consistency for a real-time database, where a mix of
consistent queries and queries with ESR correctness are allowed and where transactions
92
are accumulate a certain allowed amount of inconsistency during its execution.
Transactions are sorted on earliest deadline and are aborted when they cannot meet their
deadlines. ESR is a more specific in controlling the eventual consistency, where the
accumulated inconsistency can be matched with the tolerance level of the application
using it. This is a difference from our work, where we assume all applications tolerate
any degree of temporary inconsistence. Eventual consistency with bounded replication
limits inconsistencies in time while ESR limits inconsistency in value.
6.4.3 Specification of inconsistency
A number of approaches for a formal or controlled description of inconsistencies
between replicas can be found in the literature. Wiederhold and Qian (1987) define the
Identity connection that describes the relationship between two related data objects at
different nodes, including consistency divergence. Wiederhold and Qian (1990) define
inconsistencies in terms of the dependency between the transaction and the following
propagation of the update: Immediate (propagation within the transaction), Deferred
(propagation after commit), Independent (periodic update of copies), Potentially
inconsistent (mutual inconsistencies due to network failure or partitioning, repaired by
compensating actions.
Sheth and Rusinciewicz (1990) differentiate consistency by structural and control
dependencies, but also temporal (eventual and lagging consistency) and spatial
consistency (divergence between replicas by number of changed data objects, the
amount of change of the value; and number of change operations). Predicates for
bounding mutual inconsistency between quasi-copies (inconsistent replicas) are
introduced in (Alonso, Barbara & Garcia-Molina, 1990). Predicates describe coherency
conditions between replicas as delay, version or arithmetic inconsistency. A similar
coherency index is defined by Gallersdörfer & Nicola (1995)
Data dependencies and consistency requirements are specified using predicates in
(Rusinciewicz, M., Sheth, A. & Karabatis, 1991), where consistency requirements are
expressed as replica differences in time and state.
93
In our work we do not capture consistency relations between data directly. Rather we
separate data with similar requirements on consistency in different segments, which
have properties and to our knowledge is a new approach for separating consistency
requirements for data.
6.4.4 Mixed consistency
Only a few approaches can be found that supports both immediate and eventual
consistency in the same framework. Adly and Kumar (1994), Adly (1995) and Adly,
Nagi and Bacon (1993) present a framework for asynchronous propagation to nodes that
are neighboring nodes and where nodes are arranged hierarchically. The closer the
replicas are, the more mutually consistent are their values. Reads and writes can be done
‘fast’ or ‘slow’, where ‘fast’ read and write will use the data of the local node and
‘slow’ forms a quorum of the root nodes of the hierarchy for read and write operations.
The concept of supporting several levels of consistency in the same framework and that
the data at a few nodes may same level of consistency is similar to our concept. The
difference from our work is that consistence is preserved between a few neighboring
nodes in Adly’s framework, while in our framework all replicas of a segment will have
the same level of consistency. Our framework is designed for real-time requirements,
while Adly’s framework focuses on supporting local replication and a mix of
consistency levels.
Usage of different concurrency control methods for different classes of transactions in
an optimistic replication scheme are used in (Bassiouni, 1988), where queues for
different transactions are used for concurrency control. The aim is to increase
concurrency in an optimistic system by supporting transaction classes. A fully
replicated approach is replaced by a partially duplicated distributed database. No
implementation has been done in this work, but the system is validated through an
analytical model. The concept of having different classes of transactions in an optimistic
replication database is similar to our concept of transactions, which accesses segments
of different consistency classes. In our work the class comes from the application
94
requirements and the segment properties and we have no distributed transactions, while
Bassiouni’s class is connected with the transactions. However in both approaches, the
classes controls the concurrency control for the replication.
95
Chapter 7
Conclusions
This chapter contains conclusions from our work with segmentation in distributed realtime databases. After a summary, we list what we see as important contributions in our
work and we define future research directions for continued work in the area of
segmented distributed real-time databases.
7.1 Achievements
We show that segmentation in a distributed database improves the replication effort for
applications where full replication is not required. By supporting virtual full replication,
the users of the database still has the image of that the full database is available at the
local node and the advantages of predictable local and global real-time processing,
eventual consistency of replicas and fault tolerance that we see in fully replicated
databases still holds.
To support virtual full replication we present a syntax for specification of the
requirements of the application using the database. We have developed algorithms for
96
segmenting the database, both by manually specifying segments and by using access
information to automatically setup segments.
We present a replication architecture that makes use of the application semantics we
capture in the segment specification to replicate updates of mixed consistency. Our
solution for replication of data updates in an environment of data with mixed
consistency requirements, supporting both full consistency and real-time users, is to our
knowledge a unique approach.
Our implicit hypothesis of this dissertation is that segmentation will improve scalability
and efficiency, by lowered replication effort and reduce the communication effort. We
show that segmentation improves the replication effort, which strengthens the
hypothesis. We have also not been able disprove the hypothesis in the process of
develop our solution.
7.2 Contributions
We consider the following to be the most important contributions in this dissertation:
•
Exploration of virtual full replication. In this dissertation we elaborate on the
initial ideas of virtual full replication, as presented in (Andler et al., 1996 ), by
examining how segmentation may support a fully replicated database. We show
how replication effort is improved, while maintaining the level of local
availability of data and the real-time properties of the database.
•
Concepts in segmentation of distributed databases. We define the concept of
segment properties with consistency classes and a syntax for specification of
segments. We define algorithms for how to segment a database.
•
Replication control. We present an architecture for replication control that uses
the specification of segment properties to show that segments with data of
97
different properties and requirements in consistency and timeliness may coexist
in the same distributed database.
•
Evaluation model. Our evaluation model for segmented databases and our
discussion of scalability and replication efficiency shows how to measure the
improvements in replication effort for segmented databases.
7.3 Future work
We see a number of possible extensions to the work of this dissertation. The intention is
that the work of this dissertation is the basis for a research proposal in the area.
7.3.1 Implementation
We see a simulation and/or an implementation of segmentation in DeeDS as necessary
for a full validation of the work and a way to achieve better understanding of
segmentation and the limitations of it. As a next step we propose a low level design for
an implementation in DeeDS for investigating replication efficiency and scalability in a
typical WITAS application and an analysis of what parameters influence replication
efficiency and scalability in practice. Also the proposed segment properties must be
analyzed to see if it is valid and sufficient for an application. Paragraph 6.2.4 describes
what needs to be validated in a simulation and/or an implementation. In particular, the
architecture of scheduled replication needs to be validated and possible breakdown
situations detected.
A more detailed model for replication throughput is necessary for better understanding
the potential efficiency improvement. Our current model is limited to a definition of
replication throughput, accompanied with a discussion of which parameters influence
the replication throughput.
98
7.3.2 Dynamic allocation of segments by recovery
In this dissertation we have focused on a static description of segments, their properties
and their allocation. For many applications the need for data availability changes during
execution in different modes of operation and this motivates a dynamic allocation of
segments to nodes, following the dynamic needs of the database application. Allocation
and de-allocation of segments to nodes could be done in similar way as virtual memory
is handled in an operating system by using recovery techniques.
7.3.3 Segment properties and consistency classes
We have chosen a small set of segment properties and consistency classes for this
dissertation. An application may require a larger set of properties to support the
semantics of the application. For that reason the proposed set of segment properties may
need to be extended. A deeper analysis of the needs from different applications could
result in a more comprehensive set of segment properties.
7.3.4 Other issues for future work
Our basic evaluation model for replication effort needs to be refined so that update
access patterns can be described in greater detail for particular applications. Factors,
such as arrival rates, arrival distribution and size of updates are application dependent
and need to be specified more accurately. Also, architectural factors influence the
replication effort and scalability, such as how many network messages are used for
propagating updates (batch updates, broadcast updates etc.). An extensive evaluation
model also needs to consider the object and segment sizes and the actual size in bytes of
update messages.
Segment recovery is mentioned a few times in this dissertation and is an issue that needs
to be connected to recovery of distributed databases in general. We have proposed a
possibility to recover segment from various sources, by adding the recover-from
keyword to our segment specification syntax.
99
In our syntax for specification of segments we have defined the storage and recoverfrom keywords, but we have not actually used the information in this dissertation. By
explicitly specifying the storage for a segment we can support disk-based segments and
segments that can be swapped in and out of memory from and to disk. Once we support
dynamic allocation of segments, segment storage can be handled more easily.
In this dissertation we base the database segmentation on common properties for
application or common access patterns, but we do not optimize the replication effort for
the cost of communication and thereby the cost for keeping replicas consistent. By
adding the communication cost (network propagation cost, network delays etc.) to the
segmentation algorithms, we could get a more cost-efficient setup of segments.
100
Acknowledgements
I would like to thank my supervisor Prof. Sten F. Andler for solid guidance and an open
mind during our discussions, which helped my incomplete ideas and writing evolve and
for the time spent in reviewing the result. I also would like to thank my co-supervisor
Marcus Brohede for feedback, support and good ideas and Sanny Gustavsson for his
devoted participation and insight in the area. Also, I would like to thank the entire
Distributed Real-Time Systems research group at the University of Skövde for fruitful
discussions during this project. Thank you all for your committed involvement,
feedback and inspiring company.
I would like to thank Mikael Berndtsson for reviewing this dissertation and for his
valuable comments and questions. I would also like to thank Bengt Eftring for his
feedback.
Furthermore, I would like to thank my family for the faithful support when the summer
was warm and nice but I stayed inside to read, think and write. Thank you for
understanding.
Thank you all friends that helped me recharge from time to time during this lengthy
work, at wedding parties, when celebrating mid summer, visiting a concert, having a
crayfish party, when chasing around to find me a sailboat or when just making me a cup
of coffee.
101
Bibliography
Adly, N. (1995) Performance evaluation of HARP: A hierarchical asynchronous
replication protocol for large scale systems. Technical Report TR-378, Computer
Laboratory, University of Cambridge, August
Adly, N., & Kumar, A. (1994) HPP: A hierarchical propagation protocol for large
scale replication in wide area networks. Technical Report TR-331, Computer
Laboratory, University of Cambridge, March
Adly, N., Nagi, M., & Bacon, J. (1993) A hierarchical asynchronous replication
protocol for large scale systems, Proceedings of the IEEE Workshop on Advances
in Parallel and Distributed Systems,152-157
Alonso, G. (1997) Partial database replication and group communication primitives
(extended abstract). In Proceedings of the 2nd European Research Seminar on
Advances in Distributed Systems (ERSADS'97), 171-176, January
Alonso, R., Barbara, D. & Garcia-Molina, H. (1990) Data caching issues in an
information retrieval system, Transactions on Database systems 15(3), 359-384
Andler, S. F., Hanson J., Mellin J., Eriksson J., & Eftring B. (1998) An Overview of the
DeeDS real-time database architecture, Proc 1998 IPPS/SPDP: Jt Workshop
Parallel and Distributed Real-Time Systems (WPDRTS'98), Orlando, Florida
102
Andler, S. F., Hansson, J., Eriksson, J., Mellin, J., Berndtsson, M. & Eftring, B. (1996)
DeeDS towards a distributed active and real-time database system. ACM SIGMOD
Record, Special Section on Advances in RealTime Database Systems, 25(1), 3840, March
Andler, S., Berndtsson, M., Eftring, B., Eriksson, J., Hansson, J., & Mellin, J. (1995).
DeeDS: A distributed active real-time database system. Technical Report TR-HSIDA-95-008, Comp. Sci. Dept. University of Skövde, June
Bassiouni, M. A (1988) Single-site and distributed optimistic protocols for concurrency
control. Transactions on Software Engineering, 14(8): 1071-1080
Bernstein, P. A.& Goodman, N. (1984) An algorithm for concurrency control and
recovery in replicated distributed databases, ACM Transactions on Database
Systems, 9(4), 596-615, December
Bernstein, P.A., Hadzilacos, V. & Goodman, N. (1987) Concurrency Control &
Recovery in Database Systems, Reading, Mass:Addison-Wesley
Birrell, A.D., Levin, R., Needham, M. & Schroeder, M. D. (1982) Grapevine: An
exercise on distributed computing, Comm. of the ACM 25(4), 260-274
Brohede, M. (2001) Real-time database support for distributed real-time simluations,
MSc Thesis HS-IDA-MD-01-002, University of Skövde
Burns, A. & Wellings, A. (2001) Real-Time Systems and Programming Languages (3
ed.), Harlow, England: Addison-Wesley
Ceri, S., Negri, M. & Pelagatti, G (1982) Horizontal data partitioning in database
design, Proceedings of the ACM SIGMOD Intl Conf on Management of Data
Davidson, S. B. (1984) Optimism and consistency in partitioned distributed database
systems, ACM Transactions on database systems 9(3), 456-481
103
Doherty, P., Granlund, G., Kuchinski, K., Sandewall, E., Nordberg, K., Skarman, E. &
Wiklund, J. (2000) The WITAS unmanned aerial vehicle project, Proceedings of
the 14thy European Conference on Artificial intelligence (ECAI), Berlin, p. 747755
Elmasri, R. & Navathe, S. B. (2000) Fundamentals of database systems (3 ed.),
Reading, Mass: Addison-Wesley
Eriksson, D. (2002) How to implement bounded delay replication in DeeDS, Final year
project HS-IDA-EA-02-111, University of Skövde
Gallersdörfer, R. & Nicola, M. (1995) Improving performance in replicated databases
through relaxed coherency.Proceedings of the 21st VLDB Conference, 445 - 456
Garcia-Molina, H. & Salem, K. (1987) Sagas, Proceedings of the 1987 ACM SIGMOD
International Conference on Management of data 1987, San Francisco, CA
Garcia-Molina, H., & Salem, K. (1992) Main-memory database systems: An overview,
IEEE Trans. On Knowledge and Data Engr. 4(6), 509-516, Feb
Gray, J. & Helland, P. & O’Neil, P. & Shasha, D. (1996) The Dangers of Replication
and a Solution, SIGMOD Record, 25 (2), 173-182, June 1996
Gray, J. & Reuter, A. (1993) Transaction processing: Concepts and Techniques, San
Francisco, CA: Morgan Kaufmann
Gray, J., Lorie, R., Putzulo, G. & Traiger, I. (1976) Granularity of locks and degrees of
consistency in a shared database, In G. Nijssen (ed.), Modelling in database
management systems, Amsterdam:North-Holland
Gustavsson, P.M. (1995), How to Get Predictable Updates Using Lazy Replication in a
Distributed Real-Time Database System, MSc Thesis, University of Skövde
104
Guy, R. G., Heidemann, J. S., Mak, W., Page, T. W. Jr., Popek, G. J. & Rothmeier, D.
(1990) Implementation of the Ficus replicated file system, In USENIX Conference
Proceeedings, 63-71, June
Helal, A., Heddaya, A. A. & Bhargava, B. B. (1996) Replication techniques in
distributed systems. Norwell, MA: Kluwer Academic Publishers
Hevner, A. & Yao, S (1979) Query processing in distributed database systems, IEEE
Transactions on Software Engineering 5(3), May
Kung, H. T. & Robinson, J.T (1981) On optimistic methods for concurrency control,
ACM transactions on Database Systems 2(5), 95-114, May
Le Lann, G. & Rivierre N. (1993) Real-time communication over broadcast networks:
The CSMS-DCR and the DOD-CSMA-CD protocols, Technical report 1863,
INRIA, March
Leifsson, E. O. (1999) Recovery in distributed real-time database systems, MSc Thesis
HS-IDA-MD-99-009, Högskolan Skövde
Locke, C. D. (1986) Best effort decision making for real-time scheduling, Technical
report CMU-CS-86-134, Dept. Of Comp. Sc., Carnegie-Mellon, May
Lundström, Johan (1997), A Conflict detection and resolution mechanism for boundeddelay replication, MSc Thesis HS-IDA-MD-97-10, Högskolan Skövde
Mukkamala, R. (1988) Design of partially replicated distributed database systems,
Sigmetrics, 187-196
Navathe, S., Ceri, S. Wiederhold, G. & Dou, J. (1984) Vertical Partitioning Algorithms
for Database Design, ACM Transactions on Database Systems 9(4), Dec
105
Nicola, M. & Jarke, M. (2000) Performance Modelling of Distributed and Replicated
Databases, IEEE transactions an Knowledge and Data Engineering, 12(4), 654672
Page, T. W. Jr., Guy, R. G., Popek, G. J., Heidemann, J. S. & Rothmeier, D. (1991)
Management of replicated volume location data in the Ficus replicated file system,
In USENIX Conference Proceeedings, 17-29, June
Parker, D.S. & Ramos, R.A (1982) A distributed file system architechture supporting
high availability. Proc. Of the 6th Berkeley Workshop on Distributed data
Management and Computer Networks, 161-183, February
Pu, C. & Leff, A. (1991) Replica control in distributed systems: an asynchronous
approach, ACM SIGMOD Record 20(2), 377-386
Ramamritham, K.(1996) Real-time databases, International Journal of Distributed and
Parallel Databases, 199-226
Ratner, D., Popek G. J. & Reiher P. (1996) Peer replication with selective control,
Technical report CSD-960031, University of California, Los Angeles, July 1996
Reiher, P., Heidemann, J. S., Ratner, D., Skinner, G. & Popek, G. J. (1994) Resolving
file conflicts in the Ficus file system, In USENIX Conference Proceeedings, 183195, June
Rusinciewicz, M., Sheth, A. And Karabatis, G. (1991) Specifying inter-database
dependencies in a multidatabase environment, IEEE Computer 24(12), Dec 1991,
46-53
Satyanarayanan, M (1993) Distributed file systems, In S. Mullender (ed.) Distributed
Systems (2nd ed.), Addison Wesley, Harlow, England
106
Sheth, A. & Rusinciewicz, M. (1990) Management of interdependent data: specifying
dependency and consistency requirements, Proc. Of the 1st Workshop on the
management of replicated data, Houston, 133-136
Son S.H. And Zhang F. (1995) Real-time replication control for distributed database
systems: algorithms and their performance, Proc. Fourth Int'l Conf. Database
Systems for Advanced Database Applications, 214-221, Apr
Verissimo, P. (1993) Real-time communication, In S. Mullender, (ed.), Distributed
Systems (2nd ed.), Harlow, England: Addison-Wesley
Wiederhold, G. & Qian, X. (1987) Modelling asynchrony in distributed databases, Proc.
Of the third Intl. Conf. on Data Engineering, 246-250
Wiederhold, G. & Qian, X. (1990) Consistency control of replicated data in federated
databases, Proc. of the 1st Workshop on the management of replicated data,
Houston, 130-132
107
List of figures
Figure 1. Value functions................................................................................................11
Figure 2. A WITAS system.............................................................................................26
Figure 3. Segmenting the database..................................................................................33
Figure 4. A replicated database with three replicas ........................................................33
Figure 5. A replicated and segmented database ..............................................................34
Figure 6. Applications and processes in segmented databases .......................................36
Figure 7. A replicated and segmented database accessed by clients with different
requirements............................................................................................................45
Figure 8. Processes, time constraints and consistency classes........................................53
Figure 9. SAGA - The super-transaction concept...........................................................61
Figure 10. The DeeDS architecture.................................................................................66
Figure 11. The replication module of DeeDS .................................................................67
Figure 12. Propagation scheduling queues......................................................................69
Figure 13. Update rates and resulting integration rate requirements ..............................81
List of tables
Table 1. Key attributes for consistency classes...............................................................44
Table 2. Timeliness requirements for data objects at nodes ...........................................58
108
Was this manual useful for you? yes no
Thank you for your participation!

* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project

Download PDF

advertisement