Proceedings of the 23rd ACM Symposium on Operating

Proceedings of the 23rd ACM Symposium on Operating
October 23-26, 2011
Cascais, Portugal
Proceedings of the Twenty-Third ACM
Symposium on Operating Systems Principles
Sponsored by:
Supported by:
Microsoft Research, National Science Foundation, Infosys,
VMware, Google, HP, NetApp, Intel, HiPEAC,
IBM Research, Facebook, Citrix, Akamai, SAP
Dear Reader,
Welcome to the Proceedings of the 23rd ACM Symposium on Operating Systems
Principles (SOSP 2011), held in Cascais, Portugal. This year’s program includes 28
papers, and touches on a wide range of computer systems topics, from data center computing, storage systems and geo-replication to operating system architecture, virtualization, concurrency, security, and mobile platforms. The program committee made every
effort to identify and include some of the most creative and thought-provoking ideas in
computer systems today. Each accepted paper was shepherded by a program committee
member to make sure the papers are as readable and complete as possible. We hope you
will enjoy the program as much as we did in selecting it.
This year’s proceedings are, for the first time, published digitally on a USB key with no
paper copy distributed at the conference. The cost to the environment of so many reams
of printed paper, plus the difficulty of shipping printed material to the conference site,
made this an easy decision. The workshop proceedings appear on the conference USB
key as well. You will find two copies of each of this year’s SOSP papers: a traditional 2column version designed for printing, and a one-column version intended for reading on a
computer. In addition, the USB key contains a copy of every SOSP paper from each of
the previous 22 instances of the conference, starting in 1967. The very nature of
publishing is changing as we speak. We look forward to your feedback about the
appropriate form and format for future SOSP proceedings.
We are most grateful to the authors of the 157 papers who chose to submit their work to
SOSP (five papers were subsequently withdrawn by the authors). Their ideas and efforts
are the basis of the conference’s success. Selecting the program out of so many quality
submissions was a difficult task. A program committee consisting of 28 leading scholars
in the broad area of computer systems conducted a three-stage reviewing process and
online discussion; final selections were made during a physical meeting in Boston, MA,
which was attended by a core of 13 PC members. Each submission received at least three
PC reviews, with a maximum of eight. All in all, 719 reviews were produced. The PC
made every effort to provide detailed and constructive feedback, which should help
authors to further improve their work. Author anonymity was maintained throughout the
reviewing and selection process; PC members were removed from the deliberations of
any paper with which they had a conflict of interest (co-author, same institution, recent
collaborator, former student/adviser).
The organizing committee is pleased to be able to follow the informal tradition of locating every third SOSP in Europe. Doing so would not have been possible this year without the tireless efforts of Luís Rodrigues and João Leitão as well as the support of the
Instituto de Engenharia de Sistemas e Computadores Investigação e Desenvolvimento em
Lisboa (INESC-ID) which is conference co-sponsor along with ACM SIGOPS. For the
first time, SOSP is co-located with the ACM Symposium on Cloud Computing (SOCC),
to take place on October 27th and 28th. SOCC, a conference now in its second year, is cosponsored by SIGOPS, SIGMOD and INESC-ID, and it should provide an excellent
counterpoint to SOSP in a very topical area of research.
Following the lead of SOSP 2009, this year’s conference also offers a full slate of workshops on the Sunday before the main event. These workshops cover a range of topics
related to operating systems: programming languages (PLOS), mobile handheld systems
(MobiHeld), power-aware computing (HotPower), and the management of large scale
systems through log analysis and machine learning (SLAML). We would like to thank
the organizers and sponsors of all four workshops as well as Rama Kotla and Rodrigo
Rodrigues, who served as workshop chairs. We welcome community feedback on ideas
for future co-located workshops.
We are especially thankful this year for our generous corporate and governmental donors.
These donors make it possible to host SOSP in an environment that is conducive to collegial interaction and, this year, they have provided funds for full travel grants to over 70
students from a wide range of countries and institutions. Special thanks go to Brett
Fleisch who assembled our NSF grant application, and to the folks at UC Riverside for
administering the resulting grant.
SOSP is a great conference mostly because it attracts so many high-quality submissions,
and we would like to again thank all the authors who submitted. We also thank the PC
members for the tremendous amount of work they did: reviewing the submissions, providing feedback, and shepherding the accepted submissions. We are grateful to the external reviewers who provided an additional perspective on a few papers. SOSP has always
been organized by volunteer efforts from a host of people; we would like to thank all the
following people who have dedicated so much of their time to the conference: Luís
Rodrigues (local arrangements), João Leitão (registration), John MacCormick (treasurer),
Chandu Thekkath, J.P. Martin, and Sape Mullender (sponsorships), David Lie, Nickolai
Zeldovich, and Dilma Da Silva (scholarships), Nuno Carvalho (website), Paarijaat Aditya
(submissions website), Rama Kotla and Rodrigo Rodrigues (workshops), Bryan Ford and
George Candea (WIPs/Posters), and Junfeng Yang and Rodrigo Rodrigues (publicity).
Finally, we would like to especially thank Andrew Birrell who assembled the conference
USB key and guided the production of “camera-ready” copy.
We hope that you will find the program interesting and inspiring, and trust that the symposium will provide you with a valuable opportunity to network and share ideas with
researchers and practitioners from institutions around the world.
Ted Wobber
General Chair
Peter Druschel
Program Chair
SILT: A Memory-Efficient,
High-Performance Key-Value Store
Hyeontaek Lim
Bin Fan
Carnegie Mellon University
Carnegie Mellon University
David G. Andersen
Michael Kaminsky
Carnegie Mellon University
Intel Labs
SILT (Small Index Large Table) is a memory-efficient, high-performance key-value store system based
on flash storage that scales to serve billions of key-value items on a single node. It requires only
0.7 bytes of DRAM per entry and retrieves key/value pairs using on average 1.01 flash reads each.
SILT combines new algorithmic and systems techniques to balance the use of memory, storage, and
computation. Our contributions include: (1) the design of three basic key-value stores each with a
different emphasis on memory-efficiency and write-friendliness; (2) synthesis of the basic key-value
stores to build a SILT key-value store system; and (3) an analytical model for tuning system parameters
carefully to meet the needs of different workloads. SILT requires one to two orders of magnitude
less memory to provide comparable throughput to current high-performance key-value systems on a
commodity desktop system with flash storage.
D.4.2 [Operating Systems]: Storage Management; D.4.7 [Operating Systems]: Organization and
Design; D.4.8 [Operating Systems]: Performance; E.1 [Data]: Data Structures; E.2 [Data]: Data
Storage Representations; E.4 [Data]: Coding and Information Theory
Algorithms, Design, Measurement, Performance
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted
without fee provided that copies are not made or distributed for profit or commercial advantage and that copies
bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to
redistribute to lists, requires prior specific permission and/or a fee.
SOSP '11, October 23-26, 2011, Cascais, Portugal.
Copyright © 2011 ACM 978-1-4503-0977-6/11/10 ... $10.00.
CPU transistors
DRAM capacity
Flash capacity
Disk capacity
2008 → 2011
731 → 1,170 M
0.062 → 0.153 GB/$
0.134 → 0.428 GB/$
4.92 → 15.1 GB/$
60 %
147 %
219 %
207 %
Flash reads per lookup
Table 1: From 2008 to 2011, flash and hard disk capacity increased much faster than either
CPU transistor count or DRAM capacity.
Memory overhead (bytes/key)
Figure 1: The memory overhead and lookup performance of SILT and the recent key-value
stores. For both axes, smaller is better.
Algorithms, design, flash, measurement, memory efficiency, performance
Key-value storage systems have become a critical building block for today’s large-scale, highperformance data-intensive applications. High-performance key-value stores have therefore received
substantial attention in a variety of domains, both commercial and academic: e-commerce platforms [21], data deduplication [1, 19, 20], picture stores [7], web object caching [4, 30], and more.
To achieve low latency and high performance, and make best use of limited I/O resources, key-value
storage systems require efficient indexes to locate data. As one example, Facebook engineers recently
created a new key-value storage system that makes aggressive use of DRAM-based indexes to avoid
the bottleneck caused by multiple disk operations when reading data [7]. Unfortunately, DRAM is up
to 8X more expensive and uses 25X more power per bit than flash, and, as Table 1 shows, is growing
more slowly than the capacity of the disk or flash that it indexes. As key-value stores scale in both size
and importance, index memory efficiency is increasingly becoming one of the most important factors
for the system’s scalability [7] and overall cost effectiveness.
Recent proposals have started examining how to reduce per-key in-memory index overhead [1, 2,
4, 19, 20, 32, 40], but these solutions either require more than a few bytes per key-value entry in
memory [1, 2, 4, 19], or compromise performance by keeping all or part of the index on flash or disk
and thus require many flash reads or disk seeks to handle each key-value lookup [20, 32, 40] (see
Figure 1 for the design space). We term this latter problem read amplification and explicitly strive to
avoid it in our design.
This paper presents a new flash-based key-value storage system, called SILT (Small Index Large
Table), that significantly reduces per-key memory consumption with predictable system performance
and lifetime. SILT requires approximately 0.7 bytes of DRAM per key-value entry and uses on average
only 1.01 flash reads to handle lookups. Consequently, SILT can saturate the random read I/O on our
experimental system, performing 46,000 lookups per second for 1024-byte key-value entries, and it
can potentially scale to billions of key-value items on a single host. SILT offers several knobs to trade
memory efficiency and performance to match available hardware.
This paper makes three main contributions:
• The design and implementation of three basic key-value stores (LogStore, HashStore, and
SortedStore) that use new fast and compact indexing data structures (partial-key cuckoo hashing
and entropy-coded tries), each of which places different emphasis on memory-efficiency and
• Synthesis of these basic stores to build SILT.
• An analytic model that enables an explicit and careful balance between memory, storage,
and computation to provide an accurate prediction of system performance, flash lifetime, and
memory efficiency.
Like other key-value systems, SILT implements a simple exact-match hash table interface including
PUT (map a new or existing key to a value), GET (retrieve the value by a given key), and DELETE
(delete the mapping of a given key).
For simplicity, we assume that keys are uniformly distributed 160-bit hash values (e.g., pre-hashed
keys with SHA-1) and that data is fixed-length. This type of key-value system is widespread in several
application domains such as data deduplication [1, 19, 20], and is applicable to block storage [18, 36],
microblogging [25, 38], WAN acceleration [1], among others. In systems with lossy-compressible
data, e.g., picture stores [7, 26], data can be adaptively compressed to fit in a fixed-sized slot. A
key-value system may also let applications choose one of multiple key-value stores, each of which is
optimized for a certain range of value sizes [21]. We discuss the relaxation of these assumptions in
Section 4.
Design Goals and Rationale The design of SILT follows from five main goals:
1. Low read amplification: Issue at most 1 + ε flash reads for a single GET, where ε is configurable and small (e.g., 0.01).
Rationale: Random reads remain the read throughput bottleneck when using flash memory.
Read amplification therefore directly reduces throughput.
2. Controllable write amplification and favoring sequential writes: It should be possible to
adjust how many times a key-value entry is rewritten to flash over its lifetime. The system
should issue flash-friendly, large writes.
Rationale: Flash memory can undergo only a limited number of erase cycles before it fails.
Random writes smaller than the SSD log-structured page size (typically 4 KiB1 ) cause extra
flash traffic.
1 For clarity,
binary prefixes (powers of 2) will include “i”, while SI prefixes (powers of 10) will appear
without any “i”.
Hash Table
Figure 2: Architecture of SILT.
SortedStore (§3.3)
Data ordering
Typical size
DRAM usage
Key order
> 80% of total entries
0.4 bytes/entry
Hash order
< 20%
2.2 bytes/entry
LogStore (§3.1)
Insertion order
< 1%
6.5 bytes/entry
Table 2: Summary of basic key-value stores in SILT.
Optimizations for memory efficiency and garbage collection often require data layout changes
on flash. The system designer should be able to select an appropriate balance of flash lifetime,
performance, and memory overhead.
3. Memory-efficient indexing: SILT should use as little memory as possible (e.g., less than one
byte per key stored).
Rationale: DRAM is both more costly and power-hungry per gigabyte than Flash, and its
capacity is growing more slowly.
4. Computation-efficient indexing: SILT’s indexes should be fast enough to let the system
saturate the flash I/O.
Rationale: System balance and overall performance.
5. Effective use of flash space: Some data layout options use the flash space more sparsely to
improve lookup or insertion speed, but the total space overhead of any such choice should
remain small – less than 20% or so.
Rationale: SSDs remain relatively expensive.
In the rest of this section, we first explore SILT’s high-level architecture, which we term a “multi-store
approach”, contrasting it with a simpler but less efficient single-store approach. We then briefly outline
the capabilities of the individual store types that we compose to form SILT, and show how SILT
handles key-value operations using these stores.
Conventional Single-Store Approach A common approach to building high-performance key-value
stores on flash uses three components:
1. an in-memory filter to efficiently test whether a given key is stored in this store before accessing
2. an in-memory index to locate the data on flash for a given key; and
3. an on-flash data layout to store all key-value pairs persistently.
Unfortunately, to our knowledge, no existing index data structure and on-flash layout achieve all of
our goals simultaneously. For example, HashCache-Set [4] organizes on-flash keys as a hash table,
eliminating the in-memory index, but incurring random writes that impair insertion speed. To avoid
expensive random writes, systems such as FAWN-DS [2], FlashStore [19], and SkimpyStash [20]
append new values sequentially to a log. These systems then require either an in-memory hash table to
map a key to its offset in the log (often requiring 4 bytes of DRAM or more per entry) [2, 20]; or keep
part of the index on flash using multiple random reads for each lookup [20].
Multi-Store Approach BigTable [14], Anvil [29], and BufferHash [1] chain multiple stores, each
with different properties such as high write performance or inexpensive indexing.
Multi-store systems impose two challenges. First, they require effective designs and implementations
of the individual stores: they must be efficient, compose well, and it must be efficient to transform data
between the store types. Second, it must be efficient to query multiple stores when performing lookups.
The design must keep read amplification low by not issuing flash reads to each store. A common
solution uses a compact in-memory filter to test whether a given key can be found in a particular store,
but this filter can be memory-intensive—e.g., BufferHash uses 4–6 bytes for each entry.
SILT’s multi-store design uses a series of basic key-value stores, each optimized for a different
1. Keys are inserted into a write-optimized store, and over their lifetime flow into increasingly
more memory-efficient stores.
2. Most key-value pairs are stored in the most memory-efficient basic store. Although data outside
this store uses less memory-efficient indexes (e.g., to optimize writing performance), the average
index cost per key remains low.
3. SILT is tuned for high worst-case performance—a lookup found in the last and largest store.
As a result, SILT can avoid using an in-memory filter on this last store, allowing all lookups
(successful or not) to take 1 + ε flash reads.
SILT’s architecture and basic stores (the LogStore, HashStore, and SortedStore) are depicted in
Figure 2. Table 2 summarizes these stores’ characteristics.
LogStore is a write-friendly key-value store that handles individual PUTs and DELETEs. To achieve
high performance, writes are appended to the end of a log file on flash. Because these items are ordered
by their insertion time, the LogStore uses an in-memory hash table to map each key to its offset in the
log file. The table doubles as an in-memory filter. SILT uses a memory-efficient, high-performance
hash table based upon cuckoo hashing [34]. As described in Section 3.1, our partial-key cuckoo
hashing achieves 93% occupancy with very low computation and memory overhead, a substantial
improvement over earlier systems such as FAWN-DS and BufferHash that achieved only 50% hash
table occupancy. Compared to the next two read-only store types, however, this index is still relatively
memory-intensive, because it must store one 4-byte pointer for every key. SILT therefore uses only
one instance of the LogStore (except during conversion to HashStore as described below), with fixed
capacity to bound its memory consumption.
Once full, the LogStore is converted to an immutable HashStore in the background. The HashStore’s
data is stored as an on-flash hash table that does not require an in-memory index to locate entries.
SILT uses multiple HashStores at a time before merging them into the next store type. Each HashStore
therefore uses an efficient in-memory filter to reject queries for nonexistent keys.
SortedStore maintains key-value data in sorted key order on flash, which enables an extremely compact
index representation (e.g., 0.4 bytes per key) using a novel design and implementation of entropy-coded
tries. Because of the expense of inserting a single item into sorted data, SILT periodically merges in
Key x
by h1(x)
or, by h2(x)
Hash Table
Inserted K-V entries
are appended
Log-Structured Data
Figure 3: Design of LogStore: an in-memory cuckoo hash table (index and filter) and an onflash data log.
bulk several HashStores along with an older version of a SortedStore and forms a new SortedStore,
garbage collecting deleted or overwritten keys in the process.
Key-Value Operations Each PUT operation inserts a (key,value) pair into the LogStore, even
if the key already exists. DELETE operations likewise append a “delete” entry into the LogStore.
The space occupied by deleted or stale data is reclaimed when SILT merges HashStores into the
SortedStore. These lazy deletes trade flash space for sequential write performance.
To handle GET, SILT searches for the key in the LogStore, HashStores, and SortedStore in sequence,
returning the value found in the youngest store. If the “deleted” entry is found, SILT will stop searching
and return “not found.”
Partitioning Finally, we note that each physical node runs multiple SILT instances, responsible for
disjoint key ranges, each with its own LogStore, SortedStore, and HashStore(s). This partitioning
improves load-balance (e.g., virtual nodes [37]), reduces flash overhead during merge (Section 3.3),
and facilitates system-wide parameter tuning (Section 5).
The LogStore writes PUTs and DELETEs sequentially to flash to achieve high write throughput. Its
in-memory partial-key cuckoo hash index efficiently maps keys to their location in the flash log, as
shown in Figure 3.
Partial-Key Cuckoo Hashing The LogStore uses a new hash table based on cuckoo hashing [34]. As
with standard cuckoo hashing, it uses two hash functions h1 and h2 to map each key to two candidate
buckets. On insertion of a new key, if one of the candidate buckets is empty, the key is inserted in
this empty slot; if neither bucket is available, the new key “kicks out” the key that already resides in
one of the two buckets, and the displaced key is then inserted to its own alternative bucket (and may
kick out other keys). The insertion algorithm repeats this process until a vacant position is found, or it
reaches a maximum number of displacements (e.g., 128 times in our implementation). If no vacant
slot found, it indicates the hash table is almost full, so SILT freezes the LogStore and initializes a new
one without expensive rehashing.
To make it compact, the hash table does not store the entire key (e.g., 160 bits in SILT), but only a
“tag” of the actual key. A lookup proceeds to flash only when the given key matches the tag, which can
prevent most unnecessary flash reads for non-existing keys. If the tag matches, the full key and its
value are retrieved from the log on flash to verify if the key it read was indeed the correct key.
Although storing only the tags in the hash table saves memory, it presents a challenge for cuckoo
hashing: moving a key to its alternative bucket requires knowing its other hash value. Here, however,
the full key is stored only on flash, but reading it from flash is too expensive. Even worse, moving this
key to its alternative bucket may in turn displace another key; ultimately, each displacement required
by cuckoo hashing would result in additional flash reads, just to insert a single key.
To solve this costly displacement problem, our partial-key cuckoo hashing algorithm stores the index
of the alternative bucket as the tag; in other words, partial-key cuckoo hashing uses the tag to reduce
flash reads for non-existent key lookups as well as to indicate an alternative bucket index to perform
cuckoo displacement without any flash reads. For example, if a key x is assigned to bucket h1 (x),
the other hash value h2 (x) will become its tag stored in bucket h1 (x), and vice versa (see Figure 3).
Therefore, when a key is displaced from the bucket a, SILT reads the tag (value: b) at this bucket, and
moves the key to the bucket b without needing to read from flash. Then it sets the tag at the bucket b
to value a.
To find key x in the table, SILT checks if h1 (x) matches the tag stored in bucket h2 (x), or if h2 (x)
matches the tag in bucket h1 (x). If the tag matches, the (key,value) pair is retrieved from the
flash location indicated in the hash entry.
Associativity Standard cuckoo hashing allows 50% of the table entries to be occupied before unresolvable collisions occur. SILT improves the occupancy by increasing the associativity of the cuckoo
hashing table. Each bucket of the table is of capacity four (i.e., it contains up to 4 entries). Our
experiments show that using a 4-way set associative hash table improves space utilization of the
table to about 93%,2 which matches the known experimental result for various variants of cuckoo
hashing [24]; moreover, 4 entries/bucket still allows each bucket to fit in a single cache line.3
Hash Functions Keys in SILT are 160-bit hash values, so the LogStore finds h1 (x) and h2 (x) by
taking two non-overlapping slices of the low-order bits of the key x.
By default, SILT uses a 15-bit key fragment as the tag. Each hash table entry is 6 bytes, consisting of
a 15-bit tag, a single valid bit, and a 4-byte offset pointer. The probability of a false positive retrieval
is 0.024% (see Section 5 for derivation), i.e., on average 1 in 4,096 flash retrievals is unnecessary. The
maximum number of hash buckets (not entries) is limited by the key fragment length. Given 15-bit
key fragments, the hash table has at most 215 buckets, or 4 × 215 = 128 Ki entries. To store more
keys in LogStore, one can increase the size of the key fragment to have more buckets, increase the
associativity to pack more entries into one bucket, and/or partition the key-space to smaller regions
and assign each region to one SILT instance with a LogStore. The tradeoffs associated with these
decisions are presented in Section 5.
2 Space utilization here is defined as the fraction of used entries (not used buckets) in the table, which
more precisely reflects actual memory utilization.
3 Note that, another way to increase the utilization of a cuckoo hash table is to use more hash functions
(i.e., each key has more possible locations in the table). For example, FlashStore [19] applies 16 hash
functions to achieve 90% occupancy. However, having more hash functions increases the number
of cache lines read upon lookup and, in our case, requires more than one tag stored in each entry,
increasing overhead.
K2 K4 K1 K3
Hash order
K1 K2 K3 K4
Insertion order
Figure 4: Convert a LogStore to a HashStore. Four keys K1, K2, K3, and K4 are inserted to the
LogStore, so the layout of the log file is the insert order; the in-memory index keeps the offset
of each key on flash. In HashStore, the on-flash data forms a hash table where keys are in the
same order as the in-memory filter.
Once a LogStore fills up (e.g., the insertion algorithm terminates without finding any vacant slot after
a maximum number of displacements in the hash table), SILT freezes the LogStore and converts it into
a more memory-efficient data structure. Directly sorting the relatively small LogStore and merging
it into the much larger SortedStore requires rewriting large amounts of data, resulting in high write
amplification. On the other hand, keeping a large number of LogStores around before merging could
amortize the cost of rewriting, but unnecessarily incurs high memory overhead from the LogStore’s
index. To bridge this gap, SILT first converts the LogStore to an immutable HashStore with higher
memory efficiency; once SILT accumulates a configurable number of HashStores, it performs a bulk
merge to incorporate them into the SortedStore. During the LogStore to HashStore conversion, the old
LogStore continues to serve lookups, and a new LogStore receives inserts.
HashStore saves memory over LogStore by eliminating the index and reordering the on-flash
(key,value) pairs from insertion order to hash order (see Figure 4). HashStore is thus an on-flash
cuckoo hash table, and has the same occupancy (93%) as the in-memory version found in LogStore.
HashStores also have one in-memory component, a filter to probabilistically test whether a key is
present in the store without performing a flash lookup.
Memory-Efficient Hash Filter Although prior approaches [1] used Bloom filters [12] for the probabilistic membership test, SILT uses a hash filter based on partial-key cuckoo hashing. Hash filters
are more memory-efficient than Bloom filters at low false positive rates. Given a 15-bit tag in a
4-way set associative cuckoo hash table, the false positive rate is f = 2−12 = 0.024% as calculated
in Section 3.1. With 93% table occupancy, the effective number of bits per key using a hash filter is
15/0.93 = 16.12. In contrast, a standard Bloom filter that sets its number of hash functions to optimize
space consumption requires at least 1.44 log2 (1/ f ) = 17.28 bits of memory to achieve the same false
positive rate.
HashStore’s hash filter is also efficient to create: SILT simply copies the tags from the LogStore’s hash
table, in order, discarding the offset pointers; on the contrary, Bloom filters would have been built
from scratch, hashing every item in the LogStore again.
2 3 4
6 7
(only key MSBs are shown)
Unused in
Figure 5: Example of a trie built for indexing sorted keys. The index of each leaf node matches
the index of the corresponding key in the sorted keys.
SortedStore is a static key-value store with very low memory footprint. It stores (key,value)
entries sorted by key on flash, indexed by a new entropy-coded trie data structure that is fast to
construct, uses 0.4 bytes of index memory per key on average, and keeps read amplification low
(exactly 1) by directly pointing to the correct location on flash.
Using Sorted Data on Flash Because of these desirable properties, SILT keeps most of the key-value
entries in a single SortedStore. The entropy-coded trie, however, does not allow for insertions or
deletions; thus, to merge HashStore entries into the SortedStore, SILT must generate a new SortedStore.
The construction speed of the SortedStore is therefore a large factor in SILT’s overall performance.
Sorting provides a natural way to achieve fast construction:
1. Sorting allows efficient bulk-insertion of new data. The new data can be sorted and sequentially
merged into the existing sorted data.
2. Sorting is well-studied. SILT can use highly optimized and tested sorting systems such as
Nsort [33].
Indexing Sorted Data with a Trie A trie, or a prefix tree, is a tree data structure that stores an array
of keys where each leaf node represents one key in the array, and each internal node represents the
longest common prefix of the keys represented by its descendants.
When fixed-length key-value entries are sorted by key on flash, a trie for the shortest unique prefixes
of the keys serves as an index for these sorted data. The shortest unique prefix of a key is the shortest
prefix of a key that enables distinguishing the key from the other keys. In such a trie, some prefix of a
lookup key leads us to a leaf node with a direct index for the looked up key in sorted data on flash.
Figure 5 shows an example of using a trie to index sorted data. Key prefixes with no shading are
the shortest unique prefixes which are used for indexing. The shaded parts are ignored for indexing
because any value for the suffix part would not change the key location. A lookup of a key, for example,
Entropy coding
3 2 1 3 1 1 1
00 0 1 11 11 1 1
Figure 6: (a) Alternative view of Figure 5, where a pair of numbers in each internal node denotes the number of leaf nodes in its left and right subtries. (b) A recursive form that represents
the trie. (c) Its entropy-coded representation used by SortedStore.
10010, follows down to the leaf node that represents 100. As there are 3 preceding leaf nodes, the
index of the key is 3. With fixed-length key-value pairs on flash, the exact offset of the data is the
obtained index times the key-value pair size (see Section 4 for extensions for variable-length key-value
pairs). Note that a lookup of similar keys with the same prefix of 100 (e.g., 10000, 10011) would
return the same index even though they are not in the array; the trie guarantees a correct index lookup
for stored keys, but says nothing about the presence of a lookup key.
Representing a Trie A typical tree data structure is unsuitable for SILT because each node would
require expensive memory pointers, each 2 to 8 bytes long. Common trie representations such as
level-compressed tries [3] are also inadequate if they use pointers.
SortedStore uses a compact recursive representation to eliminate pointers. The representation for a trie
T having L and R as its left and right subtries is defined as follows:
Repr(T ) := |L| Repr(L) Repr(R)
where |L| is the number of leaf nodes in the left subtrie. When T is empty or a leaf node, the
representation for T is an empty string. (We use a special mark (-1) instead of the empty string for
brevity in the simplified algorithm description, but the full algorithm does not require the use of the
special mark.)
Figure 6 (a,b) illustrates the uncompressed recursive representation for the trie in Figure 5. As there
are 3 keys starting with 0, |L| = 3. In its left subtrie, |L| = 2 because it has 2 keys that have 0 in their
second bit position, so the next number in the representation is 2. It again recurses into its left subtrie,
yielding 1. Here there are no more non-leaf nodes, so it returns to the root node and then generates the
representation for the right subtrie of the root, 3 1 1 1.
Algorithm 1 shows a simplified algorithm that builds a (non-entropy-coded) trie representation from
sorted keys. It resembles quicksort in that it finds the partition of keys and recurses into both subsets.
Index generation is fast (≥ 7 M keys/sec on a modern Intel desktop CPU, Section 6).
Looking up Keys Key-lookup works by incrementally reading the trie representation (Algorithm 2).
The function is supplied the lookup key and a trie representation string. By decoding the encoded next
number, thead, SortedStore knows if the current node is an internal node where it can recurse into
its subtrie. If the lookup key goes to the left subtrie, SortedStore recurses into the left subtrie, whose
representation immediately follows in the given trie representation; otherwise, SortedStore recursively
decodes and discards the entire left subtrie and then recurses into the right. SortedStore sums thead at
every node where it recurses into a right subtrie; the sum of the thead values is the offset at which the
lookup key is stored, if it exists.
# @param T array of sorted keys
# @return
trie representation
def construct(T):
if len(T) == 0 or len(T) == 1:
return [-1]
# Partition keys according to their MSB
L = [key[1:] for key in T if key[0] == 0]
R = [key[1:] for key in T if key[0] == 1]
# Recursively construct the representation
return [len(L)] + construct(L) + construct(R)
Algorithm 1: Trie representation generation in Python-like syntax. key[0] and key[1:]
denote the most significant bit and the remaining bits of key, respectively.
For example, to look up 10010, SortedStore first obtains 3 from the representation. Then, as the first
bit of the key is 1, it skips the next numbers (2 1) which are for the representation of the left subtrie,
and it proceeds to the right subtrie. In the right subtrie, SortedStore reads the next number (3; not a
leaf node), checks the second bit of the key, and keeps recursing into its left subtrie. After reading the
next number for the current subtrie (1), SortedStore arrives at a leaf node by taking the left subtrie.
Until it reaches the leaf node, it takes a right subtrie only at the root node; from n = 3 at the root node,
SortedStore knows that the offset of the data for 10010 is (3 × key-value-size) on flash.
Compression Although the above uncompressed representation uses up to 3 integers per key on
average, for hashed keys, SortedStore can easily reduce the average representation size to 0.4 bytes/key
by compressing each |L| value using entropy coding (Figure 6 (c)). The value of |L| tends to be close to
half of |T | (the number of leaf nodes in T ) because fixed-length hashed keys are uniformly distributed
over the key space, so both subtries have the same probability of storing a key. More formally,
|L| ∼ Binomial(|T |, 12 ). When |L| is small enough (e.g., ≤ 16), SortedStore uses static, globally
shared Huffman tables based on the binomial distributions. If |L| is large, SortedStore encodes the
|T |
difference between |L| and its expected value (i.e., 2 ) using Elias gamma coding [23] to avoid filling
the CPU cache with large Huffman tables. With this entropy coding optimized for hashed keys,
our entropy-coded trie representation is about twice as compact as the previous best recursive tree
encoding [16].
When handling compressed tries, Algorithm 1 and 2 are extended to keep track of the number of leaf
nodes at each recursion step. This does not require any additional information in the representation
because the number of leaf nodes can be calculated recursively using |T | = |L| + |R|. Based on
|T |, these algorithms choose an entropy coder for encoding len(L) and decoding thead. It is
noteworthy that the special mark (-1) takes no space with entropy coding, as its entropy is zero.
Ensuring Constant Time Index Lookups As described, a lookup may have to decompress the entire
trie, so that the cost of lookups would grow (large) as the number of entries in the key-value store
To bound the lookup time, items are partitioned into 2k buckets based on the first k bits of their key.
Each bucket has its own trie index. Using, e.g., k = 10 for a key-value store holding 216 items, each
bucket would hold in expectation 216−10 = 26 items. With high probability, no bucket holds more than
28 items, so the time to decompress the trie for bucket is both bounded by a constant value, and small.
This bucketing requires additional information to be stored in memory: (1) the pointers to the trie
representations of each bucket and (2) the number of entries in each bucket. SILT keeps the amount of
this bucketing information small (less than 1 bit/key) by using a simpler version of a compact select
# @param key
lookup key
# @param trepr trie representation
# @return
index of the key
in the original array
def lookup(key, trepr):
(thead, ttail) = (trepr[0], trepr[1:])
if thead == -1:
return 0
if key[0] == 0:
# Recurse into the left subtrie
return lookup(key[1:], ttail)
# Skip the left subtrie
ttail = discard_subtrie(ttail)
# Recurse into the right subtrie
return thead + lookup(key[1:], ttail)
# @param trepr trie representation
# @return
remaining trie representation
with the next subtrie consumed
def discard_subtrie(trepr):
(thead, ttail) = (trepr[0], trepr[1:])
if thead == -1:
return ttail
# Skip both subtries
ttail = discard_subtrie(ttail)
ttail = discard_subtrie(ttail)
return ttail
Algorithm 2: Key lookup on a trie representation.
data structure, semi-direct-16 [11]. With bucketing, our trie-based indexing belongs to the class of
data structures called monotone minimal perfect hashing [10, 13] (Section 7).
Further Space Optimizations for Small Key-Value Sizes For small key-value entries, SortedStore
can reduce the trie size by applying sparse indexing [22]. Sparse indexing locates the block that
contains an entry, rather than the exact offset of the entry. This technique requires scanning or binary
search within the block, but it reduces the amount of indexing information. It is particularly useful
when the storage media has a minimum useful block size to read; many flash devices, for instance,
provide increasing I/Os per second as the block size drops, but not past a certain limit (e.g., 512 or
4096 bytes) [31, 35]. SILT uses sparse indexing when configured for key-value sizes of 64 bytes or
SortedStore obtains a sparse-indexing version of the trie by pruning some subtries in it. When a trie
has subtries that have entries all in the same block, the trie can omit the representation of these subtries
because the omitted data only gives in-block offset information between entries. Pruning can reduce
the trie size to 1 bit per key or less if each block contains 16 key-value entries or more.
Merging HashStores into SortedStore SortedStore is an immutable store and cannot be changed.
Accordingly, the merge process generates a new SortedStore based on the given HashStores and the
existing SortedStore. Similar to the conversion from LogStore to HashStore, HashStores and the old
SortedStore can serve lookups during merging.
Action on KSS
Action on KHS
Table 3: Merge rule for SortedStore. KSS is the current key from SortedStore, and KHS is the
current key from the sorted data of HashStores. “Deleted” means the current entry in KHS is a
special entry indicating a key of SortedStore has been deleted.
The merge process consists of two steps: (1) sorting HashStores and (2) sequentially merging sorted
HashStores data and SortedStore. First, SILT sorts all data in HashStores to be merged. This task is
done by enumerating every entry in the HashStores and sorting these entries. Then, this sorted data
from HashStores is sequentially merged with already sorted data in the SortedStore. The sequential
merge chooses newest valid entries, as summarized in Table 3; either copy or drop action on a key
consumes the key (i.e., by advancing the “merge pointer” in the corresponding store), while the current
key remains available for the next comparison again if no action is applied to the key. After both
steps have been completed, the old SortedStore is atomically replaced by the new SortedStore. During
the merge process, both the old SortedStore and the new SortedStore exist on flash; however, the flash
space overhead from temporarily having two SortedStores is kept small by performing the merge
process on a single SILT instance at the same time.
In Section 5, we discuss how frequently HashStores should be merged into SortedStore.
Application of False Positive Filters Since SILT maintains only one SortedStore per SILT instance,
SortedStore does not have to use a false positive filter to reduce unnecessary I/O. However, an
extension to the SILT architecture might have multiple SortedStores. In this case, the trie index can
easily accommodate the false positive filter; the filter is generated by extracting the key fragments
from the sorted keys. Key fragments can be stored in an in-memory array so that they have the same
order as the sorted data on flash. The extended SortedStore can consult the key fragments before
reading data from flash.
SILT can support an even wider range of applications and workloads than the basic design we have
described. In this section, we present potential techniques to extend SILT’s capabilities.
Crash Tolerance SILT ensures that all its in-memory data structures are backed-up to flash and/or easily re-created after failures. All updates to LogStore are appended to the on-flash log chronologically;
to recover from a fail-stop crash, SILT simply replays the log file to construct the in-memory index
and filter. For HashStore and SortedStore, which are static, SILT keeps a copy of their in-memory data
structures on flash, which can be re-read during recovery.
SILT’s current design, however, does not provide crash tolerance for new updates to the data store.
These writes are handled asynchronously, so a key insertion/update request to SILT may complete
before its data is written durably to flash. For applications that need this additional level of crash
tolerance, SILT would need to support an additional synchronous write mode. For example, SILT
could delay returning from write calls until it confirms that the requested write is fully flushed to the
on-flash log.
Read amplification (X)
Flash space consumption (X)
Figure 7: Read amplification as a function of flash space consumption when inlining is applied
to key-values whose sizes follow a Zipf distribution. “exp” is the exponent part of the distribution.
Variable-Length Key-Values For simplicity, the design we presented so far focuses on fixed-length
key-value data. In fact, SILT can easily support variable-length key-value data by using indirection with
inlining. This scheme follows the existing SILT design with fixed-sized slots, but stores (offset,
first part of (key, value)) pairs instead of the actual (key, value) in HashStores
and SortedStores (LogStores can handle variable-length data natively). These offsets point to the
remaining part of the key-value data stored elsewhere (e.g., a second flash device). If a whole item is
small enough to fit in a fixed-length slot, indirection can be avoided; consequently, large data requires
an extra flash read (or write), but small data incurs no additional I/O cost. Figure 7 plots an analytic
result on the tradeoff of this scheme with different slot sizes. It uses key-value pairs whose sizes
range between 4 B and 1 MiB and follow a Zipf distribution, assuming a 4-byte header (for key-value
lengths), a 4-byte offset pointer, and an uniform access pattern.
For specific applications, SILT can alternatively use segregated stores for further efficiency. Similar
to the idea of simple segregated storage [39], the system could instantiate several SILT instances
for different fixed key-value size combinations. The application may choose an instance with the
most appropriate key-value size as done in Dynamo [21], or SILT can choose the best instance for a
new key and return an opaque key containing the instance ID to the application. Since each instance
can optimize flash space overheads and additional flash reads for its own dataset, using segregated
stores can reduce the cost of supporting variable-length key-values close to the level of fixed-length
In the subsequent sections, we will discuss SILT with fixed-length key-value pairs only.
Fail-Graceful Indexing Under high memory pressure, SILT may temporarily operate in a degraded
indexing mode by allowing higher read amplification (e.g., more than 2 flash reads per lookup) to
avoid halting or crashing because of insufficient memory.
(1) Dropping in-memory indexes and filters. HashStore’s filters and SortedStore’s indexes are stored
on flash for crash tolerance, allowing SILT to drop them from memory. This option saves memory at
the cost of one additional flash read for the SortedStore, or two for the HashStore.
(2) Binary search on SortedStore. The SortedStore can be searched without an index, so the in-memory
trie can be dropped even without storing a copy on flash, at the cost of log(n) additional reads from
These techniques also help speed SILT’s startup. By memory-mapping on-flash index files or performing binary search, SILT can begin serving requests before it has loaded its indexes into memory in the
Compared to single key-value store approaches, the multi-store design of SILT has more system
parameters, such as the size of a single LogStore and HashStore, the total number of HashStores,
the frequency to merge data into SortedStore, and so on. Having a much larger design space, it is
preferable to have a systematic way to do parameter selection.
This section provides a simple model of the tradeoffs between write amplification (WA), read amplification (RA), and memory overhead (MO) in SILT, with an eye towards being able to set the system
parameters properly to achieve the design goals from Section 2.
data written to flash
WA =
data written by application
data read from flash
RA =
data read by application
total memory consumed
MO =
number of items
Model A SILT system has a flash drive of size F bytes with a lifetime of E erase cycles. The system
runs P SILT instances locally, each of which handles one disjoint range of keys using one LogStore,
one SortedStore, and multiple HashStores. Once an instance has d keys in total in its HashStores, it
merges these keys into its SortedStore.
We focus here on a workload where the total amount of data stored in the system remains constant
(e.g., only applying updates to existing keys). We omit for space the similar results when the data
volume is growing (e.g., new keys are inserted to the system) and additional nodes are being added to
provide capacity over time. Table 4 presents the notation used in the analysis.
Write Amplification An update first writes one record to the LogStore. Subsequently converting
that LogStore to a HashStore incurs 1/0.93 = 1.075 writes per key, because the space occupancy of
the hash table is 93%. Finally, d total entries (across multiple HashStores of one SILT instance) are
merged into the existing SortedStore, creating a new SortedStore with N/P entries. The total write
amplification is therefore
WA = 2.075 +
d ·P
Read Amplification The false positive rate of flash reads from a 4-way set associative hash table
using k-bit tags is f = 8/2k because there are eight possible locations for a given key—two possible
buckets and four items per bucket.
This 4-way set associative cuckoo hash table with k-bit tags can store 2k+2 entries, so at 93% occupancy,
each LogStore and HashStore holds 0.93 · 2k+2 keys. In one SILT instance, the number of items stored
in HashStores ranges from 0 (after merging) to d, with an average size of d/2, so the average number
of HashStores is
= 0.134 k .
0.93 · 2k+2
SILT design parameters
maximum number of entries to merge
tag size in bit
number of SILT instances
number of HashStores per instance
false positive rate per store
Workload characteristics
key-value entry size
total number of distinct keys
update rate
Storage constraints
total flash size
maximum flash erase cycle
7.5 M
15 bits
1024 B
100 M
5 K/sec
256 GB
Table 4: Notation.
In the worst-case of a lookup, the system reads once from flash at the SortedStore, after 1 + H failed
retrievals at the LogStore and H HashStores. Note that each LogStore or HashStore rejects all but an
f fraction of false positive retrievals; therefore, the expected total number of reads per lookup (read
amplification) is:
RA = (1 + H) f + 1 = k + 1.07 k + 1.
By picking d and k to ensure 1.07d/4k + 8/2k < ε, SILT can achieve the design goal of read amplification 1 + ε.
Memory Overhead Each entry in LogStore uses (k + 1)/8 + 4 bytes (k bits for the tag, one valid bit,
and 4 bytes for the pointer). Each HashStore filter entry uses k/8 bytes for the tag. Each SortedStore
entry consumes only 0.4 bytes. Using one LogStore, one SortedStore, and H HashStores, SILT’s
memory overhead is:
k+2 + k · 2k+2 · H + 0.4 · N · P
( k+1
8 + 4) · 2
MO =
+ 0.4.
(16.5 + 0.5k)2k + 0.067 kd
Tradeoffs Improving either write amplification, read amplification, or memory amplification comes
at the cost of one of the other two metrics. For example, using larger tags (i.e., increasing k) reduces
read amplification by reducing both f the false positive rate per store and H the number of HashStores.
However, the HashStores then consume more DRAM due to the larger tags, increasing memory
overhead. Similarly, by increasing d, SILT can merge HashStores into the SortedStore less frequently
to reduce the write amplification, but doing so increases the amount of DRAM consumed by the
HashStore filters. Figure 8 illustrates how write and read amplification change as a function of memory
overhead when the maximum number of HashStore entries, d, is varied.
Update Rate vs. Flash Life Time The designer of a SILT instance handling U updates per second
wishes to ensure that the flash lasts for at least T seconds. Assuming the flash device has perfect wearleveling when being sent a series of large sequential writes [15], the total number of writes, multiplied
by the write amplification WA, must not exceed the flash device size times its erase cycle budget. This
creates a relationship between the lifetime, device size, update rate, and memory overhead:
U · c · WA · T
≤ F · E.
Read amplification
Write amplification
Memory overhead (bytes/key)
Figure 8: WA and RA as a function of MO when N=100 M, P=4, and k=15, while d is varied.
Example Assume a SILT system is built with a 256 GB MLC flash drive supporting 10,000 erase
cycles [5] (E = 10000, F = 256 × 230 ). It is serving N = 100 million items with P = 4 SILT instances,
and d = 7.5 million. Its workload is 1 KiB entries, 5,000 updates per second (U = 5000).
By Eq. (4) the write amplification, WA, is 5.4. That is, each key-value update incurs 5.4 writes/entry.
On average the number of HashStores is 31 according to Eq. (5). The read amplification, however, is
very close to 1. Eq. (6) shows that when choosing 15 bits for the key fragment size, a GET incurs on
average 1.008 of flash reads even when all stores must be consulted. Finally, we can see how the SILT
design achieves its design goal of memory efficiency: indexing a total of 102.4 GB of data, where
each key-value pair takes 1 KiB, requires only 73 MB in total or 0.73 bytes per entry (Eq. (7)). With
the write amplification of 5.4 from above, this device will last 3 years.
Using macro- and micro-benchmarks, we evaluate SILT’s overall performance and explore how its
system design and algorithms contribute to meeting its goals. We specifically examine (1) an end-toend evaluation of SILT’s throughput, memory overhead, and latency; (2) the performance of SILT’s
in-memory indexing data structures in isolation; and (3) the individual performance of each data store
type, including flash I/O.
Implementation SILT is implemented in 15 K lines of C++ using a modular architecture similar to
Anvil [29]. Each component of the system exports the same, basic key-value interface. For example,
the classes which implement each of the three stores (LogStore, HashStore, and SortedStore) export
this interface but themselves call into classes which implement the in-memory and on-disk data
structures using that same interface. The SILT system, in turn, unifies the three stores and provides this
key-value API to applications. (SILT also has components for background conversion and merging.)
Evaluation System We evaluate SILT on Linux using a desktop equipped with:
Intel Core i7 860 @ 2.80 Ghz (4 cores)
Crucial RealSSD C300 / 256 GB
Intel X25-E / 32 GB
The 256 GB SSD-L stores the key-value data, and the SSD-S is used as scratch space for sorting
HashStores using Nsort [33]. The drives connect using SATA and are formatted with the ext4 filesystem
K queries per second K queries per second
MERGE operations
CONVERT operations
MERGE operations
CONVERT operations
Time (minutes)
Figure 9: GET throughput under high (upper) and low (lower) loads.
using the discard mount option (TRIM support) to enable the flash device to free blocks from
deleted files. The baseline performance of the data SSD is:
Random Reads (1024 B)
Sequential Reads
Sequential Writes
48 K reads/sec
256 MB/sec
233 MB/sec
Full System Benchmark
Workload Generation We use YCSB [17] to generate a key-value workload. By default, we use a
10% PUT / 90% GET workload for 20-byte keys and 1000-byte values, and we also use a 50% PUT /
50% GET workload for 64-byte key-value pairs in throughput and memory overhead benchmarks. To
avoid the high cost of the Java-based workload generator, we use a lightweight SILT client to replay
a captured trace file of queries made by YCSB. The experiments use four SILT instances (P = 4),
with 16 client threads concurrently issuing requests. When applicable, we limit the rate at which SILT
converts entries from LogStores to HashStores to 10 K entries/second, and from HashStores to the
SortedStore to 20 K entries/second in order to prevent these background operations from exhausting
I/O resources.
Throughput: SILT can sustain an average insert rate of 3,000 1 KiB key-value pairs per second, while
simultaneously supporting 33,000 queries/second, or 69% of the SSD’s random read capacity. With
no inserts, SILT supports 46 K queries per second (96% of the drive’s raw capacity), and with no
queries, can sustain an insert rate of approximately 23 K inserts per second. On a deduplication-like
workload with 50% writes and 50% reads of 64 byte records, SILT handles 72,000 requests/second.
SILT’s performance under insert workloads is limited by the time needed to convert and merge data
into HashStores and SortedStores. These background operations compete for flash I/O resources,
Time (minutes)
Average index size
Avg index size (B/entry)
Avg index size (B/entry)
Total index size (MB)
Total index size (MB)
Total index size (MB)
Avg index size (B/entry)
Avg index size (B/entry)
Total index size (MB)
Time (minutes)
Figure 10: Index size changes for four different store combinations while inserting new 50 M
resulting in a tradeoff between query latency and throughput. Figure 9 shows the sustainable query rate
under both high query load (approx. 33 K queries/second) and low query load (22.2 K queries/second)
for 1 KiB key-value pairs. SILT is capable of providing predictable, low latency, or can be tuned for
higher overall throughput. The middle line shows when SILT converts LogStores into HashStores
(periodically, in small bursts). The top line shows that at nearly all times, SILT is busy merging
HashStores into the SortedStore in order to optimize its index size.4 In Section 6.3, we evaluate in
more detail the speed of the individual stores and conversion processes.
Memory overhead: SILT meets its goal of providing high throughput with low memory overhead. We
measured the time and memory required to insert 50 million new 1 KiB entries into a table with 50
million existing entries, while simultaneously handling a high query rate. SILT used at most 69 MB
of DRAM, or 0.69 bytes per entry. (This workload is worst-case because it is never allowed time for
SILT to compact all of its HashStores into the SortedStore.) For the 50% PUT / 50% GET workload
with 64-byte key-value pairs, SILT required at most 60 MB of DRAM for 100 million entries, or 0.60
bytes per entry.
The drastic improvement in memory overhead from SILT’s three-store architecture is shown in
Figure 10. The figure shows the memory consumption during the insertion run over time, using four
different configurations of basic store types and 1 KiB key-value entries. The bottom right graph
shows the memory consumed using the full SILT system. The bottom left configuration omits the
intermediate HashStore, thus requiring twice as much memory as the full SILT configuration. The
upper right configuration instead omits the SortedStore, and consumes four times as much memory.
Finally, the upper left configuration uses only the basic LogStore, which requires nearly 10x as much
memory as SILT. To make this comparison fair, the test generates unique new items so that garbage
collection of old entries cannot help the SortedStore run faster.
The figures also help understand the modest cost of SILT’s memory efficiency. The LogStore-only
system processes the 50 million inserts (500 million total operations) in under 170 minutes, whereas
4 In both workloads, when merge operations complete (e.g., at 25 minutes), there is a momentary drop
in query speed. This is due to bursty TRIMming by the ext4 filesystem implementation (discard)
used in the experiment when the previous multi-gigabyte SortedStore file is deleted from flash.
Average latency (µs)
Hit at SortedStore
Hit at LogStore
Hit at HashStore
Key match location
Figure 11: GET query latency when served from different store locations.
the full SILT system takes only 40% longer–about 238 minutes–to incorporate the records, but achieves
an order of magnitude better memory efficiency.
Latency: SILT is fast, processing queries in 367 µs on average, as shown in Figure 11 for 100% GET
queries for 1 KiB key-value entries. GET responses are fastest when served by the LogStore (309 µs),
and slightly slower when they must be served by the SortedStore. The relatively small latency increase
when querying the later stores shows the effectiveness (reducing the number of extra flash reads to
ε < 0.01) and speed of SILT’s in-memory filters used in the Log and HashStores.
In the remaining sections, we evaluate the performance of SILT’s individual in-memory indexing
techniques, and the performance of the individual stores (in-memory indexes plus on-flash data
Index Microbenchmark
The high random read speed of flash drives means that the CPU budget available for each index
operation is relatively limited. This microbenchmark demonstrates that SILT’s indexes meet their
design goal of computation-efficient indexing.
Experiment Design This experiment measures insertion and lookup speed of SILT’s in-memory
partial-key cuckoo hash and entropy-coded trie indexes. The benchmark inserts 126 million total
entries and looks up a subset of 10 million random 20-byte keys.
This microbenchmark involves memory only, no flash I/O. Although the SILT system uses multiple
CPU cores to access multiple indexes concurrently, access to individual indexes in this benchmark
is single-threaded. Note that inserting into the cuckoo hash table (LogStore) proceeds key-by-key,
whereas the trie (SortedStore) is constructed en mass using bulk insertion. Table 5 summarizes the
measurement results.
Individual Insertion Speed (Cuckoo Hashing) SILT’s cuckoo hash index implementation can
handle 10.18 M 20-byte key insertions (PUTs or DELETEs) per second. Even at a relatively small,
higher overhead key-value entry size of 32-byte (i.e., 12-byte data), the index would support 326 MB/s
Individual insertion
Bulk insertion
Cuckoo hashing
(K keys/s)
(K keys/s)
Table 5: In-memory performance of index data structures in SILT on a single CPU core.
Speed (K keys/s)
LogStore (by PUT)
HashStore (by CONVERT)
SortedStore (by MERGE)
Table 6: Construction performance for basic stores. The construction method is shown in the
of incoming key-value data on one CPU core. This rate exceeds the typical sequential write speed of a
single flash drive: inserting keys into our cuckoo hashing is unlikely to become a bottleneck in SILT
given current trends.
Bulk Insertion Speed (Trie) Building the trie index over 126 million pre-sorted keys required
approximately 17 seconds, or 7.6 M keys/second.
Key Lookup Speed Each SILT GET operation requires a lookup in the LogStore and potentially in
one or more HashStores and the SortedStore. A single CPU core can perform 1.84 million cuckoo
hash lookups per second. If a SILT instance has 1 LogStore and 31 HashStores, each of which needs
to be consulted, then one core can handle about 57.5 K GETs/sec. Trie lookups are approximately 8.8
times slower than cuckoo hashing lookups, but a GET triggers a lookup in the trie only after SILT
cannot find the key in the LogStore and HashStores. When combined, the SILT indexes can handle
about 1/(1/57.5 K + 1/208 K) ≈ 45 K GETs/sec with one CPU core.
Insertions are faster than lookups in cuckoo hashing because insertions happen to only a few tables at
the same time and thus benefit from the CPU’s L2 cache; lookups, however, can occur to any table in
memory, making CPU cache less effective.
Operation on Multiple Cores Using four cores, SILT indexes handle 180 K GETs/sec in memory.
At this speed, the indexes are unlikely to become a bottleneck: their overhead is on-par or lower
than the operating system overhead for actually performing that many 1024-byte reads per second
from flash. As we see in the next section, SILT’s overall performance is limited by sorting, but its
index CPU use is high enough that adding many more flash drives would require more CPU cores.
Fortunately, SILT offers many opportunities for parallel execution: Each SILT node runs multiple,
completely independent instances of SILT to handle partitioning, and each of these instances can query
many stores.
Individual Store Microbenchmark
Here we measure the performance of each SILT store type in its entirety (in-memory indexing plus
on-flash I/O). The first experiment builds multiple instances of each basic store type with 100 M
key-value pairs (20-byte key, 1000-byte value). The second experiment queries each store for 10 M
random keys.
GET (hit)
GET (miss)
(K ops/s)
(K ops/s)
(K ops/s)
Table 7: Query performance for basic stores that include in-memory and on-flash data structures.
Table 6 shows the construction performance for all three stores; the construction method is shown in
parentheses. LogStore construction, built through entry-by-entry insertion using PUT, can use 90%
of sequential write bandwidth of the flash drive. Thus, SILT is well-suited to handle bursty inserts.
The conversion from LogStores to HashStores is about three times slower than LogStore construction
because it involves bulk data reads and writes from/to the same flash drive. SortedStore construction is
slowest, as it involves an external sort for the entire group of 31 HashStores to make one SortedStore
(assuming no previous SortedStore). If constructing the SortedStore involved merging the new data
with an existing SortedStore, the performance would be worse. The large time required to create a
SortedStore was one of the motivations for introducing HashStores rather than keeping un-merged
data in LogStores.
Table 7 shows that the minimum GET performance across all three stores is 44.93 K ops/s. Note that
LogStores and HashStores are particularly fast at GET for non-existent keys (more than 7 M ops/s).
This extremely low miss penalty explains why there was only a small variance in the average GET
latency in Figure 11 where bad cases looked up 32 Log and HashStores and failed to find a matching
item in any of them.
Hashing Cuckoo hashing [34] is an open-addressing scheme to resolve hash collisions efficiently
with high space occupancy. Our partial-key cuckoo hashing—storing only a small part of the key in
memory without fetching the entire keys from slow storage on collisions—makes cuckoo hashing
more memory-efficient while ensuring high performance.
Minimal perfect hashing is a family of collision-free hash functions that map n distinct keys to n
consecutive integers 0 . . . n − 1, and is widely used for memory-efficient indexing. In theory, any
minimal perfect hash scheme requires at least 1.44 bits/key [27]; in practice, the state-of-the-art
schemes can index any static data set with 2.07 bits/key [10]. Our entropy-coded trie achieves 3.1
bits/key, but it also preserves the lexicographical order of the keys to facilitate data merging. Thus, it
belongs to the family of monotone minimal perfect hashing (MMPH). Compared to other proposals for
MMPH [8, 9], our trie-based index is simple, lightweight to generate, and has very small CPU/memory
External-Memory Index on Flash Recent work such as MicroHash [40] and FlashDB [32] minimizes memory consumption by having indexes on flash. MicroHash uses a hash table chained
by pointers on flash. FlashDB proposes a self-tuning B+ -tree index that dynamically adapts the
node representation according to the workload. Both systems are optimized for memory and energy
consumption of sensor devices, but not for latency as lookups in both systems require reading multiple flash pages. In contrast, SILT achieves very low memory footprint while still supporting high
Key-Value Stores HashCache [4] proposes several policies to combine hash table-based in-memory
indexes and on-disk data layout for caching web objects. FAWN-DS [2] consists of an on-flash
data log and in-memory hash table index built using relatively slow CPUs with a limited amount of
memory. SILT dramatically reduces DRAM consumption compared to these systems by combining
more memory-efficient data stores with minimal performance impact. FlashStore [19] also uses a
single hash table to index all keys on flash similar to FAWN-DS. The flash storage, however, is used as
a cache of a hard disk-backed database. Thus, the cache hierarchy and eviction algorithm is orthogonal
to SILT. To achieve low memory footprint (about 1 byte/key), SkimpyStash [20] moves its indexing
hash table to flash with linear chaining. However, it requires on average 5 flash reads per lookup, while
SILT only needs 1 + ε per lookup.
More closely related to our design is BufferHash [1], which keeps keys in multiple equal-sized hash
tables—one in memory and the others on flash. The on-flash tables are guarded by in-memory Bloom
filters to reduce unnecessary flash reads. In contrast, SILT data stores have different sizes and types.
The largest store (SortedStore), for example, does not have a filter and is accessed at most once per
lookup, which saves memory while keeping the read amplification low. In addition, writes in SILT
are appended to a log stored on flash for crash recovery, whereas inserted keys in BufferHash do not
persist until flushed to flash in batch.
Several key-value storage libraries rely on caching to compensate for their high read amplifications [6,
28], making query performance depend greatly on whether the working set fits in the in-memory cache.
In contrast, SILT provides uniform and predictably high performance regardless of the working set
size and query patterns.
Distributed Key-Value Systems Distributed key-value storage clusters such as BigTable [14], Dynamo [21], and FAWN-KV [2] all try to achieve high scalability and availability using a cluster of
key-value store nodes. SILT focuses on how to use flash memory-efficiently with novel data structures,
and is complementary to the techniques used in these other systems aimed at managing failover and
Modular Storage Systems BigTable [14] and Anvil [29] both provide a modular architecture for
chaining specialized stores to benefit from combining different optimizations. SILT borrows its design
philosophy from these systems; we believe and hope that the techniques we developed for SILT could
also be used within these frameworks.
SILT combines new algorithmic and systems techniques to balance the use of memory, storage,
and computation to craft a memory-efficient, high-performance flash-based key value store. It
uses two new in-memory index structures—partial-key cuckoo hashing and entropy-coded tries—to
reduce drastically the amount of memory needed compared to prior systems. SILT chains the right
combination of basic key-value stores together to create a system that provides high write speed, high
read throughput, and uses little memory, attributes that no single store can achieve alone. SILT uses in
total only 0.7 bytes of memory per entry it stores, and makes only 1.01 flash reads to service a lookup,
doing so in under 400 microseconds. Our hope is that SILT, and the techniques described herein, can
form an efficient building block for a new generation of fast data-intensive services.
This work was supported by funding from National Science Foundation award CCF-0964474, Google,
the Intel Science and Technology Center for Cloud Computing, by CyLab at Carnegie Mellon under
grant DAAD19-02-1-0389 from the Army Research Office. Hyeontaek Lim is supported in part by the
Korea Foundation for Advanced Studies. We thank the SOSP reviewers, Phillip B. Gibbons, Vijay
Vasudevan, and Amar Phanishayee for their feedback, Guy Blelloch and Rasmus Pagh for pointing
out several algorithmic possibilities, and Robert Morris for shepherding this paper.
[1] A. Anand, C. Muthukrishnan, S. Kappes, A. Akella, and S. Nath. Cheap and large CAMs for high
performance data-intensive networked systems. In NSDI’10: Proceedings of the 7th USENIX
conference on Networked systems design and implementation, pages 29–29, Berkeley, CA, USA,
2010. USENIX Association.
[2] D. G. Andersen, J. Franklin, M. Kaminsky, A. Phanishayee, L. Tan, and V. Vasudevan. FAWN:
A fast array of wimpy nodes. In Proc. SOSP, Big Sky, MT, Oct. 2009.
[3] A. Andersson and S. Nilsson. Improved behaviour of tries by adaptive branching. Information
Processing Letters, 46(6):295–300, 1993.
[4] A. Badam, K. Park, V. S. Pai, and L. L. Peterson. HashCache: Cache storage for the next billion.
In Proc. 6th USENIX NSDI, Boston, MA, Apr. 2009.
[5] M. Balakrishnan, A. Kadav, V. Prabhakaran, and D. Malkhi. Differential RAID: Rethinking
RAID for SSD reliability. In Proc. European Conference on Computer Systems (Eurosys), Paris,
France, 2010.
[6] Berkeley
berkeleydb/, 2011.
[7] D. Beaver, S. Kumar, H. C. Li, J. Sobel, and P. Vajgel. Finding a needle in Haystack: Facebook’s
photo storage. In Proc. 9th USENIX OSDI, Vancouver, Canada, Oct. 2010.
[8] D. Belazzougui, P. Boldi, R. Pagh, and S. Vigna. Monotone minimal perfect hashing: searching a
sorted table with O(1) accesses. In Proceedings of the twentieth Annual ACM-SIAM Symposium
on Discrete Algorithms, SODA ’09, pages 785–794, 2009.
[9] D. Belazzougui, P. Boldi, R. Pagh, and S. Vigna. Theory and practise of monotone minimal
perfect hashing. In Proc. 11th Workshop on Algorithm Engineering and Experiments, ALENEX
’09, 2009.
[10] D. Belazzougui, F. Botelho, and M. Dietzfelbinger. Hash, displace, and compress. In Proceedings
of the 17th European Symposium on Algorithms, ESA ’09, pages 682–693, 2009.
[11] D. K. Blandford, G. E. Blelloch, and I. A. Kash. An experimental analysis of a compact graph
representation. In Proc. 6th Workshop on Algorithm Engineering and Experiments, ALENEX
’04, 2004.
[12] B. H. Bloom. Space/time trade-offs in hash coding with allowable errors. Communications of
the ACM, 13(7):422–426, 1970.
[13] F. C. Botelho, A. Lacerda, G. V. Menezes, and N. Ziviani. Minimal perfect hashing: A competitive
method for indexing internal memory. Information Sciences, 181:2608–2625, 2011.
[14] F. Chang, J. Dean, S. Ghemawat, W. C. Hsieh, D. A. Wallach, M. Burrows, T. Chandra, A. Fikes,
and R. E. Gruber. Bigtable: A distributed storage system for structured data. In Proc. 7th
USENIX OSDI, Seattle, WA, Nov. 2006.
[15] L.-P. Chang. On efficient wear leveling for large-scale flash-memory storage systems. In
Proceedings of the 2007 ACM symposium on Applied computing (SAC ’07), Mar. 2007.
[16] D. R. Clark. Compact PAT trees. PhD thesis, University of Waterloo, Waterloo, Ontario, Canada,
[17] B. Cooper, A. Silberstein, E. Tam, R. Ramakrishnan, and R. Sears. Benchmarking cloud serving
systems with YCSB. In Proc. 1st ACM Symposium on Cloud Computing (SOCC), Indianapolis,
IN, June 2010.
[18] F. Dabek, M. F. Kaashoek, D. Karger, R. Morris, and I. Stoica. Wide-area cooperative storage
with CFS. In Proc. 18th ACM Symposium on Operating Systems Principles (SOSP), Banff,
Canada, Oct. 2001.
[19] B. Debnath, S. Sengupta, and J. Li. FlashStore: High throughput persistent key-value store. Proc.
VLDB Endowment, 3:1414–1425, September 2010.
[20] B. Debnath, S. Sengupta, and J. Li. SkimpyStash: RAM space skimpy key-value store on
flash-based storage. In Proc. International Conference on Management of Data, ACM SIGMOD
’11, pages 25–36, New York, NY, USA, 2011.
[21] G. DeCandia, D. Hastorun, M. Jampani, G. Kakulapati, A. Lakshman, A. Pilchin, S. Sivasubramanian, P. Vosshall, and W. Vogels. Dynamo: Amazon’s highly available key-value store.
In Proc. 21st ACM Symposium on Operating Systems Principles (SOSP), Stevenson, WA, Oct.
[22] J. Dong and R. Hull. Applying approximate order dependency to reduce indexing space. In Proc.
ACM SIGMOD International Conference on Management of data, SIGMOD ’82, pages 119–127,
[23] P. Elias. Universal codeword sets and representations of the integers. IEEE Transactions on
Information Theory, 21(2):194–203, Mar. 1975.
[24] Ú. Erlingsson, M. Manasse, and F. Mcsherry. A cool and practical alternative to traditional hash
tables. In Proc. 7th Workshop on Distributed Data and Structures (WDAS’06), CA, USA, 2006.
[25] Facebook., 2011.
[26] Flickr., 2011.
[27] E. A. Fox, L. S. Heath, Q. F. Chen, and A. M. Daoud. Practical minimal perfect hash functions
for large databases. Communications of the ACM, 35:105–121, Jan. 1992.
[28] S. Ghemawat and J. Dean. LevelDB., 2011.
[29] M. Mammarella, S. Hovsepian, and E. Kohler. Modular data storage with Anvil. In Proc. SOSP,
Big Sky, MT, Oct. 2009.
[30] Memcached: A distributed memory object caching system.
memcached/, 2011.
[31] S. Nath and P. B. Gibbons. Online maintenance of very large random samples on flash storage.
In Proc. VLDB, Auckland, New Zealand, Aug. 2008.
[32] S. Nath and A. Kansal. FlashDB: Dynamic self-tuning database for NAND flash. In Proc.
ACM/IEEE International Conference on Information Processing in Sensor Networks, Cambridge,
MA, Apr. 2007.
[33] C. Nyberg and C. Koester. Ordinal Technology - NSort.,
[34] R. Pagh and F. Rodler. Cuckoo hashing. Journal of Algorithms, (2):122–144, May 2004.
[35] M. Polte, J. Simsa, and G. Gibson. Enabling enterprise solid state disks performance. In Proc.
Workshop on Integrating Solid-state Memory into the Storage Hierarchy, Washington, DC, Mar.
[36] S. Quinlan and S. Dorward. Venti: A new approach to archival storage. In Proc. USENIX
Conference on File and Storage Technologies (FAST), pages 89–101, Monterey, CA, Jan. 2002.
[37] I. Stoica, R. Morris, D. Karger, M. F. Kaashoek, and H. Balakrishnan. Chord: A scalable
peer-to-peer lookup service for Internet applications. In Proc. ACM SIGCOMM, San Diego, CA,
Aug. 2001.
[38] Twitter., 2011.
[39] P. Wilson, M. Johnstone, M. Neely, and D. Boles. Dynamic storage allocation: A survey and
critical review. Lecture Notes in Computer Science, 1995.
[40] D. Zeinalipour-Yazti, S. Lin, V. Kalogeraki, D. Gunopulos, and W. A. Najjar. MicroHash: An
efficient index structure for flash-based sensor devices. In Proc. 4th USENIX Conference on File
and Storage Technologies, San Francisco, CA, Dec. 2005.
Scalable Consistency in Scatter
Lisa Glendenning
Arvind Krishnamurthy
Ivan Beschastnikh
Thomas Anderson
Department of Computer Science & Engineering
University of Washington
Distributed storage systems often trade off strong semantics for improved scalability. This
paper describes the design, implementation, and evaluation of Scatter, a scalable and consistent distributed key-value storage system. Scatter adopts the highly decentralized and
self-organizing structure of scalable peer-to-peer systems, while preserving linearizable consistency even under adverse circumstances. Our prototype implementation demonstrates
that even with very short node lifetimes, it is possible to build a scalable and consistent
system with practical performance.
Categories and Subject Descriptors
H.3.4 [Information Storage and Retrieval]: Systems and Software—Distributed systems
General Terms
Design, Reliability
Distributed systems, consistency, scalability, fault tolerance, storage, distributed transactions, Paxos
A long-standing and recurrent theme in distributed systems research is the design and
implementation of efficient and fault tolerant storage systems with predictable and wellunderstood consistency properties. Recent efforts in peer-to-peer (P2P) storage services
include Chord [36], CAN [26], Pastry [30], OpenDHT [29], OceanStore [16], and Kademlia [22]. Recent industrial efforts to provide a distributed storage abstraction across data
centers include Amazon’s Dynamo [10], Yahoo!’s PNUTS [8], and Google’s Megastore [1]
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted
without fee provided that copies are not made or distributed for profit or commercial advantage and that copies
bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to
redistribute to lists, requires prior specific permission and/or a fee.
SOSP '11, October 23-26, 2011, Cascais, Portugal.
Copyright © 2011 ACM 978-1-4503-0977-6/11/10 ... $10.00.
and Spanner [9] projects. Particularly with geographic distribution, whether due to using
multiple data centers or a P2P resource model, the tradeoffs between efficiency and consistency are non-trivial, leading to systems that are complex to implement, complex to use,
and sometimes both.
Our interest is in building a storage layer for a very large scale P2P system we are designing
for hosting planetary scale social networking applications. Purchasing, installing, powering
up, and maintaining a very large scale set of nodes across many geographically distributed
data centers is an expensive proposition; it is only feasible on an ongoing basis for those
applications that can generate revenue. In much the same way that Linux offers a free
alternative to commercial operating systems for researchers and developers interested in
tinkering, we ask: what is the Linux analogue with respect to cloud computing?
P2P systems provide an attractive alternative, but first generation storage layers were
based on unrealistic assumptions about P2P client behavior in the wild. In practice,
participating nodes have widely varying capacity and network bandwidth, connections are
flaky and asymmetric rather than well-provisioned, workload hotspots are common, and
churn rates are very high [27, 12]. This led to a choice for application developers: weakly
consistent but scalable P2P systems like Kademlia and OpenDHT, or strongly consistent
data center storage.
Our P2P storage layer, called Scatter, attempts to bridge this gap – to provide an opensource, free, yet robust alternative to data center computing, using only P2P resources.
Scatter provides scalable and consistent distributed hash table key-value storage. Scatter
is robust to P2P churn, heterogeneous node capacities, and flaky and irregular network
behavior. (We have left robustness to malicious behavior, such as against DDoS attacks
and Byzantine faults, to future work.) In keeping with our goal of building an open
system, an essential requirement for Scatter is that there be no central point of control for
commercial interests to exploit.
The base component of Scatter is a small, self-organizing group of nodes, each managing
a range of keys, akin to a BigTable [6] tablet. A set of groups together partition the table
space to provide the distributed hash table abstraction. Each group is responsible for
providing consistent read/write access to its key range, and for reconfiguring as necessary
to meet performance and availability goals. As nodes are added, as nodes fail, or as the
workload changes for a region of keys, individual groups must merge with neighboring
groups, split into multiple groups, or shift responsibility over parts of the key space to
neighboring groups, all while maintaining consistency. A lookup overlay topology connects
the Scatter groups in a ring, and groups execute distributed transactions in a decentralized
fashion to modify the topology consistently and atomically.
A key insight in the design of Scatter is that the consistent group abstraction provides
a stable base on which to layer the optimizations needed to maintain overall system performance and availability goals. While existing popular DHTs have difficulty maintaining
consistent routing state and consistent name space partitioning in the presence of high
churn, these properties are a direct consequence of Scatter’s design. Further, Scatter can
locally adjust the amount of replication, or mask a low capacity node, or merge/split
groups if a particular Scatter group has an unusual number of weak/strong nodes, all
without compromising the structural integrity of the distributed table.
Of course, some applications may tolerate weaker consistency models for application data
storage [10], while other applications have stronger consistency requirements [1]. Scatter
is designed to support a variety of consistency models for application key storage. Our
current implementation provides linearizable storage within a given key; we support crossgroup transactions for consistent updates to meta-data during group reconfiguration, but
join leave
(a) Key Assignment Violation
(b) Routing Violation
Figure 1: Two examples demonstrating how (a) key assignment consistency
and (b) routing integrity may be violated in a traditional DHT. Bold lines
indicate key assignment and are associated with nodes. Dotted lines indicate
successor pointers. Both scenarios arise when nodes join and leave concurrently, as pictured in (a1) and (b1). The violation in (a2) may result in clients
observing inconsistent key values, while (b2) jeopardizes overlay connectivity.
we do not attempt to linearize multi-key application transactions. These steps are left
for future work; however, we believe that the Scatter group abstraction will make them
straightforward to implement.
We evaluate our system in a variety of configurations, for both micro-benchmarks and for
a Twitter-style application. Compared to OpenDHT, a publicly accessible open-source
DHT providing distributed storage, Scatter provides equivalent performance with much
better availability, consistency, and adaptability. We show that we can provide practical
distributed storage even in very challenging environments. For example, if average node
lifetimes are as short as three minutes, therefore triggering very frequent reconfigurations
to maintain data durability, Scatter is able to maintain overall consistency and data availability, serving its reads in an average of 1.3 seconds in a typical wide area setting.
Scatter’s design synthesizes techniques from both highly scalable systems with weak guarantees and strictly consistent systems with limited scalability, to provide the best of both
worlds. This section overviews the two families of distributed systems whose techniques
we leverage in building Scatter.
Distributed Hash Tables (DHTs): DHTs are a class of highly distributed storage
systems providing scalable, key based lookup of objects in dynamic network environments.
As a distributed systems building primitive, DHTs have proven remarkably versatile, with
application developers having leveraged scalable lookup to support a variety of distributed
applications. They are actively used in the wild as the infrastructure for peer-to-peer
systems on the order of millions of users.
In a traditional DHT, both application data and node IDs are hashed to a key, and data
is stored at the node whose hash value immediately precedes (or follows) the key. In many
DHTs, the node storing the key’s value replicates the data to its neighbors for better
reliability and availability [30]. Even so, many DHTs suffer inconsistencies in certain
failure cases, both in how keys are assigned to nodes, and in how requests are routed to
keys, yielding inconsistent results or reduced levels of availability. These issues are not
new [12, 4]; we recite them to provide context for our work.
Assignment Violation: A fundamental DHT correctness property is for each key to be
managed by at most one node. We refer to this property as assignment consistency. This
property is violated when multiple nodes claim ownership over the same key. In the figure,
a section of a DHT ring is managed by three nodes, identified by their key values A, B,
and C. A new node D joins at a key between A and B and takes over the key-range (A, D].
However, before B can let C know of this change in the key-range assignment, B fails.
Node C detects the failure and takes over the key-range (A, B] maintained by B. This
key-range, however, includes keys maintained by D. As a result, clients accessing keys in
(A, D] may observe inconsistent key values depending on whether they are routed to node
C or D.
Routing Violation: Another basic correctness property stipulates that the system maintains
consistent routing entries at nodes so that the system can route lookup requests to the
appropriate node. In fact, the correctness of certain links is essential for the overlay
to remain connected. For example, the Chord DHT relies on the consistency of node
successor pointers (routing table entries that reference the next node in the key-space) to
maintain DHT connectivity [35]. Figure 1b illustrates how a routing violation may occur
when node joins and leaves are not handled atomically. In the figure, node D joins at a
key between B and C, and B fails immediately after. Node D has a successor pointer
correctly set to C, however, A is not aware of D and incorrectly believes that C is its
successor (When a successor fails, a node uses its locally available information to set its
successor pointer to the failed node’s successor). In this scenario, messages routed through
A to keys maintained by D will skip over node D and will be incorrectly forwarded to
node C. A more complex routing algorithm that allows for backtracking may avoid this
scenario, but such tweaks come at the risk of routing loops [35]. More generally, such
routing inconsistencies jeopardize connectivity and may lead to system partitions.
Both violations occur for keys in DHTs, e.g., one study of OpenDHT found that on average
5% of the keys are owned by multiple nodes simultaneously even in settings with low
churn [31]. The two examples given above illustrate how such a scenario may occur in
the context of a Chord-like system, but these issues are known to affect all types of selforganizing systems in deployment [12].
Needless to say, inconsistent naming and routing can make it challenging for developers
to understand and use a DHT. Inconsistent naming and routing also complicates system
performance. For example, if a particular key becomes a hotspot, we may wish to shift the
load from nearby keys to other nodes, and potentially to shift responsibility for managing
the key to a well-provisioned node. In a traditional DHT, however, doing so would increase
the likelihood of naming and routing inconsistencies. Similarly, if a popular key happens
to land on a node that is likely to exit the system shortly (e.g., because it only recently
joined), we can improve overall system availability by changing the key’s assignment to a
better provisioned, more stable node, but only if we can make assignment changes reliably
and consistently.
One approach to addressing these anomalies is to broadcast all node join and leave events
to all nodes in the system, as in Dynamo. This way, every node has an eventually consistent view of its key-range, at some scalability cost. Since key storage in Dynamo is only
eventually consistent, applications must already be written to tolerate temporary inconsistency. Further, since all nodes in the DHT know the complete set of nodes participating
in the DHT, routing is simplified.
Coordination Services: In enterprise settings, applications desiring strong consistency
and high availability use coordination services such as Chubby [2] or ZooKeeper [14]. These
services use rigorous distributed algorithms with provable properties to implement strong
consistency semantics even in the face of failures. For instance, ZooKeeper relies on an
atomic broadcast protocol, while Chubby uses the Paxos distributed consensus algorithm
[18] for fault-tolerant replication and agreement on the order of operations.
Coordination services are, however, scale-limited as every update to a replicated data object requires communication with some quorum of all nodes participating in the service;
therefore the performance of replication protocols rapidly degrades as the number of participating nodes increases (see Figure 9(a) and [14]). Scatter is designed with the following
insight: what if we had many instances of a coordination service, cooperatively managing
a large scale storage system?
We now describe the design of Scatter, a scalable consistent storage layer designed to
support very large scale peer-to-peer systems. We discuss our goals and assumptions,
provide an overview of the structure of Scatter, and then discuss the technical challenges
in building Scatter.
Goals and Assumptions
Scatter has three primary goals:
1. Consistency: Scatter provides linearizable consistency semantics for operations on a
single key/value pair, despite (1) lossy and variable-latency network connections, (2)
dynamic system membership including uncontrolled departures, and (3) transient,
asymmetric communication faults.
2. Scalability: Scatter is designed to scale to the largest deployed DHT systems with
more than a million heterogeneous nodes with diverse churn rates, computational
capacities, and network bandwidths.
3. Adaptability: Scatter is designed to be self-optimizing to a variety of dynamic
operating conditions. For example, Scatter reconfigures itself as nodes come and go
to preserve the desired balance between high availability and high performance. It
can also be tuned to optimize for both WAN and LAN environments.
Our design is limited in the kinds of failures it can handle. Specifically, we are not robust to
malicious behavior, such as Byzantine faults and denial of service attacks, nor do we provide
a mechanism for continued operation during pervasive network outages or correlated and
widespread node outages. We leave adding these features to future work.
Design Overview
While existing systems partially satisfy some of our requirements outlined in the preceding
paragraphs, none exhibit all three. Therefore, we set out to design a new system, Scatter,
that synthesizes techniques from a spectrum of distributed storage systems.
The first technique we employ to achieve our goals is to use self-managing sets of nodes,
which we term groups, rather than individual nodes as building blocks for the system.
Groups internally use a suite of replicated state machine (RSM) mechanisms [33] based
on the Paxos consensus algorithm [18] as a basis for consistency and fault-tolerance. Scatter also implements many standard extensions and optimizations [5] to the basic Paxos
algorithm, including: (a) an elected leader to initiate actions on behalf of the group as a
whole, and (b) reconfiguration algorithms [19] to both exclude failed members and include
new members over time.
Figure 2: Overview of Scatter architecture
As groups maintain internal integrity using consensus protocols with provable properties,
a simple and aggressive failure detector suffices. Nodes that are excluded from a group
after being detected as failed can not influence any future actions of the group. On the
other hand, the failure to quickly detect a failed node will not impede the liveness of the
group because only a quorum of the current members are needed to make progress.
Scatter implements a simple DHT model in which a circular key-space is partitioned among
groups (see Figure 2). Each group maintains up-to-date knowledge of the two neighboring
groups that immediately precede and follow it in the key-space. These consistent lookup
links form a global ring topology, on top of which Scatter layers a best-effort routing policy
based on cached hints. If this soft routing state is stale or incomplete, then Scatter relies
on the underlying consistent ring topology as ground truth.
Carefully engineered groups go a long way to meeting our stated design goals for Scatter.
However, a system composed of some static set of groups will be inherently limited in
many ways. For example, if there is a burst of failures or sufficient disparity between
the rate of leaves and joins for a particular group, then that group is at risk of losing a
functional quorum. Not only is a static set of groups limited in robustness, but it is also
restricted in both scalability and the ability to adapt gracefully to dynamic conditions. For
instance, the performance of consensus algorithms degrades significantly as the number of
participants increases. Therefore, a static set of groups will not be able to incrementally
scale with the online addition of resources. As another example, if one group is responsible
for a hotspot in the key-space, it needs some way of coordinating with other groups, which
may be underutilized, to alleviate the hotspot.
Therefore, we provide mechanisms to support the following multi-group operations:
split: partition the state of an existing group into two groups.
merge: create a new group from the union of the state of two neighboring groups.
migrate: move members from one group to a different group.
repartition: change the key-space partitioning between two adjacent groups.
Although our approach is straightforward and combines well-known techniques from the
literature, we encountered a number of technical challenges that may not be apparent from
a cursory inspection of the high-level design.
Atomicity: Multi-group operations modify the routing state across multiple groups, but
as we discussed in Section 2, strong consistency is difficult or impossible to guarantee when
modifications to the routing topology are not atomic. Therefore, we chose to structure each
multi-group operation in Scatter as a distributed transaction. We illustrate this design
pattern, which we call nested consensus, in Figure 3. We believe that this general idea
of structuring protocols as communication between replicated participants, rather than
between individual nodes, can be applied more generally to the construction of scalable,
consistent distributed systems.
Nested consensus uses a two-tiered approach. At the top tier, groups execute a two-phase
commit protocol (2PC), while within each group the actions that the group takes are
agreed on using consensus protocols. Multi-group operations are coordinated by whichever
group decides to initiate the transaction as a result of some local policy. As Scatter is
decentralized, multiple groups can concurrently initiate conflicting transactions. Section 4
details the mechanisms used to coordinate distributed transactions across groups.
Performance: Strong consistency in distributed systems is commonly thought to come
with an unacceptably high performance or availability costs. The challenge of maximizing system performance influenced every level of Scatter’s design and implementation —
whether defined in terms of latency, throughput, or availability — without compromising
core integrity. Although many before us have shown that strongly consistent replication
techniques can be implemented efficiently at small scale, the bigger challenge for us was
the additional layer of “heavy-weight” mechanisms — distributed transactions — on top
of multiple instantiations of independent replicated state machines.
Self Organization: Our choice of complete decentralization makes the design of policies
non-trivial. In contrast to designs in which a system deployment is tuned through human
intervention or an omnipotent component, Scatter is tuned by the actions of individual
groups using local information for optimization. Section 6 outlines various techniques for
optimizing the resilience, performance, and load-balance of Scatter groups using local or
partially sampled non-local information.
In this section, we describe how we use nested consensus to implement multi-group operations. Section 4.1 characterizes our requirements for a consistent and available overlay
topology. Section 4.2 details the nested consensus technique, and Section 4.3 walks through
a concrete example of the group split operation.
Overlay Consistency Requirements
Scatter’s overlay was designed to solve the consistency and availability problems discussed
in Section 2. As Scatter is defined in terms of groups rather than nodes, we will slightly
Groups coordinate distributed
transactions using a two-phase
commit protocol
Within each group, nodes
coordinate using a Paxos-based
replicated state machine
Figure 3: Overview of nested consensus. Groups coordinate distributed transactions using a two-phase commit protocol. Within each group, nodes coordinate using the Paxos distributed consensus algorithm.
rephrase the assignment consistency correctness condition as the following system invariant: groups that are adjacent in the overlay agree on a partitioning of the key-space between
them. For individual links in the overlay to remain highly available, Scatter maintains an
additional invariant: a group can always reach its adjacent groups. Although these invariants are locally defined they are sufficient to provide global consistency and availability
properties for Scatter’s overlay.
We can derive further requirements from these conditions for operations that modify either
the set of groups, the membership of groups, or the partitioning of the key-space among
groups. For instance, in order for a group Ga to be able to communicate directly with an
adjacent group Gb , Ga must have knowledge of some subset of Gb ’s members. The following
property is sufficient, but perhaps stronger than necessary, to maintain this connectivity:
every adjacent group of Gb has up-to-date knowledge of the membership of Gb . This
requirement motivated our implementation of operations that modify the membership of
a group Gb to be eagerly replicated across all groups adjacent to Gb in the overlay.
In keeping with our goal to build on classic fault-tolerant distributed algorithms rather than
inventing ad-hoc protocols, we chose to structure group membership updates as distributed
transactions across groups. This approach not only satisfied our requirement of eager
replication but provided a powerful framework for implementing the more challenging
multi-group operations such as group splits and merges.
Consider, for example, the
scenario in Figure 4 where two adjacent groups, G1 and G2 , propose a merge operation
simultaneously. To maintain Scatter’s two overlay consistency invariants, the adjacent
groups G0 and G4 must be involved as well. Note that the changes required by G1 ’s
proposal and G2 ’s proposal conflict — i.e., if both operations were executed concurrently
they would violate the structural integrity of the overlay. These anomalies are prevented
by the atomicity and concurrency control provided by our transactional framework.
Nested Consensus
Figure 4: Scenario where two adjacent groups, G1 and G2 , propose a merge
operation simultaneously. G1 proposes a merge of G1 and G2 , while G2 proposes
a merge of G2 and G3 . These two proposals conflict.
Scatter implements distributed transactions across groups using a technique we call nested
consensus (Figure 3). At a high level, groups execute a two-phase commit protocol (2PC);
before a group executes a step in the 2PC protocol it uses the Paxos distributed consensus
algorithm to internally replicate the decision to execute the step. Thus distributed replication plays the role of write-ahead logging to stable storage in the classic 2PC protocol.
We will refer to the group initiating a transaction as the coordinator group and to the
other groups involved as the participant groups. The following sequence of steps loosely
captures the overall structure of nested consensus:
1. The coordinator group replicates the decision to initiate the transaction.
2. The coordinator group broadcasts a transaction prepare message to the nodes of the
participant groups.
3. Upon receiving the prepare message, a participant group decides whether or not to
commit the proposed transaction and replicates its vote.
4. A participant group broadcasts a commit or abort message to the nodes of the coordinator group.
5. When the votes of all participant groups is known, the coordinator group replicates
whether or not the transaction was committed.
6. The coordinator group broadcasts the outcome of the transaction to all participant
7. Participant groups replicate the transaction outcome.
8. When a group learns that a transaction has been committed then it executes the
steps of the proposed transaction, the particulars of which depend on the multigroup operation.
G2a G2b
Figure 5: Group G2 splits into two groups, G2a and G2b . Groups G1 , G2 , and
G3 participate in the distributed transaction. Causal time advances vertically,
and messages between groups are represented by arrows. The cells beneath
each group name represent the totally-ordered replicated log of transaction
steps for that group.
Note that nested consensus is a non-blocking protocol. Provided a majority of nodes in
each group remain alive and connected, the two phase commit protocol will terminate.
Even if the previous leader of the coordinating group fails, another node can take its
place and resume the transaction. This is not the case for applying two phase commit to
managing routing state in a traditional DHT.
In our implementation the leader of a group initiates every action of the group, but we note
that a judicious use of broadcasts and message batching lowers the apparently high number
of message rounds implied by the above steps. We also think that the large body of work
on optimizing distributed transactions could be applied to further optimize performance
of nested consensus, but our experimental evaluations in Section 7 show that performance
is reasonable even with a relatively conservative implementation.
Our implementation encourages concurrency while respecting safety. For example, the
storage service (Section 5) continues to process client requests during the execution of
group transactions except for a brief period of unavailability during any reconfiguration
required by a committed transaction. Also, groups continue to serve lookup requests during
transactions that modify the partitioning of the key-space provided that the lookups are
serialized with respect to the transaction commit.
To illustrate the mechanics of nested consensus, the remainder of the section walks through
an example group split operation and then considers the behavior of this mechanism in
the presence of faults and concurrent transactions.
Example: Group Split
Figure 5 illustrates three groups executing a split transaction. For clarity, this example
demonstrates the necessary steps in nested consensus in the simplest case — a non-faulty
leader and no concurrent transactions. At t0 , G2 has replicated its intent to split into
the two groups G2a and G2b and then sends a 2PC prepare message to G1 and G3 . In
parallel, G1 and G3 internally replicate their vote to commit the proposed split before
replying to G0 . After each group has learned and replicated the outcome (committed) of
the split operation at time t3 , then the following updates are executed by the respective
group: (1) G1 updates its successor pointer to G2a , (2) G3 updates its predecessor pointer
to G2b , and (3) G2 executes a replicated state machine reconfiguration to instantiate the
two new groups which partition between them G2 ’s original key-range and set of member
To introduce some of the engineering considerations needed for nested consensus, we consider the behavior of this example in more challenging conditions. First, suppose that
the leader of G1 fails after replicating intent to begin the transaction but before sending
the prepare messages to the participant groups. The other nodes of G1 will eventually
detect the leader failure and elect a new leader. When the new leader is elected, it behaves
just like a restarted classical transaction manager: it queries the replicated write-ahead
log and continues executing the transaction. We also implemented standard mechanisms
for message timeouts and re-deliveries, with the caveat that individual steps should be
implemented so that they are idempotent or have no effect when re-executed.
We return to the question of concurrency control. Say that G1 proposed a merge operation
with G2 simultaneously with G2 ’s split proposal. The simplest response is to enforce
mutual exclusion between transactions by participant groups voting to abort liberally. We
implemented a slightly less restrictive definition of conflicting multi-group operations by
defining a lock for each link in the overlay. Finer-grained locks reduce the incidence of
deadlock; for example, two groups, G1 and G3 , that are separated by two hops in the
overlay would be able to update their membership concurrently; whereas with complete
mutual exclusion these two operations would conflict at the group in the middle (G2 ).
A consistent and scalable lookup service provides a useful abstraction onto which richer
functionality can be layered. This section describes the storage service that each Scatter
group provides for its range of the global key-space. To evaluate Scatter, we implemented
a peer-to-peer Twitter-like application layered on a standard DHT-interface. This allowed
us to do a relatively direct comparison with OpenDHT in Section 7.
As explained in Section 4, each group uses Paxos to replicate the intermediate state needed
for multi-group operations. Since multi-group operations are triggered by environmental
changes such as churn or shifts in load, our design assumes these occur with low frequency
in comparison to normal client operations. Therefore Scatter optimizes each group to
provide low latency and high throughput client storage.
To improve throughput, we partition each group’s storage state among its member nodes
(see Figure 6). Storage operations in Scatter take the form of a simple read or write on
an individual key. Each operation is forwarded to the node of the group assigned to a
particular key – referred to as the primary for that key.
The group leader replicates information regarding the assignment of keys to primaries
using Paxos, as it does with the state for multi-group operations. The key assignment is
cached as soft state by the routing service in the other Scatter groups. All messages are
implemented on top of UDP, and Scatter makes no guarantees about reliable delivery or
ordering of client messages. Once an operation is routed to the correct group for a given
key, then any node in the group will forward the operation to the appropriate primary.
Each primary uses Paxos to replicate operations on its key-range to all the other nodes
in the group – this provides linearizability. Our use of the Paxos algorithm in this case
behaves very much like other primary-backup replication protocols – a single message round
Figure 6: Example Scatter group composed of three nodes (a, b, c) and assigned
to the key-range [ka , kd ). The group’s key-range is partitioned such that each
node of the group is the primary for some subset of the group’s key-space.
The primary of a key-range owns those keys and both orders and replicates
all operations on the keys to the other nodes in the group; e.g., a is assigned
[ka , kb ] and replicates all updates to these keys to b and c using Paxos.
usually suffices for replication, and operations on different keys and different primaries are
not synchronized with respect to each other.
In parliamentary terms [18], the structure within a group can be explained as follows.
The group nodes form the group parliament which elects a parliamentary leader and then
divides the law into disjoint areas, forming a separate committee to manage each resulting
area of the law independently. All members of parliament are also a member of every
committee, but each committee appoints a different committee chair (i.e., the primary)
such that no individual member of parliament is unfairly burdened in comparison to his
peers. Because the chair is a distinguished proposer in his area, in the common case only a
single round of messages is required to pass a committee decree. Further, since committees
are assigned to disjoint areas of the law, decrees in different committees can be processed
concurrently without requiring a total ordering of decrees among committees.
In addition to the basic mechanics described in this section and the previous section,
Scatter implements additional optimizations including:
• Leases: Our mechanisms for delegating keys to primaries does not require timebased leases; however, they can be turned on for a given deployment. Leases allow
primaries to satisfy reads without communicating to the rest of the group; however,
the use of leases can also delay the execution of certain group operations when a
primary fails.
• Diskless Paxos: Our implementation of Paxos does not require writing to disk.
Nodes that restart just rejoin the system as new nodes.
• Relaxed Reads: All replicas for a given key can answer read requests from local –
possibly stale – state. Relaxed reads violate linearizability, but are provided as an
option for clients.
Low churn
Uniform latency
Low churn
Non-uniform latency
High churn
Non-uniform latency
Table 1: Deployment settings and system properties that a Scatter policy may
target. A X indicates that we have developed a policy for the combination of
setting and property.
Group Failure Probability
Group size
Figure 7: Impact of group size on group failure probability for two Pareto
distributed node churn rates, with average lifetimes µ = 100s and µ = 500s.
An important property of Scatter’s design is the separation of policy from mechanism.
For example, the mechanism by which a node joins a group does not prescribe how the
target group is selected. Policies enable Scatter to adapt to a wide range of operating
conditions and are a powerful means of altering system behavior with no change to any of
the underlying mechanisms.
In this section we describe the policies that we have found to be effective in the three
experimental settings where we have deployed and evaluated Scatter (see Section 7). These
are: (1) low churn and uniform network latency, (2) low churn and non-uniform network
latency, and (3) high churn and non-uniform network latency. Table 1 lists each of these
settings, and three system properties that a potential policy might optimize. A X in
the table indicates that we have developed a policy for the corresponding combination of
deployment setting and system property. We now describe the policies for each of the
three system properties.
Scatter must be resilient to node churn as nodes join and depart the system unexpectedly.
A Scatter group with 2k + 1 nodes guarantees data availability with up to k node failures.
With more than k failures, a group cannot process client operations safely. To improve
resilience, Scatter employs a policy that prompts a group to merge with an adjacent group
if its node count is below a predefined threshold. This maintains high data availability and
helps prevent data loss. This policy trades-off availability for performance since smaller
groups are more efficient.
To determine the appropriate group size threshold we carried out a simulation, parame-
terized with the base reconfiguration latency plotted in Figure 12(b). Figure 7 plots the
probability of group failure for different group sizes for two node churn rates with node
lifetimes drawn from heavy-tailed Pareto distributions observed in typical peer-to-peer
systems [3, 32]. The plot indicates that a modest group size of 8-12 prevents group failure
with high probability.
The resilience policy also directs how nodes join the system. A new node samples k random
groups and joins the group that is most likely to fail. The group failure probability is
computed using node lifetime distribution information, if available. In the absence of
this data, the policy defaults to having a new node join the sampled group with the fewest
nodes. The default policy also takes into account the physical diversity of nodes in a group,
e.g., the number of distinct BGP prefixes spanned by the group. It then assigns a joining
node to a group that has the smallest number of nodes and spans a limited number of BGP
prefixes in order to optimize for both uncorrelated and correlated failures. We performed a
large-scale simulation to determine the impact of the number of groups sampled and found
that checking four groups is sufficient to significantly reduce the number of reconfiguration
operations performed later. If multiple groups have the expected failure probability below
the desired threshold, then the new node picks the target group based on the policy for
optimizing latency as described below.
Client latency depends on its time to reach the primary, and the time for the primary to
reach consensus with the other replicas. A join policy can optimize client latency by placing
new nodes into groups where their latencies to the other nodes in the group will be low.
The latency-optimized join policy accomplishes this by having the joining node randomly
select k groups and pass a no-op operation in each of them as a pseudo primary. This allows
the node to estimate the performance of operating within each group. While performing
these operations, nodes do not explicitly join and leave each group. The node then joins
the group with the smallest command execution latency. Note that latency-optimized join
is used only as a secondary metric when there are multiple candidate groups with the
desired resiliency properties. As a consequence, these performance optimizations are not
at the cost of reduced levels of physical diversity or group robustness. Experiments in
Section 7.1.1 compare the latency-optimized join policy with k = 3 against the random
join policy.
The latency-optimized leader selection policy optimizes the RSM command latency in a
different way – the group elects the node that has the lowest Paxos agreement latency as
the leader. We evaluate the impact of this policy on reconfiguration, merge, and split costs
in Section 7.1.3.
Load Balance
Scatter also balances load across groups in order to achieve scalable and predictable performance. A simple and direct method for balancing load is to direct a new node to join
the group that is heavily loaded. The load-balanced join policy does exactly this – a joining
node samples k groups, selects groups that have low failure probability, and then joins the
group that has processed the most client operations in the recent past. The load-balance
policy also repartitions the keyspace among adjacent groups when the request load to their
respective keyspaces is skewed. In our implementation, groups repartition their keyspaces
proportionally to their respective loads whenever a group’s load is a factor of 1.6 or above
that of its neighboring group. As this check is performed locally between adjacent groups,
it does not require global load monitoring, but it might require multiple iterations of the
load-balancing operation to disperse hotspots. We note that the overall, cumulative effect
of many concurrent locally optimal modifications is non-trivial to understand. A thorough
analysis of the effect of local decisions on global state is an intriguing direction for future
We evaluated Scatter across three deployment environments, corresponding to the churn/latency
settings listed in Table 1: (1) single cluster: a homogeneous and dedicated Emulab cluster to evaluate the low churn/uniform latency setting; (2) multi-cluster: multiple dedicated clusters (Emulab and Amazon’s EC2) at LAN sites connected over the wide-area
to evaluate the low churn/non-uniform latency setting; (3) peer-to-peer: machines from
PlanetLab in the wide-area to evaluate the high churn/non-uniform latency setting.
Figure 8: Inter-site latencies in the multi-cluster setting used in experiments.
In all experiments Scatter ran on a single core on a given node. On Emulab we used
150 nodes with 2.4GHz 64-bit Xeon processor cores. On PlanetLab we used 840 nodes,
essentially all nodes on which we could install both Scatter and OpenDHT.
For multi-cluster experiments we used 50 nodes each from Emulab (Utah), EC2-West
(California) and EC2-East (Virginia). The processors on the EC2 nodes were also 64-bit
processor cores clocked at 2.4GHz. Figure 8 details the inter-site latencies for the multicluster experiments. We performed our multi-cluster experiments using nodes at physically
distinct locations in order to study the performance of our system under realistic wide-area
network conditions.
We used Berkeley-DB for persistent disk-based storage, and a memory cache to pipeline
operations to BDB in the background.
Section 7.1 quantifies specific Scatter overheads with deployments on dedicated testbeds
(single-cluster, multi-cluster). We then evaluate Scatter at large scales on PlanetLab with
varying churn rates in the context of a Twitter-like publish-subscribe application called
Chirp in Section 7.2, and also compare it to a Chirp implementation on top of OpenDHT.
In this section we show that a Scatter group imposes a minor latency overhead and that
primaries dramatically increase group operation processing throughput. Then, we evaluate
the latency of group reconfiguration, split and merge. The results indicate that group
operations are more expensive than client operations, but the overheads are tolerable since
these operations are infrequent.
Figure 9 plots a group’s client operation processing latency for single cluster and multicluster settings. The plotted latencies do not include the network delay between the client
and the group. The client perceived latency will have an additional delay component
that is simply the latency from the client to the target group. We present the end-to-end
application-level latencies in Section 7.2.
Figure 9(a) plots client operation latency for different operations in groups of different
Leased Read
Non-Leased Read
Primary Write
Non-Primary Write
Time (ms)
Group Size
(a) Single Cluster
Time (ms)
Non-Leased Read
Primary Write
Non-Primary Write
(b) Multi-Cluster
Figure 9: Latency of different client operations in (a) a single-cluster deployment for groups of different sizes, and (b) a multi-cluster deployment in which
no site had a majority of nodes in the group.
Latency-optimized Join
Random Join
Primary Write Latency (ms)
Figure 10: The impact of join policy on write latency in two PlanetLab deployments. The latency-optimized join policy is described in Section 6.2. The
random join policy directs nodes to join a group at random.
sizes. The latency of leased reads did not vary with group size – it is processed locally
by the primary. Non-leased reads were slightly faster than primary writes as they differ
only in the storage layer overhead. Non-primary writes were significantly slower than
primary-based operations because the primary uses the faster leader-Paxos for consensus.
In the multi-cluster setting no site had a node majority. Figure 9(b) plots the latency for
Group Size
(a) Single Cluster
Group Size
(b) Multi-cluster
Figure 11: Scatter group throughput in single cluster and multi-cluster settings.
operations that require a primary to coordinate with nodes from at least one other site. As
a result, inter-cluster WAN latency dominates client operation latency. As expected, operations initiated by primaries at EC2-East had significantly higher latency, while operations
by primaries at EC2-West and Emulab had comparable latency.
To illustrate how policy may impact client operation latency, Figure 10 compares the
impact of latency-optimized join policy with k = 3 (described in Section 6.2) to the random
join policy on the primary’s write latency in a PlanetLab setting. In both PlanetLab
deployments, nodes joined Scatter using the respective policy, and after all nodes joined,
millions of writes were performed to random locations. The effect of the latency-optimized
policy is a clustering of nodes that are close in latency into the same group. Figure 10
shows that this policy greatly improves write performance over the random join policy –
median latency decreased by 45%, from 124ms to 68ms.
Latencies in the PlanetLab deployment also demonstrate the benefit of majority consensus
in mitigating the impact of slow-performing outlier nodes on group operation latency.
Though PlanetLab nodes are globally distributed, the 124ms median latency of a primary
write (with random join policy) is not much higher than that of the multi-cluster setting.
Slow nodes impose a latency cost but they also benefit the system overall as they improve
fault tolerance by consistently replicating state, albeit slowly.
Latency (ms)
(a) Unoptimized
Latency (ms)
(b) Join and Leader Optimized for Latency
Figure 12: CDFs of group reconfiguration latencies for a P2P setting with two
sets of policies: (a) random join and random leader, and (b) latency-optimized
join and latency-optimized leader.
Figure 11 plots write throughput of a single group in single cluster and multi-cluster settings. Writes were performed on randomly selected segments. Throughput was determined
by varying both the number of clients (up to 20) and the number of outstanding operations
per client (up to 100).
The figure demonstrates the performance benefit of using primaries. In both settings, a
single leader becomes a scalability bottleneck and throughput quickly degrades for groups
with more nodes. This happens because the message overhead associated with executing
a group command is linear in group size. Each additional primary, however, adds extra
capacity to the group since primaries process client operations in parallel and also pipeline
client operations. The result is that in return for higher reliability (afforded by having
more nodes) a group’s throughput decreases only slightly when using primaries.
Though the latency of a typical operation in the multi-cluster deployment is significantly
higher than the corresponding operation in the single cluster setting, group throughput in
the multi-cluster setting (Figure 11(a)) is within 30% of the group throughput in a single
cluster setting (Figure 11(b)). And for large groups this disparity is marginal. The reason
for this is the pipelining of client requests by group primaries, which allows the system to
mask the cost of wide-area network communication.
Reconfiguration, Split, and Merge
We evaluated the latency cost of group reconfiguration, split, and merge operations. In
the case of failure, this latency is the duration between a failure detector sensing a failure
and the completion of the resulting reconfiguration. Table 2 lists the average latencies
and standard deviations for single and multi- cluster settings across thousands of runs and
across group sizes 2-13. These measurements do not account for data transfer latency.
Single cluster
90.9 ± 31.8
208.8 ± 48.8
246.5 ± 45.4
307.6 ± 69.8
(Opt. leader)
55.6 ± 7.6
135.8 ± 15.2
178.5 ± 15.1
200.7 ± 24.4
Table 2: Group reconfiguration, split, and merge latencies in milliseconds and
standard deviations for different deployment settings.
Basic single cluster latency.
In the single cluster setting all operations take less
than 10ms. Splitting and merging are the most expensive operations as they require
coordination between groups, and merging is more expensive because it involves more
groups than splitting.
Impact of policy on multi-cluster latency. The single-cluster setting provides little
opportunity for optimization due to latency homogeneity. However, in the multi-cluster
settings, we can decrease the latency cost with a leader election policy. Table 2 lists
latencies for two multi-cluster deployments, one with a random leader election policy, and
one that used a latency-optimized leader policy described in Section 6.2. From the table,
the latency optimizing policy significantly reduced the latency cost of all operations.
Impact of policy on PlanetLab latency. Figure 12 plots CDFs of latencies for the
PlanetLab deployment. It compares the random join with random leader policies (Figure 12(a)) against latency-optimized join and latency-optimized leader policies described
in Section 6.2 (Figure 12(b)). In combination, the two latency optimizing policies shift
the CDF curves to the left, decreasing the latency of all operations – reconfiguration, split
and merge.
Application-level Benchmarks
To study the macro-level behavior of Scatter, we built and deployed Chirp, a Twitterlike application. In this section we compare PlanetLab deployments of Chirp on top of
Scatter and OpenDHT. We compare our implementation with OpenDHT, which is an
open-source DHT implementation that is currently deployed on PlanetLab. OpenDHT
uses lightweight techniques for DHT maintenance, and its access latencies are comparable
to that of other DHTs [28]. It therefore allows us to evaluate the impact of the more
heavy-weight techniques used in Scatter.
For a fair comparison, both Scatter and OpenDHT send node heartbeat messages every
0.5s. After four consecutive heartbeat failures OpenDHT re-replicates failed node’s keys,
and Scatter reconfigures to exclude the failed node and re-partitions the group’s keyspace
among primaries. Additionally, Scatter used the same base-16 recursive routing algorithm
as is used by OpenDHT. Only forward and reverse group pointers were maintained consistently in Scatter, but it relied on these only when the soft routing state turned out to
be inconsistent. In both systems the replication factor was set to provide at least seven 9s
of reliability, i.e., with an average lifetime of 100 seconds, we use 9 replicas (see Figure 7).
To induce churn we use two different lifetime distributions, Poisson and Pareto. Pareto is
Failed fetches (%)
Update fetch latency (ms)
400 600
800 1000
Median session time (secs)
(a) Performance
Median session time (secs)
(b) Availability
Missed updates (%)
Median session time (secs)
(c) Consistency
Figure 13: Impact of varying churn rates for Poisson distributed lifetimes.
The graphs plot measurements for P2P deployments of Chirp for both Scatter
(dashed line), and OpenDHT (solid line).
a typical way of modeling churn in P2P systems [3, 32], and Poisson is a common reference
distribution. For both churn distributions a node’s join policy joined the group with the
lowest expected residual lifetime — for Poisson this is equivalent to joining the group with
the fewest nodes.
Chirp overview
Chirp works much like Twitter; to participate in the system a user u creates a user-name,
and Kfuollows , that are computed
which is associated with two user-specific keys, Kupdates
by hashing u’s user-name. A user may write and post an update, which is at most 140
characters in length; follow another user; or fetch updates posted by the users being
followed. An update by a user u is appended to the value of Kupdates
. When u follows
u , the key Kupdates is appended to Kf ollows , which maintains the list of all users u is
Appending to a key value is implemented as a non-atomic read-modify-write, requiring
two storage operations. This was done to more fairly compare Scatter and OpenDHT. A
key’s maximum value was 8K in both systems. When a key’s value capacity is exceeded
(e.g., a user posts over 57 maximum-sized updates), a new key is written and the new key
is appended to the end of the value of the old key, as a pointer to the continuation of the
list. The Chirp client application caches previously known tails of each list accessed by
the user in order to avoid repeatedly scanning through the list to fetch the most recent
updates. In addition, the pointer to the tail of the list is stored at its header so that a
user’s followers can efficiently access the most recent updates of the user.
Failed fetches (%)
Update fetch latency (ms)
400 600
800 1000
Median session time (secs)
(a) Performance
Median session time (secs)
(b) Availability
Missed updates (%)
Median session time (secs)
(c) Consistency
Figure 14: Impact of varying churn rates for Pareto distributed lifetimes (α =
We evaluated the performance of Chirp on Scatter and OpenDHT by varying churn, the
distribution of node lifetimes, and the popularity distribution of keys. For the experiments
below, we used workloads obtained from Twitter measurement studies [17, 15]. The measurements include both the updates posted by the users and the structure of the social
network over which the updates are propagated.
Impact of Churn
We first evaluate the performance by using node lifetime distributions that are Poisson
distributed and by varying the mean lifetime value from 100 seconds to 1000 seconds.
We based our lifetime distributions on measurements of real-world P2P systems such as
Gnutella, Kazaa, FastTrack, and Overnet [32, 13, 7, 34]. For this experiment, the update/fetch Chirp workload was synthesized as follows: we played a trace of status updates
derived from the Twitter measurement studies, and for each user u posting an update, we
randomly selected one of u’s followers and issued a request from this user to the newly
posted update. Figure 13 plots performance, availability, and consistency of the fetches in
this workload as we vary churn. Each data point represents the mean value for a million
fetch operations.
Figure 13(a) indicates that the performance of both systems degrades with increasing churn
as routing state becomes increasingly stale, and the probability of the value residing on a
failed node increases. OpenDHT slightly outperforms Scatter in fetch latency because a
fetch in Scatter incurs a round of group communication.
Figure 13(b) shows that Scatter has better availability than OpenDHT. The availability
loss in OpenDHT was often due to the lack of structural integrity, with inconsistent succes-
Update fetch latency (ms)
Median session time (secs)
(a) Latency
Normalized Load
(b) Node Load
Figure 15: High load results for Chirp with node churn distributed as
Pareto(α = 1.1). (a) Scatter has better latency than OpenDHT at high loads;
(b) Scatter maintains a more balanced distribution of load among its nodes
than OpenDHT.
sor pointer information or because a key being fetched has not been assigned to any of the
nodes (see Figure 1). To compute the fetch failure for Scatter in Figure 13(b) an operation
was considered to have failed if a response has not been received within three seconds. The
loss of availability for Scatter was because an operation may be delayed for over three seconds when the destination key belonged to a group undergoing reconfiguration in response
to churn.
Figure 13(c) compares the consistency of the values stored in the two systems. OpenDHT’s
inconsistency results confirmed prior studies, e.g., [31] — even at low churn rates over 5%
of the fetches were inconsistent. These inconsistencies stem from a failure to keep replicas
consistent, either because an update to a replica failed or because different nodes have
different views regarding how the keyspace is partitioned. In contrast, Scatter had no
inconsistencies across all experiments.
Heavy tailed node lifetimes
Next, we considered a node lifetime distribution in which nodes are drawn from a heavytailed Pareto distribution that is typical of many P2P workloads. Heavy-tailed distributions exhibit “memory”, i.e., nodes who have been part of the system for some period of
time are more likely to persist than newly arriving nodes. Scatter provides for a greater
ability to optimize for skewed node lifetime distribution due to its group abstraction. Note
that all of the keys associated with a group are replicated on the same set of nodes,
whereas in OpenDHT each node participates in multiple different replica sets. In this setting, Scatter takes into account the measured residual lifetime distribution in the various
reconfiguration operations, e.g., which group an arriving node should join, when should
groups merge or split, and in determining the optimal size of the group to meet the desired
(seven 9s) availability guarantee. For these experiments the workload was generated in
the same way as the workload used in Section 7.2.2.
OpenDHT slightly outperformed Scatter with respect to access latency (see Figure 14(a)).
However, Scatter’s availability fared better under the heavy-tailed churn rate than that of
OpenDHT (Figure 14(b)). As before, Scatter had no inconsistencies, while OpenDHT was
more inconsistent with the heavy tailed churn rate (Figure 14(c)).
Non-uniform load
In the next experiment, we studied the impact of high load on Scatter. For this experiment,
we batched and issued one million updates from the Twitter trace, and after all of the
updates have been posted, the followers of the selected users fetched the updates. The
fetches were issued in a random order and throttled to a rate of 10,000 fetches per second
for the entire system. Note that in this experiment the keys corresponding to popular
users received more requests, as the load is based on social network properties. The load
is further skewed by the fact that users with a large number of followers are more likely to
post updates [17].
Figure 15(a) shows that Scatter had a slightly better fetch latency than OpenDHT due
to its better load balance properties. However, latency in Scatter tracked OpenDHT’s
latency as in prior experiments (Figures 13(a) and 14(a)).
Figure 15(b) plots the normalized node load for Scatter and OpenDHT. This was computed
in both systems by tracking the total number of fetch requests processed by a node, and
then dividing this number by the mean. The figure shows that Scatter’s load-balance policy
(Section 6.3) is effective at distributing load across nodes in the system. OpenDHT’s load
distribution was more skewed.
For our final set of experiments, we evaluated the scalability of Scatter and its ability to
adapt to variations in system load. We also compared Scatter with ZooKeeper, a system
that provides strongly consistent storage. As ZooKeeper is a centralized and scale-limited
system, we built a decentralized system comprising of multiple ZooKeeper instances, where
the global keyspace is statically partitioned across the different instances. We also optimized the performance of this ZooKeeper-based alternative by basing the keyspace partitioning on the historical load estimates of the various key values; we split our workload
into two halves, used the first half to derive the keyspace partitioning, and then performed
the evaluations using the second half of the trace. Each ZooKeeper instance comprised of
five nodes. We performed these experiments without node churn, as the system based on
ZooKeeper did not have a management layer for dealing with churn.
Figure 16 plots the average throughput results with standard deviations as we vary the
number of nodes in the system. The throughput of Scatter is comparable to that of the
ZooKeeper-based system for small number of nodes, indicating that Scatter stacks up well
against a highly optimized implementation of distributed consensus. As we increase the
number of nodes, the performance of ZooKeeper-based alternative scales sub-linearly. This
indicates that, even though the keyspace partitioning was derived based on historical workload characteristics, the inability to adapt to dynamic hotspots in the access pattern limits
the scalability of the ZooKeeper-based system. Further, the variability in the throughput
Throughput (Ops/sec)
Number of nodes
Figure 16: Comparison of Scatter with a system that composes multiple
ZooKeeper instances. The figure provides the throughput of the two systems
as we vary the number of nodes.
also increases with the number of ZooKeeper instances used in the experiment. In contrast,
Scatter’s throughput scales linearly with the number of nodes, with only a small amount
of variability due to uneven group sizes and temporary load skews.
Our work is made possible by foundational techniques for fault tolerant distributed computing such as Paxos [18], replicated state machines [33], and transactions [20]. In particular,
our design draws inspiration from the implementation of distributed transactions across
multiple replication groups in Viewstamped Replication [25].
A number of recent distributed systems in industry also rely on distributed consensus
algorithms to provide strongly consistent semantics — such systems provide a low-level
control service for an ecosystem of higher-level infrastructure applications. Well-known
examples of such systems include Google’s Chubby lock service [2] and Yahoo’s ZooKeeper
coordination service [14]. Scatter extends the techniques in such systems to a larger scale.
At another extreme, peer-to-peer systems such as distributed hash tables (DHTs) [26, 30,
22, 29] provide only best-effort probabilistic guarantees, and although targeted at planetary
scale have been found to be brittle and slow in the wild [27, 28]. Still, the large body of
work on peer-to-peer system has numerous valuable contributions. Scatter benefits from
many decentralized self-organizing techniques such as sophisticated overlay routing, and
the extensive measurements on workload and other environmental characteristics in this
body of work (e.g. [11]) are invaluable to the design and evaluation of effective policies [23].
One recent system showing that DHTs are a valuable abstraction even in an industrial
setting is Amazon’s Dynamo [10], a highly available distributed key-value store supporting
one of the largest e-commerce operations in the world. Unlike Scatter, Dynamo chooses
availability over consistency, and this trade-off motivates a different set of design choices.
Lynch et al. [21] propose the idea of using state machine replication for atomic data access
in DHTs. An important insight of this theoretical work is that a node in a DHT can be
made more robust if it is implemented as a group of nodes that execute operations atomically using a consensus protocol. An unsolved question in the paper is how to atomically
modify the ring topology under churn, a question which we answer in Scatter with our
principled design of multi-group operations.
Motivated by the same problems with large scale DHTs (as discussed in Section 2), Castro
et al. developed MSPastry [4]. MSPastry makes the Pastry [30] design more dependable,
without sacrificing performance. It does this with robust routing, active probes, and perhop acknowledgments. A fundamental difference between MSPastry and Scatter is that
Scatter provides provable consistency guarantees. Moreover, Scatter’s group abstraction
can be reused to support more advanced features in the future, such as consistency of
multi-key operations.
Although we approached the problem of scalable consistency by starting with a clean slate,
other approaches in the literature propose mechanisms for consistent operations layered
on top of a weakly-consistent DHT. Etna [24] is a representative system of this approach.
Unfortunately such systems inherit the structural problems of the underlying data system,
resulting in lower object availability and system efficiency. For example, inconsistencies
in the underlying routing protocol will manifest as unavailability at the higher layers (see
Figures 13(b) and 14(b)).
This paper presented the design, implementation and evaluation of Scatter — a scalable
distributed key-value storage system that provides clients with linearalizable semantics.
Scatter organizes computing resources into fault-tolerant groups, each of which independently serve client requests to segments of the keyspace. Groups employ self-organizing
techniques to manage membership and to coordinate with other groups for improved performance and reliability. Principled and robust group coordination is the primary contribution of our work.
We presented detailed evaluation results for various deployments. Our results demonstrate
that Scatter is efficient in practice, scales linearly with the number of nodes, and provides
high availability even at significant node churn rates. Additionally, we illustrate how
Scatter provides tunable knobs to effectively adapt to the different deployment settings for
significant improvements in load balance, latency, and resilience.
Acknowledgments. This work was supported in part by grant CNS-0963754 from the
National Science Foundation. We would like to thank Vjekoslav Brajkovic and Justin
Cappos for their contributions to earlier versions of Scatter. We would also like to thank
our shepherd Ant Rowstron and the anonymous reviewers for their feedback.
[1] J. Baker, C. Bond, J. Corbett, J. Furman, A. Khorlin, J. Larson, J.-M. Leon, Y. Li,
A. Lloyd, and V. Yushprakh. Megastore: Providing Scalable, Highly Available
Storage for Interactive Services. In Proc. of CIDR, 2011.
[2] M. Burrows. The Chubby lock service for loosely-coupled distributed systems. In
Proc. of OSDI, 2006.
[3] F. Bustamante and Y. Qiao. Friendships that last: Peer lifespan and its role in P2P
protocols. In Proc. of IEEE WCW, 2003.
[4] M. Castro, M. Costa, and A. Rowstron. Performance and dependability of
structured peer-to-peer overlays. In Proc. of DSN, 2004.
[5] T. D. Chandra, R. Griesemer, and J. Redstone. Paxos Made Live: An Engineering
Perspective. In Proc. of PODC, 2007.
[6] F. Chang, J. Dean, S. Ghemawat, W. C. Hsieh, D. A. Wallach, M. Burrows,
T. Chandra, A. Fikes, and R. E. Gruber. Bigtable: A Distributed Storage System
for Structured Data. ACM Transactions on Computer Systems, 26(2), 2008.
[7] J. Chu, K. Labonte, and B. N. Levine. Availability and Locality Measurements of
Peer-To-Peer File Systems. In Proc. of ITCom, 2002.
[8] B. F. Cooper, R. Ramakrishnan, U. Srivastava, A. Silberstein, P. Bohannon, H.-A.
Jacobsen, N. Puz, D. Weaver, and R. Yerneni. PNUTS: Yahoo!’s Hosted Data
Serving Platform. Proc. VLDB Endow., 1:1277–1288, August 2008.
[9] J. Dean. Large-Scale Distributed Systems at Google: Current Systems and Future
Directions, 2009.
[10] G. DeCandia, D. Hastorun, M. Jampani, G. Kakulapati, A. Lakshman, A. Pilchin,
S. Sivasubramanian, P. Vosshall, and W. Vogels. Dynamo: Amazon’s Highly
Available Key-value Store. In Proc. of SOSP, 2007.
[11] J. Falkner, M. Piatek, J. P. John, A. Krishnamurthy, and T. Anderson. Profiling a
Million User DHT. In Proc. of IMC, 2007.
[12] M. J. Freedman, K. Lakshminarayanan, S. Rhea, and I. Stoica. Non-transitive
connectivity and DHTs. In Proc. of WORLDS, 2005.
[13] P. K. Gummadi, R. J. Dunn, S. Saroiu, S. D. Gribble, H. M. Levy, and J. Zahorjan.
Measurement, Modeling, and Analysis of a Peer-to-Peer File-Sharing Workload. In
Proc. of SOSP, 2003.
[14] P. Hunt, M. Konar, F. P. Junqueira, and B. Reed. ZooKeeper: Wait-Free
Coordination for Internet-scale systems. In Proc. of USENIX ATC, 2010.
[15] J. Yang and J. Leskovec. Temporal Variation in Online Media. In Proc. of WSDM,
[16] J. Kubiatowicz, D. Bindel, D. Bindel, Y. Chen, S. Czerwinski, P. Eaton, D. Geels,
R. Gummadi, S. Rhea, H. Weatherspoon, W. Weimer, C. Wells, and B. Zhao.
OceanStore: An Architecture for Global-Scale Persistent Storage. In Proc. of
ASPLOS, 2000.
[17] H. Kwak, C. Lee, H. Park, and S. Moon. What is Twitter, a social network or a
news media? In Proc. of WWW, 2010.
[18] L. Lamport. The Part-Time Parliament. ACM Transactions on Computer Systems,
16(2), 1998.
[19] L. Lamport, D. Malkhi, and L. Zhou. Reconfiguring a State Machine. ACM
SIGACT News, 41(1), 2010.
[20] B. W. Lampson and H. E. Sturgis. Crash recovery in a distributed data storage
system. Technical report, Xerox Parc, 1976.
[21] N. A. Lynch, D. Malkhi, and D. Ratajczak. Atomic Data Access in Distributed Hash
Tables. In Proc. of IPTPS, 2002.
[22] P. Maymounkov and D. Mazières. Kademlia: A Peer-to-Peer Information System
Based on the XOR Metric. In Proc. of IPTPS, 2002.
[23] M. Mitzenmacher. The Power of Two Choices in Randomized Load Balancing. IEEE
Transactions on Parallel and Distributed Systems, 12(10), 2001.
[24] A. Muthitacharoen, S. Gilbert, and R. Morris. Etna: A fault-tolerant algorithm for
atomic mutable DHT data. Technical Report MIT-LCS-TR-993, MIT, June 2005.
[25] B. M. Oki and B. H. Liskov. Viewstamped Replication: A New Primary Copy
Method to Support Highly-Available Distributed Systems. In Proc. of PODC, 1988.
[26] S. Ratnasamy, P. Francis, M. Handley, R. Karp, and S. Schenker. A Scalable
Content-Addressable Network. In Proc. of SIGCOMM, 2001.
[27] S. Rhea, B. Chun, J. Kubiatowicz, and S. Shenker. Fixing the Embarrassing
Slowness of OpenDHT on PlanetLab. In Proc. of WORLDS, 2005.
[28] S. Rhea, D. Geels, T. Roscoe, and J. Kubiatowicz. Handling Churn in a DHT. In
Proc. of USENIX ATC, 2004.
[29] S. Rhea, B. Godfrey, B. Karp, J. Kubiatowicz, S. Ratnasamy, S. Shenker, I. Stoica,
and H. Yu. OpenDHT: A Public DHT Service and Its Uses. In Proc. of SIGCOMM,
[30] A. Rowstron and P. Druschel. Pastry: Scalable, distributed object location and
routing for large-scale peer-to-peer systems. In Proc. of Middleware, 2001.
[31] S. Sankararaman, B.-G. Chun, C. Yatin, and S. Shenker. Key Consistency in DHTs.
Technical Report UCB/EECS-2005-21, UC Berkeley, 2005.
[32] S. Saroiu, P. Gummadi, and S. Gribble. A Measurement Study of Peer-to-Peer File
Sharing Systems. In Proc. of MMCN, 2002.
[33] F. B. Schneider. Implementing Fault-Tolerant Services Using the State Machine
Approach: A Tutorial. ACM Computing Surveys, 22(4), 1990.
[34] S. Sen and J. Wang. Analyzing Peer-to-Peer Traffic Across Large Networks.
IEEE/ACM Transactions on Networking, 2004.
[35] I. Stoica, R. Morris, D. Liben-Nowell, D. Karger, M. F. Kaashoek, F. Dabek, and
H. Balakrishnan. Chord: A Scalable Peer-to-Peer Lookup Protocol for Internet
Applications. Technical Report MIT-LCS-TR-819, MIT, Mar 2001.
[36] I. Stoica, R. Morris, D. Liben-Nowell, D. R. Karger, M. F. Kaashoek, F. Dabek, and
H. Balakrishnan. Chord: A Scalable Peer-to-Peer Lookup Protocol for Internet
Applications. IEEE/ACM Transactions on Networking, 11(1), 2003.
Fast Crash Recovery in RAMCloud
Diego Ongaro, Stephen M. Rumble, Ryan Stutsman,
John Ousterhout, and Mendel Rosenblum
Stanford University
RAMCloud is a DRAM-based storage system that provides inexpensive durability and availability by
recovering quickly after crashes, rather than storing replicas in DRAM. RAMCloud scatters backup
data across hundreds or thousands of disks, and it harnesses hundreds of servers in parallel to reconstruct lost data. The system uses a log-structured approach for all its data, in DRAM as well as on
disk; this provides high performance both during normal operation and during recovery. RAMCloud
employs randomized techniques to manage the system in a scalable and decentralized fashion. In a
60-node cluster, RAMCloud recovers 35 GB of data from a failed server in 1.6 seconds. Our measurements suggest that the approach will scale to recover larger memory sizes (64 GB or more) in
less time with larger clusters.
Categories and Subject Descriptors
D.4.7 [Operating Systems]: Organization and Design—Distributed systems; D.4.2 [Operating Systems]: Storage Management—Main memory; D.4.5 [Operating Systems]: Reliability—Fault-tolerance;
D.4.8 [Operating Systems]: Performance—Measurements
General Terms
Design, Measurement, Performance, Reliability, Experimentation
Storage systems, Main memory databases, Crash recovery, Scalability
The role of DRAM in storage systems has been increasing rapidly in recent years, driven by the needs
of large-scale Web applications. These applications manipulate very large datasets with an intensity
that cannot be satisfied by disks alone. As a result, applications are keeping more and more of their
data in DRAM. For example, large-scale caching systems such as memcached [3] are being widely
used (in 2009 Facebook used a total of 150 TB of DRAM in memcached and other caches for a
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted
without fee provided that copies are not made or distributed for profit or commercial advantage and that copies
bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to
redistribute to lists, requires prior specific permission and/or a fee.
SOSP '11, October 23-26, 2011, Cascais, Portugal.
Copyright © 2011 ACM 978-1-4503-0977-6/11/10 ... $10.00.
database containing 200 TB of disk storage [15]), and the major Web search engines now keep their
search indexes entirely in DRAM.
Although DRAM’s role is increasing, it still tends to be used in limited or specialized ways. In most
cases DRAM is just a cache for some other storage system such as a database; in other cases (such as
search indexes) DRAM is managed in an application-specific fashion. It is difficult for developers to
use DRAM effectively in their applications; for example, the application must manage consistency
between caches and the backing storage. In addition, cache misses and backing store overheads make
it difficult to capture DRAM’s full performance potential.
RAMCloud is a general-purpose storage system that makes it easy for developers to harness the full
performance potential of large-scale DRAM storage. It keeps all data in DRAM all the time, so there
are no cache misses. RAMCloud storage is durable and available, so developers need not manage
a separate backing store. RAMCloud is designed to scale to thousands of servers and hundreds of
terabytes of data while providing uniform low-latency access (5-10 μs round-trip times for small read
The most important factor in the design of RAMCloud was the need to provide a high level of durability and availability without impacting system performance. Replicating all data in DRAM would
have solved some availability issues, but with 3x replication this would have tripled the cost and energy usage of the system. Instead, RAMCloud keeps only a single copy of data in DRAM; redundant
copies are kept on disk or flash, which is both cheaper and more durable than DRAM. However, this
means that a server crash will leave some of the system’s data unavailable until it can be reconstructed
from secondary storage.
RAMCloud’s solution to the availability problem is fast crash recovery: the system reconstructs
the entire contents of a lost server’s memory (64 GB or more) from disk and resumes full service
in 1-2 seconds. We believe this is fast enough to be considered “continuous availability” for most
This paper describes and evaluates RAMCloud’s approach to fast recovery. There are several interesting aspects to the RAMCloud architecture:
• Harnessing scale: RAMCloud takes advantage of the system’s large scale to recover quickly
after crashes. Each server scatters its backup data across all of the other servers, allowing thousands of disks to participate in recovery. Hundreds of recovery masters work together to avoid
network and CPU bottlenecks while recovering data. RAMCloud uses both data parallelism
and pipelining to speed up recovery.
• Log-structured storage: RAMCloud uses techniques similar to those from log-structured file
systems [21], not just for information on disk but also for information in DRAM. The logstructured approach provides high performance and simplifies many issues related to crash
• Randomization: RAMCloud uses randomized approaches to make decisions in a distributed
and scalable fashion. In some cases randomization is combined with refinement: a server
selects several candidates at random and then chooses among them using more detailed information; this provides near-optimal results at low cost.
• Tablet profiling: RAMCloud uses a novel dynamic tree structure to track the distribution of
data within tables; this helps divide a server’s data into partitions for fast recovery.
We have implemented the RAMCloud architecture in a working system and evaluated its crash recovery properties. Our 60-node cluster recovers in 1.6 seconds from the failure of a server with 35 GB
of data, and the approach scales so that larger clusters can recover larger memory sizes in less time.
Measurements of our randomized replica placement algorithm show that it produces uniform allocations that minimize recovery time and that it largely eliminates straggler effects caused by varying
disk speeds.
Overall, fast crash recovery allows RAMCloud to provide durable and available DRAM-based storage for the same price and energy usage as today’s volatile DRAM caches.
Crash recovery and normal request processing are tightly intertwined in RAMCloud, so this section provides background on the RAMCloud concept and the basic data structures used to process
requests. We have omitted some details because of space limitations.
2.1 Basics
RAMCloud is a storage system where every byte of data is present in DRAM at all times. The
hardware for RAMCloud consists of hundreds or thousands of off-the-shelf servers in a single datacenter, each with as much DRAM as is cost-effective (24 to 64 GB today). RAMCloud aggregates
the DRAM of all these servers into a single coherent storage system. It uses backup copies on disk or
flash to make its storage durable and available, but the performance of the system is determined by
DRAM, not disk.
The RAMCloud architecture combines two interesting properties: low latency and large scale. First,
RAMCloud is designed to provide the lowest possible latency for remote access by applications in
the same datacenter. Our goal is end-to-end times of 5-10 μs for reading small objects in datacenters
with tens of thousands of machines. This represents an improvement of 50-5,000x over existing
datacenter-scale storage systems.
Unfortunately, today’s datacenters cannot meet RAMCloud’s latency goals (Ethernet switches and
NICs typically add at least 200-500 μs to round-trip latency in a large datacenter). Thus we use lowlatency Infiniband NICs and switches in our development environment as an approximation to the
networking hardware we hope will be commonplace in a few years; this makes it easier to explore
latency issues in the RAMCloud software. The current RAMCloud system supports 5 μs reads in a
small cluster, and each storage server can handle about 1 million small read requests per second.
The second important property of RAMCloud is scale: a single RAMCloud cluster must support
thousands of servers in order to provide a coherent source of data for large applications. Scale creates
several challenges, such as the likelihood of frequent component failures and the need for distributed
decision-making to avoid bottlenecks. However, scale also creates opportunities, such as the ability
to enlist large numbers of resources on problems like fast crash recovery.
RAMCloud’s overall goal is to enable a new class of applications that manipulate large datasets more
intensively than has ever been possible. For more details on the motivation for RAMCloud and some
of its architectural choices, see [18].
2.2 Data Model
The current data model in RAMCloud is a simple key-value store. RAMCloud supports any number
of tables, each of which contains any number of objects. An object consists of a 64-bit identifier, a
variable-length byte array (up to 1 MB), and a 64-bit version number. RAMCloud provides a simple
set of operations for creating and deleting tables and for reading, writing, and deleting objects within
a table. Objects are addressed with their identifiers and are read and written in their entirety. There is
no built-in support for atomic updates to multiple objects, but RAMCloud does provide a conditional
update (“replace the contents of object O in table T only if its current version number is V ”), which
can be used to implement more complex transactions in application software. In the future we plan to
experiment with more powerful features such as indexes, mini-transactions [4], and support for large
Datacenter Network
Figure 1: RAMCloud cluster architecture. Each storage server contains a master and a backup. A central
coordinator manages the server pool and tablet configuration. Client applications run on separate machines and
access RAMCloud using a client library that makes remote procedure calls.
2.3 System Structure
As shown in Figure 1, a RAMCloud cluster consists of a large number of storage servers, each of
which has two components: a master, which manages RAMCloud objects in its DRAM and services
client requests, and a backup, which stores redundant copies of objects from other masters using
its disk or flash memory. Each RAMCloud cluster also contains one distinguished server called the
coordinator. The coordinator manages configuration information such as the network addresses of
the storage servers and the locations of objects; it is not involved in most client requests.
The coordinator assigns objects to storage servers in units of tablets: consecutive key ranges within a
single table. Small tables are stored in their entirety on a single storage server; larger tables are split
across multiple servers. Client applications do not have control over the tablet configuration; however,
they can achieve some locality by taking advantage of the fact that small tables (and adjacent keys in
large tables) are stored together on a single server.
The coordinator stores the mapping between tablets and storage servers. The RAMCloud client
library maintains a cache of this information, fetching the mappings for each table the first time it
is accessed. Clients can usually issue storage requests directly to the relevant storage server without
involving the coordinator. If a client’s cached configuration information becomes stale because a
tablet has moved, the client library discovers this when it makes a request to a server that no longer
contains the tablet, at which point it flushes the stale data from its cache and fetches up-to-date
information from the coordinator. Clients use the same mechanism during crash recovery to find the
new location for data.
2.4 Managing Replicas
The internal structure of a RAMCloud storage server is determined primarily by the need to provide
durability and availability. In the absence of these requirements, a master would consist of little more
than a hash table that maps from table identifier, object identifier pairs to objects in DRAM. The
main challenge is providing durability and availability without sacrificing performance or greatly
increasing system cost.
One possible approach to availability is to replicate each object in the memories of several servers.
However, with a typical replication factor of three, this approach would triple both the cost and energy
usage of the system (each server is already fully loaded, so adding more memory would also require
adding more servers and networking). The cost of main-memory replication can be reduced by using
coding techniques such as parity striping [20], but this makes crash recovery considerably more
expensive. Furthermore, DRAM-based replicas are still vulnerable in the event of power failures.
Instead, RAMCloud keeps only a single copy of each object in DRAM, with redundant copies on
1.Process write 2. Append object to log and
update hash table
Buffered Segment
In-Memory Log
Buffered Segment
Hash table
Buffered Segment
4. Respond to
write request
3. Replicate object
to backups
Figure 2: When a master receives a write request, it updates its in-memory log and forwards the new data to
several backups, which buffer the data in their memory. The data is eventually written to disk or flash in large
batches. Backups must use an auxiliary power source to ensure that buffers can be written to stable storage after
a power failure.
secondary storage such as disk or flash. This makes replication nearly free in terms of cost and energy
usage (the DRAM for primary copies will dominate both of these factors), but it raises two issues.
First, the use of slower storage for backup might impact the normal-case performance of the system
(e.g., by waiting for synchronous disk writes). Second, this approach could result in long periods of
unavailability or poor performance after server crashes, since the data will have to be reconstructed
from secondary storage. Section 2.5 describes how RAMCloud solves the performance problem, and
Section 3 deals with crash recovery.
2.5 Log-Structured Storage
RAMCloud manages object data using a logging approach. This was originally motivated by the
desire to transfer backup data to disk or flash as efficiently as possible, but it also provides an efficient
memory management mechanism, enables fast recovery, and has a simple implementation. The data
for each master is organized as a log as shown in Figure 2. When a master receives a write request,
it appends the new object to its in-memory log and forwards that log entry to several backup servers.
The backups buffer this information in memory and return immediately to the master without writing
to disk or flash. The master completes its request and returns to the client once all of the backups
have acknowledged receipt of the log data. When a backup’s buffer fills, it writes the accumulated
log data to disk or flash in a single large transfer, then deletes the buffered data from its memory.
Backups must ensure that buffered log data is as durable as data on disk or flash (i.e., information must
not be lost in a power failure). One solution is to use new DIMM memory modules that incorporate
flash memory and a super-capacitor that provides enough power for the DIMM to write its contents
to flash after a power outage [2]; each backup could use one of these modules to hold all of its
buffered log data. Other alternatives are per-server battery backups that extend power long enough
for RAMCloud to flush buffers, or enterprise disk controllers with persistent cache memory.
RAMCloud manages its logs using techniques similar to those in log-structured file systems [21].
Each master’s log is divided into 8 MB segments. The master keeps a count of unused space within
each segment, which accumulates as objects are deleted or overwritten. It reclaims wasted space by
occasionally invoking a log cleaner; the cleaner selects one or more segments to clean, reads the live
records from the segments and rewrites them at the head of the log, then deletes the cleaned segments
along with their backup copies. Segments are also the unit of buffering and I/O on backups; the large
segment size enables efficient I/O for both disk and flash.
64 GB / 10 Gbps = 1 minute
Datacenter Network
Recovery Masters
Recovery Master
Recovery Master
Datacenter Network
64 GB / 3 disks / 100 MB/s/disk = 3.5 minutes
64 GB / 1000 disks / 100 MB/s/disk = 0.6 seconds
Datacenter Network
Figure 3: (a) Disk bandwidth is a recovery bottleneck if each master’s data is mirrored on a small number of
backup machines. (b) Scattering log segments across many backups removes the disk bottleneck, but recovering
all data on one recovery master is limited by the network interface and CPU of that machine. (c) Fast recovery
is achieved by partitioning the data of the crashed master and recovering each partition on a separate recovery
RAMCloud uses a log-structured approach not only for backup storage, but also for information in
DRAM: the memory of a master is structured as a collection of log segments identical to those stored
on backups. This allows masters to manage both their in-memory data and their backup data using a
single mechanism. The log provides an efficient memory management mechanism, with the cleaner
implementing a form of generational garbage collection. In order to support random access to objects
in memory, each master keeps a hash table that maps from table identifier, object identifier pairs
to the current version of an object in a segment. The hash table is used both to look up objects
during storage operations and to determine whether a particular object version is the current one
during cleaning (for example, if there is no hash table entry for a particular object in a segment being
cleaned, it means the object has been deleted).
The buffered logging approach allows writes to complete without waiting for disk operations, but it
limits overall system throughput to the bandwidth of the backup storage. For example, each RAMCloud server can handle about 300,000 100-byte writes/second (versus 1 million reads/second) assuming 2 disks per server, 100 MB/s write bandwidth for each disk, 3 disk replicas of each object, and
a 100% bandwidth overhead for log cleaning. Additional disks can be used to boost write throughput.
When a RAMCloud storage server crashes, the objects that had been present in its DRAM must be reconstructed by replaying its log. This requires reading log segments from backup storage, processing
the records in those segments to identify the current version of each live object, and reconstructing
the hash table used for storage operations. The crashed master’s data will be unavailable until the
hash table has been reconstructed.
Fortunately, if the period of unavailability can be made very short, so that it is no longer than other
delays that are common in normal operation, and if crashes happen infrequently, then crash recovery
will be unnoticeable to the application’s users. We believe that 1-2 second recovery is fast enough to
constitute “continuous availability” for most applications; our goal is to achieve this speed for servers
with at least 64 GB of memory.
3.1 Using Scale
The key to fast recovery in RAMCloud is to take advantage of the massive resources of the cluster. This subsection introduces RAMCloud’s overall approach for harnessing scale; the following
subsections describe individual elements of the mechanism in detail.
As a baseline, Figure 3a shows a simple mirrored approach where each master chooses 3 backups
and stores copies of all its log segments on each backup. Unfortunately, this creates a bottleneck
for recovery because the master’s data must be read from only a few disks. In the configuration of
Figure 3a with 3 disks, it would take about 3.5 minutes to read 64 GB of data.
RAMCloud works around the disk bottleneck by using more disks during recovery. Each master
scatters its log data across all of the backups in the cluster (each segment on a different set of backups)
as shown in Figure 3b. During recovery, these scattered log segments can be read simultaneously;
with 1,000 disks, 64 GB of data can be read into memory in less than one second.
Once the segments have been read from disk into backups’ memories, they must be combined to find
the most recent version for each object (no backup can tell in isolation whether a particular object in
a particular segment is the most recent version). One approach is to send all the log segments to a
single recovery master and replay the log on that master, as in Figure 3b. Unfortunately, the recovery
master is a bottleneck in this approach: with a 10 Gbps network interface, it will take about 1 minute
to read 64 GB of data, and the master’s CPU will also be a bottleneck.
To eliminate the recovery master as the bottleneck, RAMCloud uses multiple recovery masters as
shown in Figure 3c. During recovery RAMCloud divides the objects of the crashed master into
partitions of roughly equal size. Each partition is assigned to a different recovery master, which
fetches the log data for the partition’s objects from backups and incorporates those objects into its
own log and hash table. With 100 recovery masters operating in parallel, 64 GB of data can be
transferred over a 10 Gbps network in less than 1 second. As will be shown in Section 4, this is also
enough time for each recovery master’s CPU to process the incoming data.
Thus, the overall approach to recovery in RAMCloud is to combine the disk bandwidth, network
bandwidth, and CPU cycles of thousands of backups and hundreds of recovery masters. The subsections below describe how RAMCloud divides its work among all of these resources and how it
coordinates the resources to recover in 1-2 seconds.
3.2 Scattering Log Segments
For fastest recovery the log segments for each RAMCloud master should be distributed uniformly
across all of the backups in the cluster. However, there are several factors that complicate this approach:
• Segment placement must reflect failure modes. For example, a segment’s master and each of
its backups must reside in different racks, in order to protect against top-of-rack switch failures
and other problems that disable an entire rack.
• Different backups may have different bandwidth for I/O (different numbers of disks, different
disk speeds, or different storage classes such as flash memory); segments should be distributed
so that each backup uses the same amount of time to read its share of the data during recovery.
• All of the masters are writing segments simultaneously; they should coordinate to avoid overloading any individual backup. Backups have limited buffer space.
• Storage servers are continuously entering and leaving the cluster, which changes the pool of
available backups and may unbalance the distribution of segments.
Making decisions such as segment replica placement in a centralized fashion on the coordinator
would limit RAMCloud’s scalability. For example, a cluster with 10,000 servers could back up
100,000 or more segments per second; this could easily cause the coordinator to become a performance bottleneck.
Instead, each RAMCloud master decides independently where to place each replica, using a combination of randomization and refinement. When a master needs to select a backup for a segment, it
chooses several candidates at random from a list of all backups in the cluster. Then it selects the best
candidate, using its knowledge of where it has already allocated segment replicas and information
about the speed of each backup’s disk (backups measure the speed of their disks when they start up
and provide this information to the coordinator, which relays it on to masters). The best backup is the
one that can read its share of the master’s segment replicas most quickly from disk during recovery.
A backup is rejected if it is in the same rack as the master or any other replica for the current segment.
Once a backup has been selected, the master contacts that backup to reserve space for the segment.
At this point the backup can reject the request if it is overloaded, in which case the master selects
another candidate.
The use of randomization eliminates pathological behaviors such as all masters choosing the same
backups in a lock-step fashion. Adding the refinement step provides a solution nearly as optimal as
a centralized manager (see [17] and [5] for a theoretical analysis). For example, if a master scatters
8,000 segments across 1,000 backups using a purely random approach, backups will have 8 segments
on average. However, some backups are likely to end up with 15-20 segments, which will result in
uneven disk utilization during recovery. Adding just a small amount of choice makes the segment
distribution nearly uniform and also allows for compensation based on other factors such as disk
speed (see Section 4.4). This mechanism also handles the entry of new backups gracefully: a new
backup is likely to be selected more frequently than existing backups until every master has taken full
advantage of it.
RAMCloud masters mark one of the replicas for each segment as the primary replica. Only the
primary replicas are read during recovery (unless they are unavailable), and the performance optimizations described above consider only primary replicas.
We considered the possibility of storing one of the backup replicas on the same machine as the
master. This would reduce network bandwidth requirements, but it has two disadvantages. First, it
would reduce system fault tolerance: the master already has one copy in its memory, so placing a
second copy on the master’s disk provides little benefit. If the master crashes, the disk copy will be
lost along with the memory copy; it would only provide value in a cold start after a power failure.
Second, storing one replica on the master would limit the burst write bandwidth of a master to the
bandwidth of its local disks. In contrast, with all replicas scattered, a single master can potentially
use the disk bandwidth of the entire cluster (up to the limit of its network interface).
3.3 Failure Detection
RAMCloud detects server failures in two ways. First, RAMCloud clients will notice if a server fails
to respond to a remote procedure call. Second, RAMCloud checks its own servers to detect failures
even in the absence of client activity; this allows RAMCloud to replace lost replicas before multiple
crashes cause permanent data loss. Each RAMCloud server periodically issues a ping RPC to another
server chosen at random and reports failures to the coordinator. This is another example of using a
randomized distributed approach in place of a centralized approach. The probability of detecting a
crashed machine in a single round of pings is about 63% for clusters with 100 or more nodes; the
odds are greater than 99% that a failed server will be detected within five rounds.
In either case, server failures are reported to the coordinator. The coordinator verifies the problem
by attempting to communicate with the server itself, then initiates recovery if the server does not
respond. Timeouts must be relatively short (tens of milliseconds) so that they don’t significantly
delay recovery. See Section 5 for a discussion of the risks introduced by short timeouts.
3.4 Recovery Flow
The coordinator supervises the recovery process, which proceeds in three phases:
1. Setup. The coordinator finds all replicas of all log segments belonging to the crashed master,
selects recovery masters, and assigns each recovery master a partition to recover.
2. Replay. Recovery masters fetch log segments in parallel and incorporate the crashed master’s
partitions into their own logs.
3. Cleanup. Recovery masters begin serving requests, and the crashed master’s log segments are
freed from backup storage.
These phases are described in more detail below.
3.5 Setup
3.5.1 Finding Log Segment Replicas
At the start of recovery, replicas of the crashed master’s segments must be located among the cluster’s backups. RAMCloud does not keep a centralized map of replicas since it would be difficult to
scale and would hinder common-case performance. Only masters know where their segments are
replicated, but this information is lost when they crash.
The coordinator reconstructs the locations of the crashed master’s replicas by querying all of the
backups in the cluster. Each backup responds with a list of the replicas it has stored for the crashed
master (backups maintain this index in memory). The coordinator then aggregates the responses into
a single location map. By using RAMCloud’s fast RPC system and querying multiple backups in
parallel, the segment location information is collected quickly.
3.5.2 Detecting Incomplete Logs
After backups return their lists of replicas, the coordinator must determine whether the reported
segment replicas form the entire log of the crashed master. The redundancy in RAMCloud makes
it highly likely that the entire log will be available, but the system must be able to detect situations
where some data is missing (such as network partitions).
RAMCloud avoids centrally tracking the list of the segments that comprise a master’s log by making
each log self-describing; the completeness of the log can be verified using data in the log itself. Each
segment includes a log digest, which is a list of identifiers for all segments in the log at the time this
segment was written. Log digests are small (less than 1% storage overhead even when uncompressed,
assuming 8 MB segments and 8,000 segments per master).
This leaves a chance that all the replicas for the newest segment in the log are unavailable, in which
case the coordinator would not be able to detect that the log is incomplete (the most recent digest it
could find would not list the newest segment). To prevent this, when a master creates a new segment
replica it makes its transition to the new digest carefully. First, a new digest is inserted in the new
replica, and it is marked as active. Then, after the new active digest is durable, a final update to the
prior active digest marks it as inactive. This ordering ensures the log always has an active digest, even
if the master crashes between segments. Two active log digests may be discovered during recovery,
but the coordinator simply ignores the newer one since its segment must be empty.
If the active log digest and a replica for each segment cannot be found, then RAMCloud cannot
recover the crashed master. In this unlikely case, RAMCloud notifies the operator and waits for
backups to return to the cluster with replicas for each of the missing segments. Alternatively, at the
operator’s discretion, RAMCloud can continue recovery with loss of data.
3.5.3 Starting Partition Recoveries
Next, the coordinator must divide up the work of recovering the crashed master. The choice of
partitions for a crashed master is made by the master itself: during normal operation each master
analyzes its own data and computes a set of partitions that would evenly divide the work of recovery.
This information is called a will (it describes how a master’s assets should be divided in the event
of its demise). Masters periodically upload their wills to the coordinator. Section 3.9 describes how
masters compute their wills efficiently.
During recovery setup, the coordinator assigns each of the partitions in the crashed master’s will to
an existing master within the cluster. Each of these recovery masters receives two things from the
2. Divide segment
1. Read disk
3. Transfer data
to masters
4. Add objects to
hash table and log
Hash table
In-Memory Log
Recovery Master
6. Write segment
replicas to disk
5. Replicate log
data to backups
Figure 4: During recovery, segment data flows from disk or flash on a backup over the network to a recovery
master, then back to new backups as part of the recovery master’s log.
coordinator: a list of the locations of all the crashed master’s log segments and a list of tablets that
the recovery master must recover and incorporate into the data it manages.
3.6 Replay
The vast majority of recovery time is spent replaying segments to reconstruct partitions on the recovery masters. During replay the contents of each segment are processed in six stages (see Figure 4):
1. The segment is read from disk into the memory of a backup.
2. The backup divides the records in the segment into separate groups for each partition based on
table and object identifiers in the log records.
3. The records for each partition are transferred over the network to the recovery master for that
4. The recovery master incorporates the data into its in-memory log and hash table.
5. As the recovery master fills segments in memory, it replicates those segments over the network
to backups with the same scattering mechanism used in normal operation.
6. The backups write the new segment replicas to disk or flash.
RAMCloud harnesses concurrency in two dimensions during recovery. The first dimension is data
parallelism: different backups read different segments from disk in parallel, different recovery masters reconstruct different partitions in parallel, and so on. The second dimension is pipelining: all of
the six stages listed above proceed in parallel, with a segment as the basic unit of work. While one
segment is being read from disk on a backup, another segment is being partitioned by that backup’s
CPU, and records from another segment are being transferred to a recovery master; similar pipelining
occurs on recovery masters. For fastest recovery all of the resources of the cluster should be kept fully
utilized, including disks, CPUs, and the network.
3.7 Segment Replay Order
In order to maximize concurrency, recovery masters and backups operate independently. As soon
as the coordinator contacts each backup to obtain its list of segments, the backup begins prefetching
segments from disk and dividing them by partition. At the same time, masters fetch segment data
from backups and replay it. Ideally backups will constantly run ahead of masters, so that segment
data is ready and waiting whenever a recovery master requests it. However, this only works if the
recovery masters and backups process segments in the same order. If a recovery master accidentally
requests the last segment in the backup’s order then the master will stall: it will not receive any data
to process until the backup has read all of its segments.
In order to avoid pipeline stalls, each backup decides in advance the order in which it will read its
segments. It returns this information to the coordinator during the setup phase, and the coordinator
includes the order information when it communicates with recovery masters to initiate recovery. Each
recovery master uses its knowledge of backup disk speeds to estimate when each segment’s data is
likely to be loaded. It then requests segment data in order of expected availability. (This approach
causes all masters to request segments in the same order; we could introduce randomization to avoid
contention caused by lock-step behavior.)
Unfortunately, there will still be variations in the speed at which backups read and process segments.
In order to avoid stalls because of slow backups, each master keeps several concurrent requests for
segment data outstanding at any given time during recovery; it replays segment data in the order that
the requests return.
Because of the optimizations described above, recovery masters will end up replaying segments in a
different order than the one in which the segments were originally written. Fortunately, the version
numbers in log records allow the log to be replayed in any order without affecting the result. During replay each master simply retains the version of each object with the highest version number,
discarding any older versions that it encounters.
Although each segment has multiple replicas stored on different backups, only the primary replicas
are read during recovery; reading more than one would waste valuable disk bandwidth. Masters
identify primary replicas when scattering their segments as described in Section 3.2. During recovery
each backup reports all of its segments, but it identifies the primary replicas and only prefetches the
primary replicas from disk. Recovery masters request non-primary replicas only if there is a failure
reading the primary replica.
3.8 Cleanup
After a recovery master completes the recovery of its assigned partition, it notifies the coordinator
that it is ready to service requests. The coordinator updates its configuration information to indicate
that the master now owns the tablets in the recovered partition, at which point the partition is available
for client requests. Clients with failed RPCs to the crashed master have been waiting for new configuration information to appear; they discover it and retry their RPCs with the new master. Recovery
masters can begin service independently without waiting for other recovery masters to finish.
Once all recovery masters have completed recovery, the coordinator contacts each of the backups
again. At this point the backups free the storage for the crashed master’s segments, since it is no
longer needed. Recovery is complete once all of the backups have been notified.
3.9 Tablet Profiling
Each master is responsible for creating a will, which describes how its objects should be partitioned
during recovery. A partition consists of one or more tablets. The master should balance its partitions
so that they require roughly equal time to recover, and the partitions should be sized based on the
desired recovery time. The master’s storage is not actually partitioned during normal operation as this
would create unnecessary overheads; partitioning only occurs during recovery. The master uploads
its will to the coordinator and updates the will as its data evolves.
RAMCloud computes wills using tablet profiles. Each tablet profile tracks the distribution of resource
usage within a single table or tablet in a master. It consists of a collection of buckets, each of which
Bucket Key
Not in Partition
In Partition
Figure 5: A tablet profile consists of a hierarchical collection of bucket arrays; buckets are subdivided dynamically when their counts become large. The tree structure creates (bounded) uncertainty when assigning partition
boundaries, since counts in ancestor buckets may represent objects either before or after the boundary.
counts the number of log records corresponding to a range of object identifiers, along with the total
log space consumed by those records. Tablet profiles are updated as new log records are created and
old segments are cleaned, and the master periodically scans its tablet profiles to compute a new will.
Unfortunately, it isn’t possible to choose the buckets for a tablet profile statically because the space
of object identifiers is large (264 ) and clients can allocate object identifiers however they wish. With
any static choice of buckets, it is possible that all of the objects in a table could end up in a single
bucket, which would provide no information for partitioning. Buckets must be chosen dynamically
so that the contents of each bucket are small compared to the contents of a partition.
RAMCloud represents a tablet profile as a dynamic tree of bucket arrays, as shown in Figure 5.
Initially the tree consists of a single bucket array that divides the entire 64-bit identifier space into
buckets of equal width (in the current implementation there are 256 buckets in each array). Whenever
a master creates a new log record it updates the appropriate bucket. If a bucket becomes too large
(the number of records or space usage exceeds a threshold) then a child bucket array is created to
subdivide the bucket’s range into smaller buckets. Future log records are profiled in the child bucket
array instead of the parent. However, the counts in the parent bucket remain (RAMCloud does not
attempt to redistribute them in the child bucket array since this could require rescanning a large
portion of the log). The master decrements bucket counts when it cleans log segments. Each bucket
array records the position of the log head when that array was created, and the master uses this
information during cleaning to decrement the same bucket that was incremented when the record was
created (thus, over time the counts in non-leaf buckets are likely to become small). Bucket arrays are
collapsed back into their parents when usage drops.
To calculate partitions, a master scans its tablet profiles in a depth-first search, accumulating counts
of records and space usage and establishing partition boundaries whenever the counts reach threshold
values. For example, one policy might be to assign partitions based on log space usage so that no
partition has more than 600 MB of log data or more than three million objects.
The tablet profile structure creates uncertainty in the actual usage of a partition, as illustrated in Figure 5. If a partition boundary is placed at the beginning of a leaf bucket, it isn’t possible to tell
whether counts in ancestor buckets belong to the new partition or the previous one. Fortunately, the
uncertainty is bounded. For example, in the current RAMCloud implementation, there could be up
to 7 ancestor buckets, each of which could account for 8 MB of data (the threshold for subdividing a
bucket), for a worst-case uncertainty of 56 MB for each partition boundary. In order to bound recov-
ery times, RAMCloud pessimistically assumes that unknown counts fall within the current partition.
In the configuration used for RAMCloud, the memory overhead for tablet profiles is 0.6% in the
worst case (8 levels of bucket array for 8 MB of data). The parameters of the tablet profile can be
changed to make trade-offs between the storage overhead for profiles and the accuracy of partition
3.10 Consistency
We designed RAMCloud to provide a strong form of consistency (linearizability [13], which requires
exactly-once semantics), even across host failures and network partitions. A full discussion of RAMCloud’s consistency architecture is beyond the scope of this paper, and the implementation is not yet
complete; however, it affects crash recovery in two ways. First, a master that is suspected of failure
(a sick master) must stop servicing requests before it can be recovered, to ensure that applications
always read and write the latest version of each object. Second, when recovering from suspected
coordinator failures, RAMCloud must ensure that only one coordinator can manipulate and serve the
cluster’s configuration at a time.
RAMCloud will disable a sick master’s backup operations when it starts recovery, so the sick master
will be forced to contact the coordinator to continue servicing writes. The coordinator contacts backups at the start of recovery to locate a replica of every segment in the sick master’s log, including the
active segment to which the master may still be writing. Once a backup with a replica of the active
segment has been contacted, it will reject backup operations from the sick master with an indication
that the master must stop servicing requests until it has contacted the coordinator. Masters will periodically check in with their backups, so disabling a master’s backup operations will also stop it from
servicing read requests by the time recovery completes.
Coordinator failures will be handled safely using the ZooKeeper service [14]. The coordinator will
use ZooKeeper to store its configuration information, which consists of a list of active storage servers
along with the tablets they manage. ZooKeeper uses its own replication mechanisms to provide
a high level of durability and availability for this information. To handle coordinator failures, the
active coordinator and additional standby coordinators will compete for a single coordinator lease
in ZooKeeper, which ensures that at most one coordinator runs at a time. If the active coordinator
fails or becomes disconnected, its lease will expire and it will stop servicing requests. An arbitrary
standby coordinator will acquire the lease, read the configuration information from ZooKeeper, and
resume service. The configuration information is small, so we expect to recover from coordinator
failures just as quickly as other server failures.
3.11 Additional Failure Modes
Our work on RAMCloud so far has focused on recovering the data stored in the DRAM of a single
failed master. The sections below describe several other ways in which failures can occur in a RAMCloud cluster and some preliminary ideas for dealing with them; we defer a full treatment of these
topics to future work.
3.11.1 Backup Failures
RAMCloud handles the failure of a backup server by creating new replicas to replace the ones on
the failed backup. Every master is likely to have at least one segment replica on the failed backup,
so the coordinator notifies all of the masters in the cluster when it detects a backup failure. Each
master checks its segment table to identify segments stored on the failed backup, then it creates new
replicas using the approach described in Section 3.2. All of the masters perform their rereplication
concurrently and the new replicas are scattered across all of the disks in the cluster, so recovery from
backup failures is fast. If each master has 64 GB of memory then each backup will have about 192
GB of data that must be rewritten (assuming 3 replicas for each segment). For comparison, 256 GB
of data must be transferred to recover a dead master: 64 GB must be read, then 192 GB must be
written during rereplication.
Disk 1
Disk 2
Xeon X3470 (4x2.93 GHz cores, 3.6 GHz Turbo)
16 GB DDR3 at 1333 MHz
WD 2503ABYX (7200 RPM, 250 GB)
Effective read/write: 105/110 MB/s
Seagate ST3500418AS (7200 RPM, 500 GB)
Effective read/write: 108/87 MB/s
Crucial M4 CT128M4SSD2 (128GB)
Effective read/write: 269/182 MB/s
Mellanox ConnectX-2 Infiniband HCA
5x 36-port Mellanox InfiniScale IV (4X QDR)
Table 1: Experimental cluster configuration. All 60 nodes have identical hardware. Effective disk bandwidth is
the average throughput from 1,000 8 MB sequential accesses to random locations in the first 72 GB of the disk.
Flash drives were used in place of disks for Figure 9 only. The cluster has 5 network switches arranged in two
layers. Each port’s maximum network bandwidth is 32 Gbps, but nodes are limited to about 25 Gbps by PCI
Express. The switching fabric is oversubscribed, providing at best about 22 Gbps of bisection bandwidth per node
when congested.
3.11.2 Multiple Failures
Given the large number of servers in a RAMCloud cluster, there will be times when multiple servers
fail simultaneously. When this happens, RAMCloud recovers from each failure independently. The
only difference in recovery is that some of the primary replicas for each failed server may have been
stored on the other failed servers. In this case the recovery masters will use secondary replicas;
recovery will complete as long as there is at least one replica available for each segment. It should
be possible to recover multiple failures concurrently; for example, if a RAMCloud cluster contains
5,000 servers with flash drives for backup, the measurements in Section 4 indicate that a rack failure
that disables 40 masters, each with 64 GB storage, could be recovered in about 2 seconds.
If many servers fail simultaneously, such as in a power failure that disables many racks, RAMCloud
may not be able to recover immediately. This problem arises if no replicas are available for a lost
segment or if the remaining masters do not have enough spare capacity to take over for all the lost
masters. In this case RAMCloud must wait until enough machines have rebooted to provide the
necessary data and capacity (alternatively, an operator can request that the system continue with
some loss of data). RAMCloud clusters should be configured with enough redundancy and spare
capacity to make situations like this rare.
3.11.3 Cold Start
RAMCloud must guarantee the durability of its data even if the entire cluster loses power at once.
In this case the cluster will need to perform a “cold start” when power returns. Normally, when a
backup restarts, it discards all of the segments stored on its disk or flash, since they have already
been rereplicated elsewhere. However, in a cold start this information must be preserved. Backups
will contact the coordinator as they reboot, and the coordinator will instruct them to retain existing
data; it will also retrieve a list of their segments. Once a quorum of backups has become available,
the coordinator will begin reconstructing masters. RAMCloud can use the same partitioned approach
described above, but it may make more sense to use a different approach where masters are reconstructed exactly as they existed before the cold start. This will be faster than the partitioned approach
because masters will not need to write any backup data: the existing backups can continue to serve
after the masters are reconstructed.
The current RAMCloud implementation does not perform cold starts.
We implemented the RAMCloud architecture described in Sections 2 and 3, and we evaluated the
performance and scalability of crash recovery using a 60-node cluster. The cluster hardware consists
of standard off-the-shelf components (see Table 1) with the exception of its networking equipment,
which is based on Infiniband; with it our end hosts achieve both high bandwidth (25 Gbps) and low
128 B objects
256 B objects
1 KB objects
Recovery Time (ms)
Partition Size (MB)
Figure 6: Recovery time as a function of partition size with a single recovery master and 60 backups. Each curve
uses objects of a single uniform size.
latency (user-level applications can communicate directly with the NICs to send and receive packets,
bypassing the kernel).
The default experimental configuration used one backup server on each machine, with a single disk.
A subset of these machines also ran recovery masters. One additional machine ran the coordinator,
the crashed master, and the client application. In order to increase the effective scale of the system,
some experiments ran two independent backup servers on each machine (each with one disk).
In each experiment a client application observed and measured a crash of a single master and the
subsequent recovery. The client initially filled the master with objects of a single size (1,024 bytes
by default). It then sent a magic RPC to the coordinator which caused it to recover the master. The
client waited until all partitions had been successfully recovered, then read a value from one of those
partitions and reported the end-to-end recovery time. All experiments used a disk replication factor
of 3 (i.e., 3 replicas on disk in addition to one copy in DRAM). The CPUs, disks, and networks were
idle and entirely dedicated to recovery (in practice, recovery would have to compete for resources
with application workloads, though we would argue for giving priority to recovery).
Each of the subsections below addresses one question related to the performance of recovery. The
overall results are:
• A 60-node cluster can recover lost data at about 22 GB/sec (a crashed server with 35 GB storage
can be recovered in 1.6 seconds), and recovery performance scales with cluster size. However,
our scalability measurements are limited by the small size of our test cluster.
• The speed of an individual recovery master is limited primarily by network speed for writing
new segment replicas.
• The segment scattering algorithm distributes segments effectively and compensates for varying
disk speeds.
• Fast recovery significantly reduces the risk of data loss.
4.1 How Large Should Partitions Be?
Our first measurements provide data for configuring RAMCloud (partition size and number of disks
needed per recovery master). Figure 6 measures how quickly a single recovery master can process
Total Recovery
Max. Disk Reading
Avg. Disk Reading
Recovery Time (ms)
Number of Backups (Disks)
Figure 7: Recovery time as a function of the number of disks, with a single recovery master, one 600 MB partition
with 1,024 byte objects, and each disk on a separate machine. “Avg. Disk Reading” measures the average elapsed
time (across all disks) to read backup data during recovery; “Max. Disk Reading” graphs the longest time for any
disk in the cluster. Once 6-8 disks are available recovery time is limited by the network of the recovery master.
backup data, assuming enough backups to keep the recovery master fully occupied. Depending on the
object size, a recovery master can replay log data at a rate of 400-800 MB/s, including the overhead
for reading the data from backups and writing new backup copies. With small objects the speed
of recovery is limited by the cost of updating the hash table and tablet profiles. With large objects
recovery is limited by the network speed during writes to new backups (for example, with 600 MB
partitions and a disk replication factor of 3, the recovery master must write 1.8 GB of data to backups).
For 1-second recovery Figure 6 suggests that partitions should be limited to no more than 800 MB
and no more than 3 million log records (with 128-byte objects a recovery master can process 400 MB
of data per second, which is roughly 3 million log records). With 10 Gbps Ethernet, partitions must
be limited to 300 MB due to the bandwidth requirements for rereplication.
In our measurements we filled the log with live objects, but the presence of deleted versions will,
if anything, make recovery faster. The master’s memory has the same log structure as the backup
replicas, so the amount of log data to read will always be equal to the size of the master’s memory,
regardless of deleted versions. However, deleted versions may not need to be rereplicated (depending
on the order of replay).
4.2 How Many Disks Are Needed for Each Recovery Master?
Each of our disks provided an effective bandwidth of 100-110 MB/s when reading 8 MB segments;
combined with Figure 6, this suggests that RAMCloud will need about 6-8 disks for each recovery
master in order to keep the pipeline full. Figure 7 graphs recovery performance with one recovery
master and a varying number of disks and reaches the same conclusion. With large numbers of disks,
the speed of recovery is limited by outbound network bandwidth on the recovery master.
4.3 How Well Does Recovery Scale?
The most important issue in recovery for RAMCloud is scalability: if one recovery master can recover
600 MB of data in one second, can 10 recovery masters recover 6 GB in the same time, and can 100
recovery masters recover 60 GB? Unfortunately, the disk bandwidth available in our cluster limited us
to 20 recovery masters (120 backups), which is only about 20% the number we would expect in a fullsize RAMCloud recovery. Nonetheless, within this limited range RAMCloud demonstrates excellent
scalability. Figure 8 graphs recovery time as the amount of lost data is increased and the cluster size is
increased to match. For each 600 MB partition of lost data, the cluster includes one recovery master
Recovery Time (ms)
Total Recovery
Max. Disk Reading
Avg. Disk Reading
Number of 600 MB Partitions
(Recovery Masters)
Figure 8: Recovery performance under proportional scaling (one recovery master and 6 backups for each 600 MB
partition of data to recover). Each recovery master shared a host with 2 backups, and each point is an average of 5
runs (Figure 11 shows the variance between runs). A horizontal line would indicate perfect scalability. Recovery
time is limited by disk bandwidth.
and 6 backups with one disk each. With 20 recovery masters and 120 disks, RAMCloud can recover
11.7 GB of data in under 1.1 seconds, which is only 13% longer than it takes to recover 600 MB with
a single master and 6 disks.
In order to allow more recovery masters to participate in recovery, we replaced all the disks in our
cluster with flash drives, each of which provided 270 MB/s read bandwidth (as opposed to 110 MB/s
for the disks). With this configuration we were able to run recoveries that used 60 recovery masters,
as shown in Figure 9. The system still scales well: with 60 recovery masters RAMCloud can recover
35 GB of data from a lost server in about 1.6 seconds, which is 26% longer than it takes 2 recovery
masters to recover 1.2 GB of data.
It is important to keep the overhead for additional masters and backups small, so that recovery can
span hundreds of hosts in large clusters. In order to isolate these overheads, we ran additional experiments with artificially small segments (16 KB) and kept all segment replicas in DRAM to eliminate
disk overheads. Figure 10 (bottom curve) shows the recovery time using trivial partitions containing
just a single 1 KB object; this measures the cost for the coordinator to contact all the backups and
masters during the setup phase. Our cluster scales to 60 recovery masters with only about a 10 ms
increase in recovery time (thanks in large part to fast RPCs).
Figure 10 also shows recovery time using 1.2 MB partitions and 16 KB segments (upper curve). In
this configuration the cluster performs roughly the same number of RPCs as it does in Figure 8, but it
has very little data to process. This exposes the fixed overheads for recovery masters to communicate
with backups: as the system scale increases, each master must contact more backups, retrieving
less data from each individual backup. Each additional recovery master adds only about 1.5 ms of
overhead, so work can be split across 100 recovery masters without substantially increasing recovery
4.4 How Well Does Segment Scattering Work?
Figure 11 shows that the segment placement algorithm described in Section 3.2 works well. We
measured three different variations of the placement algorithm: the full algorithm, which considers
both disk speed and number of segments already present on each backup; a version that uses purely
random placement; and an in-between version that attempts to even out the number of segments
on each backup but does not consider disk speed. The top graph in Figure 11 shows that the full
Recovery Time (ms)
Total Recovery
Number of 600 MB Partitions
(Recovery Masters)
Figure 9: Recovery time under proportional scaling, using flash drives instead of disks. Each partition contained
600 MB of data, and there were 2 backups for each recovery master. As with Figure 8, scaling is proportional: the
number of recovery masters and backups increases with the number of partitions being recovered. Each point is
an average of 5 runs. A horizontal line would indicate perfect scalability. Recovery is slower than in Figure 8 for
a number of reasons: less disk bandwidth available per master (540 MB/s vs. 600-660 MB/s), network saturation,
and processor and memory contention between the master and backups on each node.
1.2 MB partitions
1 KB partitions
Recovery Time (ms)
Number of Partitions
(Recovery Masters)
Figure 10: Management overhead as a function of system scale. Partition size is reduced to 16 KB and segment
replicas are stored in DRAM in order to eliminate overheads related to data size or disk. For “1 KB partitions”
each partition only contains a single object; this measures the coordinator’s overheads for contacting masters and
backups. “1.2 MB partitions” maintains the same number of segments (and roughly the same number of RPCs)
as in Figure 8; it measures the overhead for masters to contact more and more backups as cluster size increases.
Each data point is the average over 5 runs, and there were 2 backups for each recovery master.
Even Read Time
Even Segments
Uniform Random
Fans Normal
Fans High
Recovery Time (seconds)
Figure 11: Impact of segment placement on recovery time. Each line is a cumulative distribution of 120 recoveries
of twenty 600 MB partitions, showing the percent of recoveries that completed within a given time. “Even Read
Time” uses the placement algorithm described in Section 3.2; “Uniform Random” uses a purely random approach;
and “Even Segments” attempts to spread segments evenly across backups without considering disk speed. The top
graph measured the cluster in its normal configuration, with relatively uniform disk performance; the bottom graph
measured the system as it was shipped (unnecessarily high fan speed caused vibrations that degraded performance
significantly for some disks). With fans at normal speed, “Even Read Time” and “Even Segments” perform nearly
the same since there is little variation in disk speed.
algorithm improves recovery time by about 33% over a purely random placement mechanism. Much
of the improvement came from evening out the number of segments on each backup; considering
disk speed improves recovery time by only 12% over the even-segment approach because the disks
did not vary much in speed.
To further test how the algorithm handles variations in disk speed, we also took measurements using
the configuration of our cluster when it first arrived. The fans were shipped in a “max speed” debugging setting, and the resulting vibration caused large variations in speed among the disks (as much as
a factor of 4x). In this environment the full algorithm provided an even larger benefit over purely random placement, but there was relatively little benefit from considering segment counts without also
considering disk speed (Figure 11, bottom graph). RAMCloud’s placement algorithm compensates
effectively for variations in the speed of disks, allowing recovery times almost as fast with highly
variable disks as with uniform disks. Disk speed variations may not be significant in our current
cluster, but we think they will be important in large datacenters where there are likely to be different
generations of hardware.
4.5 Will Scattering Result in Data Loss?
RAMCloud’s approach of scattering segment replicas allows faster recovery, but it increases the system’s vulnerability in the event of simultaneous node failures. For example, consider a cluster with
1,000 nodes and 2x disk replication. With RAMCloud’s scattering approach to segment placement,
there is a 5% chance that data will be lost if any 3 nodes fail simultaneously (the three nodes will
account for the master and both backups for at least one segment). In contrast, if each master concentrates all its segment replicas on two backups, as in Figure 3a, the probability of data loss drops
to less than 10-5 with 3 simultaneous failures.
Fortunately, the fast recovery enabled by scattering makes it unlikely that a second or third failure
will occur before a first failure has been recovered, and this more than makes up for the additional
vulnerability, as shown in Figure 12. With one-second recovery the probability of data loss is very
low (about 10-5 in one year even with a 100,000-node cluster). The risk of data loss rises rapidly
with recovery time: if recovery takes 1,000 seconds, then RAMCloud is likely to lose data in any
one-year period. The line labeled “100s” corresponds roughly to the recovery mechanisms in other
systems such as GFS and HDFS (these systems keep 3 replicas on disk, vs. 1 replica in DRAM and 2
Probability of Data Loss in One Year
Concentrated 100s
Concentrated 10s
Number of Servers
Figure 12: Probability of data loss in one year as a function of cluster size, assuming 8,000 segments per master,
two disk replicas for each DRAM copy, and two crashes per year per server with a Poisson arrival distribution.
Different lines represent different recovery times. Lines labeled “Concentrated” assume that segments are concentrated instead of scattered: each master picks 2 backups at random and replicates all of its segments on each
of those backups.
replicas on disk for the corresponding RAMCloud); with large cluster sizes these other systems may
be vulnerable to data loss. Using a concentrated approach rather than scattering improves reliability,
but the benefit from faster recovery is much larger: a 10x improvement in recovery time improves
reliability more than a 1,000x reduction in scattering.
One risk with Figure 12 is that it assumes server failures are independent. There is considerable
evidence that this is not the case in datacenters [23, 8]; for example, it is not unusual for entire racks
to become inaccessible at once. Thus it is important for the segment scattering algorithm to consider
sources of correlated failure, such as rack boundaries. If there are unpredictable sources of correlated
failure, they will result in longer periods of unavailability while RAMCloud waits for one or more of
the backups to reboot (RAMCloud is no better or worse than other systems in this respect).
Although we made all of the performance measurements in this section with 3x disk replication to be
conservative, Figure 12 suggests that the combination of two copies on disk and one copy in DRAM
should be quite safe. The main argument for 3x disk replication is to ensure 3-way redundancy even
in the event of a datacenter power outage, which would eliminate the DRAM copies. With 3x disk
replication in addition to the DRAM copy, the likelihood of data loss is extremely small: less than
1% in a year even with 100,000 servers and 1,000-second recovery times.
4.6 What Is the Fastest Possible Recovery?
Assuming that recovery is scalable, it should be possible to recover even faster than 1-2 seconds by
using more backups and more recovery masters, with smaller partitions. However, we think that it
will be difficult to recover faster than a few hundred milliseconds without significant changes to the
recovery mechanism. For example, RAMCloud currently requires 150 milliseconds just to detect
failure, and the time for the coordinator to contact every backup may approach 100 ms in a large
cluster. In addition, it takes nearly 100 ms to read a single segment from disk (but this could be
reduced if flash memory replaces disk for backup storage).
There are three risks associated with RAMCloud’s recovery mechanism that we have not been able
to fully evaluate yet. We hope to learn more about these risks (and devise solutions, if necessary) as
we gain more experience with the system.
Scalability. The measurements of scalability in Section 4.3 are encouraging, but they are based on a
cluster size about one-fifth of what we would expect in production. It seems likely that larger clusters
will expose problems that we have not yet seen.
Over-hasty recovery. In order to recover quickly, RAMCloud must also detect failures quickly.
Whereas traditional systems may take 30 seconds or more to decide that a server has failed, RAMCloud makes that decision in 150 ms. This introduces a risk that RAMCloud will treat performance
glitches as failures, resulting in unnecessary recoveries that could threaten both the performance and
the integrity of the system. Furthermore, fast failure detection precludes some network protocols. For
example, most TCP implementations wait 200 ms before retransmitting lost packets; if TCP is to be
used in RAMCloud, either its retransmit interval must be shortened or RAMCloud’s failure detection
interval must be lengthened. The current implementation of RAMCloud supports several transport
protocols for its RPC system (including TCP), most of which support fast failure detection.
Fragmented partitions. Our approach to recovery assumes that a master’s objects can be divided
into partitions during recovery. However, this changes the locality of access to those objects, which
could degrade application performance after recovery. Our current data model does not benefit much
from locality, but as we experiment with richer data models, this issue could become important.
There are numerous examples where DRAM has been used to improve the performance of storage
systems. Early experiments in the 1980s and 1990s included file caching [19] and main-memory
database systems [10, 11]. In recent years, large-scale Web applications have found DRAM indispensable to meet their performance goals. For example, both Google and Yahoo! keep their entire
Web search indexes in DRAM; Facebook offloads its database servers by caching tens of terabytes
of data in DRAM with memcached [3]; and Bigtable allows entire column families to be loaded into
memory [6]. RAMCloud differs from these systems because it keeps all data permanently in DRAM
(unlike Bigtable and Facebook, which use memory as a cache on a much larger disk-based storage
system) and it is general-purpose (unlike the Web search indexes).
There has recently been a resurgence of interest in main-memory databases. One example is HStore [16], which keeps all data in DRAM, supports multiple servers, and is general-purpose. However, H-Store is focused more on achieving full RDBMS semantics and less on achieving large scale
or low latency to the same degree as RAMCloud. H-Store keeps redundant data in DRAM and does
not attempt to survive coordinated power failures.
A variety of “NoSQL” storage systems have appeared recently, driven by the demands of large-scale
Web applications and the inability of relational databases to meet their needs. Examples include
Dynamo [9] and PNUTS [7]. Many of these systems use DRAM in some form, but all are fundamentally disk-based and none are attempting to provide latencies in the same range as RAMCloud.
These systems provide availability using symmetric replication instead of fast crash recovery.
RAMCloud is similar in many ways to Google’s Bigtable [6] and GFS [12]. Bigtable, like RAMCloud, implements fast crash recovery (during which data is unavailable) rather than online replication. Bigtable also uses a log-structured approach for its (meta)data, and it buffers newly-written
data in memory, so that write operations complete before data has been written to disk. GFS serves
a role for Bigtable somewhat like the backups in RAMCloud. Both Bigtable and GFS use aggressive data partitioning to speed up recovery. However, Bigtable and GFS were designed primarily for
disk-based datasets; this allows them to store 10-100x more data than RAMCloud, but their access
latencies are 10-100x slower (even for data cached in DRAM).
Caching mechanisms such as memcached [3] appear to offer a particularly simple mechanism for
crash recovery: if a caching server crashes, its cache can simply be re-created as needed, either on the
crashed server (after it restarts) or elsewhere. However, in large-scale systems, caching approaches
can cause large gaps in availability after crashes. Typically these systems depend on high cache hit
rates to meet their performance requirements; if caches are flushed, the system may perform so poorly
that it is essentially unusable until the cache has refilled. This happened in an outage at Facebook in
September 2010 [1]: a software error caused 28 TB of memcached data to be flushed, rendering the
site unusable for 2.5 hours while the caches refilled from slower database servers.
Randomization has been used by several previous systems to allow system management decisions to
be made in a distributed and scalable fashion. For example, consistent hashing uses randomization
to distribute objects among a group of servers [24, 9]. Mitzenmacher and others have studied the
theoretical properties of randomization with refinement and have shown that it produces near-optimal
results [17, 5].
RAMCloud’s log-structured approach to storage management is similar in many ways to log-structured
file systems (LFS) [21]. However, log management in RAMCloud is simpler and more efficient than
in LFS. RAMCloud is simpler because the log need not contain metadata to enable random-access
reads as in LFS: the hash table enables fast access to data in DRAM, and the disk log is never read
except during recovery, at which time the entire log is read. Thus the log consists primarily of object
records and tombstones that mark their deletion. RAMCloud does not require checkpoints as in LFS,
because it replays the entire log during recovery. RAMCloud is more efficient than LFS because it
need not read data from disk during cleaning: all live data is always in memory. The only I/O during
cleaning is to rewrite live data at the head of the log; as a result, RAMCloud consumes 3-10x less
bandwidth for cleaning than LFS (cleaning cost has been a controversial topic for LFS; see [22], for
In this paper we have demonstrated that the resources of a large-scale storage system can be used to
recover quickly from server crashes. RAMCloud distributes backup data across a large number of
secondary storage devices and employs both data parallelism and pipelining to achieve end-to-end
recovery times of 1-2 seconds. Although we have only been able to evaluate RAMCloud on a small
cluster, our measurements indicate that the techniques will scale to larger clusters. Our implementation uses a simple log-structured representation for data, both in memory and on secondary storage,
which provides high write throughput in addition to enabling fast recovery.
Fast crash recovery is a key enabler for RAMCloud: it allows a high-performance DRAM-based
storage system to provide durability and availability at one-third the cost of a traditional approach
using online replicas.
Asaf Cidon reeducated us on the fundamentals of probability and assisted us with several calculations,
including Figure 12. Nanda Kumar Jayakumar helped us with performance measurements and some
of the figures in the paper. Several people provided helpful feedback on the paper, including Asaf
Cidon, Ankita Kejriwal, Kay Ousterhout, George Varghese, the anonymous SOSP reviewers, and our
shepherd Geoff Voelker. This work was supported by the Gigascale Systems Research Center and the
Multiscale Systems Center, two of six research centers funded under the Focus Center Research Program, a Semiconductor Research Corporation program, and by Facebook, Mellanox, NEC, NetApp,
SAP, and Samsung. This work was also partially supported by NSF Cybertrust awards CNS-0716806
and CNS-1052985 (CT-T: A Clean-Slate Infrastructure for Information Flow Control). Diego Ongaro
is supported by The Junglee Corporation Stanford Graduate Fellowship. Steve Rumble is supported
by a Natural Sciences and Engineering Research Council of Canada Postgraduate Scholarship.
[1] More Details on Today’s Outage | Facebook, Sept. 2010.
[2] Agiga tech agigaram, Mar. 2011.
[3] memcached: a distributed memory object caching system, Jan. 2011.
[4] M. K. Aguilera, A. Merchant, M. Shah, A. Veitch, and C. Karamanolis. Sinfonia: A new
paradigm for building scalable distributed systems. ACM Trans. Comput. Syst., 27:5:1–5:48,
November 2009.
[5] Y. Azar, A. Z. Broder, A. R. Karlin, and E. Upfal. Balanced allocations (extended abstract). In
Proceedings of the twenty-sixth annual ACM symposium on theory of computing, STOC ’94,
pages 593–602, New York, NY, USA, 1994. ACM.
[6] F. Chang, J. Dean, S. Ghemawat, W. C. Hsieh, D. A. Wallach, M. Burrows, T. Chandra,
A. Fikes, and R. E. Gruber. Bigtable: A distributed storage system for structured data. ACM
Trans. Comput. Syst., 26:4:1–4:26, June 2008.
[7] B. F. Cooper, R. Ramakrishnan, U. Srivastava, A. Silberstein, P. Bohannon, H.-A. Jacobsen,
N. Puz, D. Weaver, and R. Yerneni. Pnuts: Yahoo!’s hosted data serving platform. Proc. VLDB
Endow., 1:1277–1288, August 2008.
[8] J. Dean. Keynote talk: Evolution and future directions of large-scale storage and computation
systems at google. In Proceedings of the 1st ACM symposium on Cloud computing, Jun 2010.
[9] G. DeCandia, D. Hastorun, M. Jampani, G. Kakulapati, A. Lakshman, A. Pilchin,
S. Sivasubramanian, P. Vosshall, and W. Vogels. Dynamo: amazon’s highly available key-value
store. In Proceedings of twenty-first ACM SIGOPS symposium on operating systems principles,
SOSP ’07, pages 205–220, New York, NY, USA, 2007. ACM.
[10] D. J. DeWitt, R. H. Katz, F. Olken, L. D. Shapiro, M. R. Stonebraker, and D. A. Wood.
Implementation techniques for main memory database systems. In Proceedings of the 1984
ACM SIGMOD international conference on management of data, SIGMOD ’84, pages 1–8,
New York, NY, USA, 1984. ACM.
[11] H. Garcia-Molina and K. Salem. Main memory database systems: An overview. IEEE Trans.
on Knowl. and Data Eng., 4:509–516, December 1992.
[12] S. Ghemawat, H. Gobioff, and S.-T. Leung. The google file system. In Proceedings of the
nineteenth ACM symposium on Operating systems principles, SOSP ’03, pages 29–43, New
York, NY, USA, 2003. ACM.
[13] M. P. Herlihy and J. M. Wing. Linearizability: a correctness condition for concurrent objects.
ACM Trans. Program. Lang. Syst., 12:463–492, July 1990.
[14] P. Hunt, M. Konar, F. P. Junqueira, and B. Reed. Zookeeper: wait-free coordination for
internet-scale systems. In Proceedings of the 2010 USENIX annual technical conference,
USENIX ATC ’10, pages 11–11, Berkeley, CA, USA, 2010. USENIX Association.
[15] R. Johnson and J. Rothschild. Personal Communications, March 24 and August 20, 2009.
[16] R. Kallman, H. Kimura, J. Natkins, A. Pavlo, A. Rasin, S. Zdonik, E. P. C. Jones, S. Madden,
M. Stonebraker, Y. Zhang, J. Hugg, and D. J. Abadi. H-store: a high-performance, distributed
main memory transaction processing system. Proc. VLDB Endow., 1:1496–1499, August 2008.
[17] M. D. Mitzenmacher. The power of two choices in randomized load balancing. PhD thesis,
University of California, Berkeley, 1996. AAI9723118.
[18] J. Ousterhout, P. Agrawal, D. Erickson, C. Kozyrakis, J. Leverich, D. Mazières, S. Mitra,
A. Narayanan, D. Ongaro, G. Parulkar, M. Rosenblum, S. M. Rumble, E. Stratmann, and
R. Stutsman. The case for ramcloud. Commun. ACM, 54:121–130, July 2011.
[19] J. K. Ousterhout, A. R. Cherenson, F. Douglis, M. N. Nelson, and B. B. Welch. The sprite
network operating system. Computer, 21:23–36, February 1988.
[20] D. A. Patterson, G. Gibson, and R. H. Katz. A case for redundant arrays of inexpensive disks
(raid). In Proceedings of the 1988 ACM SIGMOD international conference on management of
data, SIGMOD ’88, pages 109–116, New York, NY, USA, 1988. ACM.
[21] M. Rosenblum and J. K. Ousterhout. The design and implementation of a log-structured file
system. ACM Trans. Comput. Syst., 10:26–52, February 1992.
[22] M. Seltzer, K. A. Smith, H. Balakrishnan, J. Chang, S. McMains, and V. Padmanabhan. File
system logging versus clustering: a performance comparison. In Proceedings of the USENIX
1995 Technical Conference, TCON’95, pages 21–21, Berkeley, CA, USA, 1995. USENIX
[23] K. Shvachko, H. Kuang, S. Radia, and R. Chansler. The hadoop distributed file system. In
Proceedings of the 2010 IEEE 26th Symposium on Mass Storage Systems and Technologies
(MSST), MSST ’10, pages 1–10, Washington, DC, USA, 2010. IEEE Computer Society.
[24] I. Stoica, R. Morris, D. Liben-Nowell, D. R. Karger, M. F. Kaashoek, F. Dabek, and
H. Balakrishnan. Chord: a scalable peer-to-peer lookup protocol for internet applications.
IEEE/ACM Trans. Netw., 11:17–32, February 2003.
Design Implications for Enterprise
Storage Systems via
Multi-Dimensional Trace Analysis
Yanpei Chen, Kiran Srinivasan∗ ,
Garth Goodson∗ , Randy Katz
University of California, Berkeley, ∗ NetApp Inc.
{ychen2, randy},
{skiran, goodson}
Enterprise storage systems are facing enormous challenges due to increasing growth and
heterogeneity of the data stored. Designing future storage systems requires comprehensive
insights that existing trace analysis methods are ill-equipped to supply. In this paper, we
seek to provide such insights by using a new methodology that leverages an objective, multidimensional statistical technique to extract data access patterns from network storage
system traces. We apply our method on two large-scale real-world production network
storage system traces to obtain comprehensive access patterns and design insights at user,
application, file, and directory levels. We derive simple, easily implementable, thresholdbased design optimizations that enable efficient data placement and capacity optimization
strategies for servers, consolidation policies for clients, and improved caching performance
for both.
Categories and Subject Descriptors
C.4 [Performance of Systems]: Measurement techniques; D.4.3 [Operating Systems]:
File Systems Management—Distributed file systems
Enterprise storage systems are designed around a set of data access patterns. The storage
system can be specialized by designing to a specific data access pattern; e.g., a storage
system for streaming video supports different access patterns than a document repository.
The better the access pattern is understood, the better the storage system design. Insights
into access patterns have been derived from the analysis of existing file system workloads,
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted
without fee provided that copies are not made or distributed for profit or commercial advantage and that copies
bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to
redistribute to lists, requires prior specific permission and/or a fee.
SOSP '11, October 23-26, 2011, Cascais, Portugal.
Copyright © 2011 ACM 978-1-4503-0977-6/11/10 ... $10.00.
typically through trace analysis studies [1, 3, 17, 19, 24]. While this is the correct general
strategy for improving storage system design, past approaches have critical shortcomings,
especially given recent changes in technology trends. In this paper, we present a new
design methodology to overcome these shortcomings.
The data stored on enterprise network-attached storage systems is undergoing changes due
to a fundamental shift in the underlying technology trends. We have observed three such
trends, including:
• Scale: Data size grows at an alarming rate [12], due to new types of social, business
and scientific applications [20], and the desire to “never delete” data.
• Heterogeneity: The mix of data types stored on these storage systems is becoming
increasingly complex, each having its own requirements and access patterns [22].
• Consolidation: Virtualization has enabled the consolidation of multiple applications
and their data onto fewer storage servers [6, 23]. These virtual machines (VMs) also
present aggregate data access patterns more complex than those from individual clients.
Better design of future storage systems requires insights into the changing access patterns
due to these trends. While trace studies have been used to derive data access patterns, we
believe that they have the following shortcomings:
• Unidimensional: Although existing methods analyze many access characteristics, they
do so one at a time, without revealing cross-characteristic dependencies.
• Expertise bias: Past analyses were performed by storage system designers looking for
specific patterns based on prior workload expectations. This introduces a bias that
needs to be revisited based on the new technology trends.
• Storage server centric: Past file system studies focused primarily on storage servers.
This creates a critical knowledge gap regarding client behavior.
To overcome these shortcomings, we propose a new design methodology backed by the
analysis of storage system traces. We present a method that simultaneously analyzes multiple characteristics and their cross dependencies. We use a multi-dimensional, statistical
correlation technique, called k-means [2], that is completely agnostic to the characteristics of each access pattern and their dependencies. The K-means algorithm can analyze
hundreds of dimensions simultaneously, providing added objectivity to our analysis. To
further reduce expertise bias, we involve as many relevant characteristics as possible for
each access pattern. In addition, we analyze patterns at different granularities (e.g., at
the user session, application, file level) on the storage server as well as the client, thus addressing the need for understanding client patterns. The resulting design insights enable
policies for building new storage systems.
We analyze two recent, network-attached storage file system traces from a production
enterprise datacenter. Table 1 summarizes our key observations and design implications,
they will be detailed later in the paper. Our methodology leads to observations that
would be difficult to extract using past methods. We illustrate two such access patterns,
one showing the value of multi-granular analysis (Observation 1 in Table 1) and another
showing the value of multi-feature analysis (Observation 8).
First, we observe (Observation 1) that sessions with more than 128KB of data reads or
writes are either read-only or write-only. This observation affects shared caching and
consolidation policies across sessions. Specifically, client OSs can detect and co-locate cache
sensitive sessions (read-only) with cache insensitive sessions (write-only) using just one
parameter (read-write ratio). This improves cache utilization and consolidation (increased
density of sessions per server).
Client side observations and design implications
1. Client sessions with IO sizes >128KB are
read only or write only. ⇒ Clients can consolidate sessions based on only the readwrite ratio.
2. Client sessions with duration >8 hours do
≈10MB of IO. ⇒ Client caches can already
fit an entire day’s IO.
Number of client sessions drops off linearly
by 20% from Monday to Friday. ⇒ Servers
can get an extra “day” for background tasks
by running at appropriate times during
week days.
Applications with <4KB of IO per file open
and many opens of a few files do only random IO. ⇒ Clients should always cache the
first few KB of IO per file per application.
Applications with >50% sequential read or
write access entire files at a time. ⇒ Clients
can request file prefetch (read) or delegation
(write) based on only the IO sequentiality.
Engineering applications with >50% sequential read and sequential write are doing
code compile tasks, based on file extensions.
⇒ Servers can identify compile tasks; server
has to cache the output of these tasks.
Server side observations and design implications
7. Files with >70% sequential read or write have no
repeated reads or overwrites. ⇒ Servers should
delegate sequentially accessed files to clients to
improve IO performance.
8. Engineering files with repeated reads have random accesses. ⇒ Servers should delegate repeatedly read files to clients; clients need to store
them in flash or memory.
9. All files are active (have opens, IO, and metadata access) for only 1-2 hours in a few months.
⇒ Servers can use file idle time to compress or
deduplicate to increase storage capacity.
10. All files have either all random access or >70%
sequential access. (Seen in past studies too) ⇒
Servers can select the best storage medium for
each file based on only access sequentiality.
11. Directories with sequentially accessed files almost always contain randomly accessed files as
well. ⇒ Servers can change from per-directory
placement policy (default) to per-file policy
upon seeing any sequential IO to any files in a
12. Some directories aggregate only files with repeated reads and overwrites. ⇒ Servers can delegate these directories entirely to clients, tradeoffs permitting.
Table 1: Summary of design insights, separated into insights derived from client access
patterns and server access patterns.
Similarly, we observe (Observation 8) that files with >70% sequential read or sequential
write have no repeated reads or overwrites.
This access pattern involves four characteristics: read sequentiality, write sequentiality, repeated read behavior, and overwrite
behavior. The observation leads to a useful policy: sequentially accessed files do not need
to be cached at the server (no repeated reads), which leads to an efficient buffer cache.
These observations illustrate that our methodology can derive unique design implications
that leverage the correlation between different characteristics. To summarize, our contributions are:
• Identify storage system access patterns using a multi- dimensional, statistical analysis
• Build a framework for analyzing traces at different granularity levels at both server
and client.
• Analyze our specific traces and present the access patterns identified.
• Derive design implications for various storage system components from the access patterns.
In the rest of the paper, we motivate and describe our analysis methodology (Sections 2 and
3), present the access patterns we found and the design insights (Section 4), provide the
implications on storage system architecture (Section 5), and suggest future work (Section
Past trace-based studies have examined a range of storage system protocols and use cases,
delivering valuable insights for designing storage servers. Table 2 summarizes the contributions of past studies. Many studies predate current technology trends. Analysis of
real-world, corporate workloads or traces have been sparse, with only three studies among
et al. [17]
et al. [18]
et al. [3]
et al. [10]
et al. [7]
Vogels [24]
et al. [25]
et al. [19]
et al. [8]
et al. [1]
et al. [15]
et al. [13]
This paper
analysis: Large, sequential read access;
limited read-write;
bursty I/O; short
file lifetimes, etc.
processes - on usage
distributed file system;
comparison to [17],
caching effects.
Analysis of file and
directory attributes:
size, age, lifetime, directory depth
trends in NTFS
Analysis of personal
computer workloads
pathnames can aid
file layout
Distribution of file
size and type in
namespace, change
in file contents over
File re-open, sharing, activity characteristics; changes
compared to previous studies
Study of web (live
maps, web content,
etc.) workloads in
servers via events
Section 4
Table 2: Past studies of storage system traces. “Corp” stands for corporate use cases.
“Eng” stands for engineering use cases. “Live” implies live requests or events in traces were studied,
“Snap” implies snapshots of file systems were studied.
the ones listed [13, 15, 18]. A number of studies have focused on NFS trace analysis only [8,
10]. This focus somewhat neglects systems using the Common Internet File System (CIFS)
protocol [5], with only a single CIFS study [15]. CIFS systems are important since CIFS
is the network storage protocol for Windows, the dominant OS on commodity platforms.
Our work uses the same traces as [15], but we perform analysis using a methodology that
extracts multi-dimensional insights at different layers. This methodology is sufficiently
different from prior work as to make the analysis findings not comparable. The following
discusses the need for this methodology.
Need for Insights at Different Layers
We divide our view of the storage system into behavior at clients and servers. Storage
clients interface directly with users, who create and view content via applications. Separately, servers store the content in a durable and efficient fashion over the network. Past
network storage system trace studies focus mostly on storage servers (Table 2). Storage
client behavior is underrepresented primarily due to the reliance on stateless NFS traces.
This leaves a knowledge gap about access patterns at storage clients. Specifically, these
questions are unanswered:
• Do applications exhibit clear access patterns?
• What are the user-level access patterns?
• Any correlation between users and applications?
• Do all applications interact with files the same way?
Insights on these access patterns lead to better design of both clients and servers. They
enable server capabilities such as per session quality of service (QoS), or per application
service level objectives (SLOs). They also inform various consolidation, caching, and
prefetching decisions at clients.
Each of these access patterns is visible only at a particular semantic layer within the client:
users or applications. We define each such layer as an access unit, with the observed
behaviors at each access unit being an access pattern. The analysis of client side access
units represents an improvement on prior work.
On the server side, we extend the previous focus on files. We need to also understand
how files are grouped within a directory, as well as cross-file dependencies and directory
organization. Thus, we perform multi-layer and cross-layer dependency analysis on the
server also. This is another improvement on past work.
Need for Multi-Dimensional Insights
Each access unit has certain inherent characteristics. Characteristics that can be quantified
are features of that access unit. For example, for an application, the read size in bytes
is a feature; the number of unique files accessed is another. Each feature represents an
independent mathematical dimension that describes an access unit. We use the terms
dimension, feature, and characteristic interchangeably. The global set of features for an
access unit is limitless. Picking a good feature set requires domain knowledge.
Many recent studies analyze access patterns only one feature at a time. This represents a
key limitation. The resulting insights, although valuable, lead to uniform policies around a
single design point. For example, study [15] revealed that most bytes are transferred from
larger files. Although this is an useful observation, it does not reveal other characteristics
of such large files: Do they have repeated reads? Do they have overwrites? Do they have
many metadata requests? And so on. Adding these dimensions breaks up the predominant
access pattern into smaller, minority access patterns, each may require a specific storage
Understanding minority access patterns is increasingly important, because the trend toward data heterogeneity implies that no “common case” will dominate storage system
behavior. Minority access patterns become visible only upon analyzing multiple features
simultaneously, hence the need for multi-dimensional insights. We also need to select a
reasonable number of features. Doing so allows us to fully describe the access patterns
and reduce the bias in picking any one feature.
Manually identifying multi-feature dependencies is difficult, and can lead to an untenable
analysis. Therefore, we need techniques that analyze a large number of features, scale
to a high number of analysis data points, and do not require a priori knowledge of any
cross-feature dependencies. Multi-dimensional statistics techniques have solved similar
problems in other domains [4, 9, 21]. We can apply similar techniques and combine them
with domain specific knowledge of the storage systems being analyzed.
In short, the need for multi-layered and multi-dimensional insights motivates our methodology.
In this section, we describe our analysis method in detail. We start with a description
of the traces we analyzed, followed by a description of the access units selected for our
study. Next, we describe key steps in our analysis process, including selecting the right
features for each access unit, using the k-means data clustering algorithm to identify access
patterns, and additional information needed to interpret and generalize the results.
Traces Analyzed
We collected CIFS traces from two large-scale, enterprise-class file servers deployed at
our corporate datacenters. One server covers roughly 1000 employees in marketing, sales,
finance, and other corporate roles. We call this the corporate trace. The other server covers
roughly 500 employees in various engineering roles. We call this the engineering trace. We
described the trace collecting infrastructure in [15].
The corporate trace reflects activities on 3TB of active storage from 09/20/2007 to 11/21/2007.
It contains activity from many Windows applications. The engineering trace reflects activities on 19TB of active storage from 08/10/2007 to 11/14/2007. It interleaves activity
from both Windows and Linux applications. In both traces, many clients use virtualization
technologies. Thus, we believe we have representative traces with regards to the technology trends in scale, heterogeneity, and consolidation. Also, since protocol-independent
users, applications, and stored data remain the primary factors affecting storage system
behavior, we believe our analysis is relevant beyond CIFS.
Access Units
As mentioned in Section 2.1, we analyze access patterns at multiple access units at the
server and the client. Selecting access units is subjective. We chose access units that form
clear semantic design boundaries. On the client side, we analyze two access units:
• Sessions: Sessions reflect aggregate behavior of an user. A CIFS session is bounded by
matching session connect and logoff requests. CIFS identifies it by a tuple - {client IP
address, session ID}.
• Application instance: Analysis at this level leads to application specific optimizations in
client VMs. CIFS identifies each application instance by the tuple - {client IP address,
session ID, and process ID}.
We also analyzed file open-closes, but obtained no useful insights. Hence we omit that
access unit from the paper.
We also examined two server side access units:
• File: Analyzing file level access patterns facilitates per-file policies and optimization
techniques. Each file is uniquely identified by its full path name.
• Deepest subtree: This access unit is identified by the directory path immediately containing the file. Analysis at this level enables per-directory policies.
App. Instance
Deepest subtree A
Deepest subtree A/B
App. Instance
App. Instance
Figure 1: Access units analyzed.
At clients, each session contains many application
instances. At servers, each subtree contains many files.
Figure 1 shows the semantic hierarchy among different access units. At clients, each session
contains many application instances. At servers, each subtree contains many files.
Analysis Process
Our method (Figure 2) involves the following steps:
Collect network storage system traces (Section 3.1).
Define the descriptive features for each access unit. This step requires domain knowledge about storage systems (Section 3.3.1).
Extract multiple instances of each access unit, and compute from the trace the corresponding numerical feature values of each instance.
Input those values into k-means, a multi-dimensional statistical data clustering technique (Section 3.3.2).
Interpret the k-means output and derive access patterns by looking at only the relevant subset of features. This step requires knowledge of both storage systems and
statistics. We also need to extract considerable additional information to support
our interpretations (Section 3.3.3).
Translate access patterns to design insights.
We give more details about Steps 2, 4, and 5 below.
Selecting features for each access unit
Selecting the set of descriptive features for each access unit requires domain knowledge
about storage systems (Step 2 in Figure 2). It also introduces some subjectivity, since the
choice of features limits on how one access pattern can differ from another. The human
designer needs to select some basic features initially, e.g., total IO size and read-write ratio
for a file. We will not know whether we have a good set of features until we have completed
the entire analysis process. If the analysis results leave some design choice ambiguities, we
need to add new features to clarify those ambiguities, again using domain knowledge. For
example, for the deepest subtrees, we compute various percentiles (25th, 50th, and 75th)
of certain features like read-write ratio because the average value for those features did not
clearly separate the access patterns. We then repeat the analysis process using the new
feature set. This iterative process leads to a long feature set for all access units, somewhat
reducing the subjective bias of a small feature set. We list in Section 4 the chosen features
for each access unit.
Most of the features used in our analysis (Section 4) are self-explanatory; some ambiguous
or complex features require precise definitions, such as:
IO: We use “IO” as a substitute for “read and write”.
Sequential reads or writes: We consider two read or writes requests to be sequential if they
1. Trace
2. Select layers,
define features
3. Compute numerical
feature values
6. Design
5. Interpret
4. Identify access
patterns by k-means
Figure 2: Methodology overview. The two-way arrows and the loop from Step 2 through
Step 5 indicate our many iterations between the steps.
are consecutive in time, and the file offset + request size of the first request equals the file
offset of the second request. A single read or write request is by definition not sequential.
Repeated reads or overwrites: We track accesses at 4KB block boundaries within a file,
with the offset of the first block being zero. A read is considered repeated if it accesses
a block that has been read in the past half hour. We use an equivalent definition for
Identifying access patterns via k-means
A key part of our methodology is the k-means multi-dimensional correlation algorithm. We
use it to identify access patterns simultaneously across many features (Step 4 in Figure 2).
K-means is a well-known, statistical correlation algorithm. It identifies sets of data points
that congregate around a region in n-dimensional space. These congregations are called
clusters. Given data points in an n-dimensional space, k-means picks k points at random as
initial cluster centers, assigns data points to their nearest cluster centers, and recomputes
new cluster centers via arithmetic means across points in the cluster. K-means iterates
the assignment-recompute process until the cluster centers become stationary. K-means
can run with multiple sets of initial cluster centers and return the best result [2].
For each access unit, we extract different instances of it from the trace, i.e., all session
instances, application instances, etc. For each instance, we compute the numerical values
of all its features. This gives us a data array in which each row correspond to an instance,
i.e., a data point, and each column correspond to a feature, i.e., a dimension. We input the
array into k-means, and the algorithm finds the natural clusters across all data points. We
consider all data points in a cluster as belonging to a single equivalence class, i.e., a single
access pattern. The numerical values of the cluster centers indicate the characteristics of
each access pattern.
We choose k-means for two reasons. First, k-means is algorithmically simple. This allows rapid processing on large data sets. We used a modified version of the k-means C
library [14], in which we made some improvements to limits the memory footprint when
processing large data sizes. Second, k-means leads to intuitive labels of the cluster centers.
This helps us translate the statistical behavior extracted from the traces into tangible
insights. Thus, we prefer k-means to other clustering algorithms such as hierarchical clustering and k-means derivatives [2].
K-means requires us to specify k, the number of clusters. This is a difficult task since we
do not know a priori the number of “natural” clusters in the data. We compute the intracluster “residual” variance from the k-means results - the sum of squared distances from
each data point to its assigned cluster center. This is a standard metric for cluster quality,
and gives us a lower bound on k. We cannot set k so small that the residual variance forms
a large fraction of the total variance, i.e., residual variance ≈ the sum of squared distances
from each data point to the global average of all data points. We optionally increase k
beyond the lower bound until some key access patterns can be separated. Concurrently,
we take care not to increase k too high, to prevent having an unwieldy number of access
patterns and design targets. We applied this reasoning to set k at each client and server
access unit.
Interpreting and generalizing the results
The k-means algorithm gives us a set of access patterns with various characteristics. We
need additional information to understand the significance of the results. This information
comes from computing various secondary data outside of k-means analysis (Step 5 in Figure
• We gathered the start and end times of each session instance, aggregated by times of
the day and days of the week. This gave us insight into how users launch and end
• We examine filename extensions of files associated with every access pattern belonging
to these access units: application instances, files, and deepest subtrees. This information connects the access patterns to more easily recognizable file extensions.
• We perform correlation analysis between the file and deepest subtrees access units.
Specifically, we compute the number of files of each file access pattern that is located
within directories in each deepest subtree access pattern. This information captures
the organizations of files in directories.
Such information gives us a detailed picture about the semantics of the access patterns,
resulting in human understandable labels to the access patterns. Such labels help us
translate observations to design implications.
Furthermore, after identifying the design implications, we explore if the design insights can
be extrapolated to other trace periods and other storage system use cases. We accomplish
this by repeating our exact analysis over multiple subsets of the traces, for example, a
week’s worth of traces at a time. This allow us to examine how our analysis would be
different had we obtained only a week’s trace. Access patterns that are consistent, stable
across different weeks would indicate that they are likely to be more general than just our
tracing period or our use cases.
This section presents the access patterns we identified and the accompanying design insights. We discuss client and serve side access patterns (Section 4.1, 4.2). We also check
if these patterns persist across time (Section 4.3).
For each access unit, we list the descriptive features (only some of which help separate
access patterns), outline how we derived the high-level name (label) for each access pattern,
and discuss relevant design insights.
Client Side Access Patterns
As mentioned in Section 3.2, we analyze sessions and application instances at clients.
Sessions reflect aggregate behavior of human users. We used 17 features to describe sessions
(Table 3). The corporate trace has 509,076 sessions, and the engineering trace has 232,033.
In Table 3, we provide quantitative descriptions and short names for all the session access
patterns. We derive the names from examining the significant features: duration, readwrite ratio, and IO size.
(a). Descriptive features for each session
Avg. time between IO requests
Total IO size
Read sequentiality
Read:write ratio by bytes
Write sequentiality
Total IO requests
Repeated read ratio
Read:write ratio by requests
Overwrite ratio
Total metadata requests
Tree connects
(b). Corporate session
access patterns
% of all sessions
Total IO size
Read:write ratio by bytes
Metadata requests
Read sequentiality
Write sequentiality
File opens:files
Tree connect:Trees
Directories accessed
Application instances
(c). Engineering session
access patterns
% of all sessions
Total IO size
Read:write ratio
Metadata requests
Read sequentiality
Write sequentiality
File opens:files
Tree connect:Trees
Directories accessed
Application instances
Unique trees accessed
File opens
Unique files opened
Directories accessed
Application instances seen
8 hrs
11 MB
Half day
4 hrs
3 MB
10 min
128 KB
70 min
3 MB
Supporting metadata
7 sec
Supporting readwrite
10 sec
420 B
1 day
5 MB
edit small
2 hrs
5 KB
1 min
2 MB
1 hr
2 MB
Supporting metadata
10 sec
10 sec
36 B
Table 3: Session access patterns. (a): Full list of descriptive features. (b) and (c): Short
names and descriptions of sessions in each access pattern; listing only the features that help
separate the access patterns.
We also looked at the aggregate session start and end times to get additional semantic
knowledge about each access pattern. Figure 3 shows the start and end times for selected
session access patterns. The start times of corporate full-day work sessions correspond
exactly to the U.S. work day – 9am start, 12pm lunch, 5pm end. Corporate content generation sessions show slight increase in the evening and towards Friday, indicating rushes
to meet daily or weekly deadlines. In the engineering trace, the application generated
backup and machine generated update sessions depart significantly from human workday
and work week patterns, leading us to label them as application and machine (client OS)
One surprise was that the ‘supporting metadata’ sessions account for >90% of all sessions
in both traces. We believe these sessions are not humanly generated. They last roughly
10 seconds, leaving little time for human mediated interactions. Also, the session start
rate averages to roughly one per employee per minute. We are certain that our colleagues
are not connecting and logging off every minute of the entire day. However, the shape of
the start time graphs have a strong correlation with the human work day and work week.
We call these supporting metadata sessions – machine generated in support of human
user activities. These metadata sessions form a sort of “background noise” to the storage
system. We observe the same background noise at other layers both at clients and servers.
Observation 1: The sessions with IO sizes greater than 128KB are either read-only or
Corp full day work
Corp short content
Corp supporting
Eng application
generated backup or copy
Eng machine
generated update
# of sessions
hrs of the day
days of the week
days of the week
days of the week
hrs of the day
hrs of the day
hrs of the day
hrs of the day
# of sessions
1 2 3 4 5
days of the week
days of the week
Figure 3: Number of sessions that start or ends at a particular time. Number of
session starts and ends in times of the day (top) and session starts in days of the week (bottom).
Showing only selected access patterns.
write-only, except for the full-day work sessions. Among these sessions, only read-only
sessions utilize buffer cache for repeated reads and prefetches. Write-only sessions only use
the cache to buffer writes. Thus, if we have a cache eviction policy that recognizes their
write-only nature and releases the buffers immediately on flushing dirty data, we can satisfy
many write-only sessions with relatively little buffer cache space. We can attain better
consolidation and buffer cache utilization by managing the ratio of co-located read-only
and write-only sessions. This insight can be used by virtualization managers and client
operating systems to manage a shared buffer cache between sessions. Recognizing such
read-only and write-only sessions is easy. Examining a session’s total read size and write
size reveals their read-only or write-only nature. Implication 1: Clients can consolidate
sessions efficiently based only on the read-write ratio.
Observation 2: The full-day work, content-viewing, and content-generating sessions all do
≈10MB of IO. This means that a client cache of 10s of MB can fit the working set of a day
for most sessions. Given the growth of flash devices on clients for caching, despite largescale consolidation, clients should easily cache a day’s worth of data for all users. In such
a scenario, most IO requests would be absorbed by the cache, reducing network latency
and bandwidth utilization, and load on the server. Moreover, complex cache eviction
algorithms are unnecessary. Implication 2: Clients caches can already fit an entire day’s
Observation 3: The number of human-generated sessions and supporting sessions peaks on
Monday and decreases steadily to 80% of the peak on Friday (Figure 3). This is true for all
human generated sessions, including the ones not shown in Figure 3. There is considerable
“slack” in the server load during evenings, lunch times, and even during working hours.
This implies that the server can perform background tasks such as consistency checks,
maintenance, or compression/deduplication, at appropriate times during the week. A
simple count of active sessions can serve as an effective start and stop signal. By computing
the area under the curve for session start times by days of the week, we estimate that
background tasks can squeeze out roughly one extra day’s worth of processing without
altering the peak demand on the system. This is a 50% improvement over a setup which
performs background tasks only during weekends. In the engineering trace, the application
generated backup or copy sessions seem to have been already designed this way. Implication
3: Servers get an extra “day” for background tasks by running them at appropriate times
during week-days.
Application instances
Application instance access patterns reflects application behavior, facilitating application
specific optimizations. We used 16 features to describe application instances (Table 4). The
corporate trace has 138,723 application instances, and the engineering trace has 741,319.
Table 4 provides quantitative descriptions and short names for all the application instance
access patterns. We derive the names from examining the read-write ratio, IO size, and
file extensions accessed (Figures 4 and 5).
We see again the metadata background noise. The supporting metadata application instances account for the largest fraction, and often do not even open a file.
There are many files without a file extension, a phenomenon also observed in recent storage
system snapshot studies [16]. We notice that file extensions turn out to be poor indicators
of application instance access patterns. This is not surprising because we separate access
patterns based on read/write properties. A user could either view a .doc or create a
.doc. The same application software has different read/write patterns. This speaks to the
strength of our multi-layer framework. Aggregating IO by application instances gives clean
separation of patterns; while aggregating just by application software or file extensions will
We also find it interesting that most file extensions are immediately recognizable. This
means that what people use network storage systems for, i.e., the file extensions, remains
easily recognizable, even though how people use network storage systems, i.e., the access
patterns, is ever changing and becoming more complex.
Observation 4: The small content viewing application and content update application instances have <4KB total reads per file open and access a few unique files many times. The
small read size and multiple reads from the same files means that clients should prefetch
and place the files in a cache optimized for random access (flash/SSD/memory). The trend
towards flash caches on clients should enable this transfer.
Application instances have bi-modal total IO size - either very small or large. Thus, a
simple cache management algorithm suffices; we always keep the first 2 blocks of 4KB in
cache. If the application instance does more IO, it is likely to have IO size in the 100KB1MB range, so we evict it from the cache. We should note that such a policy makes sense
even though we proposed earlier to cache all 11MB of a typical day’s working set - 11MB of
cache becomes a concern when we have many consolidated clients. Implication 4: Clients
should always cache the first few KB of IO per file per application.
Observation 5: We see >50% sequential read and write ratio for the content update applications instances (corporate) and the content viewing applications instances for humangenerated content (both corporate and engineering). Dividing the total IO size by the
number of file opens suggest that these application instances are sequentially reading and
writing entire files for office productivity (.xls, .doc, .ppt, .pdf, etc.) and multimedia
This implies that the files associated with these applications should be prefetched and
delegated to the client. Prefetching means delivering the whole file to the client before the
whole file is requested. Delegation means giving a client temporary, exclusive access to a
file, with the client periodically synchronizing to server to ensure data durability. CIFS
does delegation using opportunistic locks, while NFSv4 has a dedicated operation for delegation. Prefetching and delegation of such files will improve read and write performance,
lower network traffic, and lighten server load.
(a). Descriptive features for each application instance
Total IO size
Read sequentiality
Read:write ratio by bytes
Write sequentiality
Total IO requests by bytes
Repeated read ratio
Read:write ratio by requests
Overwrite ratio
Total metadata requests
Tree connects
Avg. time between IO requests
Unique trees accessed
(b). Corp. app. instance
Viewing app
SupportApp gengenerated
ing metaerated file
access patterns
% of all app instances
100 KB
1 KB
Total IO
Read:write ratio
Metadata requests
Read sequentiality
Write sequentiality
Overwrite ratio
File opens:files
Tree connect:Trees
Directories accessed
File extensions accessed
(c). Eng. app. instance
access patterns
% of all app instances
Total IO
Read:write ratio
Metadata requests
Read sequentiality
Write sequentiality
Overwrite ratio
File opens:files
Tree connect:Trees
Directories accessed
File extensions accessed
2 MB
Supporting metadata
Content update app small
2 KB
File opens
Unique files opened
Directories accessed
File extensions accessed
Viewing human generaed content
800 KB
3.5 MB
Viewing human generacontent
1 MB
app - small
3 KB
Table 4: Application instance access patterns. (a): Full list of descriptive features. (b)
and (c): Short names and descriptions of application instances in each access pattern; listing only
the features that help separate the access patterns.
The access patterns again offer a simple, threshold-based decision algorithm. If an application instance does more than 10s of KB of sequential IO, and has no overwrite, then it
is likely to be a content viewing or update application instance; such files are prefetched
and delegated to the clients. Implication 5: Clients can request file prefetch (read) and
delegation (write) based on only IO sequentiality.
Observation 6: Engineering applications with >50% sequential reads and >50% sequential
writes are doing code compile tasks. We know this from looking at the file extensions in
Figure 5. These compile processes show read sequentiality, write sequentiality, a significant
overwrite ratio and large number of metadata requests. They rely on the server heavily
for data accesses. We need more detailed client side information to understand why client
caches are ineffective in this case. However, it is clear that the server cache needs to prefetch
the read files for these applications. The high percentage of sequential reads and writes
gives us another threshold-based algorithm to identify these applications. Implication 6:
Servers can identify compile tasks by the presence of both sequential reads and writes;
server has to cache the output of these tasks.
Server Side Access Patterns
As mentioned in Section 3.2, we analyzed two kinds of server side access units: files and
deepest subtrees.
n.f.e. + xls
Fraction of application instances
n.f.e. + html
n.f.e. + htm
n.f.e. + doc
n.f.e. + pdf
n.f.e. + ppt
n.f.e. + doc
n.f.e. + lnk
n.f.e. + ppt
n.f.e. + xls
n.f.e. + pdf
n.f.e. + ppt
n.f.e. + lnk
n.f.e. + doc
no files opened
n.f.e. + doc
n.f.e. + xls
n.f.e. + xls
n.f.e. + xls
content viewing
app - app
app generated
file updates
content viewing content update
app - human
Figure 4: File extensions for corporate application instance access patterns. For each
access pattern (column), showing the fraction of the two most frequent file extensions that are
accessed together within a single application instance. “n.f.e.” denotes files with “no file extension”.
Fraction of application instances
n.f.e. + tmp
h + dbo
n.f.e. + xls
no files opened
xls + n.t.e.
compilation app
content update content viewing content viewing
app - small
app - human
app - small
Figure 5: File extensions for engineering application instance access patterns. For
each access pattern (column), showing the fraction of the two most frequent file extensions that are
accessed together within a single application instance. “n.f.e.” denotes files with “no file extension”.
File access patterns help storage server designers develop per-file placement and optimization techniques. We used 25 features to describe files (Table 5). Note that some of the
features include different percentiles of a characteristic, e.g., read request size as percentiles
of all read requests. We believe including different percentiles rather than just the average
would allow better separation of access patterns. The corporate trace has 1,155,099 files,
and the engineering trace has 1,809,571.
In Table 5, we quantitative descriptions and short names for all the file access patterns.
(a). Descriptive features for each file
Number of hours with 1, 2-3, or 4 file opens
Read sequentiality
Number of hours with 1-100KB, 100KB-1MB, or >1MB reads
Write sequentiality
Number of hours with 1-100KB, 100KB-1MB, or >1MB writes
Read:write ratio by bytes
Number of hours with 1, 2-3, or 4 metadata requests
Repeated read ratio
Read request size - 25th, 50th, and 75th percentile of all requests
Overwrite ratio
Write request size - 25th, 50th, and 75th percentile of all requests
Avg. time between IO requests - 25th, 50th, and 75th percentile of all request pairs
(b). Corp. file
access patterns
Sequential write
Sequential read
% of all files
# hrs with opens
Opens per hr
# hrs with reads
Reads per hr
# hrs with writes
Writes per hr
Read request size
Write request size
Read sequentiality
Write sequentiality
Read:write ratio
1 open
2-3 opens
(c). Eng. file
access patterns
Sequential write
% of all files
# hrs with opens
Opens per hr
# hrs with reads
Reads per hr
# hrs with writes
Writes per hr
Read request size
Write request size
Read sequentiality
Write sequentiality
Repeated read ratio
Read:write ratio
1 open
2-3 opens
2-3 opens
2-3 opens
1 open
1 open
2-3 opens
code &
2-3 opens
Sequenttial read
Readonly log/
2-3 opens
2-3 opens
Table 5: File access patterns. (a): Full list of descriptive features. (b) and (c): Short names
and descriptions of files in each access pattern; listing only the features that help separate the
access patterns.
Figures 6 and 7 give the most common file extensions in each. We derived the names by
examining the read-write ratio and IO size. For the engineering trace, examining the file
extensions also proved useful, leading to labels such as “edit code and compile output”,
and “read only log/backup”.
We see that there are groupings of files with similar extensions. For example, in the corporate trace, the small random read access patterns include many file extensions associated
with web browser caches. Also, multi-media files like .mp3 and .jpg congregate in the
sequential read and write access patterns. In the engineering trace, code libraries group
under the sequential write files, and read only log/backup files contain file extensions .0
to .99. However, the most common file extensions in each trace still spread across many
access patterns, e.g., office productivity files in the corporate trace and code files in the
engineering trace.
Observation 7: For files with >70% sequential reads or sequential writes, the repeated read
and overwrite ratios are close to zero. This implies that there is little benefit in caching
these files at the server. They should be prefetched as a whole and delegated to the client.
Fraction of files
no file ext.
no file ext.
no file ext.
no file ext.
no file ext.
metadata seq write
only files
seq read
write files read files read files
Figure 6: File extensions for corporate files. Fraction of file extensions in each file access
Again, the bimodal IO sequentiality offers a simple algorithm for the server to detect which
files should be prefetched and delegated – if a file has any sequential access, it is likely to
have a high percentage of sequential access, therefore it should be prefetched and delegated
to the client. Future storage servers can suggest such information to clients, leading to
delegation requests. Implication 7: Servers should delegate sequentially accessed files to
clients to improve IO performance.
Observation 8: In the engineering trace, only the edit code and compile output files have
a high % of repeated reads. Those files should be delegated to the clients as well. The
repeated reads do not show up in the engineering application instances, possibly because a
compilation process launches many child processes repeatedly reading the same files. Each
child process reads “fresh data,” even though the server sees repeated reads. With larger
memory or flash caches at clients, we expect this behavior to drop. The working set issues
that lead to this scenario need to be examined. If the repeated reads come from a single
client, then the server can suggest that the client cache the appropriate files.
We can again employ a threshold-based algorithm. Detecting any repeated reads at the
server signals that the file should be delegated to the client. At worst, only the first few
reads will hit the server. Subsequent repeated reads are stopped at the client. Implication
8: Servers should delegate repeatedly read files to clients.
Observation 9: Almost all files are active (have opens, IO, and metadata access) for only
1-2 hours over the entire trace period, as indicated by the typical opens/read/write activity
of all access patterns. There are some regularly accessed files, but they are so few that
they do not affect the k-means analysis. The lack of regular access for most files means that
there is room for the server to employ techniques to increase capacity by doing compaction
on idle files.
Common techniques include deduplication and compression. The activity on these files indicate that the IO performance impact should be small. Even if run constantly, compaction
has a low probability of affecting an active file. Since common libraries like gzip optimize
for decompression [11], decompressing files at read time should have only slight performance impact. Implication 9: Servers can use file idle time to compress or deduplicate
Fraction of files
no file ext.
no file ext.
no file ext.
no file ext.
idle, only seq write
edit code seq read read only
random & compile
read files
Figure 7: File extensions for engineering files. Fraction of file extensions in each file access
data to increase storage capacity.
Observation 10: All files have either all random access or >70% sequential access. The
small random read and write files in both traces can benefit from being placed on media
with high random access performance, such as solid state drives (SSDs). Files with a high
percentage of sequential access can reside on traditional hard disk drives (HDDs), which
already optimize for sequential access. The bimodal IO sequentiality offers yet another
threshold-based placement algorithm – if a file has any sequential access, it is likely to have
a high percentage of sequential access; therefore place it on HDDs. Otherwise, place it on
SSDs. We note that there are more randomly accessed files than sequentially accessed files.
Even though sequential files tend to be larger, we still need to do a working set analysis
to determine the right size of server SSDs for each use case. Implication 10: Servers can
select the best storage medium for each file based only on access sequentiality.
Deepest subtrees
Deepest subtree access patterns help storage server designers develop per-directory policies.
We used 40 features to describe deepest subtrees (Table 6). Some of the features include
different percentiles of a characteristic, e.g. per file read sequentiality as percentiles of all
files in a directory. Including different percentiles rather than just the average allows better
separation of access patterns. The corporate trace has 117,640 deepest subtrees, and the
engineering trace has 161,858. We use “directories” and “deepest subtrees” interchangeably.
In Table 6, we provide quantitative descriptions and short names for all the deepest subtree
access patterns. We derive the names using two types of information. First, we analyze
the file extensions in each subtree access pattern (Figures 8 and 9). Second, we examine how many files of each file access patterns are within each subtree pattern (Figures
10). For brevity, we show only the graph for corporate deepest subtrees. The graph for
the engineering deepest subtrees conveys the same information with regard to our design
For example, the “random read” and “client cacheable” labels come from looking at the IO
patterns. “Temporary directories” accounted for the .tmp files in those directories. “Mix
read” and “mix write” directories considered the presence of both sequential and randomly
(a). Descriptive features for each subtree
Number of hours with 1, 2-3, or 4 file opens
Number of hours with 1-100KB, 100KB-1MB, or >1MB reads
Number of hours with 1-100KB, 100KB-1MB, or >1MB writes
Number of hours with 1, 2-3, or 4 metadata requests
Read request size - 25th, 50th, and 75th percentile of all requests
Write request size - 25th, 50th, and 75th percentile of all requests
Avg. time between IO requests - 25th, 50th, and 75th percentile of all request pairs
Read sequentiality - 25th, 50th, and 75th percentile of files in the subtree
Write sequentiality - 25th, 50th, and 75th percentile of files in the subtree
Read:write ratio - 25th, 50th, and 75th percentile of files
Repeated read ratio - 25th, 50th, and 75th percentile of files
Overwrite ratio - 25th, 50th, and 75th percentile of files
Read sequentiality - aggregated across all files
Write sequentiality - aggregated across all files
Read:write ratio - aggregated across all files
Repeated read ratio - aggregated across all files
Overwrite ratio - aggregated across all files
(b). Corp. subtree
access patterns
% of all subtrees
# hrs with opens
Opens per hr
# hrs with reads
Reads per hr
# hrs with writes
Writes per hr
Read request size
Write request size
Read sequentiality
Write sequentiality
Repeat read ratio
Overwrite ratio
Read:write ratio
(b). Eng. subtree
access patterns
% of all subtrees
# hrs with opens
Opens per hr
# hrs with reads
Reads per hr
# hrs with writes
Writes per hr
Read request size
Write request size
Read sequentiality
Write sequentiality
Repeat read ratio
Overwrite ratio
Read:write ratio
>4 opens
1:0 to 0:1
1 open
1 open
1 open
2-3 pens
>4 opens
>4 opens
>4 opens
>4 opens
>4 opens
>4 opens
>4 opens
1:0 to 0:1
Table 6: Deepest subtree access patterns. (a): Full list of descriptive features. (b) and
(c): Short names and descriptions of subtrees in each access pattern; listing only the features that
help separate access patterns.
accessed files in those directories.
The metadata background noise remains visible at the subtree layer. The spread of file
extensions is similar to that for file access patterns – some file extensions congregate and
others spread evenly. Interestingly, some subtrees have a large fraction of metadata-only
files that do not affect the descriptions of those subtrees.
Fraction of files
no file ext.
no file ext.
no file ext.
no file ext.
no file ext.
no file ext.
temp dirs
mix read metadata mix write
for real cacheable
only dirs
mostly seq
mostly seq read dirs
Figure 8: File extensions for corporate deepest subtrees. Fraction of file extensions in
deepest subtree access patterns.
Some subtrees contain only files of a single access pattern (e.g., small random read subtrees
in Figures 10). There, we can apply the design insights from the file access patterns to the
entire subtree. For example, the small random read subtrees can reside on SSDs. Since
there are more files than subtrees, per-subtree policies can lower the amount of policy
information kept at the server.
In contrast, the mix read and mix write directories contain both sequential and randomly
accessed files. Those subtrees need per-file policies: Place the sequentially accessed files
on HDDs and the randomly accessed files on SSDs. Soft links to files can preserve the
user-facing directory organization, while allowing the server optimize per-file placement.
The server should automatically decide when to apply per-file or per-subtree policies.
Observation 11: Directories with sequentially accessed files almost always contain randomly
accessed files also.
Conversely, some directories with randomly access files will not
contain sequentially accessed files. Thus, we can default all subtrees to per-subtree policies.
Concurrently, we track the IO sequentiality per subtree. If the sequentiality is above some
threshold, then the subtree switches to per-file policies. Implication 11: Servers can change
from per-directory placement policy (default) to per-file policy upon seeing any sequential
IO to any files in a directory.
Observation 12: The client cacheable subtrees and temporary subtrees aggregate files with
repeated reads or overwrites. Additional computation showed that the repeated reads
and overwrites almost always come from a single client. Thus, it is possible for the entire
directory to be prefetched and delegated to the client. Delegating entire directories can
preempt all accesses that are local to a directory, but consumes client cache space. We
need to understand the tradeoffs through a more in-depth working set and temporal locality
analysis at both the file and deepest subtree levels. Implication 12: Servers can delegate
repeated read and overwrite directories entirely to clients, tradeoffs permitting.
Access Pattern Evolutions Over Time
We want to know if the access patterns are restricted to our particular tracing period or
if they persist across time. Only if the design insights remain relevant across time can we
rationalize their existence in similar use cases.
Fraction of files
no file ext.
no file ext.
no file ext.
no file ext.
no file ext.
no file ext.
only dir
read dir
mix read seq write temp dirs
for real
mostly seq
Figure 9: File extensions for engineering deepest subtrees. Fraction of file extensions in
deepest subtree access patterns.
# of files
Temp dirs for real data
Client cacheable dirs
Mix read dirs, mostly seq
0 1 2 3 4 5
file access patterns
0 1 2 3 4 5
file access patterns
0 1 2 3 4 5
file access patterns
Metadata only dirs
Mix write dirs, mostly seq
Small random read dirs
0 1 2 3 4 5
file access patterns
0 1 2 3 4 5
file access patterns
0 1 2 3 4 5
file access patterns
Figure 10: Corporate file access patterns within each deepest subtree. For each
deepest subtree access pattern (i.e., each graph), showing the number of files belonging to each
file access pattern that belongs to subtrees in the subtree access pattern. Corporate file access
pattern indices: 0. metadata only files; 1. sequential write files; 2. sequential read files; 3. small
random write files; 4. small random read files; 5. less small random read files.
We do not have enough traces to generalize beyond our monitoring period. We investigate
the reverse problem - if we had to analyze traces from only a subset of our tracing period,
how would our results differ? We divided our traces into weeks and repeated the analysis
for each week. For brevity, we present only the results for weekly analysis of corporate
application instances and files. These two layers have yielded the most interesting design
insights and they highlight separate considerations at the client and server.
Figure 11 shows the result for files. All the large access patterns remain steady across
the weeks. However, the access pattern corresponding to the smallest number of files, the
small random write files, comes and goes week to week. There are exactly two, temporary,
previously unseen access patterns that are very similar to the small random files. The
peaks in the metadata only files correspond to weeks that contain U.S. federal holidays or
weeks immediately preceding a holiday long weekend. Furthermore, the numerical values
of the descriptive features for each access pattern vary in a moderate range. For example,
the write sequentiality of the sequentiality write files ranges from 50% to 90%.
Figure 12 shows the result for application instances. We see no new access patterns, and
the fractional weight of each access pattern remains nearly constant, despite holidays.
Furthermore, the numerical values of descriptive features also remain nearly constant. For
example, the write sequentiality of the content update applications varies in a narrow range
Fraction of all files
metadata only files
sequential write files
sequential read files
small random write files
smallest random read files
small random read files
week #
medium partly seq write
small read write
Fraction of all app
Figure 11: Corporate file access patterns over 8 weeks. All patterns remain (hollow
markers), but the fractional weight of each changes greatly between weeks. Some small patterns
temporarily appear and disappear (solid markers).
supporting metadata
app generated file updates
content update app
content viewing app - app
generated content
week #
content viewing app - human
generated content
Figure 12: Corporate application instance access patterns over 8 weeks. All patterns
remain with near constant fractional weight. No new patterns appear.
from 80% to 85%.
Thus, if we had done our analysis on just a week’s traces, we would have gotten nearly
identical results for application instances, and qualitatively similar result for files. We
believe that the difference comes from the limited duration of client sessions and application
instances, versus the long-term persistence of files and subtrees.
Based on our results, we are confident that the access patterns are not restricted just to
our particular trace period. Future storage systems should continuously monitor the access
patterns at all levels, automatically adjusting policies as needed, and notify designers of
previously unseen access patterns.
We should always be cautious when generalizing access patterns from one use case to
another. For use cases with the same applications running on the same OS file API, we
expect to see the same application instance access patterns. Session access patterns such
as daily work sessions are also likely to be general. For the server side access patterns, we
expect the files and subtrees with large fractional weights to appear in other use cases.
Section 4 offered many specific optimizations for placement, caching, delegation, and consolidation decisions. We combine the insights here to speculate on the architecture of
future enterprise storage systems.
We see a clear separation of roles for clients and servers. The client design can target
high IO performance by a combination of efficient delegation, prefetching and caching of
the appropriate data. The servers should focus on increasing their aggregated efficiency
across clients: collaboration with clients (on caching, delegation, etc.) and exploiting user
patterns to schedule background tasks. Automating background tasks such as offline data
deduplication delivers capacity savings in a timely and hassle-free fashion, i.e., without
system downtime or explicit scheduling. Regarding caching at the server, we observe
that very few access patterns actually leverage the server’s buffer cache for data accesses.
Design insights 4-6, 8 and 12 indicate a heavy role for the client cache and Design insight
7 suggests how not to use the server buffer cache - caching metadata only and acting as a
warm/backup cache for clients would result in lower latencies for many access patterns.
We also see simple ways to take advantage of new storage media such as SSDs. The
clear identification of sequential and random access file patterns enables efficient devicespecific data placement algorithms (Design insights 10 and 11). Also, the background
metadata noise seen at all levels suggests that storage servers should both optimize for
metadata accesses and redesign client-server interactions to decrease the metadata chatter.
Depending on the growth of metadata and the performance requirements, we also need to
consider placing metadata on low latency, non-volatile media like flash or SSDs.
Furthermore, we believe that storage systems should introduce many monitoring points to
dynamically adjust the decision thresholds of placement, caching, or consolidation policies.
We need to monitor both clients and servers. For example, when repeated read and
overwrite files have been properly delegated to clients, the server would no longer see files
with such access patterns. Without monitoring points at the clients, we would not be able
to quantify the file delegation benefits. Storage systems should make extensible tracing
APIs to expedite the collection of long-term future traces. This will facilitate future work
similar to ours.
We must address the storage technology trends toward ever-increasing scale, heterogeneity,
and consolidation. Current storage design paradigms that rely on existing trace analysis
methods are ill equipped to meet the emerging challenges because they are unidimensional,
focus only on the storage server, and are subject to designer bias. We showed that a multidimensional, multi-layered trace-driven design methodology leads to more objective design
points with highly targeted optimizations at both storage clients and servers. Using our
corporate and engineering use cases, we present a number of insights that informs future
designs. We described in some detail the access patterns we observed, and we encourage
fellow storage system designers to extract further insights from our observations.
Future work includes exploring the dynamics of changing working sets and access sequences, with the goal of anticipating data accesses before they happen. Another worthwhile analysis is to look for optimization opportunities across clients; this requires collecting traces at different clients, instead of only at the server. Also, we would like to explore
opportunities for deduplication, compression, or data placement. Doing so requires extending our analysis from data movement patterns to also include data content patterns.
Furthermore, we would like to perform on-line analysis in live storage systems to enable
dynamic feedback on placement and optimization decisions. In addition, it would be useful to build tools to synthesize the access patterns, to enable designers to evaluate the
optimizations we proposed here.
We believe that storage system designers face an increasing challenge to anticipate access
patterns. Our paper builds the case that system designers can longer accurately anticipate
access patterns using intuition only. We believe that the corporate and engineering traces
from our corporate headquarters would have similar use cases at other traditional and hightech businesses. Other use cases would require us to perform the same trace collection and
analysis process to extract the same kind of “ground truth”. We also need similar studies
at regular intervals to track the evolving use of storage system. We hope that this paper
contributes to an objective and principled design approach targeting rapidly changing data
access patterns.
NetApp, the NetApp logo, and Go further, faster are trademarks or registered trademarks
of NetApp, Inc. in the United States and/or other countries.
[1] N. Agrawal, W. J. Bolosky, J. R. Douceur, and J. R. Lorch. A Five-Year Study of
File-System Metadata. In FAST 2007.
[2] E. Alpaydin. Introduction to Machine Learning. MIT Press, Cambridge,
Massachusetts, 2004.
[3] M. G. Baker, J. H. Hartman, M. D. Kupfer, K. W. Shirriff, and J. K. Ousterhout.
Measurements of a distributed file system. In SOSP 1991.
[4] P. Bodik, M. Goldszmidt, A. Fox, D. B. Woodard, and H. Andersen. Fingerprinting
the datacenter: automated classification of performance crises. In EuroSys 2010.
[5] Common Internet File System Technical Reference. Storage Network Industry
Association, 2002.
[6] IDC Whitepaper: The economics of Virtualization.
[7] J. R. Douceur and W. J. Bolosky. A Large-Scale Study of File-System Contents. In
[8] D. Ellard, J. Ledlie, P. Malkani, and M. Seltzer. Passive NFS Tracing of Email and
Research Workloads. In FAST 2003.
[9] A. Ganapathi, H. Kuno, U. Dayal, J. L. Wiener, A. Fox, M. Jordan, and
D. Patterson. Predicting Multiple Metrics for Queries: Better Decisions Enabled by
Machine Learning. In ICDE 2009.
[10] S. Gribble, G. S. Manku, E. Brewer, T. J. Gibson, and E. L. Miller. Self-Similarity
in File Systems: Measurement and Applications. In SIGMETRICS 1998.
[11] The gzip algorithm.
[12] IDC Report: Worldwide File-Based Storage 2010-2014 Forecast Update.
[13] S. Kavalanekar, B. L. Worthington, Q. Zhang, and V. Sharda. Characterization of
storage workload traces from production Windows Servers. In IISWC 2008.
[14] Open Source Clustering Software - C Clustering Library., 2010.
[15] A. Leung, S. Pasupathy, G. Goodson, and E. Miller. Measurement and analysis of
large-scale network file system workloads. In USENIX ATC 2008.
[16] D. T. Meyer and W. J. Bolosky. A Study of Practical Deduplication. In FAST 2010.
[17] J. K. Ousterhout, H. D. Costa, D. Harrison, J. A. Kunze, M. Kupfer, and J. G.
Thompson. A trace-driven analysis of the Unix 4.2 BSD file system. In SOSP 1985.
[18] K. K. Ramakrishnan, P. Biswas, and R. Karedla. Analysis of file I/O traces in
commercial computing environments. In SIGMETRICS 1992.
[19] D. Roselli, J. Lorch, and T. Anderson. A comparison of file system workloads. In
USENIX 2000.
[20] I. Stoica. A Berkeley View of Big Data: Algorithms, Machines and People. UC
Berkeley EECS Annual Research Symposium, 2011.
[21] K. Thomas, C. Grier, J. Ma, V. Paxson, and D. Song. Design and evaluation of a
real-time URL spam filtering service. In IEEE Symposium on Security and Privacy
[22] R. Villars. The Migration to Converged IT: What it Means for Infrastructure,
Applications, and the IT Organization. IDC Directions Conference 2011.
[23] VMware Whitepaper: Server Consolidation and Containment.
[24] W. Vogels. File system usage in Windows NT 4.0. In SOSP 1999.
[25] M. Zhou and A. J. Smith. Analysis of Personal Computer Workloads. In MASCOTS
Differentiated Storage Services
Michael Mesnier, Jason B. Akers, Feng Chen, Tian Luo
Intel Corporation
Hillsboro, OR
We propose an I/O classification architecture to close the widening semantic gap between
computer systems and storage systems. By classifying I/O, a computer system can request
that different classes of data be handled with different storage system policies. Specifically,
when a storage system is first initialized, we assign performance policies to predefined
classes, such as the filesystem journal. Then, online, we include a classifier with each I/O
command (e.g., SCSI), thereby allowing the storage system to enforce the associated policy
for each I/O that it receives.
Our immediate application is caching. We present filesystem prototypes and a database
proof-of-concept that classify all disk I/O — with very little modification to the filesystem,
database, and operating system. We associate caching policies with various classes (e.g.,
large files shall be evicted before metadata and small files), and we show that end-to-end
file system performance can be improved by over a factor of two, relative to conventional
caches like LRU. And caching is simply one of many possible applications. As part of our
ongoing work, we are exploring other classes, policies and storage system mechanisms that
can be used to improve end-to-end performance, reliability and security.
Categories and Subject Descriptors
D.4 [Operating Systems]; D.4.2 [Storage Management]: [Storage hierarchies]; D.4.3
[File Systems Management]: [File organization]; H.2 [Database Management]
General Terms
Classification, quality of service, caching, solid-state storage
∗The Ohio State University
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted
without fee provided that copies are not made or distributed for profit or commercial advantage and that copies
bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to
redistribute to lists, requires prior specific permission and/or a fee.
SOSP '11, October 23-26, 2011, Cascais, Portugal.
Copyright © 2011 ACM 978-1-4503-0977-6/11/10 ... $10.00.
Computer system
Storage system
Application layer
(Classify I/O & assign policies)
Pool A
Pool B
Pool C
File system layer
QoS Mechanisms
(Classify I/O & assign policies)
(Enforce per−class QoS policies)
Block layer
Storage controller
(Bind classes to I/O commands)
(Extract classes from commands)
Storage transport
Figure 1: High-level architecture
The block-based storage interface is arguably the most stable interface in computer systems
today. Indeed, the primary read/write functionality is quite similar to that used by the
first commercial disk drive (IBM RAMAC, 1956). Such stability has allowed computer and
storage systems to evolve in an independent yet interoperable manner, but at at a cost
– it is difficult for computer systems to optimize for increasingly complex storage system
internals, and storage systems do not have the semantic information (e.g., on-disk FS and
DB data structures) to optimize independently.
By way of analogy, shipping companies have long recognized that classification is the key to
providing differentiated service. Boxes are often classified (kitchen, living room, garage),
assigned different policies (deliver-first, overnight, priority, handle-with-care), and thusly
treated differently by a shipper (hand-carry, locked van, truck). Separating classification
from policy allows customers to pack and classify (label) their boxes once; the handling
policies can be assigned on demand, depending on the shipper. And separating policy from
mechanism frees customers from managing the internal affairs of the shipper, like which
pallets to place their shipments on.
In contrast, modern computer systems expend considerable effort attempting to manage
storage system internals, because different classes of data often need different levels of
service. As examples, the “middle” of a disk can be used to reduce seek latency, and the
“outer tracks” can be used to improve transfer speeds. But, with the increasing complexity
of storage systems, these techniques are losing their effectiveness — and storage systems
can do very little to help because they lack the semantic information to do so.
We argue that computer and storage systems should operate in the same manner as the
shipping industry — by utilizing I/O classification. In turn, this will enable storage systems
to enforce per-class QoS policies. See Figure 1.
Differentiated Storage Services is such a classification framework: I/O is classified in the
computer system (e.g., filesystem journal, directory, small file, database log, index, ...),
policies are associated with classes (e.g., an FS journal requires low-latency writes, and a
database index requires low-latency reads), and mechanisms in the storage system enforce
policies (e.g., a cache provides low latency).
Our approach only slightly modifies the existing block interface, so eventual standardization and widespread adoption are practical. Specifically, we modify the OS block layer so
that every I/O request carries a classifier. We copy this classifier into the I/O command
(e.g., SCSI CDB), and we specify policies on classes through the management interface of
the storage system. In this way, a storage system can provide block-level differentiated
services (performance, reliability, or security) — and do so on a class-by-class basis. The
FS Class
Small file
Large file
Vendor A:
Service levels
Vendor B:
Perf. targets
Low lat.
Low lat.
Low lat.
High BW
Vendor C:
Table 1: An example showing FS classes mapped to various performance policies. This paper focuses on priorities; lower numbers are higher priority.
storage system does not need any knowledge of computer system internals, nor does the
computer system need knowledge of storage system internals.
Classifiers describe what the data is, and policies describe how the data is to be managed.
Classifiers are handles that the computer system can use to assign policies and, in our
SCSI-based prototypes, a classifier is just a number used to distinguish various filesystem
classes, like metadata versus data. We also have user-definable classes that, for example, a
database can use to classify I/O to specific database structures like an index. Defining the
classes (the classification scheme) should be an infrequent operation that happens once for
each filesystem or database of interest.
In contrast, we expect that policies will vary across storage systems, and that vendors
will differentiate themselves through the policies they offer. As examples, storage system vendors may offer service levels (platinum, gold, silver, bronze), performance levels
(bandwidth and latency targets), or relative priority levels (the approach we take in this
paper). A computer system must map its classes to the appropriate set of policies, and
I/O classification provides a convenient way to do this dynamically when a filesystem or
database is created on a new storage system. Table 1 shows a hypothetical mapping of
filesystem classes to available performance policies, for three different storage systems.
Beyond performance, there could be numerous other policies that one might associate with
a given class, such as replication levels, encryption and integrity policies, perhaps even
data retention policies (e.g., secure erase). Rather than attempt to send all of this policy
information along with each I/O, we simply send a classifier. This will make efficient use
of the limited space in an I/O command (e.g., SCSI has 5 bits that we use as a classifier).
In the storage system the classifier can be associated with any number of policies.
We begin with a priority-based performance policy for cache management, specifically for
non-volatile caches composed of solid-state drives (SSDs). That is, to each FS and DB
class we assign a caching policy (a relative priority level). In practice, we assume that the
filesystem or database vendor, perhaps in partnership with the storage system vendor, will
provide a default priority assignment that a system administrator may choose to tune.
We present prototypes for Linux Ext3 and Windows NTFS, where I/O is classified as
metadata, journal, directory, or file, and file I/O is further classified by the file size (e.g.,
≤4KB ≤16KB, ..., >1GB). We assign a caching priority to each class: metadata, journal,
and directory blocks are highest priority, followed by regular file data. For the regular files,
we give small files higher priority than large ones.
These priority assignments reflect our goal of reserving cache space for metadata and small
files. To this end, we introduce two new block-level caching algorithms: selective allocation
and selective eviction. Selective allocation uses the priority information when allocating
I/O in a cache, and selective eviction uses this same information during eviction. The
end-to-end performance improvements of selective caching are considerable. Relative to
conventional LRU caching, we improve the performance of a file server by 1.8x, an email server by 2x, and metadata-intensive FS utilities (e.g., find and fsck) by up to 6x.
Furthermore, a TCO analysis by Intel IT Research shows that priority-based caching can
reduce caching costs by up to 50%, as measured by the acquisition cost of hard drives and
It is important to note that in both of our FS prototypes, we do not change which logical
blocks are being accessed; we simply classify I/O requests. Our design philosophy is that
the computer system continues to see a single logical volume and that the I/O into that
volume be classified. In this sense, classes can be considered “hints” to the storage system.
Storage systems that know how to interpret the hints can optimize accordingly, otherwise
they can be ignored. This makes the solution backward compatible, and therefore suitable
for legacy applications.
To further show the flexibility of our approach, we present a proof-of-concept classification
scheme for PostgreSQL [33]. Database developers have long recognized the need for intelligent buffer management in the database [10] and in the operating system [45]; buffers
are often classified by type (e.g., index vs. table) and access pattern (e.g., random vs. sequential). To share this knowledge with the storage system, we propose a POSIX file flag
(O_CLASSIFIED). When a file is opened with this flag, the OS extracts classification information from a user-provided data buffer that is sent with each I/O request and, in turn,
binds the classifier to the outgoing I/O command. Using this interface, we can easily classify all DB I/O, with only minor modification to the DB and the OS. This same interface
can be used by any application. Application-level classes will share the classification space
with the filesystem — some of the classifier bits can be reserved for applications, and the
rest for the filesystem.
This paper is organized as follows. Section 2 motivates the need for Differentiated Storage
Services, highlighting the shortcomings of the block interface and building a case for blocklevel differentiation. Alternative designs, not based on I/O classification, are discussed. We
present our design in Section 3, our FS prototypes and DB proof-of-concept in Section 4,
and our evaluation in Section 5. Related work is presented in Section 6, and we conclude
in Section 7.
The contemporary challenge motivating Differentiated Storage Services is the integration
of SSDs, as caches, into conventional disk-based storage systems. The fundamental limitation imposed by the block layer (lack of semantic information) is what makes effective
integration so challenging. Specifically, the block layer abstracts computer systems from
the details of the underlying storage system, and vice versa.
2.1 Computer system challenges
Computer system performance is often determined by the underlying storage system, so
filesystems and databases must be smart in how they allocate on-disk data structures. As
examples, the journal (or log) is often allocated in the middle of a disk drive to minimize
the average seek distance [37], files are often created close to their parent directories, and
file and directory data are allocated contiguously whenever possible. These are all attempts
by a computer system to obtain some form differentiated service through intelligent block
Unfortunately, the increasing complexity of storage systems is making intelligent allocation
difficult. Where is the “middle” of the disk, for example, when a filesystem is mounted
atop a logical volume with multiple devices, or perhaps a hybrid disk drive composed
of NAND and shingled magnetic recording? Or, how do storage system caches influence
the latency of individual read/write operations, and how can computer systems reliably
manage performance in the context of these caches? One could use models [27, 49, 52] to
predict performance, but if the predicted performance is undesirable there is very little a
computer system can do to change it.
In general, computer systems have come to expect only best-effort performance from their
storage systems. In cases where performance must be guaranteed, dedicated and overprovisioned solutions are deployed.
2.2 Storage system challenges
Storage systems already offer differentiated service, but only at a coarse granularity (logical
volumes). Through the management interface of the storage system, administrators can
create logical volumes with the desired capacity, reliability, and performance characteristics
— by appropriately configuring RAID and caching.
However, before an I/O enters the storage system, valuable semantic information is stripped
away at the OS block layer, such as user, group, application, and process information.
And, any information regarding on-disk structures is obfuscated. This means that all I/O
receives the same treatment within the logical volume.
For a storage system to provide any meaningful optimization within a volume, it must have
semantic computer system information. Without help from the computer system, this can
be very difficult to get. Consider, for example, that a filename could influence how a file is
cached [26], and what would be required for a storage system to simply determine the the
name of a file associated with a particular I/O. Not only would the storage system need to
understand the on-disk metadata structures of the filesystem, particularly the format of
directories and their filenames, but it would have to track all I/O requests that modify these
structures. This would be an extremely difficult and potentially fragile process. Expecting
storage systems to retain sufficient and up-to-date knowledge of the on-disk structures for
each of its attached computer systems may not be practical, or even possible, to realize in
2.3 Attempted solutions & shortcomings
Three schools of thought have emerged to better optimize the I/O between a computer
and storage system. Some show that computer systems can obtain more knowledge of
storage system internals and use this information to guide block allocation [11, 38]. In
some cases, this means managing different storage volumes [36], often foregoing storage
system services like RAID and caching. Others show that storage systems can discover
more about on-disk data structures and optimize I/O accesses to these structures [9, 41,
42, 43]. Still others show that the I/O interface can evolve and become more expressive;
object-based storage and type-safe disks fall into this category [28, 40, 58].
Unfortunately, none of these approaches has gained significant traction in the industry.
First, increasing storage system complexity is making it difficult for computer systems to
reliably gather information about internal storage structure. Second, increasing computer
system complexity (e.g., virtualization, new filesystems) is creating a moving target for
semantically-aware storage systems that learn about on-disk data structures. And third,
although a more expressive interface could address many of these issues, our industry has
developed around a block-based interface, for better or for worse. In particular, filesystem
and database vendors have a considerable amount of intellectual property in how blocks
are managed and would prefer to keep this functionality in software, rather than offload
to the storage system through a new interface.
When a new technology like solid-state storage emerges, computer system vendors prefer
to innovate above the block level, and storage system vendors below. But, this tug-of-war
has no winner as far as applications are concerned, because considerable optimization is
left on the table.
We believe that a new approach is needed. Rather than teach computer systems about
storage system internals, or vice versa, we can have them agree on shared, block-level
goals — and do so through the existing storage interfaces (SCSI and ATA). This will not
introduce a disruptive change in the computer and storage systems ecosystem, thereby
allowing computer system vendors to innovate above the block level, and storage system
vendors below. To accomplish this, we require a means by which block-level goals can be
communicated with each I/O request.
Differentiated Storage Services closes the semantic gap between computer and storage
systems, but does so in a way that is practical in an industry built around blocks. The
problem is not the block interface, per se, but a lack of information as to how disk blocks
are being used.
We must careful, though, to not give a storage system too much information, as this could
break interoperability. So, we simply classify I/O requests and communicate block-level
goals (policies) for each class. This allows storage systems to provide meaningful levels of
differentiation, without requiring that detailed semantic information be shared.
3.1 Operating system requirements
We associate a classifier with every block I/O request in the OS. In UNIX and Windows,
we add a classification field to the OS data structure for block I/O (the Linux “BIO,” and
the Windows “IRP”) and we copy this field into the actual I/O command (SCSI or ATA)
before it is sent to the storage system. The expressiveness of this field is only limited by
its size, and in Section 4 we present a SCSI prototype where a 5-bit SCSI field can classify
I/O in up to 32 ways.
In addition to adding the classifier, we modify the OS I/O scheduler, which is responsible for coalescing contiguous I/O requests, so that requests with different classifiers are
never coalesced. Otherwise, classification information would be lost when two contiguous
requests with different classifiers are combined. This does reduce a scheduler’s ability to coalesce I/O, but the benefits gained from providing differentiated service to the uncoalesced
requests justify the cost, and we quantify these benefits in Section 5.
The OS changes needed to enable filesystem I/O classification are minor. In Linux, we
have a small kernel patch. In Windows, we use closed-source filter drivers to provide the
same functionality. Section 4 details these changes.
3.2 Filesystem requirements
First, a filesystem must have a classification scheme for its I/O, and this is to be designed
by a developer that has a good understanding of the on-disk FS data structures and their
performance requirements. Classes should represent blocks with similar goals (e.g., journal
blocks, directory blocks, or file blocks); each class has a unique ID. In Section 4, we present
our prototype classification schemes for Linux Ext3 and Windows NTFS.
Then, the filesystem developer assigns a policy to each class; refer back to the hypothetical
examples given in Table 1. How this policy information is communicated to the storage
system can be vendor specific, such as through an administrative GUI, or even standardized. The Storage Management Initiative Specification (SMI-S) is one possible avenue for
this type of standardization [3]. As a reference policy, also presented in Section 4, we use
a priority-based performance policy for storage system cache management.
Once mounted, the filesystem classifies I/O as per the classification scheme. And blocks
may be reclassified over time. Indeed, block reuse in the filesystem (e.g., file deletion or
defragmentation) may result in frequent reclassification.
3.3 Storage system requirements
Upon receipt of a classified I/O, the storage system must extract the classifier, lookup the
policy associated with the class, and enforce the policy using any of its internal mechanisms;
legacy systems without differentiated service can ignore the classifier. The mechanisms
used to enforce a policy are completely vendor specific, and in Section 4 we present our
prototype mechanism (priority-based caching) that enforces the FS-specified performance
Because each I/O carries a classifier, the storage system does not need to record the class
of each block. Once allocated from a particular storage pool, the storage system is free to
discard the classification information. So, in this respect, Differentiated Storage Services
is a stateless protocol. However, if the storage system wishes to later move blocks across
storage pools, or otherwise change their QoS, it must do so in an informed manner. This
must be considered, for example, during de-duplication. Blocks from the same allocation
pool (hence, same QoS) can be de-duplicated. Blocks from different pools cannot.
If the classification of a block changes due to block re-use in the filesystem, the storage
system must reflect that change internally. In some cases, this may mean moving one
or more blocks across storage pools. In the case of our cache prototype, a classification
change can result in cache allocation, or the eviction of previously cached blocks.
3.4 Application requirements
Applications can also benefit from I/O classification; two good examples are databases
and virtual machines. To allow for this, we propose a new file flag O_CLASSIFIED. When a
file is opened with this flag, we overload the POSIX scatter/gather operations (readv and
writev) to include one extra list element. This extra element points to a 1-byte user buffer
that contains the classification ID of the I/O request. Applications not using scatter/gather
I/O can easily convert each I/O to a 2-element scatter/gather list. Applications already
issuing scatter/gather need only create the additional element.
Next, we modify the OS virtual file system (VFS) in order to extract this classifier from each
readv() and writev() request. Within the VFS, we know to inspect the file flags when
processing each scatter/gather operation. If a file handle has the O_CLASSIFIED flag set, we
extract the I/O classifier and reduce the scatter/gather list by one element. The classifier
is then bound to the kernel-level I/O request, as described in Section 3.1. Currently, our
user-level classifiers override the FS classifiers. If a user-level class is specified on a file
I/O, the filesystem classifiers will be ignored.
Without further modification to POSIX, we can now explore various ways of differentiating user-level I/O. In general, any application with complex, yet structured, block
relationships [29] may benefit from user-level classification. In this paper, we begin with
the database and, in Section 4, present a proof-of-concept classification scheme for PostgreSQL [33]. By simply classifying database I/O requests (e.g., user tables versus indexes),
we provide a simple way for storage systems to optimize access to on-disk database structures.
We present our implementations of Differentiated Storage Services, including two filesystem prototypes (Linux Ext3 and Windows NTFS), one database proof-of-concept (Linux
PostgreSQL), and two storage system prototypes (SW RAID and iSCSI). Our storage systems implement a priority-based performance policy, so we map each class to a priority
level (refer back to Table 1 for other possibilities). For the FS, the priorities reflect our goal
to reduce small random access in the storage system, by giving small files and metadata
higher priority than large files. For the DB, we simply demonstrate the flexibility of our
approach by assigning caching policies to common data structures (indexes, tables, and
4.1 OS changes needed for FS classification
The OS must provide in-kernel filesystems with an interface for classifying each of their
I/O requests. In Linux, we do this by adding a new classification field to the FS-visible
kernel data structure for disk I/O (struct buffer_head). This code fragment illustrates
how Ext3 can use this interface to classify the OS disk buffers into which an inode (class
5 in this example) will be read:
bh->b_class = 5;
/* classify inode buffer */
submit_bh(READ, bh); /* submit read request
Once the disk buffers associated with an I/O are classified, the OS block layer has the information needed to classify the block I/O request used to read/write the buffers. Specifically,
it is in the implementation of submit_bh that the generic block I/O request (the BIO) is
generated, so it is here that we copy in the FS classifier:
int submit_bh(int rw, struct buffer_head * bh) {
bio->bi_class = bh->b_class /* copy in class */
submit_bio(rw, bio);
/* issue read
return ret;
Finally, we copy the classifier once again from the BIO into the 5-bit, vendor-specific Group
Number field in byte 6 of the SCSI CDB. This one-line change is all that is need to enable
classification at the SCSI layer:
SCpnt->cmnd[6] = SCpnt->request->bio->bi_class;
These 5 bits are included with each WRITE and READ command, and we can fill this field
in up to 32 different ways (25 ). An additional 3 reserved bits could also be used to classify
data, allowing for up to 256 classifiers (28 ), and there are ways to grow even beyond this
if necessary (e.g., other reserved bits, or extended SCSI commands).
In general, adding I/O classification to an existing OS is a matter of tracking an I/O as
it proceeds from the filesystem, through the block layer, and down to the device drivers.
Whenever I/O requests are copied from one representation to another (e.g., from a buffer
head to a BIO, or from a BIO to a SCSI command), we must remember to copy the
classifier. Beyond this, the only other minor change is to the I/O scheduler which, as
previously mentioned, must be modified so that it only coalesces requests that carry the
same classifier.
Overall, adding classification to the Linux block layer requires that we modify 10 files (156
lines of code), which results in a small kernel patch. Table 2 summarize the changes. In
Windows, the changes are confined to closed-source filter drivers. No kernel code needs to
be modified because, unlike Linux, Windows provides a stackable filter driver architecture
for intercepting and modifying I/O requests.
Block layer
buffer head.h
Change made
Add classifier
Add classifier
Add classifier
Copy classifier
Copy classifier
Copy classifier
Copy classifier
Merge I/O of same class
Classify file sizes
Insert classifier into CDB
Table 2: Linux 2.6.34 files modified for I/O classification. Modified lines of
code (LOC) shown.
Ext3 Class
Group Descriptor
Indirect block
Directory entry
Journal entry
File <= 4KB
File <= 16KB
File > 1GB
Class ID
Table 3: Reference classes and caching priorities for Ext3. Each class is assigned a unique SCSI Group Number and assigned a priority (0 is highest).
4.2 Filesystem prototypes
A filesystem developer must devise a classification scheme and assign storage policies to
each class. The goals of the filesystem (performance, reliability, or security) will influence
how I/O is classified and policies are assigned.
4.2.1 Reference classification scheme
The classification schemes for the Linux Ext3 and Windows NTFS are similar, so we only
present Ext3. Any number of schemes could have been chosen, and we begin with one
well-suited to minimizing random disk access in the storage system. The classes include
metadata blocks, directory blocks, journal blocks, and regular file blocks. File blocks are
further classified by the file size (≤4KB, ≤16KB, ≤64KB, ≤256KB, ..., ≤1GB, >1GB) —
11 file size classes in total.
The goal of our classification scheme is to provide the storage system with a way of prioritizing which blocks get cached and the eviction order of cached blocks. Considering the
fact that metadata and small files can be responsible for the majority of the disk seeks,
we classify I/O in such a way that we can separate these random requests from large-file
requests that are commonly accessed sequentially. Database I/O is an obvious exception
and, in Section 4.3 we introduce a classification scheme better suited for the database.
Table 3 (first two columns) summarizes our classification scheme for Linux Ext3. Every
disk block that is written or read falls into exactly one class. Class 0 (unclassified) occurs
Change made
Classify block bitmaps
Classify inodes tables
Classify inode bitmaps
Classify indirect blocks,
inodes, dirs, and file sizes
Classify superblocks, journal
blocks, and group descriptors
Classify journal I/O
Classify journal I/O
Classify journal I/O
Table 4: Ext3 changes for Linux 2.6.34.
when I/O bypasses the Ext3 filesystem. In particular, all I/O created during filesystem
creation (mkfs) is unclassified, as there is no mounted filesystem to classify the I/O. The
next 5 classes (superblocks through indirect data blocks) represent filesystem metadata,
as classified by Ext3 after it has been mounted. Note, the unclassified metadata blocks
will be re-classified as one of these metadata types when they are first accessed by Ext3.
Although we differentiate metadata classes 1 through 5, we could have combined them
into one class. For example, it is not critical that we differentiate superblocks and block
bitmaps, as these structures consume very little disk (and cache) space. Still, we do this
for illustrative purposes and system debugging.
Continuing, class 6 represents directory blocks, class 7 journal blocks, and 8-18 are the file
size classes. File size classes are only approximate. As a file is being created, the file size
is changing while writes are being issued to the storage system; files can also be truncated.
Subsequent I/O to a file will reclassify the blocks with the latest file size.
Approximate file sizes allow the storage system to differentiate small files from large files.
For example, a storage system can cache all files 1MB or smaller, by caching all the file
blocks with a classification up to 1MB. The first 1MB of files larger than 1MB may also
fall into this category until they are later reclassified. This means that small files will fit
entirely in cache, and large files may be partially cached with the remainder stored on
We classify Ext3 using 18 of the 32 available classes from a 5-bit classifier. To implement
this classification scheme, we modify 8 Ext3 files (126 lines of code). Table 4 summarizes
our changes.
The remaining classes (19 through 31) could be used in other ways by the FS (e.g., text
vs. binary, media file, bootable, read-mostly, or hot file), and we are exploring these as part
of our future work. The remaining classes could also be used by user-level applications,
like the database.
4.2.2 Reference policy assignment
Our prototype storage systems implement 16 priorities; to each class we assign a priority
(0 is the highest). Metadata, journal, and directory blocks are highest priority, followed by
the regular file blocks. 4KB files are higher priority than 16KB files, and so on. Unclassified
I/O, or the unused metadata created during file system creation, is assigned the lowest
priority. For this mapping, we only require 13 priorities, so 3 of the priority levels (13-15)
are unused. See Table 3.
This priority assignment is specifically tuned for a file server workload (e.g., SPECsfs), as
we will show in Section 5, and reflects our bias to optimize the filesystem for small files
and metadata. Should this goal change, the priorities could be set differently. Should the
storage system offer policies other than priority levels (like those in Table 1), the FS classes
would need to be mapped accordingly.
4.3 Database proof-of-concept
In addition to the kernel-level I/O classification interface described in Section 4.1, we
provide a POSIX interface for classifying user-level I/O. The interface builds on the scatter/gather functionality already present in POSIX.
Using this new interface, we classify all I/O from the PostgreSQL open source database [33].
As with FS classification, user-level classifiers are just numbers used to distinguish the
various I/O classes, and it is the responsibility of the application (a DB in this case) to
design a classification scheme and associate storage system policies with each class.
4.3.1 A POSIX interface for classifying DB I/O
We add an additional scatter/gather element to the POSIX readv and writev system
calls. This element points to a user buffer that contains a classifier for the given I/O. To
use our interface, a file is opened with the flag O_CLASSIFIED. When this flag is set, the
OS will assume that all scatter/gather operations contain 1 + n elements, where the first
element points to a classifier buffer and the remaining n elements point to data buffers.
The OS can then extract the classifier buffer, bind the classifier to the kernel-level I/O
(as described in Section 4.1), reduce the number of scatter gather elements by one, and
send the I/O request down to the filesystem. Table 5 summarizes the changes made to
the VFS to implement user-level classification. As with kernel-level classification, this is a
small kernel patch.
The following code fragment illustrates the concept for a simple program with a 2-element
gathered-write operation:
unsigned char class = 23; /* a class ID */
int fd = open("foo", O_RDWR|O_CLASSIFIED);
struct iovec iov[2];
/* an sg list */
iov[0].iov_base = &class; iov[0].iov_len = 1;
iov[1].iov_base = "Hello, world!";
iov[1].iov_len = strlen("Hello, world!");
rc = writev(fd, iov, 2);
/* 2 elements */
The filesystem will classify the file size as described in Section 4.2, but we immediately
override this classification with the user-level classification, if it exists. Combining userlevel and FS-level classifiers is an interesting area of future work.
4.3.2 A DB classification scheme
Our proof-of-concept PostgreSQL classification scheme includes the transaction log, system
tables, free space maps, temporary tables, user tables, and indexes. And we further classify
the user tables by their access pattern, which the PostgreSQL database already identifies,
internally, as random or sequential. Passing this access pattern information to the storage
system avoids the need for (laborious) sequential stream detection.
Table 6 summarizes our proof-of-concept DB classification scheme, and Table 7 shows the
minor changes required of PostgreSQL. We include this database example to demonstrate
the flexibility of our approach and the ability to easily classify user-level I/O. How to
OS file
Change made
Extract class from sg list
Add classifier to readahead
Add classifier to readahead
Add classifier to page read
Add classifier to page read
Add classifier to FS page read
Add classifier to FS page read
Table 5: Linux changes for user-level classification.
DB class
Transaction Log
System table
Free space map
Temporary table
Random user table
Sequential user table
Index file
Class ID
Table 6: A classification scheme PostgreSQL. Each class is assigned a unique
number. This number is copied into the 5-bit SCSI Group Number field in
the SCSI WRITE and READ commands.
DB file
Change made
Pass classifier to storage manager
Classify transaction log
Classify indexes, system tables,
and regular tables
Classify sequential vs. random
Assign SCSI groups numbers
Add classifier to scatter/gather
and classify temp. tables
Table 7: PostgreSQL changes.
properly assign block-level caching priorities for the database is part of our current research,
but we do share some early results in Section 5 to demonstrate the performance potential.
4.4 Storage system prototypes
With the introduction of solid-state storage, storage system caches have increased in popularity. Examples include LSI’s CacheCade and Adaptec’s MaxIQ. Each of these systems
use solid-state storage as a persistent disk cache in front of a traditional disk-based RAID
We create similar storage system caches and apply the necessary modification to take
advantage of I/O classification. In particular, we introduce two new caching algorithms:
selective allocation and selective eviction. These algorithms inspect the relative priority
of each I/O and, as such, provide a mechanism by which computer system performance
policies can be enforced in a storage system. These caching algorithms build upon a
baseline cache, such as LRU.
4.4.1 Our baseline storage system cache
Our baseline cache uses a conventional write-back cache with LRU eviction. Recent research shows that solid-state LRU caching solutions are not cost-effective for enterprise
workloads [31]. We confirm this result in our evaluation, but also build upon it by demonstrating that a conventional LRU algorithm can be cost-effective with Differentiated Storage Services. Algorithms beyond LRU [13, 25] may produce even better results.
A solid-state drive is used as the cache, and we divide the SSD into a configurable number
of allocation units. We use 8 sectors (4KB, a common memory page size) as the allocation
unit, and we initialize the cache by contiguously adding all of these allocation units to a
free list. Initially, this free list contains every sector of the SSD.
For new write requests, we allocate cache entries from this free list. Once allocated, the
entries are removed from the free list and added to a dirty list. We record the entries
allocated to each I/O, by saving the mapping in a hash table keyed by the logical block
A syncer daemon monitors the size of the free list. When the free list drops below a low
watermark, the syncer begins cleaning the dirty list. The dirty list is sorted in LRU order.
As dirty entries are read or written, they are moved to the end of the dirty list. In this
way, the syncer cleans the least recently used entries first. Dirty entries are read from the
SSD and written back to the disk. As entries are cleaned, they are added back to the free
list. The free list is also sorted in LRU order, so if clean entries are accessed while in the
free list, they are moved to the end of the free list.
It is atop this baseline cache that we implement selective allocation and selective eviction.
4.4.2 Conventional allocation
Two heuristics are commonly used by current storage systems when deciding whether to
allocate an I/O request in the cache. These relate to the size of the request and its access
pattern (random or sequential). For example, a 256KB request in NTFS tells you that
the file the I/O is directed to is at least 256KB in size, and multiple contiguous 256KB
requests indicate that the file may be larger. It is the small random requests that benefit
most from caching, so large requests or requests that appear to be part of a sequential
stream will often bypass the cache, as such requests are just as efficiently served from disk.
There are at least two fundamental problems with this approach.
First, the block-level request size is only partially correlated with file size. Small files can
be accessed with large requests, and large files can be accessed with small requests. It all
depends on the application request size and caching model (e.g., buffered or direct). A
classic example of this is the NTFS Master File Table (MFT). This key metadata structure
is a large, often sequentially written file. Though when read, the requests are often small
and random. If a storage system were to bypass the cache when the MFT is being written,
subsequent reads would be forced to go to disk. Fixing this problem would require that the
MFT be distinguished from other large files and, without an I/O classification mechanism,
this would not be easy.
The second problem is that operating systems have a maximum request size (e.g., 512KB).
If one were to make a caching decision based on request size, one could not differentiate
file sizes that were larger than this maximum request. This has not been a problem with
small DRAM caches, but solid-state caches are considerably larger and can hold many
files. So, knowing that a file is, say, 1MB as opposed to 1GB is useful when making a
caching decision. For example, it can be better to cache more small files than fewer large
ones, which is particularly the case for file servers that are seek-limited from small files
and their metadata.
4.4.3 Selective allocation
Because of the above problems, we do not make a cache allocation decision based on
request size. Instead, for the FS prototypes, we differentiate metadata from regular files,
and we further differentiate the regular files by size.
Metadata and small files are always cached. Large files are conditionally cached. Our
current implementation checks to see if the syncer daemon is active (cleaning dirty entries), which indicates cache pressure, and we opt to not cache large files in this case (our
configurable cut-off is 1MB or larger — such blocks will bypass the cache). However, an
idle syncer daemon indicates that there is space in the cache, so we choose to cache even
the largest of files.
4.4.4 Selective eviction
Selective eviction is similar to selective allocation in its use of priority information. Rather
than evict entries in strict LRU order, we evict the lowest priority entries first. This is
accomplished by maintaining a dirty list for each I/O class. When the number of free
cache entries reaches a low watermark, the syncer cleans the lowest priority dirty list first.
When that list is exhausted, it selects the next lowest priority list, and so on, until a high
watermark of free entries is reached and the syncer is put to sleep.
With selective eviction, we can completely fill the cache without the risk of priority inversion. For an FS, this allows the caching of larger files, but not at the expense of evicting
smaller files. Large files will evict themselves under cache pressure, leaving the small files
and metadata effectively pinned in the cache. High priority I/O will only be evicted after
all lower priority data has been evicted. As we illustrate in our evaluation, small files and
metadata rarely get evicted in our enterprise workloads, which contain realistic mixes of
small and large file size [29].
4.4.5 Linux implementation
We implement a SW cache as RAID level 9 in the Linux RAID stack (MD).1 The mapping
to RAID is a natural one. RAID levels (e.g., 0, 1, 5) and the nested versions (e.g., 10, 50)
simply define a static mapping from logical blocks within a volume to physical blocks on
storage devices. RAID-0, for example, specifies that logical blocks will be allocated roundrobin. A Differentiated Storage Services architecture, in comparison, provides a dynamic
RAID-9 is not a standard RAID level, but simply a way for us to create cached volumes
in Linux MD.
mapping. In our implementation, the classification scheme and associated policies provide
a mapping to either the cache device or the storage device, though one might also consider
a mapping to multiple cache levels or different storage pools.
Managing the cache as a RAID device allows us to build upon existing RAID management
utilities. We use the Linux mdadm utility to create a cached volume. One simply specifies
the storage device and the caching device (devices in /dev), both of which may be another
RAID volume. For example, the cache device may be a mirrored pair of SSDs, and the
storage device a RAID-50 array. Implementing Differentiated Storage Services in this
manner makes for easy integration into existing storage management utilities.
Our SW cache is implemented in a kernel RAID module that is loaded when the cached
volume is created; information regarding the classification scheme and priority assignment
are passed to the module as runtime parameters. Because the module is part of the kernel,
I/O requests are terminated in the block layer and never reach the SCSI layer. The I/O
classifiers are, therefore, extracted directly from the block I/O requests (BIOs), not the
5-bit classification field in the SCSI request.
4.4.6 iSCSI implementation
Our second storage system prototype is based on iSCSI [12]. Unlike the RAID-9 prototype,
iSCSI is OS-independent and can be accessed by both Linux and Windows. In both cases,
the I/O classifier is copied into the SCSI request on the host. On the iSCSI target the
I/O classifier is extracted from the request, the priority of the I/O class is determined,
and a caching decision is made. The caching implementation is identical to the RAID-9
We evaluate our filesystem prototypes using a file server workload (based on SPECsfs [44]),
an e-mail server workload (modeled after the Swiss Internet Analysis [30]), a set of filesystem utilities (find, tar, and fsck), and a database workload (TPC-H [47]).
We present data from the Linux RAID-9 implementation for the filesystem workloads;
NTFS results using our iSCSI prototype are similar. For Linux TPC-H, we use iSCSI.
5.1 Experimental setup
All experiments are run on a single Linux machine. Our Linux system is a 2-way quad-core
Xeon server system (8 cores) with 8GB of RAM. We run Fedora 13 with a 2.6.34 kernel
modified as described in Section 4. As such, the Ext3 filesystem is modified to classify all
I/O, the block layer copies the classification into the Linux BIO, and the BIO is consumed
by our cache prototype (a kernel module running in the Linux RAID (MD) stack).
Our storage device is a 5-disk LSI RAID-1E array. Atop this base device we configure a
X25-E SSD is
cache as described in Section 4.4.5, or 4.4.6 (for TPC-H); an Intel32GB
used as the cache. For each of our tests, we configure a cache that is a fraction of the used
disk capacity (10-30%).
5.2 Workloads
Our file server workload is based on SPECsfs2008 [44]; the file size distributions are shown
in Table 8 (File server). The setup phase creates 262,144 files in 8,738 directories (SFS
specifies 30 files per directory). The benchmark performs 262,144 transactions against
this file pool, where a transaction is reading an existing file or creating a new file. The
read/write ratio is 2:1. The total capacity used by this test is 184GB, and we configure
an 18GB cache (10% of the file pool size). We preserve the file pool at the end of the file
File size
File server
E-mail server
Table 8: File size distributions.
transactions and run a set of filesystem utilities. Specifically, we search for a non-existent
file (find), archive the filesystem (tar), and then check the filesystem for errors (fsck).
Our e-mail server workload is based on a study of e-mail server file sizes [30]. We use
a request size of 4KB and a read/write ratio of 2:1. The setup phase creates 1 million
files in 1,000 directories. We then perform 1 million transactions (reading or creating an
e-mail) against this file pool. The file size distribution for this workload is shown in Table 8
(E-mail server). The total disk capacity used by this test is 204GB, and we configure a
20GB cache.
Finally, we run the TPC-H decision support workload [47] atop our modified PostgreSQL [33]
database (Section 4.3). Each PostgreSQL file is opened with the flag O_CLASSIFIED,
thereby enabling user-level classification and disabling file size classification from Ext3.
We build a database with a scale factor of 8, resulting in an on-disk footprint of 29GB,
and we run the I/O intensive queries (2, 17, 18, and 19) back-to-back. We compare 8GB
LRU and LRU-S caches.
5.3 Test methodology
We use an in-house, file-based workload generator for the file and e-mail server workloads. As input, the generator takes a file size distribution, a request size distribution, a
read/write ratio, and the number of subdirectories.
For each workload, our generator creates the specified number of subdirectories and, within
these subdirectories, creates files using the specified file and write request size distribution.
After the pool is created, transactions are performed against the pool, using these same
file and request size distributions. We record the number of files written/read per second
and, for each file size, the 95th percentile (worst case) latency, or the time to write or read
the entire file.
We compare the performance of three storage configurations: no SSD cache, an LRU cache,
and an enhanced LRU cache (LRU-S) that uses selective allocation and selective eviction.
For the cached tests, we also record the contents of the cache on a class-by-class basis,
the read hit rate, and the eviction overhead (percentage of transferred blocks related to
cleaning the cache). These three metrics are performance indicators used to explain the
performance differences between LRU and LRU-S. Elapsed time is used as the performance
(a) LRU cache and I/O breakdown (b) LRU-S cache and I/O breakdown
Figure 2: SFS results. Cache contents and breakdown of blocks written/read.
(a) Read hit rate
(b) Syncer overhead
(c) SFS performance
Figure 3: SFS performance indicators
metric in all tests.
5.4 File server
Figure 2a shows the contents of the LRU cache at completion of the benchmark (left
bar), the percentage of blocks written (middle bar), and the percentage of blocks read
(right bar). The cache bar does not exactly add to 100% due to round-off.2 Although
the cache activity (and contents) will naturally differ across applications, these results are
representative for a given benchmark across a range of different cache sizes.
As shown in the figure, the LRU breakdown is similar to the blocks written and read.
Most of the blocks belong to large files — a tautology given the file sizes in SPECsfs2008
(most files are small, but most of the data is in large files). Looking again at the leftmost
bar, one sees that nearly the entire cache is filled with blocks from large files. The smallest
sliver of the graph (bottommost layer of cache bar) represents files up to 64KB in size.
Smaller files and metadata consume less than 1% and cannot be seen.
Figure 2b shows the breakdown of the LRU-S cache. The write and read breakdown are
identical to Figure 2a, as we are running the same benchmark, but we see a different
outcome in terms of cache utilization. Over 40% of the cache is consumed by files 64KB
and smaller, and metadata (bottommost layer) is now visible. Unlike LRU eviction alone,
selective allocation and selective eviction limit the cache utilization of large files. As
utilization increases, large-file blocks are the first to be evicted, thereby preserving small
files and metadata.
Figure 3a compares read hit rates. With a 10% cache, the read hit rate is approximately
10%. Given the uniformly random distribution of the SPECsfs2008 workload, this result
is expected. However, although the read hit rates are identical, the miss penalties are not.
In the case of LRU, most of the hits are to large files. In the case of LRU-S, the hits are to
Some of the classes consume less than 1% and round to 0.
small files and metadata. Given the random seeks associated with small file and metadata,
it is better to miss on large sequential files.
Figure 3b compares the overhead of the syncer daemon, where overhead is the percentage
of transferred blocks due to cache evictions. When a cache entry is evicted, the syncer
must read blocks from the cache device and write them back to the disk device — and
this I/O can interfere with application I/O. Selective allocation can reduce the job of the
syncer daemon by fencing off large files when there is cache pressure. As a result, we
see the percentage of I/O related to evictions drop by more than a factor or 3. This can
translate into more available performance for the application workload.
Finally, Figure 3c shows the actual performance of the benchmark. We compare the
performance of no cache, an LRU cache, and LRU-S. Performance is measured in running
time, so smaller is better. As can be seen in the graph, an LRU cache is only slightly better
than no cache at all, and an LRU-S cache is 80% faster than LRU. In terms of running
time, the no-cache run completes in 135 minutes, LRU in minutes 124, and LRU-S in 69
The large performance difference can also be measured by the improvement in file latencies.
Figures 4a and 4b compare the 95th percentile latency of write and read operations, where
latency is the time to write or read an entire file. The x-axis represents the file sizes (as
per SPECsfs2008) and the y-axis represents the reduction in latency relative to no cache
at all. Although LRU and LRU-S reduce write latency equally for many of the file sizes
(e.g., 1KB, 8KB, 256KB, and 512KB), LRU suffers from outliers that account for the
increase in 95th percentile latency. The bars that extend below the x-axis indicate that
LRU increased write latency relative to no cache, due to cache thrash. And the read
latencies show even more improvement with LRU-S. Files 256KB and smaller have latency
reductions greater than 50%, compared to the improvements in LRU which are much more
modest. Recall, with a 10% cache, only 10% of the working set can be cached. Whereas
LRU-S uses this 10% to cache small files and metadata, standard LRU wastes the cache
on large, sequentially-accessed files. Stated differently, the cache space we save by evicting
large files allows for many more small files to be cached.
5.5 E-mail server
The results from the e-mail server workload are similar to the file server. The read cache
hit rate for both LRU and LRU-S is 11%. Again, because the files are accessed with
a uniformly random distribution, the hit rate is correlated with the size of the working
set that is cached. The miss penalties are again quite different. LRU-S reduces the read
latency considerably. In this case, files 32KB and smaller see a large read latency reduction.
For example, the read latency for 2KB e-mails is 85ms, LRU reduces this to 21ms, and
LRU-S reduces this to 4ms (a reduction of 81% relative to LRU).
As a result of the reduced miss penalty and lower eviction overhead (reduced from 54% to
25%), the e-mail server workload is twice as fast when running with LRU-S. Without any
cache, the test completes the 1 million transactions in 341 minutes, LRU completes in 262
minutes, and LRU-S completes in 131 minutes.
Like the file server, an e-mail server is often throughput limited. By giving preference to
metadata and small e-mails, significant performance improvements can be realized. This
benchmark also demonstrates the flexibility of our FS classification approach. That is, our
file size classification is sufficient to handle both file and e-mail server workloads, which
have very different file size distributions.
(a) 95th percentile write latency
(b) 95th percentile read latency
Figure 4: SFS file latencies
(a) LRU cache and I/O breakdown (b) LRU-S cache and I/O breakdown
Figure 5: TPC-H results.
(a) Read hit rate
Cache contents and breakdown of blocks writ-
(b) Syncer overhead
(c) TPC-H performance
Figure 6: TPC-H performance indicators
5.6 FS utilities
The FS utilities further demonstrate the advantages of selective caching. Following the file
server workload, we search the filesystem for a non-existent file (find, a 100% read-only
metadata workload), create a tape archive of an SFS subdirectory (tar), and check the
filesystem (fsck).
For the find operation, the LRU configuration sees an 80% read hit rate, compared to
100% for LRU-S. As a result, LRU completes the find in 48 seconds, and LRU-S in 13
(a 3.7x speedup). For tar, LRU has a 5% read hit rate, compared to 10% for LRU-S.
Moreover, nearly 50% of the total I/O for LRU is related to syncer daemon activity, as
LRU write-caches the tar file, causing evictions of the existing cache entries and leading
to cache thrash. In contrast, the LRU-S fencing algorithm directs the tar file to disk. As
as result, LRU-S completes the archive creation in 598 seconds, compared to LRU’s 850
seconds (a 42% speedup).
Finally, LRU completes fsck in 562 seconds, compared to 94 seconds for LRU-S (a 6x
speedup). Unlike LRU, LRU-S retains filesystem metadata in the cache, throughout all of
the tests, resulting in a much faster consistency check of the filesystem.
5.7 TPC-H
As one example of how our proof-of-concept DB can prioritize I/O, we give highest priority
to filesystem metadata, user tables, log files, and temporary tables; all of these classes are
managed as a single class (they share an LRU list). Index files are given lowest priority.
Unused indexes can consume a considerable amount of cache space and, in these tests,
are served from disk sufficiently fast. We discovered this when we first began analyzing
the DB I/O requests in our storage system. That is, the classified I/O both identifies the
opportunity for cache optimization, and it provides a means by which the optimization
can be realized.
Figure 5 compares the cache contents of LRU and LRU-S. For the LRU test, most of the
cache is consumed by index files; user tables and temporary tables consume the remainder.
Because index files are created after the DB is created, it is understandable why they
consume such a large portion of the cache. In contrast, LRU-S fences off the index files,
leaving more cache space for user tables, which are often accessed randomly.
The end result is an improved cache hit rate (Figure 6a), slightly less cache cleaning
overhead (Figure 6b), and a 20% improvement in query time (Figure 6c). The non-cached
run completes all 4 queries in 680 seconds, LRU in 463 seconds, and LRU-S in 386 seconds.
Also, unlike the file and e-mail server runs, we see more variance in TPC-H running time
when not using LRU-S. This applies to both the non-cached run and the LRU run. Because
of this, we average over three runs and include error bars. As seen in Figure 6c, LRU-S
not only runs faster, but it also reduces performance outliers.
File and storage system QoS is a heavily researched area. Previous work focuses on QoS
guarantees for disk I/O [54], QoS guarantees for filesystems [4], configuring storage systems
to meet performance goals [55], allocating storage bandwidth to application classes [46],
and mapping administrator-specified goals to appropriate storage system designs [48]. In
contrast, we approach the QoS problem with I/O classification, which benefits from a
coordinated effort between the computer system and the storage system.
More recently, providing performance differentiation (or isolation) has been an active area
of research due to the increasing level in which storage systems are being shared within a
data center. Such techniques manage I/O scheduling to achieve fairness within a shared
storage system [17, 50, 53]. The work presented in this paper provides a finer granularity
of control (classes) for such systems.
Regarding caching, numerous works focus on flash and its integration into storage systems as a conventional cache [20, 23, 24]. However, because enterprise workloads often
exhibit such poor locality of reference, it can be difficult to make conventional caches
cost-effective [31]. In contrast, we show that selective caching, even when applied to the
simplest of caching algorithms (LRU) can be cost effective. Though we introduce selective
caching in the context of LRU [39], any of the more advanced caching algorithms could
be used, such as LRU-K [32], CLOCK-Pro [13], 2Q [15], ARC [25], LIRS [14], FBR [35],
MQ [59], and LRFU [19].
Our block-level selective caching approach is similar to FS-level approaches, such as Conquest [51] and zFS [36], where faster storage pools are reserved for metadata and small
files. And there are other block-level caching approaches with similar goals, but different
approaches. In particular, Hystor [6] uses data migration to move metadata and other
latency sensitive blocks into faster storage, and Karma [57] relies on a priori hints on
database block access patterns to improve multi-level caching.
The characteristics of flash [7] make it attractive as a medium for persistent transactions [34], or to host flash-based filesystems [16]. Other forms of byte-addressable nonvolatile memory introduce additional filesystem opportunities [8].
Data migration [1, 2, 5, 6, 18, 21, 56], in general, is a complement to the work presented in
this article. However, migration can be expensive [22], so it is best to allocate storage from
the appropriate storage during file creation, whenever possible. Many files have well-known
patterns of access, making such allocation possible [26].
And we are not the first to exploit semantic knowledge in the storage system. Most
notably, semantically-smart disks [43] and type-safe disks [40, 58] explore how knowledge
of on-disk data structures can be used to improve performance, reliability, and security.
But we differ, quite fundamentally, in that we send higher-level semantic information with
each I/O request, rather than detailed block information (e.g., inode structure) through
explicit management commands. Further, unlike this previous work, we do not offload
block management to the storage system.
The inexpressive block interface limits I/O optimization, and it does so in two ways.
First, computer systems are having difficulty optimizing around complex storage system
internals. RAID, caching, and non-volatile memory are good examples. Second, storage
systems, due to a lack of semantic information, experience equal difficulty when trying to
optimize I/O requests.
Yet, an entire computer industry has been built around blocks, so major changes to this
interface are, today, not practical. Differentiated Storage Services addresses this problem
with I/O classification. By adding a small classifier to the block interface, we can associate
QoS policies with I/O classes, thereby allowing computer systems and storage system to
agree on shared, block-level policies. This will enable continued innovation on both sides
of the block interface.
Our filesystem prototypes show significant performance improvements when applied to
storage system caching, and our database proof-of-concept suggests similar improvements.
We are extending our work to other realms such as reliability and security. Over time, as
applications come to expect differentiated service from their storage systems, additional
usage models are likely to evolve.
We thank our Intel colleagues who helped contribute to the work presented in this paper,
including Terry Yoshii, Mathew Eszenyi, Pat Stolt, Scott Burridge, Thomas Barnes, and
Scott Hahn. We also thank Margo Seltzer for her very useful feedback.
[1] M. Abd-El-Malek, W. V. C. II, C. Cranor, G. R. Ganger, J. Hendricks, A. J. Klosterman,
M. Mesnier, M. Prasad, B. Salmon, R. R. Sambasivan, S. Sinnamohideen, J. D. Strunk,
E. Thereska, M. Wachs, and J. J. Wylie. Ursa Minor: versatile cluster-based storage. In
Proceedings of the 4th USENIX Conference on File and Storage Technologies, San
Francisco, CA, December 2005. The USENIX Association.
[2] A. C. Arpaci-Dusseau and R. H. Arpaci-Dusseau. Information and Control in Gray-Box
Systems. In Proceedings of the 18nd ACM Symposium on Operating Systems Principles
(SOSP 01), Chateau Lake Louise, Banff, Canada, October 2001.
[3] S. N. I. Association. A Dictionary of Storage Networking Terminology.
[4] P. R. Barham. A Fresh Approach to File System Quality of Service. In Proceedings of the
IEEE 7th International Workshop on Network and Operating System Support for Digital
Audio and Video (NOSSDAV 97), St. Louis, MO, May 1997.
[5] M. Bhadkamkar, J. Guerra, L. Useche, S. Burnett, J. Liptak, R. Rangaswami, and
V. Hristidis. BORG: Block-reORGanization for Self-optimizing Storage Systems. In
Proceedings of the 7th USENIX Conference on File and Storage Technologies (FAST 09),
San Francisco, CA, February 2009. The USENIX Association.
[6] F. Chen, D. Koufaty, and X. Zhang. Hystor: Making the best use of solid state drives in
high performance storage systems. In Proceedings of the 25th ACM International
Conference on Supercomputing (ICS 2011), Tucson, AZ, May 31 - June 4 2011.
[7] F. Chen, D. A. Koufaty, and X. Zhang. Understanding Intrinsic Characteristics and System
Implications of Flash Memory based Solid State Drives. In Proceedings of the International
Conference on Measurement and Modeling of Computer Systems (SIGMETRICS 2009),
Seattle, WA, June 2009. ACM Press.
J. Condit, E. B. Nightingale, C. Frost, E. Ipek, D. Burger, B. C. Lee, and D. Coetzee.
Better I/O Through Byte-Addressable, Persistent Memory. In Proceedings of the 22nd
ACM Symposium on Operating Systems Principles (SOSP 09), Big Sky, MT, October 2009.
X. Ding, S. Jiang, F. Chen, K. Davis, and X. Zhang. DiskSeen: Exploiting Disk Layout and
Access History to Enhance I/O Prefetch. In Proceedings of the 2007 USENIX Annual
Technical Conference, Santa Clara, CA, June 2007. The USENIX Association.
W. Effelsberg and T. Haerder. Principles of database buffer management. ACM
Transactions on Database Systems (TODS), 9(4):560–595, December 1984.
H. Huang, A. Hung, and K. G. Shin. FS2: Dynamic Data Replication in Free Disk Space for
Improving Disk Performance and Energy Consumption. In Proceedings of 20th ACM
Symposium on Operating System Principles, pages 263–276, Brighton, UK, October 2005.
ACM Press.
Intel Corporation. Open Storage Toolkit.
S. Jiang, F. Chen, and X. Zhang. CLOCK-Pro: An Effective Improvement of the CLOCK
Replacement. In Proceedings of the 2005 USENIX Annual Technical Conference (USENIX
ATC 2005), Anaheim, CA, April 10-15 2005. The USENIX Association.
S. Jiang and X. Zhang. LIRS: An Efficient Low Inter-reference Recency Set Replacement
Policy to Improve Buffer Cache Performance. In Proceedings of the International
Conference on Measurement and Modeling of Computer Systems (SIGMETRICS 2002),
Marina Del Rey, CA, June 15-19 2002. ACM Press.
T. Johnson and D. Shasha. 2Q: A Low Overhead High Performance Buffer Management
Replacement Algorithm. In Proceedings of the 20th International Conference on Very Large
Data Bases (VLDB’94), Santiago Chile, Chile, September 12-15 1994. Morgan Kaufmann.
W. K. Josephson, L. A. Bongo, D. Flynn, and K. Li. DFS: A File System for Virtualized
Flash Storage. In Proceedings of the 8th USENIX Conference on File and Storage
Technologies (FAST 10), San Jose, CA, February 2010. The USENIX Association.
M. Karlsson, C. Karamanolis, and X. Zhu. Triage: performance differentiation for storage
systems using adaptive control. ACM Transactions on Storage, 1(4):457–480, November
S. Khuller, Y.-A. Kim, and Y.-C. J. Wan. Algorithms for data migration with cloning. In
Proceedings of the 22nd ACM SIGMOD-SIGACT-SIGART Symposium on Principles of
Database Systems, San Diego, CA, June 2003. ACM Press.
D. Lee, J. Choi, J. Kim, S. H. Noh, S. L. Min, Y. Cho, and C. S. Kim. LRFU: A Spectrum
of Policies that Subsumes the Least Recently Used and Least Frequently Used Policies.
IEEE Transactions on Computers, 50(12):1352–1361, December 2001.
A. Leventhal. Flash storage memory. In Communications of the ACM, volume 51(7), pages
47–51, July 2008.
C. Lu, G. A. Alvarez, and J. Wilkes. Aqueduct: online data migration with performance
guarantees. In Proceedings of the 1st USENIX Conference on File and Storage Technologies
(FAST 02), Monterey, CA, January 2002. The USENIX Association.
P. Macko, M. Seltzer, and K. A. Smith. Tracking Back References in a Write-Anywhere File
System. In Proceedings of the 8th USENIX Conference on File and Storage Technologies
(FAST 10), San Jose, CA, February 2010. The USENIX Association.
B. Marsh, F. Douglis, and P. Krishnan. Flash memory file caching for mobile computers. In
Proceedings of the 27th Hawaii Conference on Systems Science, Wailea, HI, Jan 1994.
J. Matthews, S. Trika, D. Hensgen, R. Coulson, and K. Grimsrud. Intel Turbo Memory:
Nonvolatile disk caches in the storage hierarchy of mainstream computer systems. In ACM
Transactions on Storage (TOS), volume 4, May 2008.
N. Megiddo and D. S. Modha. Outperforming LRU with an Adaptive Replacement Cache
Algorithm. IEEE Computer Magazine, 37(4):58–65, April 2004.
M. Mesnier, E. Thereska, G. Ganger, D. Ellard, and M. Seltzer. File classification in self-*
storage systems. In Proceedings of the 1st International Conference on Autonomic
Computing (ICAC-04), New York, NY, May 2004. IEEE Computer Society.
M. Mesnier, M. Wachs, R. R. Sambasivan, A. Zheng, and G. R. Ganger. Modeling the
relative fitness of storage. In Proceedings of the International Conference on Measurement
and Modeling of Computer Systems (SIGMETRICS 2007), San Diego, CA, June 2007.
ACM Press.
M. P. Mesnier, G. R. Ganger, and E. Riedel. Object-based Storage. IEEE Communications,
44(8):84–90, August 2003.
[29] D. T. Meyer and W. J. Bolosky. A Study of Practical Deduplication. In Proceedings of the
9th USENIX Conference on File and Storage Technologies (FAST 11), San Jose, CA, Feb
15-17 2011. The USENIX Association.
[30] O. Muller and D. Graf. Swiss Internet Analysis 2002.
[31] D. Narayanan, E. Thereska, A. Donnelly, S. Elnikety, and A. Rowstron. Migrating Server
Storage to SSDs: Analysis of Tradeoffs. In Proceedings of the 4th ACM European
Conference on Computer systems (EuroSys ’09), Nuremberg, Germany, March 31 - April 3
2009. ACM Press.
[32] E. J. O’Neil, P. E. O’Neil, and G. Weikum. The LRU-K page replacement algorithm for
database disk buffering. In Proceedings of the 1993 ACM International Conference on
Management of Data (SIGMOD ’93), Washington, D.C., May 26-28 1993. ACM Press.
[33] PostgreSQL Global Development Group. Open source database.
[34] V. Prabhakaran, T. L. Rodeheffer, and L. Zhou. Transactional Flash. In Proceedings of the
8th USENIX Symposium on Operating Systems Design and Implementation (OSDI 08),
San Diego, CA, December 2008. The USENIX Association.
[35] J. T. Robinson and M. V. Devarakonda. Data Cache Management Using Frequency-Based
Replacement. In Proceedings of the International Conference on Measurement and Modeling
of Computer Systems (SIGMETRICS 1990), Boulder, CO, May 22-25 1990. ACM Press.
[36] O. Rodeh and A. Teperman. zFS - A Scalable Distributed File System Using Object Disks.
In Proceedings of the 20th Goddard Conference on Mass Storage Systems (MSS’03), San
Diego, CA, April 2003. IEEE.
[37] C. Ruemmler and J. Wilkes. Disk shuffling. Technical Report HPL-91-156, Hewlett-Packard
Laboratories, October 2001.
[38] J. Schindler, J. L. Griffin, C. R. Lumb, and G. R. Ganger. Track-aligned Extents: Matching
Access Patterns to Disk Drive Characteristics. In Proceedings of the 1st USENIX
Conference on File and Storage Technologies (FAST 02), Monterey, CA, January 2002. The
USENIX Association.
[39] A. Silberschatz, P. B. Galvin, and G. Gagne. Operating Systems Concepts. Wiley, 8th
edition, 2009.
[40] G. Sivathanu, S. Sundararaman, and E. Zadok. Type-safe Disks. In Proceedings of the 7th
USENIX Symposium on Operating Systems Design and Implementation (OSDI 06),
Seattle, WA, November 2006. The USENIX Association.
[41] M. Sivathanu, L. N. Bairavasundaram, A. C. Arpaci-Dusseau, and R. H. Arpaci-Dusseau.
Life or Death at Block Level. In Proceedings of the 6th Symposium on Operating Systems
Design and Implementation (OSDI 04), pages 379–394, San Francisco, CA, December 2004.
The USENIX Association.
[42] M. Sivathanu, V. Prabhakaran, A. C. Arpaci-Dusseau, and R. H. Arpaci-Dusseau.
Improving Storage System Availability with D-GRAID. In Proceedings of the 3rd USENIX
Conference on File and Storage Technologies (FAST 04), pages 15–30, San Francisco, CA,
March 2004. The USENIX Association.
[43] M. Sivathanu, V. Prabhakaran, F. I. Popovici, T. E. Denehy, A. C. Arpaci-Dusseau, and
R. H. Arpaci-Dusseau. Semantically-Smart Disk Systems. In Proceedings of the 2th USENIX
Conference on File and Storage Technologies (FAST 03), San Francisco, CA, March-April
2003. The USENIX Association.
[44] Standard Performance Evaluation Corporation. Spec sfs.
[45] M. Stonebraker. Operating system support for database management. Communications of
the ACM, 2(7):412–418, July 1981.
[46] V. Sundaram and P. Shenoy. A Practical Learning-based Approach for Dynamic Storage
Bandwidth Allocation. In Proceedings of the Eleventh International Workshop on Quality of
Service (IWQoS 2003), Berkeley, CA, June 2003. Springer.
[47] Transaction Processing Performance Council. TPC Benchmark H.
[48] S. Uttamchandani, K. Voruganti, S. Srinivasan, J. Palmer, and D. Pease. Polus: Growing
Storage QoS Management Beyond a “4-Year Old Kid”. In Proceedings of the 3rd USENIX
Conference on File and Storage Technologies (FAST 04), San Francisco, CA, March 2004.
The USENIX Association.
[49] M. Uysal, G. A. Alvarez, and A. Merchant. A modular, analytical throughput model for
modern disk arrays. In Proceedings of the 9th International Symposium on Modeling
Analysis and Simulation of Computer and Telecommunications Systems
(MASCOTS-2001), Cincinnati, OH, August 2001. IEEE/ACM.
[50] M. Wachs, M. Abd-El-Malek, E. Thereska, and G. R. Ganger. Argon: Performance
Insulation for Shared Storage Servers. In Proceedings of the 5th USENIX Conference on
File and Storage Technologies (FAST 07), San Jose, CA, February 2007. The USENIX
A.-I. A. Wang, P. Reiher, G. J. Popek, and G. H. Kuenning. Conquest: Better performance
through a Disk/Persistent-RAM hybrid file system. In Proceedings of the 2002 USENIX
Annual Technical Conference (USENIX ATC 2002), Monterey, CA, June 2002. The
USENIX Association.
M. Wang, K. Au, A. Ailamaki, A. Brockwell, C. Faloutsos, and G. R. Ganger. Storage
device performance prediction with CART models. In Proceedings of the 12th International
Symposium on Modeling Analysis and Simulation of Computer and Telecommunications
Systems (MASCOTS-2004), Volendam, The Netherlands, October 2004. IEEE.
Y. Wang and A. Merchant. Proportional share scheduling for distributed storage systems. In
Proceedings of the 5th USENIX Conference on File and Storage Technologies (FAST 07),
San Jose, CA, February 2007. The USENIX Association.
R. Wijayaratne and A. L. N. Reddy. Providing QoS guarantees for disk I/O. Multimedia
Systems, 8(1):57–68, February 2000.
J. Wilkes. Traveling to Rome: QoS specifications for automated storage system
management. In Proceedings of the 9th International Workshop on Quality of Service
(IWQoS 2001), Karlsruhe, Germany, June 2001.
J. Wilkes, R. Golding, C. Staelin, and T. Sullivan. The HP AutoRAID Hierarchical Storage
System. ACM Transactions on Computer Systems (TOCS), 14(1):108–136, February 1996.
G. Yadgar, M. Factor, and A. Schuster. Karma: Know-it-All Replacement for a Multilevel
cAche. In Proceedings of the 5th USENIX Conference on File and Storage Technologies
(FAST 07), San Jose, CA, February 2007. The USENIX Association.
C. Yalamanchili, K. Vijayasankar, E. Zadok, and G. Sivathanu. DHIS: discriminating
hierarchical storage. In Proceedings of The Israeli Experimental Systems Conference
(SYSTOR 09), Haifa, Israel, May 2009. ACM Press.
Y. Zhou, J. F. Philbin, and K. Li. The Multi-Queue Replacement Algorithm for Second
Level Buffer Caches. In Proceedings of the 2001 USENIX Annual Technical Conference,
Boston, MA, June 25-30 2001. The USENIX Association.
A File is Not a File: Understanding the I/O
Behavior of Apple Desktop Applications
Tyler Harter, Chris Dragga, Michael Vaughn,
Andrea C. Arpaci-Dusseau, Remzi H. Arpaci-Dusseau
Department of Computer Sciences
University of Wisconsin, Madison
{harter, dragga, vaughn, dusseau, remzi}
We analyze the I/O behavior of iBench, a new collection of productivity and multimedia application
workloads. Our analysis reveals a number of differences between iBench and typical file-system
workload studies, including the complex organization of modern files, the lack of pure sequential
access, the influence of underlying frameworks on I/O patterns, the widespread use of file synchronization and atomic operations, and the prevalence of threads. Our results have strong ramifications
for the design of next generation local and cloud-based storage systems.
The design and implementation of file and storage systems has long been at the forefront of computer
systems research. Innovations such as namespace-based locality [21], crash consistency via journaling [15, 29] and copy-on-write [7, 34], checksums and redundancy for reliability [5, 7, 26, 30],
scalable on-disk structures [37], distributed file systems [16, 35], and scalable cluster-based storage
systems [9, 14, 18] have greatly influenced how data is managed and stored within modern computer
Much of this work in file systems over the past three decades has been shaped by measurement: the
deep and detailed analysis of workloads [4, 10, 11, 16, 19, 25, 33, 36, 39]. One excellent example
is found in work on the Andrew File System [16]; detailed analysis of an early AFS prototype led
to the next-generation protocol, including the key innovation of callbacks. Measurement helps us
understand the systems of today so we can build improved systems for tomorrow.
Whereas most studies of file systems focus on the corporate or academic intranet, most file-system
users work in the more mundane environment of the home, accessing data via desktop PCs, laptops,
and compact devices such as tablet computers and mobile phones. Despite the large number of
previous studies, little is known about home-user applications and their I/O patterns.
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted
without fee provided that copies are not made or distributed for profit or commercial advantage and that copies
bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to
redistribute to lists, requires prior specific permission and/or a fee.
SOSP '11, October 23-26, 2011, Cascais, Portugal.
Copyright © 2011 ACM 978-1-4503-0977-6/11/10 ... $10.00.
Home-user applications are important today, and their importance will increase as more users store
data not only on local devices but also in the cloud. Users expect to run similar applications across
desktops, laptops, and phones; therefore, the behavior of these applications will affect virtually every
system with which a user interacts. I/O behavior is especially important to understand since it greatly
impacts how users perceive overall system latency and application performance [12].
While a study of how users typically exercise these applications would be interesting, the first step
is to perform a detailed study of I/O behavior under typical but controlled workload tasks. This
style of application study, common in the field of computer architecture [40], is different from the
workload study found in systems research, and can yield deeper insight into how the applications are
constructed and how file and storage systems need to be designed in response.
Home-user applications are fundamentally large and complex, containing millions of lines of code [20].
In contrast, traditional U NIX-based applications are designed to be simple, to perform one task well,
and to be strung together to perform more complex tasks [32]. This modular approach of U NIX
applications has not prevailed [17]: modern applications are standalone monoliths, providing a rich
and continuously evolving set of features to demanding users. Thus, it is beneficial to study each
application individually to ascertain its behavior.
In this paper, we present the first in-depth analysis of the I/O behavior of modern home-user applications; we focus on productivity applications (for word processing, spreadsheet manipulation, and
presentation creation) and multimedia software (for digital music, movie editing, and photo management). Our analysis centers on two Apple software suites: iWork, consisting of Pages, Numbers, and
Keynote; and iLife, which contains iPhoto, iTunes, and iMovie. As Apple’s market share grows [38],
these applications form the core of an increasingly popular set of workloads; as device convergence
continues, similar forms of these applications are likely to access user files from both stationary
machines and moving cellular devices. We call our collection the iBench task suite.
To investigate the I/O behavior of the iBench suite, we build an instrumentation framework on top of
the powerful DTrace tracing system found inside Mac OS X [8]. DTrace allows us not only to monitor
system calls made by each traced application, but also to examine stack traces, in-kernel functions
such as page-ins and page-outs, and other details required to ensure accuracy and completeness.
We also develop an application harness based on AppleScript [3] to drive each application in the
repeatable and automated fashion that is key to any study of GUI-based applications [12].
Our careful study of the tasks in the iBench suite has enabled us to make a number of interesting
observations about how applications access and manipulate stored data. In addition to confirming
standard past findings (e.g., most files are small; most bytes accessed are from large files [4]), we find
the following new results.
A file is not a file. Modern applications manage large databases of information organized into complex directory trees. Even simple word-processing documents, which appear to users as a “file”, are
in actuality small file systems containing many sub-files (e.g., a Microsoft .doc file is actually a FAT
file system containing pieces of the document). File systems should be cognizant of such hidden
structure in order to lay out and access data in these complex files more effectively.
Sequential access is not sequential. Building on the trend noticed by Vogels for Windows NT [39],
we observe that even for streaming media workloads, “pure” sequential access is increasingly rare.
Since file formats often include metadata in headers, applications often read and re-read the first
portion of a file before streaming through its contents. Prefetching and other optimizations might
benefit from a deeper knowledge of these file formats.
Auxiliary files dominate. Applications help users create, modify, and organize content, but user
files represent a small fraction of the files touched by modern applications. Most files are helper files
that applications use to provide a rich graphical experience, support multiple languages, and record
history and other metadata. File-system placement strategies might reduce seeks by grouping the
hundreds of helper files used by an individual application.
Writes are often forced. As the importance of home data increases (e.g., family photos), applications
are less willing to simply write data and hope it is eventually flushed to disk. We find that most written
data is explicitly forced to disk by the application; for example, iPhoto calls fsync thousands of
times in even the simplest of tasks. For file systems and storage, the days of delayed writes [22] may
be over; new ideas are needed to support applications that desire durability.
Renaming is popular. Home-user applications commonly use atomic operations, in particular
rename, to present a consistent view of files to users. For file systems, this may mean that transactional capabilities [23] are needed. It may also necessitate a rethinking of traditional means of
file locality; for example, placing a file on disk based on its parent directory [21] does not work as
expected when the file is first created in a temporary location and then renamed.
Multiple threads perform I/O. Virtually all of the applications we study issue I/O requests from a
number of threads; a few applications launch I/Os from hundreds of threads. Part of this usage stems
from the GUI-based nature of these applications; it is well known that threads are required to perform
long-latency operations in the background to keep the GUI responsive [24]. Thus, file and storage
systems should be thread-aware so they can better allocate bandwidth.
Frameworks influence I/O. Modern applications are often developed in sophisticated IDEs and
leverage powerful libraries, such as Cocoa and Carbon. Whereas UNIX-style applications often directly invoke system calls to read and write files, modern libraries put more code between applications
and the underlying file system; for example, including "cocoa.h" in a Mac application imports
112,047 lines of code from 689 different files [28]. Thus, the behavior of the framework, and not
just the application, determines I/O patterns. We find that the default behavior of some Cocoa APIs
induces extra I/O and possibly unnecessary (and costly) synchronizations to disk. In addition, use of
different libraries for similar tasks within an application can lead to inconsistent behavior between
those tasks. Future storage design should take these libraries and frameworks into account.
This paper contains four major contributions. First, we describe a general tracing framework for
creating benchmarks based on interactive tasks that home users may perform (e.g., importing songs,
exporting video clips, saving documents). Second, we deconstruct the I/O behavior of the tasks in
iBench; we quantify the I/O behavior of each task in numerous ways, including the types of files accessed (e.g., counts and sizes), the access patterns (e.g., read/write, sequentiality, and preallocation),
transactional properties (e.g., durability and atomicity), and threading. Third, we describe how these
qualitative changes in I/O behavior may impact the design of future systems. Finally, we present the
34 traces from the iBench task suite; by making these traces publicly available and easy to use, we
hope to improve the design, implementation, and evaluation of the next generation of local and cloud
storage systems:
The remainder of this paper is organized as follows. We begin by presenting a detailed timeline of
the I/O operations performed by one task in the iBench suite; this motivates the need for a systematic
study of home-user applications. We next describe our methodology for creating the iBench task
suite. We then spend the majority of the paper quantitatively analyzing the I/O characteristics of the
full iBench suite. Finally, we summarize the implications of our findings on file-system design.
add .jpg
add .jpg
add .jpg
add .jpg
add .jpg
add .jpg
add .jpg
add .jpg
add .jpg
add .jpg
add .jpg
add .jpg
add .jpg
add .jpg
add .jpg
new doc
Threads 1, 10
Thread 11
Thread 3
Thread 1
Threads 2, 4, 6, 8
Thread 1
17 (38MB)
25 (0.3MB)
Threads 1, 10
118 (106MB)
Thread 1
2 (9KB)
Thread 4
Threads 1-9
Thread 8
Thread 7
Thread 6
Thread 5
Thread 4
Thread 3
Thread 2
218 (0.8MB)
Thread 1
5 (39MB)
magnify .doc file save
Compressed (0.3MB)
Compressed (23.5MB)
File Offset (KB)
Compressed (11.7MB)
Sequential Runs
Figure 1: Pages Saving A Word Document. The top graph shows the 75-second timeline of the
entire run, while the bottom graph is a magnified view of seconds 54 to 58. In the top graph, annotations on the left categorize files by type and indicate file count and amount of I/O; annotations on
the right show threads. Black bars are file accesses (reads and writes), with thickness logarithmically
proportional to bytes of I/O. / is an fsync; \ is a rename; X is both. In the bottom graph, individual
reads and writes to the .doc file are shown. Vertical bar position and bar length represent the offset
within the file and number of bytes touched. Thick white bars are reads; thin gray bars are writes.
Repeated runs are marked with the number of repetitions. Annotations on the right indicate the name
of each file section.
The I/O characteristics of modern home-user applications are distinct from those of U NIX applications studied in the past. To motivate the need for a new study, we investigate the complex I/O
behavior of a single representative task. Specifically, we report in detail the I/O performed over time
by the Pages (4.0.3) application, a word processor, running on Mac OS X Snow Leopard (10.6.2) as
it creates a blank document, inserts 15 JPEG images each of size 2.5MB, and saves the document as
a Microsoft .doc file.
Figure 1 shows the I/O this task performs (see the caption for a description of the symbols used).
The top portion of the figure illustrates the accesses performed over the full lifetime of the task: at a
high level, it shows that more than 385 files spanning six different categories are accessed by eleven
different threads, with many intervening calls to fsync and rename. The bottom portion of the
figure magnifies a short time interval, showing the reads and writes performed by a single thread
accessing the primary .doc productivity file. From this one experiment, we illustrate each finding
described in the introduction. We first focus on the single access that saves the user’s document
(bottom), and then consider the broader context surrounding this file save, where we observe a flurry
of accesses to hundreds of helper files (top).
A file is not a file. Focusing on the magnified timeline of reads and writes to the productivity .doc
file, we see that the file format comprises more than just a simple file. Microsoft .doc files are based
on the FAT file system and allow bundling of multiple files in the single .doc file. This .doc file
contains a directory (Root), three streams for large data (WordDocument, Data, and 1Table), and a
stream for small data (Ministream). Space is allocated in the file with three sections: a file allocation
table (FAT), a double-indirect FAT (DIF) region, and a ministream allocation region (Mini).
Sequential access is not sequential. The complex FAT-based file format causes random access
patterns in several ways: first, the header is updated at the beginning and end of the magnified access;
second, data from individual streams is fragmented throughout the file; and third, the 1Table stream
is updated before and after each image is appended to the WordDocument stream.
Auxiliary files dominate. Although saving the single .doc we have been considering is the sole
purpose of this task, we now turn our attention to the top timeline and see that 385 different files are
accessed. There are several reasons for this multitude of files. First, Pages provides a rich graphical
experience involving many images and other forms of multimedia; together with the 15 inserted
JPEGs, this requires 118 multimedia files. Second, users want to use Pages in their native language,
so application text is not hard-coded into the executable but is instead stored in 25 different .strings
files. Third, to save user preferences and other metadata, Pages uses a SQLite database (2 files) and
a number of key-value stores (218 .plist files).
Writes are often forced; renaming is popular. Pages uses both of these actions to enforce basic
transactional guarantees. It uses fsync to flush write data to disk, making it durable; it uses rename
to atomically replace old files with new files so that a file never contains inconsistent data. The
timeline shows these invocations numerous times. First, Pages regularly uses fsync and rename
when updating the key-value store of a .plist file. Second, fsync is used on the SQLite database.
Third, for each of the 15 image insertions, Pages calls fsync on a file named “tempData” (classified
as “other”) to update its automatic backup.
Multiple threads perform I/O. Pages is a multi-threaded application and issues I/O requests from
many different threads during the experiment. Using multiple threads for I/O allows Pages to avoid
blocking while I/O requests are outstanding. Examining the I/O behavior across threads, we see that
Thread 1 performs the most significant portion of I/O, but ten other threads are also involved. In most
cases, a single thread exclusively accesses a file, but it is not uncommon for multiple threads to share
a file.
Frameworks influence I/O. Pages was developed in a rich programming environment where frameworks such as Cocoa or Carbon are used for I/O; these libraries impact I/O patterns in ways the developer might not expect. For example, although the application developers did not bother to use fsync
or rename when saving the user’s work in the .doc file, the Cocoa library regularly uses these calls
to atomically and durably update relatively unimportant metadata, such as “recently opened” lists
stored in .plist files. As another example, when Pages tries to read data in 512-byte chunks from
the .doc, each read goes through the STDIO library, which only reads in 4 KB chunks. Thus, when
Pages attempts to read one chunk from the 1Table stream, seven unrequested chunks from the WordDocument stream are also incidentally read (offset 12039 KB). In other cases, regions of the .doc file
are repeatedly accessed unnecessarily. For example, around the 3KB offset, read/write pairs occur
dozens of times. Pages uses a library to write 2-byte words; each time a word is written, the library
reads, updates, and writes back an entire 512-byte chunk. Finally, we see evidence of redundancy
between libraries: even though Pages has a backing SQLite database for some of its properties, it also
uses .plist files, which function across Apple applications as generic property stores.
This one detailed experiment has shed light on a number of interesting I/O behaviors that indicate
that home-user applications are indeed different than traditional workloads. A new workload suite is
needed that more accurately reflects these applications.
Our goal in constructing the iBench task suite is two-fold. First, we would like iBench to be representative of the tasks performed by home users. For this reason, iBench contains popular applications
from the iLife and iWork suites for entertainment and productivity. Second, we would like iBench
to be relatively simple for others to use for file and storage system analysis. For this reason, we
automate the interactions of a home user and collect the resulting traces of I/O system calls. The
traces are available online at this site:
We now describe in more detail how we met these two goals.
To capture the I/O behavior of home users, iBench models the actions of a “reasonable” user interacting with iPhoto, iTunes, iMovie, Pages, Numbers, and Keynote. Since the research community
does not yet have data on the exact distribution of tasks that home users perform, iBench contains
tasks that we believe are common and uses files with sizes that can be justified for a reasonable user.
iBench contains 34 different tasks, each representing a home user performing one distinct operation.
If desired, these tasks could be combined to create more complex workflows and I/O workloads. The
six applications and corresponding tasks are as follows.
iLife iPhoto 8.1.1 (419): digital photo album and photo manipulation software. iPhoto stores photos
in a library that contains the data for the photos (which can be in a variety of formats, including
JPG, TIFF, and PNG), a directory of modified files, a directory of scaled down images, and two
files of thumbnail images. The library stores metadata in a SQLite database. iBench contains six
tasks exercising user actions typical for iPhoto: starting the application and importing, duplicating,
editing, viewing, and deleting photos in the library. These tasks modify both the image files and the
underlying database. Each of the iPhoto tasks operates on 400 2.5 MB photos, representing a user
who has imported 12 megapixel photos (2.5 MB each) from a full 1 GB flash card on his or her
iLife iTunes 9.0.3 (15): a media player capable of both audio and video playback. iTunes organizes
its files in a private library and supports most common music formats (e.g., MP3, AIFF, WAVE,
AAC, and MPEG-4). iTunes does not employ a database, keeping media metadata and playlists in
both a binary and an XML file. iBench contains five tasks for iTunes: starting iTunes, importing and
playing an album of MP3 songs, and importing and playing an MPEG-4 movie. Importing requires
copying files into the library directory and, for music, analyzing each song file for gapless playback.
The music tasks operate over an album (or playlist) of ten songs while the movie tasks use a single
3-minute movie.
iLife iMovie 8.0.5 (820): video editing software. iMovie stores its data in a library that contains
directories for raw footage and projects, and files containing video footage thumbnails. iMovie supports both MPEG-4 and Quicktime files. iBench contains four tasks for iMovie: starting iMovie,
importing an MPEG-4 movie, adding a clip from this movie into a project, and exporting a project to
MPEG-4. The tasks all use a 3-minute movie because this is a typical length found from home users
on video-sharing websites.
iWork Pages 4.0.3 (766): a word processor. Pages uses a ZIP-based file format and can export to
DOC, PDF, RTF, and basic text. iBench includes eight tasks for Pages: starting up, creating and
saving, opening, and exporting documents with and without images and with different formats. The
tasks use 15 page documents.
iWork Numbers 2.0.3 (332): a spreadsheet application. Numbers organizes its files with a ZIP-based
format and exports to XLS and PDF. The four iBench tasks for Numbers include starting Numbers,
generating a spreadsheet and saving it, opening the spreadsheet, and exporting that spreadsheet to
XLS. To model a possible home user working on a budget, the tasks utilize a five page spreadsheet
with one column graph per sheet.
iWork Keynote 5.0.3 (791): a presentation and slideshow application. Keynote saves to a .key ZIPbased format and exports to Microsoft’s PPT format. The seven iBench tasks for Keynote include
starting Keynote, creating slides with and without images, opening and playing presentations, and
exporting to PPT. Each Keynote task uses a 20-slide presentation.
Open iPhoto with library of 400 photos
Import 400 photos into empty library
Duplicate 400 photos from library
Sequentially edit 400 photos from library
Sequentially delete 400 photos; empty trash
Sequentially view 400 photos
Open iTunes with 10 song album
Import 10 song album to library
Import 3 minute movie to library
Play album of 10 songs
Play 3 minute movie
Open iMovie with 3 minute clip in project
Import 3 minute .m4v (20MB) to “Events”
Paste 3 minute clip from “Events” to project
Export 3 minute video clip
Open Pages
Create 15 text page document; save as .pages
Create 15 JPG document; save as .pages
Open 15 text page document
Export 15 page document as .pdf
Export 15 JPG document as .pdf
Export 15 page document as .doc
Export 15 JPG document as .doc
Open Numbers
Save 5 sheets/column graphs as .numbers
Open 5 sheet spreadsheet
Export 5 sheets/column graphs as .xls
Open Keynote
Create 20 text slides; save as .key
Create 20 JPG slides; save as .key
Open and play presentation of 20 text slides
Open and play presentation of 20 JPG slides
Export 20 text slides as .ppt
Export 20 JPG slides as .ppt
Files (MB)
779 (336.7)
5900 (1966.9)
2928 (1963.9)
12119 (4646.7)
15246 (23.0)
2929 (1006.4)
143 (184.4)
68 (204.9)
41 (67.4)
61 (103.6)
56 (77.9)
433 (223.3)
184 (440.1)
210 (58.3)
70 (157.9)
218 (183.7)
408 (112.0)
404 (77.4)
385 (111.3)
283 (179.9)
517 (183.0)
637 (12.1)
654 (92.9)
318 (11.5)
321 (45.4)
685 (12.8)
723 (110.6)
8709 (3940.3)
5736 (2076.2)
18927 (12182.9)
3347 (1005.0)
139 (264.5)
383 (122.3)
546 (229.9)
997 (180.9)
965 (110.9)
952 (183.8)
901 (103.3)
996 (124.6)
/ CPU Sec
/ CPU Sec
Table 1: 34 Tasks of the iBench Suite. The table summarizes the 34 tasks of iBench, specifying the
application, a short name for the task, and a longer description of the actions modeled. The I/O is
characterized according to the number of files read or written, the sum of the maximum sizes of all
accessed files, the number of file accesses that read or write data, the number of bytes read or written,
the percentage of I/O bytes that are part of a read (or write), and the rate of I/O per CPU-second in
terms of both file accesses and bytes. Each core is counted individually, so at most 2 CPU-seconds
can be counted per second on our dual-core test machine. CPU utilization is measured with the
UNIX top utility, which in rare cases produces anomalous CPU utilization snapshots; those values
are ignored.
Table 1 contains a brief description of each of the 34 iBench tasks as well as the basic I/O characteristics of each task when running on Mac OS X Snow Leopard 10.6.2. The table illustrates that
the iBench tasks perform a significant amount of I/O. Most tasks access hundreds of files, which in
aggregate contain tens or hundreds of megabytes of data. The tasks typically access files hundreds of
times. The tasks perform widely differing amounts of I/O, from less than a megabyte to more than a
gigabyte. Most of the tasks perform many more reads than writes. Finally, the tasks exhibit high I/O
throughput, often transferring tens of megabytes of data for every second of computation.
Easy to Use
To enable other system evaluators to easily use these tasks, the iBench suite is packaged as a set of
34 system call traces. To ensure reproducible results, the 34 user tasks were first automated with
AppleScript, a general-purpose GUI scripting language. AppleScript provides generic commands
to emulate mouse clicks through menus and application-specific commands to capture higher-level
operations. Application-specific commands bypass a small amount of I/O by skipping dialog boxes;
however, we use them whenever possible for expediency.
The system call traces were gathered using DTrace [8], a kernel and user level dynamic instrumentation tool. DTrace is used to instrument the entry and exit points of all system calls dealing with the
file system; it also records the current state of the system and the parameters passed to and returned
from each call.
While tracing with DTrace was generally straightforward, we addressed four challenges in collecting the iBench traces. First, file sizes are not always available to DTrace; thus, we record every
file’s initial size and compute subsequent file size changes caused by system calls such as write
or ftruncate. Second, iTunes uses the ptrace system call to disable tracing; we circumvent
this block by using gdb to insert a breakpoint that automatically returns without calling ptrace.
Third, the volfs pseudo-file system in HFS+ (Hierarchical File System) allows files to be opened
via their inode number instead of a file name; to include pathnames in the trace, we instrument the
build_path function to obtain the full path when the task is run. Fourth, tracing system calls
misses I/O resulting from memory-mapped files; therefore, we purged memory and instrumented
kernel page-in functions to measure the amount of memory-mapped file activity. We found that the
amount of memory-mapped I/O is negligible in most tasks; we thus do not include this I/O in the
iBench traces or analysis.
To provide reproducible results, the traces must be run on a single file-system image. Therefore, the
iBench suite also contains snapshots of the initial directories to be restored before each run; initial
state is critical in file-system benchmarking [1].
The iBench task suite enables us to study the I/O behavior of a large set of home-user actions. As
shown from the timeline of I/O behavior for one particular task in Section 2, these tasks are likely to
access files in complex ways. To characterize this complex behavior in a quantitative manner across
the entire suite of 34 tasks, we focus on answering four categories of questions.
• What different types of files are accessed and what are the sizes of these files?
• How are files accessed for reads and writes? Are files accessed sequentially? Is space preallocated?
• What are the transactional properties? Are writes flushed with fsync or performed atomically?
• How do multi-threaded applications distribute I/O across different threads?
Answering these questions has two benefits. First, the answers can guide file and storage system
developers to target their systems better to home-user applications. Second, the characterization
will help users of iBench to select the most appropriate traces for evaluation and to understand their
resulting behavior.
All measurements were performed on a Mac Mini running Mac OS X Snow Leopard version 10.6.2
and the HFS+ file system. The machine has 2 GB of memory and a 2.26 GHz Intel Core Duo processor.
Nature of Files
Our analysis begins by characterizing the high-level behavior of the iBench tasks. In particular, we
study the different types of files opened by each iBench task as well as the sizes of those files.
File Types
The iLife and iWork applications store data across a variety of files in a number of different formats;
for example, iLife applications tend to store their data in libraries (or data directories) unique to each
user, while iWork applications organize their documents in proprietary ZIP-based files. The extent to
which tasks access different types of files greatly influences their I/O behavior.
To understand accesses to different file types, we place each file into one of six categories, based
on file name extensions and usage. Multimedia files contain images (e.g., JPEG), songs (e.g., MP3,
AIFF), and movies (e.g., MPEG-4). Productivity files are documents (e.g., .pages, DOC, PDF),
spreadsheets (e.g., .numbers, XLS), and presentations (e.g., .key, PPT). SQLite files are database
files. Plist files are property-list files in XML containing key-value pairs for user preferences and
application properties. Strings files contain strings for localization of application text. Finally, Other
contains miscellaneous files such as plain text, logs, files without extensions, and binary files.
Figure 2 shows the frequencies with which tasks open and access files of each type; most tasks
perform hundreds of these accesses. Multimedia file opens are common in all workloads, though
they seldom predominate, even in the multimedia-heavy iLife applications. Conversely, opens of
productivity files are rare, even in iWork applications that use them; this is likely because most iWork
tasks create or view a single productivity file. Because .plist files act as generic helper files, they
are relatively common. SQLite files only have a noticeable presence in iPhoto, where they account
for a substantial portion of the observed opens. Strings files occupy a significant minority of most
workloads (except iPhoto and iTunes). Finally, between 5% and 20% of files are of type “Other”
(except for iTunes, where they are more prevalent).
Figure 3 displays the percentage of I/O bytes accessed for each file type. In bytes, multimedia I/O
dominates most of the iLife tasks, while productivity I/O has a significant presence in the iWork
tasks; file descriptors on multimedia and productivity files tend to receive large amounts of I/O.
SQLite, Plist, and Strings files have a smaller share of the total I/O in bytes relative to the number
of opened files; this implies that tasks access only a small quantity of data for each of these files
opened (e.g., several key-value pairs in a .plist). In most tasks, files classified as “Other” receive a
more significant portion of the I/O (the exception is iTunes).
Summary: Home applications access a wide variety of file types, generally opening multimedia files
the most frequently. iLife tasks tend to access bytes primarily from multimedia or files classified
as “Other”; iWork tasks access bytes from a broader range of file types, with some emphasis on
productivity files.
iTunes iMovie
Numbers Keynote
Figure 2: Types of Files Accessed By Number of Opens. This plot shows the relative frequency
with which file descriptors are opened upon different file types. The number at the end of each bar
indicates the total number of unique file descriptors opened on files.
iTunes iMovie
Numbers Keynote
Figure 3: Types of Files Opened By I/O Size. This plot shows the relative frequency with which
each task performs I/O upon different file types. The number at the end of each bar indicates the total
bytes of I/O accessed.
iTunes iMovie
Numbers Keynote
Figure 4: File Sizes, Weighted by Number of Accesses. This graph shows the number of accessed files in each file size range upon access ends. The total number of file accesses appears at the
end of the bars. Note that repeatedly-accessed files are counted multiple times, and entire file sizes
are counted even upon partial file accesses.
iTunes iMovie
Numbers Keynote
Figure 5: File Sizes, Weighted by the Bytes in Accessed Files. This graph shows the portion of
bytes in accessed files of each size range upon access ends. The sum of the file sizes appears at the
end of the bars. This number differs from total file footprint since files change size over time and
repeatedly accessed file are counted multiple times.
File Sizes
Large and small files present distinct challenges to the file system. For large files, finding contiguous
space can be difficult, while for small files, minimizing initial seek time is more important. We
investigate two different questions regarding file size. First, what is the distribution of file sizes
accessed by each task? Second, what portion of accessed bytes resides in files of various sizes?
To answer these questions, we record file sizes when each unique file descriptor is closed. We categorize sizes as very small (< 4KB), small (< 64KB), medium (< 1MB), large (< 10MB), or very
large (≥ 10MB). We track how many accesses are to files in each category and how many of the bytes
belong to files in each category.
Figure 4 shows the number of accesses to files of each size. Accesses to very small files are extremely
common, especially for iWork, accounting for over half of all the accesses in every iWork task. Small
file accesses have a significant presence in the iLife tasks. The large quantity of very small and small
files is due to frequent use of .plist files that store preferences, settings, and other application data;
these files often fill just one or two 4 KB pages.
Figure 5 shows the proportion of the files in which the bytes of accessed files reside. Large and very
large files dominate every startup workload and nearly every task that processes multimedia files.
Small files account for few bytes and very small files are essentially negligible.
Summary: Agreeing with many previous studies (e.g., [4]), we find that while applications tend to
open many very small files (< 4 KB), most of the bytes accessed are in large files (> 1 MB).
Access Patterns
We next examine how the nature of file accesses has changed, studying the read and write patterns
of home applications. These patterns include whether files are used for reading, writing, or both;
whether files are accessed sequentially or randomly; and finally, whether or not blocks are preallocated via hints to the file system.
File Accesses
One basic characteristic of our workloads is the division between reading and writing on open file
descriptors. If an application uses an open file only for reading (or only for writing) or performs more
activity on file descriptors of a certain type, then the file system may be able to make more intelligent
memory and disk allocations.
To determine these characteristics, we classify each opened file descriptor based on the types of
accesses–read, write, or both read and write–performed during its lifetime. We also ignore the actual flags used when opening the file since we found they do not accurately reflect behavior; in all
workloads, almost all write-only file descriptors were opened with O_RDWR. We measure both the
proportional usage of each type of file descriptor and the relative amount of I/O performed on each.
Figure 6 shows how many file descriptors are used for each type of access. The overwhelming
majority of file descriptors are used exclusively for reading or writing; read-write file descriptors
are quite uncommon. Overall, read-only file descriptors are the most common across a majority of
workloads; write-only file descriptors are popular in some iLife tasks, but are rarely used in iWork.
We observe different patterns when analyzing the amount of I/O performed on each type of file
descriptor, as shown in Figure 7. First, even though iWork tasks have very few write-only file descriptors, they often write significant amounts of I/O to those descriptors. Second, even though
read-write file descriptors are rare, when present, they account for relatively large portions of total
I/O (particularly when exporting to .doc, .xls, and .ppt).
Summary: While many files are opened with the O_RDWR flag, most of them are subsequently
accessed write-only; thus, file open flags cannot be used to predict how a file will be accessed.
However, when an open file is both read and written by a task, the amount of traffic to that file
occupies a significant portion of the total I/O. Finally, the rarity of read-write file descriptors may
derive in part from the tendency of applications to write to a temporary file which they then rename
as the target file, instead of overwriting the target file; we explore this tendency more in §4.3.2.
Write Only
Read Only
iTunes iMovie
Numbers Keynote
Figure 6: Read/Write Distribution By File Descriptor. File descriptors can be used only for reads,
only for writes, or for both operations. This plot shows the percentage of file descriptors in each
category. This is based on usage, not open flags. Any duplicate file descriptors (e.g., created by
dup) are treated as one and file descriptors on which the program does not perform any subsequent
action are ignored.
Write Only
Both (Writes)
Both (Reads)
Read Only
iTunes iMovie
Numbers Keynote
Figure 7: Read/Write Distribution By Bytes. The graph shows how I/O bytes are distributed among
the three access categories. The unshaded dark gray indicates bytes read as a part of read-only accesses. Similarly, unshaded light gray indicates bytes written in write-only accesses. The shaded
regions represent bytes touched in read-write accesses, and are divided between bytes read and
bytes written.
Nearly Sequential
iTunes iMovie
Numbers Keynote
Figure 8: Read Sequentiality. This plot shows the portion of file read accesses (weighted by bytes)
that are sequentially accessed.
Nearly Sequential
iTunes iMovie
Numbers Keynote
Figure 9: Write Sequentiality. This plot shows the portion of file write accesses (weighted by bytes)
that are sequentially accessed.
Historically, files have usually been read or written entirely sequentially [4]. We next determine
whether sequential accesses are dominant in iBench. We measure this by examining all reads and
writes performed on each file descriptor and noting the percentage of files accessed in strict sequential
order (weighted by bytes).
We display our measurements for read and write sequentiality in Figures 8 and 9, respectively. The
portions of the bars in black indicate the percent of file accesses that exhibit pure sequentiality. We
observe high read sequentiality in iWork, but little in iLife (with the exception of the Start tasks and
iTunes Import). The inverse is true for writes: most iLife tasks exhibit high sequentiality; iWork
accesses are largely non-sequential.
Investigating the access patterns to multimedia files more closely, we note that many iLife applications first touch a small header before accessing the entire file sequentially. To better reflect this
behavior, we define an access to a file as “nearly sequential” when at least 95% of the bytes read or
written to a file form a sequential run. We found that a large number of accesses fall into the “nearly
sequential” category given a 95% threshold; the results do not change much with lower thresholds.
The slashed portions of the bars in Figures 8 and 9 show observed sequentiality with a 95% threshold.
Tasks with heavy use of multimedia files exhibit greater sequentiality with the 95% threshold for
both reading and writing. In several workloads (mainly iPhoto and iTunes), the I/O classified almost
entirely as non-sequential with a 100% threshold is classified as nearly sequential. The difference for
iWork applications is much less striking, indicating that accesses are more random.
Summary: A substantial number of tasks contain purely sequential accesses. When the definition
of a sequential access is loosened such that only 95% of bytes must be consecutive, then even more
tasks contain primarily sequential accesses. These “nearly sequential” accesses result from metadata
stored at the beginning of complex multimedia files: tasks frequently touch bytes near the beginning
of multimedia files before sequentially reading or writing the bulk of the file.
One of the difficulties file systems face when allocating contiguous space for files is not knowing how
much data will be written to those files. Applications can communicate this information by providing
hints [27] to the file system to preallocate an appropriate amount of space. In this section, we quantify
how often applications use preallocation hints and how often these hints are useful.
We instrument two calls usable for preallocation: pwrite and ftruncate. pwrite writes a
single byte at an offset beyond the end of the file to indicate the future end of the file; ftruncate
directly sets the file size. Sometimes a preallocation does not communicate anything useful to the
file system because it is immediately followed by a single write call with all the data; we flag these
preallocations as unnecessary.
Figure 10 shows the portion of file growth that is the result of preallocation. In all cases, preallocation
was due to calls to pwrite; we never observed ftruncate preallocation. Overall, applications
rarely preallocate space and preallocations are often useless.
The three tasks with significant preallocation are iPhoto Dup, iPhoto Edit, and iMovie Exp. iPhoto
Dup and Edit both call a copyPath function in the Cocoa library that preallocates a large amount
of space and then copies data by reading and writing it in 1 MB chunks. iPhoto Dup sometimes
uses copyPath to copy scaled-down images of size 50-100 KB; since these smaller files are copied
with a single write, the preallocation does not communicate anything useful. iMovie Exp calls a
Quicktime append function that preallocates space before writing the actual data; however, the data
is appended in small 128 KB increments. Thus, the append is not split into multiple write calls; the
preallocation is useless.
iTunes iMovie
PPT 384B
XLS 192B
DOC 192B
Numbers Keynote
Figure 10: Preallocation Hints.. The sizes of the bars indicate which portion of file extensions are
preallocations; unnecessary preallocations are diagonally striped. The number atop each bar indicates
the absolute amount preallocated.
Summary: Although preallocation has the potential to be useful, few tasks use it to provide hints,
and a significant number of the hints that are provided are useless. The hints are provided inconsistently: although iPhoto and iMovie both use preallocation for some tasks, neither application uses
preallocation during import.
Transactional Properties
In this section, we explore the degree to which the iBench tasks require transactional properties from
the underlying file and storage system. In particular, we investigate the extent to which applications
require writes to be durable; that is, how frequently they invoke calls to fsync and which APIs
perform these calls. We also investigate the atomicity requirements of the applications, whether from
renaming files or exchanging inodes.
Writes typically involve a trade-off between performance and durability. Applications that require
write operations to complete quickly can write data to the file system’s main memory buffers, which
are lazily copied to the underlying storage system at a subsequent convenient time. Buffering writes
in main memory has a wide range of performance advantages: writes to the same block may be
coalesced, writes to files that are later deleted need not be performed, and random writes can be more
efficiently scheduled.
On the other hand, applications that rely on durable writes can flush written data to the underlying
storage layer with the fsync system call. The frequency of fsync calls and the number of bytes
they synchronize directly affect performance: if fsync appears often and flushes only several bytes,
then performance will suffer. Therefore, we investigate how modern applications use fsync.
No fsync
Pref Sync
iTunes iMovie
Numbers Keynote
Figure 11: Percentage of Fsync Bytes. The percentage of fsync’d bytes written to file descriptors
is shown, broken down by cause. The value atop each bar shows total bytes synchronized.
iTunes iMovie
Numbers Keynote
Figure 12: Fsync Sizes. This plot shows a distribution of fsync sizes. The total number of fsync
calls appears at the end of the bars.
Figure 11 shows the percentage of written data each task synchronizes with fsync. The graph further subdivides the source of the fsync activity into six categories. SQLite indicates that the SQLite
database engine is responsible for calling fsync; Archiving indicates an archiving library frequently
used when accessing ZIP formats; Pref Sync is the PreferencesSynchronize function call
from the Cocoa library; writeToFile is the Cocoa call writeToFile with the atomically flag
set; and finally, FlushFork is the Carbon FSFlushFork routine.
At the highest level, the figure indicates that half the tasks synchronize close to 100% of their written
data while approximately two-thirds synchronize more than 60%. iLife tasks tend to synchronize
many megabytes of data, while iWork tasks usually only synchronize tens of kilobytes (excluding
tasks that handle images).
To delve into the APIs responsible for the fsync calls, we examine how each bar is subdivided. In
iLife, the sources of fsync calls are quite varied: every category of API except for Archiving is
represented in one of the tasks, and many of the tasks call multiple APIs which invoke fsync. In
iWork, the sources are more consistent; the only sources are Pref Sync, SQLite, and Archiving (for
manipulating compressed data).
Given that these tasks require durability for a significant percentage of their write traffic, we next
investigate the frequency of fsync calls and how much data each individual call pushes to disk.
Figure 12 groups fsync calls based on the amount of I/O performed on each file descriptor when
fsync is called, and displays the relative percentage each category comprises of the total I/O.
These results show that iLife tasks call fsync frequently (from tens to thousands of times), while
iWork tasks call fsync infrequently except when dealing with images. From these observations,
we infer that calls to fsync are mostly associated with media. The majority of calls to fsync
synchronize small amounts of data; only a few iLife tasks synchronize more than a megabyte of data
in a single fsync call.
Summary: Developers want to ensure that data enters stable storage durably, and thus, these tasks
synchronize a significant fraction of their data. Based on our analysis of the source of fsync calls,
some calls may be incidental and an unintentional side-effect of the API (e.g., those from SQLite or
Pref Sync), but most are performed intentionally by the programmer. Furthermore, some of the tasks
synchronize small amounts of data frequently, presenting a challenge for file systems.
Atomic Writes
Applications often require file changes to be atomic. In this section, we quantify how frequently
applications use different techniques to achieve atomicity. We also identify cases where performing writes atomically can interfere with directory locality optimizations by moving files from their
original directories. Finally, we identify the causes of atomic writes.
Applications can atomically update a file by first writing the desired contents to a temporary file and
then using either the rename or exchangedata call to atomically replace the old file with the
new file. With rename, the new file is given the same name as the old, deleting the original and
replacing it. With exchangedata, the inode numbers assigned to the old file and the temporary
file are swapped, causing the old path to point to the new data; this allows the file path to remain
associated with the original inode number, which is necessary for some applications.
Figure 13 shows how much write I/O is performed atomically with rename or exchangedata;
rename calls are further subdivided into those which keep the file in the same directory and those
which do not. The results show that atomic writes are quite popular and that, in many workloads,
all the writes are atomic. The breakdown of each bar shows that rename is frequent; a significant
minority of the rename calls move files between directories. exchangedata is rare and used only
by iTunes for a small fraction of file updates.
Not atomic
Rename (diff dir)
Rename (same dir)
iTunes iMovie
Numbers Keynote
Figure 13: Atomic Writes. The portion of written bytes written atomically is shown, divided into
groups: (1) rename leaving a file in the same directory; (2) rename causing a file to change
directories; (3) exchangedata which never causes a directory change. The atomic file-write count
appears atop each bar.
Pref Sync
iTunes iMovie
Numbers Keynote
Figure 14: Rename Causes. This plot shows the portion of rename calls caused by each of the top
four higher level functions used for atomic writes. The number of rename calls appears at the end
of the bars.
We find that most of the rename calls causing directory changes occur when a file (e.g., a document
or spreadsheet) is saved at the user’s request. We suspect different directories are used so that users
are not confused by seeing temporary files in their personal directories. Interestingly, atomic writes
are performed when files are saved to Apple formats, but not when exporting to Microsoft formats.
We suspect that the interface between applications and the Microsoft libraries does not specify atomic
operations well.
Figure 14 identifies the APIs responsible for atomic writes via rename. Pref Sync, from the Cocoa
library, allows applications to save user and system wide settings in .plist files. WriteToFile and
movePath are Cocoa routines and FSRenameUnicode is a Carbon routine. A solid majority of the
atomic writes are caused by Pref Sync; this is an example of I/O behavior caused by the API rather
than explicit programmer intention. The second most common atomic writer is writeToFile; in this
case, the programmer is requesting atomicity but leaving the technique up to the library. Finally, in
a small minority of cases, programmers perform atomic writes themselves by calling movePath or
FSRenameUnicode, both of which are essentially rename wrappers.
Summary: Many of our tasks write data atomically, generally doing so by calling rename. The bulk
of atomic writes result from API calls; while some of these hide the underlying nature of the write,
others make it clear that they act atomically. Thus, developers desire atomicity for many operations,
and file systems will need to either address the ensuing renames that accompany it or provide an
alternative mechanism for it. In addition, the absence of atomic writes when writing to Microsoft
formats highlights the inconsistencies that can result from the use of high level libraries.
Threads and Asynchronicity
Home-user applications are interactive and need to avoid blocking when I/O is performed. Asynchronous I/O and threads are often used to hide the latency of slow operations from users. For our
final experiments, we investigate how often applications use asynchronous I/O libraries or multiple
threads to avoid blocking.
Figure 15 shows the portion of read operations performed asynchronously with aio_read; none of
the tasks use aio_write. We find that asynchronous I/O is used rarely and only by iLife applications. However, in those cases where asynchronous I/O is performed, it is used quite heavily.
Figure 16 investigates how threads are used by these tasks: specifically, the portion of I/O performed
by each of the threads. The numbers at the tops of the bars report the number of threads performing
I/O. iPhoto and iTunes leverage a significant number of threads for I/O, since many of their tasks are
readily subdivided (e.g., importing 400 different photos). Only a handful of tasks perform all their
I/O from a single thread. For most tasks, a small number of threads are responsible for the majority
of I/O.
Figure 17 shows the responsibilities of each thread that performs I/O, where a thread can be responsible for reading, writing, or both. The measurements show that significantly more threads are devoted
to reading than to writing, with a fair number of threads responsible for both. These results indicate
that threads are the preferred technique to avoiding blocking and that applications may be particularly
concerned with avoiding blocking due to reads.
Summary: Our results indicate that iBench tasks are concerned with hiding long-latency operations
from interactive users and that threads are the preferred method for doing so. Virtually all of the
applications we study issue I/O requests from multiple threads, and some launch I/Os from hundreds
of different threads.
AIO Reads / All Reads
iTunes iMovie
Start 323KB
Add 32KB
ImpM 416KB
Numbers Keynote
Figure 15: Asynchronous Reads.. This plot shows the percentage of read bytes read asynchronously
via aio_read. The total amount of asynchronous I/O is provided at the end of the bars.
iTunes iMovie
Numbers Keynote
Figure 16: I/O Distribution Among Threads.. The stacked bars indicate the percentage of total
I/O performed by each thread. The I/O from the threads that do the most and second most I/O are
dark and medium gray respectively, and the other threads are light gray. Black lines divide the I/O
across the latter group; black areas appear when numerous threads do small amounts of I/O. The total
number of threads that perform I/O is indicated next to the bars.
Write Only
Read Only
iTunes iMovie
Numbers Keynote
Figure 17: Thread Type Distribution.. The plot categorizes threads that do I/O into three groups:
threads that read, threads that write, or threads that both read and write. The total number of threads
that perform I/O is indicated next to the bars.
Although our study is unique in its focus on the I/O behavior of individual applications, a body
of similar work exists both in the field of file systems and in application studies. As mentioned
earlier, our work builds upon that of Baker [4], Ousterhout [25], Vogels [39], and others who have
conducted similar studies, providing an updated perspective on many of their findings. However,
the majority of these focus on academic and engineering environments, which are likely to have
noticeably different application profiles from the home environment. Some studies, like those by
Ramakrishnan [31] and by Vogels, have included office workloads on personal computers; these are
likely to feature applications similar to those in iWork, but are still unlikely to contain analogues to
iLife products. None of these studies, however, look at the characteristics of individual application
behaviors; instead, they analyze trends seen in prolonged usage. Thus, our study complements the
breadth of this research with a more focused examination, providing specific information on the
causes of the behaviors we observe, and is one of the first to address the interaction of multimedia
applications with the file system.
In addition to these studies of dynamic workloads, a variety of papers have examined the static characteristics of file systems, starting with Satyanarayanan’s analysis of files at Carnegie-Mellon University [36]. One of the most recent of these examined metadata characteristics on desktops at Microsoft
over a five year time span, providing insight into file-system usage characteristics in a setting similar
to the home [2]. This type of analysis provides insight into long term characteristics of files that
ours cannot; a similar study for home systems would, in conjunction with our paper, provide a more
complete image of how home applications interact with the file system.
While most file-system studies deal with aggregate workloads, our examination of application-specific
behaviors has precedent in a number of hardware studies. In particular, Flautner et al.’s [13] and Blake
et al.’s [6] studies of parallelism in desktop applications bear strong similarities to ours in the variety
of applications they examine. In general, they use a broader set of applications, a difference that
derives from the subjects studied. In particular, we select applications likely to produce interesting
I/O behavior; many of the programs they use, like the video game Quake, are more likely to exercise
threading than the file system. Finally it is worth noting that Blake et al. analyze Windows software
using event tracing, which may prove a useful tool to conduct a similar application file-system study
to ours in Windows.
We have presented a detailed study of the I/O behavior of complex, modern applications. Through
our measurements, we have discovered distinct differences between the tasks in the iBench suite and
traditional workload studies. To conclude, we consider the possible effects of our findings on future
file and storage systems.
We observed that many of the tasks in the iBench suite frequently force data to disk by invoking
fsync, which has strong implications for file systems. Delayed writing has long been the basis of
increasing file-system performance [34], but it is of greatly decreased utility given small synchronous
writes. Thus, more study is required to understand why the developers of these applications and
frameworks are calling these routines so frequently. For example, is data being flushed to disk to
ensure ordering between writes, safety in the face of power loss, or safety in the face of application
crashes? Finding appropriate solutions depends upon the answers to these questions. One possibility
is for file systems to expose new interfaces to enable applications to better express their exact needs
and desires for durability, consistency, and atomicity. Another possibility is that new technologies,
such as flash and other solid-state devices, will be a key solution, allowing writes to be buffered
quickly, perhaps before being staged to disk or even the cloud.
The iBench tasks also illustrate that file systems are now being treated as repositories of highlystructured “databases” managed by the applications themselves. In some cases, data is stored in a
literal database (e.g., iPhoto uses SQLite), but in most cases, data is organized in complex directory
hierarchies or within a single file (e.g., a .doc file is basically a mini-FAT file system). One option is
that the file system could become more application-aware, tuned to understand important structures
and to better allocate and access these structures on disk. For example, a smarter file system could
improve its allocation and prefetching of “files” within a .doc file: seemingly non-sequential patterns
in a complex file are easily deconstructed into accesses to metadata followed by streaming sequential
access to data.
Our analysis also revealed the strong impact that frameworks and libraries have on I/O behavior.
Traditionally, file systems have been designed at the level of the VFS interface, not breaking into
the libraries themselves. However, it appears that file systems now need to take a more “vertical” approach and incorporate some of the functionality of modern libraries. This vertical approach hearkens
back to the earliest days of file-system development when the developers of FFS modified standard
libraries to buffer writes in block-sized chunks to avoid costly sub-block overheads [21]. Future
storage systems should further integrate with higher-level interfaces to gain deeper understanding of
application desires.
Finally, modern applications are highly complex, containing millions of lines of code, divided over
hundreds of source files and libraries, and written by many different programmers. As a result, their
own behavior is increasingly inconsistent: along similar, but distinct code paths, different libraries
are invoked with different transactional semantics. To simplify these applications, file systems could
add higher-level interfaces, easing construction and unifying data representations. While the systems
community has developed influential file-system concepts, little has been done to transition this class
of improvements into the applications themselves. Database technology does support a certain class
of applications quite well but is generally too heavyweight. Future storage systems should consider
how to bridge the gap between the needs of current applications and the features low-level systems
Our evaluation may raise more questions than it answers. To build better systems for the future, we
believe that the research community must study applications that are important to real users. We
believe the iBench task suite takes a first step in this direction and hope others in the community will
continue along this path.
We thank the anonymous reviewers and Rebecca Isaacs (our shepherd) for their tremendous feedback,
as well as members of our research group for their thoughts and comments on this work at various
This material is based upon work supported by the National Science Foundation under CSR-1017518
as well as by generous donations from Network Appliance and Google. Tyler Harter and Chris
Dragga are supported by the Guri Sohi Fellowship and the David DeWitt Fellowship, respectively.
Any opinions, findings, and conclusions or recommendations expressed in this material are those of
the authors and do not necessarily reflect the views of NSF or other institutions.
[1] N. Agrawal, A. C. Arpaci-Dusseau, and R. H. Arpaci-Dusseau. Generating Realistic Impressions for File-System Benchmarking. In FAST ’09, San Jose, CA, February 2009.
[2] N. Agrawal, W. J. Bolosky, J. R. Douceur, and J. R. Lorch. A Five-Year Study of File-System
Metadata. In FAST ’07, San Jose, CA, February 2007.
[3] Apple Computer, Inc. AppleScript Language Guide, March 2011.
[4] M. Baker, J. Hartman, M. Kupfer, K. Shirriff, and J. Ousterhout. Measurements of a Distributed
File System. In SOSP ’91, pages 198–212, Pacific Grove, CA, October 1991.
[5] W. Bartlett and L. Spainhower. Commercial Fault Tolerance: A Tale of Two Systems. IEEE
Transactions on Dependable and Secure Computing, 1(1):87–96, January 2004.
[6] G. Blake, R. G. Dreslinski, T. Mudge, and K. Flautner. Evolution of Thread-level Parallelism in
Desktop Applications. SIGARCH Comput. Archit. News, 38:302–313, June 2010.
[7] J. Bonwick and B. Moore. ZFS: The Last Word in File Systems. zfs/docs/zfs_last.pdf, 2007.
[8] B. Cantrill, M. W. Shapiro, and A. H. Leventhal. Dynamic Instrumentation of Production Systems. In USENIX ’04, pages 15–28, Boston, MA, June 2004.
[9] G. DeCandia, D. Hastorun, M. Jampani, G. Kakulapati, A. Lakshman, A. Pilchin, S. Sivasubramanian, P. Vosshall, and W. Vogels. Dynamo: Amazon’s Highly Available Key-Value Store. In
SOSP ’07, Stevenson, WA, October 2007.
[10] J. R. Douceur and W. J. Bolosky. A Large-Scale Study of File-System Contents. In SIGMETRICS ’99, pages 59–69, Atlanta, GA, May 1999.
[11] D. Ellard and M. I. Seltzer. New NFS Tracing Tools and Techniques for System Analysis. In
LISA ’03, pages 73–85, San Diego, CA, October 2003.
[12] Y. Endo, Z. Wang, J. B. Chen, and M. Seltzer. Using Latency to Evaluate Interactive System
Performance. In OSDI ’96, Seattle, WA, October 1994.
[13] K. Flautner, R. Uhlig, S. Reinhardt, and T. Mudge. Thread-level Parallelism and Interactive
Performance of Desktop Applications. SIGPLAN Not., 35:129–138, November 2000.
[14] S. Ghemawat, H. Gobioff, and S.-T. Leung. The Google File System. In SOSP ’03, pages 29–43,
Bolton Landing, NY, October 2003.
[15] R. Hagmann. Reimplementing the Cedar File System Using Logging and Group Commit. In
SOSP ’87, Austin, TX, November 1987.
[16] J. Howard, M. Kazar, S. Menees, D. Nichols, M. Satyanarayanan, R. Sidebotham, and M. West.
Scale and Performance in a Distributed File System. ACM Transactions on Computer Systems,
6(1), February 1988.
[17] B. Lampson. Computer Systems Research – Past and Present. SOSP 17 Keynote Lecture, December 1999.
[18] E. K. Lee and C. A. Thekkath. Petal: Distributed Virtual Disks. In ASPLOS VII, Cambridge,
MA, October 1996.
[19] A. W. Leung, S. Pasupathy, G. R. Goodson, and E. L. Miller. Measurement and Analysis of
Large-Scale Network File System Workloads. In USENIX ’08, pages 213–226, Boston, MA,
June 2008.
[20] Macintosh
[21] M. K. McKusick, W. N. Joy, S. J. Leffler, and R. S. Fabry. A Fast File System for UNIX. ACM
Transactions on Computer Systems, 2(3):181–197, August 1984.
[22] J. C. Mogul. A Better Update Policy. In USENIX Summer ’94, Boston, MA, June 1994.
[23] J. Olson. Enhance Your Apps With File System Transactions., July 2007.
[24] J. Ousterhout. Why Threads Are A Bad Idea (for most purposes), September 1995.
[25] J. K. Ousterhout, H. D. Costa, D. Harrison, J. A. Kunze, M. Kupfer, and J. G. Thompson. A
Trace-Driven Analysis of the UNIX 4.2 BSD File System. In SOSP ’85, pages 15–24, Orcas
Island, WA, December 1985.
[26] D. Patterson, G. Gibson, and R. Katz. A Case for Redundant Arrays of Inexpensive Disks
(RAID). In SIGMOD ’88, pages 109–116, Chicago, IL, June 1988.
[27] R. H. Patterson, G. A. Gibson, E. Ginting, D. Stodolsky, and J. Zelenka. Informed Prefetching
and Caching. In SOSP ’95, pages 79–95, Copper Mountain, CO, December 1995.
[28] R. Pike. Another Go at Language Design.
April 2010.
[29] V. Prabhakaran, A. C. Arpaci-Dusseau, and R. H. Arpaci-Dusseau. Analysis and Evolution of
Journaling File Systems. In USENIX ’05, pages 105–120, Anaheim, CA, April 2005.
[30] V. Prabhakaran, L. N. Bairavasundaram, N. Agrawal, H. S. Gunawi, A. C. Arpaci-Dusseau,
and R. H. Arpaci-Dusseau. IRON File Systems. In SOSP ’05, pages 206–220, Brighton, UK,
October 2005.
[31] K. K. Ramakrishnan, P. Biswas, and R. Karedla. Analysis of File I/O Traces in Commercial
Computing Environments. SIGMETRICS Perform. Eval. Rev., 20:78–90, June 1992.
[32] D. M. Ritchie and K. Thompson. The UNIX Time-Sharing System. In SOSP ’73, Yorktown
Heights, NY, October 1973.
[33] D. Roselli, J. R. Lorch, and T. E. Anderson. A Comparison of File System Workloads. In
USENIX ’00, pages 41–54, San Diego, CA, June 2000.
[34] M. Rosenblum and J. Ousterhout. The Design and Implementation of a Log-Structured File
System. ACM Transactions on Computer Systems, 10(1):26–52, February 1992.
[35] R. Sandberg. The Design and Implementation of the Sun Network File System. In Proceedings
of the 1985 USENIX Summer Technical Conference, pages 119–130, Berkeley, CA, June 1985.
[36] M. Satyanarayanan. A Study of File Sizes and Functional Lifetimes. In SOSP ’81, pages 96–
108, Pacific Grove, CA, December 1981.
[37] A. Sweeney, D. Doucette, W. Hu, C. Anderson, M. Nishimoto, and G. Peck. Scalability in the
XFS File System. In USENIX 1996, San Diego, CA, January 1996.
[38] M. Tilmann. Apple’s Market Share In The PC World Continues To Surge., April
[39] W. Vogels. File system usage in Windows NT 4.0. In SOSP ’99, pages 93–109, Kiawah Island
Resort, SC, December 1999.
[40] S. C. Woo, M. Ohara, E. Torrie, J. P. Shingh, and A. Gupta. The SPLASH-2 Programs: Characterization and Methodological Considerations. In ISCA ’95, pages 24–36, Santa Margherita
Ligure, Italy, June 1995.
CryptDB: Protecting Confidentiality with
Encrypted Query Processing
Raluca Ada Popa, Catherine M. S. Redfield,
Nickolai Zeldovich, and Hari Balakrishnan
Online applications are vulnerable to theft of sensitive information because adversaries can exploit
software bugs to gain access to private data, and because curious or malicious administrators may
capture and leak data. CryptDB is a system that provides practical and provable confidentiality in the
face of these attacks for applications backed by SQL databases. It works by executing SQL queries
over encrypted data using a collection of efficient SQL-aware encryption schemes. CryptDB can
also chain encryption keys to user passwords, so that a data item can be decrypted only by using the
password of one of the users with access to that data. As a result, a database administrator never gets
access to decrypted data, and even if all servers are compromised, an adversary cannot decrypt the data
of any user who is not logged in. An analysis of a trace of 126 million SQL queries from a production
MySQL server shows that CryptDB can support operations over encrypted data for 99.5% of the
128,840 columns seen in the trace. Our evaluation shows that CryptDB has low overhead, reducing
throughput by 14.5% for phpBB, a web forum application, and by 26% for queries from TPC-C,
compared to unmodified MySQL. Chaining encryption keys to user passwords requires 11–13 unique
schema annotations to secure more than 20 sensitive fields and 2–7 lines of source code changes for
three multi-user web applications.
Theft of private information is a significant problem, particularly for online applications [40]. An
adversary can exploit software vulnerabilities to gain unauthorized access to servers [32]; curious
or malicious administrators at a hosting or application provider can snoop on private data [6]; and
attackers with physical access to servers can access all data on disk and in memory [23].
One approach to reduce the damage caused by server compromises is to encrypt sensitive data, as
in SUNDR [28], SPORC [16], and Depot [30], and run all computations (application logic) on clients.
Unfortunately, several important applications do not lend themselves to this approach, including
database-backed web sites that process queries to generate data for the user, and applications that
compute over large amounts of data. Even when this approach is tenable, converting an existing
server-side application to this form can be difficult. Another approach would be to consider theoretical
solutions such as fully homomorphic encryption [19], which allows servers to compute arbitrary
functions over encrypted data, while only clients see decrypted data. However, fully homomorphic
encryption schemes are still prohibitively expensive by orders of magnitude [10, 21].
This paper presents CryptDB, a system that explores an intermediate design point to provide
confidentiality for applications that use database management systems (DBMSes). CryptDB leverages
Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted
without fee provided that copies are not made or distributed for profit or commercial advantage and that copies
bear this notice and the full citation on the first page. Copyrights for components of this work owned by others
than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, to republish, to post on
servers or to redistribute to lists, requires prior specific permission and/or a fee.
SOSP ’11, October 23–26, 2011, Cascais, Portugal.
Copyright 2011 ACM 978-1-4503-0977-6/11/10 . . . $10.00.
Application server
Key setup
CryptDB proxy server
Active keys:
Database proxy
Threat 2
CryptDB UDFs
DBMS server
key table
Unmodified DBMS
Threat 1
Figure 1: CryptDB’s architecture consisting of two parts: a database proxy and an unmodified DBMS. CryptDB uses user-defined functions (UDFs) to perform cryptographic
operations in the DBMS. Rectangular and rounded boxes represent processes and data, respectively. Shading indicates components added by CryptDB. Dashed lines indicate
separation between users’ computers, the application server, a server running CryptDB’s database proxy (which is usually the same as the application server), and the DBMS
server. CryptDB addresses two kinds of threats, shown as dotted lines. In threat 1, a curious database administrator with complete access to the DBMS server snoops on
private data, in which case CryptDB prevents the DBA from accessing any private information. In threat 2, an adversary gains complete control over both the software and
hardware of the application, proxy, and DBMS servers, in which case CryptDB ensures the adversary cannot obtain data belonging to users that are not logged in (e.g., user
Users' computers
Password P2
User 2
Password P1
User 1
the typical structure of database-backed applications, consisting of a DBMS server and a separate
application server, as shown in Figure 1; the latter runs the application code and issues DBMS queries
on behalf of one or more users. CryptDB’s approach is to execute queries over encrypted data, and the
key insight that makes it practical is that SQL uses a well-defined set of operators, each of which we
are able to support efficiently over encrypted data.
CryptDB addresses two threats. The first threat is a curious database administrator (DBA) who
tries to learn private data (e.g., health records, financial statements, personal information) by snooping
on the DBMS server; here, CryptDB prevents the DBA from learning private data. The second threat
is an adversary that gains complete control of application and DBMS servers. In this case, CryptDB
cannot provide any guarantees for users that are logged into the application during an attack, but can
still ensure the confidentiality of logged-out users’ data.
There are two challenges in combating these threats. The first lies in the tension between
minimizing the amount of confidential information revealed to the DBMS server and the ability to
efficiently execute a variety of queries. Current approaches for computing over encrypted data are
either too slow or do not provide adequate confidentiality, as we discuss in §9. On the other hand,
encrypting data with a strong and efficient cryptosystem, such as AES, would prevent the DBMS
server from executing many SQL queries, such as queries that ask for the number of employees in the
“sales” department or for the names of employees whose salary is greater than $60,000. In this case,
the only practical solution would be to give the DBMS server access to the decryption key, but that
would allow an adversary to also gain access to all data.
The second challenge is to minimize the amount of data leaked when an adversary compromises
the application server in addition to the DBMS server. Since arbitrary computation on encrypted data
is not practical, the application must be able to access decrypted data. The difficulty is thus to ensure
that a compromised application can obtain only a limited amount of decrypted data. A naı̈ve solution
of assigning each user a different database encryption key for their data does not work for applications
with shared data, such as bulletin boards and conference review sites.
CryptDB addresses these challenges using three key ideas:
• The first is to execute SQL queries over encrypted data. CryptDB implements this idea using a
SQL-aware encryption strategy, which leverages the fact that all SQL queries are made up of a
well-defined set of primitive operators, such as equality checks, order comparisons, aggregates
(sums), and joins. By adapting known encryption schemes (for equality, additions, and order
checks) and using a new privacy-preserving cryptographic method for joins, CryptDB encrypts
each data item in a way that allows the DBMS to execute on the transformed data. CryptDB is
efficient because it mostly uses symmetric-key encryption, avoids fully homomorphic encryption,
and runs on unmodified DBMS software (by using user-defined functions).
• The second technique is adjustable query-based encryption. Some encryption schemes leak more
information than others about the data to the DBMS server, but are required to process certain
queries. To avoid revealing all possible encryptions of data to the DBMS a priori, CryptDB
carefully adjusts the SQL-aware encryption scheme for any given data item, depending on the
queries observed at run-time. To implement these adjustments efficiently, CryptDB uses onions of
encryption. Onions are a novel way to compactly store multiple ciphertexts within each other in
the database and avoid expensive re-encryptions.
• The third idea is to chain encryption keys to user passwords, so that each data item in the database
can be decrypted only through a chain of keys rooted in the password of one of the users with
access to that data. As a result, if the user is not logged into the application, and if the adversary
does not know the user’s password, the adversary cannot decrypt the user’s data, even if the DBMS
and the application server are fully compromised. To construct a chain of keys that captures the
application’s data privacy and sharing policy, CryptDB allows the developer to provide policy
annotations over the application’s SQL schema, specifying which users (or other principals, such
as groups) have access to each data item.
We have implemented CryptDB on both MySQL and Postgres; our design and most of our
implementation should be applicable to most standard SQL DBMSes. An analysis of a 10-day trace
of 126 million SQL queries from many applications at MIT suggests that CryptDB can support
operations over encrypted data for 99.5% of the 128,840 columns seen in the trace. Our evaluation
shows that CryptDB has low overhead, reducing throughput by 14.5% for the phpBB web forum
application, and by 26% for queries from TPC-C, compared to unmodified MySQL. We evaluated the
security of CryptDB on six real applications (including phpBB, the HotCRP conference management
software [27], and the OpenEMR medical records application); the results show that CryptDB protects
most sensitive fields with highly secure encryption schemes. Chaining encryption keys to user
passwords requires 11–13 unique schema annotations to enforce privacy policies on more than 20
sensitive fields (including a new policy in HotCRP for handling papers in conflict with a PC chair) and
2–7 lines of source code changes for three multi-user web applications.
The rest of this paper is structured as follows. In §2, we discuss the threats that CryptDB defends
against in more detail. Then, we describe CryptDB’s design for encrypted query processing in §3 and
for key chaining to user passwords in §4. In §5, we present several case studies of how applications can
use CryptDB, and in §6, we discuss limitations of our design, and ways in which it can be extended.
Next, we describe our prototype implementation in §7, and evaluate the performance and security of
CryptDB, as well as the effort required for application developers to use CryptDB, in §8. We compare
CryptDB to related work in §9 and conclude in §10.
Figure 1 shows CryptDB’s architecture and threat models. CryptDB works by intercepting all SQL
queries in a database proxy, which rewrites queries to execute on encrypted data (CryptDB assumes
that all queries go through the proxy). The proxy encrypts and decrypts all data, and changes some
query operators, while preserving the semantics of the query. The DBMS server never receives
decryption keys to the plaintext so it never sees sensitive data, ensuring that a curious DBA cannot
gain access to private information (threat 1).
To guard against application, proxy, and DBMS server compromises (threat 2), developers annotate
their SQL schema to define different principals, whose keys will allow decrypting different parts of the
database. They also make a small change to their applications to provide encryption keys to the proxy,
as described in §4. The proxy determines what parts of the database should be encrypted under what
key. The result is that CryptDB guarantees the confidentiality of data belonging to users that are not
logged in during a compromise (e.g., user 2 in Figure 1), and who do not log in until the compromise
is detected and fixed by the administrator.
Although CryptDB protects data confidentiality, it does not ensure the integrity, freshness, or
completeness of results returned to the application. An adversary that compromises the application,
proxy, or DBMS server, or a malicious DBA, can delete any or all of the data stored in the database.
Similarly, attacks on user machines, such as cross-site scripting, are outside of the scope of CryptDB.
We now describe the two threat models addressed by CryptDB, and the security guarantees
provided under those threat models.
Threat 1: DBMS Server Compromise
In this threat, CryptDB guards against a curious DBA or other external attacker with full access to the
data stored in the DBMS server. Our goal is confidentiality (data secrecy), not integrity or availability.
The attacker is assumed to be passive: she wants to learn confidential data, but does not change queries
issued by the application, query results, or the data in the DBMS. This threat includes DBMS software
compromises, root access to DBMS machines, and even access to the RAM of physical machines.
With the rise in database consolidation inside enterprise data centers, outsourcing of databases to
public cloud computing infrastructures, and the use of third-party DBAs, this threat is increasingly
CryptDB aims to protect data confidentiality against this threat by executing SQL
queries over encrypted data on the DBMS server. The proxy uses secret keys to encrypt all data
inserted or included in queries issued to the DBMS. Our approach is to allow the DBMS server to
perform query processing on encrypted data as it would on an unencrypted database, by enabling it to
compute certain functions over the data items based on encrypted data. For example, if the DBMS
needs to perform a GROUP BY on column c, the DBMS server should be able to determine which
items in that column are equal to each other, but not the actual content of each item. Therefore, the
proxy needs to enable the DBMS server to determine relationships among data necessary to process a
query. By using SQL-aware encryption that adjusts dynamically to the queries presented, CryptDB is
careful about what relations it reveals between tuples to the server. For instance, if the DBMS needs to
perform only a GROUP BY on a column c, the DBMS server should not know the order of the items in
column c, nor should it know any other information about other columns. If the DBMS is required to
perform an ORDER BY, or to find the MAX or MIN, CryptDB reveals the order of items in that column,
but not otherwise.
CryptDB provides confidentiality for data content and for names of columns and
tables; CryptDB does not hide the overall table structure, the number of rows, the types of columns, or
the approximate size of data in bytes. The security of CryptDB is not perfect: CryptDB reveals to
the DBMS server relationships among data items that correspond to the classes of computation that
queries perform on the database, such as comparing items for equality, sorting, or performing word
search. The granularity at which CryptDB allows the DBMS to perform a class of computations is an
entire column (or a group of joined columns, for joins), which means that even if a query requires
equality checks for a few rows, executing that query on the server would require revealing that class of
computation for an entire column. §3.1 describes how these classes of computation map to CryptDB’s
encryption schemes, and the information they reveal.
More intuitively, CryptDB provides the following properties:
• Sensitive data is never available in plaintext at the DBMS server.
• The information revealed to the DBMS server depends on the classes of computation required by
the application’s queries, subject to constraints specified by the application developer in the schema
1. If the application requests no relational predicate filtering on a column, nothing about the
data content leaks (other than its size in bytes).
2. If the application requests equality checks on a column, CryptDB’s proxy reveals which
items repeat in that column (the histogram), but not the actual values.
3. If the application requests order checks on a column, the proxy reveals the order of the
elements in the column.
• The DBMS server cannot compute the (encrypted) results for queries that involve computation
classes not requested by the application.
How close is CryptDB to “optimal” security? Fundamentally, optimal security is achieved by
recent work in theoretical cryptography enabling any computation over encrypted data [18]; however,
such proposals are prohibitively impractical. In contrast, CryptDB is practical, and in §8.3, we
demonstrate that it also provides significant security in practice. Specifically, we show that all or
almost all of the most sensitive fields in the tested applications remain encrypted with highly secure
encryption schemes. For such fields, CryptDB provides optimal security, assuming their value is
independent of the pattern in which they are accessed (which is the case for medical information,
social security numbers, etc). CryptDB is not optimal for fields requiring more revealing encryption
schemes, but we find that most such fields are semi-sensitive (such as timestamps).
Finally, we believe that a passive attack model is realistic because malicious DBAs are more likely
to read the data, which may be hard to detect, than to change the data or query results, which is more
likely to be discovered. In §9, we cite related work on data integrity that could be used in complement
with our work. An active adversary that can insert or update data may be able to indirectly compromise
confidentiality. For example, an adversary that modifies an email field in the database may be able
to trick the application into sending a user’s data to the wrong email address, when the user asks the
application to email her a copy of her own data. Such active attacks on the DBMS fall under the
second threat model, which we now discuss.
Threat 2: Arbitrary Threats
We now describe the second threat where the application server, proxy, and DBMS server infrastructures may be compromised arbitrarily. The approach in threat 1 is insufficient because an adversary
can now get access to the keys used to encrypt the entire database.
The solution is to encrypt different data items (e.g., data belonging to different users) with
different keys. To determine the key that should be used for each data item, developers annotate the
application’s database schema to express finer-grained confidentiality policies. A curious DBA still
cannot obtain private data by snooping on the DBMS server (threat 1), and in addition, an adversary
who compromises the application server or the proxy can now decrypt only data of currently logged-in
users (which are stored in the proxy). Data of currently inactive users would be encrypted with keys
not available to the adversary, and would remain confidential.
In this configuration, CryptDB provides strong guarantees in the face of arbitrary server-side
compromises, including those that gain root access to the application or the proxy. CryptDB leaks at
most the data of currently active users for the duration of the compromise, even if the proxy behaves
in a Byzantine fashion. By “duration of a compromise”, we mean the interval from the start of the
compromise until any trace of the compromise has been erased from the system. For a read SQL
injection attack, the duration of the compromise spans the attacker’s SQL queries. In the above
example of an adversary changing the email address of a user in the database, we consider the system
compromised for as long as the attacker’s email address persists in the database.
This section describes how CryptDB executes SQL queries over encrypted data. The threat model
in this section is threat 1 from §2.1. The DBMS machines and administrators are not trusted, but the
application and the proxy are trusted.
CryptDB enables the DBMS server to execute SQL queries on encrypted data almost as if it were
executing the same queries on plaintext data. Existing applications do not need to be changed. The
DBMS’s query plan for an encrypted query is typically the same as for the original query, except that
the operators comprising the query, such as selections, projections, joins, aggregates, and orderings,
are performed on ciphertexts, and use modified operators in some cases.
CryptDB’s proxy stores a secret master key MK, the database schema, and the current encryption
layers of all columns. The DBMS server sees an anonymized schema (in which table and column
names are replaced by opaque identifiers), encrypted user data, and some auxiliary tables used by
CryptDB. CryptDB also equips the server with CryptDB-specific user-defined functions (UDFs) that
enable the server to compute on ciphertexts for certain operations.
Processing a query in CryptDB involves four steps:
1. The application issues a query, which the proxy intercepts and rewrites: it anonymizes each table
and column name, and, using the master key MK, encrypts each constant in the query with an
encryption scheme best suited for the desired operation (§3.1).
2. The proxy checks if the DBMS server should be given keys to adjust encryption layers before
executing the query, and if so, issues an UPDATE query at the DBMS server that invokes a UDF to
adjust the encryption layer of the appropriate columns (§3.2).
3. The proxy forwards the encrypted query to the DBMS server, which executes it using standard
SQL (occasionally invoking UDFs for aggregation or keyword search).
4. The DBMS server returns the (encrypted) query result, which the proxy decrypts and returns to the
SQL-aware Encryption
We now describe the encryption types that CryptDB uses, including a number of existing cryptosystems,
an optimization of a recent scheme, and a new cryptographic primitive for joins. For each encryption
type, we explain the security property that CryptDB requires from it, its functionality, and how it is
Random (RND).
RND provides the maximum security in CryptDB: indistinguishability under an
adaptive chosen-plaintext attack (IND-CPA); the scheme is probabilistic, meaning that two equal values
are mapped to different ciphertexts with overwhelming probability. On the other hand, RND does
not allow any computation to be performed efficiently on the ciphertext. An efficient construction of
RND is to use a block cipher like AES or Blowfish in CBC mode together with a random initialization
vector (IV). (We mostly use AES, except for integer values, where we use Blowfish for its 64-bit block
size because the 128-bit block size of AES would cause the ciphertext to be significantly longer).
Since, in this threat model, CryptDB assumes the server does not change results, CryptDB does
not require a stronger IND-CCA2 construction (which would be secure under a chosen-ciphertext
attack). However, it would be straightforward to use an IND-CCA2-secure implementation of RND
instead, such as a block cipher in UFE mode [13], if needed.
Deterministic (DET).
DET has a slightly weaker guarantee, yet it still provides strong security: it
leaks only which encrypted values correspond to the same data value, by deterministically generating
the same ciphertext for the same plaintext. This encryption layer allows the server to perform equality
checks, which means it can perform selects with equality predicates, equality joins, GROUP BY, COUNT,
In cryptographic terms, DET should be a pseudo-random permutation (PRP) [20]. For 64-bit and
128-bit values, we use a block cipher with a matching block size (Blowfish and AES respectively);
we make the usual assumption that the AES and Blowfish block ciphers are PRPs. We pad smaller
values out to 64 bits, but for data that is longer than a single 128-bit AES block, the standard CBC
mode of operation leaks prefix equality (e.g., if two data items have an identical prefix that is at least
128 bits long). To avoid this problem, we use AES with a variant of the CMC mode [24], which can
be approximately thought of as one round of CBC, followed by another round of CBC with the blocks
in the reverse order. Since the goal of DET is to reveal equality, we use a zero IV (or “tweak” [24]) for
our AES-CMC implementation of DET.
Order-preserving encryption (OPE).
OPE allows order relations between data items to be established based on their encrypted values, without revealing the data itself. If x < y, then OPEK (x) <
OPEK (y), for any secret key K. Therefore, if a column is encrypted with OPE, the server can perform
range queries when given encrypted constants OPEK (c1 ) and OPEK (c2 ) corresponding to the range
[c1 , c2 ]. The server can also perform ORDER BY, MIN, MAX, SORT, etc.
OPE is a weaker encryption scheme than DET because it reveals order. Thus, the CryptDB proxy
will only reveal OPE-encrypted columns to the server if users request order queries on those columns.
OPE has provable security guarantees [4]: the encryption is equivalent to a random mapping that
preserves order.
The scheme we use [4] is the first provably secure such scheme. Until CryptDB, there was no
implementation nor any measure of the practicality of the scheme. The direct implementation of
the scheme took 25 ms per encryption of a 32-bit integer on an Intel 2.8 GHz Q9550 processor. We
improved the algorithm by using AVL binary search trees for batch encryption (e.g., database loads),
reducing the cost of OPE encryption to 7 ms per encryption without affecting its security. We also
implemented a hypergeometric sampler that lies at the core of OPE, porting a Fortran implementation
from 1988 [25].
Homomorphic encryption (HOM).
HOM is a secure probabilistic encryption scheme (INDCPA secure), allowing the server to perform computations on encrypted data with the final result
decrypted at the proxy. While fully homomorphic encryption is prohibitively slow [10], homomorphic
encryption for specific operations is efficient. To support summation, we implemented the Paillier
cryptosystem [35]. With Paillier, multiplying the encryptions of two values results in an encryption
of the sum of the values, i.e., HOMK (x) · HOMK (y) = HOMK (x + y), where the multiplication is
performed modulo some public-key value. To compute SUM aggregates, the proxy replaces SUM with
calls to a UDF that performs Paillier multiplication on a column encrypted with HOM. HOM can also
be used for computing averages by having the DBMS server return the sum and the count separately,
and for incrementing values (e.g., SET id=id+1), on which we elaborate shortly.
With HOM, the ciphertext is 2048 bits. In theory, it should be possible to pack multiple values
from a single row into one HOM ciphertext for that row, using the scheme of Ge and Zdonik [17],
which would result in an amortized space overhead of 2× (e.g., a 32-bit value occupies 64 bits) for
a table with many HOM-encrypted columns. However, we have not implemented this optimization
in our prototype. This optimization would also complicate partial-row UPDATE operations that reset
some—but not all—of the values packed into a HOM ciphertext.
RND: no functionality
RND: no functionality
DET: equality selection
text value
OPE: order
JOIN: equality join
range join
any value
any value
Onion Eq
Onion Ord
Onion Search
HOM: add
int value
Onion Add
Figure 2: Onion encryption layers and the classes of computation they allow. Onion names stand for the
operations they allow at some of their layers (Equality, Order, Search, and Addition). In practice, some
onions or onion layers may be omitted, depending on column types or schema annotations provided by
application developers (§3.5.2). DET and JOIN are often merged into a single onion layer, since JOIN is a
concatenation of DET and JOIN-ADJ (§3.4). A random IV for RND (§3.1), shared by the RND layers in Eq
and Ord, is also stored for each data item.
Join (JOIN and OPE-JOIN).
A separate encryption scheme is necessary to allow equality joins
between two columns, because we use different keys for DET to prevent cross-column correlations.
JOIN also supports all operations allowed by DET, and also enables the server to determine repeating values between two columns. OPE-JOIN enables joins by order relations. We provide a new
cryptographic scheme for JOIN and we discuss it in §3.4.
Word search (SEARCH).
SEARCH is used to perform searches on encrypted text to support
operations such as MySQL’s LIKE operator. We implemented the cryptographic protocol of Song et
al. [46], which was not previously implemented by the authors; we also use their protocol in a different
way, which results in better security guarantees. For each column needing SEARCH, we split the text
into keywords using standard delimiters (or using a special keyword extraction function specified by
the schema developer). We then remove repetitions in these words, randomly permute the positions of
the words, and then encrypt each of the words using Song et al.’s scheme, padding each word to the
same size. SEARCH is nearly as secure as RND: the encryption does not reveal to the DBMS server
whether a certain word repeats in multiple rows, but it leaks the number of keywords encrypted with
SEARCH; an adversary may be able to estimate the number of distinct or duplicate words (e.g., by
comparing the size of the SEARCH and RND ciphertexts for the same data).
When the user performs a query such as SELECT * FROM messages WHERE msg LIKE "%
alice %", the proxy gives the DBMS server a token, which is an encryption of alice. The server
cannot decrypt the token to figure out the underlying word. Using a user-defined function, the DBMS
server checks if any of the word encryptions in any message match the token. In our approach, all
the server learns from searching is whether a token matched a message or not, and this happens only
for the tokens requested by the user. The server would learn the same information when returning
the result set to the users, so the overall search scheme reveals the minimum amount of additional
information needed to return the result.
Note that SEARCH allows CryptDB to only perform full-word keyword searches; it cannot
support arbitrary regular expressions. For applications that require searching for multiple adjacent
words, CryptDB allows the application developer to disable duplicate removal and re-ordering by
annotating the schema, even though this is not the default. Based on our trace evaluation, we find that
most uses of LIKE can be supported by SEARCH with such schema annotations. Of course, one can
still combine multiple LIKE operators with AND and OR to check whether multiple independent words
are in the text.
Adjustable Query-based Encryption
A key part of CryptDB’s design is adjustable query-based encryption, which dynamically adjusts the
layer of encryption on the DBMS server. Our goal is to use the most secure encryption schemes that
enable running the requested queries. For example, if the application issues no queries that compare
data items in a column, or that sort a column, the column should be encrypted with RND. For columns
that require equality checks but not inequality checks, DET suffices. However, the query set is not
ID Name
23 Alice
C1-IV C1-Eq C1-Ord C1-Add C2-IV C2-Eq C2-Ord C2-Search
x27c3 x2b82 xcb94 xc2e4 x8a13 xd1e3 x7eb1
Figure 3: Data layout at the server. When the application creates the table shown on the left, the table
created at the DBMS server is the one shown on the right. Ciphertexts shown are not full-length.
always known in advance. Thus, we need an adaptive scheme that dynamically adjusts encryption
Our idea is to encrypt each data item in one or more onions: that is, each value is dressed in layers
of increasingly stronger encryption, as illustrated in Figures 2 and 3. Each layer of each onion enables
certain kinds of functionality as explained in the previous subsection. For example, outermost layers
such as RND and HOM provide maximum security, whereas inner layers such as OPE provide more
Multiple onions are needed in practice, both because the computations supported by different
encryption schemes are not always strictly ordered, and because of performance considerations (size
of ciphertext and encryption time for nested onion layers). Depending on the type of the data (and any
annotations provided by the application developer on the database schema, as discussed in §3.5.2),
CryptDB may not maintain all onions for each column. For instance, the Search onion does not make
sense for integers, and the Add onion does not make sense for strings.
For each layer of each onion, the proxy uses the same key for encrypting values in the same
column, and different keys across tables, columns, onions, and onion layers. Using the same key for
all values in a column allows the proxy to perform operations on a column without having to compute
separate keys for each row that will be manipulated. (We use finer-grained encryption keys in §4 to
reduce the potential amount of data disclosure in case of an application or proxy server compromise.)
Using different keys across columns prevents the server from learning any additional relations. All
of these keys are derived from the master key MK. For example, for table t, column c, onion o, and
encryption layer l, the proxy uses the key
Kt,c,o,l = PRPMK (table t, column c, onion o, layer l),
where PRP is a pseudorandom permutation (e.g., AES).
Each onion starts out encrypted with the most secure encryption scheme (RND for onions Eq and
Ord, HOM for onion Add, and SEARCH for onion Search). As the proxy receives SQL queries from
the application, it determines whether layers of encryption need to be removed. Given a predicate P
on column c needed to execute a query on the server, the proxy first establishes what onion layer is
needed to compute P on c. If the encryption of c is not already at an onion layer that allows P, the
proxy strips off the onion layers to allow P on c, by sending the corresponding onion key to the server.
The proxy never decrypts the data past the least-secure encryption onion layer (or past some other
threshold layer, if specified by the application developer in the schema, §3.5.1).
CryptDB implements onion layer decryption using UDFs that run on the DBMS server. For
example, in Figure 3, to decrypt onion Ord of column 2 in table 1 to layer OPE, the proxy issues the
following query to the server using the DECRYPT RND UDF:
UPDATE Table1 SET C2-Ord = DECRYPT RND(K, C2-Ord, C2-IV)
where K is the appropriate key computed from Equation (1). At the same time, the proxy updates its
own internal state to remember that column C2-Ord in Table1 is now at layer OPE in the DBMS. Each
column decryption should be included in a transaction to avoid consistency problems with clients
accessing columns being adjusted.
Note that onion decryption is performed entirely by the DBMS server. In the steady state, no
server-side decryptions are needed, because onion decryption happens only when a new class of
computation is requested on a column. For example, after an equality check is requested on a column
and the server brings the column to layer DET, the column remains in that state, and future queries
with equality checks require no decryption. This property is the insight into why CryptDB’s overhead
is modest in the steady state (see §8): the server mostly performs typical SQL processing.
Executing over Encrypted Data
Once the onion layers in the DBMS are at the layer necessary to execute a query, the proxy transforms
the query to operate on these onions. In particular, the proxy replaces column names in a query with
corresponding onion names, based on the class of computation performed on that column. For example,
for the schema shown in Figure 3, a reference to the Name column for an equality comparison will be
replaced with a reference to the C2-Eq column.
The proxy also replaces each constant in the query with a corresponding onion encryption of that
constant, based on the computation in which it is used. For instance, if a query contains WHERE Name
= ‘Alice’, the proxy encrypts ‘Alice’ by successively applying all encryption layers corresponding
to onion Eq that have not yet been removed from C2-Eq.
Finally, the server replaces certain operators with UDF-based counterparts. For instance, the SUM
aggregate operator and the + column-addition operator must be replaced with an invocation of a UDF
that performs HOM addition of ciphertexts. Equality and order operators (such as = and <) do not
need such replacement and can be applied directly to the DET and OPE ciphertexts.
Once the proxy has transformed the query, it sends the query to the DBMS server, receives query
results (consisting of encrypted data), decrypts the results using the corresponding onion keys, and
sends the decrypted result to the application.
Read query execution.
To understand query execution over ciphertexts, consider the example
schema shown in Figure 3. Initially, each column in the table is dressed in all onions of encryption,
with RND, HOM, and SEARCH as outermost layers, as shown in Figure 2. At this point, the server
can learn nothing about the data other than the number of columns, rows, and data size.
To illustrate when onion layers are removed, consider the query:
SELECT ID FROM Employees WHERE Name = ‘Alice’,
which requires lowering the encryption of Name to layer DET. To execute this query, the proxy first
issues the query
UPDATE Table1 SET C2-Eq = DECRYPT RND(KT1,C2,Eq,RND , C2-Eq, C2-IV),
where column C2 corresponds to Name. The proxy then issues
SELECT C1-Eq, C1-IV FROM Table1 WHERE C2-Eq = x7..d,
where column C1 corresponds to ID, and where x7..d is the Eq onion encryption of “Alice” with keys
KT1,C2,Eq,JOIN and KT1,C2,Eq,DET (see Figure 2). Note that the proxy must request the random IV from
column C1-IV in order to decrypt the RND ciphertext from C1-Eq. Finally, the proxy decrypts the
results from the server using keys KT1,C1,Eq,RND , KT1,C1,Eq,DET , and KT1,C1,Eq,JOIN , obtains the result
23, and returns it to the application.
If the next query is SELECT COUNT(*) FROM Employees WHERE Name = ‘Bob’, no serverside decryptions are necessary, and the proxy directly issues the query SELECT COUNT(*) FROM
Table1 WHERE C2-Eq = xbb..4a, where xbb..4a is the Eq onion encryption of “Bob” using
KT1,C2,Eq,JOIN and KT1,C2,Eq,DET .
Write query execution.
To support INSERT, DELETE, and UPDATE queries, the proxy applies the
same processing to the predicates (i.e., the WHERE clause) as for read queries. DELETE queries require
no additional processing. For all INSERT and UPDATE queries that set the value of a column to a
constant, the proxy encrypts each inserted column’s value with each onion layer that has not yet been
stripped off in that column.
The remaining case is an UPDATE that sets a column value based on an existing column value,
such as salary=salary+1. Such an update would have to be performed using HOM, to handle
additions. However, in doing so, the values in the OPE and DET onions would become stale. In fact,
any hypothetical encryption scheme that simultaneously allows addition and direct comparison on the
ciphertext is insecure: if a malicious server can compute the order of the items, and can increment
the value by one, the server can repeatedly add one to each field homomorphically until it becomes
equal to some other value in the same column. This would allow the server to compute the difference
between any two values in the database, which is almost equivalent to knowing their values.
There are two approaches to allow updates based on existing column values. If a column is
incremented and then only projected (no comparisons are performed on it), the solution is simple:
when a query requests the value of this field, the proxy should request the HOM ciphertext from
the Add onion, instead of ciphertexts from other onions, because the HOM value is up-to-date. For
instance, this approach applies to increment queries in TPC-C. If a column is used in comparisons after
it is incremented, the solution is to replace the update query with two queries: a SELECT of the old
values to be updated, which the proxy increments and encrypts accordingly, followed by an UPDATE
setting the new values. This strategy would work well for updates that affect a small number of rows.
Other DBMS features.
Most other DBMS mechanisms, such as transactions and indexing, work
the same way with CryptDB over encrypted data as they do over plaintext, with no modifications.
For transactions, the proxy passes along any BEGIN, COMMIT, and ABORT queries to the DBMS. Since
many SQL operators behave differently on NULLs than on non-NULL values, CryptDB exposes
NULL values to the DBMS without encryption. CryptDB does not currently support stored procedures,
although certain stored procedures could be supported by rewriting their code in the same way that
CryptDB’s proxy rewrites SQL statements.
The DBMS builds indexes for encrypted data in the same way as for plaintext. Currently, if the
application requests an index on a column, the proxy asks the DBMS server to build indexes on that
column’s DET, JOIN, OPE, or OPE-JOIN onion layers (if they are exposed), but not for RND, HOM,
or SEARCH. More efficient index selection algorithms could be investigated.
Computing Joins
There are two kinds of joins supported by CryptDB: equi-joins, in which the join predicate is based
on equality, and range joins, which involve order checks. To perform an equi-join of two encrypted
columns, the columns should be encrypted with the same key so that the server can see matching
values between the two columns. At the same time, to provide better privacy, the DBMS server should
not be able to join columns for which the application did not request a join, so columns that are never
joined should not be encrypted with the same keys.
If the queries that can be issued, or the pairs of columns that can be joined, are known a priori,
equi-join is easy to support: CryptDB can use the DET encryption scheme with the same key for each
group of columns that are joined together. §3.5 describes how the proxy learns the columns to be
joined in this case. However, the challenging case is when the proxy does not know the set of columns
to be joined a priori, and hence does not know which columns should be encrypted with matching
To solve this problem, we introduce a new cryptographic primitive, JOIN-ADJ (adjustable join),
which allows the DBMS server to adjust the key of each column at runtime. Intuitively, JOIN-ADJ can
be thought of as a keyed cryptographic hash with the additional property that hashes can be adjusted
to change their key without access to the plaintext. JOIN-ADJ is a deterministic function of its input,
which means that if two plaintexts are equal, the corresponding JOIN-ADJ values are also equal.
JOIN-ADJ is collision-resistant, and has a sufficiently long output length (192 bits) to allow us to
assume that collisions never happen in practice.
JOIN-ADJ is non-invertible, so we define the JOIN encryption scheme as
JOIN(v)=JOIN-ADJ(v) k DET(v),
where k denotes concatenation. This construction allows the proxy to decrypt a JOIN(v) column to
obtain v by decrypting the DET component, and allows the DBMS server to check two JOIN values
for equality by comparing the JOIN-ADJ components.
Each column is initially encrypted at the JOIN layer using a different key, thus preventing any
joins between columns. When a query requests a join, the proxy gives the DBMS server an onion key
to adjust the JOIN-ADJ values in one of the two columns, so that it matches the JOIN-ADJ key of
the other column (denoted the join-base column). After the adjustment, the columns share the same
JOIN-ADJ key, allowing the DBMS server to join them for equality. The DET components of JOIN
remain encrypted with different keys.
Note that our adjustable join is transitive: if the user joins columns A and B and then joins columns
B and C, the server can join A and C. However, the server cannot join columns in different “transitivity
groups”. For instance, if columns D and E were joined together, the DBMS server would not be able
to join columns A and D on its own.
After an initial join query, the JOIN-ADJ values remain transformed with the same key, so no
re-adjustments are needed for subsequent join queries between the same two columns. One exception
is if the application issues another query, joining one of the adjusted columns with a third column,
which causes the proxy to re-adjust the column to another join-base. To avoid oscillations and to
converge to a state where all columns in a transitivity group share the same join-base, CryptDB chooses
the first column in lexicographic order on table and column name as the join-base. For n columns, the
overall maximum number of join transitions is n(n − 1)/2.
For range joins, a similar dynamic re-adjustment scheme is difficult to construct due to lack of
structure in OPE schemes. Instead, CryptDB requires that pairs of columns that will be involved
in such joins be declared by the application ahead of time, so that matching keys are used for layer
OPE-JOIN of those columns; otherwise, the same key will be used for all columns at layer OPE-JOIN.
Fortunately, range joins are rare; they are not used in any of our example applications, and are used in
only 50 out of 128,840 columns in a large SQL query trace we describe in §8, corresponding to just
three distinct applications.
JOIN-ADJ construction.
is computed as
Our algorithm uses elliptic-curve cryptography (ECC). JOIN-ADJK (v)
JOIN-ADJK (v) := PK·PRFK0 (v) ,
where K is the initial key for that table, column, onion, and layer, P is a point on an elliptic curve (being
a public parameter), and PRFK0 is a pseudo-random function [20] mapping values to a pseudorandom
number, such as AESK0 (SHA(v)), with K0 being a key that is the same for all columns and derived
from MK. The “exponentiation” is in fact repeated geometric addition of elliptic curve points; it is
considerably faster than RSA exponentiation.
When a query joins columns c and c0 , each having keys K and K 0 at the join layer, the proxy
computes ∆K = K/K 0 (in an appropriate group) and sends it to the server. Then, given JOIN-ADJK 0 (v)
(the JOIN-ADJ values from column c0 ) and ∆K, the DBMS server uses a UDF to adjust the key in c0
by computing:
(JOIN-ADJK 0 (v))∆K = PK ·PRFK0 (v)·(K/K )
= PK·PRFK0 (v) = JOIN-ADJK (v).
Now columns c and c0 share the same JOIN-ADJ key, and the DBMS server can perform an equi-join
on c and c0 by taking the JOIN-ADJ component of the JOIN onion ciphertext.
At a high level, the security of this scheme is that the server cannot infer join relations among
groups of columns that were not requested by legitimate join queries, and that the scheme does not
reveal the plaintext. We proved the security of this scheme based on the standard Elliptic-Curve
Decisional Diffie-Hellman hardness assumption, and implemented it using a NIST-approved elliptic
curve. We plan to publish a more detailed description of this algorithm and the proof on our web
site [37].
Improving Security and Performance
Although CryptDB can operate with an unmodified and unannotated schema, as described above, its
security and performance can be improved through several optional optimizations, as described below.
Security Improvements
Minimum onion layers.
Application developers can specify the lowest onion encryption layer
that may be revealed to the server for a specific column. In this way, the developer can ensure that the
proxy will not execute queries exposing sensitive relations to the server. For example, the developer
could specify that credit card numbers should always remain at RND or DET.
In-proxy processing.
Although CryptDB can evaluate a number of predicates on the server,
evaluating them in the proxy can improve security by not revealing additional information to the server.
One common use case is a SELECT query that sorts on one of the selected columns, without a LIMIT
on the number of returned columns. Since the proxy receives the entire result set from the server,
sorting these results in the proxy does not require a significant amount of computation, and does not
increase the bandwidth requirements. Doing so avoids revealing the OPE encryption of that column to
the server.
CryptDB provides a training mode, which allows a developer to provide a trace of
Training mode.
queries and get the resulting onion encryption layers for each field, along with a warning in case some
query is not supported. The developer can then examine the resulting encryption levels to understand
what each encryption scheme leaks, as described in §2.1. If some onion level is too low for a sensitive
field, she should arrange to have the query processed in the proxy (as described above), or to process
the data in some other fashion, such as by using a local instance of SQLite.
Onion re-encryption.
In cases when an application performs infrequent queries requiring a low
onion layer (e.g., OPE), CryptDB could be extended to re-encrypt onions back to a higher layer after
the infrequent query finishes executing. This approach reduces leakage to attacks happening in the
time window when the data is at the higher onion layer.
Performance Optimizations
Developer annotations.
By default, CryptDB encrypts all fields and creates all applicable onions
for each data item based on its type. If many columns are not sensitive, the developer can instead
provide explicit annotations indicating the sensitive fields (as described in §4), and leave the remaining
fields in plaintext.
Known query set.
If the developer knows some of the queries ahead of time, as is the case for
many web applications, the developer can use the training mode described above to adjust onions
to the correct layer a priori, avoiding the overhead of runtime onion adjustments. If the developer
provides the exact query set, or annotations that certain functionality is not needed on some columns,
CryptDB can also discard onions that are not needed (e.g., discard the Ord onion for columns that
are not used in range queries, or discard the Search onion for columns where keyword search is not
performed), discard onion layers that are not needed (e.g., the adjustable JOIN layer, if joins are known
a priori), or discard the random IV needed for RND for some columns.
Ciphertext pre-computing and caching.
The proxy spends a significant amount of time encrypting values used in queries with OPE and HOM. To reduce this cost, the proxy pre-computes (for
HOM) and caches (for OPE) encryptions of frequently used constants under different keys. Since
HOM is probabilistic, ciphertexts cannot be reused. Therefore, in addition, the proxy pre-computes
HOM’s Paillier rn randomness values for future encryptions of any data. This optimization reduces the
amount of CPU time spent by the proxy on OPE encryption, and assuming the proxy is occasionally
idle to perform HOM pre-computation, it removes HOM encryption from the critical path.
We now extend the threat model to the case when the application infrastructure and proxy are also
untrusted (threat 2). This model is especially relevant for a multi-user web site running a web and
application server. To understand both the problems faced by a multi-user web application and
CryptDB’s solution to these problems, consider phpBB, a popular online web forum. In phpBB, each
user has an account and a password, belongs to certain groups, and can send private messages to other
users. Depending on their groups’ permissions, users can read entire forums, only forum names, or
not be able to read a forum at all.
There are several confidentiality guarantees that would be useful in phpBB. For example, we
would like to ensure that a private message sent from one user to another is not visible to anyone
else; that posts in a forum are accessible only to users in a group with access to that forum; and that
the name of a forum is shown only to users belonging to a group that’s allowed to view it. CryptDB
provides these guarantees in the face of arbitrary compromises, thereby limiting the damage caused by
a compromise.
Achieving these guarantees requires addressing two challenges. First, CryptDB must capture the
application’s access control policy for shared data at the level of SQL queries. To do this, CryptDB
requires developers to annotate their database schema to specify principals and the data that each
principal has access to, as described in §4.1.
The second challenge is to reduce the amount of information that an adversary can gain by
compromising the system. Our solution limits the leakage resulting from a compromised application
or proxy server to just the data accessible to users who were logged in during the compromise. In
particular, the attacker cannot access the data of users that were not logged in during the compromise.
Leaking the data of active users in case of a compromise is unavoidable: given the impracticality
of arbitrary computation on encrypted data, some data for active users must be decrypted by the
In CryptDB, each user has a key (e.g., her application-level password) that gives her access to
her data. CryptDB encrypts different data items with different keys, and enforces the access control
policy using chains of keys starting from user passwords and ending in the encryption keys of SQL
data items, as described in §4.2. When a user logs in, she provides her password to the proxy (via the
application). The proxy uses this password to derive onion keys to process queries on encrypted data,
as presented in the previous section, and to decrypt the results. The proxy can decrypt only the data
that the user has access to, based on the access control policy. The proxy gives the decrypted data to
the application, which can now compute on it. When the user logs out, the proxy deletes the user’s key.
Policy Annotations
To express the data privacy policy of a database-backed application at the level of SQL queries, the
application developer can annotate the schema of a database in CryptDB by specifying, for any subset
of data items, which principal has access to it. A principal is an entity, such as a user or a group,
over which it is natural to specify an access policy. Each SQL query involving an annotated data item
requires the privilege of the corresponding principal. CryptDB defines its own notion of principals
instead of using existing DBMS principals for two reasons: first, many applications do not map
application-level users to DBMS principals in a sufficiently fine-grained manner, and second, CryptDB
requires explicit delegation of privileges between principals that is difficult to extract in an automated
way from an access control list specification.
An application developer annotates the schema using the three steps described below and illustrated
in Figure 4. In all examples we show, italics indicate table and column names, and bold text indicates
annotations added for CryptDB.
Step 1. The developer must define the principal types (using PRINCTYPE) used in her application,
such as users, groups, or messages. A principal is an instance of a principal type, e.g., principal 5 of
type user. There are two classes of principals: external and internal. External principals correspond to
end users who explicitly authenticate themselves to the application using a password. When a user
logs into the application, the application must provide the user password to the proxy so that the user
can get the privileges of her external principal. Privileges of other (internal) principals can be acquired
only through delegation, as described in Step 3. When the user logs out, the application must inform
the proxy, so that the proxy forgets the user’s password as well as any keys derived from the user’s
Step 2. The developer must specify which columns in her SQL schema contain sensitive data,
along with the principals that should have access to that data, using the ENC FOR annotation. CryptDB
requires that for each private data item in a row, the name of the principal that should have access to
that data be stored in another column in the same row. For example, in Figure 4, the decryption of
msgtext x37a21f is available only to principal 5 of type msg.
Step 3. Programmers can specify rules for how to delegate the privileges of one principal to other
principals, using the speaks-for relation [49]. For example, in phpBB, a user should also have the
privileges of the groups she belongs to. Since many applications store such information in tables,
programmers can specify to CryptDB how to infer delegation rules from rows in an existing table.
In particular, programmers can annotate a table T with (a x) SPEAKS FOR (b y). This annotation
indicates that each row present in that table specifies that principal a of type x speaks for principal
PRINCTYPE user, msg;
CREATE TABLE privmsgs (
msgid int,
subject varchar(255) ENC FOR (msgid msg),
msgtext text
ENC FOR (msgid msg) );
CREATE TABLE privmsgs to (
msgid int, rcpt id int, sender id int,
(sender id user) SPEAKS FOR (msgid msg),
(rcpt id user) SPEAKS FOR (msgid msg) );
userid int, username varchar(255),
(username physical user) SPEAKS FOR (userid user) );
Example table contents, without anonymized column names:
Table privmsgs
Table privmsgs to
rcpt id
sender id
Table users
userid username
Figure 4: Part of phpBB’s schema with annotations to secure private messages. Only the sender and
receiver may see the private message. An attacker that gains complete access to phpBB and the DBMS can
access private messages of only currently active users.
b of type y, meaning that a has access to all keys that b has access to. Here, x and y must always
be fixed principal types. Principal b is always specified by the name of a column in table T . On
the other hand, a can be either the name of another column in the same table, a constant, or T2.col,
meaning all principals from column col of table T2. For example, in Figure 4, principal “Bob” of type
physical user speaks for principal 2 of type user, and in Figure 6, all principals in the contactId column
from table PCMember (of type contact) speak for the paperId principal of type review. Optionally, the
programmer can specify a predicate, whose inputs are values in the same row, to specify a condition
under which delegation should occur, such as excluding conflicts in Figure 6. §5 provides more
examples of using annotations to secure applications.
Key Chaining
Each principal (i.e., each instance of each principal type) is associated with a secret, randomly chosen
key. If principal B speaks for principal A (as a result of some SPEAKS FOR annotation), then principal
A’s key is encrypted using principal B’s key, and stored as a row in the special access keys table in
the database. This allows principal B to gain access to principal A’s key. For example, in Figure 4, to
give users 1 and 2 access to message 5, the key of msg 5 is encrypted with the key of user 1, and also
separately encrypted with the key of user 2.
Each sensitive field is encrypted with the key of the principal in the ENC FOR annotation. CryptDB
encrypts the sensitive field with onions in the same way as for single-principal CryptDB, except that
onion keys are derived from a principal’s key as opposed to a global master key.
The key of each principal is a combination of a symmetric key and a public–private key pair. In the
common case, CryptDB uses the symmetric key of a principal to encrypt any data and other principals’
keys accessible to this principal, with little CPU cost. However, this is not always possible, if some
principal is not currently online. For example, in Figure 4, suppose Bob sends message 5 to Alice, but
Alice (user 1) is not online. This means that CryptDB does not have access to user 1’s key, so it will
not be able to encrypt message 5’s key with user 1’s symmetric key. In this case, CryptDB looks up
the public key of the principal (i.e., user 1) in a second table, public keys, and encrypts message 5’s
key using user 1’s public key. When user 1 logs in, she will be able to use the secret key part of her
key to decrypt the key for message 5 (and re-encrypt it under her symmetric key for future use).
For external principals (i.e., physical users), CryptDB assigns a random key just as for any other
principal. To give an external user access to the corresponding key on login, CryptDB stores the key
of each external principal in a third table, external keys, encrypted with the principal’s password. This
allows CryptDB to obtain a user’s key given the user’s password, and also allows a user to change her
password without changing the key of the principal.
When a table with a SPEAKS FOR relation is updated, CryptDB must update the access keys table
accordingly. To insert a new row into access keys for a new SPEAKS FOR relation, the proxy must have
access to the key of the principal whose privileges are being delegated. This means that an adversary
that breaks into an application or proxy server cannot create new SPEAKS FOR relations for principals
that are not logged in, because neither the proxy nor the adversary have access to their keys. If a
SPEAKS FOR relation is removed, CryptDB revokes access by removing the corresponding row from
access keys.
When encrypting data in a query or decrypting data from a result, CryptDB follows key chains
starting from passwords of users logged in until it obtains the desired keys. As an optimization, when
a user logs in, CryptDB’s proxy loads the keys of some principals to which the user has access (in
particular, those principal types that do not have too many principal instances—e.g., for groups the
user is in, but not for messages the user received).
Applications inform CryptDB of users logging in or out by issuing INSERT and DELETE SQL
queries to a special table cryptdb active that has two columns, username and password. The proxy
intercepts all queries for cryptdb active, stores the passwords of logged-in users in memory, and never
reveals them to the DBMS server.
CryptDB guards the data of inactive users at the time of an attack. If a compromise occurs,
CryptDB provides a bound on the data leaked, allowing the administrators to not issue a blanket
warning to all the users of the system. In this respect, CryptDB is different from other approaches
to database security (see §9). However, some special users such as administrators with access to a
large pool of data enable a larger compromise upon an attack. To avoid attacks happening when the
administrator is logged in, the administrator should create a separate user account with restricted
permissions when accessing the application as a regular user. Also, as good practice, an application
should automatically log out users who have been inactive for some period of time.
In this section, we explain how CryptDB can be used to secure three existing multi-user web applications. For brevity, we show simplified schemas, omitting irrelevant fields and type specifiers.
Overall, we find that once a programmer specifies the principals in the application’s schema, and
the delegation rules for them using SPEAKS FOR, protecting additional sensitive fields just requires
additional ENC FOR annotations.
phpBB is a widely used open source forum with a rich set of access control settings. Users are
organized in groups; both users and groups have a variety of access permissions that the application
administrator can choose. We already showed how to secure private messages between two users
in phpBB in Figure 4. A more detailed case is securing access to posts, as shown in Figure 5. This
example shows how to use predicates (e.g., IF optionid=...) to implement a conditional speaks-for
relation on principals, and also how one column (forumid) can be used to represent multiple principals
(of different type) with different privileges. There are more ways to gain access to a post, but we omit
them here for brevity.
HotCRP is a popular conference review application [27]. A key policy for HotCRP is that
PC members cannot see who reviewed their own (or conflicted) papers. Figure 6 shows CryptDB
annotations for HotCRP’s schema to enforce this policy. Today, HotCRP cannot prevent a curious or
careless PC chair from logging into the database server and seeing who wrote each review for a paper
that she is in conflict with. As a result, conferences often set up a second server to review the chair’s
papers or use inconvenient out-of-band emails. With CryptDB, a PC chair cannot learn who wrote
each review for her paper, even if she breaks into the application or database, since she does not have
PRINCTYPE user, group, forum post, forum name;
( userid int, username varchar(255),
(username physical user) SPEAKS FOR (userid user) );
CREATE TABLE usergroup ( userid int, groupid int,
(userid user) SPEAKS FOR (groupid group) );
CREATE TABLE aclgroups ( groupid int, forumid int, optionid int,
(groupid group) SPEAKS FOR (forumid forum post)
IF optionid=20,
(groupid group) SPEAKS FOR (forumid forum name)
IF optionid=14);
( postid int, forumid int,
post text ENC FOR (forumid forum post) );
( forumid int,
name varchar(255) ENC FOR (forumid forum name) );
Figure 5: Annotated schema for securing access to posts in phpBB. A user has access to see the content of
posts in a forum if any of the groups that the user is part of has such permissions, indicated by optionid 20 in
the aclgroups table for the corresponding forumid and groupid. Similarly, optionid 14 enables users to see
the forum’s name.
the decryption key.1 The reason is that the SQL predicate “NoConflict” checks if a PC member is
conflicted with a paper and prevents the proxy from providing access to the PC chair in the key chain.
(We assume the PC chair does not modify the application to log the passwords of other PC members
to subvert the system.)
grad-apply is a graduate admissions system used by MIT EECS. We annotated its schema
to allow an applicant’s folder to be accessed only by the respective applicant and any faculty using (reviewers.reviewer id reviewer), meaning all reviewers, SPEAKS FOR (candidate id
candidate) in table candidates, and ... SPEAKS FOR (letter id letter) in table letters. The
applicant can see all of her folder data except for letters of recommendation. Overall, grad-apply has
simple access control and therefore simple annotations.
CryptDB’s design supports most relational queries and aggregates on standard data types, such as
integers and text/varchar types. Additional operations can be added to CryptDB by extending its
existing onions, or adding new onions for specific data types (e.g., spatial and multi-dimensional range
queries [43]). Alternatively, in some cases, it may be possible to map complex unsupported operation
to simpler ones (e.g., extracting the month out of an encrypted date is easier if the date’s day, month,
and year fields are encrypted separately).
There are certain computations CryptDB cannot support on encrypted data. For example, it
does not support both computation and comparison on the same column, such as WHERE salary >
age*2+10. CryptDB can process a part of this query, but it would also require some processing on the
proxy. In CryptDB, such a query should be (1) rewritten into a sub-query that selects a whole column,
SELECT age*2+10 FROM . . ., which CryptDB computes using HOM, and (2) re-encrypted in the
proxy, creating a new column (call it aux) on the DBMS server consisting of the newly encrypted
values. Finally, the original query with the predicate WHERE salary > aux should be run. We have
not been affected by this limitation in our test applications (TPC-C, phpBB, HotCRP, and grad-apply).
1 Fully implementing this policy would require setting up two PC chairs: a main chair, and a backup chair
responsible for reviews of the main chair’s papers. HotCRP allows the PC chair to impersonate other PC
members, so CryptDB annotations would be used to prevent the main chair from gaining access to keys of
reviewers assigned to her paper.
PRINCTYPE contact, review;
CREATE TABLE ContactInfo ( contactId int, email varchar(120),
(email physical user) SPEAKS FOR (contactId contact) );
CREATE TABLE PCMember ( contactId int );
CREATE TABLE PaperConflict ( paperId int, contactId int );
CREATE TABLE PaperReview (
int ENC FOR (paperId review),
commentsToPC text ENC FOR (paperId review),
(PCMember.contactId contact) SPEAKS FOR
(paperId review) IF NoConflict(paperId, contactId) );
NoConflict (paperId, contactId): /* Define a SQL function */
c.paperId = paperId AND c.contactId = contactId) = 0;
Figure 6: Annotated schema for securing reviews in HotCRP. Reviews and the identity of reviewers
providing the review will be available only to PC members (table PCMember includes PC chairs) who are
not conflicted, and PC chairs cannot override this restriction.
In multi-principal mode, CryptDB cannot perform server-side computations on values encrypted
for different principals, even if the application has the authority of all principals in question, because
the ciphertexts are encrypted with different keys. For some computations, it may be practical for the
proxy to perform the computation after decrypting the data, but for others (e.g., large-scale aggregates)
this approach may be too expensive. A possible extension to CryptDB to support such queries may be
to maintain multiple ciphertexts for such values, encrypted under different keys.
The CryptDB proxy consists of a C++ library and a Lua module. The C++ library consists of a query
parser; a query encryptor/rewriter, which encrypts fields or includes UDFs in the query; and a result
decryption module. To allow applications to transparently use CryptDB, we used MySQL proxy [47]
and implemented a Lua module that passes queries and results to and from our C++ module. We
implemented our new cryptographic protocols using NTL [44]. Our CryptDB implementation consists
of ∼18,000 lines of C++ code and ∼150 lines of Lua code, with another ∼10,000 lines of test code.
CryptDB is portable and we have implemented versions for both Postgres 9.0 and MySQL 5.1.
Our initial Postgres-based implementation is described in an earlier technical report [39]. Porting
CryptDB to MySQL required changing only 86 lines of code, mostly in the code for connecting to the
MySQL server and declaring UDFs. As mentioned earlier, CryptDB does not change the DBMS; we
implement all server-side functionality with UDFs and server-side tables. CryptDB’s design, and to a
large extent our implementation, should work on top of any SQL DBMS that supports UDFs.
Complete schema
Used in query
Figure 7: Number of databases, tables, and columns on the MySQL server, used for trace
analysis, indicating the total size of the schema, and the part of the schema seen in queries during the trace
In this section, we evaluate four aspects of CryptDB: the difficulty of modifying an application to
run on top of CryptDB, the types of queries and applications CryptDB is able to support, the level of
security CryptDB provides, and the performance impact of using CryptDB. For this analysis, we use
seven applications as well as a large trace of SQL queries.
We evaluate the effectiveness of our annotations and the needed application changes on the
three applications we described in §5 (phpBB, HotCRP, and grad-apply), as well as on a TPC-C
query mix (a standard workload in the database industry). We then analyze the functionality and
security of CryptDB on three more applications, on TPC-C, and on a large trace of SQL queries.
The additional three applications are OpenEMR, an electronic medical records application storing
private medical data of patients; the web application of an MIT class (6.02), storing students’ grades;
and PHP-calendar, storing people’s schedules. The large trace of SQL queries comes from a popular
MySQL server at MIT, This server is used primarily by web applications running on, a shared web application hosting service operated by MIT’s Student Information
Processing Board (SIPB). In addition, this SQL server is used by a number of applications that run
on other machines and use only to store their data. Our query trace spans about ten
days, and includes approximately 126 million queries. Figure 7 summarizes the schema statistics for; each database is likely to be a separate instance of some application.
Finally, we evaluate the overall performance of CryptDB on the phpBB application and on a query
mix from TPC-C, and perform a detailed analysis through microbenchmarks.
In the six applications (not counting TPC-C), we only encrypt sensitive columns, according
to a manual inspection. Some fields were clearly sensitive (e.g., grades, private message, medical
information), but others were only marginally so (e.g., the time when a message was posted). There
was no clear threshold between sensitive or not, but it was clear to us which fields were definitely
sensitive. In the case of TPC-C, we encrypt all the columns in the database in single-principal mode
so that we can study the performance and functionality of a fully encrypted DBMS. All fields are
considered for encryption in the large query trace as well.
Application Changes
Figure 8 summarizes the amount of programmer effort required to use CryptDB in three multi-user
web applications and in the single-principal TPC-C queries. The results show that, for multi-principal
mode, CryptDB required between 11 and 13 unique schema annotations (29 to 111 in total), and 2 to 7
lines of code changes to provide user passwords to the proxy, in order to secure sensitive information
stored in the database. Part of the simplicity is because securing an additional column requires just
one annotation in most cases. For the single-principal TPC-C queries, using CryptDB required no
application annotations at all.
Functional Evaluation
To evaluate what columns, operations, and queries CryptDB can support, we analyzed the queries
issued by six web applications (including the three applications we analyzed in §8.1), the TPC-C
queries, and the SQL queries from The results are shown in the left half of Figure 9.
CryptDB supports most queries; the number of columns in the “needs plaintext” column, which
counts columns that cannot be processed in encrypted form by CryptDB, is small relative to the total
number of columns. For PHP-calendar and OpenEMR, CryptDB does not support queries on certain
sensitive fields that perform string manipulation (e.g., substring and lowercase conversions) or date
manipulation (e.g., obtaining the day, month, or year of an encrypted date). However, if these functions
were precomputed with the result added as standalone columns (e.g., each of the three parts of a date
were encrypted separately), CryptDB would support these queries.
The next two columns, “needs HOM” and “needs SEARCH”, reflect the number of columns for
which that encryption scheme is needed to process some queries. The numbers suggest that these
encryption schemes are important; without these schemes, CryptDB would be unable to support those
Based on an analysis of the larger trace, we found that CryptDB should be able
to support operations over all but 1,094 of the 128,840 columns observed in the trace. The “inproxy processing” shows analysis results where we assumed the proxy can perform some lightweight
operations on the results returned from the DBMS server. Specifically, this included any operations
that are not needed to compute the set of resulting rows or to aggregate rows (that is, expressions that
do not appear in a WHERE, HAVING, or GROUP BY clause, or in an ORDER BY clause with a LIMIT, and
are not aggregate operators). With in-proxy processing, CryptDB should be able to process queries
over encrypted data over all but 571 of the 128,840 columns, thus supporting 99.5% of the columns.
Of those 571 columns, 222 use a bitwise operator in a WHERE clause or perform bitwise aggregation,
such as the Gallery2 application, which uses a bitmask of permission fields and consults them in WHERE
clauses. Rewriting the application to store the permissions in a different way would allow CryptDB
to support such operations. Another 205 columns perform string processing in the WHERE clause,
such as comparing whether lowercase versions of two strings match. Storing a keyed hash of the
lowercase version of each string for such columns, similar to the JOIN-ADJ scheme, could support caseinsensitive equality checks for ciphertexts. 76 columns are involved in mathematical transformations
in the WHERE clause, such as manipulating dates, times, scores, and geometric coordinates. 41 columns
invoke the LIKE operator with a column reference for the pattern; this is typically used to check a
particular value against a table storing a list of banned IP addresses, usernames, URLs, etc. Such a
query can also be rewritten if the data items are sensitive.
Security Evaluation
To understand the amount of information that would be revealed to the adversary in practice, we
examine the steady-state onion levels of different columns for a range of applications and queries. To
quantify the level of security, we define the MinEnc of a column to be the weakest onion encryption
scheme exposed on any of the onions of a column when onions reach a steady state (i.e., after the
application generates all query types, or after running the whole trace). We consider RND and HOM
to be the strongest schemes, followed by SEARCH, followed by DET and JOIN, and finishing with
the weakest scheme which is OPE. For example, if a column has onion Eq at RND, onion Ord at OPE
and onion Add at HOM, the MinEnc of this column is OPE.
The right side of Figure 9 shows the MinEnc onion level for a range of applications and query
traces. We see that most fields remain at RND, which is the most secure scheme. For example,
OpenEMR has hundreds of sensitive fields describing the medical conditions and history of patients,
but these fields are mostly just inserted and fetched, and are not used in any computation. A number of
fields also remain at DET, typically to perform key lookups and joins. OPE, which leaks order, is used
the least frequently, and mostly for fields that are marginally sensitive (e.g., timestamps and counts of
messages). Thus, CryptDB’s adjustable security provides a significant improvement in confidentiality
over revealing all encryption schemes to the server.
To analyze CryptDB’s security for specific columns that are particularly sensitive, we define a new
security level, HIGH, which includes the RND and HOM encryption schemes, as well as DET for
columns having no repetitions (in which case DET is logically equivalent to RND). These are highly
secure encryption schemes leaking virtually nothing about the data. DET for columns with repeats
and OPE are not part of HIGH as they reveal relations to the DBMS server. The rightmost column in
Figure 9 shows that most of the particularly sensitive columns (again, according to manual inspection)
are at HIGH.
For the trace queries, approximately 6.6% of columns were at OPE even with
in-proxy processing; other encrypted columns (93%) remain at DET or above. Out of the columns
that were at OPE, 3.9% are used in an ORDER BY clause with a LIMIT, 3.7% are used in an inequality
comparison in a WHERE clause, and 0.25% are used in a MIN or MAX aggregate operator (some of
the columns are counted in more than one of these groups). It would be difficult to perform these
computations in the proxy without substantially increasing the amount of data sent to it.
Although we could not examine the schemas of applications using to determine
what fields are sensitive—mostly due to its large scale—we measured the same statistics as above for
columns whose names are indicative of sensitive data. In particular, the last three rows of Figure 9
show columns whose name contains the word “pass” (which are almost all some type of password),
“content” (which are typically bulk data managed by an application), and “priv” (which are typically
some type of private message). CryptDB reveals much less information about these columns than an
average column, almost all of them are supported, and almost all are at RND or DET.
Finally, we empirically validated CryptDB’s confidentiality guarantees by trying real attacks on
phpBB that have been listed in the CVE database [32], including two SQL injection attacks (CVE2009-3052 & CVE-2008-6314), bugs in permission checks (CVE-2010-1627 & CVE-2008-7143), and
a bug in remote PHP file inclusion (CVE-2008-6377). We found that, for users not currently logged in,
the answers returned from the DBMS were encrypted; even with root access to the application server,
proxy, and DBMS, the answers were not decryptable.
Performance Evaluation
To evaluate the performance of CryptDB, we used a machine with two 2.4 GHz Intel Xeon E5620
4-core processors and 12 GB of RAM to run the MySQL 5.1.54 server, and a machine with eight
2.4 GHz AMD Opteron 8431 6-core processors and 64 GB of RAM to run the CryptDB proxy
and the clients. The two machines were connected over a shared Gigabit Ethernet network. The
higher-provisioned client machine ensures that the clients are not the bottleneck in any experiment.
All workloads fit in the server’s RAM.
We compare the performance of a TPC-C query mix when running on an unmodified MySQL server
versus on a CryptDB proxy in front of the MySQL server. We trained CryptDB on the query set
(§3.5.2) so there are no onion adjustments during the TPC-C experiments. Figure 10 shows the
throughput of TPC-C queries as the number of cores on the server varies from one to eight. In all
cases, the server spends 100% of its CPU time processing queries. Both MySQL and CryptDB scale
well initially, but start to level off due to internal lock contention in the MySQL server, as reported
by SHOW STATUS LIKE ’Table%’. The overall throughput with CryptDB is 21–26% lower than
MySQL, depending on the exact number of cores.
To understand the sources of CryptDB’s overhead, we measure the server throughput for different
types of SQL queries seen in TPC-C, on the same server, but running with only one core enabled.
Figure 11 shows the results for MySQL, CryptDB, and a strawman design; the strawman performs
each query over data encrypted with RND by decrypting the relevant data using a UDF, performing
the query over the plaintext, and re-encrypting the result (if updating rows). The results show that
CryptDB’s throughput penalty is greatest for queries that involve a SUM (2.0× less throughput) and
for incrementing UPDATE statements (1.6× less throughput); these are the queries that involve HOM
additions at the server. For the other types of queries, which form a larger part of the TPC-C mix, the
throughput overhead is modest. The strawman design performs poorly for almost all queries because
the DBMS’s indexes on the RND-encrypted data are useless for operations on the underlying plaintext
data. It is pleasantly surprising that the higher security of CryptDB over the strawman also brings
better performance.
To understand the latency introduced by CryptDB’s proxy, we measure the server and proxy
processing times for the same types of SQL queries as above. Figure 12 shows the results. We can
see that there is an overall server latency increase of 20% with CryptDB, which we consider modest.
The proxy adds an average of 0.60 ms to a query; of that time, 24% is spent in MySQL proxy, 23% is
spent in encryption and decryption, and the remaining 53% is spent parsing and processing queries.
The cryptographic overhead is relatively small because most of our encryption schemes are efficient;
Figure 13 shows their performance. OPE and HOM are the slowest, but the ciphertext pre-computing
and caching optimization (§3.5) masks the high latency of queries requiring OPE and HOM. Proxy?
in Figure 12 shows the latency without these optimizations, which is significantly higher for the
corresponding query types. SELECT queries that involve a SUM use HOM but do not benefit from this
optimization, because the proxy performs decryption, rather than encryption.
In all TPC-C experiments, the proxy used less than 20 MB of memory. Caching ciphertexts for
the 30, 000 most common values for OPE accounts for about 3 MB, and pre-computing ciphertexts
and randomness for 30,000 values at HOM required 10 MB.
Multi-User Web Applications
To evaluate the impact of CryptDB on application performance, we measure the throughput of phpBB
for a workload with 10 parallel clients, which ensured 100% CPU load at the server. Each client
continuously issued HTTP requests to browse the forum, write and read posts, as well as write and
read private messages. We pre-loaded forums and user mailboxes with messages. In this experiment,
we co-located the MySQL DBMS, the CryptDB proxy, and the web application server on a single-core
machine, to ensure we do not add additional resources for a separate proxy server machine to the
system in the CryptDB configuration. In practice, an administrator would likely run the CryptDB
proxy on another machine for security.
Figure 14 shows the throughput of phpBB in three different configurations: (1) connecting to
a stock MySQL server, (2) connecting to a stock MySQL server through MySQL proxy, and (3)
connecting to CryptDB, with notably sensitive fields encrypted as summarized in Figure 9, which in
turn uses a stock MySQL server to store encrypted data. The results show that phpBB incurs an overall
throughput loss of just 14.5%, and that about half of this loss comes from inefficiencies in MySQL
proxy unrelated to CryptDB. Figure 15 further shows the end-to-end latency for five types of phpBB
requests. The results show that CryptDB adds 7–18 ms (6–20%) of processing time per request.
CryptDB increases the amount of the data stored in the DBMS, because it stores multiple onions
for the same field, and because ciphertexts are larger than plaintexts for some encryption schemes.
For TPC-C, CryptDB increased the database size by 3.76×, mostly due to cryptographic expansion
of integer fields encrypted with HOM (which expand from 32 bits to 2048 bits); strings and binary
data remains roughly the same size. For phpBB, the database size using an unencrypted system was
2.6 MB for a workload of about 1,000 private messages and 1,000 forum posts generated by 10 users.
The same workload on CryptDB had a database of 3.3 MB, about 1.2× larger. Of the 0.7 MB increase,
230 KB is for storage of access keys, 276 KB is for public keys and external keys, and 166 KB is due
to expansion of encrypted fields.
Adjustable Encryption
Adjustable query-based encryption involves decrypting columns to lower-security onion levels. Fortunately, decryption for the more-secure onion layers, such as RND, is fast, and needs to be performed
only once per column for the lifetime of the system.2 Removing a layer of RND requires AES
decryption, which our experimental machine can perform at ∼200 MB/s per core. Thus, removing an
onion layer is bottlenecked by the speed at which the DBMS server can copy a column from disk for
disk-bound databases.
Search and queries over encrypted data. Song et al. [46] describe cryptographic tools for performing keyword search over encrypted data, which we use to implement SEARCH. Amanatidis et al. [2]
propose methods for exact searches that do not require scanning the entire database and could be
used to process certain restricted SQL queries. Bao et al. [3] extend these encrypted search methods
to the multi-user case. Yang et al. [51] run selections with equality predicates over encrypted data.
Evdokimov and Guenther present methods for the same selections, as well as Cartesian products and
projections [15]. Agrawal et al. develop a statistical encoding that preserves the order of numerical
data in a column [1], but it does not have sound cryptographic properties, unlike the scheme we use [4].
Boneh and Waters show public-key schemes for comparisons, subset checks, and conjunctions of such
queries over encrypted data [5], but these schemes have ciphertext lengths that are exponential in the
length of the plaintext, limiting their practical applicability.
When applied to processing SQL on encrypted data, these techniques suffer from some of the
following limitations: certain basic queries are not supported or are too inefficient (especially joins
and order checks), they require significant client-side query processing, users either have to build and
maintain indexes on the data at the server or to perform sequential scans for every selection/search,
and implementing these techniques requires unattractive changes to the innards of the DBMS.
Some researchers have developed prototype systems for subsets of SQL, but they provide no
confidentiality guarantees, require a significant DBMS rewrite, and rely on client-side processing [9,
12, 22]. For example, Hacigumus et al. [22] heuristically split the domain of possible values for
each column into partitions, storing the partition number unencrypted for each data item, and rely
on extensive client-side filtering of query results. Chow et al. [8] require trusted entities and two
non-colluding untrusted DBMSes.
Untrusted servers. SUNDR [28] uses cryptography to provide privacy and integrity in a file
system on top of an untrusted file server. Using a SUNDR-like model, SPORC [16] and Depot [30]
2 Unless
the administrator periodically re-encrypts data/columns.
show how to build low-latency applications, running mostly on the clients, without having to trust a
server. However, existing server-side applications that involve separate database and application servers
cannot be used with these systems unless they are rewritten as distributed client-side applications to
work with SPORC or Depot. Many applications are not amenable to such a structure.
Companies like Navajo Systems and Ciphercloud provide a trusted application-level proxy that
intercepts network traffic between clients and cloud-hosted servers (e.g., IMAP), and encrypts sensitive
data stored on the server. These products appear to break up sensitive data (specified by applicationspecific rules) into tokens (such as words in a string), and encrypt each of these tokens using an
order-preserving encryption scheme, which allows token-level searching and sorting. In contrast,
CryptDB supports a richer set of operations (most of SQL), reveals only relations for the necessary
classes of computation to the server based on the queries issued by the application, and allows chaining
of encryption keys to user passwords, to restrict data leaks from a compromised proxy.
Disk encryption. Various commercial database products, such as Oracle’s Transparent Data
Encryption [34], encrypt data on disk, but decrypt it to perform query processing. As a result, the
server must have access to decryption keys, and an adversary compromising the DBMS software can
gain access to the entire data.
Software security. Many tools help programmers either find or mitigate mistakes in their code
that may lead to vulnerabilities, including static analysis tools like PQL [29, 31] and UrFlow [7],
and runtime tools like Resin [52] and CLAMP [36]. In contrast, CryptDB provides confidentiality
guarantees for user data even if the adversary gains complete control over the application and database
servers. These tools provide no guarantees in the face of this threat, but in contrast, CryptDB cannot
provide confidentiality in the face of vulnerabilities that trick the user’s client machine into issuing
unwanted requests (such as cross-site scripting or cross-site request forgery vulnerabilities in web
applications). As a result, using CryptDB together with these tools should improve overall application
Rizvi et al. [41] and Chlipala [7] specify and enforce an application’s security policy over SQL
views. CryptDB’s SQL annotations can capture most of these policies, except for result processing
being done in the policy’s view, such as allowing a user to view only aggregates of certain data.
Unlike prior systems, CryptDB enforces SQL-level policies cryptographically, without relying on
compile-time or run-time permission checks.
Privacy-preserving aggregates. Privacy-preserving data integration, mining, and aggregation
schemes are useful [26, 50], but are not usable by many applications because they support only
specialized query types and require a rewrite of the DBMS. Differential privacy [14] is complementary
to CryptDB; it allows a trusted server to decide what answers to release and how to obfuscate answers
to aggregation queries to avoid leaking information about any specific record in the database.
Query integrity. Techniques for SQL query integrity can be integrated into CryptDB because
CryptDB allows relational queries on encrypted data to be processed just like on plaintext. These
methods can provide integrity by adding a MAC to each tuple [28, 42], freshness using hash chains [38,
42], and both freshness and completeness of query results [33]. In addition, the client can verify the
results of aggregation queries [48], and provide query assurance for most read queries [45].
Outsourced databases. Curino et al. advocate the idea of a relational cloud [11], a context in
which CryptDB fits well.
We presented CryptDB, a system that provides a practical and strong level of confidentiality in
the face of two significant threats confronting database-backed applications: curious DBAs and
arbitrary compromises of the application server and the DBMS. CryptDB meets its goals using three
ideas: running queries efficiently over encrypted data using a novel SQL-aware encryption strategy,
dynamically adjusting the encryption level using onions of encryption to minimize the information
revealed to the untrusted DBMS server, and chaining encryption keys to user passwords in a way that
allows only authorized users to gain access to encrypted data.
Our evaluation on a large trace of 126 million SQL queries from a production MySQL server
shows that CryptDB can support operations over encrypted data for 99.5% of the 128,840 columns
seen in the trace. The throughput penalty of CryptDB is modest, resulting in a reduction of 14.5–26%
on two applications as compared to unmodified MySQL. Our security analysis shows that CryptDB
protects most sensitive fields with highly secure encryption schemes for six applications. The developer
effort consists of 11–13 unique schema annotations and 2–7 lines of source code changes to express
relevant privacy policies for 22–103 sensitive fields in three multi-user web applications.
The source code for our implementation is available for download at
We thank Martin Abadi, Brad Chen, Carlo Curino, Craig Harris, Evan Jones, Frans Kaashoek, Sam
Madden, Mike Stonebraker, Mike Walfish, the anonymous reviewers, and our shepherd, Adrian
Perrig, for their feedback. Eugene Wu and Alvin Cheung also provided useful advice. We also thank
Geoffrey Thomas, Quentin Smith, Mitch Berger, and the rest of the maintainers
for providing us with SQL query traces. This work was supported by the NSF (CNS-0716273 and
IIS-1065219) and by Google.
[1] R. Agrawal, J. Kiernan, R. Srikant, and Y. Xu. Order preserving encryption for numeric data.
In Proceedings of the 2004 ACM SIGMOD International Conference on Management of Data,
Paris, France, June 2004.
[2] G. Amanatidis, A. Boldyreva, and A. O’Neill. Provably-secure schemes for basic query support
in outsourced databases. In Proceedings of the 21st Annual IFIP WG 11.3 Working Conference
on Database and Applications Security, Redondo Beach, CA, July 2007.
[3] F. Bao, R. H. Deng, X. Ding, and Y. Yang. Private query on encrypted data in multi-user
settings. In Proceedings of the 4th International Conference on Information Security Practice
and Experience, Sydney, Australia, April 2008.
[4] A. Boldyreva, N. Chenette, Y. Lee, and A. O’Neill. Order-preserving symmetric encryption. In
Proceedings of the 28th Annual International Conference on the Theory and Applications of
Cryptographic Techniques (EUROCRYPT), Cologne, Germany, April 2009.
[5] D. Boneh and B. Waters. Conjunctive, subset, and range queries on encrypted data. In Proceedings of the 4th Conference on Theory of Cryptography, 2007.
[6] A. Chen. GCreep: Google engineer stalked teens, spied on chats. Gawker, September 2010.
[7] A. Chlipala. Static checking of dynamically-varying security policies in database-backed applications. In Proceedings of the 9th Symposium on Operating Systems Design and Implementation,
Vancouver, Canada, October 2010.
[8] S. S. M. Chow, J.-H. Lee, and L. Subramanian. Two-party computation model for privacypreserving queries over distributed databases. In Proceedings of the 16th Network and Distributed
System Security Symposium, February 2009.
[9] V. Ciriani, S. D. C. di Vimercati, S. Foresti, S. Jajodia, S. Paraboschi, and P. Samarati. Keep a
few: Outsourcing data while maintaining confidentiality. In Proceedings of the 14th European
Symposium on Research in Computer Security, September 2009.
[10] M. Cooney. IBM touts encryption innovation; new technology performs calculations on encrypted
data without decrypting it. Computer World, June 2009.
[11] C. Curino, E. P. C. Jones, R. A. Popa, N. Malviya, E. Wu, S. Madden, H. Balakrishnan, and
N. Zeldovich. Relational cloud: A database-as-a-service for the cloud. In Proceedings of the 5th
Biennial Conference on Innovative Data Systems Research, pages 235–241, Pacific Grove, CA,
January 2011.
[12] E. Damiani, S. D. C. di Vimercati, S. Jajodia, S. Paraboschi, and P. Samarati. Balancing
confidentiality and efficiency in untrusted relational DBMSs. In Proceedings of the 10th ACM
Conference on Computer and Communications Security, Washington, DC, October 2003.
[13] A. Desai. New paradigms for constructing symmetric encryption schemes secure against chosenciphertext attack. In Proceedings of the 20th Annual International Conference on Advances in
Cryptology, pages 394–412, August 2000.
[14] C. Dwork. Differential privacy: a survey of results. In Proceedings of the 5th International
Conference on Theory and Applications of Models of Computation, Xi’an, China, April 2008.
[15] S. Evdokimov and O. Guenther. Encryption techniques for secure database outsourcing. Cryptology ePrint Archive, Report 2007/335.
[16] A. J. Feldman, W. P. Zeller, M. J. Freedman, and E. W. Felten. SPORC: Group collaboration
using untrusted cloud resources. In Proceedings of the 9th Symposium on Operating Systems
Design and Implementation, Vancouver, Canada, October 2010.
[17] T. Ge and S. Zdonik. Answering aggregation queries in a secure system model. In Proceedings
of the 33rd International Conference on Very Large Data Bases, Vienna, Austria, September
[18] R. Gennaro, C. Gentry, and B. Parno. Non-interactive verifiable computing: Outsourcing
computation to untrusted workers. In Advances in Cryptology (CRYPTO), Santa Barbara, CA,
August 2010.
[19] C. Gentry. Fully homomorphic encryption using ideal lattices. In Proceedings of the 41st Annual
ACM Symposium on Theory of Computing, Bethesda, MD, May–June 2009.
[20] O. Goldreich. Foundations of Cryptography: Volume I Basic Tools. Cambridge University Press,
[21] A. Greenberg. DARPA will spend 20 million to search for crypto’s holy grail. Forbes, April
[22] H. Hacigumus, B. Iyer, C. Li, and S. Mehrotra. Executing SQL over encrypted data in the
database-service-provider model. In Proceedings of the 2002 ACM SIGMOD International
Conference on Management of Data, Madison, WI, June 2002.
[23] J. A. Halderman, S. D. Schoen, N. Heninger, W. Clarkson, W. Paul, J. A. Calandrino, A. J.
Feldman, J. Appelbaum, and E. W. Felten. Lest we remember: Cold boot attacks on encryption
keys. In Proceedings of the 17th Usenix Security Symposium, San Jose, CA, July–August 2008.
[24] S. Halevi and P. Rogaway. A tweakable enciphering mode. In Advances in Cryptology (CRYPTO),
[25] V. Kachitvichyanukul and B. W. Schmeiser. Algorithm 668: H2PEC: Sampling from the
hypergeometric distribution. ACM Transactions on Mathematical Software, 14(4):397–398,
[26] M. Kantarcioglu and C. Clifton. Security issues in querying encrypted data. In Proceedings
of the 19th Annual IFIP WG 11.3 Working Conference on Database and Applications Security,
Storrs, CT, August 2005.
[27] E. Kohler. Hot crap! In Proceedings of the Workshop on Organizing Workshops, Conferences,
and Symposia for Computer Systems, San Francisco, CA, April 2008.
[28] J. Li, M. Krohn, D. Mazières, and D. Shasha. Secure untrusted data repository (SUNDR). In
Proceedings of the 6th Symposium on Operating Systems Design and Implementation, pages
91–106, San Francisco, CA, December 2004.
[29] V. B. Livshits and M. S. Lam. Finding security vulnerabilities in Java applications with static
analysis. In Proceedings of the 14th Usenix Security Symposium, pages 271–286, Baltimore,
MD, August 2005.
[30] P. Mahajan, S. Setty, S. Lee, A. Clement, L. Alvisi, M. Dahlin, and M. Walfish. Depot: Cloud
storage with minimal trust. In Proceedings of the 9th Symposium on Operating Systems Design
and Implementation, Vancouver, Canada, October 2010.
[31] M. Martin, B. Livshits, and M. Lam. Finding application errors and security flaws using
PQL: a program query language. In Proceedings of the 2005 Conference on Object-Oriented
Programming, Systems, Languages and Applications, pages 365–383, San Diego, CA, October
[32] National Vulnerability Database. CVE statistics.
statistics, February 2011.
[33] V. H. Nguyen, T. K. Dang, N. T. Son, and J. Kung. Query assurance verification for dynamic
outsourced XML databases. In Proceedings of the 2nd Conference on Availability, Reliability
and Security, Vienna, Austria, April 2007.
[34] Oracle Corporation. Oracle advanced security.
[35] P. Paillier. Public-key cryptosystems based on composite degree residuosity classes. In Proceedings of the 18th Annual International Conference on the Theory and Applications of Cryptographic Techniques (EUROCRYPT), Prague, Czech Republic, May 1999.
[36] B. Parno, J. M. McCune, D. Wendlandt, D. G. Andersen, and A. Perrig. CLAMP: Practical
prevention of large-scale data leaks. In Proceedings of the 30th IEEE Symposium on Security
and Privacy, Oakland, CA, May 2009.
[37] R. A. Popa, C. M. S. Redfield, N. Zeldovich, and H. Balakrishnan. CryptDB web site. http:
[38] R. A. Popa, J. R. Lorch, D. Molnar, H. J. Wang, and L. Zhuang. Enabling security in cloud
storage SLAs with CloudProof. In Proceedings of 2011 USENIX Annual Technical Conference,
Portland, OR, 2011.
[39] R. A. Popa, N. Zeldovich, and H. Balakrishnan. CryptDB: A practical encrypted relational
DBMS. Technical Report MIT-CSAIL-TR-2011-005, MIT Computer Science and Artificial
Intelligence Laboratory, Cambridge, MA, January 2011.
[40] Privacy Rights Clearinghouse. Chronology of data breaches. http://www.privacyrights.
[41] S. Rizvi, A. Mendelzon, S. Sudarshan, and P. Roy. Extending query rewriting techniques for
fine-grained access control. In Proceedings of the 2004 ACM SIGMOD International Conference
on Management of Data, Paris, France, June 2004.
[42] H. Shacham, N. Modadugu, and D. Boneh. Sirius: Securing remote untrusted storage. In
Proceedings of the 10th Network and Distributed System Security Symposium, 2003.
[43] E. Shi, J. Bethencourt, H. Chan, D. Song, and A. Perrig. Multi-dimensional range query over
encrypted data. In Proceedings of the 28th IEEE Symposium on Security and Privacy, Oakland,
CA, May 2007.
[44] V. Shoup. NTL: A library for doing number theory., August
[45] R. Sion. Query execution assurance for outsourced databases. In Proceedings of the 31st
International Conference on Very Large Data Bases, pages 601–612, Trondheim, Norway,
August–September 2005.
[46] D. X. Song, D. Wagner, and A. Perrig. Practical techniques for searches on encrypted data. In
Proceedings of the 21st IEEE Symposium on Security and Privacy, Oakland, CA, May 2000.
[47] M. Taylor. MySQL proxy.
[48] B. Thompson, S. Haber, W. G. Horne, T. S, and D. Yao. Privacy-preserving computation and
verification of aggregate queries on outsourced databases. Technical Report HPL-2009-119, HP
Labs, 2009.
[49] E. P. Wobber, M. Abadi, M. Burrows, and B. Lampson. Authentication in the Taos operating
system. ACM Transactions on Computer Systems, 12(1):3–32, 1994.
[50] L. Xiong, S. Chitti, and L. Liu. Preserving data privacy for outsourcing data aggregation services.
Technical Report TR-2007-013, Emory University, Department of Mathematics and Computer
Science, 2007.
[51] Z. Yang, S. Zhong, and R. N. Wright. Privacy-preserving queries on encrypted data. In European
Symposium on Research in Computer Security, 2006.
[52] A. Yip, X. Wang, N. Zeldovich, and M. F. Kaashoek. Improving application security with data
flow assertions. In Proceedings of the 22nd ACM Symposium on Operating Systems Principles,
pages 291–304, Big Sky, MT, October 2009.
31 (11 unique)
29 (12 unique)
111 (13 unique)
Login/logout code
7 lines
2 lines
2 lines
Sensitive fields secured, and examples of such fields
23: private messages (content, subject), posts, forums
22: paper content and paper information, reviews
103: student grades (61), scores (17), recommendations, reviews
92: all the fields in all the tables encrypted
Figure 8: Number of annotations the programmer needs to add to secure sensitive fields, lines of code to be added to provide CryptDB with the passwords of users, and the
number of sensitive fields that CryptDB secures with these annotations, for three different applications. We count as one annotation each invocation of our three types of
annotations and any SQL predicate used in a SPEAKS FOR annotation. Since multiple fields in the same table are usually encrypted for the same principal (e.g., message
subject and content), we also report unique annotations.
TPC-C (single princ.)
Total Consider
for enc.
1, 297
MIT 6.02
128, 840 128, 840
Trace from
. . . with in-proxy processing 128, 840 128, 840
. . . col. name contains pass
2, 029
2, 029
. . . col. name contains content
2, 521
2, 521
. . . col. name contains priv
1, 094
1, 019
1, 016
1, 125
1, 135
Non-plaintext cols. with MinEnc:
80, 053
34, 212 13, 131
84, 008
35, 350 8, 513
1, 936
2, 215
Most sensitive
cols. at HIGH
18 / 18
94 / 94
525 / 540
Figure 9: Steady-state onion levels for database columns required by a range of applications and traces. “Needs plaintext” indicates that CryptDB cannot execute the
application’s queries over encrypted data for that column. For the applications in the top group of rows, sensitive columns were determined manually, and only these columns
were considered for encryption. For the bottom group of rows, all database columns were automatically considered for encryption. The rightmost column considers the
application’s most sensitive database columns, and reports the number of them that have MinEnc in HIGH (both terms are defined in §8.3).
Queries / sec
Number of server cores
Figure 10: Throughput for TPC-C queries, for a varying number of cores on the underlying MySQL DBMS
Queries / sec
Figure 11: Throughput of different types of SQL queries from the TPC-C query mix running under MySQL,
CryptDB, and the strawman design. “Upd. inc” stands for UPDATE that increments a column, and “Upd. set”
stands for UPDATE which sets columns to a constant.
Query (& scheme)
Select by = (DET)
Select join (JOIN)
Select range (OPE)
Select sum (HOM)
Update set
Update inc (HOM)
0.10 ms
0.10 ms
0.16 ms
0.11 ms
0.07 ms
0.08 ms
0.11 ms
0.10 ms
0.10 ms
0.11 ms
0.11 ms
0.22 ms
0.46 ms
0.08 ms
0.10 ms
0.14 ms
0.17 ms
0.12 ms
0.86 ms
0.75 ms
0.78 ms
0.99 ms
0.28 ms
0.37 ms
0.36 ms
0.30 ms
0.60 ms
0.86 ms
0.75 ms
28.7 ms
0.99 ms
0.28 ms
16.3 ms
3.80 ms
25.1 ms
10.7 ms
Figure 12: Server and proxy latency for different types of SQL queries from TPC-C. For each query type,
we show the predominant encryption scheme used at the server. Due to details of the TPC-C workload, each
query type affects a different number of rows, and involves a different number of cryptographic operations.
The left two columns correspond to server throughput, which is also shown in Figure 11. “Proxy” shows the
latency added by CryptDB’s proxy; “Proxy?” shows the proxy latency without the ciphertext pre-computing
and caching optimization (§3.5). Bold numbers show where pre-computing and caching ciphertexts helps.
The “Overall” row is the average latency over the mix of TPC-C queries. “Update set” is an UPDATE where
the fields are set to a constant, and “Update inc” is an UPDATE where some fields are incremented.
Blowfish (1 int.)
OPE (1 int.)
SEARCH (1 word)
HOM (1 int.)
JOIN-ADJ (1 int.)
0.0001 ms
0.008 ms
0.016 ms
0.01 ms
0.52 ms
0.0001 ms
0.007 ms
0.015 ms
0.004 ms
Special operation
Compare: 0
Match: 0.001 ms
Add: 0.005 ms
Adjust: 0.56 ms
Throughput (HTTP req. / sec)
Figure 13: Microbenchmarks of cryptographic schemes, per unit of data encrypted (one 32-bit integer,
1 KB, or one 15-byte word of text), measured by taking the average time over many iterations.
Figure 14: Throughput comparison for phpBB. “MySQL” denotes phpBB running directly on MySQL.
“MySQL+proxy” denotes phpBB running on an unencrypted MySQL database but going through MySQL
proxy. “CryptDB” denotes phpBB running on CryptDB with notably sensitive fields annotated and the
database appropriately encrypted. Most HTTP requests involved tens of SQL queries each. Percentages
indicate throughput reduction relative to MySQL.
60 ms
67 ms
R post
50 ms
60 ms
W post
133 ms
151 ms
R msg
61 ms
73 ms
W msg
237 ms
251 ms
Figure 15: Latency for HTTP requests that heavily use encrypted fields in phpBB for MySQL and CryptDB.
R and W stand for read and write.
Intrusion Recovery for Database-backed Web
Ramesh Chandra, Taesoo Kim, Meelap Shah,
Neha Narula, and Nickolai Zeldovich
WARP is a system that helps users and administrators of web applications recover from intrusions
such as SQL injection, cross-site scripting, and clickjacking attacks, while preserving legitimate user
changes. WARP repairs from an intrusion by rolling back parts of the database to a version before
the attack, and replaying subsequent legitimate actions. WARP allows administrators to retroactively
patch security vulnerabilities—i.e., apply new security patches to past executions—to recover from
intrusions without requiring the administrator to track down or even detect attacks. WARP’s time-travel
database allows fine-grained rollback of database rows, and enables repair to proceed concurrently
with normal operation of a web application. Finally, WARP captures and replays user input at the
level of a browser’s DOM, to recover from attacks that involve a user’s browser. For a web server
running MediaWiki, WARP requires no application source code changes to recover from a range of
common web application vulnerabilities with minimal user input at a cost of 24–27% in throughput
and 2–3.2 GB/day in storage.
Categories and Subject Descriptors:
H.3.5 [Information Storage and Retrieval]: Online Information Services—Web-based services.
General Terms:
Many web applications have security vulnerabilities that have yet to be discovered. For example,
over the past 4 years, an average of 3–4 previously unknown cross-site scripting and SQL injection
vulnerabilities were discovered every single day [27]. Even if a web application’s code contains no
vulnerabilities, administrators may misconfigure security policies, making the application vulnerable
to attack, or users may inadvertently grant their privileges to malicious code [8]. As a result, even
well-maintained applications can and do get compromised [4, 31, 33]. Furthermore, after gaining
unauthorized access, an attacker could use web application functionality such as Google Apps
Script [7, 9] to install persistent malicious code, and trigger it at a later time, even after the underlying
vulnerability has been fixed.
Despite this prevalence of vulnerabilities that allows adversaries to compromise web applications,
recovering from a newly discovered vulnerability is a difficult and manual process. Users or administrators must manually inspect the application for signs of an attack that exploited the vulnerability,
Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted
without fee provided that copies are not made or distributed for profit or commercial advantage and that copies
bear this notice and the full citation on the first page. Copyrights for components of this work owned by others
than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, to republish, to post on
servers or to redistribute to lists, requires prior specific permission and/or a fee.
SOSP ’11, October 23–26, 2011, Cascais, Portugal.
Copyright 2011 ACM 978-1-4503-0977-6/11/10 . . . $10.00.
and if an attack is found, they must track down the attacker’s actions and repair the damage by hand.
Worse yet, this time-consuming process provides no guarantees that every intrusion was found, or that
all changes by the attacker were reverted. As web applications take on more functionality of traditional
desktop applications, intrusion recovery for web applications will become increasingly important.
This paper presents WARP1 , a system that automates repair from intrusions in web applications.
When an administrator learns of a security vulnerability in a web application, he or she can use
WARP to check whether that vulnerability was recently exploited, and to recover from any resulting
intrusions. Users and administrators can also use WARP to repair from configuration mistakes, such as
accidentally giving permissions to the wrong user. WARP works by continuously recording database
updates, and logging information about all actions, such as HTTP requests and database queries, along
with their input and output dependencies. WARP constructs a global dependency graph from this
logged information, and uses it to retroactively patch vulnerabilities by rolling back parts of the system
to an earlier checkpoint, fixing the vulnerability (e.g., patching a PHP file, or reverting unintended
permission changes), and re-executing any past actions that may have been affected by the fix. This
both detects any intrusions that exploited the vulnerability and reverts their effects.
To illustrate the extent of challenges facing WARP in recovering from intrusions in a web application, consider the following worst-case attack on a company’s Wiki site that is used by both employees
and customers, where each user has privileges to edit only certain pages or documents. An attacker
logs into the Wiki site and exploits a cross-site scripting (XSS) vulnerability in the Wiki software
to inject malicious JavaScript code into one of the publicly accessible Wiki pages. When Alice, a
legitimate user, views that page, her browser starts running the attacker’s code, which in turn issues
HTTP requests to add the attacker to the access control list for every page that Alice can access, and
to propagate the attack code to some of those pages. The adversary now uses his new privileges to
further modify pages. In the meantime, legitimate users (including Alice) continue to access and edit
Wiki pages, including pages modified or infected by the attack.
Although the Retro system previously explored intrusion recovery for command-line workloads
on a single machine [14], WARP is the first system to repair from such attacks in web applications.
Recovering from intrusions such as the example above requires WARP to address three challenges not
answered by Retro, as follows.
First, recovering from an intrusion (e.g., in Retro) typically requires an expert administrator to
detect the compromise and to track down the source of the attack, by analyzing database entries
and web server logs. Worse yet, this process must be repeated every time a new security problem is
discovered, to determine if any attackers might have exploited the vulnerability.
Second, web applications typically handle data on behalf of many users, only a few of which may
have been affected by an attack. For a popular web application with many users, reverting all users’
changes since the attack or taking the application offline for repair is not an option.
Third, attacks can affect users’ browsers, making it difficult to track down the extent of the
intrusion purely on the server. In our example attack, when Alice (or any other user) visits an infected
Wiki page, the web server cannot tell if a subsequent page edit request from Alice’s browser was
caused by Alice or by the malicious JavaScript code. Yet an ideal system should revert all effects of
the malicious code while preserving any edits that Alice made from the same page in her browser.
To address these challenges, WARP builds on the rollback-and-reexecute approach to repair taken
by Retro, but solves a new problem—repair for distributed, web-based applications—using three novel
ideas. First, WARP allows administrators to retroactively apply security patches without having to
manually track down the source of each attack, or even having to decide whether someone already
exploited the newfound vulnerability. Retroactive patching works by re-executing past actions using
patched application code. If an action re-executes the same way as it did originally, it did not trigger
the vulnerability, and requires no further re-execution. Actions that re-execute differently on patched
application code could have been intrusions that exploited the original bug, and WARP repairs from
this potential attack by recursively re-executing any other actions that were causally affected.
Second, WARP uses a time-travel database to determine dependencies between queries, such as
finding the set of legitimate database queries whose results were influenced by the queries that an
adversary issued. WARP uses these dependencies to roll back just the affected parts of the database
stands for Web Application RePair.
during repair. Precise dependencies are crucial to minimize the amount of rollback and re-execution
during repair; otherwise, recovering from a week-old attack that affected just one user would still
require re-executing a week’s worth of work. Precise dependency analysis and rollback is difficult
because database queries operate on entire tables that contain information about all users, instead of
individual data objects related to a single user. WARP addresses this problem by partitioning tables on
popular lookup keys and using partitions to determine dependencies at a finer granularity than entire
database tables. By tracking multiple versions of a row, WARP can also perform repair concurrently
with the normal operation of the web application.
Third, to help users recover from attacks that involve client machines, such as cross-site scripting,
WARP performs DOM-level replay of user input. In our example, WARP’s repair process will first
roll back any database changes caused by Alice’s browser, then open a repaired (and presumably
no longer malicious) version of the Wiki page Alice visited, and replay the inputs Alice originally
provided to the infected page. Operating at the DOM level allows WARP to replay user input even if
the underlying page changed (e.g., the attack’s HTML and JavaScript is gone), and can often preserve
legitimate changes without any user input. WARP uses a client-side browser extension to record and
upload events to the server, and uses a browser clone on the server to re-execute them.
To evaluate our ideas in practice, we built a prototype of WARP, and ported MediaWiki, a
popular Wiki application, to run on WARP. We show that an administrator using WARP can fully
recover from six different attacks on MediaWiki, either by retroactively applying a security patch (for
software vulnerabilities), or by undoing a past action (for administrator’s mistakes). WARP requires
no application changes, incurs a 24–27% CPU and 2–3.2 GB/day storage cost on a single server, and
requires little user input.
In the rest of this paper, we start with an overview of WARP’s design and assumptions in §2.
We describe the key aspects of WARP’s design—retroactive patching, the time-travel database, and
browser re-execution—in §3, §4, and §5 respectively. §6 presents our prototype implementation, and
§7 explains how all parts of WARP fit together in the context of an example. We evaluate WARP in
§8, and compare it to related work in §9. §10 discusses WARP’s limitations and future work, and §11
!" !
Figure 1: Overview of WARP’s design. Components introduced or modified by WARP are shaded; components borrowed from Retro are striped. Solid arrows are the original web application interactions that exist
without WARP. Dashed lines indicate interactions added by WARP for logging during normal execution, and
dotted lines indicate interactions added by WARP during repair.
The goal of WARP is to recover the integrity of a web application after it has been compromised by an
adversary. More specifically, WARP’s goal is to undo all changes made by the attacker to the system,
including all indirect effects of the attacker’s changes on legitimate actions of other users (e.g., through
cross-site scripting vulnerabilities), and to produce a system state as if all the legitimate changes still
occurred, but the adversary never compromised the application.
WARP’s workflow begins with the administrator deciding that he or she wants to make a retroactive
fix to the system, such as applying a security patch or changing a permission in the past. At a high
level, WARP then rolls back the system to a checkpoint before the intended time of the fix, applies the
fix, and re-executes actions that happened since that checkpoint, to construct a new system state. This
produces a repaired system state that would have been generated if all of the recorded actions happened
on the fixed system in the first place. If some of the recorded actions exploited a vulnerability that the
fix prevents, those actions will no longer have the same effect in the repaired system state, effectively
undoing the attack.
If the application is non-deterministic, there may be many possible repaired states, and WARP only
guarantees to provide one of them, which may not necessarily be the one closest to the pre-repair state.
In other words, non-deterministic changes unrelated to the attack may appear as a result of repair, and
non-determinism may increase the number of actions re-executed during repair, but the repaired state
is guaranteed to be free of effects of attack actions. Also, due to changes in system state during repair,
some of the original actions may no longer make sense during replay, such as when a user edits a Wiki
page created by the attacker and that page no longer exists due to repair. These actions are marked as
conflicts and WARP asks the user for help in resolving them.
WARP cannot undo disclosures of private data, such as if an adversary steals sensitive information
from Wiki pages, or steals a user’s password. However, when private data is leaked, WARP can still
help track down affected users. Additionally, in the case of stolen credentials, administrators can
use WARP to retroactively change the passwords of affected users (at the risk of undoing legitimate
changes), or revert just the attacker’s actions, if they can identify the attacker’s IP address.
The rest of this section first provides a short review of Retro, and then discusses how WARP builds
on the ideas from Retro to repair from intrusions in web applications, followed by a summary of the
assumptions made by WARP.
Review of Retro
Repairing from an intrusion in Retro, which operates at the operating system level, involves five steps.
First, during normal execution, Retro records a log of all system calls and periodically checkpoints the
file system. Second, the administrator must detect the intrusion, and track down the initial attack action
(such as a user accidentally running a malware binary). Third, Retro rolls back the files affected by the
attack to a checkpoint before the intrusion. Fourth, Retro re-executes legitimate processes that were
affected by the rolled-back file (e.g., any process that read the file in the past), but avoids re-executing
the attack action. Finally, to undo indirect effects of the attack, Retro finds any other processes whose
inputs may have changed as a result of re-execution, rolls back any files they modified, and recursively
re-executes them too.
A naı̈ve system that re-executed every action since the attack would face two challenges. First,
re-execution is expensive: if the attack occurred a week ago, re-executing everything may take
another week. Second, re-execution may produce different results, for reasons that have nothing to
do with the attack (e.g., because some process is non-deterministic). A different output produced by
one process can lead to a conflict when Retro tries to re-execute subsequent processes, and would
require user input to resolve. For example, re-executing sshd can generate a different key for an ssh
connection, which makes it impossible to replay that connection’s network packets. Thus, while Retro
needs some processes to produce different outputs (e.g., to undo the effects of an attack), Retro also
needs to minimize re-execution in order to minimize conflicts that require user input, and to improve
To reduce re-execution, Retro checks for equivalence of inputs to a process before and after repair,
to decide whether to re-execute a process. If the inputs to a process during repair are identical to the
inputs originally seen by that process, Retro skips re-execution of that process. Thus, even if some of
the files read by a process may have been changed during repair, Retro need not re-execute a process
that did not read the changed parts of the file.
Retro’s design separates the overall logic of rollback, repair, and recursive propagation (the
repair controller) from the low-level details of file system rollback and process re-execution (handled
by individual repair managers). During normal execution, managers record information about
checkpoints, actions, and dependencies into a global data structure called an action history graph, and
periodically garbage-collect old checkpoints and action history graph entries. A node in the action
history graph logically represents the history of some part of the system over time, such as all versions
of a certain file or directory. The action history graph also contains actions, such as a process executing
for some period of time or issuing a system call. An action has dependencies to and from nodes at a
specific time, indicating the versions of a node that either influenced or were influenced by that action.
During repair, the repair controller consults the action history graph, and invokes the managers as
needed for rollback and re-execution. We refer the reader to Kim et al. [14] for further details.
Repairing web applications
WARP builds on Retro’s repair controller to repair from intrusions in web applications. Figure 1
illustrates WARP’s design, and its relation to components borrowed from Retro (in particular, the repair
controller, and the structure of the action history graph). WARP’s design involves the web browser,
HTTP server, application code, and database. Each of these four components corresponds to a repair
manager in WARP, which records enough information during normal operation to perform rollback
and re-execution during repair.
To understand how WARP repairs from an attack, consider the example scenario we presented in
§1, where an attacker uses a cross-site scripting attack to inject malicious JavaScript code into a Wiki
page. When Alice visits that page, her browser runs the malicious code, and issues HTTP requests
to propagate the attack to another page and to give the attacker access to Alice’s pages. The attacker
then uses his newfound access to corrupt some of Alice’s pages. In the meantime, other users continue
using the Wiki site: some users visit the page containing the attack code, other users visit and edit
pages corrupted by the attack, and yet other users visit unaffected pages.
Some time after the attack takes place, the administrator learns that a cross-site scripting vulnerability was discovered by the application’s developers, and a security patch for one of the source files—say,
calendar.php—is now available. In order to retroactively apply this security patch, WARP first
determines which runs of the application code may have been affected by a bug in calendar.php.
WARP then applies the security patch to calendar.php, and considers re-executing all potentially
affected runs of the application. In order to re-execute the application, WARP records sufficient
information during the original execution2 about all of the inputs to the application, such as the HTTP
request. To minimize the chance that the application re-executes differently for reasons other than the
security patch, WARP records and replays the original return values from non-deterministic function
calls. §3 discusses how WARP implements retroactive patching in more detail.
Now consider what happens when WARP re-executes the application code for the attacker’s
initial request. Instead of adding the attacker’s JavaScript code to the Wiki page as it did during the
original execution, the newly patched application code will behave differently (e.g., pass the attacker’s
JavaScript code through a sanitization function), and then issue an SQL query to store the resulting
page in the database. This SQL query must logically replace the application’s original query that
stored an infected page, so WARP first rolls back the database to its state before the attack took place.
After the database has been rolled back, and the new query has executed, WARP must determine
what other parts of the system were affected by this changed query. To do this, during original
execution WARP records all SQL queries, along with their results. During repair, WARP re-executes
any queries it determines may have been affected by the changed query. If a re-executed query produces
results different from the original execution, WARP re-executes the corresponding application run as
well, such as Alice’s subsequent page visit to the infected page. §4 describes the design of WARP’s
time-travel database in more detail, including how it determines query dependencies, how it re-executes
queries in the past, and how it minimizes rollback.
When the application run for Alice’s visit to the infected page is re-executed, it generates a
different HTTP response for Alice’s browser (with the attack now gone). WARP must now determine
how Alice’s browser would behave given this new page. Simply undoing all subsequent HTTP requests
from Alice’s browser would needlessly undo all of her legitimate work, and asking Alice to manually
check each HTTP request that her browser made is not practical either. To help Alice recover from
such attacks, WARP provides a browser extension that records all events for each open page in her
browser (such as HTTP requests and user input) and uploads this information to the server. If WARP
determines that her browser may have been affected by an attack, it starts a clone of her browser on the
server, and re-executes her original input on the repaired page, without having to involve her. Since
Alice’s re-executed browser will no longer issue the HTTP requests from the XSS attack, WARP will
2 We
use the terms “original execution” and “normal execution” interchangeably.
recursively undo the effects of those requests as well. §5 explains how WARP’s browser extension
works in more detail.
If a user’s actions depend on changes by the attacker, WARP may be unable to replay the user’s
original inputs in the browser clone. For example, if the attacker created a new Wiki page, and a
curious user subsequently edited that page, WARP will not be able to re-execute the user’s actions
once the attack is undone. In this case, WARP signals a conflict and asks the user (or administrator) to
resolve it. WARP cannot rely on users being always online, so WARP queues the conflict, and proceeds
with repair.
When the user next logs in, WARP redirects the user to a conflict resolution page. To resolve a
conflict, the user is presented with the original page they visited, the newly repaired version of that
page, and the original action that the server is unable to replay on the new page, and is asked to specify
what actions they would like to perform instead. For example, the user can ask WARP to cancel that
page visit altogether. Users or administrators can also use the same mechanism to undo their own
actions from the past, such as if an administrator accidentally gave administrative privileges to a user.
§5 further discusses WARP’s handling of conflicts and user-initiated undo.
To recover from intrusions, WARP makes two key assumptions. First, WARP assumes that the adversary
does not exploit any vulnerabilities in the HTTP server, database, or the application’s language runtime,
does not cause the application code to execute arbitrary code (e.g., spawning a Unix shell), and does
not corrupt WARP’s log. Most web application vulnerabilities fall into this category [10], and §8 shows
how WARP can repair from common attacks such as cross-site scripting, SQL injection, cross-site
request forgery, and clickjacking.
Second, to recover from intrusions that involve a user’s browser, our prototype requires the user to
install a browser extension that uploads dependency information to WARP-enabled web servers. In
principle, the same functionality could be performed purely in JavaScript (see §10), but for simplicity,
our prototype uses a separate extension. WARP’s server trusts each browser’s log information only
as much as it trusts the browser’s HTTP requests. This ensures that a malicious user cannot gain
additional privileges by uploading a log containing user input that tries to issue different HTTP
If one user does not have our prototype’s extension installed, but gets compromised by a cross-site
scripting attack, WARP will not be able to precisely undo the effects of malicious JavaScript code in
that user’s browser. As a result, server-side state accessible to that user (e.g., that user’s Wiki pages or
documents) may remain corrupted. However, WARP will still inform the user that his or her browser
might have received a compromised reply from the server in the past. At that point, the user can
manually inspect the set of changes made to his data from that point onward, and cancel his or her
previous HTTP requests, if unwanted changes are detected.
To implement retroactive patching, WARP’s application repair manager must be able to determine
which runs of an application may have been affected by a given security patch, and to re-execute them
during repair. To enable this, WARP’s application repair manager interposes on the application’s language runtime (PHP in our current prototype) to record any dependencies to and from the application,
including application code loaded at runtime, queries issued to the database, and HTTP requests and
responses sent to or from the HTTP server.
Normal execution
During normal execution, the application repair manager records three types of dependencies for the
executing application code (along with the dependency’s data, used later for re-execution). First, the
repair manager records an input dependency to the HTTP request and an output dependency to the
HTTP response for this run of the application code (along with all headers and data). Second, for each
read or write SQL query issued by the application, the repair manager records, respectively, input or
output dependencies to the database. Third, the repair manager records input dependencies on the
source code files used by the application to handle its specific HTTP request. This includes the initial
PHP file invoked by the HTTP request, as well as any additional PHP source files loaded at runtime
through require or include statements.
In addition to recording external dependencies, WARP’s application manager also records certain
internal functions invoked by the application code, to reduce non-determinism during re-execution.
This includes calls to functions that return the current date or time, functions that return randomness
(such as mt rand in PHP), and functions that generate unique identifiers for HTTP sessions (such as
session start in PHP). For each of these functions, the application manager records the arguments
and return value. This information is used to avoid re-executing these non-deterministic functions
during repair, as we will describe shortly.
Initiating repair
To initiate repair through retroactive patching, the administrator needs to provide the filename of
the buggy source code file, a patch to that file which removes the vulnerability, and a time at which
this patch should be applied (by default, the oldest time available in WARP’s log). In response, the
application repair manager adds a new action to WARP’s action history graph, whose re-execution
would apply the patch to the relevant file at the specified (past) time. The application repair manager
then requests that WARP’s repair controller re-execute the newly synthesized action. WARP will first
re-execute this action (i.e., apply the patch to the file in question), and then use dependencies recorded
by the application repair manager to find and re-execute all runs of the application that loaded the
patched source code file.
During re-execution, the application repair manager invokes the application code in much the same
way as during normal execution, with two differences. First, all inputs and outputs to and from the
application are handled by the repair controller. This allows the repair controller to determine when
re-execution is necessary, such as when a different SQL query is issued during repair, and to avoid
re-executing actions that are not affected or changed.
Second, the application repair manager tries to match up calls to non-deterministic functions during
re-execution with their counterparts during the original run. In particular, when a non-deterministic
function is invoked during re-execution, the application repair manager searches for a call to the same
function, from the same caller location. If a match is found, the application repair manager uses the
original return value in lieu of invoking the function. The repair manager matches non-deterministic
function calls from the same call site in-order (i.e., two non-deterministic function calls that happened
in some order during re-execution will always be matched up to function calls in that same order
during the original run).
One important aspect of this heuristic is that it is strictly an optimization. Even if the heuristic
fails to match up any of the non-deterministic function calls, the repair process will still be correct,
at the cost of increased re-execution (e.g., if the application code generates a different HTTP cookie
during re-execution, WARP will be forced to re-execute all page visits that used that cookie).
The job of WARP’s time-travel database is to checkpoint and roll back the application’s persistent data,
and to re-execute past SQL queries during repair. Its design is motivated by two requirements: first,
the need to minimize the number of SQL queries that have to be re-executed during repair, and second,
the need to repair a web application concurrently with normal operation. This section discusses how
WARP addresses these requirements.
Reducing re-execution
Minimizing the re-execution of SQL queries during repair is complicated by the fact that clients issue
queries over entire tables, and tables often contain data for many independent users or objects of the
same type.
There are two reasons why WARP may need to re-execute an SQL query. First, an SQL query
that modifies the database (e.g., an INSERT, UPDATE, or DELETE statement) needs to be re-executed in
order to re-apply legitimate changes to a database after rollback. Second, an SQL query that reads the
database (e.g., a SELECT statement, or any statement with a WHERE clause) needs to be re-executed if
the data read by that statement may have changed as a result of repair.
To minimize re-execution of write SQL queries, the database manager performs fine-grained
rollback, at the level of individual rows in a table. This ensures that, if one row is rolled back, it may
not be necessary to re-execute updates to other rows in the same table. One complication lies in the
fact that SQL has no inherent way of naming unique rows in a database. To address this limitation,
WARP introduces the notion of a row ID, which is a unique name for a row in a table. Many web
applications already use synthetic primary keys which can serve as row IDs; in this case, WARP uses
that primary key as a row ID in that table. If a table does not already have a suitable row ID column,
WARP’s database manager transparently adds an extra row id column for this purpose.
To minimize re-execution of SQL queries that read the database, the database manager logically
splits the table into partitions, based on the values of one or more of the table’s columns. During
repair, the database manager keeps track of the set of partitions that have been modified (as a result of
either rollback or re-execution), and avoids re-executing SQL queries that read from only unmodified
partitions. To determine the partitions read by an SQL query, the database manager inspects the
query’s WHERE clause. If the database manager cannot determine what partitions a query might read
based on the WHERE clause, it conservatively assumes that the query reads all partitions.
In our current prototype, the programmer or administrator must manually specify the row ID
column for each table (if they want to avoid the overhead of an extra row id column created by
WARP), and the partitioning columns for each table (if they want to benefit from the partitioning
optimization). A partitioning column need not be the same column as the row ID. For example, a
Wiki application may store Wiki pages in a table with four columns: a unique page ID, the page title,
the user ID of the last user who edited the page, and the contents of that Wiki page. Because the
title, the last editor’s user ID, and the content of a page can change, the programmer would specify
the immutable page ID as the row ID column. However, the application’s SQL queries may access
pages either by their title or by the last editor’s user ID, so the programmer would specify them as the
partitioning columns.
Re-executing multi-row queries
SQL queries can access multiple rows in a table at once, if the query’s WHERE clause does not guarantee
a unique row. Re-executing such queries—where WARP cannot guarantee by looking at the WHERE
clause that only a single row is involved—poses two challenges. First, in the case of a query that
may read multiple rows, WARP must ensure that all of those rows are in the correct state prior to
re-executing that query. For instance, if some of those rows have been rolled back to an earlier
version due to repair, but other rows have not been rolled back since they were not affected, naı̈vely
re-executing the multi-row query can produce incorrect results, mixing data from old and new rows.
Second, in the case of a query that may modify multiple rows, WARP must roll back all of those rows
prior to re-executing that query, and subsequently re-execute any queries that read those rows.
To re-execute multi-row read queries, WARP performs continuous versioning of the database, by
keeping track of every value that ever existed for each row. When re-executing a query that accesses
some rows that have been rolled back, and other rows that have not been touched by repair, WARP
allows the re-executed query to access the old value of the untouched rows from precisely the time that
query originally ran. Thus, continuous versioning allows WARP’s database manager to avoid rolling
back and reconstructing rows for the sole purpose of re-executing a read query on their old value.
To re-execute multi-row write queries, WARP performs two-phase re-execution by splitting the
query into two parts: the WHERE clause, and the actual write query. During normal execution, WARP
records the set of row IDs of all rows affected by a write query. During re-execution, WARP first
executes a SELECT statement to obtain the set of row IDs matching the new WHERE clause. These row
IDs correspond to the rows that would be modified by this new write query on re-execution. WARP
uses continuous versioning to precisely roll back both the original and new row IDs to a time just
before the write query originally executed. It then re-executes the write query on this rolled-back
To implement continuous versioning, WARP augments every table with two additional columns,
start time and end time, which indicate the time interval during which that row value was valid.
Each row R in the original table becomes a series of rows in the continuously versioned table, where
the end time value of one version of R is the start time value of the next version of R. The column
end time can have the special value ∞, indicating that row version is the current value of R. During
normal execution, if an SQL query modifies a set of rows, WARP sets end time for the modified
rows to the current time, with the rest of the columns retaining their old values, and inserts a new
set of rows with start time set to the current time, end time set to ∞, and the rest of the columns
containing the new versions of those rows. When a row is deleted, WARP simply sets end time to
the current time. Read queries during normal execution always access rows with end time = ∞.
Rolling back a row to time T involves deleting versions of the row with start time ≥ T and setting
end time ← ∞ for the version with the largest remaining end time.
Since WARP’s continuous versioning database grows in size as the application makes modifications,
the database manager periodically deletes old versions of rows. Since repair requires that both the
old versions of database rows and the action history graph be available for rollback and re-execution,
the database manager deletes old rows in sync with WARP’s garbage-collection of the action history
Concurrent repair and normal operation
Since web applications are often serving many users, it’s undesirable to take the application offline
while recovering from an intrusion. To address this problem, WARP’s database manager introduces the
notion of repair generations, identified by an integer counter, which are used to denote the state of the
database after a given number of repairs. Normal execution happens in the current repair generation.
When repair is initiated, the database manager creates the next repair generation (by incrementing
the current repair generation counter by one), which creates a fork of the current database contents.
All database operations during repair are applied to the next generation. If, during repair, users make
changes to parts of the current generation that are being repaired, WARP will re-apply the users’
changes to the next generation through re-execution. Changes to parts of the database not under repair
are copied verbatim into the next generation. Once repair is near completion, the web server is briefly
suspended, any final requests are re-applied to the next generation, the current generation is set to the
next generation, and the web server is resumed.
To implement repair generations, WARP augments all tables with two additional columns,
start gen and end gen, which indicate the generations in which a row is valid. Much as with
continuous versioning, end gen = ∞ indicates that the row has not been superseded in any later generation. During normal execution, queries execute over rows that match start gen ≤ current and
end gen ≥ current. During repair, if a row with start gen < next and end gen ≥ next is about
to be updated or deleted (due to either re-execution or rollback), the existing row’s end gen is set to
current, and, in case of updates, the update is executed on a copy of the row with start gen = next.
Rewriting SQL queries
WARP intercepts all SQL queries made by the application, and transparently rewrites them to implement
database versioning and generations. For each query, WARP determines the time and generation in
which the query should execute. For queries issued as part of normal execution, WARP uses the current
time and generation. For queries issued as part of repair, WARP’s repair controller explicitly specifies
the time for the re-executed query, and the query always executes in the next generation.
To execute a SELECT query at time T in generation G, WARP restricts the query to run over
currently valid rows by augmenting its WHERE clause with AND start time ≤ T ≤ end time AND
start gen ≤ G ≤ end gen.
During normal execution, on an UPDATE or DELETE query at time T (the current time), WARP
implements versioning by making a copy of the rows being modified. To do this, WARP sets the
end time of rows being modified in the current generation to T , and inserts copies of the rows with
start time ← T , end time ← ∞, start gen ← G, and end gen ← ∞, where G = current.
WARP also restricts the WHERE clause of such queries to run over currently valid rows, as with SELECT
queries above. On an INSERT query, WARP sets start time, end time, start gen, and end gen
columns of the inserted row as for UPDATE and DELETE queries above.
To execute an UPDATE or DELETE query during repair at time T , WARP must first preserve any
rows being modified that are also accessible from the current generation, so that they continue to be
accessible to concurrently executing queries in the current generation. To do so, WARP creates a copy
of all matching rows, with end gen set to current, sets the start gen of the rows to be modified
to next, and then executes the UPDATE or DELETE query as above, except in generation G = next.
Executing an INSERT query during repair does not require preserving any existing rows; in this case,
WARP simply performs the same query rewriting as for normal execution, with G = next.
To help users recover from attacks that took place in their browsers, WARP uses two ideas. First, when
WARP determines that a past HTTP response was incorrect, it re-executes the changed web page in
a cloned browser on the server, in order to determine how that page would behave as a result of the
change. For example, if a new HTTP response no longer contains an adversary’s JavaScript code
(e.g., because the cross-site scripting vulnerability was retroactively patched), re-executing the page in
a cloned browser will not generate the HTTP requests that the attacker’s JavaScript code may have
originally initiated, and will thus allow WARP to undo those requests.
Second, WARP performs DOM-level replay of user input when re-executing pages in a browser.
By recording and re-executing user input at the level of the browser’s DOM, WARP can better capture
the user’s intent as to what page elements the user was trying to interact with. A naı̈ve approach that
recorded pixel-level mouse events and key strokes may fail to replay correctly when applied to a page
whose HTML code has changed slightly. On the other hand, DOM elements are more likely to be
unaffected by small changes to an HTML page, allowing WARP to automatically re-apply the user’s
original inputs to a modified page during repair.
Tracking page dependencies
In order to determine what should be re-executed in the browser given some changes on the server,
WARP needs to be able to correlate activity on the server with activity in users’ browsers.
First, to correlate requests coming from the same web browser, WARP’s browser extension assigns
each client a unique client ID value. The client ID also helps WARP keep track of log information
uploaded to the server by different clients. The client ID is a long random value to ensure that an
adversary cannot guess the client ID of a legitimate user and upload logs on behalf of that user.
Second, WARP also needs to correlate different HTTP requests coming from the same page in a
browser. To do this, WARP introduces the notion of a page visit, corresponding to the period of time
that a single web page is open in a browser frame (e.g., a tab, or a sub-frame in a window). If the
browser loads a new page in the same frame, WARP considers this to be a new visit (regardless of
whether the frame navigated to a different URL or to the same URL), since the frame’s page starts
executing in the browser anew. In particular, WARP’s browser extension assigns each page visit a
visit ID, unique within a client. Each page visit can also have a dependency on a previous page visit.
For example, if the user clicks on a link as part of page visit #1, the browser extension creates page
visit #2, which depends on page visit #1. This allows WARP to check whether page visit #2 needs to
re-execute if page visit #1 changes. If the user clicks on more links, and later hits the back button to
return to the page from visit #2, this creates a fresh page visit #N (for the same page URL as visit #2),
which also depends on visit #1.
Finally, WARP needs to correlate HTTP requests issued by the web browser with HTTP requests
received by the HTTP server, for tracking dependencies. To do this, the WARP browser extension
assigns each HTTP request a request ID, unique within a page visit, and sends the client ID, visit ID,
and request ID along with every HTTP request to the server via HTTP headers.
On the server side, the HTTP server’s manager records dependencies between HTTP requests
and responses (identified by a hclient id , visit id , request id i tuple) and runs of application code
(identified by a hpid , counti tuple, where pid is the PID of the long-lived PHP runtime process, and
count is a unique counter identifying a specific run of the application).
Recording events
During normal execution, the browser extension performs two tasks. First, it annotates all HTTP
requests, as described above, with HTTP headers to help the server correlate client-side actions with
server-side actions. Second, it records all JavaScript events that occur during each page visit (including
timer events, user input events, and postMessage events). For each event, the extension records event
parameters, including time and event type, and the XPath of the event’s target DOM element, which
helps perform DOM-level replay during repair.
The extension uploads its log of JavaScript events for each page visit to the server, using a separate
protocol (tagged with the client ID and visit ID). On the server side, WARP’s HTTP server records the
submitted information from the client into a separate per-client log, which is subject to its own storage
quota and garbage-collection policy. This ensures that a single client cannot monopolize log space on
the server, and more importantly, cannot cause a server to garbage-collect recent log entries from other
users needed for repair.
Although the current WARP prototype implements client-side logging using an extension, the
extension does not circumvent any of the browser’s privacy policies. All of the information recorded
by WARP’s browser extension can be captured at the JavaScript level by event handlers, and in future
work, we hope to implement an extension-less version of WARP’s browser logging by interposing on
all events using JavaScript rewriting.
Server-side re-execution
When WARP determines that an HTTP response changed during repair, the browser repair manager
spawns a browser on the server to re-execute the client’s uploaded browser log for the affected page
visit. This re-execution browser loads the client’s HTTP cookies, loads the same URL as during
original execution, and replays the client’s original DOM-level events. The user’s cookies are loaded
either from the HTTP server’s log, if re-executing the first page for a client, or from the last browser
page re-executed for that client. The re-executed browser runs in a sandbox, and only has access to
the client’s HTTP cookie, ensuring that it gets no additional privileges despite running on the server.
To handle HTTP requests from the re-executing browser, the HTTP server manager starts a separate
copy of the HTTP server, which passes any HTTP requests to the repair controller, as opposed to
executing them directly. This allows the repair controller to prune re-execution for identical requests
or responses.
During repair, WARP uses a re-execution extension in the server-side browser to replay the events
originally recorded by the user’s browser. For each event, the re-execution extension tries to locate the
appropriate DOM element using its XPath. For keyboard input events into text fields, the re-execution
extension performs a three-way text merge between the original value of the text field, the new value
of the text field during repair, and the user’s original keyboard input. For example, this allows the
re-execution extension to replay the user’s changes to a text area when editing a Wiki page, even if the
Wiki page in the text area is somewhat different during repair.
If, after repair, a user’s HTTP cookie in the cloned browser differs from the user’s cookie in his or
her real browser (based on the original timeline), WARP queues that client’s cookie for invalidation,
and the next time the same client connects to the web server (based on the client ID), the client’s
cookie will be deleted. WARP assumes that the browser has no persistent client-side state aside from
the cookie. Repair of other client-side state could be similarly handled at the expense of additional
logging and synchronization.
During repair, the server-side browser extension may fail to re-execute the user’s original inputs, if the
user’s actions somehow depended on the reverted actions of the attacker. For example, in the case of a
Wiki page, the user may have inadvertently edited a part of the Wiki page that the attacker modified.
In this situation, WARP’s browser repair manager signals a conflict, stops re-execution of that user’s
browser, and requires the user (or an administrator, in lieu of the user) to resolve the conflict.
Since users are not always online, WARP queues the conflict for later resolution, and proceeds
with repair, assuming, for now, that subsequent requests from that user’s browser do not change. When
the user next logs into the web application (based on the client ID), the application redirects the user
to a conflict resolution page, which tells the user about the page on which the conflict arose, and the
user’s input which could not be replayed. The user must then indicate how the conflict should be
resolved. For example, the user can indicate that they would like to cancel the conflicted page visit
altogether (i.e., undo all of its HTTP requests), and apply the legitimate changes (if any) to the current
state of the system by hand.
Firefox extension
Apache logging module
PHP runtime / SQL rewriter
PHP re-execution support
Repair managers:
Retro’s repair controller
PHP manager
Apache manager
Database manager
Firefox manager
Retroactive patching manager
Lines of code
2,000 lines of JavaScript / HTML
900 lines of C
1,400 lines of C and PHP
200 lines of Python
4,300 lines of Python, total
400 lines of Python
800 lines of Python
300 lines of Python
1,400 lines of Python and PHP
400 lines of Python
200 lines of Python
800 lines of Python
Table 1: Lines of code for different components of the WARP prototype, excluding blank lines and comments.
While WARP’s re-execution extension flags conflicts that arise during replay of input from the user,
some applications may have important information that must be correctly displayed to the user. For
example, if an online banking application displayed $1,000 as the user’s account balance during the
original execution, but during repair it is discovered that the user’s balance should have been $2,000,
WARP will not raise a re-execution conflict. An application programmer, however, can provide a UI
conflict function, which, given the old and new versions of a web page, can signal a conflict even if all
of the user input events replay correctly. For the example applications we evaluated with WARP, we
did not find the need to implement such conflict functions.
User-initiated repair
In some situations, users or administrators may want to undo their own past actions. For example, an
administrator may have accidentally granted administrative privileges to a user, and later may want to
revert any actions that were allowed due to this mis-configuration. To recover from this mistake, the
administrator can use WARP’s browser extension to specify a URL of the page on which the mistake
occurred, find the specific page visit to that URL which led to the mistake, and request that the page
visit be canceled. Our prototype does not allow replacing one past action with another, although this is
mostly a UI limitation.
Allowing users to undo their own actions runs the risk of creating more conflicts, if other users’
actions depended on the action in question. To prevent cascading conflicts, WARP prohibits a regular
user (as opposed to an administrator) from initiating repair that causes conflicts for other users. WARP’s
repair generation mechanism allows WARP to try repairing the server-side state upon user-initiated
repair, and to abort the repair if any conflicts arise. The only exception to this rule is if the user’s repair
is a result of a conflict being reported to that user on that page, in which case the user is allowed to
cancel all actions, even if it propagates a conflict to another user.
We have implemented a prototype of WARP which builds on Retro. Our prototype works with the
Firefox browser on the client, and Apache, PostgreSQL, and PHP on the server. Table 1 shows the
lines of code for the different components of our prototype.
Our Firefox extension intercepts all HTTP requests during normal execution and adds WARP’s
client ID, visit ID, and request ID headers to them. It also intercepts all browser frame creations, and
adds an event listener to the frame’s window object. This event listener gets called on every event in
the frame, and allows us to record the event. During repair, the re-execution extension tries to match
up HTTP requests with requests recorded during normal execution, and adds the matching request ID
header when a match is found. Our current conflict resolution UI only allows the user to cancel the
conflicting page visit; other conflict resolutions must be performed by hand. We plan to build a more
comprehensive UI, but canceling has been sufficient for now.
In our prototype, the user’s client-side browser and the server’s re-execution browser use the
same version of Firefox. While this has simplified the development of our extension, we expect that
DOM-level events are sufficiently standardized in modern browsers that it would be possible to replay
events across different browsers, such as recent versions of Firefox and Chrome. We have not verified
this to date, however.
Our time-travel database and repair generations are implemented on top of PostgreSQL using
SQL query rewriting. After the application’s database tables are installed, WARP extends the schema
of all the tables to add its own columns, including row id if no existing column was specified as the
row ID by the programmer. All database queries are rewritten to update these columns appropriately
when the rows are modified. The approach of using query rewriting was chosen to avoid modifying
the internals of the Postgres server, although an implementation inside of Postgres would likely have
been more efficient.
To allow multiple versions of a row from different times or generations to exist in the same table,
WARP modifies database uniqueness constraints and primary keys specified by the application to
include the end ts and end gen columns. While this allows multiple versions of the same row over
time to co-exist in the same table, WARP must now detect dependencies between queries through
uniqueness violations. In particular, WARP checks whether the success (or failure) of each INSERT
query would change as a result of other rows inserted or deleted during repair, and rolls back that row
if so. WARP needs to consider INSERT statements only for partitions under repair. Our time-travel
database implementation does not support foreign keys, so it disables them. We plan to implement
foreign key constraints in the future in a database trigger. Our design is compatible with multistatement transactions; however, our current implementation does not support them, and we did not
need them for our current applications.
WARP extends Apache’s PHP module to log HTTP requests that invoke PHP scripts. WARP intercepts a PHP script’s calls to database functions, mt rand, date and time functions, and session start,
by rewriting all scripts to call a wrapper function that invokes the wrapped function and logs the
arguments and results.
We now illustrate how different components of WARP work together in the context of a simple Wiki
application. In this case, no attack takes place, but most of the steps taken by WARP remain the same
as in a case with an attack.
Consider a user who, during normal execution, clicks on a link to edit a Wiki page. The user’s
browser issues an HTTP request to edit.php. WARP’s browser extension intercepts this request, adds
client ID, visit ID, and request ID HTTP headers to it, and records the request in its log (§5.1). The
web server receives this request and dispatches it to WARP’s PHP module. The PHP module assigns
this request a unique server-side request ID, records the HTTP request information along with the
server-side request ID, and forwards the request to the PHP runtime.
As WARP’s PHP runtime executes edit.php, it intercepts three types of operations. First, for
each non-deterministic function call, it records the arguments and the return value (§3.1). Second, for
each operation that loads an additional PHP source file, it records the file name (§3.1). Third, for each
database query, it records the query, rewrites the query to implement WARP’s time-travel database,
and records the result set and the row IDs of all rows modified by the query (§4).
Once edit.php completes execution, the response is recorded by the PHP module and returned
to the browser. When the browser loads the page, WARP’s browser extension attaches handlers to
intercept user input, and records all intercepted actions in its log (§5.2). The WARP browser extension
periodically uploads its log to the server.
When a patch fixing a vulnerability in edit.php becomes available, the administrator instructs
WARP to perform retroactive patching. The WARP repair controller uses the action history graph to
locate all PHP executions that loaded edit.php and queues them for re-execution; the user edit action
described above would be among this set.
To re-execute this page in repair mode, the repair controller launches a browser on the server,
identical to the user’s browser, and instructs it to replay the user session. The browser re-issues the
same requests, and the WARP browser extension assigns the same IDs to the request as during normal
execution (§5.3). The WARP PHP module forwards this request to the repair controller, which launches
WARP’s PHP runtime to re-execute it.
During repair, the PHP runtime intercepts two types of operations. For non-deterministic function
calls, it checks whether the same function was called during the original execution, and if so, re-uses
the original return value (§3.3). For database queries, it forwards the query to the repair controller for
To re-execute a database query, the repair controller determines the rows and partitions that the
query depends on, rolls them back to the right version (for a write operation), rewrites the query to
support time-travel and generations, executes the resulting query, and returns the result to the PHP
runtime (§4).
After a query re-executes, the repair controller uses the action history graph to find other database
queries that depended on the partitions affected by the re-executed query (assuming it was a write).
For each such query, the repair controller checks whether their return values would now be different.
If so, it queues the page visits that issued those queries for re-execution.
After edit.php completes re-execution, the HTTP response is returned to the repair controller,
which forwards it to the re-executing browser via the PHP module. Once the response is loaded in the
browser, the WARP browser extension replays the original user inputs on that page (§5.3). If conflicts
arise, WARP flags them for manual repair (§5.4).
WARP’s repair controller continues repairing pages in this manner until all affected pages are
re-executed. Even though no attack took place in this example, this re-execution algorithm would
repair from any attack that exploited the vulnerability in edit.php.
In evaluating WARP, we answer several questions. §8.1 shows what it takes to port an existing web
application to WARP. §8.2 shows what kinds of attacks WARP can repair from, what attacks can
be detected and fixed with retroactive patching, how much re-execution may be required, and how
often users need to resolve conflicts. §8.3 shows the effectiveness of WARP’s browser re-execution in
reducing user conflicts. §8.4 compares WARP with the state-of-the-art work in data recovery for web
applications [1]. Finally, §8.5 measures WARP’s runtime cost.
We ported a popular Wiki application, MediaWiki [21], to use WARP, and used several previously
discovered vulnerabilities to evaluate how well WARP can recover from intrusions that exploit those
bugs. The results show that WARP can recover from six common attack types, that retroactive patching
detects and repairs all tested software bugs, and that WARP’s techniques reduce re-execution and user
conflicts. WARP’s overheads are 24–27% in throughput and 2–3.2 GB/day of storage.
Application changes
We did not make any changes to MediaWiki source code to port it to WARP. To choose row IDs for
each MediaWiki table, we picked a primary or unique key column whose value MediaWiki assigns
once during creation of a row and never overwrites. If there is no such column in a table, WARP adds
a new row id column to the table, transparent to the application. We chose partition columns for each
table by analyzing the typical queries made by MediaWiki and picking the columns that are used in
the WHERE clauses of a large number of queries on that table. In all, this required a total of 89 lines of
annotation for MediaWiki’s 42 tables.
Recovery from attacks
To evaluate how well WARP can recover from intrusions, we constructed six worst-case attack scenarios
based on five recent vulnerabilities in MediaWiki and one configuration mistake by the administrator,
shown in Table 2. After each attack, users browse the Wiki site, both reading and editing Wiki pages.
Our scenarios purposely create significant interaction between the attacker’s changes and legitimate
users, to stress WARP’s recovery aspects. If WARP can disentangle these challenging attacks, it can
also handle any simpler attack.
In the stored XSS attack, the attacker injects malicious JavaScript code into a MediaWiki page.
When a victim visits that Wiki page, the attacker’s JavaScript code appends text to a second Wiki page
that the victim has access to, but the attacker does not. The SQL injection and reflected XSS attacks
are similar in design. Successful recovery from these three attacks requires deleting the attacker’s
JavaScript code; detecting what users were affected by that code; undoing the effects of the JavaScript
code in their browsers (i.e., undoing the edits to the second page); verifying that the appended text
did not cause browsers of users that visited the second page to misbehave; and preserving all users’
legitimate actions.
The CSRF attack is a login CSRF attack, where the goal of the attacker is to trick the victim into
making her edits on the Wiki under the attacker’s account. When the victim visits the attacker’s site,
the attack exploits the CSRF vulnerability to log the victim out of the Wiki site and log her back in
under the attacker’s account. The victim then interacts with the Wiki site, believing she is logged in as
herself, and edits various pages. A successful repair in this scenario would undo all of the victim’s
edits under the attacker’s account, and re-apply them under the victim’s own account.
In the clickjacking attack, the attacker’s site loads the Wiki site in an invisible frame and tricks
the victim into thinking she is interacting with the attacker’s site, while in fact she is unintentionally
interacting with the Wiki site, logged in as herself. Successful repair in this case would undo all
modifications unwittingly made by the user through the clickjacked frame.
We used retroactive patching to recover from all the above attacks, with patches implementing the
fixes shown in Table 2.
Finally, we considered a scenario where the administrator of the Wiki site mistakenly grants a
user access to Wiki pages she should not have been given access to. At a later point of time, the
administrator detects the misconfiguration, and initiates undo of his action using WARP. Meanwhile,
the user has used her elevated privileges to edit pages that she should not have been able to edit in the
first place. Successful recovery, in this case, would undo all the modifications by the unprivileged user.
For each of these scenarios we ran a workload with 100 users. For all scenarios except the ACL
error scenario, we have one attacker, three victims that were subject to attack, and 96 unaffected users.
For the ACL error scenario, we have one administrator, one unprivileged user that takes advantage of
the administrator’s mistake, and 98 other users. During the workloads, all users login, read, and edit
Wiki pages. In addition, in all scenarios except the ACL error, the victims visit the attacker’s web site,
which launches the attack from their browser.
Table 3 shows the results of repair for each of these scenarios. First, WARP can successfully repair
all of these attacks. Second, retroactive patching detects and repairs from intrusions due to all five
software vulnerabilities; the administrator does not need to detect or track down the initial attacks.
Finally, WARP has few user-visible conflicts. Conflicts arise either because a user was tricked by
the attacker into performing some browser action, or because the user should not have been able to
perform the action in the first place. The conflicts in the clickjacking scenario are of the first type;
we expect users would cancel their page visit on conflict, since they did not mean to interact with the
MediaWiki page on the attack site. The conflict in the ACL error scenario is of the second type, since
the user no longer has access to edit the page; in this case, the user’s edit has already been reverted,
and the user can resolve the conflict by, perhaps, editing a different page.
Browser re-execution effectiveness
We evaluated the effectiveness of browser re-execution in WARP by considering three types of attack
code, for an XSS attack. The first is a benign, read-only attack where the attacker’s JavaScript
code runs in the user’s browser but does not modify any Wiki pages. The second is an append-only
attack, where the malicious code appends text to the victim’s Wiki page. Finally, the overwrite attack
completely corrupts the victim’s Wiki page.
We ran these attacks under three configurations of the client browser: First, without WARP’s
browser extension; second, with WARP’s browser extension but without WARP’s text merging for user
input; and third, with WARP’s complete browser extension. Our experiment had one attacker and eight
victims. Each user logged in, visited the attack page to trigger one of the three above attacks, edited
Wiki pages, and logged out.
Table 4 shows the results when WARP is invoked to retroactively patch the XSS vulnerability.
Without WARP’s browser extension, WARP cannot verify whether the attacker’s JavaScript code was
benign or not, and raises a conflict for every victim of the XSS attack. With the browser extension but
without text-merging, WARP can verify that the read-only attack was benign, and raises no conflict, but
cannot re-execute the user’s page edits if the attacker did modify the page slightly, raising a conflict in
that scenario. Finally, WARP’s full browser extension is able to re-apply the user’s page edits despite
the attacker’s appended text, and raises no conflict in that situation. When the attacker completely
corrupts the page, applying user’s original changes in the absence of the attack is meaningless, and a
conflict is always raised.
Recovery comparison with prior work
Here we compare WARP with state-of-the-art work in data recovery for web applications by Akkuş and
Goel [1]. Their system uses taint tracking in web applications to recover from data corruption bugs.
In their system, the administrator identifies the request that triggered the bug, and their system uses
several dependency analysis policies to do offline taint analysis and compute dependencies between
the request and database elements. The administrator uses these dependencies to manually undo the
corruption. Each specific policy can output too many dependencies (false positives), leading to lost
data, or too few (false negatives), leading to incomplete recovery.
Akkuş and Goel used five corruption bugs from popular web applications to evaluate their system.
To compare WARP with their system, we evaluated WARP with four of these bugs—two each in Drupal
and Gallery2. The remaining bug is in Wordpress, which does not support our Postgres database.
Porting the buggy versions of Drupal and Gallery2 to use WARP did not require any changes to
source code. We replicated each of the four bugs under WARP. Once we verified that the bugs were
triggered, we retroactively patched the bug. Repair did not require any user input, and after repair, the
applications functioned correctly without any corrupted data.
Table 5 summarizes this evaluation. WARP has three key advantages over Akkuş and Goel’s
system. First, unlike their system, WARP never incurs false negatives and always leaves the application
in an uncorrupted state. Second, WARP only requires the administrator to provide the patch that fixes
the bug, whereas Akkuş and Goel require the administrator to manually guide the dependency analysis
by identifying requests causing corruption, and by whitelisting database tables. Third, unlike WARP,
their system cannot recover from attacks on web applications, and cannot recover from problems that
occur in the browser.
Performance evaluation
In this subsection, we evaluate WARP’s performance under different scenarios. In these experiments,
we ran the server on a 3.07 GHz Intel Core i7 950 machine with 12 GB of RAM. WARP’s repair
algorithm is currently sequential. Running it on a machine with multiple cores makes it difficult to
reason about the CPU usage of various components of WARP; so we ran the server with only one core
turned on and with hyperthreading turned off. However, during normal execution, WARP can take full
advantage of multiple processor cores when available.
Logging overhead.
We first evaluate the overhead of using WARP by measuring the performance
of MediaWiki with and without WARP for two workloads: reading Wiki pages, and editing Wiki pages.
The clients were 8 Firefox browsers running on a machine different from the server, sending requests
as fast as possible; the server experienced 100% CPU load. The client and server machines were
connected with a 1 Gbps network.
Table 6 shows the throughput of MediaWiki with and without WARP, and the size of WARP’s
logs. For the reading and editing workloads, respectively, WARP incurs throughput overheads of 24%
and 27%, and storage costs of 3.71 KB and 7.34 KB per page visit (or 2 GB/day and 3.2 GB/day
under continuous 100% load). Many web applications already store similar log information; a 1 TB
drive could store about a year’s worth of logs at this rate, allowing repair from attacks within that time
period. We believe that this overhead would be acceptable to many applications, such as a company’s
Wiki or a conference reviewing web site.
To evaluate the overhead of WARP’s browser extension, we measured the load times of a Wiki
page in the browser with and without the WARP extension. This experiment was performed with an
unloaded MediaWiki server. The load times were 0.21 secs and 0.20 secs with and without the WARP
extension respectively, showing that the WARP browser extension imposes negligible overhead.
Finally, WARP indexes its logs to support incremental loading of its dependency graph during
repair. In our current prototype, for convenience, indexing is implemented as a separate step after
normal execution. This indexing step takes 24–28 ms per page visit for the workloads we tested. If
done during normal execution, this would add less than an additional 12% overhead.
Repair performance.
We evaluate WARP’s repair performance by considering four scenarios.
First, we consider a scenario where a retroactive patch affects a small, isolated part of the action
history graph. This scenario evaluates WARP’s ability to efficiently load and redo only the affected
actions. To evaluate this scenario, we used the XSS, SQL injection, and ACL error workloads from
§8.2 with 100 users, and victim page visits at the end of the workload. The results are shown in the
first four rows of Table 7. The re-executed actions columns show that WARP re-executes only a small
fraction of the total number of actions in the workload, and a comparison of the original execution
time and total repair time columns shows that repair in these scenarios takes an order of magnitude
less time than the original execution time.
Second, we evaluate a scenario where the patch affects a small part of the action history graph
as before, but the affected actions in turn may affect several other actions. To test this scenario, we
used the reflected XSS workload with 100 users, but with victims at the beginning of the workload,
rather than at the end. Re-execution of the victims’ page visits in this case causes the database state to
change, which affects non-victims’ page visits. This scenario tests WARP’s ability to track database
dependencies and selectively re-execute database queries without having to re-execute non-victim
page visits. The results for this scenario are shown in the fifth row of Table 7.
A comparison of the results for both the reflected XSS attack scenarios shows that WARP reexecutes the same number of page visits in both cases, but the number of database queries is significantly greater when victims are at the beginning. These extra database queries are queries from
non-victim page visits which depend on the database partitions that changed as a result of re-executing
victim pages. These queries are of two types: SELECT queries that need to be re-executed to check
whether their result has changed, and UPDATE queries that need to be re-executed to update the rolledback database rows belonging to the affected database partitions. From the repair time breakdown
columns, we see that the graph loading for these database query actions and their re-execution are the
main contributors to the longer repair time for this scenario, as compared to when victims were at the
end of the workload. Furthermore, we see that the total repair time is about one-third of the time for
original execution, and so WARP’s repair is significantly better than re-executing the entire workload.
Third, we consider a scenario where a patch requires all actions in the history to be re-executed.
We use the CSRF and clickjacking attacks as examples of this scenario. The results are shown in
the last two rows of Table 7. WARP takes an order of magnitude more time to re-execute all the
actions in the graph than the original execution time. Our unoptimized repair controller prototype
is currently implemented in Python, and the step-by-step re-execution of the repaired actions is a
significant contributor to this overhead. We believe implementing WARP in a more efficient language,
such as C++, would significantly reduce this overhead.
Finally, we evaluate how WARP scales to larger workloads. We measure WARP’s repair performance for XSS, SQL injection, and ACL error workloads, as in the first scenario, but with 5,000 users
instead of 100. The results for this experiment are shown in Table 8. The number of actions affected by
the attack remain the same, and only those actions are re-executed as part of the repair. This indicates
WARP successfully avoids re-execution of requests that were not affected by the attack. Differences
in the number of re-executed actions (e.g., in the stored XSS attack) are due to non-determinism
introduced by MediaWiki object caching. We used a stock MediaWiki installation for our experiments,
in which MediaWiki caches results from past requests in an objectcache database table. During
repair, MediaWiki may invalidate some of the cache entries, resulting in more re-execution.
The repair time for the 5,000-user workload is only 3× the repair time for 100 users, for all
scenarios except SQL injection, despite the 50× increase in the overall workload. This suggests that
WARP’s repair time does not increase linearly with the size of the workload, and is mostly determined
by the number of actions that must be re-executed during repair. The SQL injection attack had a 10×
increase in repair time because the number of database rows affected by the attack increases linearly
with the number of users. The attack injects the SQL query UPDATE pagecontent SET old text
= old text || ‘attack’, which modifies every page. Recovering from this attack requires rolling
back all the users’ pages, and the time to do that increases linearly with the total number of users.
Concurrent repair overhead.
When repair is ongoing, WARP allows the web application to
continue normal operation using repair generations. To evaluate repair generations, we measured the
performance of MediaWiki for the read and edit workloads from §8.5 while repair is underway for the
CSRF attack.
The results are shown in the “During repair” column of Table 6. They demonstrate that WARP
allows MediaWiki to be online and functioning normally while repair is ongoing, albeit at a lower
performance—with 24% to 30% lower number of page visits per second than if there were no repair
in progress. The drop in performance is due to both repair and normal execution sharing the same
machine resources. This can be alleviated if dedicated resources (e.g., a dedicated processor core)
were available for repair.
The two closest pieces of work related to WARP are the Retro intrusion recovery system [14] and the
web application data recovery system by Akkuş and Goel [1].
While WARP builds on ideas from Retro, Retro focuses on shell-oriented Unix applications on a
single machine. WARP extends Retro with three key ideas to handle web applications. First, Retro
requires an intrusion detection system to detect attacks, and an expert administrator to track down
the root cause of every intrusion; WARP’s retroactive patching allows an administrator to simply
supply a security patch for the application’s code. Second, Retro’s file- and process-level rollback and
dependency tracking cannot perform fine-grained rollback and dependency analysis for individual
SQL queries that operate on the same table, and cannot perform online repair, and WARP’s time-travel
database can.3 Third, repairing any network I/O in Retro requires user input; in a web application, this
would require every user to resolve conflicts at the TCP level. WARP’s browser re-execution eliminates
the need to resolve most conflicts, and presents a meaningful UI for true conflicts that require user
Akkuş and Goel’s data recovery system uses taint tracking to analyze dependencies between HTTP
requests and database elements, and thereby recover from data corruption errors in web applications.
However, it can only recover from accidental mistakes, as opposed to malicious attacks (in part due to
relying on white-listing to reduce false positives), and requires administrator guidance to reduce false
positives and false negatives. WARP can fully recover from data corruptions due to bugs as well as
attacks, with no manual intervention (except when there are conflicts during repair). §8.4 compared
WARP to Akkuş and Goel’s system in more detail.
Provenance-aware storage systems [24, 26] record dependency information similar to WARP, and
can be used by an administrator to track down the effects of an intrusion or misconfiguration. Margo
and Seltzer’s browser provenance system [20] shows how provenance information can be extended to
web browsers. WARP similarly tracks provenance information across web servers and browsers, and
aggregates this information at the server, but WARP also records sufficient information to re-execute
browser events and user input in a new context during repair. However, our WARP prototype does not
help users understand the provenance of their own data.
Ibis [28] and PASSv2 [25] show how to incorporate provenance information across multiple layers
in a system. While WARP only tracks dependencies at a fixed level (SQL queries, HTTP requests,
and browser DOM events), we hope to adopt ideas from these systems in the future, to recover from
intrusions that span many layers (e.g., the database server or the language runtime).
WARP’s idea of retroactive patching provides a novel approach to intrusion detection, which can be
used on its own to detect whether recently patched vulnerabilities have been exploited before the patch
was applied. Work on vulnerability-specific predicates [13] is similar in its use of re-execution (at
the virtual machine level), but requires writing specialized predicates for each vulnerability, whereas
WARP only requires the patch itself.
Much of the work on intrusion detection and analysis [5, 11, 15, 16, 18, 32] is complementary
to WARP, and can be applied in parallel. When an intrusion is detected and found using an existing
intrusion detection tool, the administrator can use WARP to recover from the effects of that intrusion
in a web application.
Polygraph [19] recovers from compromises in a weakly consistent replication system. Unlike
WARP, Polygraph does not attempt to preserve legitimate changes to affected files, and does not
3 One of Retro’s scenarios involved database repair, but it worked by rolling back the entire database file,
and re-executing every SQL query.
attempt to automate detection of compromises. Polygraph works well for applications that do not
operate on multiple files at once. In contrast, WARP deals with web applications, which frequently
access shared data in a single SQL database.
Tracking down and reverting malicious actions has been explored in the context of databases [2, 17].
WARP cannot rely purely on database transaction dependencies, because web applications tend to
perform significant amounts of data processing in the application code and in web browsers, and
WARP tracks dependencies across all those components. WARP’s time-travel database is in some ways
reminiscent of a temporal database [29, 30]. However, unlike a temporal database, WARP has no need
for more complex temporal queries; supports two time-like dimensions (wall-clock time and repair
generations); and allows partitioning rows for dependency analysis.
Many database systems exploit partitioning for performance; WARP uses partitioning for dependency analysis. The problem of choosing a suitable partitioning has been addressed in the context
of minimizing distributed transactions on multiple machines [3], and in the context of index selection [6, 12]. These techniques might be helpful in choosing a partitioning for tables in WARP.
Mugshot [22] performs deterministic recording and replay of JavaScript events, but cannot replay
events on a changed web page. WARP must replay user input on a changed page in order to re-apply
legitimate user changes after effects of the attack have been removed from a page. WARP’s DOM-level
replay matches event targets between record and replay even if other parts of the page differ.
While our prototype depends on a browser extension to record client-side events and user input, we
believe it would be possible to do so in pure JavaScript as well. In future work, we plan to explore
this possibility, perhaps leveraging Caja [23] to wrap existing JavaScript code and record all browser
events and user input; the browser’s same-origin policy already allows JavaScript code to perform all
of the necessary logging. We also plan to verify that DOM-level events recorded in one browser can
be re-executed in a different standards-compliant browser. In the meantime, we note that operators of
complex web applications often already have an infrastructure of virtual machines and mobile phone
emulators for testing across browser platforms, and a similar infrastructure could be used for WARP’s
The client-side logs, uploaded by WARP’s extension to the server, can contain sensitive information.
For example, if a user enters a password on one of the pages of a web application, the user’s key
strokes will be recorded in this log, in case that page visit needs to be re-executed at a later time.
Although this information is accessible to web applications even without WARP, applications might
not record or store this information on their own, and WARP must safeguard this additional stored
information from unintended disclosure.
In future work, we plan to explore ways in which WARP-aware applications can avoid logging
known-sensitive data, such as passwords, by modifying replay to assume that a valid (or invalid)
password was supplied, without having to re-enter the actual password. The logs can also be encrypted
so that the administrator must provide the corresponding decryption key to initiate repair. An alternative
design—storing the logs locally on each client machine and relying on client machines to participate in
the repair process—would prevent a single point of compromise for all logs, but would make complete
repair a lengthy process, since each client machine will have to come online to replay its log.
WARP’s current design cannot re-execute mashup web applications (i.e., those involving multiple
web servers), since the event logs for each web application’s frame would be uploaded to a different
web server. We plan to explore re-execution of such multi-origin web applications, as long as all of
the web servers involved in the mashup support WARP. The approach we imagine taking is to have the
client sign each event that spans multiple origins (such as a postMessage between frames) with a
private key corresponding to the source origin. This would allow WARP re-executing at the source
origin’s server to convince WARP on the other frame’s origin server that it should be allowed to initiate
re-execution for that user.
Retroactive patching by itself cannot be used to recover from attacks that resulted from leaked
credentials. For example, an attacker can use an existing XSS vulnerability in an application to
steal a user’s credentials and use them to impersonate the user and perform unauthorized actions.
Retroactive patching of the XSS vulnerability cannot distinguish the actions of the attacker’s browser
from legitimate actions of the user’s browser, as both used the same credentials. However, if the user
is willing to identify the legitimate browsers, WARP can undo the actions performed by the attacker’s
We plan to explore tracking dependencies at multiple levels of abstraction, borrowing ideas from
prior work [14, 25, 28]. This may allow WARP to recover from compromises in lower layers of
abstraction, such as a database server or the application’s language runtime. We also hope to extend
WARP’s undo mechanism higher into the application, to integrate with application-level undo features,
such as MediaWiki’s revert mechanism.
In our current prototype, we instrument the web application server to log HTTP requests and
database queries. This requires that the application server be fully trusted to not tamper with WARP
logging, and requires modification of the application server software, which may not always be
possible. It also does not support replicated web application servers, as the logs for a replica contain
the local times at that replica, which are not directly comparable to local times at other replicas. In
future work, we plan to explore an alternative design with WARP proxies in front of the application’s
HTTP load balancer and the database, and perform logging in those proxies. This design addresses
the above limitations, but can lead to more re-execution during repair, as it does not capture the exact
database queries made for each HTTP request.
We also plan to explore techniques to further reduce the number of application runs re-executed
due to retroactive patching, by determining which runs actually invoked the patched function, instead
of the runs that just loaded the patched file.
Our current prototype assumes that the application code does not change, other than through
retroactive patching. While this assumption is unrealistic, fixing it is straightforward. WARP’s
application repair manager would need to record each time the application’s source code changed.
Then, during repair, the application manager would roll back these source code changes (when rolling
back to a time before these changes were applied), and would re-apply these patches as the repaired
timeline progressed (in the process merging these original changes with any newly supplied retroactive
This paper presented WARP, an intrusion recovery system for web applications. WARP introduced
three key ideas to make intrusion recovery practical. Retroactive patching allows administrators
to recover from past intrusions by simply supplying a new security patch, without having to even
know if an attack occurred. The time-travel database allows WARP to perform precise repair of just
the affected parts of the system. Finally, DOM-level replay of user input allows WARP to preserve
legitimate changes with no user input in many cases. A prototype of WARP can recover from attacks,
misconfigurations, and data loss bugs in three applications, without requiring any code changes, and
with modest runtime overhead.
We thank Victor Costan, Frans Kaashoek, Robert Morris, Jad Naous, Hubert Pham, Eugene Wu,
the anonymous reviewers, and our shepherd, Yuanyuan Zhou, for their feedback. This research was
partially supported by the DARPA Clean-slate design of Resilient, Adaptive, Secure Hosts (CRASH)
program under contract #N66001-10-2-4089, by NSF award CNS-1053143, by Quanta, and by Google.
Taesoo Kim is partially supported by the Samsung Scholarship Foundation. The opinions in this paper
do not necessarily represent DARPA or official US policy.
[1] İ. E. Akkuş and A. Goel. Data recovery for web applications. In Proceedings of the 40th Annual
IEEE/IFIP International Conference on Dependable Systems and Networks, Chicago, IL, Jun–Jul
[2] P. Ammann, S. Jajodia, and P. Liu. Recovery from malicious transactions. Transactions on
Knowledge and Data Engineering, 14:1167–1185, 2002.
[3] C. Curino, E. Jones, Y. Zhang, and S. Madden. Schism: a workload-driven approach to database
replication and partitioning. Proceedings of the VLDB Endowment, 3(1), 2010.
[4] Damon Cortesi. Twitter StalkDaily worm postmortem.
[5] G. W. Dunlap, S. T. King, S. Cinar, M. Basrai, and P. M. Chen. ReVirt: Enabling intrusion
analysis through virtual-machine logging and replay. In Proceedings of the 5th Symposium on
Operating Systems Design and Implementation, pages 211–224, Boston, MA, Dec 2002.
[6] S. Finkelstein, M. Schkolnick, and P. Tiberio. Physical database design for relational databases.
ACM Transactions on Database Systems, 13(1):91–128, 1988.
[7] C. Goldfeder. Gmail snooze with apps script. http://googleappsdeveloper.blogspot.
[8] D. Goodin. Surfing Google may be harmful to your security. The Register, Aug 2008. http:
[9] Google, Inc. Google apps script.
[10] S. Gordeychik. Web application security statistics.
[11] S. A. Hofmeyr, S. Forrest, and A. Somayaji. Intrusion detection using sequences of system calls.
Journal of Computer Security, 6:151–180, 1998.
[12] M. Y. L. Ip, L. V. Saxton, and V. V. Raghavan. On the selection of an optimal set of indexes.
IEEE Trans. Softw. Eng., 9(2):135–143, 1983.
[13] A. Joshi, S. King, G. Dunlap, and P. Chen. Detecting past and present intrusions through
vulnerability-specific predicates. In Proceedings of the 20th ACM Symposium on Operating
Systems Principles, pages 91–104, Brighton, UK, Oct 2005.
[14] T. Kim, X. Wang, N. Zeldovich, and M. F. Kaashoek. Intrusion recovery using selective
re-execution. In Proceedings of the 9th Symposium on Operating Systems Design and Implementation, pages 89–104, Vancouver, Canada, Oct 2010.
[15] S. T. King and P. M. Chen. Backtracking intrusions. ACM Transactions on Computer Systems,
23(1):51–76, Feb 2005.
[16] W. Lee, S. J. Stolfo, and P. K. Chan. Learning patterns from Unix process execution traces for
intrusion detection. In Proceedings of the AAAI Workshop on AI Approaches in Fraud Detection
and Risk Management, pages 50–56, Jul 1997.
[17] P. Liu, P. Ammann, and S. Jajodia. Rewriting histories: Recovering from malicious transactions.
Journal of Distributed and Parallel Databases, 8:7–40, 2000.
[18] B. Livshits and W. Cui. Spectator: Detection and containment of JavaScript worms. In
Proceedings of the 2008 USENIX Annual Technical Conference, Boston, MA, Jun 2008.
[19] P. Mahajan, R. Kotla, C. C. Marshall, V. Ramasubramanian, T. L. Rodeheffer, D. B. Terry, and
T. Wobber. Effective and efficient compromise recovery for weakly consistent replication. In
Proceedings of the ACM EuroSys Conference, Nuremberg, Germany, Mar 2009.
[20] D. W. Margo and M. Seltzer. The case for browser provenance. In Proceedings of the 1st
Workshop on the Theory and Practice of Provenance, San Francisco, CA, Feb 2009.
[21] MediaWiki. MediaWiki.
[22] J. Mickens, J. Elson, and J. Howell. Mugshot: Deterministic capture and replay for JavaScript
applications. In Proceedings of the 7th Symposium on Networked Systems Design and Implementation, San Jose, CA, Apr 2010.
[23] M. S. Miller, M. Samuel, B. Laurie, I. Awad, and M. Stay. Caja: Safe active content in sanitized
JavaScript, 2008.
[24] K.-K. Muniswamy-Reddy, D. Holland, U. Braun, and M. Seltzer. Provenance-aware storage
systems. In Proceedings of the 2006 USENIX Annual Technical Conference, Boston, MA,
May–Jun 2006.
[25] K.-K. Muniswamy-Reddy, U. Braun, D. Holland, P. Macko, D. Maclean, D. W. Margo, M. Seltzer,
and R. Smogor. Layering in provenance systems. In Proceedings of the 2009 USENIX Annual
Technical Conference, San Diego, CA, Jun 2009.
[26] K.-K. Muniswamy-Reddy, P. Macko, and M. Seltzer. Provenance for the cloud. In Proceedings
of the 8th Conference on File and Storage Technologies, San Jose, CA, Feb 2010.
[27] National Vulnerability Database. CVE statistics.
statistics, Feb 2011.
[28] C. Olston and A. D. Sarma. Ibis: A provenance manager for multi-layer systems. In Proceedings
of the 5th Biennial Conference on Innovative Data Systems Research, Pacific Grove, CA, Jan
[29] Oracle Corporation. Oracle flashback technology.
[30] R. T. Snodgrass and I. Ahn. Temporal databases. IEEE Computer, 19(9):35–42, Sep 1986.
[31] J. Tyson.
Recent Facebook XSS attacks show increasing sophistication.
http://, Apr 2011.
[32] C. Warrender, S. Forrest, and B. Pearlmutter. Detecting intrusions using system calls: Alternative
data models. In Proceedings of the 20th IEEE Symposium on Security and Privacy, Oakland,
CA, May 1999.
[33] K. Wickre.
About that fake post.
Stored XSS
SQL injection
The user options (wgDB*) in the live web-based installer
(config/index.php) are not HTML-escaped.
The name of contribution link (Special:Block?ip) is not
HTML/API login interfaces do not properly handle an unintended login attempt (login CSRF).
A malicious website can embed MediaWiki within an
The language identifier, thelang, is not properly sanitized in
Administrator accidentally grants admin privileges to a user.
Revoke the user’s admin privileges.
Sanitize the thelang parameter with wfStrencode().
Include a random challenge token in a hidden form field for every
login attempt (r64677).
Add X-Frame-Options:DENY to HTTP headers (r79566).
Sanitize the ip parameter with htmlspecialchars() (r52521).
Sanitize all user options with htmlspecialchars() (r46889).
Table 2: Security vulnerabilities and corresponding fixes for MediaWiki. Where available, we indicate the revision number of each fix in MediaWiki’s subversion repository,
in parentheses.
Reflected XSS
ACL error
Attack type
Attack scenario
Reflected XSS
Stored XSS
SQL injection
ACL error
Initial repair
Retroactive patching
Retroactive patching
Retroactive patching
Retroactive patching
Retroactive patching
# users with conflicts
Table 3: WARP repairs the attack scenarios listed in Table 2. The initial repair column indicates the method
used to initiate repair.
Attack action
Number of users with conflict
No extension
No text merge
Table 4: Effectiveness of WARP UI repair. Each entry indicates whether a user-visible conflict was observed
during repair. This experiment involved eight victim users and one attacker.
Bug causing corruption
Drupal – lost voting info
Drupal – lost comments
Gallery2 – removing perms
Gallery2 – resizing images
Akkuş and Goel [1]
User input
89 / 0
95 / 0
82 / 10
119 / 0
User input
Table 5: Comparison of WARP with Akkuş and Goel’s system [1]. FP reports false positives. Akkuş and
Goel can also incur false negatives, unlike WARP. False positives are reported for the best dependency policy
in [1] that has no false negatives for these bugs, although there is no single best policy for that system. The
numbers shown before and after the slash are without and with table-level white-listing, respectively.
Page visits / second
Data stored per page visit
0.22 KB
0.21 KB
1.49 KB
1.67 KB
2.00 KB
5.46 KB
Table 6: Overheads for users browsing and editing Wiki pages in MediaWiki. The first numbers are page
visits per second without WARP, with WARP installed, and with WARP while repair is concurrently underway.
A single page visit in MediaWiki can involve multiple HTTP and SQL queries. Data stored per page visit
includes all dependency information (compressed) and database checkpoints.
14 / 1,223
1,007 / 1,217
995 / 1,216
14 / 1,011
1,005 /1,005
1,011 /1,011
19,799 / 24,578
23,227 / 24,641
1,800 / 24,741
Number of re-executed actions
Page visits
App. runs
SQL queries
14 / 1,011
13 / 1,223
258 / 24,746
14 / 1,007
15 / 1,219
293 / 24,740
22 / 1,005
23 / 1,214
524 / 24,541
13 / 1,000
13 / 1,216
185 / 24,326
exec. time
Repair time breakdown
Table 7: Performance of WARP in repairing attack scenarios described in Table 2 for a workload with 100 users. The “re-executed actions” columns show the number of
re-executed actions out of the total number of actions in the workload. The execution times are in seconds. The “original execution time” column shows the CPU time taken
by the web application server, including time taken by database queries. The “repair time breakdown” columns show, respectively, the total wall clock repair time, the time to
initialize repair (including time to search for attack actions), the time spent loading nodes into the action history graph, the CPU time taken by the re-execution Firefox
browser, the time taken by re-executed database queries that are not part of a page re-execution, time taken to re-execute page visits including time to execute database
queries issued during page re-execution, time taken by WARP’s repair controller, and time for which the CPU is idle during repair.
Reflected XSS
Stored XSS
SQL injection
ACL error
Reflected XSS
(victims at start)
Attack scenario
Number of re-executed actions
Page visits
App. runs
SQL queries
14 / 50,011
14 / 60,023
281 / 1,222,656
32 / 50,007
33 / 60,019
733 / 1,222,652
26 / 50,005
27 / 60,014
578 / 1,222,495
11 / 50,000
11 / 60,016
133 / 1,222,308
exec. time
Repair time breakdown
Table 8: Performance of WARP in attack scenarios for workloads of 5,000 users. See Table 7 for a description of the columns.
Reflected XSS
Stored XSS
SQL injection
ACL error
Attack scenario
Software fault isolation with
API integrity and multi-principal modules
Yandong Mao, Haogang Chen, Dong Zhou† , Xi Wang,
Nickolai Zeldovich, and M. Frans Kaashoek
MIT CSAIL, † Tsinghua University IIIS
The security of many applications relies on the kernel being secure, but history suggests that kernel
vulnerabilities are routinely discovered and exploited. In particular, exploitable vulnerabilities in
kernel modules are common. This paper proposes LXFI, a system which isolates kernel modules from
the core kernel so that vulnerabilities in kernel modules cannot lead to a privilege escalation attack. To
safely give kernel modules access to complex kernel APIs, LXFI introduces the notion of API integrity,
which captures the set of contracts assumed by an interface. To partition the privileges within a shared
module, LXFI introduces module principals. Programmers specify principals and API integrity rules
through capabilities and annotations. Using a compiler plugin, LXFI instruments the generated code
to grant, check, and transfer capabilities between modules, according to the programmer’s annotations.
An evaluation with Linux shows that the annotations required on kernel functions to support a new
module are moderate, and that LXFI is able to prevent three known privilege-escalation vulnerabilities.
Stress tests of a network driver module also show that isolating this module using LXFI does not hurt
TCP throughput but reduces UDP throughput by 35%, and increases CPU utilization by 2.2–3.7×.
Kernel exploits are not as common as Web exploits, but they do happen [2]. For example, for the
Linux kernel, a kernel exploit is reported about once per month, and often these exploits attack
kernel modules instead of the core kernel [5]. These kernel exploits are devastating because they
typically allow the adversary to obtain “root” privilege. For instance, CVE-2010-3904 reports on a
vulnerability in Linux’s Reliable Datagram Socket (RDS) module that allowed an adversary to write an
arbitrary value to an arbitrary kernel address because the RDS page copy function missed a check on a
user-supplied pointer. This vulnerability can be exploited to overwrite function pointers and invoke
arbitrary kernel or user code. The contribution of this paper is LXFI, a new software fault isolation
system to isolate kernel modules. LXFI allows a module developer to partition the privileges held by
a single shared module into multiple principals, and provides annotations to enforce API integrity
for complex, irregular kernel interfaces such as the ones found in the Linux kernel and exploited by
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted
without fee provided that copies are not made or distributed for profit or commercial advantage and that copies
bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to
redistribute to lists, requires prior specific permission and/or a fee.
SOSP '11, October 23-26, 2011, Cascais, Portugal.
Copyright © 2011 ACM 978-1-4503-0977-6/11/10 ... $10.00.
Previous systems such as XFI [9] have used software isolation [26] to isolate kernel modules from
the core kernel, thereby protecting against a class of attacks on kernel modules. The challenge is that
modules need to use support functions in the core kernel to operate correctly; for example, they need
to be able acquire locks, copy data, etc., which require invoking functions in the kernel core for these
abstractions. Since the kernel does not provide type safety for pointers, a compromised module can
exploit some seemingly “harmless” kernel API to gain privilege. For instance, the spin_lock_init
function in the kernel writes the value zero to a spinlock that is identified by a pointer argument. A
module that can invoke spin_lock_init could pass the address of the user ID value in the current
process structure as the spinlock pointer, thereby tricking spin_lock_init into setting the user ID
of the current process to zero (i.e., root in Linux), and gaining root privileges.
Two recent software fault isolation systems, XFI and BGI [4], have two significant shortcomings.
First, neither can deal with complex, irregular interfaces; as noted by the authors of XFI, attacks by
modules that abuse an over-permissive kernel routine that a module is allowed to invoke remain an
open problem [9, §6.1]. BGI tackles this problem in the context of Windows kernel drivers, which have
a well-defined regular structure amenable to manual interposition on all kernel/module interactions.
The Linux kernel, on the other hand, has a more complex interface that makes manual interposition
difficult. For example, Linux kernel interfaces often store function pointers to both kernel and module
functions in data structures that are updated by modules, and invoked by the kernel in many locations.
The second shortcoming of XFI and BGI is that they cannot isolate different instances of the same
module. For example, a single kernel module might be used to implement many instances of the same
abstraction (e.g., many block devices or many sockets). If one of these instances is compromised by
an adversary, the adversary also gains access to the privileges of all other instances as well.
This paper’s goal is to solve both of these problems for Linux kernel modules. To partition the
privileges held by a shared module, LXFI extends software fault isolation to allow modules to have
multiple principals. Principals correspond to distinct instances of abstractions provided by a kernel
module, such as a single socket or a block device provided by a module that can instantiate many
of them. Programmers annotate kernel interfaces to specify what principal should be used when the
module is invoked, and each principal’s privileges are kept separate by LXFI. Thus, if an adversary
compromises one instance of the module, the adversary can only misuse that principal’s privileges
(e.g., being able to modify data on a single socket, or being able to write to a single block device).
To handle complex kernel interfaces, LXFI introduces API integrity, which captures the contract
assumed by kernel developers for a particular interface. To capture API integrity, LXFI uses capabilities
to track the privileges held by each principal, and introduces light-weight annotations that programmers
use to express the API integrity of an interface in terms of capabilities and principals. LXFI enforces
API integrity at runtime through software fault isolation techniques.
To test out these ideas, we implemented LXFI for Linux kernel modules. The implementation provides
the same basic security properties as XFI and BGI, using similar techniques, but also enforces API
integrity for multiple principals. To use LXFI, a programmer must first specify the security policy for
an API, using source-level annotations. LXFI enforces the specified security policy with the help of
two components. The first is a compile-time rewriter, which inserts checks into kernel code that, when
invoked at runtime, verify that security policies are upheld. The second is a runtime component, which
maintains the privileges of each module and checks whether a module has the necessary privileges
for any given operation at runtime. To enforce API integrity efficiently, LXFI uses a number of
optimizations, such as writer-set tracking. To isolate a module at runtime, LXFI sets up the initial
capabilities, manages the capabilities as they are added and revoked, and checks them on all calls
between the module and the core kernel according to the programmer’s annotations.
An evaluation for 10 kernel modules shows that supporting a new module requires 8–133 annotations,
of which many are shared between modules. Furthermore, the evaluation shows that LXFI can prevent
exploits for three CVE-documented vulnerabilities in kernel modules (including the RDS module).
Stress tests with a network driver module show that isolating this module using LXFI does not hurt
TCP throughput, but reduces UDP throughput by 35%, and increases CPU utilization by 2.2–3.7×.
The contributions of the paper are as follows. First, this paper extends the typical module isolation
model to support multiple principals per code module. Second, this paper introduces the notion of API
integrity, and provides a light-weight annotation language that helps describe the security properties of
kernel and module interfaces in terms of capabilities. Finally, this paper demonstrates that LXFI is
practical in terms of performance, security, and annotation effort by evaluating it on the Linux kernel.
The rest of the paper is organized as follows. The next section defines the goal of this paper, and the
threat model assumed by LXFI. §3 gives the design of LXFI and its annotations. We describe LXFI’s
compile-time and runtime components in §4 and §5, and discuss how we expect kernel developers to
use LXFI in practice in §6. §7 describes the implementation details. We evaluate LXFI in §8, discuss
related work in §9, and conclude in §10.
LXFI’s goal is to prevent an adversary from exploiting vulnerabilities in kernel modules in a way
that leads to a privilege escalation attack. Many exploitable kernel vulnerabilities are found in kernel
modules. For example, Chen et al. find that two thirds of kernel vulnerabilities reported by CVE
between Jan 2010 and March 2011 are in kernel modules [1, 5].
When adversaries exploit bugs in the kernel, they trick the kernel code into performing operations
that the code would not normally do. For example, an adversary can trick the kernel into writing to
arbitrary memory locations, or invoking arbitrary kernel code. Adversaries can leverage this to gain
additional privileges (e.g., by running their own code in the kernel, or overwriting the user ID of the
current process), or to disclose data from the system. The focus of LXFI is on preventing integrity
attacks (e.g., privilege escalation), and not on data disclosure.
In LXFI, we assume that we will not be able to fix all possible underlying software bugs, but instead
we focus on reducing the possible operations the adversary can trick the kernel into performing to
the set of operations that code (e.g., a kernel module) would ordinarily be able to do. For example, if
a module does not ordinarily modify the user ID field in the process structure, LXFI should prevent
the module from doing so even if it is compromised by an adversary. Similarly, if a module does not
ordinarily invoke kernel functions to write blocks to a disk, LXFI should prevent a module from doing
so, even if it is compromised.
LXFI’s approach to prevent privilege escalation is to isolate the modules from each other and from
the core of the kernel, as described above. Of course, a module may legitimately need to raise the
privileges of the current process, such as through setuid bits in a file system, so this approach will not
prevent all possible privilege escalation exploits. However, most of the exploits found in practice take
advantage of the fact that every buggy module is fully privileged, and making modules less privileged
will reduce the number of possible exploits.
Another challenge in making module isolation work lies in knowing what policy rules to enforce at
module boundaries. Since the Linux kernel was written without module isolation in mind, all such
rules are implicit, and can only be determined by manual code inspection. One possible solution would
be to re-design the Linux kernel to be more amenable to privilege separation, and to have simpler
interfaces where all privileges are explicit; however, doing this would involve a significant amount of
work. LXFI takes a different approach that tries to make as few modifications to the Linux kernel as
possible. To this end, LXFI, like previous module isolation systems [4, 9, 26], relies on developers to
specify this policy.
In the rest of this section, we will discuss two specific challenges that have not been addressed in prior
work that LXFI solves, followed by the assumptions made by LXFI. Briefly, the challenges have to
do with a shared module that has many privileges on behalf of its many clients, and with concisely
specifying module policies for complex kernel interfaces like the ones in the Linux kernel.
Privileges in shared modules
The first challenge is that a single kernel module may have many privileges if that kernel module
is being used in many contexts. For example, a system may use the dm-crypt module to manage
encrypted block devices, including both the system’s main disk and any USB flash drives that may
be connected by the user. The entire dm-crypt module must have privileges to write to all of these
devices. However, if the user accidentally plugs in a malicious USB flash drive that exploits a bug
in dm-crypt, the compromised dm-crypt module will be able to corrupt all of the block devices it
manages. Similarly, a network protocol module, such as econet, must have privileges to write to all
of the sockets managed by that module. As a result, if an adversary exploits a vulnerability in the
context of his or her econet connection, the adversary will be able to modify the data sent over any
other econet socket as well.
Lack of API integrity
The second challenge is that kernel modules use complex kernel interfaces. These kernel interfaces
could be mis-used by a compromised module to gain additional privileges (e.g., by corrupting memory).
One approach is to re-design kernel interfaces to make it easy to enforce safety properties, such as in
Windows, as illustrated by BGI [4]. However, LXFI’s goal is to isolate existing Linux kernel modules,
where many of the existing interfaces are complex.
To prevent these kinds of attacks in Linux, we define API integrity as the contract that developers
intend for any module to follow when using some interface, such as the memory allocator, the PCI
subsystem, or the network stack. The set of rules that make up the contract between a kernel module
and the core kernel are different for each kernel API, and the resulting operations that the kernel
module can perform are also API-specific. However, by enforcing API integrity—i.e., ensuring that
each kernel module follows the intended contract for core kernel interfaces that it uses—LXFI will
ensure that a compromised kernel module cannot take advantage of the core kernel’s interfaces to
perform more operations than the API was intended to allow.
To understand the kinds of contracts that an interface may require, consider a PCI network device
driver for the Linux kernel, shown in Figure 1. In the rest of this section, we will present several
examples of contracts that make up API integrity for this interface, and how a kernel module may
violate those contracts.
Memory safety and control flow integrity.. Two basic safety properties that all software fault
isolation systems enforce is memory safety, which guarantees that a module can only access memory
that it owns or has legitimate access to, and control flow integrity, which guarantees that a module can
only execute its own isolated code and external functions that it has legitimate access to. However,
memory safety and control flow integrity are insufficient to provide API integrity, and the rest of this
section describes other safety properties enforced by LXFI.
Function call integrity.. The first aspect of API integrity deals with how a kernel module may
invoke the core kernel’s functions. These contracts are typically concerned with the arguments that the
module can provide, and the specific functions that can be invoked, as we will now illustrate.
Many function call contracts involve the notion of object ownership. For example, when the network
device driver module in Figure 1 calls pci_enable_device to enable the PCI device on line 35, it is
expected that the module will provide a pointer to its own pci_dev structure as the argument (i.e., the
one it received as an argument to module_pci_probe). If the module passes in some other pci_dev
structure to pci_enable_device, it may be able to interfere with other devices, and potentially cause
struct pci_driver {
int (*probe) (struct pci_dev *pcidev);
struct net_device {
struct net_device_ops *dev_ops;
struct net_device_ops {
netdev_tx_t (*ndo_start_xmit)
(struct sk_buff *skb,
struct net_device *dev);
/* Exported kernel functions */
void pci_enable_device(struct pci_dev *pcidev);
void netif_rx(struct sk_buff *skb);
/* In core kernel code */
netif_napi_add(struct net_device *dev,
struct napi_struct *napi,
int (*poll) (struct napi_struct *, int))
dev->dev_ops->ndo_start_xmit(skb, ndev);
(*poll) (napi, 5);
/* In network device driver’s module */
module_pci_probe(struct pci_dev *pcidev) {
ndev = alloc_etherdev(...);
ndev->dev_ops->ndo_start_xmit = myxmit;
netif_napi_add(ndev, napi, my_poll_cb);
return 0;
/* In network device driver’s code */
Figure 1: Parts of a PCI network device driver in Linux.
problems for other modules. Furthermore, if the module is able to construct its own pci_dev structure,
and pass it as an argument, it may be able to trick the kernel into performing arbitrary device I/O or
memory writes.
A common type of object ownership is write access to memory. Many core kernel functions write to a
memory address supplied by the caller, such as spin_lock_init from the example in §1. In these
cases, a kernel module should only be able to pass addresses of kernel memory it has write access to
(for a sufficient number of bytes) to such functions; otherwise, a kernel module may trick the kernel
into writing to arbitrary kernel memory. On the other hand, a kernel module can also have ownership
of an object without being able to write to its memory: in the case of the network device, modules
should not directly modify the memory contents of their pci_dev struct, since it would allow the
module to trick the kernel into controlling a different device, or dereferencing arbitrary pointers.
Another type of function call contract relates to callback functions. Several kernel interfaces involve
passing around callback functions, such as the netif_napi_add interface on line 23. In this case, the
kernel invokes the poll function pointer at a later time, and expects that this points to a legitimate
function. If the module is able to provide arbitrary function pointers, such as my_poll_cb on line 37,
the module may be able to trick the kernel into running arbitrary code when it invokes the callback on
line 28. Moreover, the module should be able to provide only pointers to functions that the module
itself can invoke; otherwise, it can trick the kernel into running a function that it is not allowed to call
Function callbacks are also used in the other direction: for modules to call back into the core kernel.
Once the core kernel has provided a callback function a kernel module, the module is expected to
invoke the callback, probably with a prescribed callback argument. The module should not invoke the
callback function before the callback is provided, or with a different callback argument.
Data structure integrity.. In addition to memory safety, many kernel interfaces assume that the
actual data stored in a particular memory location obeys certain invariants. For example, an sk_buff
structure, representing a network packet in the Linux kernel, contains a pointer to packet data. When
the module passes an sk_buff structure to the core kernel on line 42, it is expected to provide a
legitimate data pointer inside of the sk_buff, and that pointer should point to memory that the kernel
module has write access to (in cases when the sk_buff’s payload is going to be modified). If this
invariant is violated, the kernel code can be tricked into writing to arbitrary memory.
Another kind of data structure integrity deals with function pointers that are stored in shared memory.
The Linux kernel often stores callback function pointers in data structures. For example, the core
kernel invokes a function pointer from dev->dev_ops on line 27. The implicit assumption the kernel
is making is that the function pointer points to legitimate code that should be executed. However, if
the kernel module was able to write arbitrary values to the function pointer field, it could trick the core
kernel into executing arbitrary code. Thus, in LXFI, even though the module can write a legitimate
pointer on line 36, it should not be able to corrupt it later. To address this problem, LXFI checks
whether the function pointer value that is about to be invoked was a legitimate function address that
the pointer’s writer was allowed to invoke too.
API integrity in Linux.. In the general case, it is difficult to find or enumerate all of the contracts
necessary for API integrity. However, in our experience, kernel module interfaces in Linux tend to be
reasonably well-structured, and it is possible to capture the contracts of many interfaces in a succinct
manner. Even though these interfaces are not used as security boundaries in the Linux kernel, they are
carefully designed by kernel developers to support a range of kernel modules, and contain many sanity
checks to catch buggy behavior by modules (e.g., calls to BUG()).
LXFI relies on developers to provide annotations capturing the API integrity of each interface. LXFI
provides a safe default, in that a kernel function with no annotations (e.g., one that the developer
forgot to annotate) cannot be accessed by a kernel module. However, LXFI trusts any annotations that
the developer provides; if there is any mistake or omission in an annotation, LXFI will enforce the
policy specified in the annotation, and not the intended policy. Finally, in cases when it is difficult
to enforce API integrity using LXFI, re-designing the interface to fit LXFI’s annotations may be
necessary (however, we have not encountered any such cases for the modules we have annotated).
Threat model
LXFI makes two assumptions to isolate kernel modules. First, LXFI assumes that the core kernel, the
annotations on the kernel’s interfaces, and the LXFI verifier itself are correct.
Second, LXFI infers the initial privileges that a module should be granted based on the functions
that module’s code imports. Thus, we trust that the programmer of each kernel module only invokes
functions needed by that module. We believe this is an appropriate assumption because kernel
developers are largely well-meaning, and do not try to access unnecessary interfaces on purpose. Thus,
by capturing the intended privileges of a module, and by looking at the interfaces required in the
source code, we can prevent an adversary from accessing any additional interfaces at runtime.
At a high level, LXFI’s workflow consists of four steps. First, kernel developers annotate core kernel
interfaces to enforce API integrity between the core kernel and modules. Second, module developers
annotate certain parts of their module where they need to switch privileges between different module
principals. Third, LXFI’s compile-time rewriter instruments the generated code to perform API
integrity checks at runtime. Finally, LXFI’s runtime is invoked at these instrumented points, and
performs the necessary checks to uphold API integrity. If the checks fail, the kernel panics. The rest
of this section describes LXFI’s principals, privileges, and annotations.
Many modules provide an abstraction that can be instantiated many times. For example, the econet
protocol module provides an econet socket abstraction that can be instantiated to create a specific
socket. Similarly, device mapper modules such as dm-crypt and dmraid provide a layered block
device abstraction that can be instantiated for a particular block device.
To minimize the privileges that an adversary can take advantage of when they exploit a vulnerability
in a module, LXFI logically breaks up a module into multiple principals corresponding to each
instance of the module’s abstraction. For example, each econet socket corresponds to a separate
module principal, and each block device provided by dm-crypt also corresponds to a separate module
principal. Each module principal will have access to only the privileges needed by that instance of the
module’s abstraction, and not to the global privileges of the entire module.
To support this plan, LXFI provides three mechanisms. First, LXFI allows programmers to define
principals in a module. To avoid requiring existing code to keep track of LXFI-specific principals,
LXFI names module principals based on existing data structures used to represent an instance of the
module’s abstraction. For example, in econet, LXFI uses the address of the socket structure as the
principal name. Similarly, in device mapper modules, LXFI uses the address of the block device.
Second, LXFI allows programmers to define what principal should be used when invoking a module,
by providing annotations on function types (which we discuss more concretely in §3.3). For example,
when the kernel invokes the econet module to send a message over a socket, LXFI should execute
the module’s code in the context of that socket’s principal. To achieve this behavior, the programmer
annotates the message send function to specify that the socket pointer argument specifies the principal
name. At runtime, LXFI uses this annotation to switch the current principal to the one specified by the
function’s arguments when the function is invoked. These principal identifiers are stored on a shadow
stack, so that if an interrupt comes in while a module is executing, the module’s privileges are saved
before handling the interrupt, and restored on interrupt exit.
Third, a module may share some state between multiple instances. For example, the econet module
maintains a linked list of all sockets managed by that module. Since each linked list pointer is stored
in a different socket object, no single instance principal is able to add or remove elements from this
list. Performing these cross-instance operations requires global privileges of the entire module. In
these cases, LXFI allows programmers to switch the current principal to the module’s global principal,
which implicitly has access to the capabilities of all other principals in that module. For example, in
econet, the programmer would modify the function used to add or remove sockets from this linked
list to switch to running as the global principal. Conversely, a shared principal is used to represent
privileges accessible to all principals in a module, such as the privileges to invoke the initial kernel
functions required by that module. All principals in a module implicitly have access to all of the
privileges of the shared principal.
To ensure that a function that switches to the global principal cannot be tricked into misusing its global
privileges, programmers must insert appropriate checks before every such privilege change. LXFI’s
control flow integrity then ensures that these checks cannot be bypassed by an adversary at runtime. A
similar requirement arises for other privileged LXFI functions, such as manipulating principals. We
give an example of such checks in §3.4.
Modules do not explicitly define the privileges they require at runtime—such as what memory they
may write, or what functions they may call—and even for functions that a module may legitimately
need, the function itself may be expecting the module to invoke it in certain ways, as described in §2.2
and Figure 1.
To keep track of module privileges, LXFI maintains a set of capabilities, similar to BGI, that track the
privileges of each module principal at runtime. LXFI supports three types of capabilities, as follows:
W RITE (ptr, size).. This capability means that a module can write any values to memory region
[ptr, ptr + size) in the kernel address space. It can also pass addresses inside the region to kernel
routines that require writable memory. For example, the network device module in Figure 1 would
have a W RITE capability for its sk_buff packets and their payloads, which allows it to modify the
R EF (t, a).. This capability allows the module to pass a as an argument to kernel functions that require
a capability of R EF type t, capturing the object ownership idea from §2. Type t is often the C type
of the argument, although it need not be the case, and we describe situations in which this happens
in §6. Unlike the W RITE capability, R EF (t, a) does not grant write access to memory at address a.
For instance, in our network module, the module should receive a R EF (pci_dev, pcidev) capability
when the core kernel invokes module_driver->probe on line 20, if that code was annotated to
support LXFI capabilities. This capability would then allow the module to call pci_enable_device
on line 35.
C ALL (a).. The module can call or jump to a target memory address a. In our network module
example, the module has a C ALL capability for netif_rx, pci_enable_device, and others; this
particular example has no instances of dynamic call capabilities provided to the module by the core
kernel at runtime.
The basic operations on capabilities are granting a capability, revoking all copies of a capability, and
checking whether a caller has a capability. To set up the basic execution environment for a module,
LXFI grants a module initial capabilities when the module is loaded, which include: (1) a W RITE
capability to its writable data section; (2) a W RITE capability to the current kernel stack (does not
include the shadow stack, which we describe later); and (3) C ALL capabilities to all kernel routines
that are imported in the module’s symbol table.
A module can gain or lose additional capabilities when it calls support functions in the core kernel. For
example, after a module calls kmalloc, it gains a W RITE capability to the newly allocated memory.
Similarly, after calling kfree, LXFI’s runtime revokes the corresponding W RITE capability from that
Interface annotations
annotation ::= pre(action) | post(action) | principal(c-expr)
action ::= copy(caplist)
| transfer(caplist)
| check(caplist)
| if (c-expr) action
caplist ::= (c, ptr, [size])
| iterator-func(c-expr)
Figure 2: Grammar for LXFI annotations. A c-expr corresponds to a C expression that can
reference the annotated function’s arguments and its return value. An iterator-func is a name
of a programmer-supplied C function that takes a c-expr argument, and iterates over a set of
capabilities. c specifies the type of the capability (either W RITE, C ALL, or R EF, as described in
§3.2), and ptr is the address or argument for the capability. The size parameter is optional, and
defaults to sizeof(*ptr).
pre(copy(c, ptr, [size]))
post(copy(c, ptr, [size]))
pre(transfer(c, ptr, [size]))
post(transfer(c, ptr, [size]))
pre(check(c, ptr, [size]))
pre(if (c-expr) action)
post(if (c-expr) action)
Check that caller owns capability c for [ptr, ptr + size) before
calling function.
Copy capability c from caller to callee for [ptr, ptr + size)
before the call.
Check that callee owns capability c for [ptr, ptr + size) after
the call.
Copy capability c from callee to caller for [ptr, ptr + size)
after the call.
Check that caller owns capability c for [ptr, ptr + size) before
calling function.
Transfer capability c from caller to callee for [ptr, ptr + size)
before the call.
Check that callee owns capability c for [ptr, ptr + size) after
the call.
Transfer capability c from callee to caller for [ptr, ptr + size)
after the call.
Check that the caller has the (c, ptr, [size]) capability.
Check that the caller has all capabilities returned by the
programmer-supplied skb_iter function.
Run the specified action if the expression c-expr is true; used
for conditional annotations based on return value.
LXFI allows c-expr to refer to function arguments, and (for
post annotations) to the return value.
Use p as the callee principal; in the absence of this annotation,
LXFI uses the module’s shared principal.
Figure 3: Examples of LXFI annotations, using the grammar shown in Figure 2, and their
Although the principal and capability mechanisms allow LXFI to reason about the privileges held
by each module principal, it is cumbersome for the programmer to manually insert calls to switch
principals, transfer capabilities, and verify whether a module has a certain capability, for each kernel/module API function (as in BGI [4]). To simplify the programmer’s job, LXFI allows programmers
to annotate interfaces (i.e., prototype definitions in C) with principals and capability actions. LXFI
leverages the clang support for attributes to specify the annotations.
LXFI annotations are consulted when invoking a function, and can be associated (in the source code)
with function declarations, function definitions, or function pointer types. A single kernel function (or
function pointer type) can have multiple LXFI annotations; each one describes what action the LXFI
runtime should take, and specifies whether that action should be taken before the function is called, or
after the call finishes, as indicated by pre and post keywords. Figure 2 summarizes the grammar for
LXFI’s annotations.
There are three types of annotations supported by LXFI: pre, post, and principal. The first two
perform a specified action either before invoking the function or after it returns. The principal
annotation specifies the name of the module principal that should be used to execute the called
function, which we discuss shortly.
There are four actions that can be performed by either pre or post annotations. A copy action grants a
capability from the caller to the callee for pre annotations (and vice-versa for post). A transfer action
moves ownership of a capability from the caller to the callee for pre annotations (and vice-versa for
post). Both copy and transfer ensure that the capability is owned in the first place before granting it.
A check action verifies that the caller owns a particular capability; all check annotations are pre. To
support conditional annotations, LXFI supports if actions, which conditionally perform some action
(such as a copy or a transfer) based on an expression that can involve either the function’s arguments,
or, for post annotations, the function’s return value. For example, this allows transferring capabilities
for a memory buffer only if the return value does not indicate an error.
Transfer actions revoke the transferred capability from all principals in the system, rather than just
from the immediate source of the transfer. (As described above, transfers happen in different directions
depending on whether the action happens in a pre or post context.) Revoking a capability from all
principals ensures that no copies of the capability remain, and allows the object referred to by the
capability to be re-used safely. For example, the memory allocator’s kfree function uses transfer to
ensure no outstanding capabilities exist for free memory. Similarly, when a network driver hands a
packet to the core kernel, a transfer action ensures the driver—and any other module the driver could
have given capabilities to—cannot modify the packet any more.
The copy, transfer, and check actions take as argument the list of capabilities to which the action
should be applied. In the simple case, the capability can be specified inline, but the programmer can
also implement their own function that returns a list of capabilities, and use that function in an action
to iterate over all of the returned capabilities. Figure 3 provides several example LXFI annotations and
their semantics.
To specify the principal with whose privilege the function should be invoked, LXFI provides a
principal annotation. LXFI’s principals are named by arbitrary pointers. This is convenient because
Linux kernel interfaces often have an object corresponding to every instance of an abstraction that
a principal tries to capture. For example, a network device driver would use the address of its
net_device structure as the principal name to separate different network interfaces from each other.
Adding explicit names for principals would require extending existing Linux data structures to store
this additional name, which would require making changes to the Linux kernel, and potentially break
data structure invariants, such as alignment or layout.
One complication with LXFI’s pointer-based principal naming scheme is that a single instance of
an module’s abstraction may have a separate data structure that is used for different interfaces. For
instance, a PCI network device driver may be invoked both by the network sub-system and by the
PCI sub-system. The network sub-system would use the pointer of the net_device structure as
the principal name, and the PCI sub-system would use the pointer of the pci_dev structure for the
principal. Even though these two names may refer to the same logical principal (i.e., a single physical
network card), the names differ.
To address this problem, LXFI separates principals from their names. This allows a single logical
principal to have multiple names, and LXFI provides a function called lxfi_princ_alias that a
module can use to map names to principals. The special values global and shared can be used as an
argument to a principal annotation to indicate the module’s global and shared principals, respectively.
For example, this can be used for functions that require access to the entire module’s privileges, such
as adding or removing sockets from a global linked list in econet.
Annotation example
To give a concrete example of how LXFI’s annotations are used, consider the interfaces shown in
Figure 1, and their annotated version in Figure 4. LXFI’s annotations are underlined in Figure 4.
Although this example involves a significant number of annotations, we specifically chose it to illustrate
most of LXFI’s mechanisms.
To prevent modules from arbitrarily enabling PCI devices, the pci_enable_device function on
line 67 in Figure 4 has a check annotation that ensures the caller has a R EF capability for the
corresponding pci_dev object. When the module is first initialized for a particular PCI device, the
probe function grants it such a R EF capability (based on the annotation for the probe function pointer
on line 45). Note that if the probe function returns an error code, the post annotation on the probe
function transfers ownership of the pci_dev object back to the caller.
Once the network interface is registered with the kernel, the kernel can send packets by invoking the
ndo_start_xmit function. The annotations on this function, on line 60, grant the module access to
the packet, represented by the sk_buff structure. Note that the sk_buff structure is a complicated
object, including a pointer to a separate region of memory holding the actual packet payload. To
compute the set of capabilities needed by an sk_buff, the programmer writes a capability iterator
called skb_caps that invokes LXFI’s lxfi_cap_iterate function on all of the capabilities that
make up the sk_buff. This function in turn performs the requested operation (transfer, in this case)
based on the context in which the capability iterator was invoked. As with the PCI example above, the
annotations transfer the granted capabilities back to the caller in case of an error.
Note that, when the kernel invokes the device driver through ndo_start_xmit, it uses the pointer
to the net_dev structure as the principal name (line 60), even though the initial PCI probe function
used the pci_dev structure’s address as the principal (line 45). To ensure that the module has access
to the same set of capabilities in both cases, the module developer must create two names for the
corresponding logical principal, one using the pci_dev object, and one using the net_device object.
To do this, the programmer modifies the module’s code as shown in lines 72–73. This code creates a
new name, ndev, for an existing principal with the name pcidev on line 73. The check on line 72
ensures that this code will only execute if the current principal already has privileges for the pcidev
object. This ensures that an adversary cannot call the module_pci_probe function with some other
pcidev object and trick the code into setting up arbitrary aliases to principals. LXFI’s control flow
integrity ensures that an adversary is not able to transfer control flow directly to line 73. Moreover,
only direct control flow transfers to lxfi_princ_alias are allowed. This ensures that an adversary
cannot invoke this function by constructing and calling a function pointer at runtime; only statically
defined calls, which are statically coupled with a preceding check, are allowed.
When compiling the core kernel and modules, LXFI uses compiler plugins to inserts calls and checks
into the generated code so that the LXFI runtime can enforce the annotations for API integrity and
principals. LXFI performs different rewriting for the core kernel and for modules. Since LXFI assumes
that the core kernel is fully trusted, it can omit most checks for performance. Modules are not fully
trusted, and LXFI must perform more extensive rewriting there.
Rewriting the core kernel
The only rewriting that LXFI must perform on core kernel code deals with invocation of function
pointers that may have been supplied by a module. If a module is able to supply a function pointer
that the core kernel will invoke, the module can potentially increase its privileges, if it tricks the kernel
struct pci_driver {
int (*probe) (struct pci_dev *pcidev, ...)
pre(copy(ref(struct pci_dev), pcidev))
post(if (return < 0)
transfer(ref(struct pci_dev), pcidev));
void skb_caps(struct sk_buff *skb) {
lxfi_cap_iterate(write, skb, sizeof(*skb));
lxfi_cap_iterate(write, skb->data, skb->len);
struct net_device_ops {
netdev_tx_t (*ndo_start_xmit)
(struct sk_buff *skb,
struct net_device *dev)
post(if (return == -NETDEV_BUSY)
void pci_enable_device(struct pci_dev *pcidev)
pre(check(ref(struct pci_dev), pcidev));
module_pci_probe(struct pci_dev *pcidev) {
ndev = alloc_etherdev(...);
lxfi_check(ref(struct pci_dev), pcidev);
lxfi_princ_alias(pcidev, ndev);
ndev->dev_ops->ndo_start_xmit = myxmit;
netif_napi_add(ndev, napi, my_poll_cb);
return 0;
Figure 4: Annotations for parts of the API shown in Figure 1. The annotations follow the
grammar shown in Figure 2. Annotations and added code are underlined.
into performing a call that the module itself could not have performed directly. To ensure this is not
the case, LXFI performs two checks. First, prior to invoking a function pointer from the core kernel,
LXFI verifies that the module principal that supplied the pointer (if any) had the appropriate C ALL
capability for that function. Second, LXFI ensures that the annotations for the function supplied by the
module and the function pointer type match. This ensures that a module cannot change the effective
annotations on a function by storing it in a function pointer with different annotations.
To implement this check, LXFI’s kernel rewriter inserts a call to the checking function lxfi_check_indcall(void
**pptr, unsigned ahash) before every indirect call in the core kernel, where pptr is the address
of the module-supplied function pointer to be called, and ahash is the hash of the annotation for
the function pointer type. The LXFI runtime will validate that the module that writes function f to
pptr has a C ALL capability for f . To ensure that annotations match, LXFI compares the hash of the
annotations for both the function and the function pointer type.
To optimize the cost of these checks, LXFI implements writer-set tracking. The runtime tracks the
set of principals that have been granted a W RITE capability for each memory location after the last
time that memory location was zeroed. Then, for each indirect-call check in the core kernel, the
handler_func_t handler;
handler = device->ops->handler;
/* not &handler */
Figure 5: Rewriting an indirect call in the core kernel. LXFI inserts checking code with the
address of a module-supplied function pointer.
LXFI runtime first checks whether any principal could have written to the function pointer about to
be invoked. If not, the runtime can bypass the relatively expensive capability check for the function
To detect the original memory location from which the function pointer was obtained, LXFI performs
a simple intra-procedural analysis to trace back the original function pointer. For example, as shown
in Figure 5, the core kernel may copy a module-supplied function pointer device->ops->handler
to a local variable handler, and then make a call using the local variable. In this case LXFI uses
the address of the original function pointer rather than the local variable for looking up the set of
writer principals. We have encountered 51 cases that our simple analysis cannot deal with, out of 7500
indirect call sites in the core kernel, in which the value of the called pointer originates from another
function. We manually verify that these 51 cases are safe.
Rewriting modules
LXFI inserts calls to the runtime when compiling modules based on annotations from the kernel
and module developers. The rest of this subsection describes the types of instrumentation that LXFI
performs for module C code.
Annotation propagation.. To determine the annotations that should apply to a function, LXFI
first propagates annotations on a function pointer type to the actual function that might instantiate that
type. Consider the structure member probe in Figure 4, which is a function pointer initialized to the
module_pci_probe function. The function should get the annotations on the probe member. LXFI
propagates these annotations along initializations, assignments, and argument passing in the module’s
code, and computes the annotation set for each function. A function can obtain different annotations
from multiple sources. LXFI verifies that these annotations are exactly the same.
Function wrappers.. At compile time, LXFI generates wrappers for each module-defined function, kernel-exported function, and indirect call site in the module. At runtime, when the kernel
calls into one of the module’s functions, or when the module calls a kernel-exported function, the
corresponding function wrapper is invoked first. Based on the annotations, the wrapper sets the
appropriate principal, calls the actions specified in pre annotations, invokes the original function, and
finally calls the actions specified in post annotations.
The function wrapper also invokes the LXFI runtime at its entry and exit, so that the runtime can
capture all control flow transitions between the core kernel and the modules. The relevant runtime
routines switch principals and enforce control flow integrity using a shadow stack, as we detail in the
next section (§5).
Module initialization.. For each module, LXFI generates an initialization function that is invoked
(without LXFI’s isolation) when the module is first loaded, to grant an initial set of capabilities to
the module. For each external function (except those functions defined in LXFI runtime) imported in
the module’s symbol table, the initialization function grants a C ALL capability for the corresponding
function wrapper. Note that the C ALL capabilities granted to the module are only for invoking
wrappers. A module is not allowed to call any external functions directly, since that would bypass
the annotations on those functions. For each external data symbol in the module’s symbol table, the
initialization function likewise grants a W RITE capability. The initial capabilities are granted to the
module’s shared principal, so that they are accessible to every other principal in the module.
Memory writes.. LXFI inserts checking code before each memory write instruction to make sure
that the current principal has the W RITE capability for the memory region being written to.
To enforce the specified API integrity, the LXFI runtime must track capabilities and ensure that the
necessary capability actions are performed on kernel/module boundaries. For example, before a
module invokes any kernel functions, the LXFI runtime validates whether the module has the privilege
(i.e., C ALL capability) to invoke the function at that address, and if the arguments passed by the
module are safe to make the call (i.e., the pre annotations allow it). Similarly, before the kernel invokes
any function pointer that was supplied by a module, the LXFI runtime verifies that the module had the
privileges to invoke that function in the first place, and that the annotations of the function pointer and
the invoked function match. These checks are necessary since the kernel is, in effect, making the call
on behalf of the module.
Figure 6 shows the design of the LXFI runtime. As the reference monitor of the system, it is invoked
on all control flow transitions between the core kernel and the modules (at instrumentation points
described in the previous section). The rest of this section describes the operations performed by the
Principals.. The LXFI runtime keeps track of the principals for each kernel module, as well as
two special principals. The first is the module’s shared principal, which is initialized with appropriate
initial capabilities (based on the imports from the module’s symbol table); every other principal in the
module implicitly has access to the capabilities stored in this principal. The second is the module’s
global principal; it implicitly has access to all capabilities in all of the module’s principals.
Capability table.. For each principal, LXFI maintains three capability tables (one per capability
type), as shown in Figure 6. Efficiently managing capability tables is important to LXFI’s performance.
LXFI uses a hash table for each table to achieve constant lookup time. For C ALL capabilities and R EF
capabilities, LXFI uses function addresses and referred addresses, respectively, as the hash keys.
W RITE capabilities do not naturally fit within a hash table, because they are identified by an address
range, and capability checks can happen for any address within the range. To support fast range tests,
LXFI inserts a W RITE capability into all possible hash table slots covered by its address range. LXFI
reduces the number of insertions by masking the least significant bits of the address (the last 12 bits
in practice) when calculating hash keys. Since kernel modules do not usually manipulate memory
objects larger than a page (212 bytes), in our experience this data structure performs much better than
a balancing tree, in which a lookup—commonly performed on W RITE capabilities—takes logarithmic
Shadow stack.. LXFI maintains a shadow stack for each kernel thread to record LXFI-specific
context. The shadow stack lies adjacent to the thread’s kernel stack in the virtual address space, but is
only accessible to the LXFI runtime. It is updated at the entry and the exit of each function wrapper.
Figure 6: An overview of the LXFI runtime. Shaded components are parts of LXFI. Striped
components indicate isolated kernel modules. Solid arrows indicate control flow; the LXFI runtime interposes on all control flow transfers between the modules and the core kernel. Dotted
arrows indicate metadata tracked by LXFI.
To enforce control flow integrity on function returns, the LXFI runtime pushes the return address onto
the shadow stack at the wrapper’s entry, and validate its value at the exit to make sure that the return
address is not corrupted. The runtime also saves and restores the principal on the shadow stack at the
wrapper’s entry and exit.
Writer-set tracking.. To optimize the cost of indirect call checks, LXFI implements light-weight
writer-set tracking (as described in §4.1). LXFI keeps writer set information in a data structure similar
to a page table. The last level entries are bitmaps representing whether the writer set for a segment
of memory is empty or not. Checking whether the writer set for a particular address is empty takes
constant time. The actual contents of non-empty writer sets (i.e., what principal has W RITE access to
a range of memory) is computed by traversing a global list of principals. Our experiments to date have
involved a small number of distinct principals, leading to acceptable performance.
When a module is loaded, that module’s shared principal is added to the writer set for all of its writable
sections (including .data and .bss), because the section may contain writable function pointers that
the core kernel may try to invoke. The runtime adds additional entries to the writer set map as the
module executes and gains additional capabilities.
LXFI’s writer-set tracking introduces both false positives and false negatives. A false positive arises
when a W RITE capability of a function pointer was granted to some module’s principal, but the
principal did not write to the function pointer. This is benign, since it only introduces an unnecessary
capability check. A false negative arises when the kernel copies pointers from a location that was
modified by a module into its internal data structures, which were not directly modified by a module.
At compile time, LXFI detects these cases and we manually inspect such false negatives (see §4.1).
The most important step in enforcing API integrity is specifying the annotations on kernel/module
boundaries. If a programmer annotates APIs incorrectly, then an adversary may be able to exploit the
mistake to obtain increased privilege. We summarize guidelines for enforcing API integrity based on
our experience annotating 10 modules.
Guideline 1.. Following the principle of least privilege, grant a R EF capability instead of a W RITE
capability whenever possible. This ensures that a module will be unable to modify the memory
contents of an object, unless absolutely necessary.
Guideline 2.. For memory regions allocated by a module, grant W RITE capabilities to the module,
and revoke it from the module on free. W RITE is needed because the module usually directly writes
the memory it allocates (e.g., for initialization).
Guideline 3.. If the module is required to pass a certain fixed value into a kernel API (e.g., an
argument to a callback function, or an integer I/O port number to inb and outb I/O functions), grant a
R EF capability for that fixed value with a special type, and annotate the function in question (e.g., the
callback function, or inb and outb) to require a R EF capability of that special type for its argument.
Guideline 4.. When dealing with large data structures, where the module only needs write access
to a small number of the structure’s members, modify the kernel API to provide stronger API integrity.
For example, the e1000 network driver module writes to only five (out of 51) fields of sk_buff
structure. This design requires LXFI to grant the module a W RITE capability for the sk_buff
structure. It would be safer to have the kernel provide functions to change the necessary fields in
an sk_buff. Then LXFI could grant the module a R EF capability, perhaps with a special type of
sk_buff__fields, and have the annotation on the corresponding kernel functions require a R EF
capability of type sk_buff__fields.
Guideline 5.. To isolate instances of a module from each other, annotate the corresponding interface
with principal annotations. The pointer used as the principal name is typically the main data
structure associated with the abstraction, such as a socket, block device, network interface, etc.
Guideline 6.. To manipulate privileges inside of a module, make two types of changes to the
module’s code. First, in order to manipulate data shared between instances, insert a call to LXFI to
switch to the module’s global principal. Second, in order to create principal aliases, insert a similar call
to LXFI’s runtime. In both cases, the module developer needs to preface these privileged operations
with adequate checks to ensure that the functions containing these privileged operations are not abused
by an adversary at runtime.
Guideline 7.. When APIs implicitly transfer privileges between the core kernel and modules,
explicitly add calls from the core kernel to the module to grant the necessary capabilities. For example,
the Linux network stack supports packet schedulers, represented by a struct Qdisc object. When
Kernel rewriting plugin
Module rewriting plugin
Runtime checker
Lines of code
Figure 7: Components of LXFI.
the kernel wants to assign a packet scheduler to a network interface, it simply changes a pointer in
the network interface’s struct net_device to point to the Qdisc object, and expect the module to
access it.
We implemented LXFI for Linux 2.6.36 running on a single-core x86_64 system. Figure 7 shows
the components and the lines of code for each component. The kernel is compiled using gcc,
invoking the kernel rewriting plugin (the kernel rewriter). Modules are compiled using Clang with the
module rewriting plugin (the module rewriter), since Clang provides a more powerful infrastructure to
implement rewriting. The current implementation of LXFI has several limitations, as follows.
The LXFI rewriter implements an earlier version of the language defined in §3. Both of the annotation
languages can enforce the common idioms seen in the 10 annotated modules, however we believe the
new language is more succinct. We expect that the language will evolve further as we annotate more
interfaces, and discover other idioms.
The LXFI rewriter does not process assembly code, either in the core kernel or in modules. We
manually inspect the assembly functions in the core kernel; none of them contains indirect calls.
For modules, instrumentation is required if the assembly performs indirect calls or direct calls to an
external function. In this case, developer must manually instrument the assembly by inserting calls to
LXFI runtime checker. In our experience, modules use no assembly code that requires annotation.
LXFI requires all indirect calls in a module to be annotated to ensure API integrity. However, in some
cases, the module rewriter fails to trace back to the function pointer declaration (e.g., due to an earlier
phase of the compiler that optimized it away). In this case, developer has to modify the module’s
source code (e.g., to avoid the compiler optimization). For the 10 modules we annotated, such cases
are rare: we changed 18 lines of code.
API integrity requires a complete set of core kernel functions to be annotated. However, in some
cases, the Linux kernel inlines some kernel functions into modules. One approach is to annotate
the inlined function, and let the module rewriter disable inlining of such functions. This approach,
however, obscures the security boundary because these function are defined in the module, but must
be treated the same as a kernel function. LXFI requires the boundary between kernel and module to be
in one location by making either all or none of the functions inlined. In our experience, we have found
that Linux is already well-written in this regard, and we had to change less than 10 functions (by not
inlining them into a module) to enforce API integrity on 10 modules.
As pointed out in § 4.1, for indirect calls performed by the core kernel, LXFI checks that the annotation
on function pointer matches the annotation on the invoked function f. Current implementation of
LXFI performs checks when f has annotations, such as module functions that exported to kernel
through assignment. A more strict and safe check is to enforce that f has annotations. Such check is
not implemented because when f is defined in the core kernel, f may be static and has no annotation.
We plan to implement annotation propagation in the kernel rewriter to solve this problem.
This section evaluates the following 4 questions experimentally:
CAN_BCM [17]
Econet [18]
RDS [19]
Vulnerability type
Integer overflow
NULL pointer dereference
Missed privilege check
Missed context resetting
Missed check of user-supplied pointer
Source location
Figure 8: Linux kernel module vulnerabilities that result in 3 privilege escalation exploits, all
of which are prevented by LXFI.
• Can LXFI stop exploits of kernel modules that have led to privilege escalation?
• How much work is required to annotate kernel/module interfaces?
• How much does LXFI slow down the SFI microbenchmarks?
• How much does LXFI slow down a Linux kernel module?
To answer the first question we inspected 3 privilege escalation exploits using 5 vulnerabilities in
Linux kernel modules revealed in 2010 that can lead to privilege escalation. Figure 8 shows three
exploits and the corresponding vulnerabilities. LXFI successfully prevents all of the listed exploits as
CAN_BCM.. Jon Oberheide posted an exploit to gain root privilege by exploiting an integer overflow vulnerability in the Linux CAN_BCM module [17]. The overflow is in the bcm_rx_setup function,
which is triggered when the user tries to send a carefully crafted message through CAN_BCM. In particular, bcm_rx_setup allocates nframes*16 bytes of memory from a slab, where nframes is supplied
by user. By passing a large value, the allocation size overflows, and the module receives less memory
than it asked for. This allows an attacker to write an arbitrary value into the slab object that directly
follows the objects allocated to CAN_BCM. In the posted exploit, the author first arranges the kernel to
allocate a shmid_kernel slab object at a memory location directly following CAN_BCM’s undersized
buffer. Then the exploit overwrites this shmid_kernel object through CAN_BCM, and finally, tricks
the kernel into calling a function pointer that is indirectly referenced by the shmid_kernel object,
leading to a root privilege escalation.
To test the exploit against LXFI, we ported Oberheide’s exploit from x86 to x86_64, since it depends
on the size of pointer. LXFI prevents this exploit as follows. When the allocation size overflows, LXFI
will grant the module a W RITE capability for only the number of bytes corresponding to the actual
allocation size, rather than what the module asked for. When the module tries to write to an adjacent
object in the same slab, LXFI detects that the module has no W RITE capability and raises an error.
Econet.. Dan Rosenburg posted a privilege escalation exploit [18] by taking advantage of three
vulnerabilities found by Nelson Elhage [8]. Two of them lie in the Econet module, and one in the
core kernel. The two Econet vulnerabilities allow an unprivileged user to trigger a NULL pointer
dereference in Econet. It is triggered when the kernel is temporarily in a context in which the kernel’s
check of a user-provided pointer is omitted, which allows a user to write anywhere in kernel space.
To prevent such vulnerabilities, the core kernel should always reset the context so that the check of a
user-provided pointer is enforced. Unfortunately, kernel’s do_exit failed to obey this rule. do_exit
is called to kill a process when a NULL pointer dereference is captured in the kernel. Moreover, the
kernel writes a zero into a user provided pointer (task->clear_child_tid) in do_exit. Along
with the NULL pointer dereference triggered by the Econet vulnerabilities, the attacker is able to
write a zero into an arbitrary kernel space address. By carefully arranging the kernel memory address
for task->clear_child_tid, the attacker redirects econet_ops.ioctl to user space, and then
gains root privilege in the same way as the RDS exploit. LXFI prevents the exploit by stopping the
kernel from calling the indirect call of econet_ops.ioctl after it is overwritten with an illegal
RDS.. Dan Rosenburg reported a vulnerability in the Linux RDS module in CVE-2010-3904 [19]. It
is caused by a missing check of a user-provided pointer in the RDS page copying routine, allowing
a local attacker to write arbitrary values to arbitrary memory locations. The vulnerability can be
triggered by sending and receiving messages over a RDS socket. In the reported exploit, the attacker
overwrites the rds_proto_ops.ioctl function pointer defined in the RDS module with the address
of a user-space function. Then it tricks the kernel to indirectly call the rds_proto_ops.ioctl by
invoking the ioctl system call. As a result, the local attacker can execute his own code in kernel
LXFI prevents the exploit in two ways. First, LXFI does not grant W RITE capabilities for a module’s read-only section to the module (the Linux kernel does). Thus, the exploit cannot overwrite
rds_proto_ops.ioctl in the first place, since it is declared in a read-only structure. To see if LXFI
can defend against vulnerabilities that allow corrupting a writable function pointer, we made this memory location writable. LXFI is able to prevent the exploit, because it checks the core kernel’s indirect
call to rds_proto_ops.ioctl. The LXFI runtime detects that the function pointer is writable by
the RDS module, and then it checks if RDS has a C ALL capability for the target function. The LXFI
runtime rejects the indirect call because RDS module has no C ALL capability for invoking a user-space
function. It is worth mentioning that the LXFI runtime would also reject the indirect call if the user
overwrites the function pointer with a core kernel function that the module does not have a C ALL
capability for.
Other exploits.. Vulnerabilities leading to privilege escalation are harmful. The attacker can
typically mount other types of attacks exploiting the same vulnerabilities. For example, it can be used
to hide a rootkit. The Linux kernel uses a hash table (pid_hash) for process lookup. If a rootkit
deletes a process from the hash table, the process will not be listed by ps’ shell command, but will
still be scheduled to run. Without LXFI, a rootkit can exploit the above vulnerability in RDS to unlink
itself from the pid_list hash table. Using the same technique as in the RDS exploit, we developed
an exploit that successfully hides the exploiting process. The exploit runs as an unprivileged user.
It overwrites rds_proto_ops.ioctl to point to a user space function. When the vulnerability is
triggered, the core kernel calls the user space function, which calls detach_pid with current_task
(both are exported kernel symbols). As before, LXFI prevents the vulnerability by disallowing the
core kernel from invoking the function pointer into user-space code, because the RDS module has no
C ALL capability for that code. Even if the module overwrites the rds_proto_ops.ioctl function
pointer to point directly to detach_pid, LXFI still prevents this exploit, because the RDS module
does not have a C ALL capability for detach_pid.
Annotation effort
To evaluate the work required to specify contracts for kernel/module APIs, we annotated 10 modules.
These modules include several device categories (network, sound, and block), different devices within
a category (e.g., two sound devices), and abstract devices (e.g., network protocols). The difficult
part in annotating is understanding the interfaces between the kernel and the module, since there is
little documentation. We typically follow an iterative process: we annotate the obvious parts of the
interfaces, and try to run the module under LXFI. When running the module, the LXFI runtime raises
alerts because the module attempts operations that LXFI forbids. We then iteratively go through these
alerts, understand what the module is trying to do, and annotate interfaces appropriately.
net device driver
sound device driver
net protocol driver
block device driver
# Functions
# Function Pointers
Figure 9: The numbers of annotated function prototypes and function pointers for 10 modules.
An annotation is considered unique if it is used by only one module. The Total row reports the
total number of distinct annotations.
To quantify the work involved, we count the number of annotations required to support a kernel
module. The number of annotations needed for a given module is determined by the number of
functions (either defined in the core kernel or other modules) that the module invokes directly, and the
number of function pointers that the core kernel and the module call. As Figure 9 shows, each module
calls 6–81 functions directly, and is called by (or calls) 7–52 function pointers. For each module,
the number of functions and function pointers that need annotating is much smaller. For example,
supporting the can module only requires annotating 7 extra functions after all other modules listed in
Figure 9 are annotated. The reason is that similar modules often invoke the same set of core kernel
functions, and that the core kernel often invokes module functions in the same way across multiple
modules. For example, the interface of the PCI bus is shared by all PCI devices. This suggests that the
effort to support a new module can be small as more modules are supported by LXFI.
Some functions require checking, copying, or transferring a large number of capabilities. LXFI’s annotation language supports programmer-defined capability iterators for this purpose, such as skb_caps
for handling all of the capabilities associated with an sk_buff shown in Figure 4. In our experience,
most annotations are simple, and do not require capability iterators. For the 10 modules, we wrote 36
capability iterators to handle idioms such as for loops or nested data structures. Each module required
3-11 capability iterators.
A second factor that affects the annotation effort is the rate of change of Linux kernel interfaces.
We have inspected Linux kernel APIs for 20 major versions of the kernel, from 2.6.20 to 2.6.39, by
counting the numbers of both functions that are directly exported from the core kernel and function
pointers that appear in shared data structures using ctags. Figure 10 shows our findings. The results
indicate that, although the number of kernel interfaces grows steadily, the number of interfaces changed
with each kernel version is relatively modest, on the order of several hundred functions. This is in
contrast to the total number of lines of code changed between major kernel versions, which is on the
order of several hundred thousand lines of code.
To measure the enforcement overhead, we measure how much LXFI slows down the SFI microbenchmarks [23]. To run the tests, we turn each benchmark into a Linux kernel module. We run the tests
on a desktop equipped with an Intel(R) Core(TM) i3-550 3.2 GHz CPU, 6GB memory, and an Intel
82540EM Gigabit Ethernet card. For these benchmarks, we might expect a slightly higher overhead
than XFI because the stack optimizations used in SFI are not applicable to Linux kernel modules, but
on the other hand LXFI, like BGI, uses a compile-time approach to instrumentation, which provides
opportunities for compile-time optimizations. We cannot compare directly to BGI because it targets
the Windows kernel and no numbers were reported for the SFI microbenchmarks, but we would expect
Number of exported functions / func ptrs
# exported functions
changed from prev. version
# function ptrs in structs
changed from prev. version
39 11
6. 5/
2. (0
34 10
6. 5/
2. (0
29 09
6. 3/
2. (0
25 08
6. 4/
2. (0
21 07
6. 4/
2. (0
Linux kernel version (release month/year)
Figure 10: Rate of change for Linux kernel APIs, for kernel versions 2.6.21 through 2.6.39. The
top curve shows the number of total and changed exported kernel functions; for example, 2.6.21
had a total of 5,583 exported functions, of which 272 were new or changed since 2.6.20. The
bottom curve shows the number of total and changed function pointers in structs; for example,
2.6.21 had a total of 3,725 function pointers in structs, of which 183 were new or changed since
BGI to be faster than LXFI, because BGI’s design carefully optimizes the runtime data structures to
enable low-overhead checking.
Δ code size
Figure 11: Code size and slowdown of the SFI microbenchmarks.
Figure 11 summarizes the results from the measurements. We compare our result with the slowpath
write-only overhead in XFI (Table 1 in [9]). For all benchmarks, the code size is 1.1x-1.2x larger
with LXFI instrumentation, while with XFI the code size is 1.1x-3.9x larger. We believe that LXFI’s
TCP RR (1-switch latency)
UDP RR (1-switch latency)
836 M bits/sec
828 M bits/sec
770 M bits/sec
770 M bits/sec
3.1 M/3.1 M pkt/sec 2.0 M/2.0 M pkt/sec
2.3 M/2.3 M pkt/sec 2.3 M/2.3 M pkt/sec
9.4 K Tx/sec
9.4 K Tx/sec
10 K Tx/sec
8.6 K Tx/sec
16 K Tx/sec
9.8 K Tx/sec
20 K Tx/sec
10 K Tx/sec
Stock LXFI
54% 100%
46% 100%
Figure 12: Performance of netperf benchmark with stock and LXFI enabled e1000 driver.
instrumentation inserts less code because LXFI does not employ fastpath checks (inlining memoryrange tests for the module’s data section to handle common cases [9]) as XFI does. Moreover, LXFI
targets x86_64, which provides more registers, allowing the inserted instructions to be shorter.
Like XFI, LXFI adds almost no overhead for hotlist, because hotlist performs mostly read-only
operations over a linked list, which LXFI does not instrument.
The performance of lld under LXFI (11% slowdown) is much better than for XFI (93% slowdown).
This is because the code of lld contains a few trivial functions, and LXFI’s compiler plugin effectively
inlined them, greatly reducing the number of guards at function entries and exits. In contrast, XFI uses
binary rewriting and therefore is unable to perform this optimization. Since BGI also uses a compiler
plug-in, we would expect BGI to do as well or better than LXFI.
The slowdown of MD5 is also negligible (2% compared with 27% for XFI). oprofile shows that most
of the memory writes in MD5 target a small buffer, residing in the module’s stack frame. By applying
optimizations such as inlining and loop unrolling, LXFI’s compiler plugin detects that these writes
are safe because they operate on constant offsets within the buffer’s bound, and can avoid inserting
checks. Similar optimizations are difficult to implement in XFI’s binary rewriting design, but BGI
again should be as fast or faster than LXFI.
To evaluate the overhead of LXFI on an isolated kernel module, we run netperf [14] to exercise the
Linux e1000 driver as a kernel module. We run LXFI on the same desktop described in §8.3. The
other side of the network connection runs stock Linux 2.6.35 SMP on a desktop equipped with an
Intel(R) Core(TM) i7-980X 3.33 GHz CPU, 24 GB memory, and a Realtek RTL8111/8168B PCIE
Gigabit Ethernet card. The two machines are connected via a switched Gigabit network. In this section,
“TX” means that the machine running LXFI sends packets, and “RX” means that the machine running
LXFI receives packets from the network.
Figure 12 shows the performance of netperf. Each test runs for 10 seconds. The “CPU %” column
reports the CPU utilization on the desktop running LXFI. The first test, TCP_STREAM, measures the
TCP throughput of the e1000 driver. The test uses a send buffer of 16,384 bytes, and a receive buffer
of 87,370 bytes. The message size is 16,384 bytes. As shown in Figure 12, for both “TX” and “RX”
workloads, LXFI achieves the same throughput as the stock e1000 driver; the CPU utilization increases
by 3.7× and 2.2× with LXFI, respectively, because of the added cost of capability operations.
UDP_STREAM measures UDP throughput. The UDP socket size is 126,976 bytes on the send side,
and 118,784 bytes on the receive side. The test sends messages of 64 bytes. The two performance
numbers report the number of packets that get sent and received. LXFI achieves 65% of the throughput
of the stock version for TX, and achieves the same throughput for RX. The LXFI version cannot
achieve the same throughput for TX because the CPU utilization reaches 100%, so the system cannot
generate more packets. We expect that using a faster CPU would improve the throughput for TX
(although the CPU overhead would remain high).
We run TCP_RR and UDP_RR to measure the impact of LXFI on latency, using the same message
size, send and receive buffer sizes as above. We conducted two tests, each with a different network
In the first configuration, the two machines are connected the same subnet, and there are a few switches
between them (but no routers). As shown in the middle rows of Figure 12, with LXFI, the throughput
of TCP_RR is almost the same as the stock version, and the CPU utilization increases by 2.6×. For
UDP_RR, the throughput decreases by 14%, and the CPU utilization increases by 2.2×.
Part of the latency observed in the above test comes from the network switches connecting the two
machines. To understand how LXFI performs in a configuration with lower network latency, we
connect the two machines to a dedicated switch and run the test again. As Figure 12 shows, the CPU
utilization and the throughput increase for both versions. The relative overhead of LXFI increases
because the network latency is so low that the processing of the next incoming packets are delayed
by capability actions, slowing down the rate of packets received per second. We expect that few real
systems use a network with such low latencies, and LXFI provides good throughput when even a small
amount of latency is available for overlap.
Guard type
Annotation action
Function entry
Function exit
Mem-write check
Kernel ind-call all
Kernel ind-call e1000
per pkt
Time per
guard (ns)
Time per
pkt (ns)
1, 674
1, 469
Figure 13: Average number of guards executed by the LXFI runtime per packet, the average
cost of each guard, and the total time spent in runtime guards per packet for UDP_STREAM
TX benchmark.
To understand the sources of LXFI’s overheads, we measure the average number of guards per packet
that the LXFI runtime executes, and the average time for each guard. We report the numbers for the
UDP_STREAM TX benchmark, because LXFI performs worst for this workload (not considering the
1-switch network configuration). Figure 13 shows the results. As expected, LXFI spends most of the
time performing annotation actions (grant, revoke, and check), and checking permissions for memory
writes. Both of them are the most frequent events in the system. “Kernel ind-call all” and “Kernel
ind-call e1000” show that the core kernel performs 9.2 indirect function calls per packet, around
1/3 of which are calls to the e1000 driver that involve transmitting packets. This suggests that our
writer-set tracking optimization is effective at eliminating 2/3 of checks for indirect function calls.
The results suggests that LXFI works well for the modules that we annotated. The amount of work to
annotate is modest, requiring 8–133 annotations per module, including annotations that are shared
between multiple modules. Instrumenting a network driver with LXFI increases CPU usage by 2.2–
3.7×, and achieves the same TCP throughput as an unmodified kernel. However, UDP throughput
drops by 35%. It is likely that we can use design ideas for runtime data structures from BGI to
reduce the overhead of checking. In terms of security, LXFI is less beneficial to modules that must
perform privileged operations; an adversary who compromises such a module will be able to invoke
the privileged operation that the modules is allowed to perform. It would be interesting to explore how
to refactor such modules to separate privileges. Finally, some modules have complicated semantics
and the LXFI annotation language is not rich enough; for example, file systems have setuid and file
permission invariants that are difficult to capture with LXFI annotations. We would like to explore
how to increase LXFI’s applicability in future work.
LXFI is inspired by XFI [9] and BGI [4]. XFI, BGI, and LXFI use SFI [26] to isolate modules. XFI
assumes that the interface between the module and the support interface is simple and static, and does
not handle overly permissive support functions. BGI extends XFI to handle more complex interfaces
by manually interposing on every possible interaction between the kernel and module, and uses access
control lists to restrict the operations a module can perform. Manual interposition for BGI is feasible
because the Windows Driver Model (WDM) only allows drivers to access kernel objects, or register
callbacks, through well-defined APIs. In contrast, the Linux kernel exposes its internal data objects
to module developers. For example, a buggy module may overwrite function pointers in the kernel
object to trick the kernel into executing arbitrary code. To provide API integrity for these complex
interfaces, LXFI provides a capability and annotation system that programmers can use to express
the necessary contracts for API integrity. LXFI’s capabilities are dual to BGI’s access control lists.
Another significant difference between LXFI and BGI is LXFI’s support for principals to partition
the privileges held by a shared module. Finally, LXFI shows that it can prevent real and synthesized
attacks, whereas the focus of BGI is high-performance fault isolation.
Mondrix [27] shows how to implement fault isolation for several parts of the Linux kernel, including the
memory allocator, several drivers, and the Unix domain socket module. Mondrix relies on specialized
hardware not available in any processor today, whereas LXFI uses software-based techniques to run
on commodity x86 processors. Mondrix also does not protect against malicious modules, which drives
much of LXFI’s design. For example, malicious kernel modules in Mondrix can invoke core kernel
functions with incorrect arguments, or simply reload the page table register, to take over the entire
Loki [28] shows how to privilege-separate the HiStar kernel into mutually distrustful “library kernels”.
Loki’s protection domains correspond to user or application protection domains (defined by HiStar
labels), in contrast with LXFI’s domains which are defined by kernel component boundaries. Loki
relies on tagged memory, and also relies on HiStar’s simple kernel design, which has no complex
subsystems like network protocols or sound drivers in Linux that LXFI supports.
Full formal verification along the lines of seL4 [15] is not practical for Linux, both because of its
complexity, and because of its ill-defined specification. It may be possible to use program analysis
techniques to check some limited properties of LXFI itself, though, to ensure that an adversary cannot
subvert LXFI.
Driver isolation techniques such as Sud [3], Termite [21], Dingo [20], and Microdrivers [10] isolate
device drivers at user-level, as do microkernels [7, 11]. This requires significantly re-designing the
kernel interface, or restricting user-mode drivers to well-defined interfaces that are amenable to expose
through IPC. Many kernel subsystems, such as protocol modules like RDS, make heavy use of
shared memory that would not work well over IPC. Although there has been a lot of interest in fault
containment in the Linux kernel [16, 24], fault tolerance is a weaker property than stopping attackers.
A kernel runtime that provides type safety and capabilities by default, such as Singularity [13], can
provide strong API contracts similar to LXFI. However, most legacy OSes including Linux cannot
benefit from it since they are not written in a type-safe language like C#.
SecVisor [22] provides kernel code integrity, but does not guarantee data protection or API integrity. As
a result, code integrity alone is not enough to prevent privilege escalation exploits. OSck [12] detects
kernel rootkits by enforcing type safety and data integrity for operating system data at hypervisor level,
but does not address API safety and capability issues among kernel subsystems.
Overshadow [6] and Proxos [25] provide security by interposing on kernel APIs from a hypervisor.
The granularity at which these systems can isolate features is more coarse than with LXFI; for example,
Overshadow can just interpose on the file system, but not on a single protocol module like RDS.
Furthermore, techniques similar to LXFI would be helpful to prevent privilege escalation exploits in
the hypervisor’s kernel.
This paper presents an approach to help programmers capture and enforce API integrity of complex,
irregular kernel interfaces like the ones found in Linux. LXFI introduces capabilities and annotations
to allow programmers to specify these rules for any given interface, and uses principals to isolate
privileges held by independent instances of the same module. Using software fault isolation techniques,
LXFI enforces API integrity at runtime. Using a prototype of LXFI for Linux, we instrumented a
number of kernel interfaces with complex contracts to run 10 different kernel modules with strong
security guarantees. LXFI succeeds in preventing privilege escalation attacks through 5 known
vulnerabilities, and imposes moderate overhead for a network-intensive benchmark.
We thank the anonymous reviewers and our shepherd, Sam King, for their feedback. This research was
partially supported by the DARPA Clean-slate design of Resilient, Adaptive, Secure Hosts (CRASH)
program under contract #N66001-10-2-4089, by the DARPA UHPC program, and by NSF award CNS1053143. Dong Zhou was supported by China 973 program 2007CB807901 and NSFC 61033001.
The opinions in this paper do not necessarily represent DARPA or official US policy.
[1] Common vulnerabilities and exposures. From
[2] J. Arnold, T. Abbott, W. Daher, G. Price, N. Elhage, G. Thomas, and A. Kaseorg. Security impact
ratings considered harmful. In Proceedings of the 12th Workshop on Hot Topics in Operating
Systems, Monte Verita, Switzerland, May 2009.
[3] S. Boyd-Wickizer and N. Zeldovich. Tolerating malicious device drivers in Linux. In Proceedings
of the 2010 USENIX Annual Technical Conference, pages 117–130, Boston, MA, June 2010.
[4] M. Castro, M. Costa, J. P. Martin, M. Peinado, P. Akritidis, A. Donnelly, P. Barham, and R. Black.
Fast byte-granularity software fault isolation. In Proceedings of the 22nd ACM Symposium on
Operating Systems Principles, Big Sky, MT, October 2009.
[5] H. Chen, Y. Mao, X. Wang, D. Zhou, N. Zeldovich, and M. F. Kaashoek. Linux kernel
vulnerabilities: State-of-the-art defenses and open problems. In Proceedings of the 2nd AsiaPacific Workshop on Systems, Shanghai, China, July 2011.
[6] X. Chen, T. Garfinkel, E. C. Lewis, P. Subrahmanyam, C. A. Waldspurger, D. Boneh, J. Dwoskin,
and D. R. K. Ports. Overshadow: A virtualization-based approach to retrofitting protection
in commodity operating systems. In Proceedings of the 13th International Conference on
Architectural Support for Programming Languages and Operating Systems, Seattle, WA, March
[7] F. M. David, E. M. Chan, J. C. Carlyle, and R. H. Campbell. CuriOS: Improving reliability
through operating system structure. In Proceedings of the 8th Symposium on Operating Systems
Design and Implementation, San Diego, CA, December 2008.
[8] N. Elhage. CVE-2010-4258: Turning denial-of-service into privilege escalation. http://blog., December 2010.
[9] U. Erlingsson, M. Abadi, M. Vrable, M. Budiu, and G. C. Necula. XFI: Software guards for
system address spaces. In Proceedings of the 7th Symposium on Operating Systems Design and
Implementation, Seattle, WA, November 2006.
[10] V. Ganapathy, M. Renzelmann, A. Balakrishnan, M. Swift, and S. Jha. The design and implementation of microdrivers. In Proceedings of the 13th International Conference on Architectural
Support for Programming Languages and Operating Systems, Seattle, WA, March 2008.
[11] J. N. Herder, H. Bos, B. Gras, P. Homburg, and A. Tanenbaum. Fault isolation for device
drivers. In Proceedings of the 2009 IEEE Dependable Systems and Networks Conference, Lisbon,
Portugal, June–July 2009.
[12] O. Hofmann, A. Dunn, S. Kim, I. Roy, and E. Witchel. Ensuring operating system kernel integrity
with OSck. In Proceedings of the 16th International Conference on Architectural Support for
Programming Languages and Operating Systems, Newport Beach, CA, March 2011.
[13] G. C. Hunt, J. R. Larus, M. Abadi, M. Aiken, P. Barham, M. Fahndrich, C. Hawblitzel, O. Hodson,
S. Levi, N. Murphy, B. Steensgaard, D. Tarditi, T. Wobber, and B. Zill. An overview of the
Singularity project. Technical Report MSR-TR-2005-135, Microsoft, Redmond, WA, October
[14] R. Jones. Netperf: A network performance benchmark, version 2.45. http://www.netperf.
[15] G. Klein, K. Elphinstone, G. Heiser, J. Andronick, D. Cock, P. Derrin, D. Elkaduwe, K. Engelhardt, M. Norrish, R. Kolanski, T. Sewell, H. Tuch, and S. Winwood. seL4: Formal verification
of an OS kernel. In Proceedings of the 22nd ACM Symposium on Operating Systems Principles,
Big Sky, MT, October 2009.
[16] A. Lenharth, V. S. Adve, and S. T. King. Recovery domains: An organizing principle for recoverable operating systems. In Proceedings of the 14th International Conference on Architectural
Support for Programming Languages and Operating Systems, pages 49–60, Washington, DC,
March 2009.
[17] J. Oberheide. Linux kernel CAN SLUB overflow.
09/10/linux-kernel-can-slub-overflow/, September 2010.
[18] D. Rosenberg. Econet privilege escalation exploit.
security.full-disclosure/76457, December 2010.
[19] D. Rosenberg. RDS privilege escalation exploit.
tools/linux-rds-exploit.c, October 2010.
[20] L. Ryzhyk, P. Chubb, I. Kuz, and G. Heiser. Dingo: Taming device drivers. In Proceedings of
the ACM EuroSys Conference, Nuremberg, Germany, March 2009.
[21] L. Ryzhyk, P. Chubb, I. Kuz, E. L. Sueur, and G. Heiser. Automatic device driver synthesis with
Termite. In Proceedings of the 22nd ACM Symposium on Operating Systems Principles, Big Sky,
MT, October 2009.
[22] A. Seshadri, M. Luk, N. Qu, and A. Perrig. SecVisor: A tiny hypervisor to provide lifetime
kernel code integrity for commodity OSes. In Proceedings of the 21st ACM Symposium on
Operating Systems Principles, Stevenson, WA, October 2007.
[23] C. Small and M. I. Seltzer. Misfit: Constructing safe extensible systems. IEEE Concurrency, 6:
34–41, 1998.
[24] M. M. Swift, B. N. Bershad, and H. M. Levy. Improving the reliability of commodity operating
systems. ACM Transactions on Computer Systems, 22(4), November 2004.
[25] R. Ta-Min, L. Litty, and D. Lie. Splitting interfaces: Making trust between applications and
operating systems configurable. In Proceedings of the 7th Symposium on Operating Systems
Design and Implementation, pages 279–292, Seattle, WA, November 2006.
[26] R. Wahbe, S. Lucco, T. E. Anderson, and S. L. Graham. Efficient software-based fault isolation.
In Proceedings of the 14th ACM Symposium on Operating Systems Principles, pages 203–216,
Asheville, NC, December 1993.
[27] E. Witchel, J. Rhee, and K. Asanovic. Mondrix: Memory isolation for Linux using Mondriaan
memory protection. In Proceedings of the 20th ACM Symposium on Operating Systems Principles,
Brighton, UK, October 2005.
[28] N. Zeldovich, H. Kannan, M. Dalton, and C. Kozyrakis. Hardware enforcement of application
security policies. In Proceedings of the 8th Symposium on Operating Systems Design and
Implementation, pages 225–240, San Diego, CA, December 2008.
Thialfi: A Client Notification Service for
Internet-Scale Applications
Atul Adya
Gregory Cooper
Daniel Myers
Michael Piatek
{adya, ghc, dsmyers, piatek}
Google, Inc.
Ensuring the freshness of client data is a fundamental problem for applications that rely on cloud
infrastructure to store data and mediate sharing. Thialfi is a notification service developed at Google
to simplify this task. Thialfi supports applications written in multiple programming languages and
running on multiple platforms, e.g., browsers, phones, and desktops. Applications register their
interest in a set of shared objects and receive notifications when those objects change. Thialfi servers
run in multiple Google data centers for availability and replicate their state asynchronously. Thialfi’s
approach to recovery emphasizes simplicity: all server state is soft, and clients drive recovery and
assist in replication. A principal goal of our design is to provide a straightforward API and good
semantics despite a variety of failures, including server crashes, communication failures, storage
unavailability, and data center failures.
Evaluation of live deployments confirms that Thialfi is scalable, efficient, and robust. In production
use, Thialfi has scaled to millions of users and delivers notifications with an average delay of less
than one second.
Categories and Subject Descriptors
C.2.4 [Computer-Communications Networks]: Distributed Systems; D.4.5 [Operating Systems]:
General Terms
Distributed Systems, Scalability, Reliability, Performance
Many Internet-scale applications are structured around data shared between multiple users, their devices, and cloud infrastructure. Client applications maintain a local cache of their data that must be
kept fresh. For example, if a user changes the time of a meeting on a calendar, that change should be
quickly reflected on the devices of all attendees. Such scenarios arise frequently at Google. Although
infrastructure services provide reliable storage, there is currently no general-purpose mechanism to
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted
without fee provided that copies are not made or distributed for profit or commercial advantage and that copies
bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to
redistribute to lists, requires prior specific permission and/or a fee.
SOSP '11, October 23-26, 2011, Cascais, Portugal.
Copyright © 2011 ACM 978-1-4503-0977-6/11/10 ... $10.00.
notify clients that shared data has changed. In practice, many applications periodically poll to detect
changes, which results in lengthy delays or significant server load. Other applications develop custom
notification systems, but these have proven difficult to generalize and cumbersome to maintain.
This paper presents Thialfi, a highly scalable notification system developed at Google for user-facing
applications with hundreds of millions of users and billions of objects. Thialfi provides sub-second
notification delivery in the common case and clear semantics despite failures, even of entire data
centers. Thialfi supports applications written in a variety of languages (C++, Java, JavaScript) and
running on a diverse set of platforms such as web browsers, mobile phones, and desktops. To achieve
reliability, Thialfi relies on clients to drive recovery operations, avoiding the need for hard state at the
server, and our API is structured so that error handling is incorporated into the normal operation of
the application.
Thialfi models shared data as versioned objects, which are stored at a data center and cached at clients.
Clients register with Thialfi to be notified when an object changes, and the application’s servers notify
Thialfi when updates occur. Thialfi propagates notifications to registered clients, which synchronize
their data with application servers. Crucially, Thialfi delivers only the latest version number to clients,
not application data, which simplifies our design and promotes scalability.
Thialfi’s implementation consists of a library embedded in client applications and two types of servers
that run in Google data centers. Matchers are partitioned by object and receive and forward notifications; Registrars are partitioned by client and manage client registration and presence state. The
client library communicates with the servers over a variety of application-specific channels; Thialfi
protocols provide end-to-end reliability despite channel losses or message reordering. Finally, a besteffort replication protocol runs between Thialfi data centers, and clients correct out-of-date servers
during migration.
A principal feature of Thialfi’s design is reliability in the presence of a wide variety of faults. The
system ensures that clients eventually learn of the latest version of each registered object, even if
the clients were unreachable at the time the update occurred. At large scale, ensuring even eventual
delivery is challenging—Thialfi is designed to operate at the scale of hundreds of millions of clients,
billions of objects, and hundreds of thousands of changes per second. Since applications are replicated across data centers for reliability, notifications may need to be routed over multiple unreliable
communication channels to reach all clients. During propagation, a client may become unavailable
or change its server affinity. Clients may be offline. Servers, storage systems, or even entire data
centers may become temporarily unavailable. Thialfi handles these issues internally, freeing application developers from the need to cope with them as special cases. Indeed, Thialfi remains correct
even when all server state is discarded. In our API, all failures manifest as signals that objects or
registrations have become stale and should be refreshed, and this process reconstructs state at the
server if necessary.
Like many infrastructure services, Thialfi is designed for operational simplicity: the same aspects of
our design that provide reliability (e.g., tolerating data center failures) also make the system easier to
run in production. Our techniques emphasize simplicity but do not provide perfect availability. While
Thialfi remains correct, recovering from some failures results in partial unavailability, and we discuss
these scenarios in our design.
Thialfi is a production service that is in active use by millions of people running a diverse set of
Google’s applications. We focus on two: Chrome and Contacts. These show the diversity of Thialfi
usage, which includes desktop applications synchronizing data with the cloud (Chrome) as well as
web/mobile applications sharing data between devices (Contacts). In both cases, Thialfi has simplified application design and improved efficiency substantially.
Further evaluation of Thialfi confirms its scalability, efficiency, and robustness. In production use,
Thialfi has scaled to millions of users. Load testing shows that Thialfi’s resource consumption scales
Figure 1: An abstraction for a client notification service.
directly with usage. Injecting failures shows that the cost of recovery is modest; despite the failure
of an entire data center, Thialfi can rapidly migrate clients to remaining data centers with limited
To summarize, we make the following contributions:
• We provide a system robust to the full and partial failures common to infrastructure services.
Thialfi is one of the first systems to demonstrate robustness to the complete failure of a data center
and to the partial unavailability of infrastructure storage.
• Our design provides reliability at Internet scale without hard server state. Thialfi ensures that
clients eventually learn the latest versions of registered objects even if all server state is dropped.
• Thialfi’s API unifies error recovery with ordinary operation. No separate error-handling code paths
are required, greatly simplifying integration and reasoning about correctness.
• We integrate Thialfi with several Google applications and demonstrate the performance, scalability, and robustness of our design for millions of users and thousands of notifications per second.
This section describes an abstraction for a notification service with requirements drawn from our
experience at Google. Figure 1 shows the abstraction. Since Internet applications are separated into
server and client components, the service includes both an infrastructure component and a client
library. At the client, developers program against the library’s API and make updates that modify
shared data. At the server, applications publish notifications, which the service routes to appropriate
clients. The remainder of this section describes how we arrived at this abstraction.
A Case for a Notification Service
Applications that share data among users and devices have a common need for notifications when
data has changed. For example, the Google Contacts application allows users to create, edit, and
share contact information through web, mobile, and desktop interfaces that communicate with servers
running in Google’s data centers. If a contact changes, other devices should learn of the change
quickly. This is the essence of a notification service: informing interested parties of changes to data
in a reliable and timely manner.
Throughout the paper, we refer to application data as objects: named, versioned entities for which
users may receive notifications. For example, a contacts application might model each user’s address
book as an object identified by that user’s email address, or the application may model each contact
as a separate object. Contacts may be shared among users or a user’s devices. When the contact list
is changed, its version number increases, providing a simple mechanism to represent changes.
In the absence of a general service, applications have developed custom notification mechanisms. A
widely used approach is for each client to periodically poll the server for changes. While conceptually
HTTP, XMPP, internal RPC (in DC)
Java, C++, JavaScript
Web, mobile, native desktop apps
Storage with inter-DC sync or async replication
Table 1: Configurations supported by Thialfi.
simple and easy to implement, polling creates an unfortunate tension between timeliness and resource
consumption. Frequent polling allows clients to learn of changes quickly but imposes significant load
on the server. And, most requests simply indicate that no change has occurred.
An alternative is to push notifications to clients. However, ensuring reliability in a push system is
difficult: a variety of storage, network, and server failures are common at Internet scale. Further,
clients may be disconnected when updates occur and remain offline for days. Buffering messages
indefinitely is infeasible. The server’s storage requirements must be bounded, and clients should not
be overwhelmed by a flood of messages upon wakeup.
As a result of these challenges, push systems at Google are generally best-effort; developers must detect and recover from errors. This is typically done via a low-frequency, backup polling mechanism,
again resulting in occasional, lengthy delays that are difficult to distinguish from bugs.
Summarizing our discussion above, a general notification service should satisfy at least four requirements.
• Tracking. The service should track which clients are interested in what data. Particularly for
shared data, tracking a mapping between clients and objects is a common need.
• Reliability. Notifications should be reliable. To the extent possible, application developers should
not be burdened with error detection and recovery mechanisms such as polling.
• End-to-end. Given an unreliable channel, the service must provide reliability in an end-to-end
manner; i.e., it must include a client-side component.
• Flexibility. To be widely applicable, a notification service must impose few restrictions on developers. It should support web, desktop, and mobile applications written in a variety of languages for
a variety of platforms. At the server, similar diversity in storage and communication dependencies
precludes tight integration with a particular software stack. We show the variety of configurations
that Thialfi supports in Table 1.
Design Alternatives
Before describing our system in detail, we first consider alternative designs for a notification service.
Integrating notifications with the storage layer: Thialfi treats each application’s storage layer as
opaque. Updates to shared objects must be explicitly published, and applications must explicitly
register for notifications on shared objects. An alternative would be to track object sharing at the
storage layer and automatically generate notifications when shared objects change. We avoid this for
two reasons. The first is diversity: while many applications share a common need for notifications,
applications use storage systems with diverse semantics, data models, and APIs customized to particular application requirements. We view the lack of a one-size-fits-all storage system as fundamental,
leading us to design notifications as a separate component that is loosely coupled with the storage
layer. The second reason is complexity. Even though automatically tracking object dependencies [22]
may simplify the programming model when data dependencies are complex (e.g., constructing webpages on-the-fly with data joins), such application structures are difficult to scale and rare at Google.
Requiring explicit object registrations and updates substantially simplifies our design, and our experience has been that reasoning about object registrations in our current applications is straightforward.
Reliable messaging from servers to clients: Reliable messaging is a familiar primitive for developers. We argue for a different abstraction: a reliable notification of the latest version number of an
object. Why not reliable messaging? First, reliable messaging is inappropriate when clients are often
unavailable. Lengthy queues accumulate while clients are offline, leading to a flood of messages
upon wakeup, and server resources are wasted if offline clients never return. Second, message delivery is often application-specific. Delivering application data requires adhering to diverse security and
privacy requirements, and different client devices require delivery in different formats (e.g., JSON
for browsers, binary for phones). Instead of reliable messaging, Thialfi provides reliable signaling—
the queue of notifications for each object is collapsed to a single message, and old clients may be
safely garbage-collected without sacrificing reliability. Moreover, such an abstraction allows Thialfi
to remain loosely coupled with applications.
This section gives an overview of the Thialfi architecture and its programming interface.
Model and Architecture
Thialfi models data in terms of object identifiers and their version numbers. Objects are stored in each
application’s backend servers, not by Thialfi. Each object is named using a variable length byte string
of the application’s choosing (typically less than 32 bytes), which resides in a private namespace for
that application. Version numbers (currently 64-bit) are chosen by applications and included in the
update published to Thialfi.
Application backends are required to ensure that version numbers are monotonically increasing to
ensure reliable delivery; i.e., in order for Thialfi to reliably notify a client of an object’s latest version, the latest version must be well-defined. Synchronous stores can achieve this by incrementing
a version number after every update, for example. Asynchronous stores typically have some method
of eventually reconciling updates and reaching a commit point; such stores can issue notifications
to Thialfi afterwards. At Google, to avoid modifying existing asynchronous backend stores, some
services simply inform Thialfi when updates reach one of the storage replicas, using the current time
at that replica as the version number. Although such services run the risk of missing updates due
to clock skew and conflicts, this is rare in practice. Clock skew in the data center is typically low,
conflicts are infrequent for many applications, and replication delay is low (seconds).
As shown in Figure 1, Thialfi is comprised of a client library and server infrastructure. We describe
these components in turn.
Client library: The client library provides applications with a programmatic interface for registering for shared objects and receiving notifications. The library speaks the Thialfi protocol and
communicates with the Thialfi infrastructure service running in data centers. An application uses the
Thialfi library to register for objects, and the library invokes callbacks to inform the application of
registration changes and to deliver notifications. For each notification, Thialfi informs the application of the modified object’s identifier and the latest version known. When the application receives a
notification, it synchronizes object data by talking directly with its servers: Thialfi does not provide
data synchronization.
Server infrastructure: In the data center, application servers apply updates and notify Thialfi when
objects change. We provide a Publisher library that application backends can embed. The publisher
library call:
Publish(objectId, version, source)
ensures that all Thialfi data centers are notified of the change. When present, the optional source
parameter identifies the client that made the change. (This ID is provided by the application client
// Client actions
interface NotificationClient {
Start(byte[] persistentState);
Register(ObjectId objectId, long version);
Unregister(ObjectId objectId);
// Client library callbacks
interface NotificationListener {
Notify(ObjectId objectId, long version);
NotifyUnknown(ObjectId objectId);
RegistrationStatusChanged(ObjectId objectId,
boolean isRegistered);
RegistrationFailure(ObjectId objectId,
boolean isTransient);
WriteState(byte[] persistentState);
Figure 2: The Thialfi client API.
at startup and is referred to as its application ID.) As an optimization, Thialfi omits delivery of the
notification to this client, since the client already knows about the change.
Thialfi supports multiple communication channels to accommodate application diversity. For example, native applications may use XMPP [27], while web applications typically use persistent
HTTP connections [17]. This support allows Thialfi to reuse an application’s existing communication channel, an important capability given the high cost of maintaining a channel in certain contexts (e.g., mobile- or browser-based applications). Other than non-corruption, Thialfi imposes few
requirements—messages may be dropped, reordered, or duplicated. Although rare, the channels most
commonly used by applications exhibit all of these faults.
Given the diversity of authorization and authentication techniques used by applications, Thialfi does
not dictate a particular scheme for securing notifications. Instead, we provide hooks for applications
to participate in securing their data at various points in the system. For example, Thialfi can make
RPCs to application backends to authorize registrations. If required, Thialfi can also make authorization calls before sending notifications to clients.
Similarly, applications must provide a secure client-server channel if confidentiality is required. Thialfi does not mandate a channel security policy.
Client API and Usage
The Thialfi client library provides applications with the API shown in Figure 2, and we refer to these
calls throughout our discussion.
The NotificationClient interface lists the actions available via the client library. The Start() method
initializes the client, and the Register() and Unregister() calls can be used to register/unregister for
object notifications. We point out that the client interface does not include support for generating
notifications. Publish() calls must be made by the application backend.
The NotificationListener interface defines callbacks invoked by the client library to notify the user
application of status changes. Application programmers using Thialfi’s library implement these meth-
ods. When the library receives a notification from the server, it calls Notify() with that object’s ID and
new version number. In scenarios where Thialfi does not know the version number of the object (e.g.,
if Thialfi has never received any update for the object or has deleted the last known version value for
it), the client library uses the NotifyUnknown() call to inform the application that it should refetch
the object from the application store regardless of its cached version. Internally, such notifications are
assigned a sequence number by the server so that they can be reliably delivered and acknowledged in
the protocol.
The client library invokes RegistrationStatusChanged() to inform the application of any registration
information that it receives from the server. It uses RegistrationFailure() to indicate a registration
operation failure to the application. A boolean, isTransient, indicates whether the application should
attempt to retry the operation. ReissueRegistrations() allows the client library to request all registrations from the application. This call can be used to ensure that Thialfi state matches the application’s
intent, e.g., after a loss of server state.
The WriteState() call is an optional method that provides Thialfi with persistent storage on the client,
if available. Client data storage is application-specific; e.g., some applications have direct access to
the filesystem while others are limited to a browser cookie. When a client receives its identifier from
the server, the client library invokes WriteState() with an opaque byte string encoding the identifier,
which is then stored by the application and provided to Thialfi during subsequent invocations of
Start(). This allows clients to resume using existing registrations and notification state. Clients that
do not support persistence are treated as new clients after each restart.
This section describes the design and implementation of Thialfi. We highlight several key techniques.
No hard server state: Thialfi operates on registration state (i.e., which clients care about which
objects) and notification state (the latest known version of each object). The Thialfi client library is
responsible for tracking the registration state and updating servers in the event of a discrepancy, so
loss of server-side state does not jeopardize correctness. Moreover, while Thialfi makes a substantial effort to deliver “useful” notifications at specific version numbers, it is free to deliver spurious
notifications, and notifications may be associated with an unknown version. This flexibility allows
notification state to be discarded, provided the occurrence of the drop is noted.
Efficient I/O through multiple views of state: The registration and notification state in Thialfi
consists of relations between clients and objects. There is no clear advantage to choosing either
client ID or object ID as the primary key for this state: notifications update a single object and multiple
clients, while registrations update a single client and multiple objects. To make processing of each
operation type simple and efficient, we maintain two separate views of the state, one indexed by client
ID and one by object ID , allowing each type of operation to be performed via a single write to one
storage location in one view. The remaining view is brought up-to-date asynchronously.
Idempotent operations only: Thialfi is designed so that any server-side operation can be safely
repeated. Every operation commits at the server after a single write to storage, allowing aggressive batching of writes. Any dependent changes are performed in the background, asynchronously.
Avoiding overwrites fosters robustness; operations are simply retried until they succeed.
Buffering to cope with partial storage availability: While data corruption is uncommon, largescale storage systems do not have perfect availability. Writes to some storage regions may fail transiently. To prevent this transient storage unavailability from cascading to application backends, Thialfi buffers failed notification writes at available storage locations, migrating buffered state to its
appropriate location when possible.
Figure 3 shows the major components of Thialfi. Bridge servers are stateless, randomly load-
Figure 3: Overall architecture of Thialfi.
balanced tasks that consume a feed of application-specific update messages from Google’s infrastructure pub/sub service, translate them into a standard notification format, and assemble them into
batches for delivery to Matcher tasks. Matchers consume notifications for objects, match them with
the set of registered clients, and forward them to the Registrar for reliable delivery to clients. Matchers
are partitioned over the set of objects and maintain a view of state indexed by object ID. Registrars
track clients, process registrations, and reliably deliver notifications using a view of state indexed by
client ID.
The remainder of this section describes our design in stages, starting with a simplified version of
Thialfi that operates entirely in memory and in one data center only. We use this simplified design
to explain the Thialfi protocol and to describe why discarding Thialfi’s server state is safe. We then
extend the in-memory design to use persistent storage, reducing the cost of recovering failed servers.
Finally, we add replication in order to improve recovery from the failure of entire data centers.
In-memory Design
An in-memory version of Thialfi stores client and object state in the memory of the Registrar and
Matcher servers. As mentioned above, clients are partitioned over Registrar servers, and objects are
partitioned over Matcher servers. In order to ensure roughly uniform distribution of load, each client
and object is assigned a partitioning key. This key is computed by prepending a hash of the client or
object ID to the ID itself. We statically partition this keyspace into contiguous ranges; one range is
assigned to each server. If a server crashes or reboots, its state is lost and must be reconstructed from
Aside from lack of persistence and support for multiple data centers, this design is identical to that
deployed at Google. We next describe the specific state maintained.
In-memory State
Registrar: For each client, the Registrar servers maintain two sets: 1) registrations (objects of interest to the client) and 2) pending notifications (notifications not yet acknowledged by the client). They
also maintain a monotonically-increasing sequence number for each client, used to pick an ordering
for registration operations and to generate version numbers for unknown-version notifications.
Matcher: For each object, Matcher servers store the latest version number provided by the application backend. Matcher servers also maintain a copy of the registered clients for each object from the
Registrar; this copy is updated asynchronously. We refer to the combined Matcher and Registrar state
as the C/O-Cache (Client and Object cache).
Thialfi components that we call Propagators asynchronously propagate state between Matchers and
Registrars. The Registrar Propagator copies client registrations to the Matcher, and the Matcher
Propagator copies new notifications to the Registrar.
Both Matchers and Registrars maintain a set of pending operations to perform for objects and clients;
i.e., propagation and delivery of (un)registrations and notifications. The state maintained by each
server thus decomposes into two distinct parts: the C/O-Cache and a pending operation set.
Client Token Management
Thialfi identifies clients using client tokens issued by Registrars. Tokens are composed of two parts:
client identifiers and session identifiers. Tokens are opaque to clients, which store them for inclusion
in each subsequent message. A client identifier is unique and persists for the lifetime of the client’s
state. A session identifier binds a client to a particular Thialfi data center and contains the identity of
the data center that issued the token.
A client acquires tokens via a handshake protocol, in which the Registrar creates an entry for the
client’s state. If the client later migrates to another data center, the Registrar detects that the token
was issued elsewhere and informs the client to repeat the handshake protocol with the current data
center. When possible, the new token reuses the existing client identifier. A client may thus acquire
many session identifiers during its interactions with Thialfi, although it holds only one client token
(and thus one session identifier) at any given time.
The Thialfi client library sends periodic heartbeat messages to the Registrar to indicate that it is online
(a Registrar only sends notifications to online clients). In the current implementation, the heartbeat
interval is 20 minutes, and the Registrar considers a client to be offline if it has not received any message from the client for 80 minutes. Certain channels inform Thialfi in a best-effort manner when a
client disconnects, allowing the Registrar to mark the client offline more quickly. Superficially, these
periodic heartbeats might resemble polling. However, they are designed to be extremely lightweight:
the messages are small, and processing only requires a single in-memory operation in the common
case when the client is already online. Thus, unlike application-level polling, they do not pose a
significant scalability challenge.
Registration Operation
Once a client has completed the initial handshake, it is able to execute registrations. When an application calls Register(), the client library queues a message to send to the Registrar. (As with all
protocol messages, the application dispatches outgoing registrations asynchronously using its channel.) An overview of registration is shown in Figure 4.
1. The client library sends a registration message to the Registrar with the object identifier.
2. The Registrar picks an ordering for the registration by assigning it a sequence number, using the
sequence number it maintains for the issuing client. The Registrar writes the registration to the
client record and adds a new entry to the pending operation set.
3. Subsequently, the Registrar Propagator attempts to forward the registration and the application
ID of the registering client to the Matcher responsible for the object via an RPC, and the Matcher
updates the copy of the registration in its object cache. The Registrar Propagator repeats this
until either propagation succeeds or its process crashes.
4. After propagation succeeds, the Registrar reads the latest version of the object from the Matcher
(which reads the versions from its object cache) and writes a pending notification for it into the
client cache (i.e., updates its copy of the latest version). We call this process Registrar postpropagation. If no version is known, the Registrar generates an unknown-version notification
for the object with the version field set using the sequence number maintained for the client.
Registrar' 5.#Registered#
Figure 4: Object registration in Thialfi.
5. The Registrar sends a message to the client confirming the registration and removes the operation
from the pending set.
Clients unregister using an analogous process. To keep the registrations at the client and the Registrar in sync, Thialfi uses a Registration Sync Protocol. Each message from the client contains a
digest of the client’s registered objects, and each message from the server contains the digest of the
client’s registrations known to the server (in our current implementation, we compute the digest using
HMAC-SHA1 [10]). If the client or the server detects a discrepancy at any point, the client resends its
registrations to the server. If the server detects the problem, it requests that the client resend them. To
support efficient synchronization for large numbers of objects, we have implemented optional support
for Merkle Trees [18], but no application currently using Thialfi has required this mechanism.
The client library keeps track of the application’s intended registrations via registration/unregistration
API calls. To preserve the registration state across application restarts, the library could write all
registrations to the local disk using the WriteState() call (Section 3.3). To simplify persistence requirements, however, Thialfi relies on applications to restate intended registrations on restart. When
a client restarts, the client library invokes ReissueRegistrations(). The library then recomputes the
digest and sends it as part of the regular communication with the server (e.g., in heartbeats). Any discrepancy in the registrations is detected and resolved using the Registration Sync Protocol discussed
above. In the normal case when digests match, no registrations are resent to the server.
Notification Operation
As users modify data, client applications send updates to application servers in the data center. Application servers apply the updates and publish notifications to be delivered by Thialfi. Figure 5 shows
the sequence of operations by which Thialfi delivers notifications to registered clients.
1. The application server updates its authoritative copy of user data and notifies Thialfi of the new
version number. Applications publish notifications using a library that ensures each published
notification is received by all data centers running Thialfi. Currently, we use an internal Google
infrastructure publish/subscribe service to disseminate messages to data centers. The pub/sub
service acknowledges the Publisher library only after a reliable handoff, ensuring eventual delivery. (During periods of subscriber unavailability, the pub/sub service buffers notifications in
a persistent log.)
2. Thialfi’s Bridge component consumes the feed of published notifications in each data center
and processes them in small batches. The Bridge delivers the update to the Matcher server
responsible for the object.
3. The Matcher updates its record for the object with the new version number. Subsequently, using
its copy of the registered client list, the Matcher propagator determines which Registrar servers
Figure 5: Notification delivery in Thialfi.
Row Key
[email protected]
Row Key
Registrar Table
Client State
Object State
""@seqno [email protected] ""@seqno ""@version
Matcher Table
Object State
Client State
[email protected] [email protected]
Propagation State
Propagation State
Table 2: Bigtable layout for server-side state. [email protected] indicates a value a at timestamp b. seqno
refers to the sequence number assigned by the Registrar for that particular client.
have clients registered for the object. It sends RPCs to each Registrar server with (client, oid,
version) tuples indicating which clients need to be notified. The client identifiers are used to
index the Registrar’s C/O-Cache efficiently.
4. Each Registrar receiving a message stores the pending notification for the appropriate clients
and responds to the RPC.
5. When all Registrars have responded, the operation is removed from the Matcher pending operation set.
6. Periodically, the Registrars resend unacknowledged notifications for online clients. Currently,
we use a 60-second retransmission interval.
Handling Server Failures
We now discuss how a server reconstructs its in-memory state after a restart (an independent infrastructure system at Google monitors and restarts services that have crashed or become unresponsive).
For simplicity, consider a brute-force approach: if any server fails, all servers restart, and the data
center identifier is changed to a new value. Subsequent messages from clients with old tokens are
detected by the Registrars, triggering a token update as described in §4.1.2. The Registration Sync
Protocol then ensures that the clients reissue their registrations.
Client registration messages are sufficient to reconstruct the registration state at the Registrar. The
latest-version data at the Matcher is not recovered (and pending notifications are lost) since there is
no mechanism to fetch version information from the application backend. Nonetheless, correctness
is not compromised. When processing client registrations, the Registrar will send unknown-version
notifications for each registered object. This triggers client requests to the application backend to
learn the latest version. Such an approach is conservative since the data may not have changed, but
Thialfi cannot easily confirm this. After restart, Thialfi resumes normal processing of updates.
Handling Network Failures
There are three types of messages sent between the client and server: client token requests, registration changes, and notifications / acks. Any of these may be lost, reordered, or duplicated. Notifications are acknowledged and hence reliably delivered, and reordering and duplication are explicitly
permitted by the semantics of Thialfi. All other messages are retried by the client as needed. Clients
detect and ignore duplicate or reordered token grant messages from the Registrar using a nonce, and
the Registration Sync Protocol ensures that client and server registration state eventually converge.
Persistent Storage
At the scale of millions of clients, recovering from failures by flushing and reconstructing state is
impractical. Some retention of state is required to reduce work during recovery. In this section, we
describe how Thialfi currently uses Bigtable [7] to address this issue. The main idea guiding our use
of persistent storage is that updates to the C/O-Cache in the memory-only design translate directly
into blind writes into a Bigtable; i.e., updating state without reading it. Because Bigtable is based on
a log-structured storage system, writes are efficient and fast.
Bigtable Layout
Storage locations in a Bigtable (Bigtable cells) are named by {row key, column, version} tuples, and
Bigtables may be sparse; i.e., there may be many cells with no value. We exploit this property in our
storage layout to avoid overwrites. For example, in the Registrar table, for a particular client/object
registration pair, we use a distinct row key (based on the client ID), column (based on the object ID),
and version (based on the registration sequence number). When querying the registration status for
that client/object pair, we simply read the latest version.
Adapting our in-memory representation to Bigtable is straightforward. Registrar and Matcher state
is stored in separate Bigtables. The partitioning keys used in the in-memory system become the
row keys used in the Bigtables, distributing load uniformly. We continue to statically partition the
keyspace over the Registrar and Matcher servers. Each server is thus assigned a contiguous range of
Bigtable rows.
The Bigtable schema is summarized in Table 2. Each row of the Matcher table stores the latest
known version for an object, the application ID of the client that created that version, and the set
of clients registered for that object. Each Registrar row stores the client’s application ID, the latest
sequence number that was generated for the client by the Registrar, a channel-specific address if the
client is online, the object IDs that the client is registered for, and the objects for which the client
has an unacknowledged notification. Each table also contains a column for tracking which rows
have pending information to propagate to the other table. Note that a cell is written in the last-seqno
column whenever a sequence number is used for the client. This ensures that sequence numbers
always increase.
In-memory State
In order to improve performance, we cache a small amount of state from Bigtable in Registrar and
Matcher server memory. The Registrars cache the registration digest of each online client (but not
the full set of registrations). The Matchers and Registrars also cache their pending operation sets.
We rely on Bigtable’s memory cache for fast reads of the registrations and pending notifications.
Since our working set currently fits in Bigtable’s memory cache, this has not created a performance
problem. (We may revisit this decision if emerging workloads change our Bigtable memory cache
The outcome of these properties is that the in-memory state of Thialfi servers corresponds to inprogress operations and limited data for online clients only.
Pushing Notifications to Clients
As with the in-memory design, reliable notification delivery to clients is achieved by scanning for
unacknowledged notifications. Instead of memory, the scan is over the Registrar Bigtable. For efficiency and performance, we also introduce a fast path: we unreliably send notifications to online
clients during Matcher propagation. While channels are unreliable, message drops are rare, so this
fast path typically succeeds. We confirm this in our evaluation (§6).
Realizing that a lengthy periodic scan adversely impacts the tail of the notification latency distribution, we are currently implementing a scheme that buffers undelivered notifications in Registrar
memory to more quickly respond to failures.
Client Garbage Collection
If a client remains offline for an extended period (e.g., several days), Thialfi garbage-collects its
Bigtable state. This involves deleting the client’s row in the Registrar Bigtable and deleting any
registration cells in the Matcher Bigtable. If the client later comes back online, our use of blind
writes means that the client’s row may be inadvertently recreated. Although rare, some mechanism is
required to detect such an entry, remove it, and notify the client that it must restart with a fresh client
ID .
In order to detect client resurrection after garbage collection, Thialfi maintains a created cell in the
client’s Registrar row (Table 2). The Registrar writes this cell when it assigns an ID for a client, and
the garbage collector deletes it; no other operations modify this cell. If a garbage collected client
comes back online as described above, its created cell will be absent from the recreated row. An
asynchronous process periodically scans the Registrar Table for rows without created cells. When
encountered, the ‘zombie’ client row is deleted. Also, if the client is online, it is informed that its
ID is invalid. Upon receiving this message, the client discards its ID and reconnects as a new client.
This message may be lost without compromising correctness; it will be resent by the asynchronous
process if the client attempts further operations.
Recovery from Server Failures
We now describe how persistent storage reduces the burden of failure recovery. The server caches
of Bigtable state and of pending operations are write-through caches, so they may be restored after a
restart by simply scanning the Bigtable. Since each server is assigned a contiguous range, this scan is
efficient. Additionally, scanning to recover pending operations yields a straightforward strategy for
shedding load during periods of memory pressure: a server aborts in-progress propagations, evicts
items from its pending operation set, and schedules a future scan to recover.
If required, all Bigtable state can be dropped, with recovery proceeding as in the in-memory design.
In practice, this has simplified service administration significantly; e.g., when performing a Bigtable
schema change, we simply drop all data, avoiding the complexity of migration.
Tolerating Storage Unavailability
A consequence of storing state in Bigtable is that Thialfi’s overall availability is limited by that of
Bigtable. While complete unavailability is extremely rare, a practical reality of large-scale storage
is partial unavailability—the temporary failure of I/O operations for some rows, but not all. In our
experience, minor Bigtable unavailability occurs several times per day. Our asynchronous approach
to data propagation accommodates storage unavailability. I/O failures are skipped and retried, but do
not prevent partial progress; e.g., clients corresponding to available regions will continue to receive
This covers the majority of Thialfi I/O with two exceptions: 1) the initial write when accepting a
client operation, e.g., a registration, and 2) the write accepting a new version of an object at the
Matcher. In the first case, the client simply retries the operation.
However, accepting new versions is more complex. One possibility is to have the Bridge delay
the acknowledgement of a notification to the publish/subscribe service until the Matcher is able to
perform the write. This approach quickly results in a backlog being generated for all notifications
destined for the unavailable Matcher rows. Once a large backlog accumulates, the pub/sub service
no longer delivers new messages, delaying notifications for all clients in the data center. Even in
the absence of our particular pub/sub system, requiring application backends to buffer updates due to
partial Thialfi storage unavailability would significantly increase their operational complexity.
Given the prevalence of such partial storage unavailability in practice, we have implemented a simple
mechanism to prevent a backlog from being generated. To acknowledge a notification, the Bridge
needs to record the latest version number somewhere in stable storage. It need not be written to
the correct location immediately, so long as it is eventually propagated there. To provide robustness
during these periods, we reissue failed writes to a distinct, scratch Bigtable. A scanner later retries
the writes against the Matcher Bigtable. The Everest system [19] uses a similar technique to spread
load; in Thialfi, such buffering serves to reduce cascading failures.
Specifically, for a given object, we deterministically compute a sequence of retry locations in a scratch
Bigtable. These are generated by computing a salted hash over the object ID, using the retry count
as the salt. This computation exploits Thialfi’s relaxed semantics to reduce the amount of scratch
storage required; successive version updates to the same object overwrite each other in the scratch
table when the first scratch write succeeds. Storing failed updates in random locations—a simple
alternative—would retain and propagate all updates instead of only the latest. While correct, this is
inefficient, particularly for hot objects. Our scheme efficiently supports the common case: a series of
Matcher writes fails, but the first attempt of each corresponding scratch write succeeds.
Supporting Multiple Data Centers
To meet availability requirements at Google, Thialfi must be replicated in multiple data centers. In
this section, we describe the extensions required to support replication, completing the description of
Thialfi’s design. Our goal is to ensure that a site failure does not degrade reliability; i.e., notifications
may be delayed, but not dropped. Clients migrate when a failure or load balancing event causes
protocol messages to be routed from the Thialfi data center identified in the client’s session token to
a Thialfi instance in another data center.
We require that the application’s channel provide client affinity; i.e., Thialfi messages from a given
client should be routed to the same data center over short time scales (minutes). Over longer time
scales, clients may migrate among data centers depending on application policies and service availability. Also, when a Thialfi data center fails, we require the application channel to re-route messages
from clients to other data centers. These characteristics are typical for commonly used channels.
Even without replication of registration state, Thialfi can automatically migrate clients among data
centers. When a client connects to a new data center, the Registrar instructs it to repeat the tokenassignment handshake, by which it obtains a new token (§4.1.2). Since the new data center has no
information about the client’s registrations, the client and server registration digests will not match,
triggering the Registration Sync Protocol. The client then reissues all of its registrations. While
correct, this is expensive; a data center failure causes a flood of re-registrations. Thus, replication is
designed as an optimization to decrease such migration load.
State Replication
Thialfi uses two forms of state replication: 1) reliable replication of notifications to all data centers
and 2) best-effort replication of registration state. The pub/sub service acknowledges the Publisher
library after a reliable handoff and ensures that each notification is reliably delivered to all Thialfi
data centers; the Thialfi Matchers in each data center acknowledge the notification only after it has
been written to stable storage.
When replicating registration state, we use a custom, asynchronous protocol that replicates only the
state we must reconstruct during migration. Specifically, we replicate three Registrar operations between Thialfi data centers: 1) client ID assignment, 2) registrations, and 3) notification acknowledgements. Whenever a Registrar processes one of these operations, it sends best-effort RPC messages to
the Registrars in other data centers. At each data center, replication agents in the Registrar consume
these messages and replay the operations. (While we have implemented and evaluated this scheme,
we have not yet deployed it in production.)
We initially attempted to avoid designing our own replication scheme. A previous design of Thialfi
used a synchronous, globally consistent storage layer called Megastore [2]. Megastore provides
transactional storage with consistency guarantees spanning data centers. Building on such a system
is appealingly straightforward: simply commit a transaction that updates relevant rows in all data
centers before acknowledging an operation. Unfortunately, micro-benchmarks show that Megastore
requires roughly 10 times more operations per write to its underlying Bigtables than a customized
approach. For a write-intensive service like Thialfi, this overhead is prohibitive.
Although the Thialfi replication protocol is designed to make migration efficient, an outage still
causes a spike in load. During a planned outage, we use an anti-storm technique to spread load.
During a migration storm, Thialfi silently drops messages from a progressively-decreasing fraction
of migrated clients at the surviving data centers, trading short-term unavailability for reduced load.
In this section, we describe Thialfi’s notion of reliability and argue that our mechanisms provide it.
We define reliable delivery as follows:
Reliable delivery property: If a well-behaved client registers for an object X, Thialfi ensures that the client will
always eventually learn of the latest version of X.
A well-behaved client is one that faithfully implements Thialfi’s API and remains connected long
enough to complete required operations, e.g., registration synchronization. In our discussion, we
make further assumptions regarding integrity and liveness of dependent systems. First, we assume
that despite transitory unavailability, Bigtable tablets will eventually be accessible and will not corrupt
stored data. Second, we assume that the communication channel will not corrupt messages and will
eventually deliver them given sufficient retransmissions.
As is typical for many distributed systems, Thialfi’s reliability goal is one-sided. By this we mean
that, while clients will learn the latest version of registered objects, notifications may be duplicated
or reordered, and intermediate versions may be suppressed.
Thialfi achieves end-to-end reliability by ensuring that state changes in one component eventually
propagate to all other relevant components of the system. We enumerate these components and their
interactions below and discuss why state transfer between them eventually succeeds. We have not
developed a formal model of Thialfi nor complete proofs of its safely or liveness; these are left as
future work.
Registration state is determined by the client, from which it propagates to the Registrar and Matcher
(subject to access control policies). The following mechanisms ensure the eventual synchronization
of registration state across the three components:
• Client ↔ Registrar: Every message from the client includes a digest that summarizes all client
registration state (§4.1.3). If the client-provided digest disagrees with the state at the Registrar, the
synchronization protocol runs, after which client and server agree. Periodic heartbeat messages
include the registration digest, ensuring that any disagreement will be detected.
• Registrar → Matcher: When the Registrar commits a registration state change to Bigtable, a
pending work marker is also set atomically. This marker is cleared only after all dependent writes
to the Matcher Bigtable have completed successfully. All writes are retried by the Registrar Propagator if any failure occurs. (Because all writes are idempotent, this repetition is safe.)
Notification state comes from the Publisher, which provides a reliable feed of object-version pairs
via the pub/sub service. These flow reliably through the Bridge, Matcher, and Registrar to the client
using the following mechanisms:
• Bridge → Matcher: Notifications are removed from the update feed by the Bridge only after
they have been successfully written to either their appropriate location in the Matcher Bigtable
or buffered in the Matcher scratch Bigtable. A periodic task in the Bridge reads the scratch table
and resends the notifications to the Matcher, removing entries from the scratch table only after a
successful Matcher write.
• Matcher → Registrar: When a notification is written to the Matcher Bigtable, a pending work
marker is used to ensure eventual propagation. This mechanism is similar to that used for Registrar
→ Matcher propagation of registration state.
Notification state also flows from the Matcher to the Registrar in response to registration state
changes. After a client registers for an object, Registrar post-propagation will write a notification
at the latest version into the client’s Registrar row (§4.1.3). This ensures that the client learns of
the latest version even if the notification originally arrived before the client’s registration.
• Registrar → Client: The Registrar retains a notification for a client until either the client acknowledges it or a subsequent notification supersedes it. The Registrar periodically retransmits
any outstanding notifications while the client is online, ensuring eventual delivery.
Taken together, local state propagation among components provides end-to-end reliability. Specifically:
• A client’s registration eventually propagates to the Matcher, ensuring that the latest notification
received for the registered object after the propagation will be sent to the client.
• Registrar post-propagation ensures that a client learns the version of the object known to Thialfi
when its registration reached the Matcher. If no version was present at the Matcher, the client
receives a notification at unknown version.
The preceding discussion refers to system operation within a single data center. In the case of multiple data centers, our Publisher Library considers notification publication complete only after the
notification has been accepted by the Matcher or buffered in the persistent storage of Google’s infrastructure publish/subscribe service in all data centers. Thus, each application’s notifications are
reliably replicated to all data centers. This is in contrast to Thialfi’s registration state, which is replicated on a best-effort basis. However, so long as a client is not interacting with a given data center,
there is no harm in the registration state being out-of-sync there. When the client migrates to a new
data center, the Registration Sync Protocol (§4.1.3) ensures that the new Registrar obtains the client’s
current registration state. The propagation and post-propagation mechanisms described above also
apply in the new data center, ensuring that the new Registrar will reliably inform the client of the
latest version of each registered object. Taken together, these mechanisms provide reliable delivery
when operating with multiple data centers.
Thialfi is a production service that has been in active use at Google since the summer of 2010.
We report performance from this deployment. Additionally, we evaluate Thialfi’s scalability and
fault tolerance for synthetic workloads at the scale of millions of users and thousands of updates per
second. Specifically, we show:
• Ease of adoption: Applications can adopt Thialfi with minimal design and/or code changes. We
describe a representative case study, the Chrome browser, for which a custom notification service
was replaced with Thialfi. (§6.1)
• Scalability: In production use, Thialfi has scaled to millions of users. Load testing shows that
resource consumption scales linearly with active users and notification rate while maintaining
stable notification latencies. (§6.2)
• Performance: Measurements of our production deployment show that Thialfi delivers 88% of
notifications in less than one second. (§6.3)
• Fault-tolerance: Thialfi is robust to the failure of an entire data center. In a synthetic fail-over experiment, we rapidly migrate over 100,000 clients successfully and quantify the over-provisioning
required at remaining instances in order to absorb clients during fail-over. We also provide measurements of transient unavailability in production that demonstrate the practical necessity of coping with numerous short-term faults. (§6.4)
Chrome Sync Deployment
Chrome supports synchronizing client bookmarks, settings, extensions, and so on among all of
a user’s installations. Initially, this feature was implemented by piggy-backing on a previouslydeployed chat service. Each online client registered its presence with the chat service and would
broadcast a chat metadata message notifying online replicas that a change had committed to the
back-end storage infrastructure. Offline clients synchronized data on startup. While appealingly
simple, this approach has three drawbacks:
• Costly startup synchronization: The combined load of synchronizing clients on startup is significant at large scale. Ideally, synchronization of offline clients would occur only after a change in
application data, but no general-purpose signaling mechanism was available.
• Unreliable chat delivery: Although generally reliable, chat message delivery is best-effort. Even
when a client is online, delivery is not guaranteed, and delivery failures may be silent. In some
cases, this resulted in a delay in synchronization until the next browser restart.
• Lack of fate-sharing between updates and notifications: Since clients issue both updates and
change notifications, the update may succeed while the notification fails, leading to stale replicas.
Ensuring eventual broadcast of the notification with timeout and retry at the client is challenging;
e.g., a user may simply quit the program before it completes.
While these issues might have been addressed with specific fixes, the complexity of maintaining a reliable push-based architecture is substantial. Instead, Chrome adopted a hybrid approach: best-effort
push with periodic polling for reliability. Unfortunately, the back-end load arising from frequent
polling was substantial. To control resource consumption, clients polled only once every few hours.
This again gave rise to lengthy, puzzling delays for a small minority of users and increased complexity
from maintaining separate code paths for polling and push updates.
These issues drove Chrome’s adoption of Thialfi, which addresses the obstacles above. Thialfi clients
are persistent; offline clients receive notifications on startup only if a registered object has changed
or the client has been garbage collected. This eliminates the need for synchronization during every
startup. Thialfi provides end-to-end reliability over the best-effort communication channel used by
Chrome, thereby easing the porting process. Finally, Thialfi servers receive notifications directly from
Chrome’s storage service rather than from clients, ensuring that notification delivery is fate-shared
with updates to persistent storage.
Migrating from custom notifications to Thialfi required modest code additions and replaced both
the previous push and polling notification support. Chrome includes Thialfi’s C++ client library,
Figure 6: Resource consumption and notification latency as active users increase.
implements our API (Figure 2), and routes Thialfi notifications to appropriate Chrome components.
In full, Chrome’s Thialfi-specific code is 1,753 lines of commented C++ code (535 semicolons).
We evaluate Thialfi’s scalability in terms of resource consumption and performance. We show that
resource consumption increases proportionally with increases in load. With respect to performance,
we show that notification latencies are stable as load increases, provided sufficient resources. These
measurements confirm our practical experience. To support increasing usage of Thialfi, we need only
allocate an incremental amount of additional infrastructure resources. The two main contributors to
Thialfi’s load are 1) the number of active users and 2) the rate at which notifications are published.
We consider each in turn, measuring synthetic workloads on shared Google clusters. While our
experiments are not performance-isolated, the results presented are consistent over multiple trials.
Increasing active users: Increasing the number of active users exercises registration, heartbeat processing, and client / session assignment. To measure this, we recorded the resource consumption of
Thialfi in a single data center while adding 2.3 million synthetic users. Each user had one client (the
number of clients per user does not impact performance in Thialfi). Clients arrived at a constant rate
of 570 per second. Each registered for five distinct objects and issued a random notification every 8
minutes and a heartbeat message every 20 minutes. The version of each notification was set to the
current time, allowing registered clients to measure the end-to-end latency upon receipt.
Figure 6 shows the results. As a proxy for overall resource consumption, we show the increasing CPU consumption as users arrive. Demand for other resources (network traffic, RPCs, memory)
grows similarly. The CPU data is normalized by the amount required to support a baseline of 100,000
users. Overall, increasing active users 23-fold (from 100,000 to 2.3 million) requires ∼3× the resources. Throughout this increase, median notification delays are stable, ranging between 0.6–0.7
seconds. (Because these synthetic clients are local to the data center, delays do not include wide-area
messaging latency.)
Increasing notification rate: Increasing the notification rate stresses Matcher to Registrar propagation. In this case, we measure resource consumption while varying the notification rate for a fixed set
of 1.4 million synthetic clients that have completed registrations and session assignment; all clients
were online simultaneously for the duration of the experiment. As in the previous measurements,
each client registered for five objects and each user had one client.
Figure 7 shows the results of scaling the notification rate. We report CPU consumption normalized
by the amount required to support a baseline notification rate of 1,000 per second and increase the
Figure 7: Resource consumption and notification latency as the notification rate increases.
Figure 8: Cumulative distribution of notification latencies randomly sampled from our live
rate by 1,000 up to 13,000. As before, median notification delays remain stable with proportional
resource consumption.
The previous measurements quantify median performance for synthetic workloads. We next examine
the distribution of notification latencies observed in our production deployment. Each Thialfi component tracks internal propagation delays by appending a log of timestamps to each notification as it
flows through the system.
Figure 8 shows a CDF of 2,514 notifications sampled over a 50-minute period from an active Thialfi cell. 88% of notifications are dispatched in less than one second. However, as is typical in
asynchronous distributed systems operating on shared infrastructure, a minority of messages may be
delayed for much longer, exceeding two seconds in our measurements.
We point out that these delays do not include delivery and acknowledgements from clients themselves; we measure only the delay within Thialfi from the receipt of a notification to the first attempt
to send it to an online client. End-to-end delays vary significantly due to the variable quality of channels and the lengthy delays incurred by offline clients. In practice, network propagation adds between
30–100 ms to overall notification latency.
In practice, the majority of Thialfi’s delay is self-imposed. Our current implementation aggressively
batches Bigtable operations and RPC dispatch to increase efficiency. This is illustrated in Figure 9,
which shows the delay for each stage of notification delivery averaged over a 10-minute interval. This
Delay (milliseconds)
Batched RPC to Registrar (82)
Read client list (3)
Matcher Bigtable batched write (268)
Bridge to Matcher batched RPC (265)
Publisher to Thialfi bridge (82)
Notification delivery latency by system component
Figure 9: The average contribution to overall notification delay of each Thialfi system component.
data is drawn from our production deployment. The Publisher library appends an initial timestamp
when the notification is generated by the application, and its propagation delay to Thialfi’s bridge
is fundamental. Once received, the RPC sending a notification from the bridge to the Matcher is
batched with a maximum delay of 500 ms. Matcher Bigtable writes are similarly batched. During
propagation, the Matcher reads the active client list—this data is typically retrieved directly from
Bigtable’s in-memory cache. Finally, the propagation RPC to the Registrar has a batch delay of
200 ms.
The majority of our current applications use Thialfi as a replacement for lengthy polling, and the
sub-second delays associated with batching are acceptable. But, as Figure 9 shows, we can further
reduce Thialfi’s delay by simply reducing the batching delay of relevant components. This increases
resource demands but does not introduce any fundamental scalability bottlenecks.
Fault Tolerance
We evaluate fault tolerance in two ways. First, we examine fail-over of clients between data centers.
This exercises our synchronization protocol and quantifies the over-provisioning required to cope
with data center failure in practice. Second, we present a month-long trace of how often Thialfi
buffers incoming notifications to cope with small periods of partial Matcher unavailability. This
shows the practical necessity for our techniques.
Data center fail-over: The failure of a data center requires that clients be migrated to a new instance and their state synchronized with new servers. Migration can be expensive at the server; it
requires reading the set of registered objects, computing the digest, sending pending notifications,
and processing registration requests (if any). Applications with few updates and/or lengthy heartbeat
intervals naturally spread migration load over a lengthy interval. Here, we consider a more challenging case: rapidly migrating tens of thousands of clients with very frequent heartbeats to ensure rapid
We instantiated 380,000 clients spread uniformly across three distinct Thialfi data centers with a
heartbeat interval of 30 seconds. Each client registered for five objects and generated random notifications yielding an incoming notification rate of roughly 11,000/sec across all clients. After allowing
the system to stabilize, we halted the Thialfi instance of one data center while measuring the CPU
consumption of the remaining two as well as the overall client notification rate. The failed data center was not restored for the duration of the experiment. Note that this experiment was performed
using a prior version of the Registration Sync Protocol; rather than including the registration digest
in each message, clients request the full registration list during migration. This modification has not
significantly changed resource consumption in practice.
Figure 10: CPU usage and notification rate during the sudden failure of a Thialfi data center.
Figure 11: A month-long trace of notification buffering during Matcher unavailability or
Matcher storage unavailability.
Figure 10 shows the results. We normalize CPU usage by the first observation taken in steady state.
After several minutes, we fail one data center, which clients detect after three failed heartbeats. This
is reflected by increased CPU consumption at the remaining instances and a sudden drop in notification receive rate corresponding to clients in the failed data center. As clients migrate, accumulated
notifications are discharged as clients are brought up-to-date. Shortly after, the system stabilizes. To
migrate 33% of clients over several minutes, Thialfi requires over-provisioning by a factor of ∼1.6.
Matcher unavailability: Thialfi’s provisions for fault tolerance arise from practical experience. For
example, our implementation buffers notifications to a temporary Bigtable to cope with transient unavailability (§4.2.6). This mechanism was added after our initial deployment in response to frequent
manual intervention to respond to failures. Figure 11 shows a month-long trace of notification buffering, confirming the need for error handling in practice. After deploying this solution, the number
of alerts that occurred due to a backlog disappeared completely. We point out that buffering occurs
not only during storage unavailability but any unavailability of the Matcher, e.g., during software
upgrades or restarts. Support for automatically buffering notifications without manual action during
these periods has greatly simplified service administration.
The problem of scalable event notification has received significant attention in the distributed systems
community, which we draw on in our design. Thialfi differs from existing work in two principal ways.
The first is the constraints of our environment. Thialfi’s design stems from the unique requirements
of Internet applications, infrastructure services, and the failures they exhibit. The second difference is
our goal. Our API and semantics provide developers with reliability that simplifies development, but
Thialfi does not impose significant restrictions on an application’s runtime environment or software
Thialfi builds on existing infrastructure services widely used at Google. We use Bigtable [7] to
store object and client data. The Chubby lock service [4] provides reliable, consistent naming and
configuration of our server processes. While specific to Google, the functionality of these systems is
being increasingly replicated by open source alternatives for which Thialfi’s design could be adapted.
For example, HBase [12] provides Bigtable-like structured storage atop the HDFS block store [13],
and Zookeeper [15] provides a highly reliable group coordination service.
Thialfi’s provisions for fault-tolerance draw on emerging practical experience with infrastructure services [3, 9, 11, 21]. Our experience with performance variability and communications failures is
consistent with these observations. But, unlike many existing infrastructure services, Thialfi is explicitly designed to cope with the failure of entire data centers. Megastore [2] shares this goal, using
synchronous replication with Paxos [16] to provide consistent structured data storage. While early
designs of Thialfi were built atop Megastore to inherit its robustness to data center failure, we eventually adopted replication and fault-tolerance techniques specific to a notification service; these increase
efficiency substantially.
Our goal of providing a scalable notification service is shared by a number of P2P notification and
publish / subscribe systems, e.g., Bayeux [29], Scribe [23], and Siena [6]. These systems construct
multicast trees on overlay routing substrates in order to efficiently disseminate messages. While
Thialfi addresses a similar problem, differences between P2P and infrastructure environments necessitate radical differences in our design. For example, P2P message delivery requires direct browserto-browser communication that is precluded by fundamental security policies [24]. Also, message
delivery is best-effort, departing from our goal of maintaining reliable delivery of notifications. Significant additional work exists on publish / subscribe systems (e.g. [1, 20, 25, 26]), but these systems
provide richer semantics and target lower scale.
For web applications, Thialfi addresses a longstanding limitation of HTTP—the need for polling to
refresh data. Others have observed these problems; e.g., Cao and Liu [5] advocate the use of invalidations as an alternative to polling to maintain the freshness of web documents, but their proposed
protocol extensions were not taken up. Yin et al. [28] study the efficiency of HTTP polling and propose an invalidation protocol that is conceptually similar to Thialfi, although it operates on a single
HTTP server only. We reexamine these problems at much larger scale. Cowling et al. [8] mention
the applicability of Census, a Byzantine-fault-tolerant group membership system, to the problem of
large-scale cache invalidation, but they leave the design to future work.
More recently, practitioners have developed a number of techniques to work around the request / reply
limitations of HTTP [17]. Many approaches rely on a common technique: each client maintains an
in-flight request to the server, which replies to this outstanding request only when new data is available. More recently, web sockets [14] have been proposed as a standard enabling full-duplex HTTP
messaging. Thialfi supports these channels transparently, separating the implementation details of
achieving push messages from the semantics of the notification service.
In the process of designing, implementing, and supporting Thialfi we learned several lessons about
our design.
For many applications, the signal is enough. Our choice to provide applications with only a notification signal was contentious. In particular, developers have almost universally asked for richer
features than Thialfi provides: e.g., support for data delivery, message ordering, and duplicate suppression. Absent these more compelling features, some developers are hesitant to adopt Thialfi. We
have avoided these features, however, as they would significantly complicate both our implementation and API. Moreover, we have encountered few applications with a fundamental need for them.
For example, applications that would prefer to receive data directly from Thialfi typically store the
data in their servers and retrieve it after receiving a notification. While developers often express
consternation over the additional latency induced by the retrieval, for many applications this does
not adversely affect the user experience. In our view, reliable signaling strikes a balance between
complexity and system utility.
Client library rather than client protocol. Perhaps more than any other component in the system,
Thialfi’s client library has undergone significant evolution since our initial design. Initially, we had
no client library whatsoever, opting instead to expose our protocol directly. Engineers, however,
strongly prefer to develop against native-language APIs. And, a high-level API has allowed us to
evolve our client-server protocol without modifying application code.
Initially, the client library provided only a thin shim around RPCs, e.g., register, unregister, acknowledge. This API proved essentially unusable. While seemingly simple, this initial design exposed too
many failure cases to application programmers, e.g., server crashes and data center migration. This
experience lead us to our goal of unifying error handling with normal operations in Thialfi’s API.
Complexity at the server, not the client. The presence of a client library creates a temptation to
improve server scalability by offloading functionality. Our second client library took exactly this
approach. For example, it detected data center switchover and drove the recovery protocol, substantially simplifying the server implementation. In many systems, this design would be preferable:
server scalability is typically the bottleneck, and client resources are plentiful. But, a sophisticated
client library is difficult to maintain. Thialfi’s client library is implemented in multiple languages,
and clients may not upgrade their software for years, if ever. In contrast, bug and performance fixes
to data center code can be deployed in hours. Given these realities, we trade server resources for
client simplicity in our current (third) client library.
Asynchronous events, not callbacks. Developers are accustomed to taking actions that produce
results, and our initial client libraries tried to satisfy this expectation. For example, the register call
took a registration callback for success or failure. Experience showed callbacks are not sufficient;
e.g., a client may become spontaneously unregistered during migration. Given the need to respond
to asynchronous events, callbacks are unnecessary and often misleading. Clients only need to know
current state, not the sequence of operations leading to it.
Initial workloads have few objects per client. A key feature of Thialfi is its support for tens of
thousands of objects per client. At present, however, no client application has more than tens of
objects per client. We suspect this is because existing client applications were initially designed
around polling solutions that work best with few objects per client. Emerging applications make use
of fine-grained objects, and we anticipate workloads with high fanout and many objects per client.
We have presented Thialfi, an infrastructure service that provides web, desktop, and mobile client applications with timely (sub-second) notifications of updates to shared state. To make Thialfi generally
applicable, we provide a simple object model and client API that permit developers flexibility in communication, storage, and runtime environments. Internally, Thialfi uses a combination of server-side
soft state, asynchronous replication, and client-driven recovery to tolerate a wide range of failures
common to infrastructure services, including the failure of entire data centers. The Thialfi API is
structured so that these failures are handled by the same application code paths used for normal op-
eration. Thialfi is in production use by millions of people daily, and our measurements confirm its
scalability, performance, and robustness.
We would like to thank the anonymous reviewers and our shepherd, Robert Morris, for their valuable
feedback. We are also grateful to many colleagues at Google. John Pongsajapan and John Reumann
offered valuable wisdom during design discussions, Shao Liu and Kyle Marvin worked on the implementation, and Fred Akalin and Rhett Robinson helped with application integration. Brian Bershad
and Dan Grove have provided support and resources over the life of the project, and Steve Lacey provided encouragement and connections with application developers. Finally, we thank James Cowling,
Xiaolan Zhang, and Elisavet Kozyri for helpful comments on the paper.
[1] Y. Amir and J. Stanton. The Spread Wide Area Group Communication System. Technical
Report CNDS 98-4, 1998.
[2] J. Baker, C. Bond, J. C. Corbett, J. Furman, A. Khorlin, J. Larson, J.-M. Léon, Y. Li, A. Lloyd,
and V. Yushprak. Megastore: Providing scalable, highly available storage for interactive
service. In Proc. of CIDR, 2011.
[3] T. Benson, A. Akella, and D. A. Maltz. Network traffic characteristics of data centers in the
wild. In Proc. of IMC, 2010.
[4] M. Burrows. The Chubby lock service for loosely-coupled distributed systems. In Proc. of
OSDI, 2006.
[5] P. Cao and C. Liu. Maintaining strong cache consistency in the world wide web. IEEE Trans.
Comput., 47:445–457, April 1998.
[6] A. Carzaniga, D. S. Rosenblum, and A. L. Wolf. Design and evaluation of a wide-area event
notification service. ACM Trans. Comput. Syst., 19:332–383, August 2001.
[7] F. Chang, J. Dean, S. Ghemawat, W. C. Hsieh, D. A. Wallach, M. Burrows, T. Chandra,
A. Fikes, and R. E. Gruber. Bigtable: A distributed storage system for structured data. In Proc.
of OSDI, 2006.
[8] J. Cowling, D. R. K. Ports, B. Liskov, R. A. Popa, and A. Gaikwad. Census: Location-aware
membership management for large-scale distributed systems. In Proc. of USENIX, 2009.
[9] J. Dean. Designs, lessons and advice from building large distributed systems. In LADIS
Keynote, 2009.
[10] D. E. Eastlake and P. E. Jones. US secure hash algorithm 1 (SHA1). Internet RFC 3174, 2001.
[11] D. Ford, F. Labelle, F. I. Popovici, M. Stokely, V.-A. Truong, L. Barroso, C. Grimes, and
S. Quinlan. Availability in globally distributed storage systems. In Proc. of OSDI, 2010.
[12] HBase.
[13] Hadoop Distributed File System.
[14] I. Hickson. The WebSocket API.
[15] P. Hunt, M. Konar, F. P. Junqueira, and B. Reed. Zookeeper: Wait-free coordination for
Internet-scale systems. In Proc. of USENIX, 2010.
[16] L. Lamport. The part-time parliament. ACM Trans. Comput. Syst., 16:133–169, May 1998.
[17] P. McCarthy and D. Crane. Comet and Reverse Ajax: The Next-Generation Ajax 2.0. Apress,
[18] R. Merkle. Secrecy, authentication and public key systems. PhD thesis, Dept. of Electrical
Engineering, Stanford University, 1979.
[19] D. Narayanan, A. Donnelly, E. Thereska, S. Elnikety, and A. Rowstron. Everest: Scaling down
peak loads through i/o off-loading. In Proc. of OSDI, 2008.
[20] P. R. Pietzuch and J. Bacon. Hermes: A distributed event-based middleware architecture. In
Proc. ICDCS, ICDCSW ’02, pages 611–618, Washington, DC, USA, 2002. IEEE Computer
[21] E. Pinheiro, W.-D. Weber, and L. A. Barroso. Failure trends in a large disk drive population. In
Proc. of FAST, 2007.
[22] D. R. K. Ports, A. T. Clements, I. Zhang, S. Madden, and B. Liskov. Transactional consistency
and automatic management in an application data cache. In Proc. of OSDI, 2010.
[23] A. I. T. Rowstron, A.-M. Kermarrec, M. Castro, and P. Druschel. SCRIBE: The design of a
large-scale event notification infrastructure. In Networked Group Communication, pages
30–43, 2001.
[24] J. Ruderman. Same origin policy for JavaScript.
[25] R. Strom, G. Banavar, T. Chandra, M. Kaplan, K. Miller, B. Mukherjee, D. Sturman, and
M. Ward. Gryphon: An information flow based approach to message brokering. In Proc. Intl.
Symposium on Software Reliability Engineering, 1998.
[26] R. van Renesse, K. P. Birman, and S. Maffeis. Horus: a flexible group communication system.
Commun. ACM, 39:76–83, April 1996.
[27] Extensible Messaging and Presence Protocol.
[28] J. Yin, L. Alvisi, M. Dahlin, and A. Iyengar. Engineering server-driven consistency for large
scale dynamic web services. In Proc. of WWW, 2001.
[29] S. Q. Zhuang, B. Y. Zhao, A. D. Joseph, R. H. Katz, and J. D. Kubiatowicz. Bayeux: An
architecture for scalable and fault-tolerant wide-area data dissemination. In Proc. of
NOSSDAV, 2001.
Windows Azure Storage:
A Highly Available Cloud Storage Service
with Strong Consistency
Brad Calder, Ju Wang, Aaron Ogus, Niranjan Nilakantan, Arild Skjolsvold,
Sam McKelvie, Yikang Xu, Shashwat Srivastav, Jiesheng Wu,
Huseyin Simitci, Jaidev Haridas, Chakravarthy Uddaraju, Hemal Khatri,
Andrew Edwards, Vaman Bedekar, Shane Mainali,Rafay Abbasi,
Arpit Agarwal, Mian Fahim ul Haq, Muhammad Ikram ul Haq,
Deepali Bhardwaj, Sowmya Dayanand, Anitha Adusumilli,
Marvin McNett, Sriram Sankaran, Kavitha Manivannan, Leonidas Rigas
Windows Azure Storage (WAS) is a cloud storage system that provides customers the ability to
store seemingly limitless amounts of data for any duration of time. WAS customers have access to
their data from anywhere at any time and only pay for what they use and store. In WAS, data is
stored durably using both local and geographic replication to facilitate disaster recovery. Currently,
WAS storage comes in the form of Blobs (files), Tables (structured storage), and Queues (message
delivery). In this paper, we describe the WAS architecture, global namespace, and data model, as
well as its resource provisioning, load balancing, and replication systems.
Categories and Subject Descriptors
D.4.2 [Operating Systems]: Storage Management—Secondary storage; D.4.3 [Operating
Systems]: File Systems Management—Distributed file systems; D.4.5 [Operating Systems]:
Reliability—Fault tolerance; D.4.7 [Operating Systems]: Organization and Design—Distributed
systems; D.4.8 [Operating Systems]: Performance—Measurements
General Terms
Algorithms, Design, Management, Measurement, Performance, Reliability.
Cloud storage, distributed storage systems, Windows Azure.
1. Introduction
Windows Azure Storage (WAS) is a scalable cloud storage system that has been in production
since November 2008. It is used inside Microsoft for applications such as social networking
search, serving video, music and game content, managing medical records, and more. In addition,
there are thousands of customers outside Microsoft using WAS, and anyone can sign up over the
Internet to use the system.
WAS provides cloud storage in the form of Blobs (user files), Tables (structured storage), and
Queues (message delivery). These three data abstractions provide the overall storage and workflow
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted
without fee provided that copies are not made or distributed for profit or commercial advantage and that copies
bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to
redistribute to lists, requires prior specific permission and/or a fee.
SOSP '11, October 23-26, 2011, Cascais, Portugal.
Copyright © 2011 ACM 978-1-4503-0977-6/11/10 ... $10.00.
for many applications. A common usage pattern we see is incoming and outgoing data being
shipped via Blobs, Queues providing the overall workflow for processing the Blobs, and
intermediate service state and final results being kept in Tables or Blobs.
An example of this pattern is an ingestion engine service built on Windows Azure to provide near
real-time Facebook and Twitter search. This service is one part of a larger data processing pipeline
that provides publically searchable content (via our search engine, Bing) within 15 seconds of a
Facebook or Twitter user’s posting or status update. Facebook and Twitter send the raw public
content to WAS (e.g., user postings, user status updates, etc.) to be made publically searchable.
This content is stored in WAS Blobs. The ingestion engine annotates this data with user auth,
spam, and adult scores; content classification; and classification for language and named entities.
In addition, the engine crawls and expands the links in the data. While processing, the ingestion
engine accesses WAS Tables at high rates and stores the results back into Blobs. These Blobs are
then folded into the Bing search engine to make the content publically searchable. The ingestion
engine uses Queues to manage the flow of work, the indexing jobs, and the timing of folding the
results into the search engine. As of this writing, the ingestion engine for Facebook and Twitter
keeps around 350TB of data in WAS (before replication). In terms of transactions, the ingestion
engine has a peak traffic load of around 40,000 transactions per second and does between two to
three billion transactions per day (see Section 7 for discussion of additional workload profiles).
In the process of building WAS, feedback from potential internal and external customers drove
many design decisions. Some key design features resulting from this feedback include:
Strong Consistency – Many customers want strong consistency: especially enterprise customers
moving their line of business applications to the cloud. They also want the ability to perform
conditional reads, writes, and deletes for optimistic concurrency control [12] on the strongly
consistent data. For this, WAS provides three properties that the CAP theorem [2] claims are
difficult to achieve at the same time: strong consistency, high availability, and partition tolerance
(see Section 8).
Global and Scalable Namespace/Storage – For ease of use, WAS implements a global namespace
that allows data to be stored and accessed in a consistent manner from any location in the world.
Since a major goal of WAS is to enable storage of massive amounts of data, this global namespace
must be able to address exabytes of data and beyond. We discuss our global namespace design in
detail in Section 2.
Disaster Recovery – WAS stores customer data across multiple data centers hundreds of miles
apart from each other. This redundancy provides essential data recovery protection against
disasters such as earthquakes, wild fires, tornados, nuclear reactor meltdown, etc.
Multi-tenancy and Cost of Storage – To reduce storage cost, many customers are served from the
same shared storage infrastructure. WAS combines the workloads of many different customers with
varying resource needs together so that significantly less storage needs to be provisioned at any one
point in time than if those services were run on their own dedicated hardware.
We describe these design features in more detail in the following sections. The remainder of this
paper is organized as follows. Section 2 describes the global namespace used to access the WAS
Blob, Table, and Queue data abstractions. Section 3 provides a high level overview of the WAS
architecture and its three layers: Stream, Partition, and Front-End layers. Section 4 describes the
stream layer, and Section 5 describes the partition layer. Section 6 shows the throughput
experienced by Windows Azure applications accessing Blobs and Tables. Section 7 describes
some internal Microsoft workloads using WAS. Section 8 discusses design choices and lessons
learned. Section 9 presents related work, and Section 10 summarizes the paper.
2. Global Partitioned Namespace
A key goal of our storage system is to provide a single global namespace that allows clients to
address all of their storage in the cloud and scale to arbitrary amounts of storage needed over time.
To provide this capability we leverage DNS as part of the storage namespace and break the storage
namespace into three parts: an account name, a partition name, and an object name. As a result, all
data is accessible via a URI of the form:
The AccountName is the customer selected account name for accessing storage and is part of the
DNS host name. The AccountName DNS translation is used to locate the primary storage cluster
and data center where the data is stored. This primary location is where all requests go to reach the
data for that account. An application may use multiple AccountNames to store its data across
different locations.
In conjunction with the AccountName, the PartitionName locates the data once a request reaches
the storage cluster. The PartitionName is used to scale out access to the data across storage nodes
based on traffic needs.
When a PartitionName holds many objects, the ObjectName identifies individual objects within
that partition.
The system supports atomic transactions across objects with the same
PartitionName value. The ObjectName is optional since, for some types of data, the PartitionName
uniquely identifies the object within the account.
This naming approach enables WAS to flexibly support its three data abstractions 2. For Blobs, the
full blob name is the PartitionName. For Tables, each entity (row) in the table has a primary key
that consists of two properties: the PartitionName and the ObjectName. This distinction allows
applications using Tables to group rows into the same partition to perform atomic transactions
across them. For Queues, the queue name is the PartitionName and each message has an
ObjectName to uniquely identify it within the queue.
3. High Level Architecture
Here we present a high level discussion of the WAS architecture and how it fits into the Windows
Azure Cloud Platform.
3.1 Windows Azure Cloud Platform
The Windows Azure Cloud platform runs many cloud services across different data centers and
different geographic regions. The Windows Azure Fabric Controller is a resource provisioning and
management layer that provides resource allocation, deployment/upgrade, and management for
cloud services on the Windows Azure platform. WAS is one such service running on top of the
Fabric Controller.
The Fabric Controller provides node management, network configuration, health monitoring,
starting/stopping of service instances, and service deployment for the WAS system. In addition,
WAS retrieves network topology information, physical layout of the clusters, and hardware
configuration of the storage nodes from the Fabric Controller. WAS is responsible for managing
the replication and data placement across the disks and load balancing the data and application
traffic within the storage cluster.
3.2 WAS Architectural Components
An important feature of WAS is the ability to store and provide access to an immense amount of
storage (exabytes and beyond). We currently have 70 petabytes of raw storage in production and
are in the process of provisioning a few hundred more petabytes of raw storage based on customer
demand for 2012.
The WAS production system consists of Storage Stamps and the Location Service (shown in Figure
<service> specifies the service type, which can be blob, table, or queue.
APIs for Windows Azure Blobs, Tables, and Queues can be found
DNS Lookup
Access Blobs,
Tables and Queues
for account
Account Management
Partition Layer
Partition Layer
Stream Layer
Intra-Stamp Replication
Stream Layer
Intra-Stamp Replication
Storage Stamp
Storage Stamp
Figure 1: High-level architecture
Storage Stamps – A storage stamp is a cluster of N racks of storage nodes, where each rack is built
out as a separate fault domain with redundant networking and power. Clusters typically range from
10 to 20 racks with 18 disk-heavy storage nodes per rack. Our first generation storage stamps hold
approximately 2PB of raw storage each. Our next generation stamps hold up to 30PB of raw
storage each.
To provide low cost cloud storage, we need to keep the storage provisioned in production as highly
utilized as possible. Our goal is to keep a storage stamp around 70% utilized in terms of capacity,
transactions, and bandwidth. We try to avoid going above 80% because we want to keep 20% in
reserve for (a) disk short stroking to gain better seek time and higher throughput by utilizing the
outer tracks of the disks and (b) to continue providing storage capacity and availability in the
presence of a rack failure within a stamp. When a storage stamp reaches 70% utilization, the
location service migrates accounts to different stamps using inter-stamp replication (see Section
Location Service (LS) – The location service manages all the storage stamps. It is also responsible
for managing the account namespace across all stamps. The LS allocates accounts to storage
stamps and manages them across the storage stamps for disaster recovery and load balancing. The
location service itself is distributed across two geographic locations for its own disaster recovery.
WAS provides storage from multiple locations in each of the three geographic regions: North
America, Europe, and Asia. Each location is a data center with one or more buildings in that
location, and each location holds multiple storage stamps. To provision additional capacity, the LS
has the ability to easily add new regions, new locations to a region, or new stamps to a location.
Therefore, to increase the amount of storage, we deploy one or more storage stamps in the desired
location’s data center and add them to the LS. The LS can then allocate new storage accounts to
those new stamps for customers as well as load balance (migrate) existing storage accounts from
older stamps to the new stamps.
Figure 1 shows the location service with two storage stamps and the layers within the storage
stamps. The LS tracks the resources used by each storage stamp in production across all locations.
When an application requests a new account for storing data, it specifies the location affinity for the
storage (e.g., US North). The LS then chooses a storage stamp within that location as the primary
stamp for the account using heuristics based on the load information across all stamps (which
considers the fullness of the stamps and other metrics such as network and transaction utilization).
The LS then stores the account metadata information in the chosen storage stamp, which tells the
stamp to start taking traffic for the assigned account. The LS then updates DNS to allow requests
to now route from the name to that storage stamp’s
virtual IP (VIP, an IP address the storage stamp exposes for external traffic).
3.3 Three Layers within a Storage Stamp
Also shown in Figure 1 are the three layers within a storage stamp. From bottom up these are:
Stream Layer – This layer stores the bits on disk and is in charge of distributing and replicating
the data across many servers to keep data durable within a storage stamp. The stream layer can be
thought of as a distributed file system layer within a stamp. It understands files, called “streams”
(which are ordered lists of large storage chunks called “extents”), how to store them, how to
replicate them, etc., but it does not understand higher level object constructs or their semantics. The
data is stored in the stream layer, but it is accessible from the partition layer. In fact, partition
servers (daemon processes in the partition layer) and stream servers are co-located on each storage
node in a stamp.
Partition Layer – The partition layer is built for (a) managing and understanding higher level data
abstractions (Blob, Table, Queue), (b) providing a scalable object namespace, (c) providing
transaction ordering and strong consistency for objects, (d) storing object data on top of the stream
layer, and (e) caching object data to reduce disk I/O.
Another responsibility of this layer is to achieve scalability by partitioning all of the data objects
within a stamp. As described earlier, all objects have a PartitionName; they are broken down into
disjointed ranges based on the PartitionName values and served by different partition servers. This
layer manages which partition server is serving what PartitionName ranges for Blobs, Tables, and
Queues. In addition, it provides automatic load balancing of PartitionNames across the partition
servers to meet the traffic needs of the objects.
Front-End (FE) layer – The Front-End (FE) layer consists of a set of stateless servers that take
incoming requests. Upon receiving a request, an FE looks up the AccountName, authenticates and
authorizes the request, then routes the request to a partition server in the partition layer (based on
the PartitionName). The system maintains a Partition Map that keeps track of the PartitionName
ranges and which partition server is serving which PartitionNames. The FE servers cache the
Partition Map and use it to determine which partition server to forward each request to. The FE
servers also stream large objects directly from the stream layer and cache frequently accessed data
for efficiency.
3.4 Two Replication Engines
Before describing the stream and partition layers in detail, we first give a brief overview of the two
replication engines in our system and their separate responsibilities.
Intra-Stamp Replication (stream layer) – This system provides synchronous replication and is
focused on making sure all the data written into a stamp is kept durable within that stamp. It keeps
enough replicas of the data across different nodes in different fault domains to keep data durable
within the stamp in the face of disk, node, and rack failures.
Intra-stamp replication is done
completely by the stream layer and is on the critical path of the customer’s write requests. Once a
transaction has been replicated successfully with intra-stamp replication, success can be returned
back to the customer.
Inter-Stamp Replication (partition layer) – This system provides asynchronous replication and
is focused on replicating data across stamps. Inter-stamp replication is done in the background and
is off the critical path of the customer’s request. This replication is at the object level, where either
the whole object is replicated or recent delta changes are replicated for a given account. Interstamp replication is used for (a) keeping a copy of an account’s data in two locations for disaster
recovery and (b) migrating an account’s data between stamps. Inter-stamp replication is configured
for an account by the location service and performed by the partition layer.
Inter-stamp replication is focused on replicating objects and the transactions applied to those
objects, whereas intra-stamp replication is focused on replicating blocks of disk storage that are
used to make up the objects.
We separated replication into intra-stamp and inter-stamp at these two different layers for the
following reasons. Intra-stamp replication provides durability against hardware failures, which
occur frequently in large scale systems, whereas inter-stamp replication provides geo-redundancy
against geo-disasters, which are rare. It is crucial to provide intra-stamp replication with low
latency, since that is on the critical path of user requests; whereas the focus of inter-stamp
replication is optimal use of network bandwidth between stamps while achieving an acceptable
level of replication delay. They are different problems addressed by the two replication schemes.
Another reason for creating these two separate replication layers is the namespace each of these
two layers has to maintain. Performing intra-stamp replication at the stream layer allows the
amount of information that needs to be maintained to be scoped by the size of a single storage
stamp. This focus allows all of the meta-state for intra-stamp replication to be cached in memory
for performance (see Section 4), enabling WAS to provide fast replication with strong consistency
by quickly committing transactions within a single stamp for customer requests. In contrast, the
partition layer combined with the location service controls and understands the global object
namespace across stamps, allowing it to efficiently replicate and maintain object state across data
4. Stream Layer
The stream layer provides an internal interface used only by the partition layer. It provides a file
system like namespace and API, except that all writes are append-only. It allows clients (the
partition layer) to open, close, delete, rename, read, append to, and concatenate these large files,
which are called streams. A stream is an ordered list of extent pointers, and an extent is a sequence
of append blocks.
Figure 2 shows stream “//foo”, which contains (pointers to) four extents (E1, E2, E3, and E4).
Each extent contains a set of blocks that were appended to it. E1, E2 and E3 are sealed extents. It
means that they can no longer be appended to; only the last extent in a stream (E4) can be appended
to. If an application reads the data of the stream from beginning to end, it would get the block
contents of the extents in the order of E1, E2, E3 and E4.
Stream //foo
Pointer to Extent E1
Pointer to Extent E2
Pointer to Extent E3
B11 B12 ….. B1x
B21 B22 ….. B2y
B31 B32 ….. B3z
Extent E1 - Sealed
Extent E2 - Sealed
Extent E3 - Sealed
Pointer to Extent E4
B41 B42 B43
Extent E4 - Unsealed
Figure 2: Example stream with four extents
In more detail these data concepts are:
Block – This is the minimum unit of data for writing and reading. A block can be up to N bytes
(e.g. 4MB). Data is written (appended) as one or more concatenated blocks to an extent, where
blocks do not have to be the same size. The client does an append in terms of blocks and controls
the size of each block. A client read gives an offset to a stream or extent, and the stream layer
reads as many blocks as needed at the offset to fulfill the length of the read. When performing a
read, the entire contents of a block are read. This is because the stream layer stores its checksum
validation at the block level, one checksum per block. The whole block is read to perform the
checksum validation, and it is checked on every block read. In addition, all blocks in the system
are validated against their checksums once every few days to check for data integrity issues.
Extent – Extents are the unit of replication in the stream layer, and the default replication policy is
to keep three replicas within a storage stamp for an extent. Each extent is stored in an NTFS file
and consists of a sequence of blocks. The target extent size used by the partition layer is 1GB. To
store small objects, the partition layer appends many of them to the same extent and even in the
same block; to store large TB-sized objects (Blobs), the object is broken up over many extents by
the partition layer. The partition layer keeps track of what streams, extents, and byte offsets in the
extents in which objects are stored as part of its index.
Streams – Every stream has a name in the hierarchical namespace maintained at the stream layer,
and a stream looks like a big file to the partition layer. Streams are appended to and can be
randomly read from. A stream is an ordered list of pointers to extents which is maintained by the
Stream Manager. When the extents are concatenated together they represent the full contiguous
address space in which the stream can be read in the order they were added to the stream. A new
stream can be constructed by concatenating extents from existing streams, which is a fast operation
since it just updates a list of pointers. Only the last extent in the stream can be appended to. All of
the prior extents in the stream are immutable.
4.1 Stream Manager and Extent Nodes
The two main architecture components of the stream layer are the Stream Manager (SM) and
Extent Node (EN) (shown in Figure 3).
Stream Layer
A. Create extent
B. Allocate extent
replica set
Figure 3: Stream Layer Architecture
Stream Manager (SM) – The SM keeps track of the stream namespace, what extents are in each
stream, and the extent allocation across the Extent Nodes (EN). The SM is a standard Paxos cluster
[13] as used in prior storage systems [3], and is off the critical path of client requests. The SM is
responsible for (a) maintaining the stream namespace and state of all active streams and extents, (b)
monitoring the health of the ENs, (c) creating and assigning extents to ENs, (d) performing the
lazy re-replication of extent replicas that are lost due to hardware failures or unavailability, (e)
garbage collecting extents that are no longer pointed to by any stream, and (f) scheduling the
erasure coding of extent data according to stream policy (see Section 4.4).
The SM periodically polls (syncs) the state of the ENs and what extents they store. If the SM
discovers that an extent is replicated on fewer than the expected number of ENs, a re-replication of
the extent will lazily be created by the SM to regain the desired level of replication. For extent
replica placement, the SM randomly chooses ENs across different fault domains, so that they are
stored on nodes that will not have correlated failures due to power, network, or being on the same
The SM does not know anything about blocks, just streams and extents. The SM is off the critical
path of client requests and does not track each block append, since the total number of blocks can
be huge and the SM cannot scale to track those. Since the stream and extent state is only tracked
within a single stamp, the amount of state can be kept small enough to fit in the SM’s memory.
The only client of the stream layer is the partition layer, and the partition layer and stream layer are
co-designed so that they will not use more than 50 million extents and no more than 100,000
streams for a single storage stamp given our current stamp sizes. This parameterization can
comfortably fit into 32GB of memory for the SM.
Extent Nodes (EN) – Each extent node maintains the storage for a set of extent replicas assigned to
it by the SM. An EN has N disks attached, which it completely controls for storing extent replicas
and their blocks. An EN knows nothing about streams, and only deals with extents and blocks.
Internally on an EN server, every extent on disk is a file, which holds data blocks and their
checksums, and an index which maps extent offsets to blocks and their file location. Each extent
node contains a view about the extents it owns and where the peer replicas are for a given extent.
This view is a cache kept by the EN of the global state the SM keeps. ENs only talk to other ENs
to replicate block writes (appends) sent by a client, or to create additional copies of an existing
replica when told to by the SM. When an extent is no longer referenced by any stream, the SM
garbage collects the extent and notifies the ENs to reclaim the space.
4.2 Append Operation and Sealed Extent
Streams can only be appended to; existing data cannot be modified. The append operations are
atomic: either the entire data block is appended, or nothing is. Multiple blocks can be appended at
once, as a single atomic “multi-block append” operation. The minimum read size from a stream is
a single block. The “multi-block append” operation allows us to write a large amount of sequential
data in a single append and to later perform small reads. The contract used between the client
(partition layer) and the stream layer is that the multi-block append will occur atomically, and if the
client never hears back for a request (due to failure) the client should retry the request (or seal the
extent). This contract implies that the client needs to expect the same block to be appended more
than once in face of timeouts and correctly deal with processing duplicate records. The partition
layer deals with duplicate records in two ways (see Section 5 for details on the partition layer
streams). For the metadata and commit log streams, all of the transactions written have a sequence
number and duplicate records will have the same sequence number. For the row data and blob data
streams, for duplicate writes, only the last write will be pointed to by the RangePartition data
structures, so the prior duplicate writes will have no references and will be garbage collected later.
An extent has a target size, specified by the client (partition layer), and when it fills up to that size
the extent is sealed at a block boundary, and then a new extent is added to the stream and appends
continue into that new extent. Once an extent is sealed it can no longer be appended to. A sealed
extent is immutable, and the stream layer performs certain optimizations on sealed extents like
erasure coding cold extents. Extents in a stream do not have to be the same size, and they can be
sealed anytime and can even grow arbitrarily large.
4.3 Stream Layer Intra-Stamp Replication
The stream layer and partition layer are co-designed to provide strong consistency at the object
transaction level. The correctness of the partition layer providing strong consistency is built upon
the following guarantees from the stream layer:
1. Once a record is appended and acknowledged back to the client, any later reads of that record
from any replica will see the same data (the data is immutable).
2. Once an extent is sealed, any reads from any sealed replica will always see the same contents of
the extent.
The data center, Fabric Controller, and WAS have security mechanisms in place to guard against
malicious adversaries, so the stream replication does not deal with such threats. We consider faults
ranging from disk and node errors to power failures, network issues, bit-flip and random hardware
failures, as well as software bugs. These faults can cause data corruption; checksums are used to
detect such corruption. The rest of the section discusses the intra-stamp replication scheme within
this context.
4.3.1 Replication Flow
As shown in Figure 3, when a stream is first created (step A), the SM assigns three replicas for the
first extent (one primary and two secondary) to three extent nodes (step B), which are chosen by the
SM to randomly spread the replicas across different fault and upgrade domains while considering
extent node usage (for load balancing). In addition, the SM decides which replica will be the
primary for the extent. Writes to an extent are always performed from the client to the primary EN,
and the primary EN is in charge of coordinating the write to two secondary ENs. The primary EN
and the location of the three replicas never change for an extent while it is being appended to (while
the extent is unsealed). Therefore, no leases are used to represent the primary EN for an extent,
since the primary is always fixed while an extent is unsealed.
When the SM allocates the extent, the extent information is sent back to the client, which then
knows which ENs hold the three replicas and which one is the primary. This state is now part of
the stream’s metadata information held in the SM and cached on the client. When the last extent in
the stream that is being appended to becomes sealed, the same process repeats. The SM then
allocates another extent, which now becomes the last extent in the stream, and all new appends now
go to the new last extent for the stream.
For an extent, every append is replicated three times across the extent’s replicas. A client sends all
write requests to the primary EN, but it can read from any replica, even for unsealed extents. The
append is sent to the primary EN for the extent by the client, and the primary is then in charge of
(a) determining the offset of the append in the extent, (b) ordering (choosing the offset of) all of the
appends if there are concurrent append requests to the same extent outstanding, (c) sending the
append with its chosen offset to the two secondary extent nodes, and (d) only returning success for
the append to the client after a successful append has occurred to disk for all three extent nodes.
The sequence of steps during an append is shown in Figure 3 (labeled with numbers). Only when
all of the writes have succeeded for all three replicas will the primary EN then respond to the client
that the append was a success. If there are multiple outstanding appends to the same extent, the
primary EN will respond success in the order of their offset (commit them in order) to the clients.
As appends commit in order for a replica, the last append position is considered to be the current
commit length of the replica. We ensure that the bits are the same between all replicas by the fact
that the primary EN for an extent never changes, it always picks the offset for appends, appends for
an extent are committed in order, and how extents are sealed upon failures (discussed in Section
When a stream is opened, the metadata for its extents is cached at the client, so the client can go
directly to the ENs for reading and writing without talking to the SM until the next extent needs to
be allocated for the stream. If during writing, one of the replica’s ENs is not reachable or there is a
disk failure for one of the replicas, a write failure is returned to the client. The client then contacts
the SM, and the extent that was being appended to is sealed by the SM at its current commit length
(see Section 4.3.2). At this point the sealed extent can no longer be appended to. The SM will then
allocate a new extent with replicas on different (available) ENs, which makes it now the last extent
of the stream. The information for this new extent is returned to the client. The client then
continues appending to the stream with its new extent. This process of sealing by the SM and
allocating the new extent is done on average within 20ms. A key point here is that the client can
continue appending to a stream as soon as the new extent has been allocated, and it does not rely on
a specific node to become available again.
For the newly sealed extent, the SM will create new replicas to bring it back to the expected level
of redundancy in the background if needed.
4.3.2 Sealing
From a high level, the SM coordinates the sealing operation among the ENs; it determines the
commit length of the extent used for sealing based on the commit length of the extent replicas.
Once the sealing is done, the commit length will never change again.
To seal an extent, the SM asks all three ENs their current length. During sealing, either all replicas
have the same length, which is the simple case, or a given replica is longer or shorter than another
replica for the extent. This latter case can only occur during an append failure where some but not
all of the ENs for the replica are available (i.e., some of the replicas get the append block, but not
all of them). We guarantee that the SM will seal the extent even if the SM may not be able to reach
all the ENs involved. When sealing the extent, the SM will choose the smallest commit length
based on the available ENs it can talk to. This will not cause data loss since the primary EN will
not return success unless all replicas have been written to disk for all three ENs. This means the
smallest commit length is sure to contain all the writes that have been acknowledged to the client.
In addition, it is also fine if the final length contains blocks that were never acknowledged back to
the client, since the client (partition layer) correctly deals with these as described in Section 4.2.
During the sealing, all of the extent replicas that were reachable by the SM are sealed to the commit
length chosen by the SM.
Once the sealing is done, the commit length of the extent will never be changed. If an EN was not
reachable by the SM during the sealing process but later becomes reachable, the SM will force the
EN to synchronize the given extent to the chosen commit length. This ensures that once an extent
is sealed, all its available replicas (the ones the SM can eventually reach) are bitwise identical.
4.3.3 Interaction with Partition Layer
An interesting case is when, due to network partitioning, a client (partition server) is still able to
talk to an EN that the SM could not talk to during the sealing process. This section explains how
the partition layer handles this case.
The partition layer has two different read patterns:
1. Read records at known locations. The partition layer uses two types of data streams (row and
blob). For these streams, it always reads at specific locations (extent+offset, length). More
importantly, the partition layer will only read these two streams using the location information
returned from a previous successful append at the stream layer. That will only occur if the append
was successfully committed to all three replicas. The replication scheme guarantees such reads
always see the same data.
2. Iterate all records sequentially in a stream on partition load. Each partition has two
additional streams (metadata and commit log). These are the only streams that the partition layer
will read sequentially from a starting point to the very last record of a stream. This operation only
occurs when the partition is loaded (explained in Section 5). The partition layer ensures that no
useful appends from the partition layer will happen to these two streams during partition load.
Then the partition and stream layer together ensure that the same sequence of records is returned on
partition load.
At the start of a partition load, the partition server sends a “check for commit length” to the primary
EN of the last extent of these two streams. This checks whether all the replicas are available and
that they all have the same length. If not, the extent is sealed and reads are only performed, during
partition load, against a replica sealed by the SM. This ensures that the partition load will see all of
its data and the exact same view, even if we were to repeatedly load the same partition reading
from different sealed replicas for the last extent of the stream.
4.4 Erasure Coding Sealed Extents
To reduce the cost of storage, WAS erasure codes sealed extents for Blob storage. WAS breaks an
extent into N roughly equal sized fragments at block boundaries. Then, it adds M error correcting
code fragments using Reed-Solomon for the erasure coding algorithm [19]. As long as it does not
lose more than M fragments (across the data fragments + code fragments), WAS can recreate the
full extent.
Erasure coding sealed extents is an important optimization, given the amount of data we are
storing. It reduces the cost of storing data from three full replicas within a stamp, which is three
times the original data, to only 1.3x – 1.5x the original data, depending on the number of fragments
used. In addition, erasure coding actually increases the durability of the data when compared to
keeping three replicas within a stamp.
4.5 Read Load-Balancing
When reads are issued for an extent that has three replicas, they are submitted with a “deadline”
value which specifies that the read should not be attempted if it cannot be fulfilled within the
deadline. If the EN determines the read cannot be fulfilled within the time constraint, it will
immediately reply to the client that the deadline cannot be met. This mechanism allows the client
to select a different EN to read that data from, likely allowing the read to complete faster.
This method is also used with erasure coded data. When reads cannot be serviced in a timely
manner due to a heavily loaded spindle to the data fragment, the read may be serviced faster by
doing a reconstruction rather than reading that data fragment. In this case, reads (for the range of
the fragment needed to satisfy the client request) are issued to all fragments of an erasure coded
extent, and the first N responses are used to reconstruct the desired fragment.
4.6 Spindle Anti-Starvation
Many hard disk drives are optimized to achieve the highest possible throughput, and sacrifice
fairness to achieve that goal. They tend to prefer reads or writes that are sequential. Since our
system contains many streams that can be very large, we observed in developing our service that
some disks would lock into servicing large pipelined reads or writes while starving other
operations. On some disks we observed this could lock out non-sequential IO for as long as 2300
milliseconds. To avoid this problem we avoid scheduling new IO to a spindle when there is over
100ms of expected pending IO already scheduled or when there is any pending IO request that has
been scheduled but not serviced for over 200ms. Using our own custom IO scheduling allows us to
achieve fairness across reads/writes at the cost of slightly increasing overall latency on some
sequential requests.
4.7 Durability and Journaling
The durability contract for the stream layer is that when data is acknowledged as written by the
stream layer, there must be at least three durable copies of the data stored in the system. This
contract allows the system to maintain data durability even in the face of a cluster-wide power
failure. We operate our storage system in such a way that all writes are made durable to power safe
storage before they are acknowledged back to the client.
As part of maintaining the durability contract while still achieving good performance, an important
optimization for the stream layer is that on each extent node we reserve a whole disk drive or SSD
as a journal drive for all writes into the extent node. The journal drive [11] is dedicated solely for
writing a single sequential journal of data, which allows us to reach the full write throughput
potential of the device. When the partition layer does a stream append, the data is written by the
primary EN while in parallel sent to the two secondaries to be written. When each EN performs its
append, it (a) writes all of the data for the append to the journal drive and (b) queues up the append
to go to the data disk where the extent file lives on that EN. Once either succeeds, success can be
returned. If the journal succeeds first, the data is also buffered in memory while it goes to the data
disk, and any reads for that data are served from memory until the data is on the data disk. From
that point on, the data is served from the data disk. This also enables the combining of contiguous
writes into larger writes to the data disk, and better scheduling of concurrent writes and reads to get
the best throughput. It is a tradeoff for good latency at the cost of an extra write off the critical
Even though the stream layer is an append-only system, we found that adding a journal drive
provided important benefits, since the appends do not have to contend with reads going to the data
disk in order to commit the result back to the client. The journal allows the append times from the
partition layer to have more consistent and lower latencies. Take for example the partition layer’s
commit log stream, where an append is only as fast as the slowest EN for the replicas being
appended to. For small appends to the commit log stream without journaling we saw an average
end-to-end stream append latency of 30ms. With journaling we see an average append latency of
6ms. In addition, the variance of latencies decreased significantly.
5. Partition Layer
The partition layer stores the different types of objects and understands what a transaction means
for a given object type (Blob, Table, or Queue). The partition layer provides the (a) data model for
the different types of objects stored, (b) logic and semantics to process the different types of
objects, (c) massively scalable namespace for the objects, (d) load balancing to access objects
across the available partition servers, and (e) transaction ordering and strong consistency for access
to objects.
5.1 Partition Layer Data Model
The partition layer provides an important internal data structure called an Object Table (OT). An
OT is a massive table which can grow to several petabytes. Object Tables are dynamically broken
up into RangePartitions (based on traffic load to the table) and spread across Partition Servers
(Section 5.2) in a stamp. A RangePartition is a contiguous range of rows in an OT from a given
low-key to a high-key. All RangePartitions for a given OT are non-overlapping, and every row is
represented in some RangePartition.
The following are the Object Tables used by the partition layer. The Account Table stores
metadata and configuration for each storage account assigned to the stamp. The Blob Table stores
all blob objects for all accounts in the stamp. The Entity Table stores all entity rows for all
accounts in the stamp; it is used for the public Windows Azure Table data abstraction. The
Message Table stores all messages for all accounts’ queues in the stamp. The Schema Table keeps
track of the schema for all OTs. The Partition Map Table keeps track of the current
RangePartitions for all Object Tables and what partition server is serving each RangePartition.
This table is used by the Front-End servers to route requests to the corresponding partition servers.
Each of the above OTs has a fixed schema stored in the Schema Table. The primary key for the
Blob Table, Entity Table, and Message Table consists of three properties: AccountName,
PartitionName, and ObjectName. These properties provide the indexing and sort order for those
Object Tables.
5.1.1 Supported Data Types and Operations
The property types supported for an OT’s schema are the standard simple types (bool, binary,
string, DateTime, double, GUID, int32, int64). In addition, the system supports two special types –
DictionaryType and BlobType. The DictionaryType allows for flexible properties (i.e., without a
fixed schema) to be added to a row at any time. These flexible properties are stored inside of the
dictionary type as (name, type, value) tuples. From a data access standpoint, these flexible
properties behave like first-order properties of the row and are queryable just like any other
property in the row. The BlobType is a special property used to store large amounts of data and is
currently used only by the Blob Table. BlobType avoids storing the blob data bits with the row
properties in the “row data stream”. Instead, the blob data bits are stored in a separate “blob data
stream” and a pointer to the blob’s data bits (list of “extent + offset, length” pointers) is stored in
the BlobType’s property in the row. This keeps the large data bits separated from the OT’s
queryable row property values stored in the row data stream.
OTs support standard operations including insert, update, and delete operations on rows as well as
query/get operations. In addition, OTs allows batch transactions across rows with the same
PartitionName value. The operations in a single batch are committed as a single transaction.
Finally, OTs provide snapshot isolation to allow read operations to happen concurrently with
5.2 Partition Layer Architecture
The partition layer has three main architectural components as shown in Figure 4: a Partition
Manager (PM), Partition Servers (PS), and a Lock Service.
Partition Manager (PM) – Responsible for keeping track of and splitting the massive Object
Tables into RangePartitions and assigning each RangePartition to a Partition Server to serve access
to the objects. The PM splits the Object Tables into N RangePartitions in each stamp, keeping
track of the current RangePartition breakdown for each OT and to which partition servers they are
assigned. The PM stores this assignment in the Partition Map Table. The PM ensures that each
RangePartition is assigned to exactly one active partition server at any time, and that two
RangePartitions do not overlap. It is also responsible for load balancing RangePartitions among
partition servers. Each stamp has multiple instances of the PM running, and they all contend for a
leader lock that is stored in the Lock Service (see below). The PM with the lease is the active PM
controlling the partition layer.
Partition Server (PS) – A partition server is responsible for serving requests to a set of
RangePartitions assigned to it by the PM. The PS stores all the persistent state of the partitions into
streams and maintains a memory cache of the partition state for efficiency. The system guarantees
that no two partition servers can serve the same RangePartition at the same time by using leases
with the Lock Service. This allows the PS to provide strong consistency and ordering of concurrent
transactions to objects for a RangePartition it is serving. A PS can concurrently serve multiple
RangePartitions from different OTs. In our deployments, a PS serves on average ten
RangePartitions at any time.
Lock Service – A Paxos Lock Service [3,13] is used for leader election for the PM. In addition,
each PS also maintains a lease with the lock service in order to serve partitions. We do not go into
the details of the PM leader election, or the PS lease management, since the concepts used are
similar to those described in the Chubby Lock [3] paper.
Monitor Lease
Lookup partition
Front End/
Map Table
Lease Renewal
Partition Assignment
Load Balance
Persist partition state
Read partition state
from streams
Partition Layer
Stream Layer
Figure 4: Partition Layer Architecture
On partition server failure, all N RangePartitions served by the failed PS are assigned to available
PSs by the PM. The PM will choose N (or fewer) partition servers, based on the load on those
servers. The PM will assign a RangePartition to a PS, and then update the Partition Map Table
specifying what partition server is serving each RangePartition. This allows the Front-End layer to
find the location of RangePartitions by looking in the Partition Map Table (see Figure 4). When
the PS gets a new assignment it will start serving the new RangePartitions for as long as the PS
holds its partition server lease.
5.3 RangePartition Data Structures
A PS serves a RangePartition by maintaining a set of in-memory data structures and a set of
persistent data structures in streams.
5.3.1 Persistent Data Structure
A RangePartition uses a Log-Structured Merge-Tree [17,4] to maintain its persistent data. Each
Object Table’s RangePartition consists of its own set of streams in the stream layer, and the streams
belong solely to a given RangePartition, though the underlying extents can be pointed to by
multiple streams in different RangePartitions due to RangePartition splitting. The following are the
set of streams that comprise each RangePartition (shown in Figure 5).
Metadata Stream – The metadata stream is the root stream for a RangePartition. The PM assigns
a partition to a PS by providing the name of the RangePartition’s metadata stream. The metadata
stream contains enough information for the PS to load a RangePartition, including the name of the
commit log stream and data streams for that RangePartition, as well as pointers (extent+offset) into
those streams for where to start operating in those streams (e.g., where to start processing in the
commit log stream and the root of the index for the row data stream). The PS serving the
RangePartition also writes in the metadata stream the status of outstanding split and merge
operations that the RangePartition may be involved in.
Commit Log Stream – Is a commit log used to store the recent insert, update, and delete
operations applied to the RangePartition since the last checkpoint was generated for the
Row Data Stream – Stores the checkpoint row data and index for the RangePartition.
Blob Data Stream – Is only used by the Blob Table to store the blob data bits.
Each of the above is a separate stream in the stream layer owned by an Object Table’s
RangePartition Memory Data Module
Row Page Cache Index cache
Load Metrics
Bloom Filters Range Profiling
Persistent Data for a Range Partition
(Data Stored in Stream Layer)
Row Data Stream
Commit Log Stream
Checkpoint Checkpoint Checkpoint
Blob Data Stream
Metadata Stream
Extent Ptr
Extent Ptr
Extent Ptr
Figure 5: RangePartition Data Structures
Each RangePartition in an Object Table has only one data stream, except the Blob Table. A
RangePartition in the Blob Table has a “row data stream” for storing its row checkpoint data (the
blob index), and a separate “blob data stream” for storing the blob data bits for the special
BlobType described earlier.
5.3.2 In-Memory Data Structures
A partition server maintains the following in-memory components as shown in Figure 5:
Memory Table – This is the in-memory version of the commit log for a RangePartition, containing
all of the recent updates that have not yet been checkpointed to the row data stream. When a
lookup occurs the memory table is checked to find recent updates to the RangePartition.
Index Cache – This cache stores the checkpoint indexes of the row data stream. We separate this
cache out from the row data cache to make sure we keep as much of the main index cached in
memory as possible for a given RangePartition.
Row Data Cache – This is a memory cache of the checkpoint row data pages. The row data cache
is read-only. When a lookup occurs, both the row data cache and the memory table are checked,
giving preference to the memory table.
Bloom Filters – If the data is not found in the memory table or the row data cache, then the
index/checkpoints in the data stream need to be searched. It can be expensive to blindly examine
them all. Therefore a bloom filter is kept for each checkpoint, which indicates if the row being
accessed may be in the checkpoint.
We do not go into further details about these components, since these are similar to those in [17,4].
5.4 Data Flow
When the PS receives a write request to the RangePartition (e.g., insert, update, delete), it appends
the operation into the commit log, and then puts the newly changed row into the memory table.
Therefore, all the modifications to the partition are recorded persistently in the commit log, and
also reflected in the memory table. At this point success can be returned back to the client (the FE
servers) for the transaction. When the size of the memory table reaches its threshold size or the size
of the commit log stream reaches its threshold, the partition server will write the contents of the
memory table into a checkpoint stored persistently in the row data stream for the RangePartition.
The corresponding portion of the commit log can then be removed. To control the total number of
checkpoints for a RangePartition, the partition server will periodically combine the checkpoints
into larger checkpoints, and then remove the old checkpoints via garbage collection.
For the Blob Table’s RangePartitions, we also store the Blob data bits directly into the commit log
stream (to minimize the number of stream writes for Blob operations), but those data bits are not
part of the row data so they are not put into the memory table. Instead, the BlobType property for
the row tracks the location of the Blob data bits (extent+offset, length). During checkpoint, the
extents that would be removed from the commit log are instead concatenated to the
RangePartition’s Blob data stream. Extent concatenation is a fast operation provided by the stream
layer since it consists of just adding pointers to extents at the end of the Blob data stream without
copying any data.
A PS can start serving a RangePartition by “loading” the partition. Loading a partition involves
reading the metadata stream of the RangePartition to locate the active set of checkpoints and
replaying the transactions in the commit log to rebuild the in-memory state. Once these are done,
the PS has the up-to-date view of the RangePartition and can start serving requests.
5.5 RangePartition Load Balancing
A critical part of the partition layer is breaking these massive Object Tables into RangePartitions
and automatically load balancing them across the partition servers to meet their varying traffic
The PM performs three operations to spread load across partition servers and control the total
number of partitions in a stamp:
Load Balance – This operation identifies when a given PS has too much traffic and reassigns one
or more RangePartitions to less loaded partition servers.
Split – This operation identifies when a single RangePartition has too much load and splits the
RangePartition into two or more smaller and disjoint RangePartitions, then load balances
(reassigns) them across two or more partition servers.
Merge – This operation merges together cold or lightly loaded RangePartitions that together form a
contiguous key range within their OT. Merge is used to keep the number of RangePartitions within
a bound proportional to the number of partition servers in a stamp.
WAS keeps the total number of partitions between a low watermark and a high watermark
(typically around ten times the partition server count within a stamp). At equilibrium, the partition
count will stay around the low watermark. If there are unanticipated traffic bursts that concentrate
on a single RangePartition, it will be split to spread the load. When the total RangePartition count
is approaching the high watermark, the system will increase the merge rate to eventually bring the
RangePartition count down towards the low watermark. Therefore, the number of RangePartitions
for each OT changes dynamically based upon the load on the objects in those tables.
Having a high watermark of RangePartitions ten times the number of partition servers (a storage
stamp has a few hundred partition servers) was chosen based on how big we can allow the stream
and extent metadata to grow for the SM, and still completely fit the metadata in memory for the
SM. Keeping many more RangePartitions than partition servers enables us to quickly distribute a
failed PS or rack’s load across many other PSs. A given partition server can end up serving a
single extremely hot RangePartition, tens of lightly loaded RangePartitions, or a mixture inbetween, depending upon the current load to the RangePartitions in the stamp. The number of
RangePartitions for the Blob Table vs. Entity Table vs. Message Table depends upon the load on
the objects in those tables and is continuously changing within a storage stamp based upon traffic.
For each stamp, we typically see 75 splits and merges and 200 RangePartition load balances per
5.5.1 Load Balance Operation Details
We track the load for each RangePartition as well as the overall load for each PS. For both of these
we track (a) transactions/second, (b) average pending transaction count, (c) throttling rate, (d) CPU
usage, (e) network usage, (f) request latency, and (g) data size of the RangePartition. The PM
maintains heartbeats with each PS. This information is passed back to the PM in responses to the
heartbeats. If the PM sees a RangePartition that has too much load based upon the metrics, then it
will decide to split the partition and send a command to the PS to perform the split. If instead a PS
has too much load, but no individual RangePartition seems to be too highly loaded, the PM will
take one or more RangePartitions from the PS and reassign them to a more lightly loaded PS.
To load balance a RangePartition, the PM sends an offload command to the PS, which will have the
RangePartition write a current checkpoint before offloading it. Once complete, the PS acks back to
the PM that the offload is done. The PM then assigns the RangePartition to its new PS and updates
the Partition Map Table to point to the new PS. The new PS loads and starts serving traffic for the
RangePartition. The loading of the RangePartition on the new PS is very quick since the commit
log is small due to the checkpoint prior to the offload.
5.5.2 Split Operation
WAS splits a RangePartition due to too much load as well as the size of its row or blob data
streams. If the PM identifies either situation, it tells the PS serving the RangePartition to split based
upon load or size. The PM makes the decision to split, but the PS chooses the key (AccountName,
PartitionName) where the partition will be split. To split based upon size, the RangePartition
maintains the total size of the objects in the partition and the split key values where the partition
can be approximately halved in size, and the PS uses that to pick the key for where to split. If the
split is based on load, the PS chooses the key based upon Adaptive Range Profiling [16]. The PS
adaptively tracks which key ranges in a RangePartition have the most load and uses this to
determine on what key to split the RangePartition.
To split a RangePartition (B) into two new RangePartitions (C,D), the following steps are taken.
1. The PM instructs the PS to split B into C and D.
2. The PS in charge of B checkpoints B, then stops serving traffic briefly during step 3 below.
3. The PS uses a special stream operation “MultiModify” to take each of B’s streams (metadata,
commit log and data) and creates new sets of streams for C and D respectively with the same
extents in the same order as in B. This step is very fast, since a stream is just a list of pointers to
extents. The PS then appends the new partition key ranges for C and D to their metadata streams.
4. The PS starts serving requests to the two new partitions C and D for their respective disjoint
PartitionName ranges.
5. The PS notifies the PM of the split completion, and the PM updates the Partition Map Table and
its metadata information accordingly. The PM then moves one of the split partitions to a different
5.5.3 Merge Operation
To merge two RangePartitions, the PM will choose two RangePartitions C and D with adjacent
PartitionName ranges that have low traffic. The following steps are taken to merge C and D into a
new RangePartition E.
1. The PM moves C and D so that they are served by the same PS. The PM then tells the PS to
merge (C,D) into E.
2. The PS performs a checkpoint for both C and D, and then briefly pauses traffic to C and D
during step 3.
3. The PS uses the MultiModify stream command to create a new commit log and data streams for
E. Each of these streams is the concatenation of all of the extents from their respective streams in
C and D. This merge means that the extents in the new commit log stream for E will be all of C’s
extents in the order they were in C’s commit log stream followed by all of D’s extents in their
original order. This layout is the same for the new row and Blob data stream(s) for E.
4. The PS constructs the metadata stream for E, which contains the names of the new commit log
and data stream, the combined key range for E, and pointers (extent+offset) for the start and end of
the commit log regions in E’s commit log derived from C and D, as well as the root of the data
index in E’s data streams.
5. At this point, the new metadata stream for E can be correctly loaded, and the PS starts serving
the newly merged RangePartition E.
6. The PM then updates the Partition Map Table and its metadata information to reflect the merge.
5.6 Partition Layer Inter-Stamp Replication
Thus far we have talked about an AccountName being associated (via DNS) to a single location
and storage stamp, where all data access goes to that stamp. We call this the primary stamp for an
account. An account actually has one or more secondary stamps assigned to it by the Location
Service, and this primary/secondary stamp information tells WAS to perform inter-stamp
replication for this account from the primary stamp to the secondary stamp(s).
One of the main scenarios for inter-stamp replication is to geo-replicate an account’s data between
two data centers for disaster recovery. In this scenario, a primary and secondary location is chosen
for the account. Take, for example, an account, for which we want the primary stamp (P) to be
located in US South and the secondary stamp (S) to be located in US North. When provisioning
the account, the LS will choose a stamp in each location and register the AccountName with both
stamps such that the US South stamp (P) takes live traffic and the US North stamp (S) will take
only inter-stamp replication (also called geo-replication) traffic from stamp P for the account. The
LS updates DNS to have hostname point to the storage
stamp P’s VIP in US South. When a write comes into stamp P for the account, the change is fully
replicated within that stamp using intra-stamp replication at the stream layer then success is
returned to the client. After the update has been committed in stamp P, the partition layer in stamp
P will asynchronously geo-replicate the change to the secondary stamp S using inter-stamp
replication. When the change arrives at stamp S, the transaction is applied in the partition layer and
this update fully replicates using intra-stamp replication within stamp S.
Since the inter-stamp replication is done asynchronously, recent updates that have not been interstamp replicated can be lost in the event of disaster. In production, changes are geo-replicated and
committed on the secondary stamp within 30 seconds on average after the update was committed
on the primary stamp.
Inter-stamp replication is used for both account geo-replication and migration across stamps. For
disaster recovery, we may need to perform an abrupt failover where recent changes may be lost, but
for migration we perform a clean failover so there is no data loss. In both failover scenarios, the
Location Service makes an active secondary stamp for the account the new primary and switches
DNS to point to the secondary stamp’s VIP. Note that the URI used to access the object does not
change after failover. This allows the existing URIs used to access Blobs, Tables and Queues to
continue to work after failover.
6. Application Throughput
For our cloud offering, customers run their applications as a tenant (service) on VMs. For our
platform, we separate computation and storage into their own stamps (clusters) within a data center
since this separation allows each to scale independently and control their own load balancing. Here
we examine the performance of a customer application running from their hosted service on VMs
in the same data center as where their account data is stored. Each VM used is an extra-large VM
with full control of the entire compute node and a 1Gbps NIC. The results were gathered on live
shared production stamps with internal and external customers.
Figure 6 shows the WAS Table operation throughput in terms of the entities per second (y-axis) for
1-16 VMs (x-axis) performing random 1KB single entity get and put requests against a single
100GB Table. It also shows batch inserts of 100 entities at a time – a common way applications
insert groups of entities into a WAS Table. Figure 7 shows the throughput in megabytes per second
(y-axis) for randomly getting and putting 4MB blobs vs. the number of VMs used (x-axis). All of
the results are for a single storage account.
Figure 6 Table Entity Throughput for 1-16 VMs
Figure 7: Blob Throughput for 1-16 VMs
These results show a linear increase in scale is achieved for entities/second as the application scales
out the amount of computing resources it uses for accessing WAS Tables. For Blobs, the
throughput scales linearly up to eight VMs, but tapers off as the aggregate throughput reaches the
network capacity on the client side where the test traffic was generated. The results show that, for
Table operations, batch puts offer about three times more throughput compared to single entity
puts. That is because the batch operation significantly reduces the number of network roundtrips
and requires fewer stream writes. In addition, the Table read operations have slightly lower
throughput than write operations. This difference is due to the particular access pattern of our
experiment, which randomly accesses a large key space on a large data set, minimizing the effect of
caching. Writes on the other hand always result in sequential writes to the journal.
7. Workload Profiles
Usage patterns for cloud-based applications can vary significantly. Section 1 already described a
near-real time ingestion engine to provide Facebook and Twitter search for Bing. In this section we
describe a few additional internal services using WAS, and give some high-level metrics of their
The XBox GameSaves service was announced at E3 this year and will provide a new feature in Fall
2011 for providing saved game data into the cloud for millions of XBox users. This feature will
enable subscribed users to upload their game progress into the WAS cloud storage service, which
they can then access from any XBox console they sign into. The backing storage for this feature
leverages Blob and Table storage.
The XBox Telemetry service stores console-generated diagnostics and telemetry information for
later secure retrieval and offline processing. For example, various Kinect related features running
on Xbox 360 generate detailed usage files which are uploaded to the cloud to analyze and improve
the Kinect experience based on customer opt-in. The data is stored directly into Blobs, and Tables
are used to maintain metadata information about the files. Queues are used to coordinate the
processing and the cleaning up of the Blobs.
Microsoft’s Zune backend uses Windows Azure for media file storage and delivery, where files are
stored as Blobs.
Table 1 shows the relative breakdown among Blob, Table, and Queue usage across all (All)
services (internal and external) using WAS as well as for the services described above. The table
shows the breakdown of requests, capacity usage, and ingress and egress traffic for Blobs, Tables
and Queues.
Notice that, the percentage of requests for all services shows that about 17.9% of all requests are
Blob requests, 46.88% of the requests are Table operations and 35.22% are Queue requests for all
services using WAS. But in terms of capacity, 70.31% of capacity is in Blobs, 29.68% of capacity
is used by Tables, and 0.01% used by Queues. “%Ingress” is the percentage breakdown of
incoming traffic (bytes) among Blob, Table, and Queue; “%Egress” is the same for outbound traffic
(bytes). The results show that different customers have very different usage patterns. In term of
capacity usage, some customers (e.g., Zune and Xbox GameSaves) have mostly unstructured data
(such as media files) and put those into Blobs, whereas other customers like Bing and XBox
Telemetry that have to index a lot of data have a significant amount of structured data in Tables.
Queues use very little space compared to Blobs and Tables, since they are primarily used as a
communication mechanism instead of storing data over a long period of time.
Table 1: Usage Comparison for (Blob/Table/Queue)
XBox GameSaves
XBox Telemetry
8. Design Choices and Lessons Learned
Here, we discuss a few of our WAS design choices and relate some of the lessons we have learned
thus far.
Scaling Computation Separate from Storage – Early on we decided to separate customer VMbased computation from storage for Windows Azure. Therefore, nodes running a customer’s
service code are separate from nodes providing their storage. As a result, we can scale our supply
of computation cores and storage independently to meet customer demand in a given data center.
This separation also provides a layer of isolation between compute and storage given its multitenancy usage, and allows both of the systems to load balance independently.
Given this decision, our goal from the start has been to allow computation to efficiently access
storage with high bandwidth without the data being on the same node or even in the same rack. To
achieve this goal we are in the process of moving towards our next generation data center
networking architecture [10], which flattens the data center networking topology and provides full
bisection bandwidth between compute and storage.
Range Partitions vs. Hashing – We decided to use range-based partitioning/indexing instead of
hash-based indexing (where the objects are assigned to a server based on the hash values of their
keys) for the partition layer’s Object Tables. One reason for this decision is that range-based
partitioning makes performance isolation easier since a given account’s objects are stored together
within a set of RangePartitions (which also provides efficient object enumeration). Hash-based
schemes have the simplicity of distributing the load across servers, but lose the locality of objects
for isolation and efficient enumeration. The range partitioning allows WAS to keep a customer’s
objects together in their own set of partitions to throttle and isolate potentially abusive accounts.
For these reasons, we took the range-based approach and built an automatic load balancing system
(Section 5.5) to spread the load dynamically according to user traffic by splitting and moving
partitions among servers.
A downside of range partitioning is scaling out access to sequential access patterns. For example,
if a customer is writing all of their data to the very end of a table’s key range (e.g., insert key 201106-30:12:00:00, then key 2011-06-30:12:00:02, then key 2011-06:30-12:00:10), all of the writes go
to the very last RangePartition in the customer’s table. This pattern does not take advantage of the
partitioning and load balancing our system provides. In contrast, if the customer distributes their
writes across a large number of PartitionNames, the system can quickly split the table into multiple
RangePartitions and spread them across different servers to allow performance to scale linearly
with load (as shown in Figure 6). To address this sequential access pattern for RangePartitions, a
customer can always use hashing or bucketing for the PartitionName, which avoids the above
sequential access pattern issue.
Throttling/Isolation – At times, servers become overloaded by customer requests. A difficult
problem was identifying which storage accounts should be throttled when this happens and making
sure well-behaving accounts are not affected.
Each partition server keeps track of the request rate for AccountNames and PartitionNames.
Because there are a large number of AccountNames and PartitionNames it may not be practical to
keep track of them all. The system uses a Sample-Hold algorithm [7] to track the request rate
history of the top N busiest AccountNames and PartitionNames. This information is used to
determine whether an account is well-behaving or not (e.g., whether the traffic backs off when it is
throttled). If a server is getting overloaded, it uses this information to selectively throttle the
incoming traffic, targeting accounts that are causing the issue. For example, a PS computes a
throttling probability of the incoming requests for each account based on the request rate history for
the account (those with high request rates will have a larger probability being throttled, whereas
accounts with little traffic will not). In addition, based on the request history at the AccountName
and PartitionName levels, the system determines whether the account has been well-behaving.
Load balancing will try to keep the servers within an acceptable load, but when access patterns
cannot be load balanced (e.g., high traffic to a single PartitionName, high sequential access traffic,
repetitive sequential scanning, etc.), the system throttles requests of such traffic patterns when they
are too high.
Automatic Load Balancing – We found it crucial to have efficient automatic load balancing of
partitions that can quickly adapt to various traffic conditions. This enables WAS to maintain high
availability in this multi-tenancy environment as well as deal with traffic spikes to a single user’s
storage account. Gathering the adaptive profile information, discovering what metrics are most
useful under various traffic conditions, and tuning the algorithm to be smart enough to effectively
deal with different traffic patterns we see in production were some of the areas we spent a lot of
time working on before achieving a system that works well for our multi-tenancy environment.
We started with a system that used a single number to quantify “load” on each RangePartition and
each server. We first tried the product of request latency and request rate to represent the load on a
PS and each RangePartition. This product is easy to compute and reflects the load incurred by the
requests on the server and partitions. This design worked well for the majority of the load
balancing needs (moving partitions around), but it did not correctly capture high CPU utilization
that can occur during scans or high network utilization. Therefore, we now take into consideration
request, CPU, and network loads to guide load balancing. However, these metrics are not sufficient
to correctly guide splitting decisions.
For splitting, we introduced separate mechanisms to trigger splits of partitions, where we collect
hints to find out whether some partitions are reaching their capacity across several metrics. For
example, we can trigger partition splits based on request throttling, request timeouts, the size of a
partition, etc. Combining split triggers and the load balancing allows the system to quickly split
and load balance hot partitions across different servers.
From a high level, the algorithm works as follows. Every N seconds (currently 15 seconds) the PM
sorts all RangePartitions based on each of the split triggers. The PM then goes through each
partition, looking at the detailed statistics to figure out if it needs to be split using the metrics
described above (load, throttling, timeouts, CPU usage, size, etc.). During this process, the PM
picks a small number to split for this quantum, and performs the split action on those.
After doing the split pass, the PM sorts all of the PSs based on each of the load balancing metrics request load, CPU load and network load. It then uses this to identify which PSs are overloaded
versus lightly loaded. The PM then chooses the PSs that are heavily loaded and, if there was a
recent split from the prior split pass, the PM will offload one of those RangePartitions to a lightly
loaded server. If there are still highly loaded PSs (without a recent split to offload), the PM
offloads RangePartitions from them to the lightly loaded PSs.
The core load balancing algorithm can be dynamically “swapped out” via configuration updates.
WAS includes scripting language support that enables customizing the load balancing logic, such
as defining how a partition split can be triggered based on different system metrics. This support
gives us flexibility to fine-tune the load balancing algorithm at runtime as well as try new
algorithms according to various traffic patterns observed.
Separate Log Files per RangePartition – Performance isolation for storage accounts is critical in
a multi-tenancy environment. This requirement is one of the reasons we used separate log streams
for each RangePartition, whereas BigTable [4] uses a single log file across all partitions on the
same server. Having separate log files enables us to isolate the load time of a RangePartition to just
the recent object updates in that RangePartition.
Journaling – When we originally released WAS, it did not have journaling. As a result, we
experienced many hiccups with read/writes contending with each other on the same drive,
noticeably affecting performance. We did not want to write to two log files (six replicas) like
BigTable [4] due to the increased network traffic. We also wanted a way to optimize small writes,
especially since we wanted separate log files per RangePartition. These requirements led us to the
journal approach with a single log file per RangePartition. We found this optimization quite
effective in reducing the latency and providing consistent performance.
Append-only System – Having an append-only system and sealing an extent upon failure have
greatly simplified the replication protocol and handling of failure scenarios. In this model, the data
is never overwritten once committed to a replica, and, upon failures, the extent is immediately
sealed. This model allows the consistency to be enforced across all the replicas via their commit
Furthermore, the append-only system has allowed us to keep snapshots of the previous states at
virtually no extra cost, which has made it easy to provide snapshot/versioning features. It also has
allowed us to efficiently provide optimizations like erasure coding. In addition, append-only has
been a tremendous benefit for diagnosing issues as well as repairing/recovering the system in case
something goes wrong. Since the history of changes is preserved, tools can easily be built to
diagnose issues and to repair or recover the system from a corrupted state back to a prior known
consistent state. When operating a system at this scale, we cannot emphasize enough the benefit
we have seen from using an append-only system for diagnostics and recovery.
An append-based system comes with certain costs. An efficient and scalable garbage collection
(GC) system is crucial to keep the space overhead low, and GC comes at a cost of extra I/O. In
addition, the data layout on disk may not be the same as the virtual address space of the data
abstraction stored, which led us to implement prefetching logic for streaming large data sets back to
the client.
End-to-end Checksums – We found it crucial to keep checksums for user data end to end. For
example, during a blob upload, once the Front-End server receives the user data, it immediately
computes the checksum and sends it along with the data to the backend servers. Then at each layer,
the partition server and the stream servers verify the checksum before continuing to process it. If a
mismatch is detected, the request is failed. This prevents corrupted data from being committed into
the system. We have seen cases where a few servers had hardware issues, and our end-to-end
checksum caught such issues and helped maintain data integrity. Furthermore, this end-to-end
checksum mechanism also helps identify servers that consistently have hardware issues so we can
take them out of rotation and mark them for repair.
Upgrades – A rack in a storage stamp is a fault domain. A concept orthogonal to fault domain is
what we call an upgrade domain (a set of servers briefly taken offline at the same time during a
rolling upgrade). Servers for each of the three layers are spread evenly across different fault and
upgrade domains for the storage service. This way, if a fault domain goes down, we lose at most
1/X of the servers for a given layer, where X is the number of fault domains. Similarly, during a
service upgrade at most 1/Y of the servers for a given layer are upgraded at a given time, where Y
is the number of upgrade domains. To achieve this, we use rolling upgrades, which enable us to
maintain high availability when upgrading the storage service, and we upgrade a single upgrade
domain at a time. For example, if we have ten upgrade domains, then upgrading a single domain
would potentially upgrade ten percent of the servers from each layer at a time.
During a service upgrade, storage nodes may go offline for a few minutes before coming back
online. We need to maintain availability and ensure that enough replicas are available at any point
in time. Even though the system is built to tolerate isolated failures, these planned (massive)
upgrade “failures” can be more efficiently dealt with instead of being treated as abrupt massive
failures. The upgrade process is automated so that it is tractable to manage a large number of these
large-scale deployments. The automated upgrade process goes through each upgrade domain one
at a time for a given storage stamp. Before taking down an upgrade domain, the upgrade process
notifies the PM to move the partitions out of that upgrade domain and notifies the SM to not
allocate new extents in that upgrade domain. Furthermore, before taking down any servers, the
upgrade process checks with the SM to ensure that there are sufficient extent replicas available for
each extent outside the given upgrade domain. After upgrading a given domain, a set of validation
tests are run to make sure the system is healthy before proceeding to the next upgrade domain.
This validation is crucial for catching issues during the upgrade process and stopping it early
should an error occur.
Multiple Data Abstractions from a Single Stack – Our system supports three different data
abstraction from the same storage stack: Blobs, Tables and Queues. This design enables all data
abstractions to use the same intra-stamp and inter-stamp replication, use the same load balancing
system, and realize the benefits from improvements in the stream and partition layers. In addition,
because the performance needs of Blobs, Tables, and Queues are different, our single stack
approach enables us to reduce costs by running all services on the same set of hardware. Blobs use
the massive disk capacity, Tables use the I/O spindles from the many disks on a node (but do not
require as much capacity as Blobs), and Queues mainly run in memory. Therefore, we are not only
blending different customer’s workloads together on shared resources, we are also blending
together Blob, Table, and Queue traffic across the same set of storage nodes.
Use of System-defined Object Tables – We chose to use a fixed number of system defined Object
Tables to build Blob, Table, and Queue abstractions instead of exposing the raw Object Table
semantics to end users. This decision reduces management by our system to only the small set of
schemas of our internal, system defined Object Tables. It also provides for easy maintenance and
upgrade of the internal data structures and isolates changes of these system defined tables from end
user data abstractions.
Offering Storage in Buckets of 100TBs – We currently limit the amount of storage for an account
to be no more than 100TB. This constraint allows all of the storage account data to fit within a
given storage stamp, especially since our initial storage stamps held only two petabytes of raw data
(the new ones hold 20-30PB). To obtain more storage capacity within a single data center,
customers use more than one account within that location. This ended up being a reasonable
tradeoff for many of our large customers (storing petabytes of data), since they are typically already
using multiple accounts to partition their storage across different regions and locations (for local
access to data for their customers). Therefore, partitioning their data across accounts within a given
location to add more storage often fits into their existing partitioning design. Even so, it does
require large services to have account level partitioning logic, which not all customers naturally
have as part of their design. Therefore, we plan to increase the amount of storage that can be held
within a given storage account in the future.
CAP Theorem – WAS provides high availability with strong consistency guarantees. This
combination seems to violate the CAP theorem [2], which says a distributed system cannot have
availability, consistency, and partition tolerance at the same time. However, our system, in
practice, provides all three of these properties within a storage stamp. This situation is made
possible through layering and designing our system around a specific fault model.
The stream layer has a simple append-only data model, which provides high availability in the face
of network partitioning and other failures, whereas the partition layer, built upon the stream layer,
provides strong consistency guarantees. This layering allows us to decouple the nodes responsible
for providing strong consistency from the nodes storing the data with availability in the face of
network partitioning. This decoupling and targeting a specific set of faults allows our system to
provide high availability and strong consistency in face of various classes of failures we see in
practice. For example, the type of network partitioning we have seen within a storage stamp are
node failures and top-of-rack (TOR) switch failures. When a TOR switch fails, the given rack will
stop being used for traffic — the stream layer will stop using that rack and start using extents on
available racks to allow streams to continue writing. In addition, the partition layer will reassign its
RangePartitions to partition servers on available racks to allow all of the data to continue to be
served with high availability and strong consistency. Therefore, our system is designed to be able
to provide strong consistency with high availability for the network partitioning issues that are
likely to occur in our system (at the node level as well as TOR failures).
High-performance Debug Logging – We used an extensive debug logging infrastructure
throughout the development of WAS. The system writes logs to the local disks of the storage
nodes and provides a grep-like utility to do a distributed search across all storage node logs. We do
not push these verbose logs off the storage nodes, given the volume of data being logged.
When bringing WAS to production, reducing logging for performance reasons was considered.
The utility of verbose logging though made us wary of reducing the amount of logging in the
system. Instead, the logging system was optimized to vastly increase its performance and reduce
its disk space overhead by automatically tokenizing and compressing output, achieving a system
that can log 100’s of MB/s with little application performance impact per node. This feature allows
retention of many days of verbose debug logs across a cluster. The high-performance logging
system and associated log search tools are critical for investigating any problems in production in
detail without the need to deploy special code or reproduce problems.
Pressure Point Testing – It is not practical to create tests for all combinations of all complex
behaviors that can occur in a large scale distributed system. Therefore, we use what we call
Pressure Points to aid in capturing these complex behaviors and interactions. The system provides a
programmable interface for all of the main operations in our system as well as the points in the
system to create faults.
Some examples of these pressure point commands are: checkpoint a
RangePartition, combine a set of RangePartition checkpoints, garbage collect a RangePartition,
split/merge/load balance RangePartitions, erasure code or un-erasure code an extent, crash each
type of server in a stamp, inject network latencies, inject disk latencies, etc.
The pressure point system is used to trigger all of these interactions during a stress run in specific
orders or randomly. This system has been instrumental in finding and reproducing issues from
complex interactions that might have taken years to naturally occur on their own.
9. Related Work
Prior studies [9] revealed the challenges in achieving strong consistency and high availability in a
poorly-connected network environment. Some systems address this by reducing consistency
guarantees to achieve high availability [22,14,6]. But this shifts the burden to the applications to
deal with conflicting views of data. For instance, Amazon’s SimpleDB was originally introduced
with an eventual consistency model and more recently added strongly consistent operations [23].
Van Renesse et. al. [20] has shown, via Chain Replication, the feasibility of building large-scale
storage systems providing both strong consistency and high availability, which was later extended
to allow reading from any replica [21]. Given our customer needs for strong consistency, we set
out to provide a system that can provide strong consistency with high availability along with
partition tolerance for our fault model.
As in many other highly-available distributed storage systems [6,14,1,5], WAS also provides georedundancy. Some of these systems put geo-replication on the critical path of the live application
requests, whereas we made a design trade-off to take a classical asynchronous geo-replication
approach [18] and leave it off the critical path. Performing the geo-replication completely
asynchronously allows us to provide better write latency for applications, and allows more
optimizations, such as batching and compaction for geo-replication, and efficient use of cross-data
center bandwidth. The tradeoff is that if there is a disaster and an abrupt failover needs to occur,
then there is unavailability during the failover and a potential loss of recent updates to a customer’s
The closest system to ours is GFS [8,15] combined with BigTable [4]. A few differences from
these prior publications are: (1) GFS allows relaxed consistency across replicas and does not
guarantee that all replicas are bitwise the same, whereas WAS provides that guarantee, (2)
BigTable combines multiple tablets into a single commit log and writes them to two GFS files in
parallel to avoid GFS hiccups, whereas we found we could work around both of these by using
journaling in our stream layer, and (3) we provide a scalable Blob storage system and batch Table
transactions integrated into a BigTable-like framework. In addition, we describe how WAS
automatically load balances, splits, and merges RangePartitions according to application traffic
10. Conclusions
The Windows Azure Storage platform implements essential services for developers of cloud based
solutions. The combination of strong consistency, global partitioned namespace, and disaster
recovery has been important customer features in WAS’s multi-tenancy environment. WAS runs a
disparate set of workloads with various peak usage profiles from many customers on the same set
of hardware. This significantly reduces storage cost since the amount of resources to be provisioned
is significantly less than the sum of the peak resources required to run all of these workloads on
dedicated hardware.
As our examples demonstrate, the three storage abstractions, Blobs, Tables, and Queues, provide
mechanisms for storage and workflow control for a wide range of applications. Not mentioned,
however, is the ease with which the WAS system can be utilized. For example, the initial version of
the Facebook/Twitter search ingestion engine took one engineer only two months from the start of
development to launching the service. This experience illustrates our service's ability to empower
customers to easily develop and deploy their applications to the cloud.
Additional information on Windows Azure and Windows Azure Storage is available
We would like to thank Geoff Voelker, Greg Ganger, and anonymous reviewers for providing
valuable feedback on this paper.
We would like to acknowledge the creators of Cosmos (Bing’s storage system): Darren Shakib,
Andrew Kadatch, Sam McKelvie, Jim Walsh and Jonathan Forbes. We started Windows Azure 5
years ago with Cosmos as our intra-stamp replication system. The data abstractions and appendonly extent-based replication system presented in Section 4 was created by them. We extended
Cosmos to create our stream layer by adding mechanisms to allow us to provide strong consistency
in coordination with the partition layer, stream operations to allow us to efficiently split/merge
partitions, journaling, erasure coding, spindle anti-starvation, read load-balancing, and other
We would also like to thank additional contributors to Windows Azure Storage: Maneesh Sah, Matt
Hendel, Kavitha Golconda, Jean Ghanem, Joe Giardino, Shuitao Fan, Justin Yu, Dinesh Haridas,
Jay Sreedharan, Monilee Atkinson, Harshawardhan Gadgil, Phaneesh Kuppahalli, Nima Hakami,
Maxim Mazeev, Andrei Marinescu, Garret Buban, Ioan Oltean, Ritesh Kumar, Richard Liu, Rohit
Galwankar, Brihadeeshwar Venkataraman, Jayush Luniya, Serdar Ozler, Karl Hsueh, Ming Fan,
David Goebel, Joy Ganguly, Ishai Ben Aroya, Chun Yuan, Philip Taron, Pradeep Gunda, Ryan
Zhang, Shyam Antony, Qi Zhang, Madhav Pandya, Li Tan, Manish Chablani, Amar Gadkari,
Haiyong Wang, Hakon Verespej, Ramesh Shankar, Surinder Singh, Ryan Wu, Amruta Machetti,
Abhishek Singh Baghel, Vineet Sarda, Alex Nagy, Orit Mazor, and Kayla Bunch.
Finally we would like to thank Amitabh Srivastava, G.S. Rana, Bill Laing, Satya Nadella, Ray
Ozzie, and the rest of the Windows Azure team for their support.
[1] J. Baker et al., "Megastore: Providing Scalable, Highly Available Storage for Interactive
Services," in Conf. on Innovative Data Systems Research, 2011.
[2] Eric A. Brewer, "Towards Robust Distributed Systems. (Invited Talk)," in Principles of
Distributed Computing, Portland, Oregon, 2000.
[3] M. Burrows, "The Chubby Lock Service for Loosely-Coupled Distributed Systems," in
OSDI, 2006.
[4] F. Chang et al., "Bigtable: A Distributed Storage System for Structured Data," in OSDI,
[5] B. Cooper et al., "PNUTS: Yahoo!'s Hosted Data Serving Platform," VLDB, vol. 1, no. 2,
[6] G. DeCandia et al., "Dynamo: Amazon's Highly Available Key-value Store," in SOSP,
[7] Cristian Estan and George Varghese, "New Directions in Traffic Measurement and
Accounting," in SIGCOMM, 2002.
[8] S. Ghemawat, H. Gobioff, and S.T. Leung, "The Google File System," in SOSP, 2003.
[9] J. Gray, P. Helland, P. O'Neil, and D. Shasha, "The Dangers of Replication and a
Solution," in SIGMOD, 1996.
[10] Albert Greenberg et al., "VL2: A Scalable and Flexible Data Center Network,"
Communications of the ACM, vol. 54, no. 3, pp. 95-104, 2011.
[11] Y. Hu and Q. Yang, "DCD—Disk Caching Disk: A New Approach for Boosting I/O
Performance," in ISCA, 1996.
[12] H.T. Kung and John T. Robinson, "On Optimistic Methods for Concurrency Control,"
ACM Transactions on Database Systems, vol. 6, no. 2, pp. 213-226, June 1981.
[13] Leslie Lamport, "The Part-Time Parliament," ACM Transactions on Computer Systems,
vol. 16, no. 2, pp. 133-169, May 1998.
[14] A. Malik and P. Lakshman, "Cassandra: a decentralized structured storage system,"
SIGOPS Operating System Review, vol. 44, no. 2, 2010.
[15] M. McKusick and S. Quinlan, "GFS: Evolution on Fast-forward," ACM File Systems, vol.
7, no. 7, 2009.
[16] S. Mysore, B. Agrawal, T. Sherwood, N. Shrivastava, and S. Suri, "Profiling over Adaptive
Ranges," in Symposium on Code Generation and Optimization, 2006.
[17] P. O'Neil, E. Cheng, D. Gawlick, and E. O'Neil, "The Log-Structured Merge-Tree (LSMtree)," Acta Informatica - ACTA, vol. 33, no. 4, 1996.
[18] H. Patterson et al., "SnapMirror: File System Based Asynchronous Mirroring for Disaster
Recovery," in USENIX-FAST, 2002.
[19] Irving S. Reed and Gustave Solomon, "Polynomial Codes over Certain Finite Fields,"
Journal of the Society for Industrial and Applied Mathematics, vol. 8, no. 2, pp. 300-304,
[20] R. Renesse and F. Schneider, "Chain Replication for Supporting High Throughput and
Availability," in USENIX-OSDI, 2004.
[21] J. Terrace and M. Freedman, "Object Storage on CRAQ: High-throughput chain replication
for read-mostly workloads," in USENIX'09, 2009.
[22] D. Terry, K. Petersen M. Theimer, A. Demers, M. Spreitzer, and C. Hauser, "Managing
Update Conflicts in Bayou, A Weakly Connected Replicated Storage System," in ACM
SOSP, 1995.
[23] W.
Consistency,", 2010.
An Empirical Study on Configuration
Errors in Commercial and Open Source
Zuoning Yin∗ , Xiao Ma∗ , Jing Zheng† , Yuanyuan Zhou† ,
Lakshmi N. Bairavasundaram‡ , and Shankar Pasupathy‡
∗ Univ.
of Illinois at Urbana-Champaign, † Univ. of California, San Diego, ‡ NetApp, Inc.
Configuration errors (i.e., misconfigurations) are among the dominant causes of system
failures. Their importance has inspired many research efforts on detecting, diagnosing,
and fixing misconfigurations; such research would benefit greatly from a real-world characteristic study on misconfigurations. Unfortunately, few such studies have been conducted
in the past, primarily because historical misconfigurations usually have not been recorded
rigorously in databases.
In this work, we undertake one of the first attempts to conduct a real-world misconfiguration characteristic study. We study a total of 546 real world misconfigurations, including
309 misconfigurations from a commercial storage system deployed at thousands of customers, and 237 from four widely used open source systems (CentOS, MySQL, Apache
HTTP Server, and OpenLDAP). Some of our major findings include: (1) A majority of
misconfigurations (70.0%∼85.5%) are due to mistakes in setting configuration parameters; however, a significant number of misconfigurations are due to compatibility issues
or component configurations (i.e., not parameter-related). (2) 38.1%∼53.7% of parameter
mistakes are caused by illegal parameters that clearly violate some format or rules, motivating the use of an automatic configuration checker to detect these misconfigurations.
(3) A significant percentage (12.2%∼29.7%) of parameter-based mistakes are due to inconsistencies between different parameter values. (4) 21.7%∼57.3% of the misconfigurations
involve configurations external to the examined system, some even on entirely different
hosts. (5) A significant portion of misconfigurations can cause hard-to-diagnose failures,
such as crashes, hangs, or severe performance degradation, indicating that systems should
be better-equipped to handle misconfigurations.
Categories and Subject Descriptors: D.4.5 [Operating Systems]: Reliability
General Terms: Reliability, Management
Keywords: Misconfigurations, characteristic study
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted
without fee provided that copies are not made or distributed for profit or commercial advantage and that copies
bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to
redistribute to lists, requires prior specific permission and/or a fee.
SOSP '11, October 23-26, 2011, Cascais, Portugal.
Copyright © 2011 ACM 978-1-4503-0977-6/11/10 ... $10.00.
1.1 Motivation
Configuration errors (i.e., misconfigurations) have a great impact on system availability.
For example, a recent misconfiguration at Facebook prevented its 500 million users from
accessing the website for several hours [15]. Last year, a misconfiguration brought down
the entire “.se” domain for more than an hour [6], affecting almost 1 million hosts.
Not only do misconfigurations have high impact, they are also prevalent. Gray’s pioneering paper on system faults [11] stated that administrator errors were responsible for 42%
of system failures in high-end mainframes. Similarly, Patterson et al. [30] observed that
more than 50% of failures were due to operator errors in telephone networks and Internet
systems. Studies have also observed that a majority of operator errors (or administrator errors) are misconfigurations [23, 29]. Further, of the issues reported in COMP-A’s1
customer-support database (used in this study), around 27% are labeled as configurationrelated (as shown later in Figure 1(a) in Section 3). This percentage is second only to
hardware failures and is much bigger than that of software bugs.
Moreover, configuration errors are also expensive to troubleshoot. Kappor [16] found that
17% of the total cost of ownership of today’s desktop computers goes toward technical
support, and a large fraction of that is troubleshooting misconfigurations.
Given the data on the prevalence and impact of misconfigurations, several recent research
efforts [3, 17, 18, 35, 38, 41] have proposed ideas to detect, diagnose, and automatically fix
misconfigurations. For example, PeerPressure [38] uses statistics methods on a large set of
configurations to identify single configuration parameter errors. Chronus [41] periodically
checkpoints disk state and automatically searches for configuration changes that may have
caused the misconfiguration.ConfAid [3] uses data flow analysis to trace the configuration
error back to a particular configuration entry. AutoBash [35] leverages a speculative OS
kernel to automatically try out fixes from a solution database in order to find a proper solution for a configuration problem. Further, ConfErr [17] provides a useful framework with
which users can inject configuration errors of three types: typos, structural mistakes, and
semantic mistakes. In addition to research efforts, various tools are available to aid users in
managing system configuration; for example, storage systems have provisioning tools [13,
14, 25, 26], misconfiguration-detection tools [24], and upgrade assistants that check for
compatibility issues [24]. The above research directions and tools would benefit greatly
from a characteristic study of real-world misconfigurations. Moreover, understanding the
major types and root causes of misconfigurations may help guide developers to better
design configuration logic and requirements, and testers to better verify user interfaces,
thereby reducing the likelihood of configuration mistakes by users.
Unfortunately, in comparison to software bugs that have well-maintained bug databases
and have benefited from many software bug characteristic studies [5, 19, 36, 37], a misconfiguration characteristic study is much harder, mainly because historical misconfigurations
usually have not been recorded rigorously in databases. For example, developers record
information about the context in the code for bugs, the causes of bugs, and how they were
fixed; they also focus on eliminating or coalescing duplicate bug reports. On the other
hand, the description of misconfigurations is user-driven, the fixes may be recorded simply
as pointers to manuals and best-practice documents, and there is no duplicate elimination.
As a result, analyzing and understanding misconfigurations is a much harder, and more
importantly, manual task.
We are required to keep the company anonymous.
1.2 Our Contributions
In this paper, we perform one of the first characteristic studiessystems, using a total of
546 misconfiguration cases. The commercial system is a storage system from COMP-A deployed at thousands of customers. It has a well-maintained customer-issues database. The
open-source systems includewidely used system software: CentOS, MySQL, Apache, and
OpenLDAP. The misconfiguration issues we examine are primarily user-reported.Therefore,
our study is a manual analysis of user descriptions ofmisconfigurations, aided by discussions withdevelopers, support engineers, and system architects of these systems to ensure
correct understanding of these cases.Our study was approximately 21 person-months of
effort, excluding the help fromseveral COMP-A engineers and open-source developers.
We study the types, patterns, causes, system reactions, and impact of misconfigurations:
• We examine the prevalence and reported severity of configuration issues (includes,
but not limited to misconfigurations) as compared to other support issues in COMPA’s customer-issues database.
• We develop a simple taxonomy of misconfiguration types: parameter, compatibility,
and component, and identify the prevalence of each type. Given the prevalence
of parameter-based misconfigurations, we further analyze its types and observable
• We identify how systems react to misconfigurations: whether error messages are provided, whether systems experience failures or severe performance issues, etc. Given
that error messages are important for diagnosis and fixes, we also investigate the
relationship between message clarity and diagnosis time.
• We study the frequency of different causes of misconfigurations such as first-time use,
software upgrades, hardware changes, etc.
• Finally, we examine the impact of misconfigurations, including the impact on system
availability and performance.
The major findings of the study are summarized in Table 1. While we believe that the
misconfiguration cases we examined are fairly representative of misconfigurations in large
system software, we do not intend to draw any general conclusions about all applications.
In particular, we remind readers that all of the characteristics and findings in this study
should be taken with the specific system types and our methodology in mind (discussed
in Section 2).
We will release our open-source misconfiguration cases to share with the research community.
This section describes our methodology for analyzing misconfigurations. There are unique
challenges in obtaining and analyzing a large set of real-world misconfigurations. Historically, unlike bugs that usually have Bugzillas as repositories, misconfigurations are not
recorded rigorously. Much of the information is in the form of unstructured textual descriptions and there is no systematic way to report misconfiguration cases. Therefore, in
order to overcome these challenges, we manually analyzed reported misconfiguration cases
by studying manuals, instructions, source code, and knowledge bases of each system. For
some hard cases, we contacted the corresponding engineers through emails or phone calls
to understand them thoroughly.
Major Findings on Prevalence and Severity of Configuration Issues (Section 3)
Similar to results from previous studies [11, 29, 30], data from COMP-A shows that a significant
portion (27%) of customer cases are related to configuration issues.
Configuration issues cause the largest percentage (31%) of high-severity support requests.
Major Findings on Misconfiguration Types (Section 4)
Configuration-parameter mistakes account for the majority (70.0%∼85.5%) of the examined
However, a significant portion (14.5%∼30.0%) of the examined misconfigurations are caused
by software compatibility issues and component configuration, which are not well addressed
in literature.
38.1%∼53.7% of parameter misconfigurations are caused by illegal parameters that violate
formats or semantic rules defined by the system, and can be potentially detected by checkers
that inspect against these rules.
A significant portion (12.2%∼29.7%) of parameter mistakes are due to value-based inconsistency, calling for an inconsistency checker or a better configuration design that does not require
users to worry about such error-prone consistency constraints.
Although most misconfigurations are located within each examined system, still a significant
portion (21.7%∼57.3%) involve configurations beyond the system itself or span over multiple
Major Findings on System Reactions to Misconfigurations (Section 5)
Only 7.2%∼15.5% of the studied misconfiguration problems provide explicit messages that
pinpoint the configuration error.
Some misconfigurations have caused the systems to crash, hang or have severe performance
degradation, making failure diagnosis a challenging task.
Messages that pinpoint configuration errors can shorten the diagnosis time by 3 to 13 times as
compared to the cases with ambiguous messages or by 1.2 to 14.5 times as compared to cases
with no messages.
Major Findings on Causes of Misconfigurations (Section 6)
The majority of misconfigurations are related to first-time use of desired functionality. For more
complex systems, a significant percentage (16.7%∼32.4%) of misconfigurations were introduced
into systems that used to work.
By looking into the 100 used-to-work cases (32.4% of the total) at COMP-A, 46% of them
are attributed to configuration parameter changes due to routine maintenance, configuring for
new functionality, system outages, etc, and can benefit from tracking configuration changes.
The remainder are caused by non-parameter related issues such as hardware changes (18%),
external environmental changes (8%), resource exhaustion (14%), and software upgrades(14%).
Major Findings on Impact of Misconfigurations (Section 7)
Although most studied misconfiguration cases only lead to partial unavailability of the system,
16.1%∼47.3% of them make the systems to be fully unavailable or cause severe performance
Table 1: Major findings on misconfiguration characteristics. Please take our methodology into consideration when you interpret and draw any conclusions.
2.1 Data Sets
We examine misconfiguration data for one commercial system and four open-source systems. The commercial system is a storage system from COMP-A. The core software
running in such system is proprietary to COMP-A. The four open-source systems include
CentOS, MySQL, Apache HTTP server, and OpenLDAP. We select these software systems
for two reasons: (1) they are mature and widely used, and (2) they have a large set of
misconfiguration cases reported by users. While we cannot draw conclusions about any
general system, our examined systems are representative of large, server-based systems.
We focus only on software misconfigurations; we do not have sufficient data for hardware
misconfigurations on systems running the open-source software.
COMP-A storage systems consist of multiple components including storage controllers,
disk shelves, and interconnections between them (e.g., switches). These systems can be
configured in a variety of ways for customers with different degrees of expertise. For
instance, COMP-A offers tools that simplify system configuration. We cannot ascertain
Total Cases
Sampled Cases
Used Cases
Table 2: The systems we studied and the number of misconfiguration cases we
identified for each of them.
from the data whether users configured the systems directly or used tools for configuration.
The misconfiguration cases we study are from COMP-A’s customer-issues database, which
records problems reported by customers. For accuracy, we considered only closed cases,
i.e. cases that COMP-A has provided a solution to the users. Also, to be as relevant as
possible, we focused on only cases over the last two years. COMP-A’s support process is
rigorous, especially in comparison to open-source projects. For example, when a customer
case is closed, the support engineer needs to record information about the root cause as
well as resolution. Such information is very valuable for our study. There are many cases
labeled as “Configuration-related” by support engineers and it is prohibitively difficult
to study all of them. Therefore, we randomly sampled 1,000 cases labeled as related to
configuration. Not all 1,000 cases are misconfigurations because more than half of them are
simply customer questions related to how the system should be configured. Hence, we did
not consider them as misconfigurations. We also pruned out a few cases for which we cannot
determine whether a configuration error occurred. After careful manual examination, we
identified 309 cases as misconfigurations, as shown in Table 2.
Besides COMP-A storage systems, we also study four open-source systems: CentOS,
MySQL, Apache HTTP server, and OpenLDAP. All of them are mature software systems, well-maintained and widely used. CentOS is an enterprise-class Linux distribution,
MySQL is a database server, Apache is a web server, and OpenLDAP is a directory server.
For open-source software, the misconfiguration cases come from three sources: official usersupport forums, mailing lists, and (a large question-answering website
focusing on system administration). Whenever necessary, scripts were used to identify
cases related to systems of interest, as well as to remove those that were not confirmed by
users. We then randomly sampled from all the remaining candidate cases (the candidate
set sizes and the sample set sizes are also shown in Table 2) and manually examined
each case to check if it is a misconfiguration. Our manual examination yielded a total
of 237 misconfiguration cases from these four open-source systems. The yield ratio (used
cases/sampled cases) is low for these open-source projects because we observe a higher
ratio of cases that are customer questions among the samples from open source projects
as compared to the commercial data.
2.2 Threats to Validity and Limitations
Many characteristic studies suffer from limitations such as the systems or workloads not being representative of the entire population, the semantics of events such as failures differing
across different systems, and so on. Given that misconfiguration cases have considerably
less information than ideal to work with, and that we need to perform all of the analysis
manually, our study has a few more limitations. We believe that these limitations do not
invalidate our results; at the same time, we urge the reader to focus on overall trends and
not on precise numbers. We expect that most systems and processes for configuration
errors would have similar limitations to the ones we face. Therefore, we hope that the
limitations of our methodology would inspire techniques and processes that can be used to
record misconfigurations more rigorously and in a format amenable to automated analysis.
Sampling: To make the time and effort manageable, we sampled the data sets. As shown
in Table 2, our sample rates are statistically significant and our collections are also large
enough to be statistically meaningful [10]. In our result tables, we also show the confidence
interval on ratios with a 95% confidence level based on our sampling rates.
Users: The sources from which we sample contain only user-reported cases. Users may
choose not to report trivial misconfigurations. Also, it is more likely that novice users may
report more misconfiguration problems. We do not have sufficient data to judge whether
a user is a novice or an expert. But, with new systems or major revisions of an existing
system deployed to the field, there will always be new users. Therefore, our findings are
still valid.
User environment: Some misconfigurations may have been prevented, or detected and
resolved automatically by the system or other tools. This scenario is particularly true for
COMP-A systems. At the same time, some, but not all, COMP-A customers use the tools
provided by COMP-A and we cannot distinguish the two in the data.
System versions: We do not differentiate between system versions. Given that software
is constantly evolving, it is possible that some of the reported configuration issues may
not apply to some versions, or have already been addressed in system development (e.g.,
automatically correcting configuration mistakes, providing better error messages, etc.).
Overall, our study is representative of user-reported misconfigurations that are more challenging, urgent, or important.
We first examine how prevalent configuration issues are in the field and how severely they
impact users using data from the last two years from COMP-A’s customer-issues database.
There are five root causes classified by COMP-A engineers after resolving each customerreported problem: configuration (configuration-related), hardware failure, bug, customer
environment (cases caused by power supplies, cooling systems, or other environmental
issues), and user knowledge (cases where customers request information about the system).
Each case is also labeled with a severity level by customer-support engineers – from “1” to
“4,” based on how severe the problem is in the field; cases with severity level of “1” or “2”
are usually considered as high-severity cases that require prompt responses.
(a) Categorization of problem causes on all
the cases
(b) Categorization of problem causes on
cases with high severity
Figure 1: Root cause distribution among the customer problems reported to COMPA
Figure 1(a) shows the distribution of customer cases based on different root causes. Figure 1(b) further shows the distribution of high-severity cases. We do not have the results
for the open source systems due to unavailability of such labeled data (i.e., customer issues
caused by hardware, software bugs, configurations, etc.).
Among all five categories, configuration-related issues contribute to 27% of the cases and
are the second-most pervasive root cause of customer problems. While this number is
potentially inflated by customer requests for information on configuration (as seen in our
manual analysis), it shows that system configuration is nontrivial and of particular concern
for customers. Furthermore, considering only high-severity cases, configuration-related
issues become the most significant contributor to support cases; they contribute to 31% of
high-severity cases. We expect that hardware issues are not as severe (smaller percentage
of high-severity cases than of all cases) due to availability of redundancy and ease of fixes
– the hardware can be replaced easily.
Finding 1.1: Similar to the results from previous studies [11, 30, 29], data from COMPA shows that a significant percentage (27%) of customer cases are related to configuration
Finding 1.2: Configuration issues cause the largest percentage (31%) of high-severity
support requests.
4.1 Distribution among Different Types
246 (79.6±2.4%)
42 (70.0±3.7%)
47 (85.5±2.3%)
50 (83.4±2.8%)
49 (79.0±3.0%)
31 (10.0±1.8%)
11 (18.3±3.1%)
5 (8.3±2.1%)
7 (11.2±2.3%)
32 (10.4±1.8%)
7 (11.7±2.6%)
8 (14.5±2.3%)
5 (8.3±2.1%)
6 (9.7±2.2%)
Table 3: The numbers of misconfigurations of each type. Their percentages and the
sampling errors are also shown.
To examine misconfigurations in detail, we first look at the different types of misconfigurations that occur in the real world and their distributions. We classify the examined
misconfiguration cases into three categories (as shown in Table 3). Parameter refers to
configuration parameter mistakes; a parameter could be either an entry in a configuration
file or a console command for configuring certain functionality. Compatibility refers to
misconfigurations related to software compatibility (i.e. whether different components or
modules are compatible with each other). Component refers to other remaining software
misconfigurations (e.g., a module is missing).
Finding 2.1: Configuration parameter mistakes account for the majority (70.0%∼85.5%)
of the examined misconfigurations.
Finding 2.2: However, a significant portion (14.5%∼30.0%) of the examined misconfigurations are caused by software compatibility and component configuration, which are not
well addressed in literature.
First, Finding 2.1 supports recent research efforts [3, 35, 38, 41] on detecting, diagnosing,
or fixing parameter-based misconfigurations. Second, this finding perhaps indicates that
system designers should have fewer “knobs” (i.e. parameters) for users to configure and
tune. Whenever possible, auto-configuration [44] should be preferred because in many
cases users may not be experienced enough to set the knobs appropriately.
While parameter-based misconfigurations are the most common, Finding 2.2 calls for attention to investigating solutions dealing with non-parameter-based configurations such as
software incompatibility, etc. For example, software may need to be shipped as a complete
package, deployed as an appliance (either virtual or physical), or delivered as a service
(SaaS) to reduce these incompatibilities and general configuration issues.
26 (61.9±13.8%)
Value Inconsistent
Value Inconsistent
w/ Other Values
w/ Environment
73 (29.7±5.6%)
6 (14.3±10.0%)
Table 4: The distribution of different types of parameter mistakes for each application.
(a) Illegal 1 – Format – Lexical
from COMP-A
InitiatorName: iqn:DEV_domain
from COMP-A (f) Illegal 6 – Value – Value Inconsistency
192.168.x.x system-e0 There is no interface
named "system-e0"
The path does not contain data files any more
Description: the parameter datadir specifies the
directory that stores the data files. After the data files
were moved to other directory during migration, the
user did not update datadir to the new directory.
Impact: MySQL cannot start.
Description: In the hosts file of COMP-A's system,
The mapping from ip address to interface name needs
to be specified. However, the user mapped the ip
192.168.x.x to a non-existed interface system-e0.
Impact: The host cannot be accessed.
from MySQL (h) Illegal 8 – Value – Value Inconsistency
The max allowed
persistent connections
specified in php is
larger than the max
connection specified
in mysql
Description: when using persistent connections, the
mysql.max_persistent in PHP should be no larger
than the max_connections in MySQL. User did not
conform to this constraint.
Impact: too many connections error generated.
mysql's config
max_connections = 300
php's config
mysql.max_persistent = 400
from Apache with PHP
extension =
"" must
be put before
extension =
This entry is missing
Description: to use the password policy (i.e. ppolicy) Description: When using PHP in Apache, the
extension depends on
overlay, user needs to first include the related
Therefore the order between them matters. The
schema in the configuration file. But the user did not
user configured the order in a wrong way.
do that.
Impact: the LDAP server fails to work.
Impact: Apache cannot start due to seg fault.
from MySQL (e) Illegal 5 – Value – Env Inconsistency
datadir = /some/old/path
(g) Illegal 7 – Value – Value Inconsistency
from OpenLDAP (c) Illegal 3 – Format – Syntax
include schema/ppolicy.schema
overlay ppolicy
Description: for COMP-A's iscsi device, the name
of initiator (InitiatorName) can only allow
lowercase letters, while the user set the name with
some capital letters DEV.
Impact: a storage share cannot be recognized.
(d) Illegal 4 – Value – Env Inconsistency
(b) Illegal 2 – Format – Syntax
NameVirtualHost *:80
from MySQL
"log=" contradicts
with "log_ouput="
Description: The parameter log_output controls
how log is stored (in file or database table). The user
wanted to store log in file query.log, but log_output
was incorrectly set to store log in database table.
Impact: log is written to table rather than file.
from Apache (i) Legal 1
from MySQL
"*.80" does not
match with the "*"
in <VirtualHost ...>
AutoCommit = True
Description: the parameter AutoCommit controls if
<VirtualHost *>
to disk automatically after every
insert. Either True or False is a legal value.
However, the user was experiencing an insert
Description: when setting name based virtual host,
the parameter VirtualHost should be set to the same intensive workload, so setting the value as True will
hurt performance dramatically. But when the user set
host as NameVirtualHost does. However, the user
this parameter to be True, she was not aware of the
set NameVirtualHost to be *.80 while set
performance impact.
VirtualHost to be *.
Impact: Apache loads virtual host in a wrong order.
Impact: the performance of MySQL is very bad.
Figure 2: Examples of different types of configuration parameter related mistakes.
(legal vs. illegal, lexical error, syntax error and inconsistency error)
4.2 Parameter Misconfigurations
Given the prevalence of parameter-based mistakes, we study the different types of such
mistakes (as shown in Table 4), the number of parameters needed for diagnosing or fixing
a parameter misconfiguration, and the problem domain of these mistakes.
Types of mistakes in parameter configuration. First, we look at parameter mistakes
that clearly violate some implicit or explicit configuration rules related to format, syntax, or semantics. We call them illegal misconfigurations because they are unacceptable
to the examined system. Figures 2(a)∼(h) show eight such examples. These types of
misconfigurations may be detected automatically by checking against configuration rules.
In contrast, some other parameter mistakes are perfectly legal, but they are incorrect
simply because they do not deliver the functionality or performance desired by users,
like the example in Figure 2(i). These kinds of mistakes are difficult to detect unless
users’ expectation and intent can be specified separately and checked against configuration
settings. More user training may reduce these kinds of mistakes, as can simplified system
configuration logic, especially for things that can be auto-configured by the system.
Finding 3.1: 38.1%∼53.7% of parameter misconfigurations are caused by illegal parameters that clearly violate some format or semantic rules defined by the system, and can be
potentially detected by checkers that inspect against these rules.
Finding 3.2: However, a large portion (46.3% ∼61.9%) of the parameter misconfigurations have perfectly legal parameters but do not deliver the functionality intended by users.
These cases are more difficult to detect by automatic checkers and may require more user
training or better configuration design.
We subcategorize illegal parameter misconfigurations into illegal format, in which some
parameters do not obey format rules such as lower case, field separators, etc.; and illegal
value, in which the parameter format is correct but the value violates some constraints,
e.g., the value of a parameter should be smaller than some threshold. We find that illegalvalue misconfigurations are more common than illegal-format misconfigurations in most
systems, perhaps because format is easier to test against and thereby avoid.
Illegal format misconfigurations include both lexical and syntax mistakes. Similar to lexical
and syntax errors in program languages, a lexical mistake violates the grammar of a single
parameter, like the example shown in Figure 2(a); a syntax mistake violates structural or
order constraints of the format, like the example shown in Figure 2(b) and 2(c). As shown
in Table 4, up to 14.3% of the parameter misconfigurations are lexical mistakes, and up to
22.4% are syntax mistakes.
Illegal value misconfigurations mainly consist of two type of mistakes, “value inconsistency”
and “environment inconsistency”. Value inconsistency means that some parameter settings violate some relationship constraints with some other parameters, while environment
inconsistency means that some parameter’s setting is inconsistent with the system environment (i.e., physical configuration). Figure 2(d) and 2(e) are two environment inconsistency
examples. As shown in Table 4, value inconsistency accounts for 12.2%∼29.7% of the
parameter misconfigurations, while environment inconsistency contributes 2.0%∼17.0%.
Both can be detected by some well-designed checkers as long as the constraints are known
and enforceable.
Figure 2(f), 2(g), and 2(h) present three value-inconsistency examples. In the first example, the name of the log file is specified while the log output is chosen to be database
table. In the second example, two parameters from two different but related configuration
files contradict each other. In the third example, two parameters, NameVirtualHost and
VirtualHost, have unmatched values (“*.80” v.s. “*”).
Finding 4: A significant portion (12.2%∼29.7%) of parameter mistakes are due to valuebased inconsistency, calling for an inconsistency checker or a better configuration design
that does not require users to worry about such error-prone consistency constraints.
Number of erroneous parameters. As some previous work on detecting or diagnosing
misconfiguration focuses on only single configuration parameter mistakes, we look into
what percentages of parameter mistakes involve only a single parameter.
of Involved Parameters
Number of Fixed Parameters
Table 5: The number of parameters in the configuration parameter mistakes.
Table 5 shows the number of parameters involved in configuration as well as the number
of parameters that were changed to fix the misconfiguration. These numbers may not be
the same because a mistake may involve two parameters, but can be fixed by changing
only one parameter. Our analysis indicates that about 23.4%∼61.2% of the parameter
mistakes involve multiple parameters. Examples of cases where multiple parameters are
involved are cases with value inconsistencies (see above).
In comparison, about 14.9%∼34.7% of the examined misconfigurations require fixing multiple parameters. For example, the performance of a system could be influenced by several
parameters. To achieve the expected level of performance, all these parameters need to be
considered and set correctly.
Finding 5.1: The majority (36.7%∼74.5%) of parameter mistakes can be diagnosed by
considering only one parameter, and an even higher percentage(59.2%∼83.0%) of them
can be fixed by changing the value of only one parameter.
Finding 5.2: However, a significant portion (23.4%∼61.2%) of parameter mistakes involve more than one parameter, and 14.9%∼34.7% require fixing more than one parameter.
Problem domains of parameter mistakes. We also study what problem domains each
parameter mistake falls under. We decide the domain based on the functionality of the
involved parameter. Four major problem domains – network, permission/privilege, performance, and devices – are observed. Overall, 18.3% of examined parameter mistakes relate
to how the network is configured; 16.8% relate to permission/privilege; 7.1% relate to performance adjustment. For the COMP-A systems and CentOs (the OSes), 8.5%∼26.2% of
examined parameter mistakes are about device configurations.
4.3 Software Incompatibility
Besides parameter-related mistakes, software incompatibility is another major cause of
misconfigurations (up to 18.3%, see Table 3). Software-incompatibility issues refer to
improper combinations of components or their versions. They could be caused by incompatible libraries, applications, or even operating system kernels.
One may think that system upgrades are more likely to cause software-incompatibility
issues, but we find that only 18.5% of the software-incompatibility issues are caused by
upgrades. One possible reason is that both developers and users already put significant
effort into the process of upgrades. For example, COMP-A provides a tool to help with
upgrades that creates an easy-to-understand report of all known compatibility issues, and
recommends ways to resolve them.
Some of the misconfiguration cases we analyze show that package-management systems
(e.g., RPM [34] and Debian dpkg [8]) can help address many software-incompatibility
issues. For example, in one of the studied cases, the user failed to install the mod proxy html
module because the existing libxml2 library was not compatible with this module.
Package-management systems may work well for systems with a standard set of packages.
For systems that require multiple applications from different vendors to work together,
it is more challenging. An alternative to package management systems is to use selfcontained packaging, i.e. integrating dependent components into one installation package
and minimizing the requirements on the target system. To further reduce dependencies,
one could deliver a system as virtual machine images (e.g., Amazon Machine Image) or
appliances (e.g., COMP-A’s storage systems). The latter may even eliminate hardwarecompatibility issues.
4.4 Component Misconfiguration
Missing component
File format
Insufficient resource
Stale data
Number of Cases
Table 6: Subtypes of component misconfigurations.
Component misconfigurations are configuration errors that are neither parameter mistakes
nor compatibility problems. They are more related to how the system is organized and how
resources are supplied. A sizable portion (8.3%∼14.5%) of our examined misconfigurations
are of this category. Here, we further classify them into the following five subtypes based on
root causes: (1) Missing component: certain components (modules or libraries) are missing;
(2) Placement: certain files or components are not in the place expected by the system; (3)
File format: the format of a certain file is not acceptable to the system. For example, an
Apache web server on a Linux host cannot load a configuration file because it is in the MSDOS format with unrecognized new line characters. (4) Insufficient resource: the available
resources are not enough to support the system functionality (e.g., not enough disk space);
(5) Stale data: stale data in the system prevents the new configuration. Table 6 shows
the distribution of the subtypes of component misconfigurations. Missing components,
placement issues, and insufficient resources are equally prominent.
4.5 Mistake Location
Other App
Table 7: The location of errors. “Inside”: inside the target application. “FS”: in file
system. “OS-Module”: in some OS modules like SELinux. “Network”: in network
settings. “Other App”: in other applications. “Environment”: other environment like
DNS service.
Table 7 shows the distribution of configuration error locations. Naturally, most misconfigurations are contained in the target application itself. However, many misconfigurations
also span to places beyond the application. The administrators also need to consider other
parts of the system, including file-system permissions/capacities, operating-system modules, other applications running in the system, network configuration, etc. So looking at
only the application itself is not enough to diagnose and fix many configuration errors.
Finding 6: Although most misconfigurations are located within each examined application, still a significant portion (21.7%∼57.3%) of cases involve configurations beyond the
application itself or span across multiple hosts.
In this section, we examine system reactions to misconfigurations, focusing on whether the
system detects the misconfiguration and on the error messages issued by the system.
5.1 Do Systems Detect and Report Configuration Errors?
Proactive detection and informative reporting can help diagnose misconfigurations more
easily. Therefore, we wish to understand whether systems detect and report configuration
errors. We divide the examined cases into three categories based on how well the system
handles configuration errors (Table 8). Cases where the systems and associated tools detect, report, recover from (or help the user correct) misconfigurations may not be reported
by users. Therefore, the results in this section may be especially skewed by the available
data. Nevertheless, there are interesting findings that arise from this analysis.
Mysterious Symptoms
w/o Message
Table 8: How do systems react to misconfigurations? Table (a) presents the number
of cases in each category of system reaction. Table (b) presents the number of cases
that cause mysterious crashes, hangs, etc. but do not provide any messages.
We classify system reactions into pinpoint reaction, indeterminate reaction, and quiet failure.
A pinpoint reaction is one of the best system reactions to misconfigurations. The system
not only detects a configuration error but also pinpoints the exact root cause in the error
message (see a COMP-A example in Figure 3). As shown in Table 8 (a), more than
85% of the cases do not belong to this category, indicating that systems may not react
in a user-friendly way to misconfigurations. As previously discussed, the study includes
only reported cases. Therefore, some misconfigurations with good error messages may
have already been solved by users themselves and thus not reported. So in reality, the
percentage of pinpoint reaction to misconfiguration may be higher. However, considering
the total number of misconfigurations in the sources we selected is very large, there are still
a significant number of misconfigurations for which the examined systems do not pinpoint
the misconfigurations.
[COMP-A – dir.size.max:warning]:
Directory /vol/vol1/xxx/data/ reached
the maxdirsize Limit. Reduce the number
of files or use the vol options command
to increase this limit
Figure 3: A misconfiguration case where the error message pinpoints the root cause
and tells the user how to fix it.
An indeterminate reaction is a reaction that a system does provide some information about
the failure symptoms (i.e., manifestation of the misconfiguration), but does not pinpoint
the root cause or guide the user on how to fix the problem. 45.2%∼55.0% of our studied
cases belong to this category.
A quiet failure refers to cases where the system does not function properly, and it further
does not provide any information regarding the failure or the root cause. 22.6%∼26.7% of
the cases belong to this category. Diagnosing them is very difficult.
Finding 7: Only 7.2%∼15.5% of the studied misconfiguration problems provide explicit
messages that pinpoint the configuration error.
Quiet failures can be even worse when the misconfiguration causes the system to misbehave
in a mysterious way (crash, hang, etc.) just like software bugs. We find that such behavior
occurred in 5%∼8% of the cases (Table 8 (b)).
Why would misconfigurations cause a system to crash or hang unexpectedly? The reason
is intuitive: since configuration parameters can also be considered as a form of input, if
a system does not perform validity checking and prepare for illegal configurations, it may
lead to system misbehavior. We describe two such scenarios below.
Crash example: A web application used both mod python and mod wsgi modules in an
Apache httpd server. These two modules used two different versions of Python, which
caused segmentation fault errors when trying to access the web page.
Hang example: A server was configured to authenticate via LDAP with the hard bind
policy, which made it keep connecting to the LDAP server until it succeeded. However,
the LDAP server was not working, so the server hung when the user added new accounts.
Such misbehavior is very challenging to diagnose because users and support engineers may
suspect these unexpected failures to have been caused by a bug in the system instead of
a configuration issue (of course, one may argue that, in a way it can also be considered to
be a bug). If the system is built to perform more thorough configuration validity-checking
and avoid misconfiguration-caused misbehavior, both the cost of support and the diagnosis
time can be reduced.
Finding 8: Some misconfigurations have caused the systems to crash, hang, or have severe
performance degradation, making failure diagnosis a challenging task.
We further study if there is a correlation between the type of misconfiguration and the
difficulty for systems to react. We find that it is more difficult to have an appropriate reaction for software-incompatibility issues. Only 9.3% of all the incompatibility issues have
pinpoint reaction, while the same ratio for parameter mistakes and component misconfigurations is 14.3% and 15.5% respectively. This result is reasonable since global knowledge
(e.g., the configuration of different applications) is often required to decide if there are
incompatibility issues.
5.2 System Reaction to Illegal Parameters
Cases with illegal configuration parameters (defined in Section 4.2) are usually easier to
be checked and pinpointed automatically. For example, Figure 4 is a patch from MySQL
that prints a warning message when the user sets illegal (inconsistent) parameters.
+if (opt_logname && !(log_output_options & LOG_FILE)
+ && !(log_output_options & LOG_NONE))
+ sql_print_warning("Although a path was specified
+ for the --log option, log tables are used. To enable
+ logging to files use the --log-output option.");
Figure 4: A patch from MySQL that adds an explicit warning message when an
illegal configuration is detected. If parameter log output (value stored in variable
log_output_options) is set as neither “FILE” (i.e. output logs to files) nor “NONE”
(i.e. not output logs) but parameter log (value stored in variable opt_logname) is
specified with the name of a log file, a warning will be issued because these two
parameters contradict each other.
Unfortunately, systems do not detect and pinpoint a majority of these configuration mistakes, as shown in Table 9.
Finding 9: Among 220 cases with illegal parameters that could be easily detected and fixed,
only 4.3%∼26.9% of them provide explicit messages. Up to 31.3% of them do not provide
any message at all, unnecessarily complicating the diagnosis process.
Table 9: How do systems react to illegal parameters? The reaction category is the
same as in Table 8 (a).
5.3 Impact of Messages on Diagnosis Time
Do good error messages help engineers diagnose misconfiguration problems more efficiently? To answer this question, we calculate the diagnosis time, in hours, from the
time when a misconfiguration problem was posted to the time when the correct answer
was provided.
No Message
Table 10: The median of diagnosis time for cases with and without messages (time is
normalized for confidentiality reasons). Explicit message means that the error message
directly pinpoints the location of the misconfiguration. The median diagnosis time
of the cases with explicit messages is used as base. Ambiguous message means there
are messages, but they do not directly identify the misconfiguration. No message is
for cases where no messages are provided.
Table 10 shows that the misconfiguration cases with explicit messages are diagnosed much
faster. Otherwise, engineers have to spend much more time on diagnosis, where the median
of the diagnosis time is up to 14.5 times longer.
Finding 10: Messages that pinpoint configuration errors can shorten the diagnosis time
3 to 13 times as compared to the cases with