Master’s Thesis Nr. 84

Master’s Thesis Nr. 84
Master’s Thesis Nr. 84
Systems Group, Department of Computer Science, ETH Zurich
Investigating OS/DB co-design with SharedDB and Barrelfish
by
Zaheer Chothia
Supervised by
Prof. Dr. Timothy Roscoe
Dr. Kornilios Kourtis
Jana Gičeva
December 2012–June 2013
Abstract
Databases and operating systems both operate over the same hardware but they have differing views of how the resources should
be managed. As a result databases forgo the system services and
implement their own scheduling and memory management which
have been refined over several decades. Co-design is an area of
research to address this by improving the collaboration between
both systems. The idea is to integrate application knowledge in
the operating system’s decisions, and for the database to receive
notifications and adapt to changes in system state.
This thesis presents work in this area which builds upon two research systems. One is Barrelfish OS, a realisation of the multikernel model which treats the machine as a distributed system.
The other is SharedDB, an in-memory relational database designed
around sharing of computation which delivers predictable performance on large and complex workloads. Our experimental results
on Linux show that less than a quarter of the cores are needed to
achieve peak performance when running a TPC-W workload. We
are interested in resource consolidation and present an approach for
utilising application knowledge in spatial and temporal scheduling
of operators. We ported SharedDB to run on top of Barrelfish and
conclude with discussion that the combined system makes a good
foundation for testing new ideas and optimisations which bisect
several layers of the system stack.
i
Contents
Contents
iii
1 Introduction
1.1 Context . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
1.2 Problem statement . . . . . . . . . . . . . . . . . . . . . . .
1.3 Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
1
1
2
2
2 Background
2.1 Barrelfish . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.2 SharedDB . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
3
3
6
3 Porting SharedDB to Barrelfish
3.1 Overview . . . . . . . . . . . . . . .
3.2 SharedDB . . . . . . . . . . . . . . .
3.2.1 Project structure . . . . . . .
3.2.2 Build system . . . . . . . . .
3.2.3 Static linking . . . . . . . . .
3.2.4 Modifications . . . . . . . .
3.3 Barrelfish: Porting . . . . . . . . . .
3.3.1 C++ support . . . . . . . . .
3.3.2 Synchronisation primitives
3.4 Barrelfish: Diagnosing . . . . . . .
3.4.1 Library OS design . . . . .
3.4.2 Network stack . . . . . . . .
3.4.3 Summary . . . . . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
9
9
10
10
10
11
11
12
12
13
14
14
14
15
4 Linux Experiments
4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . .
17
17
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
iii
Contents
4.2
.
.
.
.
.
.
.
.
.
.
.
.
.
18
18
19
19
20
21
22
22
23
24
26
28
29
5 Co-Design
5.1 Operator consolidation . . . . . . . . . . . . . . . . . . . . .
5.1.1 Temporal scheduling . . . . . . . . . . . . . . . . . .
5.2 Non-cache coherent architectures . . . . . . . . . . . . . . .
31
31
32
35
6 Conclusion & Outlook
6.1 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . .
6.2 Future work . . . . . . . . . . . . . . . . . . . . . . . . . . .
37
37
38
A Appendix
A.1 NUMA awareness . . . . . . . . . . . . . . . . . . . . . . . .
A.2 Performance Counters . . . . . . . . . . . . . . . . . . . . .
41
41
42
Bibliography
45
4.3
4.4
iv
Methodology . . . . . . . . . . . .
4.2.1 Workload . . . . . . . . . .
4.2.2 Metrics . . . . . . . . . . .
4.2.3 Factors . . . . . . . . . . .
4.2.4 Environment . . . . . . . .
4.2.5 Modifications . . . . . . .
Results . . . . . . . . . . . . . . .
4.3.1 Baseline . . . . . . . . . .
4.3.2 NUMA awareness . . . .
4.3.3 Deployment strategy . . .
4.3.4 Operator characterisation
4.3.5 Partitioned tables . . . . .
Summary . . . . . . . . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
Chapter 1
Introduction
1.1
Context
In our day-to-day interactions we encounter numerous systems which
store and transform vast quantities of data. Such applications have
demanding requirements and need to support many end users with
predictable performance. Databases are therefore especially concious of
their resource usage. The problem is that some operating system policies
are rigid and unsuited to the applications’ needs [34]. Consequently,
databases make limited use of the system services and instead allocate
and manage resources on their own. For instance, they request large
pools of memory from the system and sub-divide this internally. Another
example is that they know whether I/O access is sequential or random
and can implement custom buffer pool strategies and pre-fetching logic.
Database / operating system co-design is an area of research which aims
to address some of these needs by exploiting application knowledge and
improving the collaboration between both systems. One such project,
Infokernel [4], provides visibility into the kernel’s state and algorithms.
In this way, default policies are transformed to mechanisms which can be
controlled from user space. Another project, Cod [15], demonstrates the
value of interaction between a database storage engine and an operating
system’s policy engine. The OS can incorporate database cost models
when reasoning about scheduling and placement decisions. The database
also receives notifications on changes to the system state and can adapt
to meet SLA guarantees.
1
1. Introduction
1.2
Problem statement
In this thesis we build upon two large systems and investigate synergies
of combining them in a single scenario. On one side we have an operating
system (Barrelfish [5]) which treats the machine as a distributed system;
on the other is a database (SharedDB [14]) intended for predictability
which leverages shared computation. Both projects are being developed
within the Systems Group at ETH Zürich and in collaboration with
external partners (Microsoft Research and Amadeus respectively). They
are part of a larger goal of rethinking the application stack to better
cope with increasingly challenging application requirements and modern
innovations in hardware (SWissBox [1]).
The two systems make a natural fit because they both share the common
design principle of minimising shared state and avoiding synchronisation
allowing them to fully exploit multicore hardware. SharedDB presents
an interesting use case to port on top of Barrelfish because it is a large
application which spans the system. Furthermore, it has demanding performance requirements which stress several subsystems, from memory
management through networking.
The main achievement of of this thesis is to make Barrelfish and SharedDB
work together. Whilst there were a number of challenges along the way,
this now opens new possibilities for future research. Beyond this, we conducted some experiments to understand the performance characteristics
of SharedDB. These results demonstrate resource consolidation which
still meets the application’s SLA. We present some early work in this
direction and discuss several avenues for future investigation.
1.3
Outline
Chapter 3 goes into more detail of the porting process and some of
the challenges we faced. In Chapter 4 we present experimental results
on specific aspects of SharedDB’s performance. Chapter 5 returns to
the topic of co-design and presents two interesting research questions.
Finally Chapter 6 concludes with a discussion of what we have learned
and provides outlook for further research.
2
Chapter 2
Background
In this section we introduce the key ideas and overall design of the
systems being used throughout this thesis. Barrelfish incorporates ideas
from distributed computing and applies these to operating system design.
SharedDB is a relational database which processes queries in batches and
is built around the concept of shared computation.
2.1
Barrelfish
Barrelfish is a research operating system designed to scale to modern
multicore architectures. It is a realisation of the multi-kernel model
[5] which treats the machine as a distributed system by placing an
independent node on each core and replicating state by explicitly passing
messages rather than sharing memory. This is based on the observation
that the underlying hardware resembles that of a network. Consider for
instance the cache coherence protocol which is realised as an exchange
of messages over a complex interconnect topology. Structuring the OS
in such a manner offers not only performance benefits but makes it
easier to reason about the system as a whole and facilitates adapting
to new hardware developments, for instance heterogeneous and non
cache-coherent machines.
This is in stark contrast with conventional monolithic operating systems
in which all services (e.g. drivers, file system) are implemented as part of
the kernel. These are built as a shared memory program which spans the
machine along with global data structures protected by coarse-grained
locks. Monolithic operating systems date from the single-processor era
and increasingly parallel hardware has revealed performance problems
3
2. Background
due to contention. Overcoming this has involved considerable effort to
replace with more fine-grained sharing.
Another point in the design space is the microkernel [23] which provides a minimal set of mechanisms in the kernel and traditional services
running in user-space which implement a specific policy. The main OS abstractions in such a system are address spaces, threads and inter-process
communication. Exokernel [12] took this idea further and separates
resource protection from management. The kernel exposes just a thin
interface over the hardware and the conventional abstractions are implemented in an untrusted library OS. More recently there have also
been projects such as Tornado, K42 [13, 18] and Corey [6] which target
multicore hardware. They demonstrate the benefits of avoiding sharing
across cores through careful design of data structures.
Application
libbarrelfish
Dispatcher
Dispatcher
Dispatcher
User
space
Monitor
Monitor
Monitor
Kernel
space
CPU
driver
CPU
driver
CPU
driver
x86
x64
ARM
Hardware
Dispatcher
URPC
Send IPI
Cache-coherence,
interrupts
Monitor
CPU
driver
GPU
Figure 2.1: Barrelfish OS Structure
Figure 2.1 depicts the key components and the overall structure of the
multi-kernel design as implemented in Barrelfish:
• CPU driver: This implements a limited subset of the features in
a traditional kernel, namely performing privileged operations on
behalf of applications and interacting with hardware (e.g. MMU,
APIC, timers). Another task that it is responsible for is multiplexing
the processor among dispatchers, which is done with an RBED
scheduler [7] that integrates real-time and best-effort tasks. The
CPU driver is written to run on a single core and is a small com4
2.1. Barrelfish
ponent which is specialised to a particular architecture and can be
easily adapted.
• CPU drivers operate in isolation on a single core. The monitor
handles cross-core operations and replication of state, for example
coordination of virtual memory mappings. This runs in user space
but is a privileged process which has access to data structures such
as the capabilities database. In contrast with the CPU driver most of
the monitor code is not hardware specific and thus portable across
machines.
• libbarrelfish provides a standard set of primitives (threads, memory management, notification). Applications are free to link against
their own library and thereby customise the OS personality as desired. Barrelfish adopts this principle of pushing complexity and
policy into user-space from Exokernel and Nemesis [12, 19].
• Application: Programs are statically-linked ELF binaries and code
can access shared memory as usual, provided this is supported
by the hardware. There are two important kernel abstractions to
point out, specifically domains which represent an address space
and dispatchers which are the unit of scheduling. There is a single dispatcher per core and applications can explicitly span their
domain across several cores if desired.
Barrelfish includes a realisation of scheduler activations [3, 24].
When threads block or upon allocation of the processor, the dispatcher receives an upcall and can decide how to use its time
slice. This mechanism permits an implementation of threads which
combines the flexibility of kernel threads with the performance of
user-level threads.
• Message passing is one of the central themes in a multi-kernel and
Barrelfish has several primitives to support this. Flounder is an
interface-definition language which generates communication stubs.
These provide a uniform interface for sending and receiving messages independent of the underlying transport mechanism. A server
domain exports some set of functions; clients can invoke these in
either blocking or non-blocking manner.
There are several interconnect drivers which implement different
transport mechanisms. LMP (local message passing) provides fast
kernel-mediated communication with other dispatchers on the same
core. UMP (user-level message passing) is a protocol which takes
5
2. Background
advantage of cache coherency to transport cache lines between cores
using a shared memory buffer.
2.2
SharedDB
SharedDB [14] implements an in-memory relational database which
processes thousands of queries concurrently by leveraging shared computation. Incoming queries are not processed immediately but instead
enqueued and processed at regular intervals. Processing an entire batch
of queries at once provides the opportunity for shared execution, for
instance costly table scans and joins. Through appropriate capacity planning it is also possible to bound worst case execution time: a query
spends at most one cycle waiting and one being processed.
This is very different from traditional database architectures in which
each query is parsed, planned and optimised in a separate execution
context. This incurs sizeable costs and transactions also contend for locks
which limits scalability and causes considerable cross-interaction as load
on the system increases.
It should be noted that both systems have their merits and fit different use
cases. The query-at-a-time model has been refined over many years and
delivers high performance for a specific set of queries, provided indexes
and materialised views are appropriately selected. On the contrary,
batching gives consistent throughput and predictable performance even
on unknown and changing workloads. This is possible because the cost
of computation is shared across a batch of queries.
A key abstraction of SharedDB is that of a global query network which
is a dataflow graph of relational operators. This implies there is a fixed
set of queries defined ahead-of-time, although these may have arbitrarily
complex predicates. The workload itself will consist of many instances
of these common templates which permits sharing. An example of a
query network is shown in Figure 4.1 on Page 19 which we will discuss
in more detail later. Currently the query plan is built and implemented
by hand, although there is ongoing work to extend SharedDB with a
declarative SQL-like interface to express the schema and queries more
flexibly. At runtime a cost-based optimiser builds a query plan and
generates specialised code to drive the operators.
Operators receive a batch of queries, push work down the query network
and conduct their processing once results are returned from these sub6
2.2. SharedDB
queries. Finally the processed data is propagated back up the tree. From
an abstract perspective an operator can be viewed as a black box with two
queues: one for incoming queries, another with results of sub-queries.
There are two categories of operators: blocking and non-blocking. The
former (e.g. sorting) need to materialise the full input stream before
processing can begin, whereas the latter (e.g. projection or filtering) can
immediately process and forward tuples as they arrive.
SharedDB achieves parallelism by running each operator in the context
of a kernel thread. Most operators are single threaded, although intraoperator parallelism is employed for a few cases such as the hash join
operator. Operators are bound to cores with hard affinity and this
deployment remains fixed. Note that the thread-per-operator model may
limit the number of cores which can be utilised depending on the size of
the query network.
All data resides in main memory and (optional) durability is supported
by logging all updates to disk, with regular checkpoints for recovery.
There are two storage backends: Key-Value and Crescando. Tables can be
accessed with index probes or full table scans. In addition a considerable
portion of processing can be done directly in the storage engine, for
example predicate evaluation, string matching and aggregation.
Crescando [35] is a storage engine designed to support large request
rates and deliver predictable performance on unknown workloads. A
large part of its scalability is due to the Clock Scan algorithm which
implements a cache-aware join between queries and data. It builds an
index over the query predicates and matches these against the data
as it streams through. Two cursors are used: updates are applied in
arrival-order followed by queries. In this way queries see a consistent
snapshot of the data. Crescando also maintains a B-Tree index which
is used to support scan operations and for index nested-loop joins. For
especially large tables, data can also be partitioned over several disjoint
segments using a hashing strategy or in round-robin fashion. In this case,
a separate controller operator distributes requests among the scan threads
and merges their results.
SharedDB is implemented in heavily-optimised C++ and makes extensive use of template metaprogramming. There is also a schema compiler
which takes a table definition and generates specialised code for processing tuples. On the client side, the main logic for orchestrating an
experiment is written in Java along with JNI (Java Native Interface) bindings to low-level code for serialising and sending requests. Client and
7
2. Background
server communicate over TCP sockets using a custom protocol. An important design choice is that one thread is used per terminal; as we shall
see this presented some challenges when benchmarking.
8
Chapter 3
Porting SharedDB to Barrelfish
The previous chapter gave an overview of the database (SharedDB) and
the operating system (Barrelfish) being used in this work. In this chapter
we present the modifications required and some problems encountered
during porting.
The following discussion is broken down into the following sections:
the first gives a broad perspective of the porting process, the second
discusses specific aspects from the perspective of SharedDB and the third
focuses on points related to Barrelfish.
3.1
Overview
In general the porting process consists of two stages:
• Compile-time: initially we had to setup the environment and deal
with missing or differing functionality. These issues can be easily isolated and involve many small and straightforward changes
across the project.
• Runtime: thereafter we needed to investigate and diagnose localised
issues in specific components. By comparison, this is what made
up the bulk of the effort.
Our plan was to gradually build up the port, starting with the utilities
library and some unit tests. Subsequent to that would follow a smallscale Crescando demo and finally the full TPC-W workload running on
SharedDB.
9
3. Porting SharedDB to Barrelfish
3.2
3.2.1
SharedDB
Project structure
SharedDB is a large codebase and it took some time to understand the
key abstractions and overall structure. The papers [35, 14] are intended
for the research community and accordingly focus on the algorithms
and processing model. The most important interfaces (such as Table,
Operator, Result) are documented and as a beginner it would help to
have a guide with pointers to these.
To illustrate, here are some notes on how the different components relate
to one another. SharedDB’s codebase is structured as several projects
which progressively build on the functionality of one another:
• Utilities: Wrappers over system-specific functions (memory management and threading) and primitives such as logging, queuing
and reference counting.
• Crescando: Implementation of the storage engine. More specifically,
this contains the logic corresponding to the scan operation, indexing, statistics, durability and partitioning. Clients make use of this
via a generic table interface which provides functions to store tuples
and execute queries.
• Schema compiler: Takes an SQL-like table definition as input and
generates schema-specific code to process data within Crescando.
• SharedDB: Implements the relational operators (e.g. joins, grouping and sorting) and provides a unified interface over the storage
engines (Crescando, key-value).
• TPC-W driver: Server-side code which implements a specific workload. This consists of several schema files and code to build the
query network, populate tables and query logic to drive the operators.
3.2.2
Build system
The existing build system is based on the GNU Autotools suite. Although
this tool includes support for cross-compiling, more complex scenarios
involving code generators require additional attention. We encountered
the problem that the Crescando schema compiler is built for the host
platform (in our case Barrelfish) and therefore fails when executed on
10
3.2. SharedDB
the native platform during the build process. Beyond this, we needed all
code to be statically linked with the main executable.
Barrelfish’s build system (Hake) is designed around similar needs, however it is not aware of the external toolchain and would require further
modification. We therefore opted to switch to CMake with which we had
prior experience. CMake has support for adding custom commands to
the build graph and we wrote a macro to encapsulate the dependencies
and actions needed to run the schema compiler. The primary advantage
this offers is that the entire codebase can be built in a single invocation
and when dependencies are modified all appropriate targets are rebuilt
automatically. Furthermore, built artifacts are laid out in the appropriate
layout to facilitate deployment to the cluster and we could also support
several build configurations from a single source tree.
3.2.3
Static linking
Crescando generates code at compile-time which is specific to a schema
definition and compiles this to a shared library which is loaded at runtime. A function named csd storage table create is then used to create
an instance of a specific table. Dynamic linking is not currently supported
on Barrelfish so we needed to make some modifications to support static
linking. The first change was to resolve conflicting symbols by including
the table name in this function (e.g. csd storage author table create).
A second change added an interface to transparently use the schemaspecific libraries in a portable manner. Based on a configuration macro
this will either call dlsym or return a fixed function pointer depending
on whether dynamic or static linking is used.
3.2.4
Modifications
SharedDB is developed on Linux and depends primarily on the standard
C and C++ libraries and a small set of POSIX functions, such as pthreads
and BSD-style sockets. In addition, it makes use of Boost which is an
extensive C++ framework to complement the standard library. Only a
select few header files are needed and no code changes were necessary.
There is also an optional dependency on libnuma although this was
omitted from the port.
Barrelfish uses newlib as C standard library and complements this
with libposixcompat which offers a limited set of POSIX interfaces
to ease porting existing applications. The vast majority of the func11
3. Porting SharedDB to Barrelfish
tions SharedDB needs were already provided by these libraries. A few
Linux-specific functions are used in SharedDB but these could be easily
replaced with an alternate implementation. As an example, we broke
the dependency on the system type CPU SET - which is used extensively
throughout - and substituted those usages with a portable CPU affinity
mask. Aside from this there were two other classes of changes. The
first added feature tests and macros to disable features which are not
available, for instance the socket option SO REUSEADDR or structs with
differing fields. A second group of changes was needed to fix missing
#include directives. The tool include-what-you-use [16] was used to make
these changes automatically.
In order to have some assurance of correctness we also wrote and ported
some unit tests. In retrospect these had limited utility because they only
cover a small portion of the basic utilities and some Crescando data
structures. Although the unit tests helped to isolate one or two issues,
the more complex problems we encountered were better served by full
end-to-end integration tests. A sample issue we encountered is stack
overflow. SharedDB often makes use of alloca and places structures
on the stack (e.g. trie for prefix queries). This happens because of the
small default stack size on Barrelfish (64 KB) compared against Linux (8
MB). This is easily remedied by allocating a larger stack. Nonetheless,
we added a warning to the page fault error which has been especially
helpful in other situations.
Overall not many invasive code changes were needed to support Barrelfish and those which were are largely confined to the system portability
layer and network I/O. We discuss the latter in more detail shortly.
3.3
3.3.1
Barrelfish: Porting
C++ support
Barrelfish is not self-hosting. Instead you develop on a remote machine,
cross-compile to static binaries and then run on a simulator or deploy
onto a rack machine. The build process makes use of the host’s C
compiler. This works when compiling freestanding C programs because
Barrelfish provides its own libraries and startup objects.
For C++ we made use of a cross-compiler which had been previously
ported to target Barrelfish and updated it to the latest version as needed
by SharedDB. The toolchain port itself entails modifications to GNU binu12
3.3. Barrelfish: Porting
tils, which provides the assembler and linker, and GCC (GNU Compiler
Collection) for the compilers and C++ standard library. All the important language constructs work including virtual functions, templates,
thread locals and inline assembler. Exception handling does not work
but this was not a hindrance because SharedDB does not make use of
this language feature.
Whilst C and C++ are mostly compatible at the source level, minor
changes were required to Barrelfish’s header files. C++ is more strict
about function casts and does not support some C99 features, for instance
designated initialisers. In addition, function definitions needed to be
wrapped in extern "C" blocks to demarcate the linking mode.
In Barrelfish each subsystem, such as the file system layer (VFS), needs to
be initialised before usage. SharedDB’s logging system reads a configuration file during its initialisation, however, this is triggered by constructors
of global objects which run before the main() function. Individual functions can be decorated as constructors with an integer to indicate their
priority. Using this we could arrange for VFS initialisation to occur at the
appropriate point.
3.3.2
Synchronisation primitives
Barrelfish’s threading library provides mutexes, condition variables and
semaphores but was lacking barriers and thread-safe initialisation (‘once’).
These routines were easily implemented. What was a little more involved
is support for blocking waits with a timeout, which we now describe
briefly. This functionality is used within SharedDB to allow for clean
shutdown. More importantly though, lwIP (the network stack) uses this
as the basis to portably implement timers.
The existing synchronisation primitives have functions to block a thread
and wait indefinitely until it becomes ready, but in this instance what
was needed is the ability to re-awaken the thread if a timeout elapses.
Deferred events are not appropriate for this purpose because the thread
needs to run an event loop but it is blocked. We added an analogue
mechanism which is triggered automatically by the dispatcher upcall.
If a wait timeout expires, the corresponding thread is removed from
the queue where it is waiting and added back to the runnable queue.
Conversely, if the thread is awoken in the interim its deferred event is
cancelled. All of this involves manipulating doubly-linked lists which
can be done in constant time.
13
3. Porting SharedDB to Barrelfish
The implementation of threads in Barrelfish is very elegant, minimal
and straightforward to understand. Having this implemented in a user
library makes it easy to prototype new features which underscores a
benefit of the library OS design.
3.4
3.4.1
Barrelfish: Diagnosing
Library OS design
In keeping with the microkernel design Barrelfish implements a large
portion of its services as unprivileged code in user-space. Examples
of this include device drivers and the network stack. When porting
SharedDB we had no concrete plans to tailor the OS flavour, however
simply having this possibility has been useful in a number of situations.
One major benefit is that this leads to a structure with many selfcontained modules which can be readily understood and modified. This
makes it considerably easier to troubleshoot unexpected behaviour due
to the ability to add custom instrumentation. For example the dispatcher
upcall can be used to implement a rudimentary profiler. To give another
example, we also encountered an implementation-specific issue which
prevented allocating more than a few gigabytes of memory. Barrelfish has
an extensible virtual memory system which runs in user-level modules
and so our application could be modified to request memory pages from
the system and map them into its own address space.
3.4.2
Network stack
Barrelfish’s network stack consists of an Ethernet driver, several services (queue, port and filter managers) which run as separate domains
and lwIP which is linked against applications. The latter implements
the TCP/IP protocols and interested readers are directed to [11] which
describes its design and implementation in further detail.
At the edge of SharedDB’s query plan are server operators which interact
with external clients. These are written using conventional BSD-style
sockets with blocking calls and a select() loop. Although this interface
is offered by lwIP, we encountered stability issues at load and when used
from several threads. To resolve this we instead modified SharedDB to
use lwIP’s event-based ‘raw’ API. The major change this involved is that
all network operations need to occur from a single thread. The server
14
3.4. Barrelfish: Diagnosing
operators therefore call stubs that forward processing to a centralised
thread which performs the operations on their behalf. This potentially
involves cross-core communication which could affect performance and
is something we may need to re-visit later.
3.4.3
Summary
This section presented some of the work involved in porting SharedDB to
Barrelfish. Whilst there were some challenges and modifications needed
these were primarily implementation issues and using existing facilities
in scenarios where they had not previously been tested. Nevertheless,
the design of both systems complement one another well and they will
serve as a good foundation for further co-design. As of the submission
of this thesis, the port of SharedDB to Barrelfish is functional but there
is a minor unresolved issue related to cross-core messaging. We expect
this problem to be resolved soon, and will compare its performance
characteristics against a run on Linux. We do not foresee any bottlenecks
although there may be differences in performance due to the network
stack and scheduler. In the next chapter we establish an initial baseline
of SharedDB running on Linux.
15
Chapter 4
Linux Experiments
4.1
Introduction
At this point we understood the inner workings of SharedDB, however
we lacked an intuition for its performance characteristics. SharedDB
departs radically from the design found in conventional databases and
in fact may fare better under load because it is designed to leverage
sharing among queries. This section describes a series of experiments we
conducted on Linux to quantitatively analyze its behaviour and better
understand its performance under a number of different configurations.
In particular we were interested in investigating questions such as the
following:
• Scalability: What happens as the number of cores is varied? When
there are more cores than operators does it matter which subset is
used? On the contrary, how does the system react to the processor
time being sliced among several operators?
• NUMA awareness: SharedDB makes provision for NUMA architectures in its placement of operators and when allocating buffers. To
what extent does this boost performance?
• Deployment: SharedDB explicitly pins threads and has a handcrafted mapping of operators onto cores. A natural question to ask
is why this is done and how it affects performance.
• Which portions of the query network are most active? Further, is it
possible to characterise the resource usage of each operator?
17
4. Linux Experiments
In the coming sections we shall first provide an overview of the environment and then zoom in on each question, present our expectations and
discuss the results we found. As we shall see, on our specific workload
SharedDB already achieves its peak performance after exhausting only a
small fraction of the resources available on a large multi-core server.
4.2
4.2.1
Methodology
Workload
TPC-W [25] is a standardised benchmark set in the context of an online
book store which implements a typical three-tier application architecture.
Web browsers, which assume the role of customers, interact with the store
by browsing, ordering and shopping for books. These are sent to a web
server which in turn generates requests to the data layer. These requests
are made up of primarily point queries and short-running transactions.
The specification defines fourteen types of interactions which can be
broadly classified as either browsing (searching and requesting details of
products) or ordering (adding items to the cart and purchasing). Beyond
this, there are three workload mixes which govern the proportion of each
interaction type effectively sent to the system: browsing, shopping and
ordering. The first has a breakdown of 95% browsing to 5% ordering
which makes it mostly query-intensive. By contrast the last has an equal
split between the two categories and consists of more updates. The
middle has an 80-to-20 ratio and is a compromise with a mix of queries
and updates.
Figure 4.1 shows the query plan corresponding to the TPC-W workload
in SharedDB. At the base of the network are 10 storage operators whose
tables store the details of the inventory, customers and their orders. There
are 12 server operators at the edge of the network which communicate
with client machines and the query logic is realised by a further 16
relational operators. You may notice these perform joins, sorting and
grouping but there is a distinct lack of primitive filtering operations such
as LIKE and LIMIT. These operations are pushed down and evaluated
directly inside the storage engine together with predicate matching.
There are two different storage engines in use: Crescando is used for the
Address, Author, Country and Item tables; the remainder are backed by
key-value stores with indexes on the key attributes. Two-way partitioning
is also employed on the Address and Item tables.
18
4.2. Methodology
Buy
Request
New
Products
Server
Server
Products
Detail
Search
Item
Best
Sellers
Admin
Confirm
Related
Items
Server
Server
Server
Server
Server
Sort
(by Date)
Hash ⋈
COUNTRY
CUSTOMER
Server
Sort
GroupBy
GroupBy
Hash ⋈
NL ⋈
Sort
(by Title)
Hash ⋈
ADDRESS
Sort
AUTHOR
Sort
ORDER_LINE
Server
Hash ⋈
Hash ⋈
ITEM
Server
NL ⋈
View
Cart
Server
ORDERS
Hash ⋈
SHOPPING
CARTLINE
SHOPPING
CART
CCXACTS
Server
Figure 4.1: TPC-W: Global Query Plan
4.2.2
Metrics
There are two primary metrics we consider during the experiments:
• Median response time (milliseconds)
• Throughput (WIPS: web interactions per second)
In our context, response time is measured end-to-end from the client
perspective – i.e. from the time a request is sent until the full response
is received including network latency. The metrics we present are aggregated over all clients, but there remains the option to zoom in to a
specific transaction type if needed.
Each query has an associated time limit. These are generally between
3 and 5 seconds with two exceptions: 10 seconds (Search Results) and
20 seconds (Admin Confirm). Responses are considered valid as long
as their 90-th percentile latency does not exceed the given limit for that
transaction type.
4.2.3
Factors
The following is a non-exhaustive list of the factors influencing our
experiment:
19
4. Linux Experiments
• Load (number of emulated browsers).
• Placement of operators onto cores.
• Total number of cores allocated.
• Operating system.
• Network topology.
• Hardware (in particular caches and interconnect structure).
• Workload mix.
We will thoroughly investigate the first three factors and keep the remainder fixed. Given that there are interactions – for example between
deployment strategy and hardware – each experiment will target a single factor. Unless mentioned otherwise we used the default placement
strategy and all the cores on the machine.
4.2.4
Environment
In the subsequent experiments we used a single machine to host the
database server and twelve client machines to generate load. Their
specifications are as follows:
• Server (”appenzeller”): quad-processor, 12-cores (AMD Opteron 6174
”Magny Cours” running at 2.2GHz), 128 GB RAM over 8 NUMA
domains, 64 KB L1-d, 512 KB L2 and 12 MB shared L3 cache.
• Client (”dryad”): 8 cores (AMD Opteron 2376 ”Shanghai” running
at 2.3GHz) and 16 GB RAM.
The server and client machines are not attached to the same switch but
are rather separated by three network hops with average round-trip time
of 0.22 ms.
The code itself is compiled using GCC 4.7.2 at optimisation level 3,
for the native architecture and with assertions disabled. The runtime
environment is consists of 64-bit Debian ”Sid” running Linux kernel
3.6.10 and glibc 2.13.
For the purposes of these experiments all data was held in main memory
and durability was disabled so there was no disk I/O involved. The
cardinality of the tables is controlled by two factors: number of EBs
(emulated browsers) and number of items (1K, 10K, 100K, 1M or 10M).
The actual values used were 100 EBs and 10K items respectively, which
20
4.2. Methodology
results in a total dataset size of less than a gigabyte. These numbers
may seem small relative to the size of main memory, but recall that
SharedDB’s primary use case is sharing which is especially effective
when processing a vast quantity of queries in parallel. There is nothing
inherent in its design that would prevent use of larger datasets and this
is the subject of ongoing experiments.
4.2.5
Modifications
There are a few specific aspects of our experimental setup worth nothing
as they deviate from the specification:
• Given that we are interested in benchmarking SharedDB, we will
exclude the web server layer and clients will instead issue their requests directly to the database. For the purpose of our experiments
the system-under-test (SUT) is defined as the database in isolation.
For comparison, TPC-W defines the SUT rather broadly to comprise
all the web servers, databases and network infrastructure used in
the implementation.
• To reduce the scope of our experiments we will only consider a
single workload mix: browsing. The ordering mix has a larger proportion of updates and primarily stresses durability and the disk
I/O subsystem. This is a reasonable choice because it involves complex queries with several table scans, joins and sorting/grouping
which exert a considerable portion of the query network.
• Emulated browsers wait for a short interval between the arrival
of a response and sending their next request. This is referred to
as the think-time and is defined to be Poisson distributed with an
average of 7 seconds. One specific quirk we needed was to scale
this by a factor of roughly 0.6, which reduces the time between
interactions and consequently has the effect of simulating more
emulated browsers. This was necessary because each terminal runs
in the context of a Java thread and we observed undesired garbage
collection effects with a few thousand threads.
21
4. Linux Experiments
4.3
Results
4.3.1
Baseline
Median Response Time [seconds]
1.6
1.4
1.2
1.0
0.8
0.6
0.4
0.2
0.00 5 10 15 20 25 30 35 40 45 50
Emulated Browsers (in thousands)
(a) Response Time
Throughput [web interactions per second]
Our first experiment attempts to reproduce the results from the VLDB’12
paper [14]. Based on those figures we expected to reach peak performance of about 1500 WIPS with 12000 emulated browsers (on the same
hardware) and deployed accordingly.
4000
3500
3000
2500
2000
1500
1000
500
00 5 10 15 20 25 30 35 40 45 50
Emulated Browsers (in thousands)
(b) Throughput
Figure 4.2: Baseline Performance, Varying Load
The results we obtained are shown in Figure 4.2. We observe that throughput rises steadily until roughly 27000 browsers and the system is able
to cope with the increasing load. Response times also increase very
gradually and the median response time remains below half a second
until this point. Beyond this threshold the system becomes unstable
and exhibits a decline in throughput and fluctuations in response time.
This is largely attributable to the browsing workload mix which has
several heavy search interactions. In particular the Best Sellers query
involves several joins, sorting and grouping and the system spends time
processing these but responses eventually exceed their time limit.
To summarise, SharedDB scales in proportion to the load and is able to
do so because it shares processing over a large number of concurrent
queries.
22
4.3. Results
4.3.2
NUMA awareness
SharedDB includes some logic for NUMA hardware (described in more
detail in Appendix A.1). We will now conduct an experiment to compare
whether this affects performance.
1.8
NUMA-aware
NUMA-unaware
1.6
1.4
1.2
1.0
0.8
0.6
0.4
0.2
0.00 5 10 15 20 25 30 35 40 45 50
Emulated Browsers (in thousands)
(a) Response Time
Throughput [web interactions per second]
Median Response Time [seconds]
Hypothesis: At this stage it is unclear what quantity of data is shared
between operators and whether these accesses could saturate the interconnect. In the extreme case of data shuffling NUMA awareness can
yield a dramatic improvement (3x) [22]. In SharedDB’s setting, however,
the overall effect on our two metrics is likely to be on the order of several percent because these effects are minimal compared to the cost of
network transmission.
4000
3500
3000
2500
2000
1500
1000
500
00 5 10 15 20 25 30 35 40 45 50
Emulated Browsers (in thousands)
(b) Throughput
Figure 4.3: NUMA-awareness Comparison, Varying Load
The results we observed are shown in Figure 4.3 which compares performance with NUMA-awareness both enabled and disabled using the
hand-tuned deployment strategy. Counter-intuitively disabling NUMA
awareness actually gave better performance beyond 25000 browsers: peak
throughput increases by just under ten percent. The reason for this behaviour is not clear, although it may be attributable to the deployment
strategy which was devised by analysing the communication patterns
between operators. Linux has a policy of allocating memory on the
NUMA node where it was first touched. Another possible explanation is
that this results in better memory placement than the hard coded policy
23
4. Linux Experiments
of allocating on a specific NUMA node as SharedDB implements. In
any case, we did not investigate further but simply omitted this aspect
from the Barrelfish port and can revisit this decision if we observe a
performance anomaly.
4.3.3
Deployment strategy
In this series of experiments we extensively investigate how the placement
of operators onto cores affects performance. We will compare three
different policies: hand-tuned, ”largest-first” heuristic and not pinning
threads. As a second dimension we will vary the number of cores and
explore how the database behaves as resources become more scarce.
SharedDB’s hand-tuned deployment strategy was devised by analysing
access patterns and data transferred between operators and we expect
it to perform well. It incorporates the heuristic of scattering Crescando
scan threads and gathering SharedDB operators. The reasoning behind
this is to have good locality between operator queues by placing scan
threads as ‘far’ apart and thus minimise conflicts in the shared last-level
cache.
Operator placement is specified in a configuration file which defines the
core number for each operator. The effective core is derived by taking
these numbers modulo the actual number of cores. As we shall see, in
specific cases performance is degraded with the hand-tuned strategy due
to conflicts between sensitive operators. To address this we devised a
simple strategy which places large table scans and critical join operators
first. When devising this we considered the number of tuples passing
through and the CPU usage of each operator.
Finally we also compare against not setting hard affinity on operator
threads and instead relying on the operating system’s policy. We refer to
this deployment as no-pin.
For this experiment we used a fixed load of 18000 browsers, varied the
number of cores and tried different deployment strategies. The results
are presented in Figure 4.4. All three strategies perform identically when
using the maximum number of cores 1 . However, as the number of
cores is reduced we observe three distinct trends corresponding to each
deployment strategy. The hand-tuned strategy performs worst overall,
1 Although
the machine has 48 cores, the query network only has a total of 44
operators so a few cores are unutilised.
24
Median Response Time [seconds]
4.0
Hand-tuned
Heuristic
Not pinned
3.5
3.0
2.5
2.0
1.5
1.0
0.5
0.0
1
2
4 8 16 24 32 44
Number of CPU cores
(a) Response Time
Throughput [web interactions per second]
4.3. Results
2500
2000
1500
1000
500
0
1 2 4 8 16 24 32 44
Number of CPU cores
(b) Throughput
Figure 4.4: Effect of Deployment on Performance
our heuristic improves on this and surprisingly not pinning threads
yields a striking improvement. In particular take note of the steepness of
the respective curves.
As expected, performance decreases when there are fewer resources
allocated but the point at which this occurs is of note. When leaving
decisions to the kernel, peak performance can be reached using just 8
cores and beyond this the situation does not change. Viewed differently,
using only one tenth of the cores gives just a 5% decrease in throughput.
The same effect can be observed with the other deployment strategies
although less pronounced. This may be partly attributable to the size of
workload we use. A further explanation for this behaviour is that there
are a few active operators and the remainder are comparatively dormant.
This is reflected in the figures shown in Table 4.1 which shows mean
CPU usage of the top five operators in descending order.
The issue with the default deployment strategy is that it treats core
numbers as abstract values whose order doesn’t matter. Consider the
data point with 24 cores: the Orders table is co-located with one segment
of the Address table, which is an instance of an undesirable conflict
because both operators are hot. Our heuristic assigns operators largestfirst which helps mitigate bad cases where two hot operators are forced
to share a core. This is clearly reflected by the more stable behaviour and
overall higher throughput. A typical justification for explicitly pinning
25
4. Linux Experiments
Operator
Address (Crescando, Segment 2)
Address (Crescando, Segment 1)
Item (Crescando, Segment 1)
Item (Crescando, Segment 2)
Author ./ Item
...
Total
Mean CPU usage (%)
84.3
77.6
61.6
57.9
26.2
11.5
Table 4.1: Top-5 Operators by CPU Usage
threads is to avoid potentially costly thread migrations which thrash the
cache state. Our experiments demonstrate that in SharedDB’s case this
is unwise and actually degrades performance. Leaving decisions to the
kernel gave the highest throughput and lowest response time in all cases.
It is challenging to gain insight into the kernel’s scheduling decisions,
but we believe this is due to having full flexibility to re-arrange threads
and quickly react as load fluctuates.
Aside from the spatial aspects we have explored, this experiment shows
what occurs when the cores are over-subscribed. It is not surprising
that performance suffers because several threads now compete for the
processors’ time. When using just two cores a third of the queries being
processed exceed their time limit and consequently throughput halves
and response time quadruples. This underscores the importance of
appropriate capacity planning.
This section explored the effect of operator placement on performance.
We found that deployment strategy does indeed matter, although it was
best not to set hard affinity but rather leave these decisions to the Linux
kernel. Further, our experiments show that on our specific workload
resource utilisation is low and it is possible to achieve peak performance
with less than a quarter of the resources!
4.3.4
Operator characterisation
In the previous section we established that resource utilisation is low and
we want to delve further and understand why this is the case.
Merkel et al. [26, 27] proposed the notion of Activity Vectors in prior
research in the context of energy- and temperature-aware scheduling.
These vectors are associated to a task and express the extent to which
26
4.3. Results
functional units within a processor (e.g. TLB, ALU) are utilised. They are
derived using performance monitoring hardware in the processor, updated at regular intervals and are used to influence scheduling decisions.
Our aim is to apply the same technique to operators in the TPC-W query
network to characterise their resource usage and potentially identify
bottlenecks. We expect that operators can be broadly classified as CPU-,
IO- or memory-bound and if so, this should be reflected in their activity
vectors. Such a classification could be used to derive a suitable deployment strategy when there are more operators then cores. To measure
resource usage we will use performance counters; Appendix A.2 provides
additional background material.
In this series of experiments we use a fixed load of 18000 browsers and a
deployment with no overlap. No two operators share the same core such
that their activity can be observed in isolation. We record performance
counters and our primary areas of focus are: CPU utilisation, memory
bandwidth and cache utilisation (private and shared). The output is a
matrix with a vector for each operator and where each vector consists of
twelve dimensions.
We faced some issues when trying to interpret this data because there are
many dimensions with few discernible patterns. Figure 4.5 shows a heat
map of the correlation matrix. The majority of the dimensions are highly
correlated and exhibit little differentiation. In an attempt to make sense
of these figures we applied dimensionality reduction techniques such
as PCA (principal component analysis) and k-means clustering. This
analysis revealed that the dataset has low intrinsic dimensionality; 99.93%
of the variance can be explained with two components. The issue, however, is that the actual direction of these components point somewhere
in the space. As such the projection of our data onto these directions
does not relate to the original metrics so it cannot be interpreted directly.
Visual inspection showed two clusters – ‘hot’ and ‘dormant’ operators
respectively. The hot operators we previously mentioned exhibited high
CPU usage, as expected, but in other dimensions such as cache miss rate
they did not exhibit much variation.
In summary, these experiments did not provide the insight we had
hoped for. It may be that a more thorough investigation of a larger set of
performance events is needed, in conjunction with more sophisticated
analysis techniques.
27
4. Linux Experiments
1.00
0.95
0.90
0.85
0.80
0.75
0.70
0.65
0.60
0.55
cycles
instructions
l1d_access
l1d_l2
l1d_system
l2_access
l2_miss
l3_request
l3_miss
mem_read
mem_write
mem_access
mem_access
mem_write
mem_read
l3_miss
l3_request
l2_miss
l2_access
l1d_system
l1d_l2
l1d_access
instructions
cycles
Figure 4.5: Correlation heat map for performance counters
4.3.5
Partitioned tables
As we observed in a prior experiment, resource utilisation is rather
low yet we would like to leverage the full potential of the machine. One
possibility to overcome this is to focus on bottleneck operators or replicate
several copies of the entire query network. The former can be easily
realised by partitioning hot tables which we try in this experiment.
The results are depicted in Figure 4.6. Although this does result in a
throughput boost of a few percent, it is not as promising as initially
hoped. Tuples are distributed in a round-robin fashion over the segments
of a table and a central controller handles operations across partitions.
Subsequent investigation revealed that this operator is reaching the socalled ‘batch limit’. This is a knob which adjusts the granularity of
requests (queries or updates) which are passed around as a unit. The
choice of this parameter is guided by a trade-off: larger values can
improve sharing, but this should also be small enough that the tuples
28
2.0
Baseline
1.8
Item: 4 segments
Author: 2 segments
1.6
1.4
1.2
1.0
0.8
0.6
0.4
0.2
0.00 4 8 12 16 20 24 28 32 36 40
Emulated Browsers (in thousands)
(a) Response Time
Throughput [web interactions per second]
Median Response Time [seconds]
4.4. Summary
4000
3500
3000
2500
2000
1500
1000
500
00 4 8 12 16 20 24 28 32 36 40
Emulated Browsers (in thousands)
(b) Throughput
Figure 4.6: Partitioned Crescando Tables, Varying Load
fit into the cache. Resolving this is not within the scope of this work,
although it would be interesting to deploy several replicated engines
with a middleware layer for coordination, as in Multimed [30].
4.4
Summary
In this section we presented a series of experiments intended to explore
specific aspects of SharedDB’s performance. To a large extent it behaves
consistently with the behaviour we expected. Starting with the baseline,
we observed that SharedDB is able to sustain a large number of clients at
high throughput because of shared computation. A surprising finding
was that on our specific workload resource utilisation is low and it is
actually possible to achieve peak performance using only a small number
of cores. Digging deeper, we tried to quantify each operators resource
usage using performance counters and clustering but struggled to gain
meaningful insight. Finally we tried table partitioning in the hope of
boosting throughput but this brought limited gain. We believe this is due
to limitations of the current implementation.
29
Chapter 5
Co-Design
The following section discusses two interesting questions which arise
when consolidating operators onto fewer cores and concludes with a
discussion of how applicable it might be to run SharedDB on a non-cache
coherent architecture.
5.1
Operator consolidation
When running the TPC-W benchmark SharedDB is able to deliver high
throughput and sustain tens of thousands of clients, but resource utilisation is low on our particular workload. Our experiments demonstrated
that it is possible to consolidate to a smaller number of cores without
severely impacting performance. In this context there are two interesting
sub-problems related to operator scheduling.
• Spatial deployment: In our experiments we observed that different
operator placement strategies have an impact on performance (see
Section 4.3.3). This theme investigates how to group operators
onto cores. To illustrate this, there are some operators which are
cache-sensitive that could interfere, and others which could benefit
from co-location because they share a working set residing in cache.
This problem is outside the scope of this thesis and we will not
describe it in more detail, but instead we summarise the approach.
Initially the database is run across all cores to gather performance
counter measurements which characterise the operators. These
are subsequently fed into an optimisation algorithm which applies
bin packing. The result is a deployment plan which uses fewer re31
5. Co-Design
sources and still meets the application’s SLA. In a limited sense this
is similar to the knapsack problem of job scheduling in datacenter
deployment.
• Temporal scheduling: When using 8 instead of 44 cores, our results
showed that total throughput is reduced by less than 5% however response time increases by roughly 30% and exhibits higher
variability. We believe this is attributable to time sharing and the
operating system’s scheduler not being aware of the application’s
requirements. Databases are aware of data dependencies and can
estimate the resources required to process a set of queries. The
question we pose is whether the database perform better scheduling
decisions based on the knowledge it has.
Due to the limited scope of this thesis we were unable to implement and
evaluate these ideas, however we will define the problem and present
initial thoughts.
5.1.1
Temporal scheduling
Consider the problem of time slicing among several tasks. This can be
broken down into three main components:
• Which task next?
• For how long?
• On which processor?
A simple scheduler may use a round-robin strategy and time slices on the
order of tens or hundreds of milliseconds. Modern operating systems are
more sophisticated and have variable time slices, dynamic priorities and
more complex scheduling disciplines which strive to make decisions in
constant time. Additional details on Barrelfish’s scheduling infrastructure
can be found in [29].
Operating system schedulers need to support a mix of workloads including throughput-oriented batch jobs and interactive tasks which forces
them to make trade-offs. Consider an operator which is pre-empted at an
inopportune moment; this could pollute its working set and increase its
runtime. Some of the performance degradation we observed may stem
from poor scheduling decisions due to lack of coordination between OS
and database. We believe that by using application knowledge, such as
data dependencies and task estimates, it may be possible to improve on
32
5.1. Operator consolidation
this situation where the core is shared between several operators. An
initial goal would be to confirm that smart scheduling decisions can, for
example, reduce end-to-end response time for a query.
Problem definition
Inputs
• Spatial deployment: Static mapping of operators onto cores.
• Runtime statistics: Databases have cost models which can provide
an estimate of the execution cost for a batch of tuples. They also
maintain information such as operator selectivity, number of tuples
materialised, etc.
• Query network: Standard boxes and arrows model from data-flow
systems. More concretely this is a directed acyclic graph where
each node represents a database operator and edges indicate a data
dependency. Fan-out: each node has exactly one parent and either
zero, one or two children. Examples of each are leaf table scans,
sorting/grouping and joins respectively. There are no constraints
on the depth of the query network but typically this will have no
more than 5 levels.
Output Using the inputs above, the algorithm should return a scheduling
plan. That is a serialisation of the tasks on a particular core which is
optimized for some metric, perhaps latency or throughput. By tasks we
refer to operations such as a single table scan or the build and probe
phases of a join operator.
Simplifications As an initial starting point there are several simplifying
assumptions which can be made to reduce the problem scope:
• No external applications are allocated to the same set of cores, so
the scheduler has full visibility and control over what is happening.
• Operator placement remains static and each scheduler is only responsible for a single core. In this way one does not need to
consider thread migrations.
• Only consider pairwise dependencies between neighboring operators, which considerably prunes down the multitude of possibilities.
• There needs to be some constraint on the time period during which
operators may potentially be active, such that an ideal scheduler
33
5. Co-Design
could actually fulfill the work. The spatial deployment will generate
a reasonable placement such that no core is at peak overload.
Micro-benchmark
When switching between tasks there are two different costs incurred.
There is the direct cost of running the scheduler and dispatching to
another task, which involves saving and restoring the processor state.
There are also indirect penalties due to effects such as pollution of caches.
In order to gauge whether our approach is viable we would need to
gauge the magnitude of these costs and assess the ratio of useful work to
wasted time. One approach would be to conduct an experiment which
varies the time slice between two extremes: regular tick and no preemption (co-scheduling). The aim would be to demonstrate a scenario
where either extreme affects performance and thus support the claim
that a compromise is needed.
Prior work in this area [21, 33] has shown that working set size and access
stride play an important role, so it would make sense to select cachesensitive operators (e.g. Crescando scan) for full effect. Dispatch latency
figures alone would not be too informative and it would make sense to
instead measure the overall time for task completion on a representative
workload. It is expected that co-scheduling will perform best since
there is no interference between tasks, but this will also affect system
responsiveness because other tasks cannot make progress in the interim.
Related work
In [8], Carney et al. present work on operator scheduling from the realm
of data stream processing in Aurora. This makes a good comparison
because it shares similar requirements of supporting large volumes of
data with low latency. These techniques are also largely applicable to
databases, for instance the traversal models which optimise for throughput, latency or memory. The paper also presents benefits of batching of
operators (superboxes) and tuples (‘trains’). The latter closely resembles
the query processing model of SharedDB. The use of QoS specifications
to influence scheduling decisions is also intriguing because it permits the
system to shed load when demand exceeds capacity.
Another class of programs which implement their own scheduling are
parallel runtimes. These are intended for different scale where there are a
large number of fine-grained and possibly short-lived tasks. An example
34
5.2. Non-cache coherent architectures
thereof is Erlang which has a scheduler per core with work stealing. The
aim here is for low latency with soft-realtime guarantees. It achieves this
with accounting by assigning a reduction budget and by having separate
per-process heaps. Further information can be found in [36, 2].
5.2
Non-cache coherent architectures
Barrelfish’s design with explicit messaging lends itself to non-cache coherent architectures such as the Intel SCC (Single-chip Cloud Computer).
In this section we provide some perspective on what might be involved
to have SharedDB running on such a system.
Processors typically found in computing devices today are cache-coherent.
This refers to the fact that the contents of the caches are kept consistent
between all processors by means of a hardware protocol. Scaling such
a memory architecture is challenging as the number of cores in a chip
increases, so hardware designers are investigating alternative designs
where the individual cores are connected via a fast network fabric. This
has implications on software which has long been able to assume a
shared address space and now needs to employ message passing.
The design of Barrelfish and SharedDB are similar in that they both adopt
a philosophy of minimising shared state. Where they differ, however,
is that the former assumes no sharing at its bottom layer, whereas the
latter is written as a conventional shared memory program. To illustrate
this, consider how results are passed around – as reference-counted
pointers over synchronised queues. From a structural perspective though,
sharing is not necessary because operators work independently of one
another and exchange data in a well-defined manner. As such, it should
in principle be possible to run SharedDB on a system without cache
coherency.
As a starting point it could make sense to adopt a multi-process design
and place each operator in a separate domain. Having several address
spaces would immediately flag any sharing issues. Operator queues
provide a typical get/put-style API which could be easily implemented
over messages. What could prove more challenging is the initialisation
code, which populates tables with data, builds the query network and
attaches operators to their neighbours. This is easiest to express in a single
piece of code which has access to all the required entities, however it
could be broken down into localised operations. One potential downside
of carving SharedDB into standalone units is that there is no longer a
35
5. Co-Design
single point of control, which could make it more challenging to apply
techniques such as co-scheduling.
Communication is likely to place a central role in such work. On a
shared memory architecture, operators in SharedDB exchange results
by passing around pointers. A non-cache coherent version may need to
pass the actual tuples and for large datasets this could place a strain on
the shared interconnect. Some queries may have strict SLAs and it could
be interesting to explore whether ideas such as the network calculus
could be applied to provide QoS guarantees. There may also be some
interesting communication patterns which arise, for example sending the
results of a shared computation to several operators could be viewed as
multicast. Updates on replicated tables involve consensus algorithms.
The existing literature discusses how to efficiently implement this over a
network and there is the question of whether this is also suitable within
a single machine.
It is imaginable there may be hybrid architectures with ‘islands’ of
cache coherence. Parallelised operators could benefit from this by using
shared memory to coordinate their work. Further, on such machines one
could adopt an approach similar to Multimed [30] with several replicas
each running SharedDB in shared memory on a group of cores and
middleware to coordinate execution.
36
Chapter 6
Conclusion & Outlook
6.1
Discussion
The goal of this thesis was to investigate the topic of co-design in the
context of two research systems: an operating system (Barrelfish) and
a database (SharedDB). The main accomplishment is to bring the two
systems together and explore potential synergies. SharedDB is a large
application which exploits modern hardware and has demanding requirements. In particular, on the TPC-W workload we used, it is capable
of supporting tens of thousands of clients, thousands of concurrent transactions each second and response times of several hundred milliseconds.
This makes it a good use case to improve the collaboration between
database and OS and as a basis for further research. In this work we
introduced the problem of consolidating resources, whilst retaining the
performance characteristics of the system and presented some approaches
using spatial and temporal scheduling.
The porting process presented some challenges and required changes to
both systems, such as modifying the memory management and dealing
with limitations of the network stack. All together, though, these were
mostly issues of the current implementation and do not point to problems
of the model itself. In addition to porting, we also conducted a series of
experiments to quantify aspects of SharedDB’s performance on Linux.
The results we presented indicate that NUMA awareness and explicitly
pinning threads actually diminishes performance. Further, we found
resource utilisation to be low – peak performance could be achieved with
just 8 of 44 cores. This could be due to our choice of TPC-W workload
and is the subject of further investigation. We also presented some
37
6. Conclusion & Outlook
work to characterise operator resource usage using performance counters
although this did not provide the insight we hoped.
To recap, we believe this work mostly addressed our initial motivation and provides good potential to further the collaboration between
databases and operating systems and improve resource management.
The systems used in this thesis make a good fit because they both adopt
a shared-nothing design. Beyond the scalability benefits which have been
extensively discussed in the literature, the resulting implementation is
simple, elegant and thus easy to reason about which is especially helpful
when troubleshooting.
6.2
Future work
Now that SharedDB runs on Barrelfish there are numerous avenues
worthy of additional investigation. Having full control over the entire
stack and extensive experience in the group is also beneficial. Together,
this enables testing of new ideas for co-design and optimisations which
bisect several layers of the system.
• Once the problem with messaging between cores is resolved, the
next step is to conduct a thorough investigation of SharedDB’s
performance when running on Barrelfish. We do not expect any
major differences compared to Linux. Should there be any bottlenecks, it would be informative to understand whether these are
due to the multi-kernel model or rather specific artifacts of the
implementation.
• Barrelfish has a multi-level scheduling infrastructure and there
are several interesting topics on operator scheduling which we
discuss in Section 5.1. One aspect is whether time slicing on a core
could benefit from database knowledge. Another theme worth revisiting is parallelised operators and how they could benefit from
coordinated execution.
• Barrelfish has limited tools for debugging and performance analysis. There are a few ideas which come to mind that would make
development easier:
– A simple profiler could be built using the dispatcher upcall
and a stack unwinder. The existing tracing infrastructure is
useful for short-running programs; this would cater to a need
for aggregated statistics from a longer run.
38
6.2. Future work
– The is a family of ‘sanitiser’ tools [31, 32] which can be used
to detect race conditions and memory access errors (e.g. buffer
overflow). Unlike more complex tools which rely on dynamic
instrumentation these can be easily ported, because they are
implemented with compile-time instrumentation with calls to
a small runtime library.
– The system call interface that libbarrelfish uses is quite
slim. Would it be possible to build a compatibility layer and
run Barrelfish applications in a Linux process? This would
give access to conventional tools and could be helpful when
porting.
• A common application structure on Barrelfish is to place a representative on each core and communicate explicitly using messages.
By contrast, SharedDB runs as a single process and makes use
of shared memory. What are the implications and design tradeoffs if operators instead each run in a separate domain? This is
also an approach which could be used to run SharedDB on a non
cache-coherent machine and we discuss this in Section 5.2.
• Energy is a valuable commodity. Can the OS – in its role as global
resource manager – encourage applications to be more conservative?
There are also architectures with heterogeneous cores (such as ARM
big.LITTLE). How would operators be deployed onto cores on such
a machine and what are the trade-offs when migrating threads?
• Server consolidation is widely used in industry, but this can be
problematic for performance-sensitive code. Imagine a co-located
application that competes for the shared cache, causing a deterioration in performance. Jails/containers provide containment on a
software level. Can techniques such as page coloring be used to
provide working sets which don’t conflict in cache? A use case for
this would be for the streaming query-data join in Crescando.
39
Appendix A
Appendix
A.1
NUMA awareness
This presents a short summary of what NUMA architectures are and
the measures SharedDB implements for such systems. This information
complements the experiment described in Section 4.3.2.
NUMA (Non-Uniform Memory Access) describes a particular hardware
design where the cost varies for accessing different regions of memory.
This can be found in modern multi-core servers and is implemented to
improve scalability. Each group of cores (node) is attached to a private
area of local memory which can be accessed more cheaply. Processors are
still cache-coherent and have a view of the entire shared memory, however
accessing remote memory will incur a transfer across the interconnect
link. From the software design perspective there are to main aspects to
consider: finite bandwidth of the interconnect and latency penalty upon
access (typically 50% or more).
libnuma [17] is a library which provides information on the hardware
topology and allows to allocate memory according to a number of policies.
SharedDB makes use of this when deciding where to place an operator
and when allocating memory buffers. More concretely here are two
scenarios of how this interface is used:
i There are also interfaces to explicitly allocate memory on the local
node, on a particular node or interleaved over all nodes. An operator’s stack and heap are allocated on the local node to minimise
cross-node traffic. NUMA-aware allocation is an explicit operation
so quite a large fraction of memory is actually just delivered by the
41
A. Appendix
default allocator.
ii The order in which cores are numbered differs significantly between
vendors. SharedDB permutes this to encode its placement heuristic,
namely to spread operators across separate chips and thus maximise
utilisation of shared resources. The numbering scheme goes sequentially over the cores on a chip, interleaved with each NUMA-domain
round-robin. Hyper-threads are placed last to discourage their usage.
A.2
Performance Counters
In this section we give a brief overview of performance counters and the
available tools. Consider also reading Section 4.3.4 which discusses how
we tried using these to characterise resource usage of operators.
Within the processor hardware there are a fixed number of performance
monitoring units (PMUs) which can record a vast number of events.
These are very useful for performance analysis because they operate with
low overhead and provide insight into micro-architectural details such as
requests to the different cache levels or why cycles are being wasted. The
processor vendors provide guides [10, 20] with details of the available
events and formulae to interpret the raw values.
A piece of software programs a register with a specific event and threshold which arms the counter. Events are gathered and once the desired
threshold is reached an interrupt is generated. At this point the program
can capture the execution context (instruction pointer, unwind the stack).
Later this is aggregated to a report which gives the percentage of events
at a given source location. This mode of recording is similar to the
operation of a profiler and is referred to as sampling. In some use cases
it is more convenient to add explicit instrumentation for which there are
libraries such as PAPI [28].
There are a number of different tools available on Linux, for example OProfile, the ‘perf events’ framework and Intel PCM (Performance
Counter Monitor). We have found the first to be flexible and well suited
to our needs. In particular, it allows fine-grained control over the events
recorded and has several options to customise how reports are generated. Initially we tried the perf framework and while the initial setup
is easier we encountered some problems. For instance, it tries to help
with platform diversity by providing abstract events however this makes
it more challenging to interpret results because the event semantics are
42
A.2. Performance Counters
less precise. In our setting we also needed to zoom in on specific cores
but we found that events were already pre-aggregated in the trace. The
perf framework is undergoing active development and would be worth
re-visiting at some later stage. We made use of the last tool for a specific
use case which we now briefly describe.
Memory bandwidth and cross-NUMA traffic can be easily measured
on AMD systems because there are corresponding events. The same,
however, is not true for the Intel Nehalem platform. Such a chip consists
of the processing cores themselves and the ‘uncore’ portion: last-level
cache, memory controller and interconnect (QPI). Whilst there is a PMU
residing on the uncore, events need to be recorded explicitly over a time
duration rather than with sampling. Further, events cannot be attributed
to the originating core but rather only at the socket-level. With the more
recent Sandy Bridge platform it is now possible to measure memory
requests and attribute this on a more fine-grained level [9].
43
Bibliography
[1] Gustavo Alonso, Donald Kossmann, and Timothy Roscoe. SWissBox:
An architecture for data processing appliances. In CIDR 2011, Fifth
Biennial Conference on Innovative Data Systems Research, Asilomar, CA,
USA, January 9-12, 2011, Online Proceedings, pages 32–37, 2011.
How Erlang does scheduling.
[2] Jesper Louis Andersen.
http://jlouisramblings.blogspot.dk/2013/01/how-erlangdoes-scheduling.html, 2013. [Online; accessed 05-May-2013].
[3] Thomas E Anderson, Brian N Bershad, Edward D Lazowska, and
Henry M Levy. Scheduler activations: Effective kernel support for
the user-level management of parallelism. ACM Transactions on
Computer Systems (TOCS), 10(1):53–79, 1992.
[4] Andrea C Arpaci-Dusseau, Remzi H Arpaci-Dusseau, Nathan C
Burnett, Timothy E Denehy, Thomas J Engle, Haryadi S Gunawi,
James A Nugent, and Florentina I Popovici. Transforming policies
into mechanisms with Infokernel. In ACM SIGOPS Operating Systems
Review, volume 37, pages 90–105. ACM, 2003.
[5] Andrew Baumann, Paul Barham, Pierre-Evariste Dagand, Tim Harris, Rebecca Isaacs, Simon Peter, Timothy Roscoe, Adrian Schüpbach,
and Akhilesh Singhania. The multikernel: a new OS architecture for
scalable multicore systems. In Proceedings of the ACM SIGOPS 22nd
symposium on Operating systems principles, SOSP ’09, pages 29–44,
New York, NY, USA, 2009. ACM.
[6] Silas Boyd-Wickizer, Haibo Chen, Rong Chen, Yandong Mao, Frans
Kaashoek, Robert Morris, Aleksey Pesterev, Lex Stein, Ming Wu,
45
Bibliography
Yuehua Dai, et al. Corey: An operating system for many cores. In
Proceedings of the 8th USENIX Symposium on Operating Systems Design
and Implementation, pages 43–57. San Diego, CA, 2008.
[7] Scott A Brandt, Scott Banachowski, Caixue Lin, and Timothy Bisson.
Dynamic integrated scheduling of hard real-time, soft real-time, and
non-real-time processes. In Real-Time Systems Symposium, 2003. RTSS
2003. 24th IEEE, pages 396–407. IEEE, 2003.
[8] Don Carney, Uğur Çetintemel, Alex Rasin, Stan Zdonik, Mitch
Cherniack, and Mike Stonebraker. Operator scheduling in a data
stream manager. In Proceedings of the 29th international conference on
Very large data bases - Volume 29, VLDB ’03, pages 838–849. VLDB
Endowment, 2003.
[9] Roman Dementiev. Monitoring integrated memory controller
R CoreTM procesrequests in the 2nd, 3rd and 4th generation Intel
sors. http://software.intel.com/en-us/articles/monitoringintegrated-memory-controller-requests-in-the-2nd-3rd-and4th-generation-intel, 2013. [Online; accessed 05-May-2013].
[10] Paul J Drongowski. Basic performance measurements for AMD
AthlonTM 64, AMD OpteronTM and AMD PhenomTM processors.
AMD whitepaper, 25, 2008.
[11] Adam Dunkels. Design and implementation of the lwIP TCP/IP
stack. Swedish Institute of Computer Science, 2:77, 2001.
[12] Dawson R Engler, M Frans Kaashoek, et al. Exokernel: An operating
system architecture for application-level resource management. In
ACM SIGOPS Operating Systems Review, volume 29, pages 251–266.
ACM, 1995.
[13] Ben Gamsa, Orran Krieger, Jonathan Appavoo, and Michael Stumm.
Tornado: Maximizing locality and concurrency in a shared memory
multiprocessor operating system. Operating systems review, 33:87–100,
1998.
[14] Georgios Giannikis, Gustavo Alonso, and Donald Kossmann.
SharedDB: Killing one thousand queries with one stone. Proc. VLDB
Endow., 5(6):526–537, February 2012.
46
Bibliography
[15] Jana Giceva, Tudor-Ioan Salomie, Adrian Schüpbach, Gustavo
Alonso, and Timothy Roscoe. COD: Database / operating system
co-design. In Proceedings of the 6th biennial Conference on Innovative
Data Systems Research (CIDR), Asilomar, CA, USA, January 2013.
[16] Google Inc. include-what-you-use - a tool for use with Clang to
analyze #includes in C and C++ source files. https://code.google.
com/p/include-what-you-use/, 2011. [Online; accessed 05-May2013].
[17] Andi Kleen. A NUMA API for Linux. Technical report, Novell Inc.,
August 2004.
[18] Orran Krieger, Marc Auslander, Bryan Rosenburg, Robert W Wisniewski, Jimi Xenidis, Dilma Da Silva, Michal Ostrowski, Jonathan
Appavoo, Maria Butrico, Mark Mergen, et al. K42: building a complete operating system. In ACM SIGOPS Operating Systems Review,
volume 40, pages 133–145. ACM, 2006.
[19] Ian M. Leslie, Derek McAuley, Richard Black, Timothy Roscoe, Paul
Barham, David Evers, Robin Fairbairns, and Eoin Hyden. The
design and implementation of an operating system to support distributed multimedia applications. IEEE Journal on Selected Areas in
Communications, 14(7):1280–1297, September 1996.
R
[20] David Levinthal.
Performance analysis guide for Intel
R XeonTM 5500 processors.
CoreTM i7 processor and Intel
http://software.intel.com/sites/products/collateral/hpc/
vtune/performance_analysis_guide.pdf, 2008. [Online; accessed
14-March-2013].
[21] Chuanpeng Li, Chen Ding, and Kai Shen. Quantifying the cost of
context switch. In Proceedings of the 2007 workshop on Experimental
computer science, ExpCS ’07, New York, NY, USA, 2007. ACM.
[22] Yinan Li, Ippokratis Pandis, René Müller, Vijayshankar Raman,
and Guy M. Lohman. NUMA-aware algorithms: the case of data
shuffling. In CIDR, 2013.
[23] Jochen Liedtke et al. On-kernel construction. In Proceedings of the
15th ACM Symposium on OS Principles, pages 237–250, 1995.
47
Bibliography
[24] Brian D Marsh, Michael L Scott, Thomas J LeBlanc, and Evangelos P
Markatos. First-class user-level threads. ACM SIGOPS Operating
Systems Review, 25(5):110–121, 1991.
[25] Daniel A. Menascé. TPC-W: A benchmark for e-commerce. Internet
Computing, IEEE, 6(3):83–87, May 2002.
[26] Andreas Merkel and Frank Bellosa. Task activity vectors: a new
metric for temperature-aware scheduling. ACM SIGOPS Operating
Systems Review, 42(4):1–12, 2008.
[27] Andreas Merkel, Jan Stoess, and Frank Bellosa. Resource-conscious
scheduling for energy efficiency on multicore processors. In Proceedings of the 5th European conference on Computer systems, pages 153–166.
ACM, 2010.
[28] Philip J. Mucci, Shirley Browne, Christine Deane, and George Ho.
PAPI: A portable interface to hardware performance counters. In In
Proceedings of the Department of Defense HPCMP Users Group Conference, pages 7–10, 1999.
[29] Simon Peter, Adrian Schüpbach, Paul Barham, Andrew Baumann,
Rebecca Isaacs, Tim Harris, and Timothy Roscoe. Design principles for end-to-end multicore schedulers. In Proceedings of the 2nd
USENIX Workshop on Hot Topics on Parallelism (HotPar ’10), June 2010.
[30] Tudor-Ioan Salomie, Ionut Emanuel Subasu, Jana Giceva, and Gustavo Alonso. Database engines on multicores, why parallelize when
you can distribute? In Proceedings of the sixth conference on Computer
systems, pages 17–30. ACM, 2011.
[31] Konstantin Serebryany, Derek Bruening, Alexander Potapenko, and
Dmitry Vyukov. AddressSanitizer: A fast address sanity checker. In
USENIX ATC 2012, 2012.
[32] Konstantin Serebryany, Alexander Potapenko, Timur Iskhodzhanov,
and Dmitry Vyukov. Dynamic race detection with LLVM compiler.
In Runtime Verification, pages 110–114. Springer, 2012.
[33] Benoı̂t Sigoure. How long does it take to make a context
switch? http://blog.tsunanet.net/2010/11/how-long-does-ittake-to-make-context.html, 2010. [Online; accessed 05-May-2013].
48
Bibliography
[34] Michael Stonebraker. Operating system support for database management. Commun. ACM, 24(7):412–418, July 1981.
[35] P. Unterbrunner, G. Giannikis, G. Alonso, D. Fauser, and D. Kossmann. Predictable performance for unpredictable workloads. Proc.
VLDB Endow., 2(1):706–717, August 2009.
[36] Jianrong Zhang. Characterizing the scalability of Erlang VM on
many-core processors. Master’s thesis, KTH, School of Information
and Communication Technology (ICT), January 2011.
49
Was this manual useful for you? yes no
Thank you for your participation!

* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project

Download PDF

advertisement