PDF (some figures slightly broken)

PDF (some figures slightly broken)
MASTER’S THESIS:
NOMADIC OPERATING SYSTEMS
December 10, 2002
By
Jacob Gorm Hansen & Asger Kahl Henriksen
Department of Computer Science,
University of Copenhagen
Abstract
This thesis attempts to solve the configuration and reconfiguration difficulties encountered in
utility computing, by allowing problem instances to be submitted as running nomadic operating
systems. An environment suitable for hosting multiple nomadic operating systems and an
example nomadic operating system, based on Linux, are implemented. Performance of the
nomadic operating system is shown not to suffer significantly, compared to native Linux, and
the nomadic operating system is shown to be able to migrate between two hosts in less than
one tenth of a second.
Acknowledgements
We would like to thank the following people for their time and effort in making this thesis
possible:
Niels Elgaard Larsen, our counsellor from DIKU.
Peter Andreasen for proofreading and comments, and for supplying us with this nice TEX-style.
Henrik Lund for helping create the illustrations.
Enterprise Brandhouse for letting us use their offices for our work.
The nice people on the L4-ka and L4-hackers mailing lists, for taking the time to explain in
depth the inner workings of L4.
i
Contents
1
2
3
Introduction
1
Work description . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2
Focus of this thesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
3
Hypotheses . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
4
Motivation
5
Mobile computer use . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
6
Laptop and palmtop computers . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
6
Remote display technologies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
6
Trends in computing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
7
Utility computing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
7
Network file systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
7
Removable media . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
8
Faster, unstable networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
8
The need for operating system migration . . . . . . . . . . . . . . . . . . . . . . . . . .
9
Present state
10
System partitioning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
10
IBM VM/370 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
10
Microkernels . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
11
Disco and VMWare . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
11
Jails . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
13
Java Virtual Machine . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
13
ii
Chapter 0:
4
5
iii
Choosing a technology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
13
Process migration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
14
Sprite . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
14
MOSIX . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
15
The residual dependency problem . . . . . . . . . . . . . . . . . . . . . . . . . . .
16
Why process migration has failed to take off . . . . . . . . . . . . . . . . . . . . .
16
Persistent systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
17
Proposed solution
19
Interface design considerations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
19
Choosing a microkernel . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
20
Mach 3.0 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
21
L4 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
21
L4 abstractions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
22
L4Linux . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
24
Different migration schemes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
26
Guest interface considerations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
29
Address space layout . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
29
Page fault and page request handling . . . . . . . . . . . . . . . . . . . . . . . . .
30
Checkpointing behaviour . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
31
Hardware abstractions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
32
Working with the L4 kernel
33
L4 basics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
33
Checkpointing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
34
Virtual memory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
35
Different L4 versions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
36
Chapter 0:
6
7
iv
Implementation
37
Host environment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
37
L4Linux as a host environment . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
37
The NomadBIOS host environment . . . . . . . . . . . . . . . . . . . . . . . . . . .
39
Choice of host environment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
41
Sharing host resources . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
41
L4 task numbers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
41
Physical memory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
42
Timer interrupt . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
42
Realtime clock . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
43
Network packet filtering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
43
Adapting L4Linux as a guest operating system . . . . . . . . . . . . . . . . . . . . . . .
44
Hardware abstraction layers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
45
Disabling interrupts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
46
Suspending L4Linux and its user processes . . . . . . . . . . . . . . . . . . . . . .
46
Resuming L4Linux and its user processes . . . . . . . . . . . . . . . . . . . . . . .
48
Pure demand paging model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
48
Task identifier migration issues . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
49
Remote control interface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
51
Future work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
51
Checkpointing server . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
52
Load balancing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
52
Block device access . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
52
Performance measurements
53
Latency benchmark . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
53
Throughput benchmark . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
57
Migration benchmark . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
57
Chapter 0:
8
v
Discussion
60
Security . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
60
Migration across network boundaries . . . . . . . . . . . . . . . . . . . . . . . . . . . .
61
Other usage scenarios . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
62
Workstation hotel . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
62
Laptop replacement
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
63
Related work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
63
The Fluke microkernel . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
64
Denali . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
64
vMatrix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
65
Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
65
A Guest interface specification
67
Address space layout and page fault handling . . . . . . . . . . . . . . . . . . . . . . .
67
Checkpointing behaviour . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
67
Hardware abstractions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
68
External thread identifiers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
68
Guest info page . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
68
B Changes to the L4Linux source
69
C Availability
71
D Bibliography
72
C HAPTER
1
Introduction
The first two sections below are a slightly altered version of the original work description submitted for
this thesis before work began. Minor grammatical corrections, and a few clarifications, have been made.
The current trend towards distributed computing, in the form of clusters, or grids, introduces
a demand for dynamic allocation and re-allocation of computational resources across administrative realms. A number of languages and APIs have been proposed for creating a unified
environment on which applications may rely. This may be impractical for a number of reasons.
Existing applications need to be rewritten to utilise the new APIs or language features.
This may be very costly or even impossible, due to lack of access to source code or qualified personnel, or lack of bindings between the new APIs and old programming languages like FORTRAN.
Applications may require features not provided by the new APIs, such as shared memory,
or access to various network file systems.
In the case of new and interpreted languages, such as Java, the performance overhead
often negates whatever speedup might be had from distributing the computation in the
first place.
To provide a common environment, the new APIs or languages will need to implement or
abstract already existing technologies, resulting in duplication of effort and code bloat. It
is reasonable to expect the new APIs to grow in complexity, until they match the features
already provided by modern operating systems.
From the above, the obvious yet naive way to achieve such a unified environment without
rewriting everything, would be to define a standard operating system configuration, that would
be adhered to by all participating organisations. Applications would then be able to rely on the
existence of various operating system features, such as access to network file systems, a common set of installed shared libraries, and so on.
However, the problems on agreeing on such a platform configuration, and keeping it synchronised across all nodes, would be exponential in the number of participating organisations, and
number of features, so this approach is not realistic.
One has to realize that the reliance of the application upon certain operating system features
makes the operating system and its configuration a part of the application. To solve this configuration problem, we propose that instead of distributing applications in the form of plain user
level programs, the entire operating system should be distributed along with the application,
1
Chapter 1: Introduction
2
allowing users to customise the operating system configuration according to the demands of
the application.
Most popular operating systems today serve two primary purposes. The first is to be a hardware abstraction layer, providing a unified interface to resources such as disk, network and
memory. The second is to provide a range of services to client programs, such as protection,
concurrency, persistence, naming etc. User level programs rely heavily on the particular semantics of the service layer, and changes to this layer may cause the programs to stop functioning.
These two layers are often tightly coupled, mainly for performance reasons. This tight coupling
has some drawbacks. For example, it is non-trivial to migrate a running operating system
from one physical computer to another, without restarting it, because the operating system has
committed itself to certain assumptions about the workings of the hardware, as determined at
boot time.
Since the hardware is already abstracted as seen from user level applications and kernel service layer, it becomes possible to cleanly separate the layers, leaving the operating system with
only the service layer responsibilities. With this separation in place, migrating entire operating systems with all of their running applications becomes possible, as does running multiple
(possibly different) operating systems concurrently on the same computer.
Work description
Building a foundation for migratable operating systems, and for running a number of these
concurrently on the same processor, will be the scope of our work.
Our goal will be to create a small program, much like the BIOS or firmware installed in computers today, that encapsulates not only the specifics of the installed hardware, but also the
processing units and memory. This can be achieved by using a microkernel to host and service
one or more guest operating systems.
We will do the following:
Investigate the various microkernel designs and rate them with regards to our goals.
Analyse the needs and challenges of hardware abstractions.
Analyse problems of migration across network boundaries.
Implement a migratable operating system based on Linux.
Benchmark our implementation against alternative schemes.
Chapter 1: Introduction
3
Focus of this thesis
Today, 30 years after the invention of UNIX, operating systems are still the centre of much research attention. Most of the researchers in the field can be divided into two groups; the group
aiming to replace UNIX with something entirely different, and the group aiming to improve
on what is already popular.
An example of a replacement-project is Eros [Shapiro, 1999], a checkpointing and capabilitybased operating system, and an example of an improvement project is the MOSIX [Barak and
La’adan, 1998] process migration system for Linux.
There is no inherent conflict between these two schools of thought, and many researchers are
working within both fields. Yet, when embarking on a research project on operating systems,
one has to choose whether to try and topple the old with something radically new, or take the
path of least resistance by attempting to improve upon the existing.
As relative novices in the field, this project’s authors feel in no shape to stray too much from
the beaten paths, so this project is one of evolution, not revolution. Therefore, the research and
experiments described here focus more on finding new ways of exploiting classic operating
systems technologies, than on developing concepts for next-generation operating systems.
Due to the limited time frame of this project (the work of two persons full-time for six months),
and the main goal of creating a real-life nomadic (migratable) operating system, it was decided
to build upon existing and proven technologies where possible, instead of implementing base
technologies from scratch. Notably, the project will build upon an existing microkernel foundation, rather than try and invent a new one. Some previous microkernel based approaches
have gotten to the point where functionality is comparable to traditional monolithic systems,
and stopped there. The real benefits of the microkernel approach are often not exploited fully,
but left as an exercise for the reader. This project can be described as the undertaking of one
such exercise; showing that the use of a microkernel allows multiple operating systems to run
concurrently, and allows them to migrate quickly and easily between physical and somewhat
heterogeneous machines.
With regards to the implementation of a nomadic operating systems, the goals are restated in
more technical detail as follows:
Implement an environment suitable for hosting multiple concurrently running nomadic
operating systems on commodity hardware, including shared and safe access to hardware devices such as network adapters.
Implement a non-toy example nomadic operating system able to support standard applications with no recompilation or relinking necessary, and to transparently migrate between hosts.
Implement a system of network protocols, clients and servers, for migrating running operating systems over a network, including measures to limit the downtime experienced
by users.
Chapter 1: Introduction
4
The purpose of the host environment is similar to the original purpose of the IBM PC basic
input/output system (BIOS), in providing a standard foundation on which real operating systems can run. Therefore, we refer to it as the NomadBIOS. The example nomadic operating
system will be a version of Linux, referred to as NomadLinux.
Hypotheses
We aim to confirm the following hypotheses about nomadic operating systems:
Concurrent By adapting an existing microkernel-based operating system, the concurrent running of multiple fully protected operating systems on the same Intel IA32-based computer, without a fully virtualised environment 1 , will be possible.
Migratable By being based on a hardware abstractions, operating systems will be able to migrate between heterogeneous physical computers, and do so without becoming unresponsive during migration, and without experiencing the residual dependency problem 2 .
Efficient By sacrificing the ability to host unmodified operating systems, the implemented system will be able to outperform fully virtualised systems. Performance of the adapted operating system, running within the abstracted environment, will equal that of the original
microkernel-based version of the operating system, which in turn will be almost on par
with that of traditional monolithic operating systems.
Scalable The performance of two program instances running under two concurrent operating
system instances, will be identical to the performance of the two instances running under
one instance of the operating system.
1
2
Such as the one present in VMWare, described on page 12.
Residual dependencies are described in more detail on page 16.
C HAPTER
2
Motivation
This project deals with migration of running programs between physical hosts. This concept,
while elegant and beneficial in theory, has found only limited use in practice.
Our hypothesis is that this limited success is due to a wrong choice of migrational unit. Current schemes focus on the migration of processes rather than of complete systems. Processes
in operating systems such as UNIX, have a large amount of dependencies on the environment
in which they are running, for example open files, memory shared with other processes, reliance on access to various host-specific system features, and so on. Operating systems, on the
other hand, only require access to a certain amount of memory, a certain type of processor, and
perhaps a few standard peripheral devices such as harddrives or network cards. Furthermore,
while the interfaces exposed by operating systems to application processes are often complex,
vaguely defined and subject to change, the hardware interfaces exposed by computers to operating systems are simple, well-defined, and only slowly evolving. Thus, the operating system,
instead of just its processes, should be the unit of migration.
We make two initial assumptions about the migration of running operating systems. These
assumptions will be backed by more thorough argumentation later, but for now let us assume
that:
1. Open network connections to the outside world will always migrate seamlessly along
with the operating system.
2. All file systems mounted by the migrating operating system will be network file systems,
and thus remain accessible due to assumption 1.
Naturally, by changing the unit of migration, the range of useful applications of migration
changes as well. Process migration schemes such as MOSIX have shown useful mainly for balancing CPU and memory load across tightly connected clusters, running long non I/O intensive calculations. Operating system-level migration on the other hand, will probably be more
useful when dealing with migration between clusters that are loosely connected via wide-area
networks such as the Internet, as well as for administrative tasks, such as a server being taken
down for hardware maintenance, where the operating system might be moved temporarily to
another host without users noticing. Running operating systems may also be checkpointed to
stable storage on another host for backup purposes.
Operating system migration may also change the way individual users work with computers.
Because the new unit of migration corresponds to the complete working environment of a
workstation user, mobility problems may be attacked from a new angle.
5
Chapter 2: Motivation
6
Mobile computer use
Computers aid us in our work, but also limit our work to specific locations such as the office
or the home. Mobile computing is a broad term, embodying various attempts to regain the
freedom to work where it is most practical (next to a knowledgeable coworker) or most pleasant
(in the park).
Today, mobile computing is generally solved in in one of three different ways:
Laptop computers - compress all of the components needed for a real computer system
to a size where it becomes carryable.
Palmtop computers - build a minimal computer system, and tell the user to limit his work
to the options provided.
Remote displays - free the user from his desk (but not from the network) by allowing him
to promote his display to any terminal at hand in the company or on the campus.
We will look further into the advantages and disadvantages of the technologies below, and
introduce nomadic operating systems as an alternative in some situations.
Laptop and palmtop computers
Laptop computers are generally just desktop workstations using smaller (and thus often either
more expensive or less performant) components. Laptops cannot be made arbitrarily small,
due to user demands of a large monitor and a usable keyboard, and tend to be rather heavy as
well. Though they promise users the chance to work in the park or on the beach, it feels safe to
assume that most laptops reside safely on desks for the most of their existences.
Palmtops trade functionality for size, and are small enough that they can be taken anywhere
just in case they might be needed. While good enough to handle a calendar and to do list, plus
a few specialised applications, they are not yet able to replace a real computer for traditional
computing uses.
Remote display technologies
Originally, remote displays were invented to allow multiple users to share one computer, but in
recent years, as personal workstations have become cheap, their main force from the individual
user’s perspective, lies in allowing the user to work from different locations, with access to the
same set of files and applications.
Remote display technologies offer a solution to mobility, by allowing the user to log in from
terminals geographically removed from the server on which applications are running. Some
technologies demand the user to perform a new login when changing location, while others
Chapter 2: Motivation
7
allow a login session to be resumed on a new terminal. Some technologies, such as VT100,
support only a text display, while others, such as the X Window System, provide display of
bitmapped graphics as well.
Common to all such technologies is the need for a reliable network connection. If the network
is down, no work can be done. This dependence on network connectivity also means that while
a remote display will be usable from everywhere inside a company, or on a university campus,
it may become too slow for practical use from less connected sites, for example when working
from home. For such uses, most people still prefer laptops, or just an extra workstation, with
an extra set of applications, at home.
Operating system migration offers a solution to the mobility problem. By letting the user migrate his entire operating system along with him when he moves, the need for laptops, palmtops, and remote displays is reduced. Because the operating system only needs to migrate once,
network connectivity becomes less important.
Trends in computing
Below, we will examine three current trends in computing, trends that we believe underline
the practicality of migrating entire operating system instead of just their processes.
Utility computing
The term utility computing, or computing grids, describes a trend towards sharing access to
computation much as access to the power grid is shared today. If electricity is needed, it is not
necessary to invest in neither a diesel generator, windmill, fuel cell, nor a nuclear reactor. Instead, it is possible to plug into the power grid via simple-to-use means, and retrieve electricity
from there.
Some people [Foster et al., 2002] believe that in the future, computing will become a utility
just like electricity is today. Not only will consumers be able to purchase processing power or
safe data storage from central entities, they will also be able to provide and sell these services
themselves.
Researchers in natural sciences, who occasionally have great needs of computational power
but not always the budgets for the necessary equipment, have been pushing grids as a way
of leveraging computer investments across multiple institutions, and a number of research
projects dealing with the creation of software for service discovery, security, job scheduling, are
ongoing.
Chapter 2: Motivation
8
Network file systems
Since the late 1980’s, the trend has been for storage to become detached from computation.
Network filesystems, such as NFS, and distributed filesystems such as AFS [Satyanarayanan,
1990], allow data to be placed on dedicated servers. These servers may be tuned for better
I/O performance, security is easier to enforce, and data redundancy can be avoided. The main
drawback is that network partitions may render client computers useless, and that client-side
storage space may end up unused. AFS utilises client harddrives as cache space, allowing
near-local read performance when working on previously cached files, and even allows disconnected operation in certain cases where all needed files are in the cache. AFS also supports
replication of read-only filesets across multiple fileservers, mitigating the effects of network
partitions.
Coda [Satyanarayanan et al., 1990] goes one step further, by allowing read-write replication of
filesets across fileservers, and writes to client caches even when disconnected from fileservers,
tracking modifications in a log which is replayed on the fileserver upon reconnection.
Intermezzo [Braam et al., 1999] is a re-implementation of the central Coda ideas, attempting
to reuse preexisting functionality where possible. For example, Intermezzo uses the journaling
layer of the local filesystem for logging filesystems modifications, and the HTTP protocol for
bulk network file transfers, instead of inventing its own protocol.
The emergence of intelligent and efficient network file systems increase the feasibility of having
tasks run at remote locations without having to transfer the total filesystem along with the
application, especially if it is not known up front which parts of the total filesystem will be
accessed.
Removable media
The ubiquitous but technically outdated 3.5” floppy disk has for many years hindered the advent of new forms of removable media. However, with the introduction of the Universal Serial
Bus (USB) and IEEE-1394 (aka FireWire) interfaces as a standard feature in most computers, the
removable media market has been revitalised. Solid-state devices such as M-Systems DiskOnKey, or the IBM USBKey, and portable harddrives such as the Apple iPod music player, allow
vast quantities of data to be carried in a pocket, or even in a key-ring.
The combination of removable media and migrating operating systems would allow a user to
migrate a running desktop environment to passive removable media, and resume the session
on a different machine. This would enable the user to bring with him his favourite working
environment, without having to carry a heavy laptop around.
Faster, unstable networks
The main trends in networking over the last ten years have been a continuous growth in bandwidth, and an increasing amount of computer systems and devices connected to the Internet.
Chapter 2: Motivation
9
But even though networks are becoming faster and larger, network instabilities and partitions
have not yet ceased to occur. With interconnected networks, the points of failure causing partitions may lie outside the administrative realms of those affected, and so may be hard to rectify.
However, problems are rare enough that few sites employ preventive measures such as redundant wide-area connections.
For the Internet, it seems as if bandwidth and amount of connected hosts are rapidly increasing,
but availability remains constant at well below 100%. When planning Internet-scale distributed
systems, this should be kept in mind. Assuming almost infinite bandwidth for an application
may well make sense, while relying on constant and 100% reliable connectivity between any
two nodes in the network for the application to function, may be dangerous.
Operating system typically run much longer than processes, and so will need to migrate less
frequently. They will be able to cache large amounts of data, and thus become less reliant on
constant network connectivity than process-migration systems.
The need for operating system migration
Storage is becoming easier detachable from computation. In a network file system, running
software can be moved to a new location without losing contact to its file state. With fast
removable media, migration may take place by simply checkpointing the entire running operating system, along with its file state to removable media, and carrying it to another machine
where it may resume operations.
Networking bandwidth and stability trends are relevant when suggesting remote display technologies as a solution to the mobility problem. As long as constant connectivity cannot be
guaranteed, users will be wary of betting their ability to work on the stability of the network.
If instead the operating system is migrated along with the user when he moves, connectivity is
only necessary during the period of migration.
Whether computing grids will ever appeal to individual users is yet too early to tell, but it
appears evident that a system for migration of running calculations inside the grid will be a
necessity for efficient scheduling of resources.
With migratable operating systems (or nomadic operating systems as we like to call them), a
grid customer is able to submit a job as a set of running operating systems, for which the grid
may then allocate the necessary resources.
Overall, the above mentioned trends point to the feasibility of operating system migration, aka
nomadic operating systems.
C HAPTER
3
Present state
This project is about running multiple operating systems on one machine simultaneously, transparent migration of these operating systems to another machine, and, as a consequence of the
latter, also about checkpointing running operating systems. A lot of research has been performed within all of these three fields, which are generally described by the names system
partitioning, process migration, and persistent systems.
In this chapter, we briefly survey the the most prominent system partitioning and process migration technologies, and provide a short introduction to persistent systems.
System partitioning
Various ways of running multiple operating systems on the same machine exist. In the 1970s,
hardware-assisted virtual machines running on IBM mainframes were the centre of much attention. In the late 1980s, microkernels such as Mach seemed like the right answer due to their
ability to run different operating system personalities concurrently, and in the mid 1990s the focus moved to safe programming languages such as Java. Lately, the hugely successful VMWare
has spurred renewed interest in the virtual machine concept.
IBM VM/370
The most well-known system for running multiple operating systems on the same machine, is
the IBM VM/370 [Creasy, 1981] time-sharing system. It builds on the concept of pure virtual
machines, emulating the complete instruction set of the physical machine for every operating
system. Resource sharing is managed by the Control Program (CP), supporting various sharing
policies, such as exclusive access to magnetic drives, or time-sharing of the CPU.
In the original VM/370-system, the Conversational Monitor System (CMS) was the most frequently used guest operating system, but more recent version run both UNIX and Linux as well,
and are able to host several thousand virtual machines on a single host.
VM/370 runs on specialised hardware, built to aid virtualisation. While interesting for large
organisations able to afford such equipment, its design prohibits its use on cheap commodity
hardware, such as Intel systems, that lack proper virtualisation support [Robin and Irvine,
2001].
10
Chapter 3: Present state
11
Microkernels
The microkernel concept was proposed in 1970 by Per Brinch Hansen [Hansen, 1970], for the
RC4000 system. The fundamental idea is to leave only the absolute essentials inside the operating system kernel (called a nucleus by Brinch Hansen), keeping all other functionality in normal
user level programs. The focus of the RC4000 nucleus was extensibility, as the hardware platform on which it was running did not support memory protection. The nucleus supported
buffered inter-process communication (IPC) as the primary means of communication between
processes. Processes could be either internal, what we understand as a process today, or external,
driving peripheral hardware. Since IPC could be performed towards both kinds of process, it
became a unifying abstraction for both process-to-process communication, as well as for access
to hardware.
Later, in the mid-1980s, this approach was revisited in the Mach kernel [Rashid et al., 1989].
Mach supports both buffered IPC secured by a capability-like system of ports and port references, memory protection with user-implementable paging policies, and even virtual memory
with disk-paging. Critics argued that Mach was too slow, as performance when running UNIX
on top of Mach was lower than when running as one monolithic program, and today Mach is
considered by critics as a failed experiment and proof that the microkernel concept is fundamentally flawed.
Later again, in 1995, the idea was revisited by Jochen Liedtke [Liedtke, 1995], who argued that
the deficiencies of Mach were due mainly to bad design and the inclusion of too many features,
rather than a proof that the microkernel concept was bad. In contrast with Mach, Liedtke’s L4
kernel implemented only the basic abstractions which could not be implemented with equal
functionality at the user level. Only unbuffered IPC was supported, based on the argument
that buffered IPC might be implemented by user level threads. L4 had user level pagers but
no disk-based virtual memory. The original L4 implementation was written in assembly code
and tailored to the Intel x86 processor family. Liedtke showed that at the sacrifice of portability,
microkernel based systems could be made to perform almost as well as monolithic systems. For
example, an implementation of Linux on L4 ran only about 5-10% slower for practical scenarios
than the traditional monolithic Linux [Härtig et al., 1997].
With microkernels, most of the features usually associated with an operating system are implemented as user level programs. Because of this, it becomes a lot simpler to run multiple
operating systems concurrently and safely, though this is a feature which is not frequently exploited.
Disco and VMWare
The Stanford Disco [Bugnion et al., 1997] system, and its commercial successor VMWare, allow
multiple commodity operating systems to run on standard hardware without special support
for virtualisation.
Disco focuses mainly on better utilisation of machines with many processors, by running many
copies of the same operating system (SGI Irix) on the machine. Guest operating systems communicate via a machine-internal TCP/IP network, running atop a shared-memory layer for
Chapter 3: Present state
12
maximum performance. Running multiple copies of Irix on a multiprocessor machine under
Disco is shown to yield better overall performance than running a single, multi-processor capable Irix, on the same machine.
VMWare is a derivate of Disco for the Intel Pentium family. Its main use is for letting different
operating systems, usually Windows and Linux, run concurrently on the same machine.
The basic idea of both systems is to let unprivileged code run directly on the CPU, at full speed,
while interpreting privileged instructions, to trick the guest operating system into believing it
is running directly on the real hardware.
Guest operating systems experience a standard hardware configuration, accessible using normal means such as memory-mapped I/O and Direct Memory Access (DMA). The underlying
host environment (known as the Virtual Machine Monitor (VMM)) simulates the semantics of
real hardware devices. For instance, VMWare simulates the popular AMD Lance network chip,
for which drivers are likely to exist for most operating systems.
As described in Robin and Irvine [2001], the Intel Pentium does does not meet the requirements
for full virtualisation, because not all instructions reading or writing privileged processor state
cause processor traps. Therefore, VMWare has to audit and potentially modify all code before
allowing it to run directly on the CPU. The technique which is most likely used is described in
[Lawton, 1999], and will be briefly outlined below:
Initially, all memory pages corresponding to the pages containing guest code are filled with trap
instructions. When the CPU attempts to execute one of these instructions, a trap occurs, and
the trap handler fetches the original instruction, and checks whether it manipulates privileged
state or not. If not, the instruction is written to the code memory page and executed by the
CPU. If it does, the effects of the manipulation are simulated and the state of the virtual CPU
modified accordingly, and execution commences at the subsequent instruction. Only when all
contents of a memory page are verified in this manner, it can run without any interpretation
taking place, and thus at full speed.
Naturally, the process is more complicated than described, but one should notice the drawbacks
of this technique, namely that:
It requires additional memory for partially-checked pages.
It has a performance overhead when running unchecked code the first time.
The interpretation of all CPU instructions requires precise knowledge of the underlying
CPU architecture, and so is hard to implement and not at all portable to other hardware
architectures.
The VMM does not know if the guest operating system is busy or not, and risks wasting
CPU cycles executing the guest idle loop. Disco relies on the guest issuing a special
power-save instruction present on the MIPS CPU, but this approach is not very generic
or portable.
The main advantage of the VMWare approach is that it allows unmodified guest operating system to run, and thus is a pure virtual machine in the sense of Goldberg [1974]. As described in
Venkitachalam and Lim [2001], this purity is sometimes sacrificed for performance by modifying popular guest operating systems such as Linux specially for running under VMWare.
Chapter 3: Present state
13
Jails
A simple way of partitioning a machine into several administrative realms, each with their
own set of users, and own set of processes, is the creation of virtual realms by performing extra
bookkeeping inside the same operating system. This is the approach taken in the FreeBSD “jail”
mechanism [Kamp and Watson, 2000]. The advantages to this approach compared to running
multiple operating systems on the same machine are, apart from its simplicity, that memory is
saved by only running one operating system, and that there is no virtualisation overhead on
hardware access.
Migration of entire jails between hosts is still a rather complex issue though, because running
processes are closely tied to the host operating system. It is also impossible for a virtual super
user to tailor the operating system to his needs, for example by installing extra kernel modules,
without affecting other users.
Java Virtual Machine
Interpreted languages running on a virtual machine, such as Java [Lindholm and Yellin, 1996],
may be easily and safely run in parallel in multiple virtual machines (VMs) on any computer
for which the VM is available.
As long as interpretation takes place entirely in software, and no just-in-time compilation is
applied, the process of suspending and moving virtual machines between physical hosts is
simple. Once just-in-time compilation is introduced, the process becomes more difficult [von
Laszewski et al., 2000]. Usually, Java virtual machines are able to access the local filesystem,
making residual dependencies a problem as well.
The biggest problem is the performance of interpreted or just-in-time compiled code, which is
still at least twice as slow as that of native code [Ertl et al., 2002]. Why distribute a computation
across multiple computers, if the speedup gain is largely negated by the overhead of using an
interpreted language?
Another problem of using interpreted languages is the need to rewrite legacy software, for
which source code or skilled personnel may not be available.
Choosing a technology
For the purposes of implementing nomadic operating systems, it would appear that microkernels have the largest potential. Because operating systems run as normal user level programs,
they may be subject to the microkernel’s scheduling policy, which solves the idle-loop problem
described for Disco above, and may be protected from each other by normal memory protection
methods.
Unlike the VMWare and Disco approaches, the ability to run unmodified commercial operating systems (such as Windows or Irix), is not an issue, due to the availability of source code
Chapter 3: Present state
14
for commercial-quality operating systems such as Linux, NetBSD and FreeBSD, which can be
adapted with relative ease.
Because the ability to run existing applications is a goal, language based systems will not be
applicable. Neither will the pure virtual machine approach of VM/370, because it requires
special processor features not present in commodity hardware.
Process migration
Migration of entire operating systems has much in common with migration of single processes,
and lessons may be learned by looking at current process migration systems.
Process migration is the mechanism of transferring a running process from one physical machine to another. This is beneficial in theory because it allows standard applications to transparently utilise idle computing resources present on most networks, and frees human operators
from having to manually balance system loads. While not a new concept, process migration
has failed to become mainstream, due to a number of inherent problems and limitations.
To better understand the pros and cons of process migration, two of the most successful systems
are described below. The systems described are Sprite and MOSIX, in both of which process
migration is implemented as a part of the operating system, and is transparent to the end user
and the processes involved.
Finally, some of the problems hindering existing process migration system from gaining wide
acceptance are discussed.
Sprite
Sprite [Ousterhout et al., 1988] was developed at the University of California in Berkeley. More
than just an operating system, Sprite creates a network wide single system image, by providing specialised file services, network unique process identifiers, built in support for accessing remote files and services on the network, and transparent process migration [Douglis and
Ousterhout, 1991].
A process can exist in one of two places: On the home node, or on a foreign node. Initially all
processes start on the home node, but can then be migrated to another machine in the network
for completion. Often this is done at the process initiation time, where the environment of the
process is limited, and little information has to be transferred. Sprite attempts to utilise unused
computing resources available on the network due to workstation idle time. A workstation
becomes eligible for receiving foreign processes when it has been idle for a predetermined
amount of time. When the owner of the workstation reactivates it, all foreign processes are
evicted to their home nodes, from where they can be migrated again.
Sprite uses a specialised strategy for migrating virtual memory to the new node, utilising the
semantics of the Sprite network file system. When a process is migrated, it is frozen at the home
Chapter 3: Present state
15
node, and all memory from its address space is flushed to a page file on disk. The process state
is then transferred to the foreign node, which can then page memory from the page file as
needed. The Sprite network file system ensures that in most cases the page file does not cause
disk operations, because the fileservers use their memory as cache for the pagefile, which will
be referenced again soon after it is written, when the process is resumed on the foreign node.
In Sprite, the home node of a process always contains residuals dependencies of processes
migrated onto other hosts in the network. Some kernel calls have to be forwarded to the home
node for evaluation, while others, such as available physical memory, can be resolved directly
on the foreign node.
If a process has open files on the home node when migrating, any changed blocks are flushed
to the fileserver, prior to migration. If a process shares a file with another process on the same
node, i.e. a child process that has been forked, any caching of the file is disabled when one of
the processes migrate away from the node. Any access to the file will bypass the local cache,
and communicate directly with the fileserver, preserving UNIX file semantics.
Processes using memory mapped I/O , i.e. for a frame-buffer, are not eligible for migration in
the Sprite system.
MOSIX
MOSIX [Barak and La’adan, 1998] is a system for transparent process migration, developed at
the Hebrew University in Jerusalem, Israel. The aim of MOSIX is to provide load balancing of
processes across a cluster of computers, without the need for a central scheduling server, and
without rewriting or recompiling applications. As in Sprite, the migrational unit in MOSIX is
the UNIX process.
Like Sprite, MOSIX has the concept of the home node of a process. For process migration
to become transparent to the individual process, all communication with local hardware or
file system is performed through a deputy process, which stays at the originating node after
the process itself has been migrated elsewhere. This limits the types of processes that can be
migrated without undue communications overhead, and MOSIX takes into account the amount
of local I/O performed, before deciding whether or not to migrate a given process.
MOSIX attempts to balance the processor and memory load of the nodes in the cluster, by probabilistically dispersing load information from each computer to a random subset of the cluster.
When a node running MOSIX decides that it has become too loaded, it will use information
previously received from other nodes to decide where to migrate one or more processes. When
recipient nodes have been selected, deputy processes are created at the home node, and processes migrated via the network. This probabilistic method of load balancing is suboptimal,
since information in the nodes does not contain the complete set, and may not be entirely up
to date, but eliminates the need for a central scheduling server, and thus bottleneck and point
of failure.
The problem of traffic between home and foreign nodes has been addressed by the MOSIX
team by the development of the MOSIX File System [Amar et al., 2000] (MFS) and of Direct File
Chapter 3: Present state
16
System Access (DFSA). MFS is a prototype network file system, with files arranged so that part
of the path name reveals on which node the file is physically located. MFS achieves UNIX file
semantics by removing all caching, so that it becomes safe to access files locally rather than via
the home node.
Because file locations are easy to determine from file names, MOSIX is able to migrate the
process to the file, rather than the other way round. MFS and DFSA are still at the research
stage, and suffer from the no-caching requirement needed to guarantee UNIX semantics on the
file system level.
The use of deputy processes on home nodes means that MOSIX processes have a higher risk
of failure than single-node processes, since either home or foreign node failing will take the
process down. MOSIX does however provide commands for expelling all foreign processes on
a node, which means nodes can be taken down without losing processes.
The residual dependency problem
Both Sprite and MOSIX suffer from the problem of residual dependencies. This problem occurs
when a process leaves behind open files or other state in the operating system on the originating host. Both systems attempt to solve this problem by leaving a proxy processes on the
originating host, handling access to local resources on behalf of the migrated process. This solution is problematic for two reasons; performance and stability. If all access to a resource has
to go via the network, execution may be slowed, and by relying on resources on the originating host, the vulnerability of the migrated process to a machine crash is increased, because the
process will fail if just one of the two involved hosts crashes.
Note that under the assumptions made on page 5, operating system migration does not suffer
from the residual dependency problem.
Why process migration has failed to take off
In spite of the promises held by the technology of process migration, it has failed to become an
integrated part of any widely used operating system.
A number of non-technical explanations for this situation are listed in Milojicic et al. [2000],
including:
Lack of infrastructure Since no widely used operating system supports migration, a lot of
work is needed to gain the benefits of such a system.
Not a requirement Users have been content to work with remote invocation, and remote data
access.
Sociological factors In a Network of Workstations, it becomes a social issue of making the end
user accept that other users have access to his resources. Until recently, the typical workstation has had little resources to spare, but recent advances in CPU-speed and memory
prices have changed this.
Chapter 3: Present state
17
Another reason may be that the aforementioned systems are not as transparent as they would
like to be. In practice, the array of useful applications gaining from migration is very limited,
because of the residual dependency problem. In MOSIX for example, certain types of application (for example those using shared memory) cannot be migrated at all, and short running
processes such as compilers usually finish before the system has had time to decide whether
they should migrate or stay. These limitations force the application developer into redesigning
his application to better fit the needs of the system, and hence transparency is lost.
Add to this the complexity of implementing both schemes. The tight coupling to internal kernel data structures makes the schemes very vulnerable to changes elsewhere in the kernel. For
Sprite, the process migration code was initially easily and often broken due to changes elsewhere in the kernel, and often the person causing the breakage did not realize the implication
of his or her changes to seemingly unrelated parts of the kernel. For MOSIX the problem is
twofold, since MOSIX is not part of the official Linux source tree, requiring MOSIX to catch up
on new features added, and changed kernel structures and semantics. Also, the sheer number
of system calls (over 200 for the Linux kernel), kernel interfaces and semantics to be managed,
makes it very difficult to validate that every single case is handled correctly.
This indicates to us that the single process as a migrational unit is not the correct choice, and if
migration as a concept is to succeed, it must be implemented against a simpler interface, and
be oblivious to changes to the operating system in general.
Persistent systems
The phrase checkpointing refers to the act of freezing a running program, so that it may continue
from the same stage, at a later time. Saving a document from a word processing program may
be thought of as manually checkpointing the state of the program to disk. When talking about
automatic checkpointing, or persistent, systems, the individual program takes no special action
to have its state checkpointed, but relies on an underlying system doing so automatically.
A checkpoint is a snapshot of all program state in certain instant, from which the program can
continue. If the program runs independently of external input (i.e. all program variables are
bound), the result of continuing from a checkpoint should be equal to the result of running the
non-checkpointed program without interruption.
Traditional operating systems, such as UNIX, support only explicit checkpointing, through
filesystem operations. If a program wants its data to survive a system crash or restart, it has to
manually serialise its internal data structures and copy the serialised version to stable storage.
A number of experimental persistent systems with built in automatic and transparent checkpointing have been suggested. If user level programs were able to rely on the operating system
automatically checkpointing their state to stable storage, they would be simpler to implement,
and the whole system would become more resilient to failure.
Some persistent systems rely on all applications being written in a special language, that inserts
checkpointing code at relevant places automatically at compile or interpretation time, while
Chapter 3: Present state
18
others perform checkpointing of the whole system at regular intervals. Napier [Morrison et al.,
1990] is a an example of the first kind, while the Eros single-level store [Shapiro and Adams,
2002] is an example of the latter. Interval-based checkpointing has also been suggested for L4
in Skoglund et al. [2001].
Because nomadic operating systems need to freeze all of their state in order to copy it to a
different machine, they will need a checkpointing mechanism, though not necessarily one that
is as efficient as if it were to be running constantly at short intervals. Other than as an enabler
of migration, the ability to checkpoint a running operating system is also practical for very long
running calculations, where a checkpoint may be backed up to another host every few hours,
so that months of work is not lost in case of a machine crash.
C HAPTER
4
Proposed solution
The configuration and reconfiguration problems of utility computing, and the problem of being
mobile without having to carry a laptop, may be solved by making operating systems nomadic.
By nomadic we mean able to migrate between physical hosts without loss of state and only
neglible downtime, and able to share a host with other nomadic operating systems.
Because fully virtualised systems are inefficient and complex on commodity hardware, we
propose to implement the nomadic operating system on top of a modern microkernel, running
a custom host environment, providing nomadic operating systems with shared abstracted access
to hardware resources.
Interface design considerations
If different operating systems are to run in the nomadic host environment, a standard way
of interfacing to the underlying environment must be defined, so that implementors will be
able to refer to this standard when creating or adapting new guest operating systems. As
whatever code lies outside this interface definition can be said to be “just implementation”, the
specification of this interface will be the final result of this work on nomadic operating systems.
Needless to say, its design must be well though out.
This thesis is not about software engineering, but still, a few thoughts on when to design and
implement what, will be in place.
Defining the way different software systems interface to one another, is often a good way of
clarifying difficult design decisions about the internal workings, and of eliminating superfluous features and hacks. As such, designing the interface before implementing the actual code
behind it seems like a reasonable approach, since mistakes will then be eliminated by design.
However, when researching new areas, all aspects of a given problem will not be known beforehand.
In the book “Patterns of Software” [Gabriel, 1996], the author argues, basing his arguments
of those of famed architect Christopher Alexander, that software should be grown much like
classical farmhouses were grown for hundreds of years, by adding new features as they are
needed, rather than the way grand office complexes today are designed, trying to anticipate
any possible need up-front, which is often impossible in practice.
So on one side there is the argument that a good interface design is a necessary precursor to a
good implementation, and on the other the argument that things should only be designed as
they are needed.
19
Chapter 4: Proposed solution
20
However appealing the first argument may be, and however many times the authors of this
thesis have been told this by authorative figures, it was not possible to possess all knowledge
required to design this interface (simple as it may be) before having implemented prototypes
of the algorithms that were ultimately going to hide behind it.
The resulting interface specification, detailed in appendix A is the result of both planning and
experimentation, and has been revised all the way through the implementation process.
Choosing a microkernel
To achieve the stated goal of being able to run standard software on top of guest operating
systems (and “standard” here is considered UNIX or UNIX-like) at least the following features
need to be present in the chosen microkernel:
User level paging Guest operating systems cannot be trusted with access to the Memory Management Unit (MMU1 ) registers, because they would then be able to allow themselves
access to the address spaces of other guests. Instead, the microkernel must act as a middleman for manipulating MMU registers, and user tasks be able to handle page faults
incurred by other user level tasks. This is necessary when running multi-address space
operating systems such as UNIX.
Interrupt multiplexing Interrupts are the basis of preemptive multitasking and of efficient
event-driven hardware access. The microkernel either needs to be able to multiplex interrupts between multiple operating systems, or it needs to support the implementation of
services for shared access to hardware by other means.
Efficient means of inter-address space communication Because guest operating systems have
to go through a hardware abstraction layer for any kind of I/O operation, a fast way of
communicating between the guest and the address space housing the abstraction is necessary.
Preexisting UNIX-personality Because of limited time, it will not be possible to port an existing UNIX-like operating systems to the chosen microkernel 2 , while also implementing
features for concurrency and for migration. Therefore, a functioning port of a UNIX-like
operating system for the candidate microkernel must exist. Because operating systems
running on top of microkernels are not kernels, but rather normal user level programs
with operating system-like behaviour, they are often referred to as personalities.
Intel IA32 implementation While the Intel IA32 3 architecture may not be the most elegant or
the most simple to program for, it is cheap, and plenty of machines are available, so the
candidates need to have an up to date IA32 version available.
1
The purpose of the Memory Management Unit in a system is described in more detail on page 35.
For example, the port of Linux to the L4 microkernel reportedly took 14 man-months to complete, while the
entire time frame for this thesis project is 12 man-months.
3
Meaning Intel 80386 and up. In reality, we are only really interested in the Pentium Pro and higher.
2
Chapter 4: Proposed solution
21
Source availability While not strictly a requirement, access to source code for the microkernel
will make it easier to understand and, if needed, debug.
These demands leave us with two prime candidates: Mach 3.0 and L4.
Mach 3.0
The Mach kernel, originally from Carnegie Mellon [Rashid et al., 1989], aimed at supporting a
number of different operating system personalities, by providing a uniform interface on which
operating systems could be run. The Mach kernel supported multiple threads within a shared
address space, secure IPC, virtual memory management and copy-on-write mechanisms for
lazy copying between address spaces.
Prior to version 3.0, the Mach kernel contained a full 4.2BSD UNIX implementation, but this
was moved outside the kernel in version 3.0, making Mach 3.0 a microkernel.
IPC in Mach is done through ports, where a port designates a receiver of the IPC, which can
grant other threads access to send messages to the object represented by the port by handing
out port capabilities. IPC under Mach is asynchronous, allowing tasks to simply deliver messages and not have to wait for the recipient to accept the message before continuing.
Under Mach it is possible for a parent task to insert shared libraries into the address space of
any children it spawns. The parent task may then notify the Mach kernel that any system call
traps in the child task is to be handled outside the kernel, by the shared libraries. These libraries
can then execute the system call, or forward them to the Mach kernel if needed. In this way, it
is possible for an operating system personality to handles syscalls from its own user processes.
Mach 3.0 has a working Linux implementation called MkLinux, originally developed by the
Open Software Foundation (OSF) in Grenoble, France in conjunction with Apple Computer,
in an attempt to port Linux to the PowerPC platform. MkLinux is based on the OSF’s own
implementation of the Mach 3.0 specification.
Recently, Apple has launched the MacOS X operating system, a modified Mach 3.0 kernel running a BSD personality.
L4
The purpose of the original L4 implementation was to prove that microkernels could be as
efficient as monolithic kernels. At the time, Mach was regarded by many as a failed experiment,
and Jochen Liedtke, the original creator of L4, set out to prove that the microkernel idea was
not in itself wrong.
L4 has the following relevant features:
Chapter 4: Proposed solution
22
Fast unbuffered IPC Current versions are able to perform inter address space IPC in as little
as 183 clock cycles (on a Pentium II, 400 MHz). Fast IPC is a requirement when building
hardware abstractions. Unbuffered IPC makes migration simpler, since less state has to
be extracted from the kernel.
Evolved and mature The L4 specification has evolved since 1996, with multiple implementations currently in existence. Work on improving it is ongoing, which is an important issue
when choosing a technology base for a new project.
Linux 2.2 port As mentioned, an existing port of a UNIX-like system as a base for a nomadic
operating system example implementation, is a necessity. L4 currently only has a port of
Linux 2.2, but a port of version 2.4 is underway.
Limited kernel state While L4 is not a stateless kernel, the amount of kernel state relevant to
migration is very limited.
Source availability Current versions of L4 are available with source code under the GNU Public License.
Due to these advantages, L4 was chosen as the base for the nomadic operating system implementation. A more detailed overview of L4 is presented below:
L4 abstractions
The L4 kernel provides a small set of abstractions for controlling and multiplexing hardware
resources, such as the central processing unit (CPU), the memory management unit (MMU),
memory, and peripherals such as network interface cards and hard drive controllers.
The CPU is abstracted via threads of control. A thread represents the execution of a program,
and when the thread is not running on the CPU, the L4 kernel stores its CPU state, namely the
instruction and stack pointers, as well as other general purpose registers. If a thread is waiting
(blocked) trying to perform IPC, information about the pending IPC message is stored in L4 as
well.
Threads are able to communicate and synchronise with each other by means of IPC. When
thread A wishes to communicate a message to thread B, it invokes the L4 IPC syscall. If B is
already waiting to receive a message from A, control is transferred along with the message to
B. If B is not already waiting to receive the message, A is blocked until B starts waiting for it.
Other than for inter-thread communication, IPC is also used in L4 for controlling the MMU,
and for delivering interrupts to programs that wish to handle them.
Multiple threads share the same address space. The combination of an address space and a
set of threads is known as a task in L4. A thread is able to manipulate the registers of other
threads in its own address space/task. When a new task is created, the first thread gets created
automatically by L4. This thread may then create a number of other threads if it desires.
Chapter 4: Proposed solution
23
Figure 1: The page fault and page reply mechanisms. The bent arrow pointing from thread ’s
pager to L4 indicates IPC interception of the page reply message from the pager to thread .
Chapter 4: Proposed solution
24
When the MMU raises a page fault exception because a thread is trying to access memory for
which no mapping exists, L4 looks up the pager of the running (and thus faulting) thread.
The page fault is transformed into an IPC message describing the fault, and the message is
forwarded to the pager. The pager is also a thread, which may then decide how to react on
the page fault. It encodes its answer as a page mapping IPC message, with which it replies to
the faultee. L4 sees that this is a mapping message, intercepts it, and programs the MMU to
provide the corresponding mapping in the address space of the faultee. Figure 1 shows the two
steps of the process.
Interrupts are abstracted via IPC as well. A thread can ask L4 (again, via IPC) to become the
handler of a given type of interrupt, handed out by L4 on a first come basis. When an interrupt
of the given type occurs, it is transformed into an IPC message, which is then forwarded to the
handling thread.
IPC is the central concept in L4, and the L4 creators have put much effort into making IPC as
fast as possible. By restricting who may send IPC to whom in the system, various security
policies may be implemented.
Clans and Chiefs [Liedtke, 1992] is the security mechanism offered by L4. A clan is a set of tasks
that can communicate freely. These tasks are owned, or controlled, by a chief which is the only
member of the clan that can make IPC to outside tasks. Whenever a task within the clan tries to
send IPC to a task outside the clan, the chief task will automatically intercept it, and potentially
modify it before either dropping the IPC or forwarding it to the intended recipient.
A chief owns all tasks in its clan, meaning it is the only task able to create and delete them,
though tasks may be donated to other chiefs. Initially, all tasks are unowned, but the first
server to start will usually reserve them all for itself, to later hand them out when requested,
according to some security policy.
L4Linux
L4Linux is a Linux personality running as a single user level task on top of L4. We use the word
personality instead of kernel, to not confuse the different entities in the system.
Because L4Linux is an adaption of a traditional monolithic operating system to a microkernel,
the operating system itself runs as a single L4 task. If one were to implement an operating
system on L4 from the beginning, it might be a good idea to split the operating system into
multiple server tasks, but this was deemed too complex a job by the L4Linux implementors.
The L4Linux server runs in multiple threads, but in one common address space. Please see part
2 of figure 6 on page 38 for an overview of the structure of L4Linux.
After L4 and some L4-specific bookkeeping tasks, such as the backing pager for the entire system memory, have been started, L4Linux starts. L4Linux requests all interrupts and memory
in the system. This allows it to access hardware as normal, including using its standard set of
device drivers.
Chapter 4: Proposed solution
25
Figure 2: L4Linux signal handling mechanism
When the hardware resources have been initialised, L4Linux starts the first user level process 4 .
The first process started in a UNIX-like system is traditionally called /sbin/init.This process may then start other processes, resulting in the end in a usable system.
When a process is started, an L4 task is set up to contain it, so that it gets its own private
address space. The new task contains two threads. One thread for the process’ own code, and
an additional thread for handling signals. The signal thread is necessary because only threads
in the same task are allowed to manipulate one another. The signal thread runs an infinite
loop waiting for IPC from L4Linux, telling it to deliver a signal to the main thread. A signal
is delivered by forcing (by manipulation of its instruction pointer) the main thread into the
appropriate signal handling code. See figure 2.
Linux syscall trap instructions are translated into IPCs by a user level trap handler 5 , and then
forwarded to a thread in the L4Linux task. This thread handles all syscall IPCs, acts as a pager
for user processes, and is multiplexed internally by Linux’ own scheduler. User processes are
currently not scheduled by L4Linux, but by the L4 strict priority scheduler instead, though this
is likely to change in the future. See figure 3.
The original L4Linux implementation on the hand optimised L4 x86 kernel has been shown to
run 5-10% slower than native Linux [Härtig et al., 1997].
4
We distinguish here between a process, which is the Linux object representing a program and an address space,
and a task, which is a set of L4 threads and an address space.
5
As an optimisation, it is also possible to perform the syscall directly via IPC, without the need for the mediating
user level trap handler, but requiring modifications to the original program or shared libraries such as libc.
Chapter 4: Proposed solution
26
Figure 3: L4Linux syscall handling mechanism.
Different migration schemes
If not considering the survival of long running computational tasks, a more simple solution to
the problem of migrating a set of services from one machine to another, would be to just boot
the target machine from the same filesystem as the source. The downside to this approach is
mainly about downtime, which is the time taken to boot a new system. If nomadic operating
systems are to compete in this scenario, downtime has to be smaller than what can be achieved
simply by rebooting on a different machine.
One problem with migration is the trail of residual dependencies often laid behind, and how
the risk of failure of the migrated task becomes affected by it. While one of the points of migrating operating systems instead of just processes is to eliminate residual dependencies, some
migration schemes may in fact reintroduce them. If residual dependencies exist, the risk of task
failure as result of a host machine crash becomes exponential in the length of the dependency
trail.
A number of different approaches for migration, with various effects on downtime and risk of
failure, as described in Milojicic et al. [2000].
Simple copy The task is stopped, and memory as well as task state information is transferred
to the new host, where the state is restored and the task resumed. This scheme is the
simplest scheme, and transfers all data only once. The task is unresponsive during the
entire copy, but no residual dependencies exists.
Lazy copy The task is stopped and kernel and CPU task state is transferred to the new host,
where the task is resumed immediately. When the task pagefaults, the faulting page is
Chapter 4: Proposed solution
27
Figure 4: The three schemes compared. The left side of each block designates initiation of
migration procedure, and designates the time, at which control is transferred to the new host.
The transfer block in “Lazy copy” immediately after signifies the resolving of the initial page
fault for the running program.
Chapter 4: Proposed solution
28
fetched from the old host before the task is allowed to continue. The task is only unresponsive for the short amount of time taken to transfer task state to the new host, but
performance of the task running at the new host suffers initially, since every pagefault
must be resolved across the network. Lazy copy leaves a residual dependency trail on
all machines on which the task has ever run, but ensures that only memory actually accessed by a task is copied across the network. As noted in Zayas [1987], typical behaviour
for a task is to access only 25-50% of the total amount of memory it has allocated when
running.
Precopy The task is left running at the source machine and all memory owned by the task is
mapped read-only and copied to the new host. When the still running task attempts to
modify one of its pages, a page fault is raised, and the page marked as dirty. The page is
then mapped read-write to the task, which continues running.
After the initial transfer, a subset of the pages will be dirty, and these are again mapped
read-only and copied to the new host. This goes on until the dirty subset is sufficiently
small, after which the task is suspended at the originating host, the remaining dirty pages
copied across, and the task is resumed on the new host.
The downtime of the task is reduced to the time taken to copy the last set of dirty pages
to the new host, but a number of pages will be copied more than once.
The precopy approach leaves no residual dependencies on the originating host, and thus
does not suffer from the increased risk of failure present in lazy copy.
Figure 4 shows the comparable differences under the mentioned schemes.
Because of the residual dependency problems of lazy copy, only full copy and precopy were
implemented in this project for migration via the network. Lazy copy may be interesting if
migrating via removable storage, as described at the end of this section.
It is possible to improve upon the precopy scheme, ensuring that frequently used memory
pages are sent as rarely as possible. This algorithm may be described as queued precopy, and
can be implemented by using a queue and a set . is a queue of read-only marked pages
ready for transfer, and
is a set of writable pages which cannot be transferred. is the total
set of pages to be migrated.
.
2. Each page in is removed from 1. Initially,
, marked read-only in the MMU page tables, and is
queued on . This step is rerun at a certain interval, until migration completes, and does
not need to be atomic.
3. If the task incurs a page fault by attempting to write to a page in , the page is removed
from and is inserted into . The page is marked writable in the MMU.
4. A background thread monitors . It continually dequeues the first page on and transfers it across to the target system. If becomes empty (other than initially), the size of
may be gauged, and the time needed for a transfer (and thus the downtime of the
system) be estimated. If the downtime is acceptably small, the task is suspended, and the
remaining pages in
are marked read-only (thereby queing them on ). The pages now
in may be transferred across, and the migration completed.
Chapter 4: Proposed solution
29
The tendency should be for
to shrink in size rather than grow. If not then the task is doing
a lot of sparse writes and must be slowed down by delaying the mapping of writable pages in
step 2, or by decreasing the update interval of step 1.
The reason for being a queue rather than a set, is that pages which are infrequently written
to will stay in longer and be transferred first, reducing the number of double page transfers.
In addition to support for queueing operations,
of arbitrary elements, needed by step 2.
must also allow for fast lookup and removal
The queued precopy algorithm has not yet been implemented. Currently a simple precopy,
performing a fixed number of iterations, is used instead.
When migrating via removable media, it will be desirable to be able to resume operation as
soon after the media is inserted as possible. For this scenario a variant of lazy copy, or ondemand paging, with pages being loaded from the media as they are needed, becomes interesting. This is similar to how programs are loaded in UNIX-like systems. Though the system
will run slowly at first, it should start responding almost immediately, provided the media is
of decent speed.
More challenging is the initial checkpointing of the operating system image onto the media.
The user should not have to wait overly long for this to happen. Infrequently changing pages
may be copied to the media ahead of time, hopefully leaving only a small percentage remaining
when the user decides to detach the removable media from the host system. An algorithm
similar to the one outlined for network migration above could be used, although it would be
kept running at all times. Continuously copying pages to the removable media may seem a
waste of CPU, but the utilisation of direct memory access (DMA) for copying pages should
limit the overhead.
Guest interface considerations
Since the nomadic operating system is just an application of L4 microkernel architecture, most
of the interface will have to adhere to its specifications. The choices left to us mainly have
to do with address space layout, pagefault handling, checkpointing behaviour and hardware
abstraction semantics.
The different options available when designing the interface between host environment and
guest operating system are outlined below:
Address space layout
Since L4 allows arbitrarily nested address spaces, it is possible to define whatever layout appears suitable.
Most operating systems will expect a flat, contiguous address space starting spanning the total
amount of physical memory. Often, hardware architectures attach special special meaning to
Chapter 4: Proposed solution
30
Figure 5: The proposed address space layout of the NomadBIOS system. NomadBIOS controls
all available memory and maps it to the L4Linux tasks, which in turn map it to their user level
processes.
various memory areas, using them for providing access to hardware control registers, or for
information put there by the firmware at boot time, for example the total amount of RAM,
harddrive geometries, and so forth.
Since guest operating systems will not be able to directly access any hardware, but will have
to go through various IPC based abstractions, theoretically no special memory areas will be
needed for hardware access. In practice, a few special areas will have to exist.
L4 provides a special kernel information page, from where values such as a global time tick
counter may be read. A special page with guest-specific information will also be provided by
the host environment. Shared pages for zero copy hardware access, for instance for incoming
or outgoing network datagrams, may also be necessary for performance. The locations of such
special pages in the guest address space, should be determined by the guest, so that different
needs and traditions of guest operating systems may be catered to.
A model of the NomadBIOS address space layout is seen in figure 5.
Page fault and page request handling
If a thread in L4 tries to access a memory page for which no mapping exists, the resulting page
fault is translated into an IPC message, destined for the thread acting as pager for the faulting
thread. It then becomes up to the pager thread to decide whether a page should be provided,
and to return it. The faulting thread is blocked until a mapping has been provided for it. The
pager may also choose to take other action, such as to kill the faulting thread on illegal memory
access.
Chapter 4: Proposed solution
31
In L4, memory page mappings are sent via IPC from the pager to the faulting or requesting
thread. These special IPC messages are intercepted by the L4 kernel, which performs the actual
maintenance of the MMU page tables.
For L4 programs, an IPC protocol known as the
protocol, named after the program which is
often used as the root pager in an L4 system, exists. The
protocol allows a caller to request
mappings for special pages, such as the kernel info page, and for special 4MB “superpages”,
available on the Intel Pentium family of CPUs. The use of superpages lessens the amount of
entries in MMU page tables, so that address space switches become faster.
When designing a pager interface for guests, the question becomes whether to rely entirely
upon page faults, lazily providing pages as they are needed, or to just provide all mappings up
front.
In favour of the lazy approach speaks the fact that not all the memory available to a guest may
actually be needed, and so can be saved for other guests in the system 6 . Also, the code for
resuming a suspended guest becomes simpler, because the normal page faulting mechanism
will make sure to request pages as they are needed.
Against the lazy strategy speaks the impossibility of requesting special pages by other means
than the attachment of various semantics to page locations, for example to agree that the kernel
info page is always at address 1000 hexadecimal, and that all other page faults result in superpages. Imposing such special rules onto the address space layout may hurt future uses of the
system, and should be limited as much as possible.
The best solution is probably to allow the guest to provide hints as to its wishes or demands of
special mappings, but not to provide any actual mappings before they are needed. In this manner, only the memory for recording the hinting information is wasted, as opposed to perhaps
almost an entire address space.
Checkpointing behaviour
Currently no stable implementation of L4 provides all the features needed for wholly transparent checkpointing of a running program. Even though user level pagers provide easy access to
all in-memory program state, some information about it is only available to L4 itself.
If an L4 thread is in the kernel doing IPC, there is no way for another thread to determine this,
and no way of suspending it in a manner that it may later be restarted safely. A thread that
is manipulated from the outside, for example as part of a suspend or other signal operation,
cancels any pending IPC, returning an error code. This means that only the thread itself (plus
possibly the thread at the other end of the IPC) will know that IPC was ongoing, and should
possibly be restarted for things to continue correctly.
6
Even when providing mappings lazily a for a guest, the guest must be guaranteed access to the entire address
space if it needs it, or alternatively the guest will have to be suspended and relocated to a machine with more
memory available. The amount of memory available to a traditional operating system does not change after the
system has started
Chapter 4: Proposed solution
32
Experimental versions of L4 have been implemented [Skoglund et al., 2001], wherein L4 itself uses an external pager thread to provide memory for kernel-internal thread control blocks
(TCBs), allowing a trusted thread, whose identifier is specified at compile time, to access, and
possibly checkpoint, L4 internals. This approach will only work if the kernel and the checkpointer have a common understanding of the internal structures, meaning that the kernel cannot be upgraded without also upgrading the external pager, and vice versa.
An alternative method is to handle checkpointing non-transparently, by requiring all threads in
the program to carefully check the error codes returned by the L4 IPC syscall, thereby becoming able to safely restart cancelled IPC operations, after the checkpointed program is revived.
Even though this approach comes with the overhead of a few machine instructions after each
relevant IPC in a guest operating system, it was deemed more elegant and better suited for this
project, than simply wrenching open the L4 internals.
Checkpointing will not be entirely transparent to the guest, which will have to respond to a
special call to wrap up all its running processes, so that all their states are entirely in main
memory, before it can be checkpointed safely.
A positive side effect of this non-transparency is the ability for the guest to flush unimportant
caches, for example those of network file systems, or take any other measures to compact the
guest memory image, before it is migrated over the network.
Hardware abstractions
Although L4 itself does not impose any restrictions on accessing hardware devices available,
it does not make any effort to aid this either. A task can request access to specific hardware
interrupts, and to special memory mapped I/O ports. Other than this, the task has to communicate directly with the hardware itself. Access to hardware devices is handed out by L4 on a
first come basis, so if multiple tasks are to share the same devices, one task will have to act as a
proxy for the others.
For simplicity, only a shared Ethernet device will be implemented, because network abstractions for all important I/O types are present in Linux. Some considerations about how to best
share a network interface are presented on page 43.
C HAPTER
5
Working with the L4 kernel
This chapter introduces the central L4 concepts, and gives an overview of the most important
programming primitives, in order to provide the reader with a basic understanding of how one
might implement a classic full scale operating system on top of L4, and a feeling for the work
involved in implementing features such as operating system suspension and resumption.
L4 basics
L4 provides a minimal set of primitives (syscalls) for manipulation of address spaces and
threads, and for performing IPC between threads.
A task is a set of threads sharing the same address space. When a new task is created, its first
thread is started automatically. This thread may then start additional threads if necessary.
A system call (syscall) in L4, as in most other kernels with memory protection, is invoked by
means of a trap instruction. Traps are special instructions causing the CPU to enter supervisory
mode and jump to a predetermined address containing handling code for the specific trap
number, as specified in the trap instruction. The handler receives the general purpose registers
of the program triggering the trap, and will usually treat these as parameters for the syscall.
When the syscall is complete, any return values may be placed in the registers, and control
returned to the trapping program by means of a special return instruction, which also changes
the CPU state back to user mode.
The original L4 kernel implemented seven 1 syscalls, each with a separate trap handler invoked
(on Intel IA32) with the int instruction, with trap numbers 0x30 to 0x36 hexadecimal 2 .
The L4 syscalls generally deal with task creation and deletion, thread register manipulation,
pair, stored in
and IPC. Thread- and task identifiers are specified by means of a
one or two CPU registers, depending on implementation. Below are listed the syscalls most
regularly used:
task-new Creates or deletes a task with a specific number. Whether the call acts as creation or
deletion depends on the parameters given. For a new task, initial instruction and stack
pointers for its first thread are specified, as is an identifier for a thread acting as a pager
for the first thread. If the thread incurs a page fault by addressing unpaged memory, this
pager is contacted via IPC. When a task is initially created, it has an empty address space.
1
In later specifications the number of syscalls has grown to 12, with new calls for optimised intra-address space
IPC and improved scheduling and thread management.
2
Note that this is a little different than for example Linux, which multiplexes all syscalls through a single trap
vector.
33
Chapter 5: Working with the L4 kernel
34
ipc This syscall handles all communication and synchronisation between threads. Depending
on its parameters it can act as both send, wait, as well as combined send-and-wait. IPC
is unbuffered and thus always blocking. IPC messages may be entirely register-based,
or perform memory copying and memory mapping, and so no special virtual memory
manipulation syscalls are required. IPCs can be given a timeout value, and a thread can
be put to sleep simply by performing IPC with the destination parameter set to nil.
thread-ex-regs The tread-ex-regs syscall performs thread register manipulation. It allows modification of the instruction and stack pointer registers of a running thread, and creation of
threads by initial specification of these registers. The general purpose CPU registers of a
thread cannot be read directly, but instead the thread-ex-regs call may be used for redirecting a thread to a function which stores the registers into a memory location. Threadex-regs acts only on threads in the local task, so signal handlers and similar functionality
have to be implemented as extra threads in this task.
Checkpointing
The L4 microkernel was not designed with checkpointing in mind, but provides a number of
features which make implementing checkpointing possible. The main feature is the possibility
of having user level threads handle page faults incurred by other user level threads.
When a user level program is able to handle page faults, and to manipulate page mappings
and page protection flags for other programs, it becomes possible for that program to copy
(and thus checkpoint) the data contained in these pages, and to restart the program from the
copied data. Other than the data which is kept in main memory, the CPU has a number of
registers, all of which must also be saved a restored for a checkpoint to be complete.
L4 has no direct support for transparently checkpointing the state of an entire set of running
tasks from another user level task. User level programs do not have access to kernel internals such as thread control blocks (TCBs), so there is no way to determine if a given thread is
engaged in an IPC operation at a given time, and thus no way to safely suspend it.
However, if a thread-ex-regs operation is performed on a thread performing IPC, the IPC
syscall will return a special error code, alerting the thread to the fact. The thread is then able
to handle this case correctly, for example by retrying the syscall. If all threads cooperate in this
manner, suspension still becomes possible, although not entirely transparent. One might argue
that malicious threads will be a problem, but threads who refuse to cooperate are only able to
hurt themselves, and may be killed by the system.
In the case of Linux running atop of L4, every Linux process is implemented as an L4 task,
consisting of a thread running the actual process code, as well as a signal handling thread; see
figure 2 on page 25. The signal handling thread can be extended to handle suspension of the
main process thread, without changing the Linux semantics of the system.
A Linux process is still able to stop or otherwise damage its L4 signal thread, either by will or
by accident, but this is equivalent to the program crashing, and may be handled similarly.
By using the signal thread to implement suspension, transparent checkpointing of Linux processes becomes possible. Similar techniques may be used for other guest operating system
personalities.
Chapter 5: Working with the L4 kernel
35
Virtual memory
All modern multi-user operating systems make use of the memory management unit (MMU),
today an integrated part of most full scale CPUs. Originally, there was a one to one correspondence between memory addresses addressed by programs, and the location of the memory bits
inside physical storage. Furthermore, any program would be able to access any location at will.
This was deemed impractical for a number of reasons, mainly:
Malicious programs were able to access arbitrary addresses, leaving innocent programs
open to attacks or as victims of other people’s buggy programs.
Swapping or paging unused program data to disk automatically was not possible, because there was no way to track what memory was being accessed. Programs too large
to fit inside physical memory would have to implement their own disk-paging code.
In systems using dynamic allocation, memory would become fragmented over time, leaving a reboot as the only means of defragmentation.
The MMU introduces a layer of indirection between the CPU and memory, allowing encapsulation of untrusted programs inside closed address spaces, demand-paging of memory from
disk, and easy defragmentation of memory.
The MMU uses a number of memory tables to dynamically translate memory accesses into
physical addresses, with the most frequently used combinations being cached for performance.
MMUs are implemented differently in different architectures, but they commonly handle memory in rather large chunks (pages), so that the translation tables do not have to take up the entire
physical memory. On the Intel 80386 and 80486 CPUs the page size is 4kB, whereas the Intel
Pentium supports 4MB pages as well. Large page sizes result in smaller page tables, at the cost
of coarser granularity.
Programming the MMU registers is a privileged operation, and most traditional monolithic
operating systems, such as Linux, only allow the user very limited control of the MMU. Userlevel tasks all reside in their own memory spaces with a fixed layout, and are unable to directly
page other programs. Microkernels, such as Mach and L4, on the other hand, often allow
user-level programs to manipulate the system-wide pagetables in some secure manner. This
approach is more flexible in that it allows programs to implement their own paging policies,
but has also been claimed by some to be less effective.
For the purposes of this project, the need for user-level paging is clear. Guest operating systems
are untrusted, and thus need to run at the user-level, but they also need to be able to implement
arbitrary paging policies for their own client programs, so they need to be able to program the
MMU themselves.
In L4, virtual memory is abstracted into the IPC mechanism. When a thread accesses memory
for which no mapping has been provided, the MMU triggers a page fault exception, which goes
to L4. L4 determines which thread caused the page fault (which is easy, since it must have been
the thread currently running) and looks up the corresponding pager thread. The page fault is
Chapter 5: Working with the L4 kernel
36
translated into an IPC message describing the page fault (faulting instruction pointer, faulting
address, read or write operation) which looks like it comes from the faultee, and is sent to the
pager. Since IPC is blocking, the faultee blocks. The pager may now decide how to handle
the page fault, for example by allocating some more memory for the faultee, and answer via a
another IPC message. The answer is intercepted by L4, which performs the actual page table
and MMU manipulations. Figure 1 on page 23 illustrates the two steps of the process.
Since page faults look just like IPC, it is entirely possible for a program to contact a pager by
means of IPC instead of actually pagefaulting. This is sometimes useful, for example when setting up a large initial address space. It is also possible to receive page mappings from multiple
pager threads if desired.
Different L4 versions
The L4 specification has undergone a number of changes since its initial release. The main traits
of each version are outlined below. The focus of the descriptions is the Intel IA32 platform,
since this is the platform initially chosen for this project. An overview of the plans for future
L4 development is given in Liedtke et al. [2001].
Version X.0 This is the original specification for the L4 x86 kernel, described in [Liedtke, 1999].
pair of 8 plus 6 bits for
It specifies a 32 bit thread identifier, part of which is a
the identification of the thread itself, resulting in a maximum of 256 different tasks with
64 threads each. Because the thread identifier parameter only takes up one 32 bit register,
three registers are available for user data in IPC messages. Aside from the original L4, the
Hazelnut C++ implementation also follows the X.0 specification.
Version 2 This version is a derivate of X.0, and used by the Fiasco C++ kernel. Version 2
[Hohmuth, 1998] sacrifices one of the user data IPC registers in exchange for having 64
bit thread identifiers, leading to 2048 different tasks with 128 threads each. Due to its
larger number of tasks, our work has been based mainly on Fiasco.
Version X.2 While no public version is available, the Pistachio L4 kernel is based on the X.2
specification [Dannowski et al., 2002]. X.2 does not employ the fixed size
pair from the earlier
versions, but rather a single 18 bit thread identifier, resulting in
room for up to
threads. Because
every thread can have its own address space in X.2,
it becomes possible to have
tasks as well. Version X.2 is also the first version of L4
to specify multiprocessor support. Another new feature is a special fast-path IPC call for
use between two threads in the same address space, described in Liedtke and Wenske
[2001].
C HAPTER
6
Implementation
This chapter describes two approaches to implementing a workable host environment for nomadic operating systems, and the changes made to L4Linux to turn it into an example nomadic
operating system.
A number of things may go wrong when making fundamental changes to the workings of a
complex and evolved operating system like Linux, and the reader shall be spared many of the
details and tales of dead ends encountered on the way. Suffice it to say that not everything was
as simple to implement as it may seem from the text.
Host environment
The host environment is a user level program running on top of L4, functioning as a service
layer for guest operating systems, and as a control panel for controlling operations from the
outside.
The host environment implements both an interface with which guest operating systems may
interact to access machine resources, and a control interface by which the guests may be managed by a human operator.
Two different host environments were implemented, one based on the existing L4Linux, and
one made from scratch using the Utah OSKit [Ford et al., 1997].
Using L4Linux as a host environment has the advantage that management tools may be easily
implemented as user level processes, and that these tools can be remotely accessed by standard
means such as Telnet or Secure Shell. L4Linux also includes the full Linux 2.2 set of device
drivers. The downside is the overhead of running a full L4Linux kernel.
The native host environment – called NomadBIOS – does away with this overhead, at the
expense of having to implement network stacks and drivers for hardware, as well as a protocol
for remote management.
L4Linux as a host environment
Initially this environment was used as a development platform, in order to get multiple L4Linux
guests running at the same time.
37
Chapter 6: Implementation
38
Figure 6: 1) Native Linux interfaces hardware directly and runs in supervisor mode. 2) L4Linux
running as user level process communicates with L4 which in turn handles IRQ forwarding via
IPC. 3) NomadBIOS as middle tier, forwarding IPC messages to several guest Linuxes.
Chapter 6: Implementation
39
When using L4Linux as the host environment, the guests may be run encapsulated in ordinary
Linux processes, or alongside the host L4Linux, in a disjoint physical memory area.
Running guests as an ordinary Linux process would allow the host environment to take advantage of the L4Linux built in to-disk paging algorithms, thereby increasing the perceived
amount of memory accessible.
However, the original L4Linux was hard to fit inside the restrictions normally faced by a user
level L4Linux process, and instead it was decided to run the guests alongside the host. The
current version of nomadic L4Linux would be better suited for running as a L4Linux processes
though, and the avid reader should feel free to experiment.
When running the guests alongside the host environment, resources are allocated from the
global pool of resources, rather than from the host environment’s pool of resources. In this configuration, the host environment must be instructed not to grab all physical memory during
boot, and not allocate every L4 task but rather just allocate enough to to get the host environment running. Because the host environment needs to be able to access unallocated resources
directly, it was implemented as a loadable module which could be inserted into an unmodified
L4Linux.
The Ethernet abstraction is the only modification necessary to L4Linux for it to serve as a functioning host environment. Filter and bridging code is inserted into the Linux network packet
handling loop, and packets destined for a guest are forwarded via L4 IPC.
The NomadBIOS host environment
The NomadBIOS native host environment is similar in nature to the L4Linux kernel module
solution, with regards to memory allocation and task allocation. The main difference is that
the network interface card, rather than being interfaced by the host L4Linux, must be accessed
by a native L4 server. The native host environment comprises a number of separate L4 server
threads, each responsible for one distinct feature of the host environment. The servers are
highly independent, and are rarely interfaced directly by other servers than the control interface server, so that theoretically they could be made to run in separate address spaces, protected
from each other.
Each server has, in addition to an IPC guest interface, a number of back-end interfaces allowing,
for instance, the control server to register a new client with the network server. These interfaces
serve to tie the BIOS servers together, and cannot be accessed by guest operating systems.
The network server needs direct access to the Ethernet network interface card (NIC), and
should preferably support a wide range of hardware. To this end, the Utah OSKit set of operating system components was chosen. Apart from supplying low-level NIC drivers, OSKit
also provides a TCP/IP stack, on top of which a remote management protocol has been implemented. This protocol allows NomadBIOS to be controlled from the outside by a small
command line tool called runclient.
Because the native host environment acts more like a firmware (or Basic Input Output System
– BIOS) providing abstracted and uniform hardware access, than like an operating system, it is
referred to it as the NomadBIOS.
Chapter 6: Implementation
Figure 7: An overview of the services running in NomadBIOS.
40
Chapter 6: Implementation
41
Choice of host environment
While both environments were implemented during the course of this work, the L4Linux kernel module option was eventually abandoned, as it became clear that the benefits of running a
full L4Linux as a host were limited, and maintaining two separate interfaces became a burden.
No benchmarking was done using the L4Linux host environment. The L4Linux host environment primarily served as a convenient test-bench for experimentation, since the module could
be unloaded and re-loaded quickly, rather than having to reboot the entire system. When the
host environment interface was stable, focus shifted to the NomadBIOS native host environment solution, after which the L4Linux version became obsoleted.
This is not to say that the L4Linux host environment would not serve a purpose. If revived,
it would make it possible to add nomadic Linux servers to an already running L4Linux on
demand, making it an optional service of the host L4Linux, rather than the dedicated purpose
of the host. One would also gain free access to the complete set of Linux device drivers.
Since both implementations provide the same interface, the choice of host environment may be
ultimately left to the end user.
Sharing host resources
As a rule, The hardware resources in a computer system do not support shared access. If, for
example, two programs write to a harddrive without some entity coordinating their access,
the contents of the drive are likely to end up in a corrupted state. The operating system traditionally fills out this coordinating role, arbitrating access to resources on behalf of multiple
users.
In the nomadic operating system setup, operating systems become applications, and the role
of the sole arbitrator is filled instead by the host environment. Below we list the resources that
the host environment needs to arbitrate access to. Apart from the physical hardware, kernel
resources such as task numbers must also be shared.
L4 task numbers
Current L4 kernels have a fixed number of tasks which must be shared between the host environment and the guest operating systems. For the Hazelnut kernel this number is 256, and for
the Fiasco kernel 2048. Each task represents a separate address space and a set of up to either
64 or 128 threads. Of the tasks, four are used for special L4 servers, and one is used for the
NomadBIOS. The remaining tasks must be shared by the guest operating systems, limiting the
amount of guest kernels able to perform useful work. To make the best use of the scarce task
resource, various allocation policies can be implemented.
We considered two approaches to this issue:
Chapter 6: Implementation
42
Splitting the available tasks into equally sized consecutive ranges. If each guest were to
be given 255 tasks, eight guests would be able to run on a Fiasco-based system at once,
which would be satisfactory at the moment.
Alternatively, different guests could be allowed different amounts of tasks, similar to how
they are allowed to use different amounts of memory. Tasks could be requested from the
host environment on demand, so that guests needing few tasks would make more room
for guests needing many.
Because the former approach is simpler to implement than the latter, and since upcoming ver
sions of L4 allow as many as
tasks, which will probably solve the problem for all practical
purposes, fixed size consecutive task ranges were chosen.
Physical memory
The Fiasco L4 kernel, in its current incarnation for the Intel IA32, supports no more than 256MB
of physical memory, though this limitation is currently being addressed by the Fiasco team.
As with the tasks, there is a choice of pre-allocating physical memory to a guest, or handing it
out as required. If memory is only allocated when actually needed, more guests will be able to
run in less memory, in a case when not all guests actually use their full allowance. However,
most modern systems will quickly utilise all available memory for caching filesystem contents,
so unless that behaviour is changed there is no point in allocating memory on demand.
Memory for guests is always handled in 4MB superpage chunks which, on the Intel Pentium,
is an efficient unit for the MMU, resulting in small page tables and few page faults. A simple
bit vector is used for keeping track of reserved and free pages. The only exception is when a
guest is migrating. In that case, 4kB pages are used because of their finer granularity.
Timer interrupt
Since guest access to the physical hardware on the host machine is not allowed, it is not possible
for the guests to receive timer interrupts.
The timer interrupt in Linux is used to allow preemptive scheduling to occur, to update process
runtime information, and to update internal kernel timers.
Since every L4Linux process is a native L4 task, the scheduling of processes is handled natively
by L4’s own strict priority scheduler.
It is possible for a task in L4 to handle the scheduling of other L4 tasks, by donating the remainder of its own timeslice to another task. The current L4Linux does not attempt to control
scheduling of user level tasks though.
Rather than having the host signal each guest every time a timer interrupt occurs, idle guests
go to sleep for a short interval using the timeout feature of L4 IPC, polling for things to do
Chapter 6: Implementation
43
(for example running the bottom half handler for incoming network packets) when they wake
up, and then go back to sleep. Because preemption and scheduling is handled in L4, it should
be possible for guest operating systems to become entirely event-driven, with no need for any
external timer ticks. Unfortunately, L4Linux (which is derived from Linux 2.2) relies too much
on the presence of a timer ticks to be easily turned into an entirely event-driven system.
Realtime clock
Linux implements the gettimeofday system call, which returns the current time in seconds
since 1/1-1970 00:00:00, with millisecond accuracy. This is much higher accuracy than what
can be obtained from the motherboard real time clock.
The IA32 architecture exposes a high precision time stamp counter (TSC), a 64-bit value which
is incremented at every clock tick. Linux uses this counter to calculate the current time, in order
to get the high accuracy. Since the TSC counts clock cycles, the speed of the actual processor
must be known in order to get a correct clock multiplier for converting between clock cycles
and wall time.
When an operating system migrates, a new value for the clock multiplier must be obtained in
order to avoid clock skew due to migrating to a faster or slower CPU.
NomadBIOS exposes the clock multiplier in the guest info page, and it is the guest’s responsibility to re-read this field after migration.
Network packet filtering
The physical network interface has to be multiplexed between the guest operating systems,
as well as the host environment’s own network stack. If a network packet arrives at the host,
its destination address must be examined, and a decision made of who the packet should be
relayed to, or if the packet should be simply ignored.
This filtering can be done in several ways, depending on the tradeoffs one is willing to make in
the areas of flexibility, security, and performance.
The flexibility choice is about whether guests should be able to decide the networking protocols
supported dynamically, or if it is acceptable to hard code a fixed amount of protocols into the
host environment.
The security choice is if whether or not guests can be trusted to examine all incoming packets,
even packets not destined for themselves, or if this should be considered a security breach? If
guests can be trusted, all packets may just be forwarded to all guests, deferring the filtering
decision to them.
The performance choice is related to the security choice above. Consulting every guest for
every packet will lead to poor performance if naively implemented, but on the other hand
operations may be sped up by allowing all guests to read packets from a memory buffer shared
Chapter 6: Implementation
44
between all of them, as opposed to having to copy packets from host to guest address spaces (or
grant or map memory pages, an operation which takes some time as well), in order to enforce
access restrictions.
If flexibility is important, it is not possible to implement a static filter in the host. Instead, either
all clients must be consulted for each packet, or clients must be able to specify filtering rules
in a general way. The DPF [Engler and Kaashoek, 1996] system, part of the Exokernel project,
allows filters to be specified in a domain-specific language, and downloaded into the host. DPF
employs dynamic code generation, aided by runtime knowledge of installed filters, and claims
performance on par with, or superior to, hand-crafted packet filters.
For the purposes of this project, flexibility is less important than ease of implementation and
performance. The only protocol currently supported is IP version 4, which means that a destination for an incoming packet can be determined solely be looking at the 32-bit destination
IP address for the packet. Currently, the list of clients is searched linearly, as the number of
clients is assumed to be small (less than
ten), but this search would be easily implemented as
complexity. If more flexibility were to be desired,
a binary search instead, leading to
an approach like the one taken by DPF would be necessary.
Packets are copied from the host to the destination client, incurring some overhead. This is
done for the security reasons described above, as well as to avoid garbage collection issues
when sharing memory between host and clients. To increase performance, packets are queued
up at both host and client, before being transferred via IPC in larger chunks, thereby amortising
address space switching costs. Performance might be further improved by avoiding copying,
but this is left for future work.
Adapting L4Linux as a guest operating system
In order to turn L4Linux into an example nomadic operating system, able to be seamlessly
migrated between hosts on a network, a number of changes were necessary. These changes
will be described in more detail shortly.
The only hardware abstraction implemented by the current host environment is an Ethernet
device. However, most Linux services like file systems (NFS), terminal access (SSH), graphical
display (X), and so on, are already abstracted via the network in Linux, so it is possible to run
a full blown operating system on this abstraction alone.
While dramatically simplifying the abstractions needed to get several L4Linuxes running side
by side, it also reduces the complexity of migration, because only data inside the guest address
space has to be transferred. Network state in the abstracted layer does not necessarily have to
be transferred, since applications designed for the Internet should tolerate a few lost packets.
For some applications, this setup is fine, while for others it is inadequate. In the future, other
types of abstractions (mainly access to hard drives) will be implemented as well.
The nomadic L4Linux was configured to boot from an NFS-exported file system on another
server, and was accessed from the outside via Secure Shell (SSH). Both the X Window System
Chapter 6: Implementation
45
and Virtual Network Computing (VNC) were tested and found to work for access to applications with graphical user interfaces.
Because the actual guest operating system configuration is entirely customisable, other file
systems, for example AFS [Satyanarayanan, 1990] or Intermezzo [Braam and Nelson, 1999],
might be used instead of or as supplement to NFS.
Hardware abstraction layers
As described, standard L4Linux requests all available interrupts from L4 at startup. From this
point on, hardware is accessed as normal, using the standard Linux set of device drivers.
If multiple Linuxes are to share the same physical hardware, they cannot be allowed to allocate
any interrupt, so a hardware abstraction layer with multiplexing functionality must be created.
The use of a uniform abstraction layer has the added advantage of allowing guest operating
systems to migrate between systems with dissimilar peripheral hardware, without any trouble.
Below the abstraction layer, code must exist to access the physical hardware, either by implementing its own drivers, or by being based on a host operating system which provides the
drivers for it.
The only abstraction actually implemented for this project is a shared Ethernet device. This
device is implemented in NomadBIOS on top of the OSKit network drivers, and as a bridging
plug-in in the network layer of the host version of L4Linux. The L4 IPC mechanism is used
for passing network packets between guest and host address spaces. L4Linux already implemented a driver for an IPC-based Ethernet adapter, and this was used as a base for the client
side of the implementation, though over time it was heavily modified.
To lower the amount of context switches necessary when sending or receiving many packets in
short succession, packets are not forwarded immediately, but gathered into a buffer of up to 16
packets, before they are sent as a single IPC message. The buffer is flushed either if full, or if a
small timeout expires, to keep network latency low. This optimisation somewhat amortises the
cost of added context switches incurred by using an abstraction, and is used both for incoming
and outgoing traffic, and is inspired by the work done in VMWare [Venkitachalam and Lim,
2001].
The network multiplexing and demultiplexing (filtering) implementation is described in more
detail in its own section below.
Hardware abstraction layers for sharing other devices such as hard drives have not yet been
implemented, but it is reasonable to assume this to be a rather trivial task, and this should be
addressed in the future.
Chapter 6: Implementation
46
repeat
E RROR
until E RROR
SEND
;
E RROR
;
Figure 8: Handling IPC send cancellation.
repeat
E RROR
until E RROR
RECEIVE
;
E RROR
;
Figure 9: Handling IPC receive cancellation.
Disabling interrupts
For critical sections, Linux (on Intel hardware) uses the cli and sti instructions to turn off
and on all interrupts. While the original L4Linux was allowed the use of these instructions,
it is clearly not acceptable when running multiple systems at once, as it would allow a guest
to monopolise the CPU. Fortunately, the same is true for realtime applications, and this has
already been addressed in Härtig et al. [1998]. Recent versions of L4Linux can be configured to
emulate the effects of cli and sti using a queued lock instead. Because some sections of the
Linux timer calibration code need a real cli context to be precise, these had to be relocated to
the host environment.
Suspending L4Linux and its user processes
Before a guest L4Linux instance can be migrated to another host, it has to be suspended to a
safe state. This means that all threads, both in the L4Linux task and in all user processes, have
to save CPU and kernel state to main memory, from where it may later be retrieved.
The CPU state of a thread are its instruction pointer, stack pointer, general purpose registers,
and also the state of the floating point unit if used.
The kernel state of a thread describes its eventual involvement in IPC. A thread can be in one
of the five following states:
Ready: Not doing IPC, ready to run when scheduled.
repeat
E RROR
until E RROR
while E RROR
E RROR
CALL
#"%$
'&
)*+-, $./$
RECEIVE
!0
! E RROR
$21
;
;
E RROR
#"($
'&
+3"%$
Figure 10: Handling two phase IPC cancellation.
;
'&
do
Chapter 6: Implementation
47
Waiting to receive: The thread is waiting for some other party to start a send operation.
Waiting to send: The thread has invoked a send operation, but the receiving party has not
invoked a corresponding receive operation yet.
Receiving: A receive operating is in progress. IPC is able to perform large copying operations,
and may be subject to scheduling and to interruption by other threads.
Sending: As above, in the opposite direction.
It is possible to perform two-phase IPC in L4, allowing both a send operation followed by a
receive operation, in a single syscall.
There is no way to determine neither the CPU nor kernel state of a thread from outside the
thread, so this will have to be done by the thread itself. The CPU state is obvious to the thread
because it knows the values of its own registers, and the IPC syscall returns error values corresponding to each of the last four states above when interrupted.
In the implementation, the thread-ex-regs syscall is used to force a given thread into special
suspension code which pushes all CPU state to a location in memory. If the thread was interrupted doing IPC, the eax register will contain the appropriate error value corresponding to
the the states above. After resumption, each thread is responsible for handling this situation
correctly. This means that every IPC syscall invocation in L4Linux had to be transformed in the
following way:
In the case of cancellation or abortion of single-phase IPC, the IPC is restarted.
In the case of cancellation or abortion of two-phase IPC, the error code describes how far
the operation got before interruption. If the operation did not get past the send phase, it
may be restarted completely. Otherwise, a single-phase receive should be started instead.
Figures 8, 9 and 10 show the methods used to handle cancellation for send, receive and twophase IPC respectively.
In case of user processes, the existing L4Linux signal handling thread was expanded to manipulate the main program thread into a suspended state.
In the L4Linux task, one thread takes care of suspending all the other threads in a similar fashion. This thread’s only purpose is to wait for the host environment to ask it to suspend. Before
suspending the various internal kernel threads, it IPCs the signal threads of all user processes
to suspend. When the signal threads of all user processes have responded that suspension is
complete, the corresponding tasks are deleted from L4 with the task-new syscall, and the kernel
threads are suspended. The suspender thread then IPCs back to the host environment that suspension is complete. This last IPC message contains an instruction pointer for code which may
be used to later revive the guest operating system from its suspension. The host environment
may then delete the L4Linux task from L4 at will.
At this point, all that is left of the once running guest is a contiguous chunk of memory, along
with an instruction pointer, which is easily copied to a new host where it may be revived, as
described below.
Chapter 6: Implementation
48
Resuming L4Linux and its user processes
From the viewpoint of the host environment, a suspended guest image is revived simply by
creating a new L4 task, beginning execution at its specified resumption address.
Inside the new task, the first thread that runs performs the following operations:
First it recreates all of the internal kernel threads, starting with the internal kernel pager, which
will handle page faults incurred by the rest of the kernel, usually by forwarding them to the
host pager.
The rest of the threads are restarted in code which makes them go to sleep until they receive a
go signal. This is necessary because of the chicken-and-egg situation which occurs when both
user processes and kernel threads expect to be able to communicate right away, and cannot
handle the other party not existing yet.
Each user process is then recreated as a new L4 task. The new tasks start at a special resumption
address within the user signal handling code. This code starts a new signal handling thread
for the process, restores the CPU state from the copy previously stored in memory, and jumps
to the location on which the process was originally interrupted.
At this point, the newly created user task will have incurred two page faults. The resumption
code resides in a page shared by all user tasks, and the CPU state resides in a page specific
to each task. If the process were involved in a syscall or page fault before suspension, special
care must be taken, because the L4Linux syscall and page fault server thread cannot handle
multiple syscalls or page faults at once from the same process (which makes sense because this
is usually not possible). To solve this problem, an extra pager thread used only by recovering
processes was added to L4Linux. Once a process is recovered, it switches back to the normal
L4Linux pager.
A specialised version of libc was made for L4Linux which performed syscalls via direct IPC
to L4Linux, eliminating the need to go through a user level trap handler. This is not compatible
with the above mentioned approach, since there is no simple way of knowing where the libc
syscall IPC code is placed in the address space of the user level process prior to suspension.
While this could be addressed by adding additional functionality to the extra recovery pager,
but was considered to be beyond the scope of this project.
Once all user processes have been recreated, all server threads are given the go signal, and the
suspender thread goes back to sleep. The guest is now running as before.
Pure demand paging model
Because page faults in L4 are turned into simple IPC messages, it is possible for a task to explicitly request memory by making an IPC to it’s pager thread directly. The L4 documentation
protocol, to which pagers may adhere.
defines a page request protocol, known as the
The protocol, named after the L4 backing pager, allows extended paging requests, for example requests for 4MB superpages, instead of the normal 4kB pages. The L4Linux startup code
Chapter 6: Implementation
49
makes use of this protocol to explicitly request superpages for as much of its memory as possible, for performance reasons. After booting, standard L4Linux has requested all the memory
available, and will make no further page faults to its backing pager.
While this behaviour is beneficial in a traditional single-operating system setup, because the
use of superpages is faster than normal pages, it is problematic in a migration scenario, because
L4Linux assumes that page mappings do not disappear once requested. When resuming a
guest in an empty address space, this assumption does not hold. Furthermore, the guest should
not be allowed to decide or even know in what kind of pages it is running, as this may change
during the lifetime of the guest. For example, the precopy algorithm, described on page 28,
unmaps the entire guest address space and later temporarily maps back only the pages actually
needed by the guest to continue running. While superpages are used for the entire address
space for normal operation, the temporary precopy mappings are created as 4kB pages, because
of their finer granularity.
Because of this, the paging model in L4Linux was changed to be purely demand driven. No
explicit requests for memory are made, but instead page faults incurred by user processes and
kernel threads are implicitly forwarded to the host pager by the Linux pager touching the
faulting addresses. The host pager may then choose whether to map back superpages or normal pages at its own discretion. This also has the advantage that disk paging and similar
techniques may be applied at the host level, completely transparent to running guests.
Task identifier migration issues
Because of L4’s flat task identifier space, the task number for the Linux server, as well as for
the user level tasks, cannot be assumed constant across migrations, because another guest may
have already been allocated in the same task number range.
Guest operating systems will need to deal with situations in which there is only partial or no
overlap between task numbers before and after migration.
Part of the solution to this problem is to introduce a level of indirection, by using some type
of naming service to convert virtual task identifiers into actual L4 task identifiers. In the case
of L4Linux, this indirection already exists for user level processes, as Linux Process IDs (PIDs)
may be mapped via their kernel-internal task structs into L4 task identifiers.
Upon resumption after a migration, the L4Linux server will have to change these identifiers
into newly allocated ones. Conversely, the signal handling code of user space tasks store an
identifier for the L4Linux server in a shared page, which will need to be updated as well.
Unfortunately, task identifiers may also be cached in temporary variables or CPU registers,
and used in IPC operations after resumption, and the only way of preventing use of such stale
identifiers across migration is to treat every identifier use as a critical section and protect it
with a lock or semaphore, inside which migration cannot occur. This solution does not come
without implementation and runtime costs though.
The Clans and Chiefs (C&C) (see page 24) security mechanism of L4 allows the L4Linux server,
which is chief of its clan of Linux user process tasks, to intercept messages going to out-of-clan
tasks, and redirect them to their new and correct destinations.
Chapter 6: Implementation
50
A thread trying to perform IPC using a stale identifier will then either reach the correct destination right away (because the recipient it is still at its old location), or be intercepted by C&C
(because the stale identifier points to a task which is not member of the new clan).
Intercepted IPC will be forwarded to the L4Linux server, which may then look up the correct
destination, and redirect the message.
For this to work, new tasks need to be allocated in a manner so, that any newly obtained task
number which overlaps the old set of task numbers is assigned to the same Linux PID as before.
The same thing goes for the L4Linux server.
IPC redirection has some overhead, but shortly after resumption all temporary identifiers should
have gone out of scope, and execution will be back to normal speed.
If the operating system is migrated again before all stale identifiers have gone out of scope,
there is a risk of task identifiers rejoining the clan, but for different PIDs this time, resulting in
invalid IPC going undetected by C&C. However, this may be solved for all practical purposes
by being a little careful about the lifetime of cached identifiers.
If this is not acceptable, a lifelong history of previous identifiers will need to be kept for each
task, ensuring that if a PID once mapped to a certain task identifier, and that identifier ever
becomes available, it will map to it again.
Unfortunately, no existing L4 version for Intel currently implements C&C redirection, though
forthcoming version promise a new and improved security model to replace it. As a temporary
solution, the Fiasco kernel was modified to make out of clan IPCs fail with an error code, which
L4Linux reacts to by recalculating the destination identifier and retrying the IPC.
Another problem is that two guest Linux servers will normally be direct children of the host
server task, and thus members of the same clan, allowing them to IPC each other directly. From
a security viewpoint, this is not much of an issue, because L4Linux already takes care to discard
incoming IPC from unknown sources, but it is a problem when migrating. Suppose a Linux
server is running at task 256, and that its internal threads communicate by referring to task
256 and a specific local thread number. When the server is suspended and migrated to a host
where task 256 is taken by another Linux server, it is assigned task 512 instead. Internal threads
in task 512 still hold stale identifiers, pointing to threads in task 256, and because the two server
tasks are in the same clan, they are allowed to communicate directly to threads in task 256, with
undefined behaviour as a result. This can be solved by encapsulating each server in its own
clan, but this solution is inefficient because it deepens the C&C hierarchy, adding another layer
through which IPC will have to be manually forwarded.
The root of the problem is in the redundant way IPC destinations are identified as
pairs
in L4 when communicating inside the same task. If it were possible in L4 to specify a
tuple as the destination, to specify a recipient thread within the current task, the
problem would be solved. The upcoming Pistachio L4 kernel is supposed to have a special
LIPC primitive for performing very fast intra-task IPC, which could be used. In the meantime,
the Fiasco L4 source was modified to allow IPC with task identifier 0 designating task-local
IPC.
Chapter 6: Implementation
51
Remote control interface
The control interface defines a number of actions used for managing guest operating systems
running on the machine. The interface server accepts TCP connections, making it possible to
remotely start and stop guests, as well as to migrate them between machines.
The following commands are defined with which guest operating systems can be managed:
client-alloc Given information of required physical memory and IP address this call inserts
the guest into the host environment. If the guest is transfered as an image for first time
boot, the image is parsed and setup ready for execution. It is possible to specify a guest
specific command line, made available to the guest via the guest info page.
suspend Suspends a guest designated by its IP address. This function goes through the suspension process for the relevant guest but does not release resources allocated to the
guest.
resume Resumes a previously suspended guest with the environment available to the guest
when it was suspended.
migrate Migrates a guest designated by its IP address onto a new host. The precopy algorithm is used to minimise downtime during migration. The task is initially copied to the
new server, after which it is suspended locally, and changed pages are copied to the new
server. Finally the old server sends a resume signal to the new server to restart the process
remotely. Should anything go wrong, the local task can be restarted with a resume command, making migration transactional by nature. When migration is complete, resources
may be freed.
The client program for the control protocol is available both as a stand-alone command line
program, called runclient,and is also integrated into the host environment. Apart from the
initial migration command, the protocol steps involved in performing migration between host
environments are handled in a peer-to-peer fashion.
Future work
The current number of services provided by NomadBIOS suffice to demonstrate the feasibility
of the system. A number of additional services have been identified, which, if implemented,
would extend the usability of NomadBIOS in a cluster environment.
Chapter 6: Implementation
52
Checkpointing server
A checkpointing server can be implemented as a special version of the control interface running at a passive host, possibly a file server. A guest can be checkpointed by migrating to the
checkpoint server, which stores the data in persistent storage, and then reports failure to start
the process. This makes the original host environment restart the guest locally at the migration
point.
Should a host environment fail, the last known checkpoint can be restored by having the checkpoint server migrate the stored version of the guest back to an active host environment.
Load balancing
It would be possible to add MOSIX-like load balancing of guest operating systems to the host
environment. The service would be responsible for communicating load averages and guest
specifications between a number of hosts, making decisions about which hosts should offload
which guests onto other hosts. The load balancing server would use the existing control interface to migrate guests to other hosts. This would allow for MOSIX-style load balancing of a
cluster, or across a set of clusters, without residual dependency problems, though at a coarser
granularity level.
Block device access
Similar to the network abstraction, it would be possible to add a block device abstraction to
the host environment. This would give the guests access to a section of a local physical disk,
usable as swap-space or for creating file systems for storing temporary files. The block device
service would partition the local hard drive simply by recording a offset and length of each
guests physical volume, and do block remapping and bounds checking to assure separation of
the volumes. During migration all blocks owned by a guest would be transferred to the target
machine, possibly using the precopy algorithm.
C HAPTER
7
Performance measurements
The primary objective of this project is the implementation of an environment for hosting nomadic operating systems, and a port of one non-toy operating system to the environment.
The secondary objective, as described on page 4 is proving that the primary objective can be
achieved without a significant loss of performance. To confirm this, a number of benchmarks
were run to hopefully verify that this is the case. The strategy is twofold. One is to demonstrate
that running nomadic L4Linux (NomadLinux) under NomadBIOS is not significantly slower
than running standard L4Linux (which has been shown in Härtig et al. [1997] to be 5-10%
slower than native Linux for practical scenarios), while the other is to compare performance of
NomadLinux under NomadBIOS against a similar setup using VMWare, the main alternative
for hosting multiple operating systems on commodity hardware.
Three types of benchmarks have been chosen:
Latencies By measuring the latencies of a series of operating system related tasks, such as
process creation and syscalls, it is possible to get indications of the performance overhead
(if any) incurred by the overall design.
Throughput Timing the task-completion time of a CPU intensive process gives a clearer picture of the impact felt by an end user in a system based on NomadBIOS.
Migration While not a comparative benchmark, this benchmark measures the downtime incurred by migration as perceived by an external spectator, for example a remotely logged
in user.
All benchmarks were run on the same machine, a single CPU 750MHz Athlon AMD, with 380
MB of RAM and a 100Mbit Intel EtherExpress 100 Pro network interface.
Latency benchmark
A series of micro benchmarks are run to determine relative performance of various operating system elements. For this purpose the hbench OS [Brown and Seltzer, 1997] benchmark
was chosen. hbench is a derivate of lmbench [McVoy and Staelin, 1996], used for the original
L4Linux benchmarks, but modified to provide more accurate results in a number of cases.
Hbench tests a number of operating system specific latencies and throughputs. Benchmarks
regarding local file system performance have been disabled, since the guest is only able to
53
Chapter 7: Performance measurements
54
access network file systems. The results of this benchmark will be compared to the measuments
of standard L4Linux, to give an indication of the overhead introduced in the nomadic version.
A subset of the total benchmark results was picked, akin to the subset used in the original
L4Linux benchmarks, but with parameters adjusted to run on faster hardware.
The latency benchmarks are described below. Results are specificed in milliseconds (ms), and
lower values yield better performance.
Test
getpid
write to /dev/null
null process
simple process
dynamic/static
sh process
dynamic/static
pipe
ctx 0k 16
ctx2 0k 16
Description
Measures the time it takes for one getpid syscall.
Measures the time it takes to write one byte to the null device.
Measures the time it takes to fork the current process.
Measures the time it takes to fork the current process and
start an either dynamically or statically linked program
using the execve system call.
Measures the time it takes to fork the current process and
start an either her dynamically or statically linked program
using the shell.
Measures the time it takes to send one byte through a pipe
between two processes.
Measures the context switch latency, by sending a token
through a series of pipes, set up between 16 child processes.
as ctx, but provokes cache misses in the first and second level caches.
Usually, it would have been customary to benchmark memory map latency together with the
other latency tests, but unfortunately the benchmark tended to crash NomadLinux, so no results are available. Investigations are proceeding as to the cause of the crash.
The bandwidth benchmarks measure their results in MB/s. Higher results yield better performance.
Test
pipe-64k
mem rd 2m
mem write 2m
mem zero 2m
mem copy 2m
unrolled aligned
tcp-128k
Description
Measures the bandwidth of a pipe, by transferring 4MB through a pipe,
in chunks of 64kB.
Measures the read bandwidth of the system memory, by touching
every byte in a 2MB data range.
Measures the write bandwidth of the system memory, by writing
to every byte in a 2MB data range.
Measures the bandwidth of the system memory, when clearing
memory with the bzero library call.
Measures the copy bandwidth of the system memory, by copying 2MB
data from one address to another. The copy code is loop unrolled for
maximum performance, and the data is aligned on 4kB page boundaries.
Measures the bandwidth of the network layer, by transferring at least
10MB to another host on the network, in chunks of 128kB.
The benchmark is run only once per system, since hbench itself performs repetitions to average
results.
Chapter 7: Performance measurements
55
For each run, the system benchmarked was allowed to use 64MB RAM and a root file system
mounted via NFS. The file server was a dual 350MHz Pentium II serving files from a Quantum
Atlas 10k3 ultrawide SCSI2 disk, over a 100Mbit network, and was not doing other work.
The results are compared against identical runs of the same benchmark suite under the following conditions:
Setup
VMWare on Windows XP
VMWare on Linux 2.4.18
L4Linux
Linux 2.2-20
Description
VMWare running under Microsoft Windows XP
the guest being a native Linux 2.2-20 kernel.
As above, except with VMWare
running under Linux2.4.18.
The unmodified L4Linux, running on top of the same
L4 kernel as the NomadLinux benchmark
The native Linux kernel running directly on
the hardware.
The version of VMWare used was VMWare Workstation 3.2.
The results of this benchmark are tabulated in table 1 and 2 and it is seen that while VMWare
has better syscall latencies than NomadLinux, it suffers greatly in the process invocation tests.
This is consistent with Lawton [1999]’s description of how VMWare works. When compared
to native Linux, both NomadLinux and L4Linux are outperformed by quite a margin, especially regarding syscalls. In the orginal L4Linux performance benchmarks, L4Linux was benchmarked with two different syscall mechanisms, one of which involved a specialised libc using
direct L4 IPC instead of the trap mechanism described on page 25. Use of the specialised libc
almost halved the syscall overhead when compared to native Linux.
While this would have benefited NomadLinux as it did L4Linux, the current version of NomadLinux is unable to safely resume processes involved in syscalls by means of direct IPC,
and so this optimisation has not been applied. The problems leading to this decision are further described on page 48.
From the results, it is seen that NomadLinux equals L4Linux in most cases, while outperforming it in some. It is not evident what makes NomadLinux perform better than L4Linux in these
cases, but for this benchmark it is satisfactory to see that no unnecessary overhead was introduced when adapting L4Linux to run under NomadBIOS. When examining the bandwidth results, it seems that NomadLinux and L4Linux have the same throughput in most cases, while
still outperformed by native Linux. Linux under VMWare comes in at 5-20% performance
penalty compared to native Linux, and suffers a massive degradation in the pipe bandwidth
test, although reasons for this are not exactly clear. The degradation of the pipe bandwidth
result, when seen in conjunction with the high pipe latency result of VMWare, helps to explain
the relatively poor results of the context switch latency benchmarks, since these pass a token
through a series of pipes set up between a number of processes.
One area in which NomadLinux did not perform well was network bandwidth, at just 50% of
native Linux performance. VMWare performed almost as well as native Linux, indicating that
it should be possible to achieve similar performance for NomadLinux on NomadBIOS.
Chapter 7: Performance measurements
Test
56
NomadLinux
getpid
write to /dev/null
null process
Simple process dynamic
Simple process static
/bin/sh process dynamic
/bin/sh process static
ctx 0K 16
ctx2 0K 16
pipe
3.80
4.28
982.88
2644.74
1706.60
12907.32
11898.97
2.46
1.93
19.21
VMWare
Windows
0.80
1.09
1485.70
4149.80
7242.33
30002.18
28316.46
27.93
26.11
40.67
VMWare
Linux
0.83
1.03
1397.83
5197.21
2973.43
24193.52
22546.51
24.50
24.29
40.56
L4Linux
Linux
3.94
4.62
1060.44
3094.99
1734.40
14501.13
12951.64
3.41
3.19
24.31
0.36
0.46
178.42
906.86
330.18
7707.26
7084.71
3.39
3.60
5.30
Table 1: Selected latency results running hbench
Test
NomadLinux
pipe-64k
mem rd 2m
mem write 2m
mem zero 2m
mem copy 2m unrolled aligned
tcp, 128k
330.73
393.18
290.94
294.79
149.76
5.443
VMWare
Windows
88.23
332.00
278.61
282.35
168.76
8.843
VMWare
Linux
90.17
371.16
281.06
280.48
156.78
9.379
Table 2: Selected bandwidth results running hbench
L4Linux
Linux
313.73
394.48
292.61
294.66
168.62
9.48
366.30
408.06
304.69
289.63
165.16
9.83
Chapter 7: Performance measurements
57
Throughput benchmark
In order to measure system throughput a CPU intensive task was put together, in order to
simulate a scientific workload. The task consists of encoding a single uncompressed wave
audio file, into its ogg-encoded equivalent. To perform the encoding, the program oggenc
was used. Each result measures one encoding from raw PCM wave file format to ogg-format,
and the result is measured by the completion wall time of the task. For each run, the file cache
is warmed, by running the benchmark once discarding the result. The benchmark is run in
three scenarios:
One encoder The encoder is run alone, and the time measured is the time it takes to complete
the encoding. This mode measures throughput of a single task with no contending of
resources.
Two encoders The encoder is started twice. The time measured is the completion time of the
last to finish.
One encoder on two kernels For the setups that support running multiple guests, this benchmark runs one encoder in each of two guests. This benchmark is relevant only to the
VMWare and NomadLinux setups, and the time measured is the last to finish.
The results are measured in wall-time rather than UNIX system time + user time, due to the
inability of L4Linux (and thus NomadLinux) to accurately report these numbers.
The benchmarks were run under the same environments as the latency test, with the same
guest configuration, and results are available in table 3. It is interesting to note that running
just one guest under NomadBIOS performs equal to L4Linux itself, and loses just a few seconds
running two guests each doing one encoding, compared to running two encoders at once. The
difference between the two measurements signify the overhead of having a second operating
system running, which, when compared to the VMWare figures, is marginal. VMWare on
Windows suffers an 8% slowdown when running two virtual machines, whereas NomadBIOS
suffers just 1.5%.
Please note that the VMWare on Linux result for multiple guests was obtained with an independent timer, since VMWare was not able to share the real time clock correctly between multiple
virtual environments, making the clock drift 50 seconds during this three minute test.
Generally VMWare came in last, by a margin of 15-30% compared to NomadBIOS, except for
the syscall latency benchmark. As is seen, L4Linux and NomadBIOS both come within 5% of
native Linux performance.
Migration benchmark
This benchmark is relevant only for NomadLinux, since it is the only one of the test systems
with built-in support for migration.
Chapter 7: Performance measurements
58
This benchmark measures the time taken to migrate a guest environment from one host environment to another. This period of time can be split into two separate phases: the downtime
experienced by the guest, and the downtime experienced by peers on the network.
To get an indication of the first, the number of 4kB pages sent when transferring control of
a guest to a new host, was measured. The guest was a full Linux system, running 14 user
processes, including inetd, sshd, and apache, though all idle at the time.
When using the simple copy algorithm, the downtime of a guest is the time taken to transfer
its whole address space to a new host. The migrated guest test contained 64MB of memory,
so the downtime of the guest should be 6-7 seconds on a 100Mbit network. Using the precopy
algorithm, the downtime was reduced to the time taken to transfer 29 4kB pages (a total of
116kB), which is about 10 ms. Running a recursive find command from the root of the NFSmounted filesystem while being logged in via SSH, resulted in a final transfer of 244 4kB pages,
or a downtime of about 100 ms.
Since every user process in a NomadLinux
system writes at least
one page during suspension,
pages, where is the number of user processes
the theoretical best case final transfer is
in the system. The extra page is for the state of the operating system threads. In the worst case,
an extremely busy system, the time for the final transfer, and thus downtime, will be equal to
that of the simple copy algorithm.
The downtime experienced by other machines is not just the transfer time. Since the IP address
of the guest is now associated with a different Ethernet address, machines connected to the
guest needs to be made aware of this. Until this happens, they will continue sending packets
to the old host, where the packets will be discarded. This can take a relatively long time; times
of up to 20 seconds were seen. To counter this, the new host of the guest broadcasts a series
of gratuitous ARP1 messages, telling every other machine on the network to update their ARP
table entry for the guest IP address. This technique brings down the time for re-discovery of the
guest to virtually nil. This was confirmed by letting a machine on the network send ICMP echo
packets to the guest before, during and after migration, at the rate of ten packets per second.
Without gratuitous ARP, 24.9 seconds elapsed before the guest started responding to echo requests, while with gratuitous ARP not a single packet was lost.
The results of these three benchmarks support our hypotheses as described on page 4. Specifically, the throughput benchmark supports the Concurrent, Efficient and Scalable hypotheses,
1
See page 61 for a detailed description
Mode
NomadLinux
L4Linux
VMWare/Windows
VMWare/Linux
Native Linux
One encoder
1m 01.1s
1m 01.5s
1m 15.1s
1m 15.2s
0m 58.2s
Two encoders
2m 07.6s
2m 06.7s
2m 35.9s
2m 34.5s
1m 57.6s
One encoder on two guests
2m 09.0s
N/A
2m 48.7s
2m 40.0s
N/A
Table 3: Results for converting 44MB audio file into ogg-vorbis format
Chapter 7: Performance measurements
59
showing that NomadLinux suffers just 1.5% overhead when running two simultanious guests
and performs equally well with L4Linux, while the very low downtime shown in the migration
benchmark support the Migratable hypothesis.
C HAPTER
8
Discussion
The original motivation for this project was to create a better solution to the configuration
management and load balancing needs of computing grids. However, most of the actual work
has been performed on the micro level, and tested in practice only on just a few machines.
This chapter deals briefly with some of the challenges one would be met with if attempting to
implement a grid-scale system based on nomadic operating systems.
Security
In a system involving untrusted parties, which a large-scale computing grid will necessarily
be, there is a need for security measures to prevent abuse.
The current NomadBIOS implementation contains no security features. Any client on the Internet is able to connect to a NomadBIOS machine and make use of its services. Augmenting
NomadBIOS with support for Kerberos [Miller et al., 1987] or similar network authentication
protocol, as well as implementing an authorisation system on top of a directory access protocol
such as Lightweight Directory Address Protocol (LDAP) would be possible, and may be a topic
for future work.
When a host environment receives a new guest to execute, the host will need to validate the
authenticity of the guest, in order to match it against local access control and quota rules, and to
make sure it was not altered, either as an effect of network failures, or a malicious third party,
while in transit.
A guest transfer via the network happens in the form of a header structure describing the
guest’s external attributes, and a set of memory pages. The external attributes describe the IP
address, memory quota, and other restrictions that the guest must obey to, and is unable to
manipulate.
The external attributes of the guest are determined by its creator, and should be kept constant
once the guest is running. The creator may digitally sign these attributes, and the host environment verify this signature before accepting any guest pages into memory. The number of
distinctly addressed pages accepted into memory can then be bounded by the amount originally described by the creator. The rationale behind not letting a guest change its IP address,
is that hosting sites may employ various address-based security measures, which the guest
should not be allowed to circumvent. As described on page 61, Mobile IP could be used to aid
in keeping IP address constant across migrations.
60
Chapter 8: Discussion
61
The originating host might calculate a secure checksum for each page before transfer, and conclude the transferral of pages with a signed list of all checksums, to make sure everything has
come across correctly.
By their nature, most memory pages will be mutable, and it will make no sense for their creator
to sign them. By compromising a host system, an attacker will be able to alter the programs
running on a guest, or to extract secrets such as keys or passwords from their memory. If the
signature of the compromised host is trusted by other systems, they have no way of preventing
the receipt of a maliciously altered guest. Even though the pages containing the static parts
of the operating system code, or even of various user level programs, may be signed up front,
there is still the risk of an attacker manipulating vital data structures, leaving them in an unpredictable state.
It should be noted though, that even though a malicious host environment poses a threat to
innocent guests, a malicious guest should not pose any threat to a host environment, as the
host environment should ensure adequate protection.
Migration across network boundaries
The example nomadic operating system and host environment implemented for this project
employ Internet Protocol version 4 (IPv4) on top of Ethernet as the only channel for communications with the outside world. IPv6 support may be added in the future.
Guest operating systems may be transparently migrated within IP subnets, where their IP addresses stay valid. When a guest arrives at a new host, the new host broadcasts an Address
Resolution Protocol (ARP) reply message to the local subnet 1 , so that everyone in the subnet
becomes aware of the new Ethernet location of the guest’s IP address quickly. If this is not
done, IP connections to the migrated guest usually hang for a few seconds, until its peers have
resolved the new Ethernet address themselves.
The problem of migrating a running operating system to a different IP subnet, is equivalent to
the problem of migrating a portable computer by hand. If losing all open network connections
is acceptable, the operating system may simply be reconfigured to use a new IP address, within
the target subnet, which is what most laptop users do today.
If connections are to survive migration between distinct subnets, a layer of indirection has to
be introduced at the protocol level. Because of the problem of migrating portable computers by
hand, this has already been addressed by the Mobile IP extension to IPv4 [Perkins and Myles,
1997].
In Mobile IP, a node on the home network (the home agent) forwards packets destined for the
client (mobile node) to a care-of address on the foreign network. The care-of address may be
handled by a foreign agent on the foreign network, or by the mobile node itself. The home and
foreign agent functionality will typically be embedded in routers on the networks.
1
also known as a gratuitous ARP message
Chapter 8: Discussion
62
Outgoing packets from the mobile node may generally be sent directly to their destination,
though not all firewalls will forward packets with non-local source addresses. If outgoing
packets cannot be sent directly, they also have to be tunnelled via the home agent.
The problem with Mobile IP is the triangular routing occurring because incoming packets have
to go through the home agent. Some Mobile IP implementations take the type of traffic into
account, and allow stateless protocols such as HTTP to be routed directly.
Because of the existence of Mobile IP, there is no need to implement anything special in order to support migration across network boundaries. The problem may be tackled simply by
employing Mobile IP in guest operating systems and on host networks.
The Internet Protocol version 6 specification contains some additional features which are beneficial to Mobile IP, and hopefully the popularity of laptops and wireless networks will result
in most routing devices supporting mobility in the near future.
By using the gratuitous ARP message to minimise the time taken for peers to discover the
migration of a guest to a new physical interface, and by pointing to the option of using Mobile
IP when migrating between different subnets, we can now conclude that the assumption made
on page 5, about being able to migrate network connections seamlessly, has been satisfied.
Other usage scenarios
Apart from the grid scenarios envisaged elsewhere, nomadic operating systems may prove
useful in some end user centric scenarios as well. Below a few imaginable “blue sky” examples
are described. Common to both of them is that they are not very useful at the current state
of the NomadBIOS, because abstracted access to human interface devices such as display and
keyboard has not yet been implemented.
Workstation hotel
When leaving his place of work, a user may still wish to be able to access his workspace from
home. Normally, this is solved by leaving the workstation on all night, in case it might be
needed.
Alternatively, the workstation’s operating system could migrate to a “hotel server” for the
night. If the user needs to access it, it may be still be reached from the hotel server (though
the actual location appears unchanged), while the workstation is powered down. Since only
few users are expected to be logged in at night, one hotel server replaces a large number of
workstations, saving huge amounts of energy.
If the user prefers the speed and reliability of his workstation at home, instead of using a terminal connection to the hotel server, the operating system may instead be migrated to his home
workstation when he leaves the office, racing the user home.
Chapter 8: Discussion
63
Laptop replacement
Spurred by the presence of Universal Serial Bus (USB) and IEEE-1394 (aka FireWire) ports in
almost all modern personal computers, a new crop of portable storage devices, such as flash
cards, the IBM USBKey, and the Apple iPod music player, have started to emerge.
Both solid-state devices currently storing up to one gigabyte, and small harddrives storing 20
gigabytes or more, are available at consumer level prices.
One use of the nomadic operating system would be to periodically checkpoint the entire operating system to for example a USBKey. When the user removes the USBKey from the computer,
the operating system is paused, since there would no longer be anywhere to place dirty pages.
When the USBKey were inserted anew, operation would resume.
If the USBKey is inserted into another computer running the base operating system, the guest
operating system would start operation from the checkpoint on the key.
At first, no mappings would exist, leading to page faults. These would be served from the
USBKey, and the system would soon be running at full speed, similar to the way normal diskbacked paging works.
The benefit is that the user might carry his or her entire operating system environment around
on a very small device, and be able to commence work at any location with a compatible computer present.
Such a solution would need Mobile IP to allow migration of network connections to file servers.
The root file system, as well as the various application binaries would be served from the network, perhaps with part of the USBKey acting as a disk cache. Intermezzo [Braam and Nelson,
1999] would be a good candidate for such a network file system, as it would deal gracefully
with disconnected operation.
When not in interactive use i.e.. when the lid is closed, most laptops today function only as
clunky storage devices. Even though the convenience of being able to work on your thesis
in the park during summer should not be underestimated, it feels safe to assume that many
laptops are actually running on desks, as desktop computer replacements, most of the time.
Instead of carrying a bulky and expensive laptop between desks, just to ensure access to personal data and applications, simple removable media technologies, which will fit in a pocket or
in a key ring, could be used instead. If used in conjunction with trusted, public terminals running the host environment, users would be able to migrate their operating system, data, and
applications between desks, eliminating the need for a laptop in many cases. Such terminals
could exists both in the home and at the office, or even in airport and on planes.
Related work
Other recent research projects deal with the subjects of partitioning and operating system migration on commodity hardware. This can be viewed as a natural result of personal computers
becoming fast enough that they can host several independent systems at once, much like mainframes have for several years.
Chapter 8: Discussion
64
The Fluke microkernel
The project perhaps most related to NomadBIOS is the Fluke [Ford et al., 1996] microkernel
from the University of Utah.
Fluke is an attempt at building a system supporting recursive virtual machines, of arbitrary
nesting depth, with exportable kernel state allowing easy checkpointing of single programs, entire guest operating systems, or even of systems distributed across multiple physical machines.
The main advantage of Fluke compared to L4, is its lack of global identifiers, simplifying the
task of migrating complex guest operating systems across machines.
Fluke is a synthesis of Mach and L4. IPC security is enforced via ports and port references like
in Mach, but IPC is unbuffered like in L4. The memory management system supports recursive
address spaces like L4, but with more detailed means of control.
Where Fluke is a microkernel architecture, NomadBIOS is only a microkernel application. NomadBIOS gives up some transparency in exchange for less duplication of effort and better
performance, due to its reliance on an established an quite mature kernel technology.
Compared to Fluke, NomadBIOS on L4 aligns better with the microkernel ideal, in that only the
bare-minimum features needed for protection reside within the kernel, whereas checkpointing
support and indirection of global identifiers are kept in user-space.
The most current version of Fluke was released in early 1999, and was described by its developers as unfinished at the time. However, the related OSKit driver framework, on which
NomadBIOS relies for networking, is still maintained and used in many current operating system research projects.
Denali
The Denali Fault Isolation Kernel project [Whitaker et al., 2002] shares several goals with NomadBIOS.
Denali aims to run multiple (hundreds or even thousands) of isolated guest operating systems
on the same machine, and, like NomadBIOS, sees transparency as a non-goal. Scalability is
more important to Denali than to NomadBIOS, but comes at the sacrifice of features such as
virtual memory, resulting in the loss of ability to host traditional operating systems, such as
Linux, within a single guest instance. Instead, Denali suggest the use of several, cooperating
guests, to form a full operating system. This approach is similar to how L4Linux (and most
other microkernel based operating system implementations) runs multiple user level tasks,
served by one or more operating system tasks.
While Denali currently lacks NomadBIOS’ ability to host real operating systems, it has the
ability to swap guests to disk, which NomadBIOS does currently not. This currently makes
Denali better suited for hosting large numbers of rarely active guests.
Chapter 8: Discussion
65
vMatrix
vMatrix [Awadallah and Rosenblum, 2002] uses the proprietary VMWare Virtual Machine
Monitor (VMM) to implement a system much akin to NomadBIOS, though mainly focused
on better HTTP serving by migrating content closer to consumers. As described on page 11,
the VMWare approach to system partitioning is quite resource-wasteful, due to its goal of transparency.
vMatrix does not attempt to limit migration downtimes, and cites figures of 10 seconds or more
for resumption alone. Contrasted with NomadBIOS’ ability to perform complete migration in
less than one tenth of a second of downtime, this is rather slow.
Conclusion
As has can be seen from the performance measurements, the hypotheses about nomadic operating systems have been verified. NomadBIOS generally outperforms VMWare, performs on
level with standard L4Linux (except for network performance), and lags 5-10% behind monolithic Linux, similar to the results reported in Härtig et al. [1997].
The OSKit-based network abstraction needs improvement. Its lacking performance compared
to standard L4Linux and VMWare may be due to the use of unoptimised OSKit drivers, the
extra layers of copying introduced, or to the latency added by the queueing of packets to save
on context switches. All of these issues are addressable without making fundamental changes
to the overall design, and should be the focus of future work.
The limitations of the current L4 Fiasco kernel, mainly the 256MB RAM limitation, also have to
be addressed. It safe to assume that most nodes in clusters today contain much more memory
than this, so this limitation needs to be addressed, which the Fiasco team is currently doing.
The limited number of tasks available to guest operating systems, should not be a problem in
most scientific scenarios, but anyway the upcoming L4 version 4 specification will solve that.
How close is the implemented system to being usable, in a real cluster or grid setup? That
depends of the type of application one wishes to run. Applications that iterate over very large
datasets multiple times, will need access to disks for temporary storage to perform adequatly.
As long as no block device abstraction has been implemented, these applications cannot be
served efficiently. Applications that only iterate over their dataset once, will run fine, because
the input and output data has to be transferred via the network anyway, and a disk will be no
real advantage.
Many current grid efforts attempt to use Java or other interpreted or just-in-time compiled
languages to provide the user with an architecture-independent programming environment.
Nomadic operating systems may be criticised for sacrificing architecture independence for performance and the ability to run legacy software. However, this choice is warranted by the very
nature of the grid itself. If one assumes the set of nodes in the grid to be large enough, finding
a suitable subset sporting one’s desired processor type, will always be possible.
66
Nomadic operating systems solve the problem of agreeing on a common configuration for
operating systems on grid nodes, by deferring this choice to the end user. They also allow
more efficient resource utilisation, by allowing load balancing on cluster or even grid level to
occur, by means of operating system migration. Compared to traditional process migration
schemes, they completely eliminate the problem of residual dependencies, though at the cost
of coarser granularity of the migrational unit.
The authors believe nomadic operating systems to be an enabling technology of grid computing. While the current implementation still needs a little work, it is stable enough for many
uses already, and the reader is encouraged to try it out.
A PPENDIX
A
Guest interface specification
The interface specification has been derived from the NomadBIOS implementation, and represents the minimum functionality needed for creating a migratable guest operating system. The
specification assumes that the guest is already an L4 program, so only the restrictions specific to
guests running under the NomadBIOS host environment are described. The reader is referred
to Hohmuth [1998] and Liedtke [1999]for further details on programming for L4.
Address space layout and page fault handling
May: Access read-only the L4 kernel info page at address 0x1000.
May: Access read-write a special guest info page, at address 0x2000 (see guest info page specification below).
Cannot: Access any memory below 0x400000 (the first 4MB superpage), except for the shared
kernel info page and the guest info page.
May: Access distinct 4MB pages above 0x400000, with
specified in the guest info page.
signifying the amount of 4MB pages
May: Allocate memory from the NomadBIOS pager via either IPC or by accessing it. Memory
will be returned as either 4kB or 4MB pages, at the sole discretion of NomadBIOS.
Must: Be able to handle repeated page faults to the same address idempotently, when acting
as a pager for its own tasks and threads.
Checkpointing behaviour
Must: Listen for and react to the suspend IPC signal sent by the BIOS, by storing all task state
in memory, and returning an instruction address on which resumption may later occur.
Must: Be able to resume full operations from a previously suspended image.
67
Appendix A: Guest interface specification
68
Hardware abstractions
Cannot: Allocate any hardware interrupts.
Cannot: Access any hardware directly .
May: Use the network IPC protocol for accessing access network resources.
External thread identifiers
Must: Obtain any identifiers needed for communication with external service threads via the
guest info page at address 0x2000.
Guest info page
The guest info page is shared between the host environment and the guest, and is writable
by both. Its main purpose is the provision of parameters, such as boot parameter string, IP
and Ethernet addresses, maximum amount of memory available, accessible range of L4 tasks,
thread identifiers for the network services, clock multiplier and so forth. It is writable by the
guest so that in the future, information may be passed back to the host environment as well.
After suspension, values obtained from the guest info page may no longer be valid, and should
be reread. Information written to the guest info page should be considered lost and be rewritten.
A PPENDIX
B
Changes to the L4Linux source
For those knowledgeable about the L4Linux implementation, the changes necessary to turn
L4Linux into an example nomadic operation system are described below. Some of the changes
rely on two small modifications made to the Fiasco kernel: Out-of-clan IPC now fails with an
error code, and intra-task IPC is possible by specifying task as the recipient.
arch/l4/x86/emulib/int entry.S All IPCs are modified according to figure 8, 9 and 10. The outof-clan IPC error code is checked as well, to see if task numbers have changed due to
migration. Added functions for storing the CPU state of the main process thread in the
shared page, and for recovering the process state from this page.
arch/l4/x86/emulib/user.c Special handling code for signal 100 added to the signal handling
thread. Signal 100 is the signal chosen to signify a suspend request from the L4Linux
server. When suspension is complete, the signal handler responds to L4Linux with eip
and esp of the frozen process.
arch/l4/x86/kernel/chead.c Thread 0, the first thread of the L4Linux task, normally creates a
new thread – thread 3 – which does most of the work, and then goes to sleep forever.
In Nomadic L4Linux, thread 0 listens for suspend requests from the host environment
instead of going back to sleep.
arch/l4/x86/kernel/irq.c Rather than trying to obtain the timer interrupt from L4, the function
timer irq thread waits for an IPC timeout every 10 ms, and now has the responsibility
of incrementing the jiffies counter.
arch/l4/x86/kernel/l4 idle.S A special page fault handler thread is installed to allow page faults
in the emulib pages while resuming a user process. All IPCs are modified according to
figure 8, 9 and 10. Note that only the assembler version of the idle thread has been modified, so the C-version in dispatch.c is currently not supported.
arch/l4/x86/kernel/pagefault.S All pages are touched before being mapped to the faultee, to
provoke higher level page faults should the mapping have been removed by the host
pager. Pages are touched according to the original type of access, so that read requests do
not generate a writable mapping further up.
arch/l4/x86/kernel/setup.c The command line for the kernel is read from the guest info page.
All memory is now allocated on demand, so no explicit pager calls are performed.
arch/l4/x86/kernel/time.c Gets the clock multiplier from the guest info page, instead of performing its own calibration.
69
Appendix B: Changes to the L4Linux source
70
arch/l4/x86/lib/l4 pager.c Modified to allow multiple page faults on the same address over
time, by simply touching the memory page associated with the faulting address, instead
of explicitly propagating the request to the host pager. The Ping-Pong task is not necessary when running under Fiasco, and has been disabled.
arch/l4/x86/lib/task.c Modified to obtain tasks from the host environment instead of from RMGR.
init/suspend.c New file implementing the suspension and recovery functions used prior to
and after migration. First suspends all L4Linux server threads, then all processes, after
which it signals the host environment which will handle the actual migration. Upon
recovery, first all L4Linux server threads are restarted but sleeping, then all user processes
are recreated according to the Linux process table, and finally the L4Linux server threads
are given a “go” signal and start serving again.
drivers/net/l4 ether.c Modified to use the host network interface. The receiver is running as
a separate thread, listening for incoming packets from the host. Multiple packets are
gathered in a queue, which is later flushed by an added bottom half handler, to amortise
the cost of context switches.
include/l4linux/net.h New file, defining constants for the NomadBIOS network interface.
include/l4linux/x86/config.h Modified to run at lower priorities, to make room for the host
environment. The kernel now runs at priority 10.
include/l4linux/x86/sched.h IPCs modified according to figure 8, 9 and 10
include/l4linux/x86/shared data.h The shared data structure used for syscall handling is extended to make room for storing CPU state upon suspension.
include/l4linux/guest info.h New file that defines the layout of the guest info page.
A PPENDIX
C
Availability
Source code and binaries of the NomadBIOS and nomadic L4Linux may be downloaded from
the Internet at http://www.nomadbios.dk. More information about the Fiasco L4 kernel
can be found at http://os.inf.tu-dresden.de/fiasco/, and more information about
the Hazelnut and Pistachio L4 kernels, and about the upcoming version 4 L4 specification can
be found at http://www.l4ka.org.
71
A PPENDIX
D
Bibliography
Amar, L., A. Barak, A. Eizenberg and A. Shiloh. The MOSIX scalable cluster file systems for
Linux, 2000.
Awadallah, Amr and Medel Rosenblum. The vMatrix: A network of virtual machine monitors for dynamic content distribution. Technical report, Computer Systems Lab,
Stanford University, 2002.
Barak, A. and O. La’adan. The MOSIX multicomputer operating system for high performance
cluster computing, 1998.
Braam, P., M. Callahan and Phil Schwan. The InterMezzo filesystem, 1999.
Braam, Peter J and Phillip A Nelson. Removing bottlenecks in distributed filesystems: Coda &
Intermezzo as examples. Proceedings of the 5th Annual Linux Expo, pages 131–139,
1999.
Brown, Aaron B and Margo I. Seltzer. Operating system benchmarking in the wake of lmbench: A case study of the performance of NetBSD on the Intel x86 architecture.
Technical report, Harvard University, 1997.
Bugnion, Edouard, Scott Devine, Kinshuk Govil and Mendel Rosenblum. Disco: Running
commodity operating systems on scalable multiprocessors. ACM Transactions
on Computer Systems, 15(4):412–447, 1997.
Creasy, R. J. The origin of the VM/370 time-sharing system. IBM Journal of Research and Development, 25(5):483–490, 1981.
Dannowski, Uwe, Espen Skoglund and Volkmar Uhlig. L4 eXperimental Kernel Reference Manual,
Version X.2. Universität Karlsruhe, 2002.
Douglis, Fred and John K. Ousterhout. Transparent process migration: Design alternatives and
the Sprite implementation. Software - Practice and Experience, 21(8):757–785, 1991.
Engler, Dawson R. and M. Frans Kaashoek. DPF: Fast, flexible message demultiplexing using
dynamic code generation. In SIGCOMM, pages 53–59, 1996.
Ertl, M. Anton, David Gregg, Andreas Krall and Bernd Paysan. Vmgen — a generator of
efficient virtual machine interpreters. Software Practice and Experience, 32(3):265–
294, 2002.
Ford, Bryan, Godmar Back, Greg Benson, Jay Lepreau, Albert Lin and Olin Shivers. The Flux
OSKit: A substrate for kernel and language research. In Symposium on Operating
Systems Principles, pages 38–51, 1997.
72
Appendix D: Bibliography
73
Ford, Bryan, Mike Hibler, Jay Lepreau, Patrick Tullmann, Godmar Back and Stephen Clawson.
Microkernels meet recursive virtual machines. In Operating Systems Design and
Implementation, pages 137–151, 1996.
Foster, Ian, Carl Kesselman, Jeffrey M. Nick and Steven Tuecke. The Physiology of the Grid. An
open Grid services architecture for distributed systems integration. draft., 2002.
Gabriel, Richard P. Patterns of Software: Tales from the software community. OUP, 1996.
Goldberg, Robert P. Survey of virtual machine research. IEEE Computer, 7(6):34–45, 1974.
Hansen, Per Brinch. The nucleus of a multiprogramming system. Communications of the ACM,
13(4):238–250, 1970.
Härtig, Hermann, Michael Hohmuth, Jochen Liedtke, Sebastian Schnberg and Jean Wolter. The
performance of microkernel-based systems, 1997.
Härtig, Hermann, Michael Hohmuth and Jean Wolter. Taming Linux, 1998.
Hohmuth, Michael. The Fiasco Kernel: Requirements Definition. Technische Universität Dresden,
1998.
Kamp, Poul-Henning and Robert N. M. Watson. Jails: Confining the omnipotent root. In
Proceedings, SANE 2000 Conference, 2000.
Lawton, Kevin. Running multiple operating systems concurrently on the IA32 pc using virtualization techniques, 1999.
Liedtke, J. and H. Wenske. Lazy process switching, 2001.
Liedtke, Jochen. Clans & Chiefs. Technical report, German National Research Center for Computer Science, 1992.
Liedtke, Jochen. On micro-kernel construction. In Symposium on Operating Systems Principles,
pages 237–250, 1995.
Liedtke, Jochen. L4 Nucleus Version X Reference Manual. Universität Karlsruhe, 1999.
Liedtke, Jochen, Uwe Dannowski, Kevin Elphinstone, Gerd Liefländer, Espen Skoglund, Volkmar Uhlig, Christian Ceelen, Marcus Haeberlen and Marcus Völp. The L4Ka
vision. White paper, Universität of Karlsruhe, 2001.
Lindholm, Tim and Frank Yellin. The Java Virtual Machine Specification. Addison-Wesley, 1996.
McVoy, Larry and Carl Staelin. lmbench: Portable tools for performance analysis. Technical
report, Sillicon Graphics Inc. and Hewlett-Packard Laboratories, 1996.
Miller, S. P., B. C. Neuman, J. I. Schiller and J. H. Saltzer. Kerberos authentication and authorization system. Technical report, Massachusetts Institute of Technology, 1987.
Milojicic, D., F. Douglis, Y. Paindaveine, R. Wheeler and S. Zhou. Process migration. ACM
Computing Surveys, 32(3):241–299, 2000.
Appendix D: Bibliography
74
Morrison, R., A. L. Brown, R. Carrick, R. C. H. Connor, A. Dearle and M. P. Atkinson. The
Napier type system. In Rosenberg, J. and D M Koch, editors, Persistent Object
Systems, pages 3–18. Springer-Verlag, 1990.
Ousterhout, J. K., A. R. Cherenson, F. Douglis, M. N. Nelson and B. B. Welch. The Sprite
network operating system. Computer Magazine of the Computer Group News of the
IEEE Computer Group Society, ; ACM CR 8905-0314, 21(2), 1988.
Perkins, C. E. and A. Myles. Mobile IP. Proceedings of International Telecommunications Symposium, pages 415–419, 1997.
Rashid, Richard, Daniel Julin, Douglas Orr, Richard Sanzi, Robert Baron, Alesandro Forin,
David Golub and Michael B. Jones. Mach: a system software kernel. In Proceedings of the 1989 IEEE International Conference, COMPCON, pages 176–178, San
Francisco, CA, USA, 1989. IEEE Comput. Soc. Press.
Robin, John Scott and Cynthia E. Irvine. Analysis of the intel pentium’s ability to support
a secure virtual machine monitor. In Proceedings of the 2001 USENIX Security
Symposium, 2001.
Satyanarayanan, M. Scalable, secure, and highly available distributed file access. IEEE Computer, 23(5), 1990.
Satyanarayanan, M., J. J. Kistler, P. Kumar, M. E. Okasaki, E. H. Siegel and D. C. Steere. Coda:
A highly available file system for a distributed workstation environment. IEEE
Transactions on Computers, 39(4):447–459, 1990.
Shapiro, Jonathan S. EROS: A Capability System. PhD thesis, University of Pennsylvania, 1999.
Shapiro, Jonathan S. and Jonathan Adams. Design evolution of the EROS single-level store. In
Proceedings of the 2002 USENIX Annual Technical Conference, 2002.
Skoglund, Espen, Christian Ceelen and Jochen Liedtke. Transparent orthogonal checkpointing
through user-level pagers. Lecture Notes in Computer Science, 2135:201–??, 2001.
Venkitachalam, Ganesh and Beng-Hong Lim. Virtualizing I/O devices on VMWare Workstation’s hosted virtual machine monitor. In Proceedings of the 2001 USENIX Technical Conference, 2001.
Laszewski, Gregor von, Kazuyuki Shudo and Yoichi Muraoka. Grid-based asynchronous migration of execution context in Java virtual machines. Lecture Notes in Computer
Science, 1900:22–34, 2000.
Whitaker, Andrew, Marianne Shaw and Steven D. Gibble. Denali: A scalable isolation kernel.
Technical report, University of Washington, 2002.
Zayas, Edward R. Attacking the process migration bottleneck, 1987.
Was this manual useful for you? yes no
Thank you for your participation!

* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project

Download PDF

advertisement