On μ-kernel construction

On μ-kernel construction
15th ACM Symposium on Operating System Principles (SOSP)
December 3-6, Copper Mountain Resort, Colorado
On
-Kernel Construction
Jochen Liedtke
GMD | German National Research Center for Information Technology
[email protected]
Abstract
From a software-technology point of view, the kernel concept is superior to large integrated kernels.
On the other hand, it is widely believed that (a) kernel based systems are inherently inecient and (b)
they are not suciently exible. Contradictory to this
belief, we show and support by documentary evidence
that ineciency and inexibility of current -kernels is
not inherited from the basic idea but mostly from overloading the kernel and/or from improper implementation.
Based on functional reasons, we describe some concepts which must be implemented by a -kernel and
illustrate their exibility. Then, we analyze the performance critical points. We show what performance
is achievable, that the eciency is sucient with respect to macro-kernels and why some published contradictory measurements are not evident. Furthermore, we
describe some implementation techniques and illustrate
why -kernels are inherently not portable, although
they improve portability of the whole system.
1 Rationale
-kernel based systems have been built long before the
term itself was introduced, e.g. by Brinch Hansen 1970]
and Wulf et al. 1974]. Traditionally, the word `kernel'
is used to denote the part of the operating system that is
mandatory and common to all other software. The basic
GMD SET{RS, 53754 Sankt Augustin, Germany
0 Copyright c
1995 by the Association for Computing Machinery,
Inc. Permission to make digital or hard copies of part or all of this
work for personal or classroom use is granted without fee provided
that copies are not made or distributed for prot or commercial advantage and that new copies bear this notice and the full citation
on the rst page. Copyrights for components of this WORK owned
by others than ACM must be honored. Abstracting with credit is
permitted. To copy otherwise, to republish, to post on servers or
to redistribute to lists, requires prior specic permission and/or a
fee. Request Permissions from Publications Dept, ACM Inc., Fax +1
(212) 869-0481, or [email protected]
idea of the -kernel approach is to minimize this part,
i.e. to implement outside the kernel whatever possible.
The software technological advantages of this approach are obvious:
(a) A clear -kernel interface enforces a more modular
system structure.1
(b) Servers can use the mechanisms provided by the
-kernel like any other user program. Server malfunction is as isolated as any other user program's
malfunction.
(c) The system is more exible and tailorable. Dierent strategies and APIs, implemented by dierent
servers, can coexist in the system.
Although much eort has been invested in -kernel
construction, the approach is not (yet) generally accepted. This is due to the fact that most existing kernels do not perform suciently well. Lack of eciency also heavily restricts exibility, since important
mechanisms and principles cannot be used in practice
due to poor performance. In some cases, the -kernel
interface has been weakened and special servers have
been re-integrated into the kernel to regain eciency.
It is widely believed that the mentioned ineciency
(and thus inexibility) is inherent to the -kernel approach. Folklore holds that increased user-kernel mode
and address-space switches are responsible. At a rst
glance, published performance measurements seem to
support this view.
In fact, the cited performance studies measured the
performance of a particular -kernel based system without analyzing the reasons which limit eciency. We can
only guess whether it is caused by the -kernel approach,
by the concepts implemented by this particular -kernel
or by the implementation of the -kernel. Since it is
known that conventional IPC, one of the traditional kernel bottlenecks, can be implemented an order of magnitude faster2 than believed before, the question is still
1 Although many macro-kernels tend to be less modular, there
are exceptions from this rule, e.g. Chorus Rozier et al. 1988] and
Peace Schroder-Preikschat 1994].
2 Short user-to-user cross-address space IPC in L3 Liedtke
2.1 Address Spaces
open. It might be possible that we are still not applying
the appropriate construction techniques.
For the above reasons, we feel that a conceptual analysis is needed which derives -kernel concepts from pure
functionality requirements (section 2) and that discusses
achievable performance (section 4) and exibility (section 3). Further sections discuss portability (section 5)
and the chances of some new developments (section 6).
At the hardware level, an address space is a mapping
which associates each virtual page to a physical page
frame or marks it `non-accessible'. For the sake of
simplicity, we omit access attributes like read-only and
read/write. The mapping is implemented by TLB hardware and page tables.
The -kernel, the mandatory layer common to all subsystems, has to hide the hardware concept of address
spaces, since otherwise, implementing protection would
be impossible. The -kernel concept of address spaces
must be tamed, but must permit the implementation of
arbitrary protection (and non-protection) schemes on
top of the -kernel. It should be simple and similar to
the hardware concept.
The basic idea is to support recursive construction
of address spaces outside the kernel. By magic, there
is one address space 0 which essentially represents the
physical memory and is controlled by the rst subsystem S0 . At system start time, all other address spaces
are empty. For constructing and maintaining further
address spaces on top of 0 , the -kernel provides three
operations:
Grant. The owner of an address space can grant any
of its pages to another space, provided the recipient
agrees. The granted page is removed from the granter's
address space and included into the grantee's address
space. The important restriction is that instead of physical page frames, the granter can only grant pages which
are already accessible to itself.
Map. The owner of an address space can map any of its
pages into another address space, provided the recipient
agrees. Afterwards, the page can be accessed in both
address spaces. In contrast to granting, the page is not
removed from the mapper's address space. Comparable
to the granting case, the mapper can only map pages
which itself already can access.
Flush. The owner of an address space can ush any of
its pages. The ushed page remains accessible in the
usher's address space, but is removed from all other
address spaces which had received the page directly or
indirectly from the usher. Although explicit consent
of the aected address-space owners is not required, the
operation is safe, since it is restricted to own pages. The
users of these pages already agreed to accept a potential
ushing, when they received the pages by mapping or
granting.
Appendix A contains a more precise denition of address spaces and the above three operations.
2 Some -Kernel Concepts
In this section, we reason about the minimal concepts
or \primitives" that a -kernel should implement.3 The
determining criterion used is functionality, not performance. More precisely, a concept is tolerated inside the
-kernel only if moving it outside the kernel, i.e. permitting competing implementations, would prevent the
implementation of the system's required functionality.
We assume that the target system has to support
interactive and/or not completely trustworthy applications, i.e. it has to deal with protection. We further
assume that the hardware implements page-based virtual memory.
One inevitable requirement for such a system is that
a programmer must be able to implement an arbitrary
subsystem S in such a way that it cannot be disturbed or
corrupted by other subsystems S 0 . This is the principle
of independence: S can give guarantees independent of
S 0 . The second requirement is that other subsystems
must be able to rely on these guarantees. This is the
principle of integrity: there must be a way for S1 to
address S2 and to establish a communication channel
which can neither be corrupted nor eavesdropped by
S0.
Provided hardware and kernel are trustworthy, further security services, like those described by Gasser
et al. 1989], can be implemented by servers. Their integrity can be ensured by system administration or by
user-level boot servers. For illustration: a key server
should deliver public-secret RSA key pairs on demand.
It should guarantee that each pair has the desired RSA
property and that each pair is delivered only once and
only to the demander. The key server can only be
realized if there are mechanisms which (a) protect its
code and data, (b) ensure that nobody else reads or
modies the key and (c) enable the demander to check
whether the key comes from the key server. Finding the
key server can be done by means of a name server and
checked by public key based authentication.
1993] is 22 times faster than in Mach, both running on a 486.
On the R2000, the specialized Exo-tlrpc Engler et al. 1995] is 30
times faster than Mach's general RPC.
3 Proving minimality, necessarity and completeness would be
nice but is impossible, since there is no agreed-upon metric and
all is Turing-equivalent.
Reasoning
The described address-space concept leaves memory
management and paging outside the -kernel only the
2
grant, map and ush operations are retained inside the
kernel. Mapping and ushing are required to implement
memory managers and pagers on top of the -kernel.
The grant operation is required only in very special
situations: consider a pager F which combines two underlying le systems (implemented as pagers f1 and f2 ,
operating on top of the standard pager) into one unied le system (see gure 1). In this example, f1 maps
user A
A
YHHH grant
HH
A
A
A
map
;
A
;
A
HY
A HH
;
;
A
HH map
;
A
pure memory mapped I/O, i.e., device ports can be controlled and mapped with 4K granularity.
Controlling I/O rights and device drivers is thus also
done by memory managers and pagers on top of the
-kernel.
2.2 Threads and IPC
A thread is an activity executing inside an address space.
A thread is characterized by a set of registers, including at least an instruction pointer, a stack pointer and
a state information. A thread's state also includes the
address space ( ) in which currently executes. This
dynamic or static association to address spaces is the decisive reason for including the thread concept (or something equivalent) in the -kernel. To prevent corruption
of address spaces, all changes to a thread's address space
(( ) := 0 ) must be controlled by the kernel. This implies that the -kernel includes the notion of some that
represents the above mentioned activity. In some operating systems, there may be additional reasons for introducing threads as a basic abstraction, e.g. preemption.
Note that choosing a concrete thread concept remains
subject to further OS-specic design decisions.
Consequently, cross-address-space communication,
also called inter-process communication (IPC), must be
supported by the -kernel. The classical method is
transferring messages between threads by the -kernel.
IPC always enforces a certain agreement between
both parties of a communication: the sender decides
to send information and determines its contents the
receiver determines whether it is willing to receive information and is free to interpret the received message.
Therefore, IPC is not only the basic concept for communication between subsystems but also, together with
address spaces, the foundation of independence.
Other forms of communication, remote procedure call
(RPC) or controlled thread migration between address
spaces, can be constructed from message-transfer based
IPC.
Note that the grant and map operations (section 2.1)
need IPC, since they require an agreement between
granter/mapper and recipient of the mapping.
user X
A
F
disk
f1
f2
std pager
Figure 1: A Granting Example.
one of its pages to F which grants the received page
to user A. By granting, the page disappears from F so
that it is then available only in f1 and user A the resulting mappings are denoted by the thin line: the page
is mapped in user A, f1 and the standard pager. Flushing the page by the standard pager would aect f1 and
user A, ushing by f1 only user A. F is not aected by
either ush (and cannot ush itself), since it used the
page only transiently. If F had used mapping instead
of granting, it would have needed to replicate most of
the bookkeeping which is already done in f1 and f2 .
Furthermore, granting avoids a potential address-space
overow of F.
In general, granting is used when page mappings
should be passed through a controlling subsystem without burdening the controller's address space by all pages
mapped through it.
The model can easily be extended to access rights on
pages. Mapping and granting copy the source page's
access right or a subset of them, i.e., can restrict the
access but not widen it. Special ushing operations may
remove only specied access rights.
Supervising IPC
Architectures like those described by Yokote 1993] and
Kuhnhauser 1995] need not only supervise the memory
of subjects but also their communication. This can be
done by introducing either communication channels or
Clans Liedtke 1992] which allow supervision of IPC by
user-dened servers. Such concepts are not discussed
here, since they do not belong to the minimal set of
concepts. We only remark that Clans do not burden
the -kernel: their base cost is 2 cycles per IPC.
I/O
An address space is the natural abstraction for incorporating device ports. This is obvious for memory mapped
I/O, but I/O ports can also be included. The granularity of control depends on the given processor. The
386 and its successors permit control per port (one very
small page per port) but no mapping of port addresses
(it enforces mappings with v = v0 ) the PowerPC uses
3
Interrupts
on top of the -kernel. In this section, we show the
principal exibility of a -kernel. Whether it is really
as exible in practice strongly depends on the achieved
eciency of the -kernel. The latter performance topic
is discussed in section 4.
The natural abstraction for hardware interrupts is the
IPC message. The hardware is regarded as a set of
threads which have special thread ids and send empty
messages (only consisting of the sender id) to associated
software threads. A receiving thread concludes from the
message source id, whether the message comes from a
hardware interrupt and from which interrupt:
Memory Manager. A server managing the initial
address space 0 is a classical main memory manager,
but outside the -kernel. Memory managers can easily
be stacked: M0 maps or grants parts of the physical
memory (0 ) to 1 , controlled by M1 , other parts to 2 ,
controlled by M2 . Now we have two coexisting main
memory managers.
driver thread:
do
wait for (msg, sender) if sender = my hardware interrupt
then read/write io ports reset hardware interrupt
else
Pager. A Pager may be integrated with a memory
:::
manager or use a memory managing server. Pagers use
the -kernel's grant, map and ush primitives. The
remaining interfaces, pager { client, pager { memory
server and pager { device driver, are completely based
on IPC and are user-level dened.
Pagers can be used to implement traditional paged
virtual memory and le/database mapping into user address spaces as well as unpaged resident memory for device drivers and/or real time systems. Stacked pagers,
i.e. multiple layers of pagers, can be used for combining access control with existing pagers or for combining
various pagers (e.g. one per disk) into one composed object. User-supplied paging strategies Lee et al. 1994
Cao et al. 1994] are handled at the user level and are in
no way restricted by the -kernel. Stacked le systems
Khalidi and Nelson 1993] can be realized accordingly.
od .
Transforming the interrupts into messages must be
done by the kernel, but the -kernel is not involved in
device-specic interrupt handling. In particular, it does
not know anything about the interrupt semantics. On
some processors, resetting the interrupt is a device specic action which can be handled by drivers at user level.
The iret-instruction then is used solely for popping status information from the stack and/or switching back to
user mode and can be hidden by the kernel. However, if
a processor requires a privileged operation for releasing
an interrupt, the kernel executes this action implicitly
when the driver issues the next IPC operation.
2.3 Unique Identiers
A -kernel must supply unique identiers (uid) for
something, either for threads or tasks or communication
channels. Uids are required for reliable and ecient local communication. If S1 wants to send a message to
S2 , it needs to specify the destination S2 (or some channel leading to S2 ). Therefore, the -kernel must know
which uid relates to S2 . On the other hand, the receiver
S2 wants to be sure that the message comes from S1 .
Therefore the identier must be unique, both in space
and time.
In theory, cryptography could also be used. In practice, however, enciphering messages for local communication is far too expensive and the kernel must be
trusted anyway. S2 can also not rely on purely usersupplied capabilities, since S1 or some other instance
could duplicate and pass them to untrusted subsystems
without control of S2 .
Multimedia Resource Allocation. Multimedia
and other real-time applications require memory resources to be allocated in a way that allows predictable
execution times. The above mentioned user-level memory managers and pagers permit e.g. xed allocation
of physical memory for specic data or locking data in
memory for a given time.
Note that resource allocators for multimedia and for
timesharing can coexist. Managing allocation conicts
is part of the servers' jobs.
Device Driver. A device driver is a process which
directly accesses hardware I/O ports mapped into its
address space and receives messages from the hardware (interrupts) through the standard IPC mechanism.
Device-specic memory, e.g. a screen, is handled by
means of appropriate memory managers. Compared to
other user-level processes, there is nothing special about
a device driver. No device driver has to be integrated
into the -kernel.4
3 Flexibility
To illustrate the exibility of the basic concepts, we
sketch some applications which typically belong to the
basic operating system but can easily be implemented
4 In general, there is no reason for integrating boot drivers into
the kernel. The booter, e.g. located in ROM, simply loads a bit
image into memory that contains the micro-kernel and perhaps
4
Second Level Cache and TLB. Improving the hit
11/780 or roughly a 486 at 50 MHz), he showed that
most machines need 20{30 s per getpid, one required
even 63 s. Corroborating these results, we measured
18 s per Mach6 -kernel call get self thread. In fact,
the measured kernel-call costs are high.
For analyzing the measured costs, our argument is
based on a 486 (50 MHz) processor. We take an x86
processor, because kernel-user mode switches are extremely expensive on these processors. In contrast to
the worst case processor, we use a best-case measurement for discussion, 18 s for Mach on a 486/50.
The measured costs per kernel call are 18 50 = 900
cycles. The bare machine instruction for entering kernel
mode costs 71 cycles, followed by an additional 36 cycles for returning to user mode. These two instructions
switch between the user and kernel stack and push/pop
ag register and instruction pointer. 107 cycles (about
2 s) is therefore a lower bound on kernel{user mode
switches. The remaining 800 or more cycles are pure
kernel overhead. By this term, we denote all cycles
which are solely due to the construction of the kernel,
nevermind whether they are spent in executing instructions (800 cycles 500 instructions) or in cache and
TLB misses (800 cycles 270 primary cache misses 90 TLB misses). We have to conclude that the measured
kernels do a lot of work when entering and exiting the
kernel. Note that this work by denition has no net
eect.
Is an 800 cycle kernel overhead really necessary? The
answer is no. Empirical proof: L3 Liedtke 1993] has a
minimal kernel overhead of 15 cycles. If the -kernel call
is executed infrequently enough, it may increase by up
to 57 additional cycles (3 TLB misses, 10 cache misses).
The complete L3 kernel call costs are thus 123 to 180
cycles, mostly less than 3 s.
The L3 -kernel is process oriented, uses a kernel
stack per thread and supports persistent user processes
(i.e. the kernel can be exchanged without aecting the
remaining system, even if a process actually resides in
kernel mode). Therefore, it should be possible for any
other -kernel to achieve comparably low kernel call
overhead on the same hardware.
Other processors may require a slightly higher overhead, but they oer substantially cheaper basic operations for entering and leaving kernel mode. From
an architectural point of view, calling the kernel from
user mode is simply an indirect call, complemented by
a stack switch and setting the internal `kernel'-bit to
permit privileged operations. Accordingly, returning
from kernel mode is a normal return operation complemented by switching back to user stack and resetting the
`kernel'-bit. If the processor has dierent stack pointer
registers for user and kernel stack, the stack switching
costs can be hidden. Conceptually, entering and leaving
rates of a secondary cache by means of page allocation or
reallocation Kessler and Hill 1992 Romer et al. 1994]
can be implemented by means of a pager which applies
some cache-dependent (hopefully conict reducing) policy when allocating virtual pages in physical memory.
In theory, even a software TLB handler could be implemented like this. In practice, the rst-level TLB
handler will be implemented in the hardware or in the
-kernel. However, a second-level TLB handler, e.g.
handling misses of a hashed page table, might be implemented as a user-level server.
Remote Communication. Remote IPC is implemented by communication servers which translate local
messages to external communication protocols and vice
versa. The communication hardware is accessed by device drivers. If special sharing of communication buers
and user address space is required, the communication
server will also act as a special pager for the client. The
-kernel is not involved.
Unix Server. Unix5 system calls are implemented by
IPC. The Unix server can act as a pager for its clients
and also use memory sharing for communicating with
its clients. The Unix server itself can be pageable or
resident.
Conclusion. A small set of -kernel concepts lead to
abstractions which stress exibility, provided they perform well enough. The only thing which cannot be implemented on top of these abstractions is the processor
architecture, registers, rst-level caches and rst-level
TLBs.
4 Performance, Facts & Rumors
4.1 Switching Overhead
It is widely believed that switching between kernel and
user mode, between address spaces and between threads
is inherently expensive. Some measurements seem to
support this belief.
4.1.1 Kernel{User Switches
Ousterhout 1990] measured the costs for executing the
\null" kernel call getpid. Since the real getpid operation consists only of a few loads and stores, this method
measures the basic costs of a kernel call. Normalized to
a hypothetical machine with 10 MIPS rating (10 VAX
some set of initial pagers and drivers (running at user mode and
not linked but simply appended to the kernel). Afterwards, the
boot drivers are no longer used.
5 Unix is a registered trademark of UNIX System Laboratories.
6
5
Mach 3.0, NORMA MK 13
this worst-case calculation shows that switching page
tables may become critical in some situations.
Fortunately, this is not a problem, since on Pentium
and PowerPC, address-space switches can be handled
dierently. The PowerPC architecture includes segment
registers which can be controlled by the -kernel and
oer an additional address translation facility from the
local 232-byte address space to a global 252-byte space.
If we regard the global space as a set of one million local
spaces, address-space switches can be implemented by
reloading the segment registers instead of switching the
page table. With 29 cycles for 3.5 GB or 12 cycles for
1 GB segment switching, the overhead is low compared
to a no longer required TLB ush. In fact, we have a
tagged TLB.
Things are not quite as easy on the Pentium or the
486. Since segments are mapped into a 232-byte space,
mapping multiple user address spaces into one linear
space must be handled dynamically and depends on the
actually used sizes of the active user address spaces.
The according implementation technique Liedtke 1995]
is transparent to the user and removes the potential
performance bottleneck. Address space switch overhead
then is 15 cycles on the Pentium and 39 cycles on 486.
For understanding that the restriction of a 232-byte
global space is not crucial to performance, one has to
mention that address spaces which are used only for
very short periods and with small working sets are effectively very small in most cases, say 1 MB or less for a
device driver. For example, we can multiplex one 3 GB
user address space with 8 user spaces of 64 MB and additionally 128 user spaces of 1 MB. The trick is to share
the smaller spaces with all large 3 GB spaces. Then any
address-space switch to a medium or small space is always fast. Switching between two large address spaces
is uncritical anyway, since switching between two large
working sets implies TLB and cache miss costs, nevermind whether the two programs execute in the same or
in dierent address spaces.
Table 1 shows the page table switch and segment
switch overhead for several processors. For a TLB miss,
the minimal and maximal cycles are given (provided
that no referenced or modied bits need updating). In
the case of 486, Pentium and PowerPC, this depends on
whether the corresponding page table entry is found in
the cache or not. As a minimal working set, we assume
4 pages. For the maximum case, we exclude 4 pages
from the address-space overhead costs, because at most
4 pages are required by the -kernel and thus would as
well occupy TLB entries when the address space would
not be switched.
kernel mode can perform exactly like a normal indirect
call and return instruction (which do not rely on branch
prediction). Ideally, this means 2+2=4 cycles on a 1issue processor
Conclusion. Compared to the theoretical minimum,
kernel{user mode switches are costly on some processors. Compared to existing kernels however, they can
be improved 6 to 10 times by appropriate -kernel construction. Kernel{user mode switches are not a serious
conceptual problem but an implementational one.
4.1.2 Address Space Switches
Folklore also considers address-space switches as costly.
All measurements known to the author and related to
this topic deal with combined thread and address-space
switch costs. Therefore, in this section, we analyze only
the architectural processor costs for pure address-space
switching. The combined measurements are discussed
together with thread switching.
Most modern processors use a physically indexed
primary cache which is not aected by address-space
switching. Switching the page table is usually very
cheap: 1 to 10 cycles. The real costs are determined
by the TLB architecture.
Some processors (e.g. Mips R4000) use tagged TLBs,
where each entry does not only contain the virtual page
address but also the address-space id. Switching the
address space is thus transparent to the TLB and costs
no additional cycles. However, address-space switching
may induce indirect costs, since shared pages occupy
one TLB entry per address space. Provided that the kernel (shared by all address spaces) has a small working
set and that there are enough TLB entries, the problem
should not be serious. However, we cannot support this
empirically, since we do not know an appropriate kernel running on such a processor.
Most current processors (e.g. 486, Pentium, PowerPC
and Alpha) include untagged TLBs. An address-space
switch thus requires a TLB ush. The real costs are determined by the TLB load operations which are required
to re-establish the current working set later. If the working set consists of n pages, the TLB is fully-associative,
has s entries and a TLB miss costs m cycles, at most
min(n s) m cycles are required in total.
Apparently, larger untagged TLBs lead to a performance problem. For example, completely reloading
the Pentium's data and code TLBs requires at least
(32 + 64) 9 = 864 cycles. Therefore, intercepting a
program every 100 s could imply an overhead of up to
9%. Although using the complete TLB is unrealistic7 ,
7 Both TLBs are 4-way set-associative. Working sets which
are not compact in the virtual address space, usually imply some
conicts so that only about half of the TLB entries are used simultaneously. Furthermore, a working set of 64 data pages will
most likely lead to cache thrashing: in best case, the cache supports 4 32 bytes per page. Since the cache is only 2-way setassociative, probably only 1 or 2 cache entries can be used per
page in practice.
6
All times are user to user, cross-address space.They
include system call, argument copy, stack and address
space switch costs. Exokernel, Spring and L3 show that
communication can be implemented pretty fast and that
the costs are heavily inuenced by the processor architecture: Spring on Sparc has to deal with register windows, whereas L3 is burdened by the fact that a 486
trap is 100 cycles more expensive than a Sparc trap.
The eect of using segment based address-space
switch on Pentium is shown in gure 2. One long running application with a stable working set (2 to 64
data pages) executes a short RPC to a server with
a small working set (2 pages). After the RPC, the
application re-accesses all its pages. Measurement is
done by 100,000 repetitions and comparing each run
against running the application (100,000 time accessing all pages) without RPC. The given times are round
trip RPC times, user to user, plus the required time for
re-establishing the application's working set.
TLB TLB miss Page Table Segment
entries cycles
switch cycles
486
32
9: : : 13
36: : : 364
39
Pentium
96
9: : : 13 36: : : 1196
15
PowerPC 601 256
?
?
29
Alpha 21064 40 20: : : 50a 80: : : 1800
n/a
Mips R4000 48 20: : : 50a
0b
n/a
a Alpha and Mips TLB misses
b R4000 has a tagged TLB.
are handled by software.
Table 1: Address Space Switch Overhead
Conclusion. Properly constructed address-space
switches are not very expensive, less than 50 cycles on
modern processors. On a 100 MHz processor, the inherited costs of address-space switches can be ignored
roughly up to 100,000 switches per second. Special optimizations, like executing dedicated servers in kernel
space, are superuous. Expensive context switching in
some existing -kernels is due to implementation and
not caused by inherent problems with the concept.
4.1.3 Thread Switches and IPC
12
Ousterhout 1990] also measured context switching in
some Unix systems by echoing one byte back and forth
through pipes between two processes. Again normalized
to a 10 Mips machine, most results are between 400 and
System
CPU, MHz
RPC time
(round trip)
full IPC semantics
L3
486, 50
10 s
QNX
486, 33
76 s
Mach
R2000, 16.7
190 s
SRC RPC CVAX, 12.5
464 s
Mach
486, 50
230 s
Amoeba
68020, 15
800 s
Spin
Alpha 21064, 133
102 s
Mach
Alpha 21064, 133
104 s
restricted IPC semantics
Exo-tlrpc R2000, 16.7
6 s
Spring
SparcV8, 40
11 s
DP-Mach 486, 66
16 s
LRPC
CVAX, 12.5
157 s
by segment switch
by page-table switch
14
RPC
time
+
working
set
reestablish
s]
cycles/IPC
10.9
10
9
8
7.6
6.3
6
(oneway)
4
250
1254
1584
2900
5750
6000
6783
6916
2
12.7
3.2
3.2
3.2
3.4
3.6
2
16
32
48
64
application data working set pages]
Figure 2: Segmented Versus Standard Address-Space
Switch in L4 on Pentium, 90 MHz.
Conclusion. IPC can be implemented fast enough to
53
220
528
981
handle also hardware interrupts by this mechanism.
4.2 Memory Eects
Table 2: 1-byte-RPC performance
Chen and Bershad 1993] compared the memory system
behaviour of Ultrix, a large monolithic Unix system,
with that of the Mach -kernel which was complemented
with a Unix server. They measured memory cycle overhead per instruction (MCPI) and found that programs
running under Mach + Unix server had a substantially
800 s per ping-pong, one was 1450 s. All existing kernels are at least 2 times faster, but it is proved by
construction that 10 s, i.e. a 40 to 80 times faster RPC
is achievable. Table 2 gives the costs of echoing one byte
by a round trip RPC, i.e. two IPC operations.8
Renesse et al. 1988 Liedtke 1993 Bershad et al. 1995 Engler
et al. 1995 Hamilton and Kougiouris 1993 Bryce and Muller
1995 Bershad et al. 1989].
The respective data is taken from Liedtke 1993 Hildebrand
1992 Schroeder and Burroughs 1989 Draves et al. 1991 van
8
7
suggest a potential problem due to OS structure.
Chen and Bershad measured cache conicts by comparing the direct mapped to a simulated 2-way cache.9
They found that system self-interference is more important than user/system interference, but the data also
show that the ratio of conict to capacity misses in
Mach is lower than in Ultrix. Table 4 shows the conict
(black) and capacity (white) system cache misses both
in an absolute scale (left) and as ratio (right).
higher MCPI than running the same programs under Ultrix. For some programs, the dierences were up to 0.25
cycles per instruction, averaged over the total program
(user + system). Similar memory system degradation
of Mach versus Ultrix is noticed by others Nagle et al.
1994]. The crucial point is whether this problem is due
to the way that Mach is constructed, or whether it is
caused by the -kernel approach.
Chen and Bershad 1993, p. 125] state: \This suggests
that microkernel optimizations focussing exclusively on
IPC : : :], without considering other sources of system
overhead such as MCPI, will have a limited impact on
overall system performance." Although one might suppose a principal impact of OS architecture, the mentioned paper exclusively presents facts \as is" about a
specic implementation without analyzing the reasons
for memory system degradation.
Careful analysis of the results is thus required. According to the original paper, we comprise under `system' either all Ultrix activities or the joined activities
of the Mach -kernel, Unix emulation library and Unix
server. The Ultrix case is denoted by U, the Mach
case by M. We restrict our analysis to the samples that
show a signicant MCPI dierence for both systems:
sed, egrep, yacc, gcc, compress, espresso and the andrew benchmark ab.
In gure 3, we present the results of Chen's gure 21 in a slightly reordered way. We have colored MCPI
sed U
M
egrep U
M
yacc U
M
gcc U
M
compress U
M
ab U
M
espresso U
M
.......................................................................... other MCPI
...
............................................... system cache miss MCPI
0.227
0.495
0.035
0.081
0.067
0.129
0.434
0.250
0.041
0.068
0.418
0.427
sed U
M
egrep U
M
yacc U
M
gcc U
M
compress U
M
ab U
M
espresso U
M
...................................................... conict misses
...
.................................... capacity misses
0.170
0.415
0.024
0.069
0.039
0.098
0.130
0.388
0.102
0.258
0.230
0.382
0.012
0.037
Figure 4: MCPI Caused by Cache Misses.
From this we can deduce that the increased cache
misses are caused by higher cache consumption of the
system (Mach + emulation library + Unix server), not
by conicts which are inherent to the system's structure.
The next task is to nd the component which is responsible for the higher cache consumption. We assume
that the used Unix single server behaves comparably
to the corresponding part of the Ultrix kernel. This
is supported by the fact that the samples spent even
fewer instructions in Mach's Unix server than in the
corresponding Ultrix routines. We also exclude Mach's
emulation library, since Chen and Bershad report that
only 3% or less of system overhead is caused by it.
What remains is Mach itself, including trap handling,
IPC and memory management, which therefore must
induce nearly all of the additional cache misses.
Therefore, the mentioned measurements suggest that
memory system degradation is caused solely by high
cache consumption of the -kernel. Or in other words:
drastically reducing the cache working set of a -kernel
will solve the problem.
Since a -kernel is basically a set of procedures which
are invoked by user-level threads or hardware, a high
cache consumption can only10 be explained by a large
number of very frequently used -kernel operations or
0.690
0.534
Figure 3: Baseline MCPI for Ultrix and Mach.
black that are due to system i-cache or d-cache misses.
The white bars comprise all other causes, system write
buer stalls, system uncached reads, user i-cache and
d-cache misses and user write buer stalls. It is easy
to see that the white bars do not dier signicantly
between Ultrix and Mach the average dierence is 0.00,
the standard deviation is 0.02 MCPI.
We conclude that the dierences in memory system
behaviour are essentially caused by increased system cache misses for Mach. They could be conict misses (the
measured system used direct mapped caches) or capacity misses. A large fraction of conict misses would
9 Although this method does not determine all conict misses
as dened by Hill and Smith 1989], it can be used as a rst-level
approximation.
10 We do not believe that the Mach kernel ushes the cache explicitly. The measured system was a uniprocessor with physically
tagged caches. The hardware does not even require explicit cache
ushes for DMA.
8
5.1 Compatible Processors
by high cache working sets of a few frequently used operations. According to section 2, the rst case has to be
considered as a conceptual mistake. Large cache working sets are also not an inherent feature of -kernels.
For example, L3 requires less than 1 K for short IPC.
(Recall: voluminous communication can be made by dynamic or static mapping so that the cache is not ooded
by copying very long messages.)
For illustration, we briey describe how a -kernel has
to be conceptually modied even when \ported" from
486 to Pentium, i.e. to a compatible processor.
Although the Pentium processor is binary compatible
to the 486, there are some dierences in the internal
486
Pentium
TLB entries, ways 32(u)
4
32(i) + 64(d) 4
Cache size, ways 8K(u)
4
8K(i) + 8K(d) 2
line, write 16B through
32B
back
fast instructions
1 cycle
0.5{1 cycle
segment register
9 cycles
3 cycles
trap
107 cycles
69 cycles
Mogul and Borg 1991] reported an increase in cache
misses after preemptively-scheduled context switches
between applications with large working sets. This depends mostly on the application load and the requirement for interleaved execution (timesharing). The type
of kernel is almost irrelevant. We showed (section 4.1.2
and 4.1.3) that -kernel context switches are not expensive in the sense that there is not much dierence
between executing application + servers in one or in
multiple address spaces.
Table 3: 486 / Pentium Dierences
hardware architecture (see table 3) which inuence the
internal -kernel architecture:
Conclusion. The hypothesis that -kernel architec-
tures inherently lead to memory system degradation is
not substantiated. On the contrary, the quoted measurements support the hypothesis that properly constructed -kernels will automatically avoid the memory
system degradation measured for Mach.
User-address-space implementation. As mentioned in section 4.1.2, a Pentium -kernel should use
segment registers for implementing user address spaces
so that each 232-byte hardware address space shares all
small and one large user address space. Recall that this
can be implemented transparently to the user.
Ford 1993] proposed a similar technique for the 486,
and table 1 also suggests it for the 486. Nevertheless,
the conventional hardware-address-space switch is preferrable on this processor. Expensive segment register
loads and additional instructions at various places in
the kernel sum to roughly 130 cycles required in addition. Now look at the relevant situation: an addressspace switch from a large space to a small one and back
to the large. Assuming cache hits, the costs of the segment register model would be (130 + 39) 2 = 338 cycles,
whereas the conventional address-space model would require 28 9 + 36 = 288 cycles in the theoretical case of
100% TLB use, 14 9 + 36 = 162 cycles for the more probable case that the large address space uses only 50% of
the TLB and only 72 cycles in the best case. In total,
the conventional method wins.
On the Pentium however, the segment register
method pays. The reasons are several: (a) Segment register loads are faster. (b) Fast instructions are cheaper,
whereas the overhead by trap and TLB misses remain
nearly constant. (c) Conict cache misses (which, relative to instruction execution, are anyway more expensive) are more likely because of reduced associativity.
Avoiding TLB misses thus also reduces cache conicts.
(d) Due to the three times larger TLB, the ush costs
can increase substantially. As a result, on Pentium, the
segment register method always pays (see gure 2).
5 Non-Portability
Older -kernels were built machine-independently on
top of a small hardware-dependent layer. This approach
has strong advantages from the software technological
point of view: programmers did not need to know very
much about processors and the resulting -kernels could
easily be ported to new machines. Unfortunately, this
approach prevented these -kernels from achieving the
necessary performance and thus exibility.
In retrospective, we should not be surprised, since
building a -kernel on top of abstract hardware has serious implications:
Such a -kernel cannot take advantage of specic
hardware.
It cannot take precautions to circumvent or avoid
performance problems of specic hardware.
The additional layer per se costs performance.
-kernels form the lowest layer of operating systems
beyond the hardware. Therefore, we should accept that
they are as hardware dependent as optimizing code generators. We have learned that not only the coding but
even the algorithms used inside a -kernel and its
internal concepts are extremely processor dependent.
9
The dierences are orders of magnitude higher than between 486 and Pentium. We have to expect that a new
processor requires a new -kernel design.
For illustration, we compare two dierent kernels on
two dierent processors: the Exokernel Engler et al.
1995] running on an R2000 and L3 running on a 486. Although this is similar to comparing apples with oranges,
a careful analysis of the performance dierences helps
understanding the performance-determining factors and
weighting the dierences in processor architecture. Finally, this results in dierent -kernel architectures.
We compare Exokernel's protected control transfer
(PCT) with L3's IPC. Exo-PCT on the R2000 requires
about 35 cycles, whereas L3 takes 250 cycles on a 486
processor for an 8-byte message transfer. If this difference cannot be explained by dierent functionality
and/or average processor performance, there must be
an anomaly relevant to -kernel design.
Exo-PCT is a \substrate for implementing ecient
IPC mechanisms. It] changes the program counter to
an agreed-upon value in the callee, donates the current
time-slice to the callee's processor environment, and installs required elements of the callee's processor context." L3-IPC is used for secure communication between potentially untrusted partners it therefore additionally checks the communication permission (whether
the partner is willing to receive a message from the
sender and whether no clan borderline is crossed), synchronizes both threads, supports error recovery by send
and receive timeouts, and permits complex messages to
reduce marshaling costs and IPC frequency. From our
experience, extending Exo-PCT accordingly should require no more than 30 additional cycles. (Note that
using PCT for a trusted LRPC already costs an additional 18 cycles, see table 2.) Therefore, we assume
that a hypothetical L3-equivalent \Exo-IPC" would cost
about 65 cycles on the R2000. Finally, we must take into
consideration that the cycles of both processors are not
equivalent as far as most-frequently-executed instructions are concerned. Based on SpecInts, roughly 1.4
486-cycles appear to do as much work as one R2000cycle comparing the ve instructions most relevant in
this context (2-op-alu, 3-op-alu, load, branch taken and
not taken) gives 1.6 for well-optimized code. Thus we
estimate that the Exo-IPC would cost up to approx. 100
486-cycles being denitely less than L3's 250 cycles.
This substantial dierence in timing indicates an isolated dierence between both processor architectures
that strongly inuences IPC (and perhaps other kernel mechanisms), but not average programs.
In fact, the 486 processor imposes a high penalty on
entering/exiting the kernel and requires a TLB ush
per IPC due to its untagged TLB. This costs at least
107 + 49 = 156 cycles. On the other hand, the R2000
has a tagged TLB, i.e. avoids the TLB ush, and needs
less than 20 cycles for entering and exiting the kernel.
As a consequence, we have to implement an additional user-address-space multiplexer, we have to modify address-space switch routines, handling of user supplied addresses, thread control blocks, task control
blocks, the IPC implementation and the address-space
structure as seen by the kernel. In total, the mentioned
changes aect algorithms in about half of all -kernel
modules.
IPC implementation. Due to reduced associativity,
the Pentium caches tend to exhibit increased conict
misses. One simple way to improve cache behaviour
during IPC is by restructuring the thread control block
data such that it prots from the doubled cache line
size. This can be adopted to the 486 kernel, since it has
no eect on 486 and can be implemented transparently
to the user.
In the 486 kernel, thread control blocks (including
kernel stacks) were page aligned. IPC always accesses
2 control blocks and kernel stacks simultaneously. The
cache hardware maps the according data of both control blocks to identical cache addresses. Due to its
4-way associativity, this problem could be ignored on
the 486. However, Pentium's data cache is only 2-way
set-associative. A nice optimization is to align thread
control blocks no longer on 4K but on 1K boundaries.
(1K is the lower bound due to internal reasons.) Then
there is a 75% chance that two randomly selected control blocks do not compete in the cache.
Surprisingly, this aects the internal bit-structure of
unique thread identiers supplied by the -kernel (see
Liedtke 1993] for details). Therefore, the new kernel
cannot simply replace the old one, since (persistent) user
programs already hold uids which would become invalid.
5.2 Incompatible Processors
Processors of competing families dier in instruction set,
register architecture, exception handling, cache/TLB
architecture, protection and memory model. Especially
the latter ones radically inuence -kernel structure.
There are systems with
multi-level page tables,
hashed page tables,
(no) reference bits,
(no) page protection, 11
strange page protection ,
single/multiple
page sizes,
232-, 243-, 252- and 264-byte address spaces,
at and segmented address spaces,
various segment models,
tagged/untagged TLBs,
virtually/physically tagged caches.
11 e.g. the 386 ignores write protection in kernel mode, the PowerPC supports read only in kernel mode but this implies that the
page is seen in user mode as well.
10
Spin. Spin Bershad et al. 1994 Bershad et al. 1995]
From the above example, we learn two lessons:
For well-engineered -kernels on dierent processor
architectures, in particular with dierent memory
systems, we should expect isolated timing dierences that are not related to overall processor performance.
Dierent architectures require processor-specic
optimization techniques that even aect the global
-kernel structure.
To understand the second point, recall that the mandatory 486-TLB ush requires minimization of the number of subsequent TLB misses. The relevant techniques Liedtke 1993, pp. 179,182{183] are mostly based
on proper address space construction: concentrating
processor-internal tables and heavily used kernel data in
one page (there is no unmapped memory on then 486),
implementing control blocks and kernel stacks as virtual
objects, lazy scheduling. In toto, these techniques save
11 TLB misses, i.e. at least 99 cycles on the 486 and are
thus inevitable.
Due to its unmapped memory facility and tagged
TLB, the mentioned constraint disappears on the
R2000. Consequently, the internal structure (address
space structure, page fault handling, perhaps control
block access and scheduling) of a corresponding kernel
can substantially dier from a 486-kernel. If other factors also imply implementing control blocks as physical
objects, even the uids will dier between the R2000 (no
pointer size + x) and 486 kernel (no control block
size + x).
is a new development which tries to extend the Synthesis
idea: user-supplied algorithms are translated by a kernel compiler and added to the kernel, i.e. the user may
write new system calls. By controlling branches and
memory references, the compiler ensures that the newlygenerated code does not violate kernel or user integrity.
This approach reduces kernel{user mode switches and
sometimes address space switches. Spin is based on
Mach and may thus inherit many of its ineciencies
which makes it dicult to evaluate performance results.
Rescaling them to an ecient -kernel with fast kernel{
user mode switches and fast IPC is needed. The most
crucial problem, however, is the estimation of how an
optimized -kernel architecture and the requirements
coming from a kernel compiler interfere with each other.
Kernel architecture and performance might be e.g. affected by the requirement for larger kernel stacks. (A
pure -kernel needs only a few hundred bytes per kernel
stack.) Furthermore, the costs of safety-guaranteeing
code have to be related to -kernel overhead and to optimal user-level code.
The rst published results Bershad et al. 1995] cannot answer these questions: On an Alpha 21064, 133
MHz, a Spin system call needs nearly twice as many cycles (1600, 12 s) as the already expensive Mach system
call (900, 7 s). The application measurements show
that Mach can be substantially improved by using a
kernel compiler however, it remains open whether this
technique can reach or outperform a pure -kernel approach like that described here. For example, a simple
user-level page-fault handler (1100 s under Mach) executes in 17 s under Spin. However, we must take into
consideration that in a traditional -kernel, the kernel
is invoked and left only twice: page fault (enter), message to pager (exit), reply map message (enter+exit).
The Spin technique can save only one system call which
on this processor should cost less than 1 s i.e. with
12 s the actual Spin overhead is far beyond the ideal
traditional overhead of 1+1 s.
From our experience, we expect a notable gain if
a kernel compiler eliminates nested IPC redirection,
e.g. when using deep hierarchies of Clans or Custodians Hartig et al. 1993]. Ecient integration of the
kernel compiler technique and appropriate -kernel design might be a promising research direction.
Conclusion. -kernels form the link between a minimal \ "-set of abstractions and the bare processor. The
performance demands are comparable to those of earlier
microprogramming. As a consequence, -kernels are inherently not portable. Instead, they are the processor
dependent basis for portable operating systems.
6 Synthesis, Spin, DP-Mach,
Panda, Cache and Exokernel
Synthesis. Henry Massalin's Synthesis operating sys-
tem Pu et al. 1988] is another example of a high performing (and non-portable) kernel. Its distinguishing
feature was a kernel-integrated \compiler" which generated kernel code at runtime. For example, when issuing
a read pipe system call, the Synthesis kernel generated
specialized code for reading out of this pipe and modied
the respective invocation. This technique was highly
successful on the 68030. However (a good example for
non-portability), it would most probably no longer pay
on modern processors, because (a) code ination will
degrade cache performance and (b) frequent generation
of small code chunks pollutes the instruction cache.
Utah-Mach. Ford and Lepreau 1994] changed Mach
IPC semantics to migrating RPC which is based on
thread migration between address spaces, similar to the
Clouds model Bernabeu-Auban et al. 1988]. Substantial performance gain was achieved, a factor of 3 to 4.
DP-Mach. DP-Mach Bryce and Muller 1995] implements multiple domains of protection within one user
11
address space and oers a protected inter-domain call.
The performance results (see table 2) are encouraging.
However, although this inter-domain call is highly specialized, it is twice as slow as achievable by a general
RPC mechanism. In fact, an inter-domain call needs
two kernel calls and two address-space switches. A general RPC requires two additional thread switches and
argument transfers12 . Apparently, the kernel call and
address-space switch costs dominate. Bryce and Muller
presented an interesting optimization for small interdomain calls: when switching back from a very small
domain, the TLB is only selectively ushed. Although
the eects are rather limited on their host machine (a
486 with only 32 TLB entries), it might become more
relevant on processors with larger TLBs. To analyze
whether kernel enrichment by inter-domain calls pays,
we need e.g. a Pentium implementation and then compare it with a general RPC based on segment switching.
Consequently, the Exokernel interface is archtecture dependent, in particular dedicated to software-controlled
TLBs. A further dierence to our driver-less -kernel
approach is that Exokernel appears to partially integrate device drivers, in particular for disks, networks
and frame buers.
We believe that dropping the abstractional approach
could only be justied by substantial performance gains.
Whether these can be achieved remains open (see discussion in section 5.2) until we have well-engineered exoand abstractional -kernels on the same hardware platform. It might then turn out that the right abstractions
are even more ecient than securely multiplexing hardware primitives or, on the other hand, that abstractions
are too inexible. We should try to decide these questions by constructing comparable -kernels on at least
two reference platforms. Such a co-construction will
probably also lead to new insights for both approaches.
Panda. The Panda system's Assenmacher et al.
7 Conclusions
1993] -kernel is a further example of a small kernel
which delegates as much as possible to user space. Besides its two basic concepts protection domain and virtual processor, the Panda kernel handles only interrupts
and exceptions.
A -kernel can provide higher layers with a minimal set
of appropriate abstractions that are exible enough to allow implementation of arbitrary operating systems and
allow exploitation of a wide range of hardware. The
presented mechanisms (address space with map, ush
and grant operation, threads with IPC and unique identiers) form such a basis. Multi-level-security systems
may additionally need clans or a similar reference mon-
Cache-Kernel. The Cache-kernel Cheriton and
Duda 1994] is also a small and hardware-dependent kernel. In contrast to the Exokernel, it relies on a small
xed (non extensible) virtual machine. It caches kernels, threads, address spaces and mappings. The term
`caching' refers to the fact that the -kernel never handles the complete set of e.g. all address spaces, but only
a dynamically selected subset. It was hoped that this
technique would lead to a smaller -kernel interface and
also to less -kernel code, since it no longer has to deal
with special but infrequent cases. In fact, this could
be done as well on top of a pure -kernel by means of
according pagers. (Kernel data structures, e.g. thread
control blocks, could be held in virtual memory in the
same way as other data.)
itor concept. Choosing the right abstractions is crucial
for both exibility and performance. Some existing kernels chose inappropriate abstractions, or too many
or too specialized and inexible ones.
Similar to optimizing code generators, -kernels must
be constructed per processor and are inherently not
portable. Basic implementation decisions, most algo-
rithms and data structures inside a -kernel are processor dependent. Their design must be guided by
performance prediction and analysis. Besides inappropriate basic abstractions, the most frequent mistakes
come from insucient understanding of the combined
hardware-software system or inecient implementation.
Exokernel. In contrast to Spin, the Exokernel En-
gler et al. 1994 Engler et al. 1995] is a small and
hardware-dependent -kernel. In accordance with our
processor-dependency thesis, the exokernel is tailored
to the R2000 and gets excellent performance values
for its primitives. In contrast to our approach, it is
based on the philosophy that a kernel should not provide abstractions but only a minimal set of primitives.
The presented design shows that it is possible to
achieve well performing -kernels through processorspecic implementations of processor-independent abstractions.
Availability
The source code of the L4 -kernel, a successor of the L3
-kernel, is available for examination and experimentation through the web:
12 Sometimes, the argument transfer can be omitted. For implementing inter-domain calls, a pager can be used which shares
the address spaces of caller and callee such that the trusted callee
can access the parameters in the caller space. E.g. LRPC Bershad et al. 1989] and NetWare Major et al. 1994] use a similar
technique.
http://borneo.gmd.de/RS/L4.
12
Acknowledgements
Flushing, the reverse operation, can be executed without explicit agreement of the mappees, since they agreed
implicitly when accepting the prior map operation. S
can ush any of its pages:
8v0 0 = (v) : v0 0 :
Many thanks to Hermann Hartig for discussion and Rich
Uhlig for proofreading and stylistic help. Further thanks
for reviewing remarks to Dejan Milojicic, some anonymous referees and Sacha Krakowiak for shepherding.
Note that and ush are dened recursively. Flushing
recursively aects also all mappings which are indirectly
derived from v .
No cycles can be established by these three operations, since ushes the destination prior to copying.
A Address Spaces
An Abstract Model of Address Spaces
Implementing the Model
We describe address spaces as mappings. 0 : V !
R fg is the initial address space, where V is the set
of virtual pages, R the set of available physical (real)
pages and the nilpage which cannot be accessed. Further address spaces are dened recursively as mappings
: V ! ($ V ) fg, where $ is the set of address
spaces. It is convenient to regard each mapping as a one
column table which contains (v) for all v 2 V and can
be indexed by v. We denote the elements of this table
by v .
All modications of address spaces are based on the
replacement operation: we write v x to describe a
change of at v, precisely:
At a rst glance, deriving the phyical address of page v
in address space seems to be rather complicated and
expensive:
8< 0(v0) if = (0 v0)
v
(v) = : r
if v = r
if v = Fortunately, a recursive evaluation of (v) is never required. The three basic operations guarantee that the
physical address of a virtual page will never change,
except by ushing. For implementation, we therefore
complement each by an additional table P, where Pv
corresponds to v and holds either the physical address
of v or . Mapping and granting then include
Pv0 0 := Pv
and each replacement v invoked by a ush operation includes
Pv := :
Pv can always be used instead of evaluating (v). In
fact, P is equivalent to a hardware page table. -kernel
address spaces can be implemented straightforward by
means of the hardware-address-translation facilities.
The recommended implementation of is to use one
mapping tree per physical page frame which describes
all actual mappings of the frame. Each node contains
(P v), where v is the according virtual page in the address space which is implemented by the page table P.
Assume that a grant-, map- or ush-operation deals
with a page v in address space to which the page
table P is associated. In a rst step, the operation selects the according tree by Pv , the physical page. In the
next step, it selects the node of the tree that contains
(P v). (This selection can be done by parsing the tree
or in a single step, if Pv is extended by a link to the
node.) Granting then simply replaces the values stored
in the node and map creates a new child node for storing (P 0 v0). Flush lets the selected node unaected but
parses and erases the complete subtree, where Pv0 := is executed for each node (P 0 v0 ) in the subtree.
ush ( v) v := x :
A page potentially mapped at v in is ushed, and the
new value x is copied into v . This operation is internal
to the -kernel. We use it only for describing the three
exported operations.
A subsystem S with address space can grant any
of its pages v to a subsystem S 0 with address space 0
provided S 0 agrees:
v0 0 v v :
Note that S determines which of its pages should be
granted, whereas S 0 determines at which virtual address
the granted page should be mapped in 0 . The granted
page is transferred to 0 and removed from .
A subsystem S with address space can map any
of its pages v to a subsystem S 0 with address space 0
provided S 0 agrees:
v0 0 ( v) :
In contrast to grant, the mapped page remains in the
mapper's space and a link to the page in the mapper's address space ( v) is stored in the receiving address space 0 , instead of transferring the existing link
from v to v0 0 . This operation permits to construct
address spaces recursively, i.e. new spaces based on existing ones.
13
References
Hill, M. D. and Smith, A. J. 1989. Evaluating associativity in
CPU caches. IEEE Transactions on Computers 38, 12 (Dec.),
1612{1630.
Intel Corp. 1990. i486 Microprocessor Programmer's Reference
Manual. Intel Corp.
Intel Corp. 1993. Pentium Processor User's Manual, Volume 3:
Architecture and Programming Manual. Intel Corp.
Kane, G. and Heinrich, J. 1992. MIPS Risc Architecture. Prentice
Hall.
Kessler, R. and Hill, M. D. 1992. Page placement algorithms for
large real-indexed caches. ACM Transactions on Computer
Systems 10, 4 (Nov.), 11{22.
Khalidi, Y. A. and Nelson, M. N. 1993. Extensible le systems in
Spring. In 14th ACM Symposium on Operating System Principles (SOSP), Asheville, NC, pp. 1{14.
Kuhnhauser, W. E. 1995. A paradigm for user-dened security policies. In Proceedings of the 14th IEEE Symposium on Reliable
Distributed Systems, Bad Neuenahr, Germany.
Lee, C. H., Chen, M. C., and Chang, R. C. 1994. HiPEC: high
performance external virtual memory caching. In 1st USENIX
Symposium on Operating Systems Design and Implementation (OSDI), Monterey, CA, pp. 153{164.
Liedtke, J. 1992. Clans & chiefs. In 12. GI/ITG-Fachtagung Architektur von Rechensystemen, Kiel, pp. 294{305. Springer.
Liedtke, J. 1993. Improving IPC by kernel design. In 14th
ACM Symposium on Operating System Principles (SOSP),
Asheville, NC, pp. 175{188.
Liedtke, J. 1995. Improved address-space switching on Pentium
processors by transparently multiplexing user address spaces.
Arbeitspapiere der GMD No. 933 (Sept.), GMD | German
National Research Center for Information Technology, Sankt
Augustin.
Major, D., Minshall, G., and Powell, K. 1994. An overview of the
NetWare operating system. In Winter Usenix Conference, San
Francisco, CA.
Mogul, J. C. and Borg, A. 1991. The eect of context switches on
cache performance. In 4th International Conference on Architectural Support for Programming Languages and Operating
Systems (ASPLOS), Santa Clara, CA, pp. 75{84.
Motorola Inc. 1993. PowerPC 601 RISC Microprocessor User's
Manual. Motorola Inc.
Nagle, D., Uhlig, R., Mudge, T., and Sechrest, S. 1994. Optimal
allocation of on-chip memory for multiple-API operating systems. In 21th Annual International Symposium on Computer
Architecture (ISCA), Chicago, IL, pp. 358{369.
Ousterhout, J. K. 1990. Why aren't operating systems getting
faster as fast as hardware? In Usenix Summer Conference,
Anaheim, CA, pp. 247{256.
Pu, C., Massalin, H., and Ioannidis, J. 1988. The Synthesis kernel.
Computing Systems 1, 1 (Jan.), 11{32.
Romer, T. H., Lee, D. L., Bershad, B. N., and Chen, B. 1994.
Dynamic page mapping policies for cache conict resolution on
standard hardware. In 1st USENIX Symposium on Operating
Systems Design and Implementation (OSDI), Monterey, CA,
pp. 255{266.
Rozier, M., Abrossimov, A., Armand, F., Boule, I., Gien, M.,
Guillemont, M., Herrmann, F., Kaiser, C., Langlois, S.,
Leonard, P., and Neuhauser, W. 1988. Chorus distributed operating system. Computing Systems 1, 4, 305{370.
Schroder-Preikschat, W. 1994. The Logical Design of Parallel Operating Systems. Prentice Hall.
Schroeder, M. D. and Burroughs, M. 1989. Performance of the
Firey RPC. In 12th ACM Symposium on Operating System
Principles (SOSP), Licheld Park, AR, pp. 83{90.
van Renesse, R., van Staveren, H., and Tanenbaum, A. S. 1988.
Performance of the world's fastest distributed operating system. Operating Systems Review 22, 4 (Oct.), 25{34.
Wulf, W., Cohen, E., Corwin, W., Jones, A., Levin, R., Pierson, C.,
and Pollack, F. 1974. Hydra: The kernel of a multiprocessing
operating system. Commun. ACM 17, 6 (July), 337{345.
Yokote, Y. 1993. Kernel-structuring for object-oriented operating
systems: The Apertos approach. In International Symposium
on Object Technologies for Advanced Software. Springer.
Assenmacher, H., Breitbach, T., Buhler, P., Hubsch, V., and
Schwarz, R. 1993. The Panda system architecture { a picokernel approach. In 4th Workshop on Future Trends of Distributed Computing Systems, Lisboa, Portugal, pp. 470{476.
Bernabeu-Auban, J. M., Hutto, P. W., and Khalidi, Y. A. 1988.
The architecture of the Ra kernel. Tech. Rep. GIT-ICS-87/35
(Jan.), Georgia Institute of Technology, Atlanta, GA.
Bershad, B. N., Anderson, T. E., Lazowska, E. D., and Levy, H. M.
1989. Lightweight remote procedure call. In 12th ACM Symposium on Operating System Principles (SOSP), Licheld Park,
AR, pp. 102{113.
Bershad, B. N., Chambers, C., Eggers, S., Maeda, C., McNamee,
D., Pardyak, P., Savage, S., and Sirer, E. G. 1994. Spin { an
extensible microkernel for application-specic operating system services. In 6th SIGOPS European Workshop, Schlo
Dagstuhl, Germany, pp. 68{71.
Bershad, B. N., Savage, S., Pardyak, P., Sirer, E. G., Fiuczynski,
M., Becker, D., Eggers, S., and Chambers, C. 1995. Extensibility, safety and performance in the Spin operating system.
In 15th ACM Symposium on Operating System Principles
(SOSP), Copper Mountain Resort, CO, pp. xx{xx.
Brinch Hansen, P. 1970. The nucleus of a multiprogramming system. Commun. ACM 13, 4 (April), 238{241.
Bryce, C. and Muller, G. 1995. Matching micro-kernels to modern
applications using ne-grained memory protection. In IEEE
Symposium on Parallel Distributed Systems, San Antonio,
TX.
Cao, P., Felten, E. W., and Li, K. 1994. Implementation and performance of application-controlled le caching. In 1st USENIX
Symposium on Operating Systems Design and Implementation (OSDI), Monterey, CA, pp. 165{178.
Chen, J. B. and Bershad, B. N. 1993. The impact of operating system structure on memory system performance. In 14th
ACM Symposium on Operating System Principles (SOSP),
Asheville, NC, pp. 120{133.
Cheriton, D. R. and Duda, K. J. 1994. A caching model of operating system kernel functionality. In 1st USENIX Symposium
on Operating Systems Design and Implementation (OSDI),
Monterey, CA, pp. 179{194.
Digital Equipment Corp. 1992. DECChip 21064-AA Risc Microprocessor Data Sheet. Digital Equipment Corp.
Draves, R. P., Bershad, B. N., Rashid, R. F., and Dean, R. W. 1991.
Using continuations to implement thread management and
communication in operating systems. In 13th ACM Symposium on Operating System Principles (SOSP), Pacic Grove,
CA, pp. 122{136.
Engler, D., Kaashoek, M. F., and O'Toole, J. 1994. The operating system kernel as a secure programmable machine. In 6th
SIGOPS European Workshop, Schlo Dagstuhl, Germany, pp.
62{67.
Engler, D., Kaashoek, M. F., and O'Toole, J. 1995. Exokernel,
an operating system architecture for application-level resource
management. In 15th ACM Symposium on Operating System
Principles (SOSP), Copper Mountain Resort, CO, pp. xx{xx.
Ford, B. 1993. private communication.
Ford, B. and Lepreau, J. 1994. Evolving Mach 3.0 to a migrating
thread model. In Usenix Winter Conference, CA, pp. 97{114.
Gasser, M., Goldstein, A., Kaufmann, C., and Lampson, B. 1989.
The Digital distributed system security architecture. In 12th
National Computer Security Conference (NIST/NCSC), Baltimore, pp. 305{319.
Hamilton, G. and Kougiouris, P. 1993. The Spring nucleus: A microkernel for objects. In Summer Usenix Conference, Cincinnati, OH, pp. 147{160.
Hartig, H., Kowalski, O., and Kuhnhauser, W. 1993. The Birlix
security architecture. Journal of Computer Security 2, 1, 5{
21.
Hildebrand, D. 1992. An architectural overview of QNX. In 1st
Usenix Workshop on Micro-kernels and Other Kernel Architectures, Seattle, WA, pp. 113{126.
14
Was this manual useful for you? yes no
Thank you for your participation!

* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project

Download PDF

advertisement