On -Kernel Construction Abstract

On -Kernel Construction  Abstract
On -Kernel Construction
Jochen Liedtke
GMD — German National Research Center for Information Technology [email protected]
Abstract
From a software-technology point of view, the kernel concept is superior to large integrated kernels.
On the other hand, it is widely believed that (a) -kernel
based systems are inherently inefficient and (b) they are
not sufficiently flexible. Contradictory to this belief,
we show and support by documentary evidence that inefficiency and inflexibility of current -kernels is not
inherited from the basic idea but mostly from overloading the kernel and/or from improper implementation.
Based on functional reasons, we describe some concepts which must be implemented by a -kernel and
illustrate their flexibility. Then, we analyze the performance critical points. We show what performance
is achievable, that the efficiency is sufficient with respect to macro-kernels and why some published contradictory measurements are not evident. Furthermore,
we describe some implementation techniques and illustrate why -kernels are inherently not portable, although they improve portability of the whole system.
GMD SET–RS, 53754 Sankt Augustin, Germany
Copyright (c) 1995 by the Association for Computing Machinery, Inc. (ACM)
1 Rationale
-kernel based systems have been built long before
the term itself was introduced, e.g. by Brinch Hansen
[1970] and Wulf et al. [1974]. Traditionally, the word
‘kernel’ is used to denote the part of the operating system that is mandatory and common to all other software.
The basic idea of the -kernel approach is to minimize
this part, i.e. to implement outside the kernel whatever
possible.
The software technological advantages of this approach are obvious:
(a) A clear -kernel interface enforces a more modular
system structure.1
(b) Servers can use the mechanisms provided by the
-kernel like any other user program. Server malfunction is as isolated as any other user program’s
malfunction.
(c) The system is more flexible and tailorable. Different strategies and APIs, implemented by different
servers, can coexist in the system.
Although much effort has been invested in -kernel
construction, the approach is not (yet) generally accepted. This is due to the fact that most existing kernels do not perform sufficiently well. Lack of efficiency also heavily restricts flexibility, since important
mechanisms and principles cannot be used in practice
due to poor performance. In some cases, the -kernel
interface has been weakened and special servers have
been re-integrated into the kernel to regain efficiency.
It is widely believed that the mentioned inefficiency
(and thus inflexibility) is inherent to the -kernel approach. Folklore holds that increased user-kernel mode
1
Although many macro-kernels tend to be less modular, there
are exceptions from this rule, e.g. Chorus [Rozier et al. 1988] and
Peace [Schröder-Preikschat 1994].
and address-space switches are responsible. At a first
glance, published performance measurements seem to
support this view.
In fact, the cited performance studies measured the
performance of a particular -kernel based system without analyzing the reasons which limit efficiency. We
can only guess whether it is caused by the -kernel
approach, by the concepts implemented by this particular -kernel or by the implementation of the -kernel.
Since it is known that conventional IPC, one of the
traditional -kernel bottlenecks, can be implemented
an order of magnitude faster2 than believed before, the
question is still open. It might be possible that we
are still not applying the appropriate construction techniques.
For the above reasons, we feel that a conceptual analysis is needed which derives -kernel concepts from
pure functionality requirements (section 2) and that
discusses achievable performance (section 4) and flexibility (section 3). Further sections discuss portability
(section 5) and the chances of some new developments
(section 6).
2 Some -Kernel Concepts
In this section, we reason about the minimal concepts
or “primitives” that a -kernel should implement.3 The
determining criterion used is functionality, not performance. More precisely, a concept is tolerated inside the
-kernel only if moving it outside the kernel, i.e. permitting competing implementations, would prevent the
implementation of the system’s required functionality.
We assume that the target system has to support interactive and/or not completely trustworthy applications,
i.e. it has to deal with protection. We further assume that
the hardware implements page-based virtual memory.
One inevitable requirement for such a system is that
a programmer must be able to implement an arbitrary
subsystem S in such a way that it cannot be disturbed or
corrupted by other subsystems S 0. This is the principle
of independence: S can give guarantees independent
of S 0. The second requirement is that other subsystems
2
Short user-to-user cross-address space IPC in L3 [Liedtke 1993]
is 22 times faster than in Mach, both running on a 486. On the
R2000, the specialized Exo-tlrpc [Engler et al. 1995] is 30 times
faster than Mach’s general RPC.
3
Proving minimality, necessarity and completeness would be
nice but is impossible, since there is no agreed-upon metric and all
is Turing-equivalent.
must be able to rely on these guarantees. This is the
principle of integrity: there must be a way for S1 to
address S2 and to establish a communication channel
which can neither be corrupted nor eavesdropped by S 0.
Provided hardware and kernel are trustworthy, further security services, like those described by Gasser
et al. [1989], can be implemented by servers. Their
integrity can be ensured by system administration or by
user-level boot servers. For illustration: a key server
should deliver public-secret RSA key pairs on demand.
It should guarantee that each pair has the desired RSA
property and that each pair is delivered only once and
only to the demander. The key server can only be realized if there are mechanisms which (a) protect its code
and data, (b) ensure that nobody else reads or modifies
the key and (c) enable the demander to check whether
the key comes from the key server. Finding the key
server can be done by means of a name server and
checked by public key based authentication.
2.1
Address Spaces
At the hardware level, an address space is a mapping
which associates each virtual page to a physical page
frame or marks it ‘non-accessible’. For the sake of
simplicity, we omit access attributes like read-only and
read/write. The mapping is implemented by TLB hardware and page tables.
The -kernel, the mandatory layer common to all
subsystems, has to hide the hardware concept of address spaces, since otherwise, implementing protection
would be impossible. The -kernel concept of address spaces must be tamed, but must permit the implementation of arbitrary protection (and non-protection)
schemes on top of the -kernel. It should be simple and
similar to the hardware concept.
The basic idea is to support recursive construction of
address spaces outside the kernel. By magic, there is
one address space 0 which essentially represents the
physical memory and is controlled by the first subsystem S0 . At system start time, all other address spaces
are empty. For constructing and maintaining further
address spaces on top of 0 , the -kernel provides three
operations:
Grant. The owner of an address space can grant any
of its pages to another space, provided the recipient
agrees. The granted page is removed from the granter’s
address space and included into the grantee’s address
space. The important restriction is that instead of physical page frames, the granter can only grant pages which
are already accessible to itself.
Map. The owner of an address space can map any of its
pages into another address space, provided the recipient
agrees. Afterwards, the page can be accessed in both
address spaces. In contrast to granting, the page is not
removed from the mapper’s address space. Comparable
to the granting case, the mapper can only map pages
which itself already can access.
Flush. The owner of an address space can flush any of
its pages. The flushed page remains accessible in the
flusher’s address space, but is removed from all other
address spaces which had received the page directly or
indirectly from the flusher. Although explicit consent
of the affected address-space owners is not required, the
operation is safe, since it is restricted to own pages. The
users of these pages already agreed to accept a potential
flushing, when they received the pages by mapping or
granting.
Appendix A contains a more precise definition of
address spaces and the above three operations.
YHHH grant
HH
user A
A
A
A
A
A
F
A
A
?
disk
user X
?
?
?
? map
YHH map
A H
A
HH
A
f1
f2
std pager
Figure 1: A Granting Example.
In general, granting is used when page mappings
should be passed through a controlling subsystem without burdening the controller’s address space by all pages
mapped through it.
The model can easily be extended to access rights on
pages. Mapping and granting copy the source page’s
access right or a subset of them, i.e., can restrict the
access but not widen it. Special flushing operations
may remove only specified access rights.
I/O
Reasoning
The described address-space concept leaves memory
management and paging outside the -kernel; only the
grant, map and flush operations are retained inside the
kernel. Mapping and flushing are required to implement
memory managers and pagers on top of the -kernel.
The grant operation is required only in very special
situations: consider a pager F which combines two
underlying file systems (implemented as pagers f1 and
f2 , operating on top of the standard pager) into one
unified file system (see figure 1). In this example, f1
maps one of its pages to F which grants the received
page to user A. By granting, the page disappears from
F so that it is then available only in f1 and user A; the
resulting mappings are denoted by the thin line: the
page is mapped in user A, f1 and the standard pager.
Flushing the page by the standard pager would affect f1
and user A, flushing by f1 only user A. F is not affected
by either flush (and cannot flush itself), since it used the
page only transiently. If F had used mapping instead
of granting, it would have needed to replicate most of
the bookkeeping which is already done in f1 and f2 .
Furthermore, granting avoids a potential address-space
overflow of F .
An address space is the natural abstraction for incorporating device ports. This is obvious for memory
mapped I/O, but I/O ports can also be included. The
granularity of control depends on the given processor.
The 386 and its successors permit control per port (one
very small page per port) but no mapping of port addresses (it enforces mappings with v = v 0 ); the PowerPC
uses pure memory mapped I/O, i.e., device ports can be
controlled and mapped with 4K granularity.
Controlling I/O rights and device drivers is thus also
done by memory managers and pagers on top of the
-kernel.
2.2
Threads and IPC
A thread is an activity executing inside an address
space. A thread is characterized by a set of registers, including at least an instruction pointer, a stack
pointer and a state information. A thread’s state also
includes the address space ( ) in which currently
executes. This dynamic or static association to address
spaces is the decisive reason for including the thread
concept (or something equivalent) in the -kernel. To
prevent corruption of address spaces, all changes to
thread’s address space ( ( ) := 0) must be controlled
by the kernel. This implies that the -kernel includes
the notion of some that represents the above mentioned activity. In some operating systems, there may
be additional reasons for introducing threads as a basic abstraction, e.g. preemption. Note that choosing a
concrete thread concept remains subject to further OSspecific design decisions.
Consequently, cross-address-space communication,
also called inter-process communication (IPC), must
be supported by the -kernel. The classical method is
transferring messages between threads by the -kernel.
IPC always enforces a certain agreement between
both parties of a communication: the sender decides
to send information and determines its contents; the
receiver determines whether it is willing to receive information and is free to interpret the received message.
Therefore, IPC is not only the basic concept for communication between subsystems but also, together with
address spaces, the foundation of independence.
Other forms of communication, remote procedure
call (RPC) or controlled thread migration between address spaces, can be constructed from message-transfer
based IPC.
Note that the grant and map operations (section 2.1)
need IPC, since they require an agreement between
granter/mapper and recipient of the mapping.
Supervising IPC
Architectures like those described by Yokote [1993] and
Kühnhauser [1995] need not only supervise the memory
of subjects but also their communication. This can
be done by introducing either communication channels
or Clans [Liedtke 1992] which allow supervision of
IPC by user-defined servers. Such concepts are not
discussed here, since they do not belong to the minimal
set of concepts. We only remark that Clans do not
burden the -kernel: their base cost is 2 cycles per IPC.
Interrupts
The natural abstraction for hardware interrupts is the
IPC message. The hardware is regarded as a set of
threads which have special thread ids and send empty
messages (only consisting of the sender id) to associated
software threads. A receiving thread concludes from
the message source id, whether the message comes from
a hardware interrupt and from which interrupt:
driver thread:
do
wait for (msg, sender) ;
if sender = my hardware interrupt
then read/write io ports ;
reset hardware interrupt
else : : :
fi
od .
Transforming the interrupts into messages must be
done by the kernel, but the -kernel is not involved
in device-specific interrupt handling. In particular, it
does not know anything about the interrupt semantics.
On some processors, resetting the interrupt is a device
specific action which can be handled by drivers at user
level. The iret-instruction then is used solely for popping status information from the stack and/or switching
back to user mode and can be hidden by the kernel.
However, if a processor requires a privileged operation for releasing an interrupt, the kernel executes this
action implicitly when the driver issues the next IPC
operation.
2.3
Unique Identifiers
A -kernel must supply unique identifiers (uid) for
something, either for threads or tasks or communication
channels. Uids are required for reliable and efficient local communication. If S1 wants to send a message to
S2 , it needs to specify the destination S2 (or some channel leading to S2 ). Therefore, the -kernel must know
which uid relates to S2 . On the other hand, the receiver
S2 wants to be sure that the message comes from S1.
Therefore the identifier must be unique, both in space
and time.
In theory, cryptography could also be used. In practice, however, enciphering messages for local communication is far too expensive and the kernel must be
trusted anyway. S2 can also not rely on purely usersupplied capabilities, since S1 or some other instance
could duplicate and pass them to untrusted subsystems
without control of S2 .
3 Flexibility
To illustrate the flexibility of the basic concepts, we
sketch some applications which typically belong to the
basic operating system but can easily be implemented
on top of the -kernel. In this section, we show
principal flexibility of a -kernel. Whether it is really
as flexible in practice strongly depends on the achieved
efficiency of the -kernel. The latter performance topic
is discussed in section 4.
Memory Manager. A server managing the initial address space 0 is a classical main memory manager,
but outside the -kernel. Memory managers can easily
be stacked: M0 maps or grants parts of the physical
memory (0 ) to 1 , controlled by M1 , other parts to 2 ,
controlled by M2 . Now we have two coexisting main
memory managers.
Pager. A Pager may be integrated with a memory
manager or use a memory managing server. Pagers
use the -kernel’s grant, map and flush primitives. The
remaining interfaces, pager – client, pager – memory
server and pager – device driver, are completely based
on IPC and are user-level defined.
Pagers can be used to implement traditional paged
virtual memory and file/database mapping into user address spaces as well as unpaged resident memory for device drivers and/or real time systems. Stacked pagers,
i.e. multiple layers of pagers, can be used for combining access control with existing pagers or for combining
various pagers (e.g. one per disk) into one composed object. User-supplied paging strategies [Lee et al. 1994;
Cao et al. 1994] are handled at the user level and are in
no way restricted by the -kernel. Stacked file systems
[Khalidi and Nelson 1993] can be realized accordingly.
Multimedia Resource Allocation. Multimedia and
other real-time applications require memory resources
to be allocated in a way that allows predictable execution times. The above mentioned user-level memory managers and pagers permit e.g. fixed allocation
of physical memory for specific data or locking data in
memory for a given time.
Note that resource allocators for multimedia and for
timesharing can coexist. Managing allocation conflicts
is part of the servers’ jobs.
Device Driver. A device driver is a process which
directly accesses hardware I/O ports mapped into its
address space and receives messages from the hardware (interrupts) through the standard IPC mechanism.
Device-specific memory, e.g. a screen, is handled by
means of appropriate memory managers. Compared to
other user-level processes, there is nothing special about
a device driver. No device driver has to be integrated
into the -kernel.4
Second Level Cache and TLB. Improving the hit
rates of a secondary cache by means of page allocation
or reallocation [Kessler and Hill 1992; Romer et al.
1994] can be implemented by means of a pager which
applies some cache-dependent (hopefully conflict reducing) policy when allocating virtual pages in physical
memory.
In theory, even a software TLB handler could be implemented like this. In practice, the first-level TLB
handler will be implemented in the hardware or in the
-kernel. However, a second-level TLB handler, e.g.
handling misses of a hashed page table, might be implemented as a user-level server.
Remote Communication. Remote IPC is implemented by communication servers which translate local messages to external communication protocols and
vice versa. The communication hardware is accessed
by device drivers. If special sharing of communication
buffers and user address space is required, the communication server will also act as a special pager for the
client. The -kernel is not involved.
Unix Server. Unix5 system calls are implemented by
IPC. The Unix server can act as a pager for its clients
and also use memory sharing for communicating with
its clients. The Unix server itself can be pageable or
resident.
Conclusion. A small set of -kernel concepts lead to
abstractions which stress flexibility, provided they perform well enough. The only thing which cannot be implemented on top of these abstractions is the processor
architecture, registers, first-level caches and first-level
TLBs.
4
In general, there is no reason for integrating boot drivers into the
kernel. The booter, e.g. located in ROM, simply loads a bit image
into memory that contains the micro-kernel and perhaps some set of
initial pagers and drivers (running at user mode and not linked but
simply appended to the kernel). Afterwards, the boot drivers are no
longer used.
5
Unix is a registered trademark of UNIX System Laboratories.
4 Performance, Facts & Rumors
4.1
Switching Overhead
It is widely believed that switching between kernel and
user mode, between address spaces and between threads
is inherently expensive. Some measurements seem to
support this belief.
4.1.1
Kernel–User Switches
Ousterhout [1990] measured the costs for executing the
“null” kernel call getpid. Since the real getpid operation
consists only of a few loads and stores, this method
measures the basic costs of a kernel call. Normalized to
a hypothetical machine with 10 MIPS rating (10 VAX
11/780 or roughly a 486 at 50 MHz), he showed that
most machines need 20–30 s per getpid, one required
even 63 s. Corroborating these results, we measured
18 s per Mach6 -kernel call get self thread. In fact,
the measured kernel-call costs are high.
For analyzing the measured costs, our argument is
based on a 486 (50 MHz) processor. We take an x86
processor, because kernel-user mode switches are extremely expensive on these processors. In contrast to
the worst case processor, we use a best-case measurement for discussion, 18 s for Mach on a 486/50.
The measured costs per kernel call are 18 50 = 900
cycles. The bare machine instruction for entering kernel
mode costs 71 cycles, followed by an additional 36 cycles for returning to user mode. These two instructions
switch between the user and kernel stack and push/pop
flag register and instruction pointer. 107 cycles (about
2 s) is therefore a lower bound on kernel–user mode
switches. The remaining 800 or more cycles are pure
kernel overhead. By this term, we denote all cycles
which are solely due to the construction of the kernel,
nevermind whether they are spent in executing instructions (800 cycles 500 instructions) or in cache and
TLB misses (800 cycles 270 primary cache misses
90 TLB misses). We have to conclude that the measured kernels do a lot of work when entering and exiting
the kernel. Note that this work by definition has no net
effect.
Is an 800 cycle kernel overhead really necessary?
The answer is no. Empirical proof: L3 [Liedtke 1993]
has a minimal kernel overhead of 15 cycles. If the
-kernel call is executed infrequently enough, it may
6
Mach 3.0, NORMA MK 13
increase by up to 57 additional cycles (3 TLB misses,
10 cache misses). The complete L3 kernel call costs
are thus 123 to 180 cycles, mostly less than 3 s.
The L3 -kernel is process oriented, uses a kernel
stack per thread and supports persistent user processes
(i.e. the kernel can be exchanged without affecting the
remaining system, even if a process actually resides in
kernel mode). Therefore, it should be possible for any
other -kernel to achieve comparably low kernel call
overhead on the same hardware.
Other processors may require a slightly higher overhead, but they offer substantially cheaper basic operations for entering and leaving kernel mode. >From
an architectural point of view, calling the kernel from
user mode is simply an indirect call, complemented by
a stack switch and setting the internal ‘kernel’-bit to
permit privileged operations. Accordingly, returning
from kernel mode is a normal return operation complemented by switching back to user stack and resetting the
‘kernel’-bit. If the processor has different stack pointer
registers for user and kernel stack, the stack switching costs can be hidden. Conceptually, entering and
leaving kernel mode can perform exactly like a normal
indirect call and return instruction (which do not rely
on branch prediction). Ideally, this means 2+2=4 cycles
on a 1-issue processor
Conclusion. Compared to the theoretical minimum,
kernel–user mode switches are costly on some processors. Compared to existing kernels however, they can
be improved 6 to 10 times by appropriate -kernel construction. Kernel–user mode switches are not a serious
conceptual problem but an implementational one.
4.1.2
Address Space Switches
Folklore also considers address-space switches as
costly. All measurements known to the author and
related to this topic deal with combined thread and
address-space switch costs. Therefore, in this section,
we analyze only the architectural processor costs for
pure address-space switching. The combined measurements are discussed together with thread switching.
Most modern processors use a physically indexed
primary cache which is not affected by address-space
switching. Switching the page table is usually very
cheap: 1 to 10 cycles. The real costs are determined by
the TLB architecture.
Some processors (e.g. Mips R4000) use tagged TLBs,
where each entry does not only contain the virtual page
address but also the address-space id. Switching the
address space is thus transparent to the TLB and costs
no additional cycles. However, address-space switching may induce indirect costs, since shared pages occupy one TLB entry per address space. Provided that
the -kernel (shared by all address spaces) has a small
working set and that there are enough TLB entries, the
problem should not be serious. However, we cannot
support this empirically, since we do not know an appropriate -kernel running on such a processor.
Most current processors (e.g. 486, Pentium, PowerPC and Alpha) include untagged TLBs. An addressspace switch thus requires a TLB flush. The real costs
are determined by the TLB load operations which are
required to re-establish the current working set later. If
the working set consists of n pages, the TLB is fullyassociative, has s entries and a TLB miss costs m cycles,
at most min(n; s) m cycles are required in total.
Apparently, larger untagged TLBs lead to a performance problem. For example, completely reloading
the Pentium’s data and code TLBs requires at least
(32 + 64) 9 = 864 cycles. Therefore, intercepting a
program every 100 s could imply an overhead of up to
9%. Although using the complete TLB is unrealistic7 ,
this worst-case calculation shows that switching page
tables may become critical in some situations.
Fortunately, this is not a problem, since on Pentium
and PowerPC, address-space switches can be handled
differently. The PowerPC architecture includes segment registers which can be controlled by the -kernel
and offer an additional address translation facility from
the local 232 -byte address space to a global 252 -byte
space. If we regard the global space as a set of one
million local spaces, address-space switches can be implemented by reloading the segment registers instead
of switching the page table. With 29 cycles for 3.5 GB
or 12 cycles for 1 GB segment switching, the overhead
is low compared to a no longer required TLB flush. In
fact, we have a tagged TLB.
Things are not quite as easy on the Pentium or the 486.
Since segments are mapped into a 232 -byte space, mapping multiple user address spaces into one linear space
7
Both TLBs are 4-way set-associative. Working sets which
are not compact in the virtual address space, usually imply some
conflicts so that only about half of the TLB entries are used simultaneously. Furthermore, a working set of 64 data pages will
most likely lead to cache thrashing: in best case, the cache supports
4 32 bytes per page. Since the cache is only 2-way set-associative,
probably only 1 or 2 cache entries can be used per page in practice.
must be handled dynamically and depends on the actually used sizes of the active user address spaces. The
according implementation technique [Liedtke 1995] is
transparent to the user and removes the potential performance bottleneck. Address space switch overhead
then is 15 cycles on the Pentium and 39 cycles on 486.
For understanding that the restriction of a 2 32 -byte
global space is not crucial to performance, one has to
mention that address spaces which are used only for
very short periods and with small working sets are effectively very small in most cases, say 1 MB or less
for a device driver. For example, we can multiplex
one 3 GB user address space with 8 user spaces of 64
MB and additionally 128 user spaces of 1 MB. The
trick is to share the smaller spaces with all large 3 GB
spaces. Then any address-space switch to a medium
or small space is always fast. Switching between two
large address spaces is uncritical anyway, since switching between two large working sets implies TLB and
cache miss costs, nevermind whether the two programs
execute in the same or in different address spaces.
Table 1 shows the page table switch and segment
switch overhead for several processors. For a TLB
TLB TLB miss
entries cycles
486
32
9: : : 13
Pentium
96
9: : : 13
PowerPC 601 256
?
Alpha 21064 40
20: : : 50a
Mips R4000
48
20: : : 50a
Page Table Segment
switch cycles
36: : : 364
39
36: : : 1196
15
?
29
80: : : 1800
n/a
0b
n/a
a Alpha and Mips TLB misses are handled by software.
bR4000 has a tagged TLB.
Table 1: Address Space Switch Overhead
miss, the minimal and maximal cycles are given (provided that no referenced or modified bits need updating). In the case of 486, Pentium and PowerPC, this
depends on whether the corresponding page table entry
is found in the cache or not. As a minimal working set,
we assume 4 pages. For the maximum case, we exclude
4 pages from the address-space overhead costs, because
at most 4 pages are required by the -kernel and thus
would as well occupy TLB entries when the address
space would not be switched.
Conclusion. Properly constructed address-space
switches are not very expensive, less than 50 cycles
on modern processors. On a 100 MHz processor, the
inherited costs of address-space switches can be ignored
roughly up to 100,000 switches per second. Special optimizations, like executing dedicated servers in kernel
space, are superfluous. Expensive context switching in
some existing -kernels is due to implementation and
not caused by inherent problems with the concept.
4.1.3
Thread Switches and IPC
Ousterhout [1990] also measured context switching in
some Unix systems by echoing one byte back and forth
through pipes between two processes. Again normalized to a 10 Mips machine, most results are between
System
CPU, MHz
RPC time
cycles/IPC
(round trip)
(oneway)
windows, whereas L3 is burdened by the fact that a 486
trap is 100 cycles more expensive than a Sparc trap.
The effect of using segment based address-space
switch on Pentium is shown in figure 2. One long
running application with a stable working set (2 to 64
data pages) executes a short RPC to a server with a
small working set (2 pages). After the RPC, the application re-accesses all its pages. Measurement is
done by 100,000 repetitions and comparing each run
against running the application (100,000 time accessing all pages) without RPC. The given times are round
trip RPC times, user to user, plus the required time for
re-establishing the application’s working set.
by segment switch
14
by page-table switch
12.7
12
L3
QNX
Mach
SRC RPC
Mach
Amoeba
Spin
Mach
full IPC semantics
486, 50
10 s
486, 33
76 s
R2000, 16.7
190 s
CVAX, 12.5
464 s
486, 50
230 s
68020, 15
800 s
Alpha 21064, 133
102 s
Alpha 21064, 133
104 s
250
1254
1584
2900
5750
6000
6783
6916
Exo-tlrpc
Spring
DP-Mach
LRPC
restricted IPC semantics
R2000, 16.7
6 s
SparcV8, 40
11 s
486, 66
16 s
CVAX, 12.5
157 s
53
220
528
981
10.9
10
RPC
time
+
working
set
reestablish
[s]
9
8
7.6
6.3
6
4
3.2
3.2
3.2
3.4
3.6
2
2
16
32
48
application data working set [pages]
64
Figure 2: Segmented Versus Standard Address-Space
Switch in L4 on Pentium, 90 MHz.
Table 2: 1-byte-RPC performance
400 and 800 s per ping-pong, one was 1450 s. All
existing -kernels are at least 2 times faster, but it is
proved by construction that 10 s, i.e. a 40 to 80 times
faster RPC is achievable. Table 2 gives the costs of
echoing one byte by a round trip RPC, i.e. two IPC
operations.8
All times are user to user, cross-address space.They
include system call, argument copy, stack and address
space switch costs. Exokernel, Spring and L3 show
that communication can be implemented pretty fast and
that the costs are heavily influenced by the processor
architecture: Spring on Sparc has to deal with register
8
The respective data is taken from [Liedtke 1993; Hildebrand
1992; Schroeder and Burroughs 1989; Draves et al. 1991; van
Renesse et al. 1988; Liedtke 1993; Bershad et al. 1995; Engler
et al. 1995; Hamilton and Kougiouris 1993; Bryce and Muller
1995; Bershad et al. 1989].
Conclusion. IPC can be implemented fast enough to
handle also hardware interrupts by this mechanism.
4.2
Memory Effects
Chen and Bershad [1993] compared the memory system behaviour of Ultrix, a large monolithic Unix system, with that of the Mach -kernel which was complemented with a Unix server. They measured memory
cycle overhead per instruction (MCPI) and found that
programs running under Mach + Unix server had a substantially higher MCPI than running the same programs
under Ultrix. For some programs, the differences were
up to 0.25 cycles per instruction, averaged over the total program (user + system). Similar memory system
degradation of Mach versus Ultrix is noticed by others [Nagle et al. 1994]. The crucial point is whether
this problem is due to the way that Mach is constructed,
or whether it is caused by the -kernel approach.
Chen and Bershad [1993, p. 125] state: “This suggests that microkernel optimizations focussing exclusively on IPC [: : : ], without considering other sources
of system overhead such as MCPI, will have a limited
impact on overall system performance.” Although one
might suppose a principal impact of OS architecture,
the mentioned paper exclusively presents facts “as is”
about a specific implementation without analyzing the
reasons for memory system degradation.
Careful analysis of the results is thus required. According to the original paper, we comprise under ‘system’ either all Ultrix activities or the joined activities
of the Mach -kernel, Unix emulation library and Unix
server. The Ultrix case is denoted by U, the Mach case
by M. We restrict our analysis to the samples that show
a significant MCPI difference for both systems: sed,
egrep, yacc, gcc, compress, espresso and the andrew
benchmark ab.
In figure 3, we present the results of Chen’s figure
2-1 in a slightly reordered way. We have colored MCPI
.................................................................................................................................................
.
.
.
.
.
...........................................................................................
.
.
...
.
.
.
sed U
M
egrep U
M
yacc U
M
0.227
0.495
0.035
0.081
0.067
0.129
gcc U
M
0.434
0.690
compress U
M
0.250
0.418
ab U
M
espresso U
M
other MCPI
system cache miss MCPI
0.427
0.534
0.041
0.068
Figure 3: Baseline MCPI for Ultrix and Mach.
black that are due to system i-cache or d-cache misses.
The white bars comprise all other causes, system write
buffer stalls, system uncached reads, user i-cache and dcache misses and user write buffer stalls. It is easy to see
that the white bars do not differ significantly between
Ultrix and Mach; the average difference is 0.00, the
standard deviation is 0.02 MCPI.
We conclude that the differences in memory system
behaviour are essentially caused by increased system
cache misses for Mach. They could be conflict misses
(the measured system used direct mapped caches) or
capacity misses. A large fraction of conflict misses
would suggest a potential problem due to OS structure.
Chen and Bershad measured cache conflicts by comparing the direct mapped to a simulated 2-way cache.9
They found that system self-interference is more important than user/system interference, but the data also
show that the ratio of conflict to capacity misses in
Mach is lower than in Ultrix. Table 4 shows the conflict (black) and capacity (white) system cache misses
both in an absolute scale (left) and as ratio (right).
.......................................................................................................
.
.
.
.
.
...................................................................
.
.
.
.
.
.
.
.
sed U
M
egrep U
M
yacc U
M
0.170
0.415
0.024
0.069
0.039
0.098
gcc U
M
0.130
0.388
compress U
M
0.102
0.258
ab U
M
espresso U
M
conflict misses
capacity misses
0.230
0.382
0.012
0.037
Figure 4: MCPI Caused by Cache Misses.
>From this we can deduce that the increased cache
misses are caused by higher cache consumption of the
system (Mach + emulation library + Unix server), not
by conflicts which are inherent to the system’s structure.
The next task is to find the component which is responsible for the higher cache consumption. We assume
that the used Unix single server behaves comparably
to the corresponding part of the Ultrix kernel. This
is supported by the fact that the samples spent even
fewer instructions in Mach’s Unix server than in the
corresponding Ultrix routines. We also exclude Mach’s
emulation library, since Chen and Bershad report that
only 3% or less of system overhead is caused by it.
What remains is Mach itself, including trap handling,
IPC and memory management, which therefore must
induce nearly all of the additional cache misses.
Therefore, the mentioned measurements suggest that
memory system degradation is caused solely by high
cache consumption of the -kernel. Or in other words:
drastically reducing the cache working set of a -kernel
will solve the problem.
Since a -kernel is basically a set of procedures which
are invoked by user-level threads or hardware, a high
9
Although this method does not determine all conflict misses
as defined by Hill and Smith [1989], it can be used as a first-level
approximation.
cache consumption can only10 be explained by a large
number of very frequently used -kernel operations or
by high cache working sets of a few frequently used
operations. According to section 2, the first case has
to be considered as a conceptual mistake. Large cache working sets are also not an inherent feature of
-kernels. For example, L3 requires less than 1 K for
short IPC. (Recall: voluminous communication can be
made by dynamic or static mapping so that the cache is
not flooded by copying very long messages.)
Mogul and Borg [1991] reported an increase in cache
misses after preemptively-scheduled context switches
between applications with large working sets. This
depends mostly on the application load and the requirement for interleaved execution (timesharing). The type
of kernel is almost irrelevant. We showed (section
4.1.2 and 4.1.3) that -kernel context switches are not
expensive in the sense that there is not much difference
between executing application + servers in one or in
multiple address spaces.
Conclusion. The hypothesis that -kernel architectures inherently lead to memory system degradation is
not substantiated. On the contrary, the quoted measurements support the hypothesis that properly constructed
-kernels will automatically avoid the memory system
degradation measured for Mach.
5 Non-Portability
Older -kernels were built machine-independently on
top of a small hardware-dependent layer. This approach
has strong advantages from the software technological point of view: programmers did not need to know
very much about processors and the resulting -kernels
could easily be ported to new machines. Unfortunately,
this approach prevented these -kernels from achieving
the necessary performance and thus flexibility.
In retrospective, we should not be surprised, since
building a -kernel on top of abstract hardware has
serious implications:
10
Such a -kernel cannot take advantage of specific
hardware.
We do not believe that the Mach kernel flushes the cache explicitly. The measured system was a uniprocessor with physically
tagged caches. The hardware does not even require explicit cache
flushes for DMA.
It cannot take precautions to circumvent or avoid
performance problems of specific hardware.
The additional layer per se costs performance.
-kernels form the lowest layer of operating systems
beyond the hardware. Therefore, we should accept
that they are as hardware dependent as optimizing code
generators. We have learned that not only the coding
but
5.1
even the algorithms used inside a -kernel and its
internal concepts are extremely processor dependent.
Compatible Processors
For illustration, we briefly describe how a -kernel has
to be conceptually modified even when “ported” from
486 to Pentium, i.e. to a compatible processor.
Although the Pentium processor is binary compatible
to the 486, there are some differences in the internal
486
Pentium
TLB entries, ways
32 (u)
4
32 (i) + 64 (d)
4
Cache
size, ways
8K (u)
4
8K (i) + 8K (d)
2
line, write
16B
through
32B
back
fast instructions
segment register
trap
1 cycle
9 cycles
107 cycles
0.5–1 cycle
3 cycles
69 cycles
Table 3: 486 / Pentium Differences
hardware architecture (see table 3) which influence the
internal -kernel architecture:
User-address-space implementation. As mentioned
in section 4.1.2, a Pentium -kernel should use segment
registers for implementing user address spaces so that
each 232 -byte hardware address space shares all small
and one large user address space. Recall that this can
be implemented transparently to the user.
Ford [1993] proposed a similar technique for the 486,
and table 1 also suggests it for the 486. Nevertheless,
the conventional hardware-address-space switch is preferrable on this processor. Expensive segment register
loads and additional instructions at various places in the
kernel sum to roughly 130 cycles required in addition.
Now look at the relevant situation: an address-space
switch from a large space to a small one and back to the
large. Assuming cache hits, the costs of the segment
register model would be (130 + 39) 2 = 338 cycles,
whereas the conventional address-space model would
require 28 9 + 36 = 288 cycles in the theoretical case
of 100% TLB use, 14 9 + 36 = 162 cycles for the more
probable case that the large address space uses only
50% of the TLB and only 72 cycles in the best case. In
total, the conventional method wins.
On the Pentium however, the segment register
method pays. The reasons are several: (a) Segment register loads are faster. (b) Fast instructions are cheaper,
whereas the overhead by trap and TLB misses remain
nearly constant. (c) Conflict cache misses (which, relative to instruction execution, are anyway more expensive) are more likely because of reduced associativity.
Avoiding TLB misses thus also reduces cache conflicts.
(d) Due to the three times larger TLB, the flush costs
can increase substantially. As a result, on Pentium, the
segment register method always pays (see figure 2).
As a consequence, we have to implement an additional user-address-space multiplexer, we have to modify address-space switch routines, handling of user
supplied addresses, thread control blocks, task control
blocks, the IPC implementation and the address-space
structure as seen by the kernel. In total, the mentioned
changes affect algorithms in about half of all -kernel
modules.
IPC implementation. Due to reduced associativity,
the Pentium caches tend to exhibit increased conflict
misses. One simple way to improve cache behaviour
during IPC is by restructuring the thread control block
data such that it profits from the doubled cache line size.
This can be adopted to the 486 kernel, since it has no
effect on 486 and can be implemented transparently to
the user.
In the 486 kernel, thread control blocks (including
kernel stacks) were page aligned. IPC always accesses
2 control blocks and kernel stacks simultaneously. The
cache hardware maps the according data of both control blocks to identical cache addresses. Due to its
4-way associativity, this problem could be ignored on
the 486. However, Pentium’s data cache is only 2-way
set-associative. A nice optimization is to align thread
control blocks no longer on 4K but on 1K boundaries.
(1K is the lower bound due to internal reasons.) Then
there is a 75% chance that two randomly selected control blocks do not compete in the cache.
Surprisingly, this affects the internal bit-structure of
unique thread identifiers supplied by the -kernel (see
[Liedtke 1993] for details). Therefore, the new kernel
cannot simply replace the old one, since (persistent)
user programs already hold uids which would become
invalid.
5.2
Incompatible Processors
Processors of competing families differ in instruction set, register architecture, exception handling, cache/TLB architecture, protection and memory model.
Especially the latter ones radically influence -kernel
structure. There are systems with
multi-level page tables,
hashed page tables,
(no) reference bits,
(no) page protection,
strange page protection11,
single/multiple page sizes,
232 -, 243 -, 252 - and 264 -byte address spaces,
flat and segmented address spaces,
various segment models,
tagged/untagged TLBs,
virtually/physically tagged caches.
The differences are orders of magnitude higher than
between 486 and Pentium. We have to expect that a
new processor requires a new -kernel design.
For illustration, we compare two different kernels on
two different processors: the Exokernel [Engler et al.
1995] running on an R2000 and L3 running on a 486.
Although this is similar to comparing apples with oranges, a careful analysis of the performance differences
helps understanding the performance-determining factors and weighting the differences in processor architecture. Finally, this results in different -kernel architectures.
We compare Exokernel’s protected control transfer
(PCT) with L3’s IPC. Exo-PCT on the R2000 requires
about 35 cycles, whereas L3 takes 250 cycles on a 486
processor for an 8-byte message transfer. If this difference cannot be explained by different functionality
and/or average processor performance, there must be
an anomaly relevant to -kernel design.
Exo-PCT is a “substrate for implementing efficient
IPC mechanisms. [It] changes the program counter to
11
e.g. the 386 ignores write protection in kernel mode, the PowerPC supports read only in kernel mode but this implies that the
page is seen in user mode as well.
an agreed-upon value in the callee, donates the current time-slice to the callee’s processor environment,
and installs required elements of the callee’s processor context.” L3-IPC is used for secure communication between potentially untrusted partners; it therefore additionally checks the communication permission
(whether the partner is willing to receive a message from
the sender and whether no clan borderline is crossed),
synchronizes both threads, supports error recovery by
send and receive timeouts, and permits complex messages to reduce marshaling costs and IPC frequency.
From our experience, extending Exo-PCT accordingly
should require no more than 30 additional cycles. (Note
that using PCT for a trusted LRPC already costs an additional 18 cycles, see table 2.) Therefore, we assume
that a hypothetical L3-equivalent “Exo-IPC” would cost
about 65 cycles on the R2000. Finally, we must take into
consideration that the cycles of both processors are not
equivalent as far as most-frequently-executed instructions are concerned. Based on SpecInts, roughly 1.4
486-cycles appear to do as much work as one R2000cycle; comparing the five instructions most relevant in
this context (2-op-alu, 3-op-alu, load, branch taken
and not taken) gives 1.6 for well-optimized code. Thus
we estimate that the Exo-IPC would cost up to approx.
100 486-cycles being definitely less than L3’s 250 cycles.
This substantial difference in timing indicates an isolated difference between both processor architectures
that strongly influences IPC (and perhaps other -kernel
mechanisms), but not average programs.
In fact, the 486 processor imposes a high penalty
on entering/exiting the kernel and requires a TLB flush
per IPC due to its untagged TLB. This costs at least
107 + 49 = 156 cycles. On the other hand, the R2000
has a tagged TLB, i.e. avoids the TLB flush, and needs
less than 20 cycles for entering and exiting the kernel.
From the above example, we learn two lessons:
For well-engineered -kernels on different processor architectures, in particular with different memory systems, we should expect isolated timing differences that are not related to overall processor
performance.
Different architectures require processor-specific
optimization techniques that even affect the global
-kernel structure.
To understand the second point, recall that the mandatory 486-TLB flush requires minimization of the num-
ber of subsequent TLB misses. The relevant techniques [Liedtke 1993, pp. 179,182–183] are mostly
based on proper address space construction: concentrating processor-internal tables and heavily used kernel data in one page (there is no unmapped memory
on then 486), implementing control blocks and kernel
stacks as virtual objects, lazy scheduling. In toto, these
techniques save 11 TLB misses, i.e. at least 99 cycles
on the 486 and are thus inevitable.
Due to its unmapped memory facility and tagged
TLB, the mentioned constraint disappears on the
R2000. Consequently, the internal structure (address
space structure, page fault handling, perhaps control
block access and scheduling) of a corresponding kernel
can substantially differ from a 486-kernel. If other factors also imply implementing control blocks as physical
objects, even the uids will differ between the R2000 (no
pointer size + x) and 486 kernel (no control block
size + x).
Conclusion. -kernels form the link between a minimal “”-set of abstractions and the bare processor. The
performance demands are comparable to those of earlier
microprogramming. As a consequence, -kernels are
inherently not portable. Instead, they are the processor
dependent basis for portable operating systems.
6 Synthesis, Spin, DP-Mach, Panda,
Cache and Exokernel
Synthesis. Henry Massalin’s Synthesis operating system [Pu et al. 1988] is another example of a high
performing (and non-portable) kernel. Its distinguishing feature was a kernel-integrated “compiler” which
generated kernel code at runtime. For example, when
issuing a read pipe system call, the Synthesis kernel
generated specialized code for reading out of this pipe
and modified the respective invocation. This technique
was highly successful on the 68030. However (a good
example for non-portability), it would most probably no
longer pay on modern processors, because (a) code inflation will degrade cache performance and (b) frequent
generation of small code chunks pollutes the instruction
cache.
Spin. Spin [Bershad et al. 1994; Bershad et al. 1995]
is a new development which tries to extend the Synthesis idea: user-supplied algorithms are translated by
a kernel compiler and added to the kernel, i.e. the user
may write new system calls. By controlling branches
and memory references, the compiler ensures that the
newly-generated code does not violate kernel or user
integrity. This approach reduces kernel–user mode
switches and sometimes address space switches. Spin
is based on Mach and may thus inherit many of its inefficiencies which makes it difficult to evaluate performance results. Rescaling them to an efficient -kernel
with fast kernel–user mode switches and fast IPC is
needed. The most crucial problem, however, is the
estimation of how an optimized -kernel architecture
and the requirements coming from a kernel compiler
interfere with each other. Kernel architecture and performance might be e.g. affected by the requirement for
larger kernel stacks. (A pure -kernel needs only a
few hundred bytes per kernel stack.) Furthermore, the
costs of safety-guaranteeing code have to be related to
-kernel overhead and to optimal user-level code.
The first published results [Bershad et al. 1995] cannot answer these questions: On an Alpha 21064, 133
MHz, a Spin system call needs nearly twice as many cycles (1600, 12s) as the already expensive Mach system
call (900, 7s). The application measurements show
that Mach can be substantially improved by using a
kernel compiler; however, it remains open whether this
technique can reach or outperform a pure -kernel approach like that described here. For example, a simple
user-level page-fault handler (1100 s under Mach) executes in 17 s under Spin. However, we must take into
consideration that in a traditional -kernel, the kernel is
invoked and left only twice: page fault (enter), message
to pager (exit), reply map message (enter+exit). The
Spin technique can save only one system call which
on this processor should cost less than 1 s i.e. with
12 s the actual Spin overhead is far beyond the ideal
traditional overhead of 1+1 s.
>From our experience, we expect a notable gain if
a kernel compiler eliminates nested IPC redirection,
e.g. when using deep hierarchies of Clans or Custodians [Härtig et al. 1993]. Efficient integration of the
kernel compiler technique and appropriate -kernel design might be a promising research direction.
Utah-Mach. Ford and Lepreau [1994] changed Mach
IPC semantics to migrating RPC which is based on
thread migration between address spaces, similar to the
Clouds model [Bernabeu-Auban et al. 1988]. Substantial performance gain was achieved, a factor of 3 to
4.
DP-Mach. DP-Mach [Bryce and Muller 1995] implements multiple domains of protection within one user
address space and offers a protected inter-domain call.
The performance results (see table 2) are encouraging.
However, although this inter-domain call is highly specialized, it is twice as slow as achievable by a general
RPC mechanism. In fact, an inter-domain call needs
two kernel calls and two address-space switches. A
general RPC requires two additional thread switches
and argument transfers 12 . Apparently, the kernel call
and address-space switch costs dominate. Bryce and
Muller presented an interesting optimization for small
inter-domain calls: when switching back from a very
small domain, the TLB is only selectively flushed. Although the effects are rather limited on their host machine (a 486 with only 32 TLB entries), it might become more relevant on processors with larger TLBs.
To analyze whether kernel enrichment by inter-domain
calls pays, we need e.g. a Pentium implementation and
then compare it with a general RPC based on segment
switching.
Panda. The Panda system’s [Assenmacher et al.
1993] -kernel is a further example of a small kernel which delegates as much as possible to user space.
Besides its two basic concepts protection domain and
virtual processor, the Panda kernel handles only interrupts and exceptions.
Cache-Kernel. The Cache-kernel [Cheriton and
Duda 1994] is also a small and hardware-dependent
-kernel. In contrast to the Exokernel, it relies on a
small fixed (non extensible) virtual machine. It caches
kernels, threads, address spaces and mappings. The
term ‘caching’ refers to the fact that the -kernel never
handles the complete set of e.g. all address spaces, but
only a dynamically selected subset. It was hoped that
this technique would lead to a smaller -kernel interface and also to less -kernel code, since it no longer
has to deal with special but infrequent cases. In fact,
this could be done as well on top of a pure -kernel by
means of according pagers. (Kernel data structures, e.g.
12
Sometimes, the argument transfer can be omitted. For implementing inter-domain calls, a pager can be used which shares the
address spaces of caller and callee such that the trusted callee can
access the parameters in the caller space. E.g. LRPC [Bershad et al.
1989] and NetWare [Major et al. 1994] use a similar technique.
thread control blocks, could be held in virtual memory
in the same way as other data.)
Exokernel. In contrast to Spin, the Exokernel [Engler et al. 1994; Engler et al. 1995] is a small and
hardware-dependent -kernel. In accordance with our
processor-dependency thesis, the exokernel is tailored
to the R2000 and gets excellent performance values
for its primitives. In contrast to our approach, it is
based on the philosophy that a kernel should not provide abstractions but only a minimal set of primitives.
Consequently, the Exokernel interface is archtecture dependent, in particular dedicated to software-controlled
TLBs. A further difference to our driver-less -kernel
approach is that Exokernel appears to partially integrate device drivers, in particular for disks, networks
and frame buffers.
We believe that dropping the abstractional approach
could only be justified by substantial performance gains.
Whether these can be achieved remains open (see discussion in section 5.2) until we have well-engineered
exo- and abstractional -kernels on the same hardware
platform. It might then turn out that the right abstractions are even more efficient than securely multiplexing
hardware primitives or, on the other hand, that abstractions are too inflexible. We should try to decide these
questions by constructing comparable -kernels on at
least two reference platforms. Such a co-construction
will probably also lead to new insights for both approaches.
algorithms and data structures inside a -kernel are
processor dependent. Their design must be guided by
performance prediction and analysis. Besides inappropriate basic abstractions, the most frequent mistakes
come from insufficient understanding of the combined
hardware-software system or inefficient implementation.
The presented design shows that it is possible to
achieve well performing -kernels through processorspecific implementations of processor-independent abstractions.
Availability
The source code of the L4 -kernel, a successor of the
L3 -kernel, is available for examination and experimentation through the web:
http://borneo.gmd.de/RS/L4.
Acknowledgements
Many thanks to Hermann H"artig for discussion and
Rich Uhlig for proofreading and stylistic help. Further thanks for reviewing remarks to Dejan Milojicic,
some anonymous referees and Sacha Krakowiak for
shepherding.
A Address Spaces
7 Conclusions
A -kernel can provide higher layers with a minimal
set of appropriate abstractions that are flexible enough
to allow implementation of arbitrary operating systems
and allow exploitation of a wide range of hardware.
The presented mechanisms (address space with map,
flush and grant operation, threads with IPC and unique
identifiers) form such a basis. Multi-level-security systems may additionally need clans or a similar reference
monitor concept. Choosing the right abstractions is crucial for both flexibility and performance. Some existing
-kernels chose inappropriate abstractions, or too many
or too specialized and inflexible ones.
Similar to optimizing code generators, -kernels
must be constructed per processor and are inherently
not portable. Basic implementation decisions, most
An Abstract Model of Address Spaces
We describe address spaces as mappings. 0 : V !
R [ fg is the initial address space, where V is the set of
virtual pages, R the set of available physical (real) pages
and the nilpage which cannot be accessed. Further
address spaces are defined recursively as mappings :
V ! ( V ) [ fg, where is the set of address
spaces. It is convenient to regard each mapping as a
one column table which contains (v ) for all v 2 V and
can be indexed by v . We denote the elements of this
table by v .
All modifications of address spaces are based on the
replacement operation: we write v
x to describe a
change of at v , precisely:
flush (; v ) ;
v : = x :
A page potentially mapped at v in is flushed, and the
new value x is copied into v . This operation is internal
to the -kernel. We use it only for describing the three
exported operations.
A subsystem S with address space can grant any
of its pages v to a subsystem S 0 with address space 0
provided S 0 agrees:
v ; v
v0 0
:
Note that S determines which of its pages should be
granted, whereas S 0 determines at which virtual address
the granted page should be mapped in 0. The granted
page is transferred to 0 and removed from .
A subsystem S with address space can map any
of its pages v to a subsystem S 0 with address space 0
provided S 0 agrees:
(; v ) :
v0 0
In contrast to grant, the mapped page remains in the
mapper’s space and a link to the page in the mapper’s
address space (; v ) is stored in the receiving address
space 0, instead of transferring the existing link from
v to v0 0 . This operation permits to construct address
spaces recursively, i.e. new spaces based on existing
ones.
Flushing, the reverse operation, can be executed
without explicit agreement of the mappees, since they
agreed implicitly when accepting the prior map operation. S can flush any of its pages:
8v0 0 =
(;v )
:
v0 0
:
Note that and flush are defined recursively. Flushing
recursively affects also all mappings which are indirectly derived from v .
No cycles can be established by these three operations, since flushes the destination prior to copying.
Implementing the Model
At a first glance, deriving the phyical address of page v
in address space seems to be rather complicated and
expensive:
8 0 0
>< (v )
(v) = > r
: v = ( ; v )
v = r
v = Fortunately, a recursive evaluation of (v ) is never reif
if
if
0
0
quired. The three basic operations guarantee that the
physical address of a virtual page will never change,
except by flushing. For implementation, we therefore
complement each by an additional table P , where Pv
corresponds to v and holds either the physical address
of v or . Mapping and granting then include
Pv0 0 := Pv
and each replacement v
ation includes
invoked by a flush oper-
Pv := :
Pv can always be used instead of evaluating (v). In
fact, P is equivalent to a hardware page table. -kernel
address spaces can be implemented straightforward by
means of the hardware-address-translation facilities.
The recommended implementation of is to use one
mapping tree per physical page frame which describes
all actual mappings of the frame. Each node contains
(P; v ), where v is the according virtual page in the
address space which is implemented by the page table
P.
Assume that a grant-, map- or flush-operation deals
with a page v in address space to which the page table
P is associated. In a first step, the operation selects the
according tree by Pv , the physical page. In the next step,
it selects the node of the tree that contains (P; v ). (This
selection can be done by parsing the tree or in a single
step, if Pv is extended by a link to the node.) Granting
then simply replaces the values stored in the node and
map creates a new child node for storing (P 0 ; v 0). Flush
lets the selected node unaffected but parses and erases
the complete subtree, where Pv0 := is executed for
each node (P 0 ; v 0) in the subtree.
References
Assenmacher, H., Breitbach, T., Buhler, P., Hübsch, V., and Schwarz, R.
1993. The Panda system architecture – a pico-kernel approach. In
4th Workshop on Future Trends of Distributed Computing Systems,
Lisboa, Portugal, pp. 470–476.
Bernabeu-Auban, J. M., Hutto, P. W., and Khalidi, Y. A. 1988. The
architecture of the Ra kernel. Tech. Rep. GIT-ICS-87/35 (Jan.),
Georgia Institute of Technology, Atlanta, GA.
Bershad, B. N., Anderson, T. E., Lazowska, E. D., and Levy, H. M.
1989. Lightweight remote procedure call. In 12th ACM Symposium
on Operating System Principles (SOSP), Lichfield Park, AR, pp.
102–113.
Bershad, B. N., Chambers, C., Eggers, S., Maeda, C., McNamee, D.,
Pardyak, P., Savage, S., and Sirer, E. G. 1994. Spin – an extensible
microkernel for application-specific operating system services. In
6th SIGOPS European Workshop, Schloß Dagstuhl, Germany, pp.
68–71.
Bershad, B. N., Savage, S., Pardyak, P., Sirer, E. G., Fiuczynski, M.,
Becker, D., Eggers, S., and Chambers, C. 1995. Extensibility, safety
and performance in the Spin operating system. In 15th ACM Symposium on Operating System Principles (SOSP), Copper Mountain
Resort, CO, pp. xx–xx.
Brinch Hansen, P. 1970. The nucleus of a multiprogramming system.
Commun. ACM 13, 4 (April), 238–241.
Bryce, C. and Muller, G. 1995. Matching micro-kernels to modern applications using fine-grained memory protection. In IEEE Symposium
on Parallel Distributed Systems, San Antonio, TX.
Cao, P., Felten, E. W., and Li, K. 1994. Implementation and performance
of application-controlled file caching. In 1st USENIX Symposium on
Operating Systems Design and Implementation (OSDI), Monterey,
CA, pp. 165–178.
Chen, J. B. and Bershad, B. N. 1993. The impact of operating system
structure on memory system performance. In 14th ACM Symposium
on Operating System Principles (SOSP), Asheville, NC, pp. 120–
133.
Cheriton, D. R. and Duda, K. J. 1994. A caching model of operating
system kernel functionality. In 1st USENIX Symposium on Operating Systems Design and Implementation (OSDI), Monterey, CA,
pp. 179–194.
Digital Equipment Corp. 1992. DECChip 21064-AA Risc Microprocessor
Data Sheet. Digital Equipment Corp.
Draves, R. P., Bershad, B. N., Rashid, R. F., and Dean, R. W. 1991. Using
continuationsto implement thread managementand communication
in operating systems. In 13th ACM Symposium on Operating System
Principles (SOSP), Pacific Grove, CA, pp. 122–136.
Engler, D., Kaashoek, M. F., and O’Toole, J. 1994. The operating system
kernel as a secure programmable machine. In 6th SIGOPS European
Workshop, Schloß Dagstuhl, Germany, pp. 62–67.
Engler, D., Kaashoek, M. F., and O’Toole, J. 1995. Exokernel, an operating system architecture for application-level resource management.
In 15th ACM Symposium on Operating System Principles (SOSP),
Copper Mountain Resort, CO, pp. xx–xx.
Ford, B. 1993. private communication.
Ford, B. and Lepreau, J. 1994. Evolving Mach 3.0 to a migrating thread
model. In Usenix Winter Conference, CA, pp. 97–114.
Gasser, M., Goldstein, A., Kaufmann, C., and Lampson, B. 1989. The
Digital distributed system security architecture. In 12th National
Computer Security Conference (NIST/NCSC), Baltimore, pp. 305–
319.
Hamilton, G. and Kougiouris, P. 1993. The Spring nucleus: A microkernel for objects. In Summer Usenix Conference, Cincinnati, OH, pp.
147–160.
Härtig, H., Kowalski, O., and Kühnhauser, W. 1993. The Birlix security
architecture. Journal of Computer Security 2, 1, 5–21.
Hildebrand, D. 1992. An architectural overview of QNX. In 1st Usenix
Workshop on Micro-kernels and Other Kernel Architectures, Seattle, WA, pp. 113–126.
Hill, M. D. and Smith, A. J. 1989. Evaluating associativity in CPU caches.
IEEE Transactions on Computers 38, 12 (Dec.), 1612–1630.
Intel Corp. 1990. i486 MicroprocessorProgrammer’s Reference Manual.
Intel Corp.
Intel Corp. 1993. Pentium Processor User’s Manual, Volume 3: Architecture and Programming Manual. Intel Corp.
Kane, G. and Heinrich, J. 1992. MIPS Risc Architecture. Prentice Hall.
Kessler, R. and Hill, M. D. 1992. Page placement algorithms for large
real-indexed caches. ACM Transactions on Computer Systems 10, 4
(Nov.), 11–22.
Khalidi, Y. A. and Nelson, M. N. 1993. Extensible file systems in Spring.
In 14th ACM Symposium on Operating System Principles (SOSP),
Asheville, NC, pp. 1–14.
Kühnhauser, W. E. 1995. A paradigm for user-defined security policies.
In Proceedings of the 14th IEEE Symposium on Reliable Distributed
Systems, Bad Neuenahr, Germany.
Lee, C. H., Chen, M. C., and Chang, R. C. 1994. HiPEC: high performance external virtual memory caching. In 1st USENIX Symposium
on Operating Systems Design and Implementation (OSDI), Monterey, CA, pp. 153–164.
Liedtke, J. 1992. Clans & chiefs. In 12. GI/ITG-Fachtagung Architektur
von Rechensystemen, Kiel, pp. 294–305. Springer.
Liedtke, J. 1993. Improving IPC by kernel design. In 14th ACM Symposium on Operating System Principles (SOSP), Asheville, NC, pp.
175–188.
Liedtke, J. 1995. Improved address-space switching on Pentium processors by transparently multiplexing user address spaces. Arbeitspapiere der GMD No. 933 (Sept.), GMD — German National Research
Center for Information Technology, Sankt Augustin.
Major, D., Minshall, G., and Powell, K. 1994. An overview of the
NetWare operating system. In Winter Usenix Conference, San Francisco, CA.
Mogul, J. C. and Borg, A. 1991. The effect of context switches on cache performance. In 4th International Conference on Architectural
Support for Programming Languages and Operating Systems (ASPLOS), Santa Clara, CA, pp. 75–84.
Motorola Inc. 1993. PowerPC 601 RISC MicroprocessorUser’s Manual.
Motorola Inc.
Nagle, D., Uhlig, R., Mudge, T., and Sechrest, S. 1994. Optimal allocation of on-chip memory for multiple-API operating systems. In
21th Annual International Symposium on Computer Architecture
(ISCA), Chicago, IL, pp. 358–369.
Ousterhout, J. K. 1990. Why aren’t operating systems getting faster as
fast as hardware? In Usenix Summer Conference, Anaheim, CA,
pp. 247–256.
Pu, C., Massalin, H., and Ioannidis, J. 1988. The Synthesis kernel. Computing Systems 1, 1 (Jan.), 11–32.
Romer, T. H., Lee, D. L., Bershad, B. N., and Chen, B. 1994. Dynamic
page mapping policies for cache conflict resolution on standard
hardware. In 1st USENIX Symposium on Operating Systems Design
and Implementation (OSDI), Monterey, CA, pp. 255–266.
Rozier, M., Abrossimov, A., Armand, F., Boule, I., Gien, M., Guillemont, M., Herrmann, F., Kaiser, C., Langlois, S., Leonard, P., and
Neuhauser, W. 1988. Chorus distributed operating system. Computing Systems 1, 4, 305–370.
Schröder-Preikschat, W. 1994. The Logical Design of Parallel Operating
Systems. Prentice Hall.
Schroeder, M. D. and Burroughs, M. 1989. Performance of the Firefly
RPC. In 12th ACM Symposium on Operating System Principles
(SOSP), Lichfield Park, AR, pp. 83–90.
van Renesse, R., van Staveren, H., and Tanenbaum, A. S. 1988. Performance of the world’s fastest distributed operating system. Operating
Systems Review 22, 4 (Oct.), 25–34.
Wulf, W., Cohen, E., Corwin, W., Jones, A., Levin, R., Pierson, C., and
Pollack, F. 1974. Hydra: The kernel of a multiprocessing operating
system. Commun. ACM 17, 6 (July), 337–345.
Yokote, Y. 1993. Kernel-structuring for object-oriented operating systems: The Apertos approach. In International Symposium on Object
Technologies for Advanced Software. Springer.
Was this manual useful for you? yes no
Thank you for your participation!

* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project

Download PDF

advertisement