of

of
1280
IEEE JOURNAL ON SELECTED AREAS IN COMMUKICATIONS, VOL. 14, NO. 7, SEPTEMBER 1996
The Design and Implementation of
an Operating System to Support
Distributed Multimedia Applications
Ian M, Leslie, Member, IEEE, Derek McAuley, Richard Black, Timothy Roscoe,
Paul Barham, David Evers, Robin Fairbaims, and Eoin Hyden, Member, IEEE
Abstract- Support for multimedia applications by general
purpose computing platforms has been the subject of considerable
research. Much of this work is based on an evolutionary strat­
egy in which small changes to existing systems are made. The
approach adopted here is to start ab initio with no backward
compatibility constraints. This leads to a novel structure for an
operating system. The structure aims to decouple applications
from one another and to pruvide multiplexing of all resources, not
just the CPU, at a low level. The mutivation for this structure,
a
desigu based on the structure, and its implementation on a
number of hardware platforms is described.
for such applications which are stronger than traditional data
processing applications, and informational requirements which
are weaker.
In order for an operating system to support both traditional
and multimedia applications, a wider range of facilities than
is found in current operating systems needs to be provided.
This paper describes an operating system, called Nemesis,
whose goal is to provide this range of facilities. This work
was carried out as part of the Pegasus project
[1], an ESPRIT
Basic Research Activity.
1.
G
ENERAL
The underlying assumptions made in the Pegasus project
INTRODUCTION
purpose
multimedia
computing
platforms
are as follows.
1) General purpose computing platforms will process con­
should endow text, images, audio, and video with
tinuous media as well as simply capture, render, and
equal status: interpreting an audio or video stream should
store them.
not be a privileged task of special functions provided by
2) Users will run many applications which manipUlate
the operating system, but one of ordinary user programs.
continuous media simultaneously.
Support for such processing on a platform on which other
3) An application manipulating continuous media will
user applications are running, some of which may also
make varying demands on resources during its exe­
be processing continuous media, cannot be achieved using
cution.
existing operating systems-it requires mechanisms that will
4) The application mix and load will be dynamic.
consistently share out resources in a manner determined by
In multimedia systems which capture, display, and store
both application requirements and user preferences.
Continuous media streams have two important properties.
continuous media, the range of applications is constrained
The first property is that their fidelity is often dependent upon
(although by no means currently exhausted). The introduction
an
the timeliness with which they are presented. This temporal
of the processing of continuous media in real time adds
property of continuous media imposes the requirement that
important new dimension. If the current situation is typified as
code which manipulates the media data may need to be sched­
using processors to control continuous media, future systems
uled within suitable windows of time. The second property
will be typified as systems in which the data types operated on
is that they are often tolerant of the loss of some of their
have been extended to include continuous media streams [2].
information content, particularly if it is known how the data
What we are concerned with here is to provide an environment
is to be used (e.g., many compression schemes rely on human
in which such applications can be developed and run.
factors to achieve high compression rates). This informational
Traditional general purpose operating systems support the
a
property, without regards to its exact nature, may be exploited
notion of
by systems which handle continuous media.
cation sees its own virtual processor-this provides a method
virtual processor interface in which each appli­
The properties of the streams can be extended to the appli­
for sharing the real processor. However, each virtual processor
cations which process them; we have temporal requirements
sees a performance which is influenced by the load on the other
Manuscript received May 1, 1995; revi sed March 1. 1996.
1.
M. Leslie, R. Black, P. Barham, and R. Fairbairns are with the C'niversity
of Cambridge Computer Laboratory, Cambridge, U.K
D. McAuley is with the DepaJtment
Glasgow, U.K
of
Computer
Science,
.
T. Roscoe is with Persimmon IT Inc., Durham, NC 27703
D. Evers is with Nemesys Research Ltd., Cambridge, U.K.
Cnivers i ty of
USA.
E. Hyden is with AT&T Bell Labs, MUlTay Hill, NJ 07 974 USA.
Publisher Hem Identifier S 0733-8716(96)06122-7.
virtual processors, and mechanisms to control this interference
are generally not available. Multimedia applications require
such mechanisms.
One way of controlling this interference is by providing
multiple real processors. For example, many multimedia appli­
cations (or parts thereof), run on processors on peripheral cards
so that the main processor is not involved. Moreover, the code
0733-8716/96$05.00
©
1996
IEEE
LESLIE e/ al.: THE DESIGN AND IMPLEMENTATION OF AN OPERATING SYSTEM TO SUPPORT DISTRIBUTED MULTI\1EDIA APPLICATIONS
1281
running on the peripheral is likely to be embedded and there
can be provided over a POSIX emulation which mostly runs
is no danger of competing applications using the peripheral at
within the applications domain.
the same time. The same approach is also used in mainframes
In Nemesis, a service is provided as far as possible by shared
where the use of channel processors reduces the I/O demands
library code and the design of a service will aim to minimize
on the central processors, in particular ensuring that the central
the number of changes in protection domain. To aid in the
processors do not get overloaded by I/O interrupts.
construction of such services, references between the various
Our aim in Nemesis is to allow a general purpose processor
parts of the code and data are simplified by the use of a single
to be used to provide the functions one would find in a
address space with protection between domains provided by
specialized DSP peripheral while providing the same control
the access control fields of address translations.
of interference across virtual processors as can be achieved
After a discussion of quality-of-service (QoS) management
with distinct hardware. We wish to retain the flexibility of the
and application crosstalk in Section II, the structure of the
virtual processor system so that resources can be used more
N emesis kernel, the virtual processor model and the event
efficiently than in a dedicated-peripheral approach.
In approaching the design of an operating system with these
mechanism, is described in detail in Section III. Events are
a native concept in Nemesis. Events can be used to support an
goals, the immediate question of revolution versus evolution
implementation of event counts and sequencers, and in practice
arises. Should one attempt to migrate a current operating
all domains currently use this mapping. Other synchronization
system (or indeed use a current operating system) in order to
primitives can be built on top of event counts and sequencers
meet these goals, or should one start afresh? The reasons why
when required.
current general purpose operating systems are not appropriate
Scheduling amongst and within domains is described in
are well established. Similarly, hard real-time solutions which
Section IV. Two domain scheduling algorithms are presented,
require static analysis are not appropriate in a situation where
one in detail, one briefly. Although scheduling is an important
the application mix is dynamic.
aspect of supporting multimedia applications, Nemesis does
General purpose operating systems with "real-time threads"
not take the view that there is a correct scheduling algorithm;
in which the real-time behavior is provided by static priority
indeed, the structure of Nemesis is designed to make the use
are also inappropriate, unless one is running a single mul­
of alternative scheduling algorithms straightforward.
timedia application or can afford to perform an analysis of
Two aspects of the system are only briefly described: the
the complete system in order to assign priorities. A better
linkage model for the single address space and the interdo­
solution might be to take an existing operating system and
main communication mechanisms. A system in which code
modify its scheduling system to support multimedia applica­
implementing operating system services executes within the
tions-perhaps one reason for the difficulty in performing such
application domain gives rise to problems of linking the pieces
a scheduler transplant is that knowledge of the characteristics
of code and data required and of providing safe and efficient
as
of the scheduler often migrates to other components, making
sharing of this code and data. These problems,
the effect of replacement unpredictable.
directly attributable to the use of a single address space, are
This, together with our view that processor scheduling is
well as those
discussed in Section V.
not the only important aspect of operating system support
Higher layer inter-domain communication (IDC) systems
for multimedia applications has lead us to start from scratch.
can be built over events. Section VI presents the system used
As we describe below, providing a realization of a virtual
for interdomain invocations based on a remote procedure call
processor that has the properties that we require has profound
(RPC) model. This section also presents the bulk transfer
implications on the complete structure of the operating system.
The main theme guiding the design of Nemesis is mul­
mechanism used in Nemesis, in the context of support for
networking I/O.
tiplexing system resources at the lowest level-in the case
The current state of the implementation and systems built
of the processor, this multiplexing system is the scheduling
over it are described along with some early conclusions in
algorithm. However, it is the multiplexing of all resources, real
Section VII.
or virtual, which has determined the fundamental structure of
Nemesis.
This has given rise to a system in which as much functional­
ity as possible executes in the domain of the application. I This
II.
THE MODEL OF QUALITY OF SERVICE MANAGEMENT
Managing QoS in an operating system can be done in a
number of ways. At one extreme, one can make hard real­
includes code that in a traditional microkernel would execute
time guarantees to the applications, refusing to run them if
in a shared server. It should be emphasized that this need
the hard real-time guarantees cannot be made. At the other
not change the interface seen by application programmers.
extreme one can hope for the best by providing more resource
The API seen by a programmer is often a thin layer of
than one expects to be used.
library code supplying a veneer over a set of kernel traps
In between are a range of options which are more appro­
and messages to server processes-whereas in Nemesis the
priate for multimedia systems. In general, the approach is
majority of the functionality would be provided by a shared
to provide probabilistic guarantees and to expect applications
library. As an example, a POSIX API in a Nemesis domain
to monitor their performance and adapt their behavior when
resource allocation changes.
1 The concept of domains
the moment
they
can
be
in
Nemesis will
thought of
as
be
explained in Section Ill, for
analogous to UNIX
processes.
Some QoS architectures, for example
[3], assume a context
in which applications specify their QoS requirements to a layer
IEEE meRNA! ON SELECnm AREAS IN COMMUNICATIONS, VOL. 14, NO.7, SEPTEMBER 1996
1282
Application
Adaptation
Desired
Performance
OoS
Controller
OoS
Manager
Application
Execution
Application
Performance
1.
Fig.
QoS feedback control.
below them which then determines how that requirement is
controller, manager, and applications? A brief consideration
to be met and in turn specifies derived QoS requirements
of the feedback system leads to a conclusion: the forward
to the next layer below. This is a particularly bad approach
performance function, that is, the application performance for
when the layers are performing multiplexing (e.g., a single
a given set of resources and instruction streams, need not
thread operating on behalf of a number of applications) since
necessarily be predictable to obtain the desired performance
great care must be taken to prevent QoS crosstalk. Even when
but it must be consistent. Note that while efficiency and speed
the processing is not multiplexed, we cannot escape the nced
of execution are desirable, they are not as important to the
to have a recursive mapping of QoS requirements down the
stability of the QoS control system as consistency.
service stack. This is not a practical approach; providing this.
mapping is problematic, particularly when the application itself
Consistency, in tum, requires that resources are accounted
correctly to the applications that consume them, or to be more
is unlikely to understand its QoS requirements, and when they
accurate, to the applications iliat cause them to be consumed,
change in time.
and iliat QoS crosstalk between applications be kept to a
A.
Feedbackfor QoS Control
Our approach is to introduce a notion of feedback control.
an
minimum.
B. QoS Crosstalk
adaptive approach, in which a controller adjusts
When dealing with time-related data streams in network
application QoS demands in the light of the observed per­
protocol stacks, the problem of QoS crosstalk between streams
formance. This should be distinguished from the more usual
has been identified. QoS crosstalk occurs because of con­
type of feedback where applications degrade gracefully when
tention for resources between different streams multiplexed
This is
resources are over committed. This is shown schematically in
onto a single lower-level channeL If the thread processing the
Fig.
channel has no notion of ilie component streams, it cannot
1.
The QoS Controller dictates the policy to be followed and
apply resource guarantees to them and statistical delays are
can be directly dictatcd by the user, by an agent running on
introduced into the packets of each stream. To preserve the
the user's behalf, or more normally both. The QoS Manager
QoS allocated to a stream, scheduling decisions must be made
implements the allocation of resources to try and achieve
at each multiplexing point.
the policies dictated by the QoS Controller, and ensures
When QoS crosstalk occurs, the performance of a given
their enforcement by infom1ing the operating system and
network association at the application level is unduly affected
applications so they can adapt their behavior.
by the traffic pattem of other associations wiili which it
This scheme is directly analogous to many window systems,
where ilie Window Manager and Server are the counterparts
is mUltiplexed. The solution advocated in [4] and
[5]
is
to multiplex network associations at a single layer in the
of the QoS Controller and Manager. In a window system,
protocol stack immediatdy adjacent to the network point of
applications are made aware of the pixels they have been
attachment. This allows scheduling decisions to apply to single
allocated by the server and adapt accordingly; the server
associations rather than [0 multiplexed aggregates. While iliis
enforces these allocations by clipping; and users, by using
particular line of work grew out of the use of virtual circuits
a preference profile and resizing windows directly, interact
in ATM networks, it can also he employed in IP networks by
with ilie window manager (their agent) to express their desired
the use of packet filters [6], [7], and fair queuing schemes [8].
policies.
This approach allows applications (and application writers)
Analogously, application QoS crosstalk occurs when oper­
ating system services and physical resources are multiplexed
to be free from the problem of determining exactly what
among client applications. In addition to network protocol
resources an application requires at the cost of requiring them
processing, components such as device 110, filing systems
to implement adaptive algorithms. However, a useful side
and directory services, memory management, link-loaders, and
effect is that it thus simplifies the porting of applications to
window systems are needed by client applications. These
new platforms.
services must provide concurrency and access control to
Where does this lead in relation to providing QoS guarantees
within the operating system? Can everything be left to the
manage system state, and so are generally implemented in
server processes or within the kernel.
LESLIE et al.: THE DESIGN AND IMPLEMENTAl'ION OF A:-l OPERATING SYSTEM TO SUPPORT DISTRIBUTED MULTIMEDIA APPLICATIONS
This means that the perfonnance of a client is dependent not
only on how it is scheduled but also on the perfonnance of
any servers it requires, including the kernel. The perfonnance
of these servers is in turn dependent on the demand for their
services by other clients. Thus one client's activity can delay
invocations of a service by another. This is at odds with
the resource allocation policy, which should be attempting to
allocate resources among applications rather than servers. We
can look upon scheduling as the act of allocating the real
resource of the processor. Servers introduce virtual resources
which must also be allocated in a manner consistent with
application QoS.
1283
provide synchronization mechanisms for its threads, and
applications are no longer in control of their own resource
trade-offs.
The alternative is to implement servers as separate schedu­
lable entities. Some systems allow a client to transfer some
of their resources to the server to preserve a given QoS
across server calls. The processor capacity reserves mechanism
[11] is the most prominent of these; the kernel implements
objects called reserves which can be transferred from client
threads to servers. This mechanism can be implemented with
a reasonable degree of efficiency, but does not fully address
the problem.
1) The state associated with a reserve must be transferred
C. Requirements on the Operating System
Taking this model of QoS management, including its ex­
tension to cover all resources, gives rise to the following
requirements.
I) The operating system should provide facilities to allow
the dynamic allocation of resources to applications.
2) The operating system should ensure that the consump­
tion of resources is accounted to the correct application.
3) The operating system should not force applications to
use shared servers where applications will experience
crosstalk from other applications.
The first of these requirements can be met by what we
have called the QoS manager. This runs occasionally as
requests to change the allocation of resources are made. It
does not run for any other reason and, to borrow a phrase
from communications, can be said to run out of band with
respect to the application computation.
The second and third of these requirements are strongly
related. Both are concerned with in-band application com­
putation and, again to use the language of communication
systems, lead to a philosophy of a low-level multiplexing of
all resourccs within the system. This consideration gives rise
to a novel structure for operating systems.
III. STRUCTURAL OVERVIEW
Nemesis is structured to provide fine-grained resource con­
trol and to minimize application QoS crosstalk. To meet these
goals, it is important to account for as much of the time used
by an application as possible, to keep the application infonned
of its resource use, and to enable the application to schedule
its own subtasks. At odds with this desire is the need for code
which implements concurrency and access control over shared
state to execute in a different protection domain from the client
(either the kernel or a server process).
A number of approaches have been taken to try and
minimize the cost of interacting with such servers. One
technique is to support thread migration; there are systems
which allow threads to undergo protection domain switches,
both in specialized hardware architectures [9J and conventional
workstations [10]. However, such threads cannot easily
be scheduled by their parent application, and must be
implemented by a kernel which manages the protection
domain boundaries. As a consequence, this kernel must
to a server thread when an IPC call is made. This adds
to call overhead, and furthennore suffers from the kernel
thread-related problems described above.
2) Crosstalk will still occur within servers, and there is no
guarantee that a server will deal with clients fairly, or
that clients will correctly "pay" for their service.
3) It is not clear how nested server calls are handled; in
particular, the server may be able to transfer the reserve
to an unrelated thread.
Nemesis takes the approach of minimizing the use of
shared servers so as to reduce the impact of application
QoS crosstalk: the minimum necessary functionality for a
service is placed in a shared server while as much pro­
cessing as possible is perfonned within the application do­
main. Ideally, the server should only perfonn privileged op­
erations, in particular access control and concurrency con­
trol.
A consequence of this approach is the desire to expose some
server internal state in a controlled manner to client domains.
Section V describes how the particular use of interfaces and
modules in Nemesis supports a model where all text and
data occupies a single virtual address space facilitating this
controlled sharing. It must be emphasized that this in no way
implies a lack of memory protection between domains. The
virtual to physical address translations in Nemesis are the same
for all domains, while the protection rights on a given page
may vary. What it does mean is that any area of memory
in Nemesis can be shared, and virtual addresses of physical
memory locations do not change between domains.
The minimal use of shared servers stands in contrast to
recent trends in operating systems, which have been to move
functionality away from client domains (and indeed the kernel)
into separate processes. However, there are a number of exam­
ples in recent literature of services being implemented as client
libraries instead of within a kernel or server. Efficient user­
level threads packages have already been mentioned. Other
examples of user-level libraries include network protocols
[121, window system rendering [13], and UNIX emulation
[14].
Nemesis is designed to use these techniques. In addition,
most of the support for creating and linking new domains,
setting up inter-domain communication, and networking is
performed in the context of the application. The result is
a "vertically integrated" operating system architecture, illus-
1284
IEEE JOURNAL ON SELECTED AREAS IN COMMUNICATIONS, VOL
14,
NO.7, SEPTEMBER
1996
I'
Unprivileged:
Applic­
ation
Applic­
ation
Device
Driver
Applic­
ation
Privileged:
( Kernel)
Fig. 2.
Nemesis system architecture.
trated in Fig. 2. The system is organized as a set of domains,
which are scheduled by a very small kernel.
A. The Virtual Proces.wr Inte iface
The runtime interface. betwecn a domain and the kernel
serves two purposes:
1) it provides the application with information about when
and why it is being scheduled;
2) it supports user-level multiplexing of the CPU among
distinct subtasks within the domain.
The key concepts are activations by which the scheduler
invokes the domain. and events which indicate when and why
the domain has been invoked. If each domain is considered a
virtual processor, the activations are the virtual interrupts, the
events the virtual interrupt status.
An important data structure associated with the virtual
processor interface is the domain control block (DCB). This
contains scheduling information, communication end-points,
a protection domain identifier, an upcall entry point for the
domain, and a small initial stack. The DCB is divided into two
areas: one is writable by the domain itself, the other is readable
but not writable. A privileged service called the domain
manager creates DCB' s and links them into the scheduler data
structures. The details of some of the fields in the DCB are
described below.
1) Activations: The concept of activations is similar to that
presented in [15]. When a domain is allocated the CPU by
the kernel, the domain is normally upcalled rather than being
resumed at the point where it lost the CPU. This allows the
domain to consider scheduling actions as soon as it obtains
CPU resource. The exceptional case of a resumption is only
used when the domain is operating within a critical section
where an activation would be difficult to cope with, entailing
re-entrant handlcrs. The domain controls whether it is activated
or resumed by setting or clearing the activation bit in the DCB.
This can be considered as disabling the virtual interrupts.
A Nemesis domain is provided with an array of slots in
the DeB, each of which can hold a processor context. For
example, in the case of the AlphaJAXP implementation, there
are 32 slots, each consisting of 31 integer and 31 floating-point
registers, plus a program counter and processor status word.
At any time, two of the slots are designated by the application
as the activation context and the resume context.
\Vhen a domain is descheduled, its processor context is
saved into the activation slot or the resume slot, depending
on whether the activation bit is set or not. When the domain is
once again scheduled, if its activation bit is clear, the resume
context is used; if the activation bit is set, the bit is cleared and
an upcall takes place to a routine specified in the DCB. This
entry point will typically be a user-level thread scheduler, but
domains are also initially entered this way. Fig. 3 illustrates
the two cases.
The upcall occurs on a dedicated stack (again in the DCB)
and delivers' information such as cunent system time, time
of last deschedule, reason for upcall (e.g., event notification)
and context slot used at last deschedule. Enough information is
provided to give the domain a sufficient execution environment
to schedulc a thread. A threads package will typically use
one context slot for each thread and change the designated
activation context according to which thread is IUnning. If
more threads than slots are required, slots can be used as a
cache for thread contexts. The activation bit can be used with
appropriate exit checks to allow the thread schediller to be
entrant, and therefore simpler.
2) Events: In an operating system, one of the require­
ments is to provide a mechanism for the various devices
and components to communicate. The Nemesis kernel pro­
vides events and event channels to provide the underlying
notification mechanism on which a range of cOIllmLlnications
channels can be constructed. There were a number of important
considerations for the event mechanism in Nemesis.
1) The mechanism must hot force synchronous behavior on
domains which would find an asynchronous mechanism
Illore convenient.
2) It must be possible to communicate in a nonblocking
manner (for example, for device drivers or servers which
are QoS conscious).
1285
LESLIE et al.: THE DESIGN AND IMPLEMENTATION OF AN OPERATING SYSTEM TO SUPPORT DISTRIBUTED MULTIMEDIA APPLICATIONS
� deschedule
activation bit::::: 1
activation
---+ deschedule
�
-
activation bit= 0
activation bit= 0
..
activation
---+
activation bit= 0
temporary
activation
context
Context
Slots
Fig.
3.
Dcschcdules, activations, and resumptions.
3) In loosely coupled multiprocessors, any explicit memory
IV-D. Events are also used as the underlying notification mech­
synchronization required by the communication should
anism supporting interrupt dispatch and inter-domain commu­
only have to be performed when the mechanism is
nication facilities at a higher level of abstraction-currently
invoked. This requirement is to enable portable use of
support is provided for inter-domain invocations (Section VI)
partially ordered memory systems such as an Alpha AXP
and streaming data operations (Section VI-C).
3) Time: In a multimedia environment, there is a particular
multiprocessor, or the Desk Area Network.
4) A thread scheduler within a domain can map communi­
need for a domain to know the current time, since it may need
cations activities to scheduling requirements efficiently;
to schedule many of its activities related to time in the real
this necessitates that the communications primitives be
world.
designed in conjunction with the concurrency primitives.
In many systems (e.g., VNIX), the current time of day clock
These requirements dictate a solution which is asynchronous
is derived from a periodic system ticker. CPV scheduling is
and nonblocking, and which can indicate that an arbitrary
number of communications have occurred to the receiver.
The scheme is based on events, the value of which can be
conveyed from a sending domain via the kernel to a recipient
domain via an event channel. An event is a monotonically
increasing integer which may be read and modified atomically
by the sending domain. This domain can request the current
value be conveyed to the recipient domain by performing the
kernel system call,
sendO. The recipient domain holds a read­
only copy of the event which is updated by the kernel as a
result of a
send().
As an example, Fig. 4 shows the value of event number
in domain A being propagated (by the
send
n
system call) to
domain B where it is event number m. The mapping table for
event channels from A has the pair (B,
m)
for entry
n
so the
kernel copies the value from A's event table to the appropriate
entry in B's event table and places B's index (in this case
m)
into B' s circular buffer FIFO.
done based on this ticker, and the system time is updated when
this ticker interrupt occurs taking into account an adjustment
to keep the time consistent with universal time (VCT) (e.g.,
using NTP
[17]).
In these circumstances, a domain may only
be able to obtain a time value accurate to the time of the last
timer interrupt and even then the value actually read may be
subject to significant skew due to adjustment'>.
To overcome these problems in Nemesis, scheduling time
and VCT are kept separate. The former is kept as a number
of nanoseconds since the system booted and is used for
all scheduling and resource calculations and requests. The
expected granularity of updates to this variable can be read
by applications if required. Conversion between this number
and VCT can be done by adding a system base value. It is
this base value which can be adjusted to take account of the
drift between the system and mankind's clock of convention.
The scheduling time is available in memory readable by all
domains, as well as being passed to a domain on activation.
For each domain, the kernel has a protected table of the
destinations of the event channels originating at that domain.
A management domain called the Binder, described in Section
B. Kernel Structure
VI-A, is responsible for initializing these tables and thereby
The Nemesis kernel consists almost entirely of interrupt and
creating communication channels. Currently, only "point-to­
trap handlers; there are no kernel threads. When the kernel is
point" events are implemented, although there is nothing to
entered from a domain due to a system call, a new kernel
stack frame is constructed in a fixed (per processor) area of
prevent "multicast" events if needed in the future.
Exactly what an event represents is not known by the kernel,
but only by the domains themselves. The relationship between
these Nemesis events and event counts and sequencers
[16]
within the standard user thread library is discussed in Section
memory; likewise when fielding an interrupt.
Kernel traps are provided to
sendO
events,
yieldO
the
processor (with and without timeout) and a set of three
variations of a "return from activation" system call,
rfaO.
IEEE JOURKAL ON SELECTED AREAS IN COMMUNICATIONS, VOL. 14, NO.7, SEP TEMBER 1996
1286
roC�8SS A
procsss
evonl tabls
n
process
event table
Kernel
fife
kernel
event dispatCh
table for process A
Fig. 4.
Example of sending an event update.
The return from activation system calls perform various forms
The solution adopted in Nemesis decouples the interrupt
of atomic context switches for the user level schedulers.
itself from the domain which is handling the interrupt source.
Privileged domains can also register interrupt handlers, mask
Device drivers are implemented as privileged domains-they
interrupts and if necessary (on some processors) ask to be
can register an interrupt handler with the system, which is
placed in a particular processor privileged modes (e.g"
to
called by the interrupt dispatch code with a minimum of
registers saved. This ISR typically clears the condition, masks
access TLB etc).
On the Alpha AXP implementation, the above calls are all
the source of the interrupt, and sends an event to the domain
implemented as PAL calls with only the scheduler written in
responsible. This sequence is sufficiently short that it can be
C (for reasons of comprehension).
ignored from an accounting point of view. For example, the
Nemesis aims to schedule domains with a clear allocation
of CPU time according to OoS specifications. In most existing
ISR for the LANCE Ethernet driver on the Sandpiper is 12
instructions long.
operating systems, the arrival of an interrupt usually causes
Sometimes the recipient domain will actually be a spe­
a task to bc scheduled immediately to handle the interrupt,
cific application domain (e.g., an application which has ex­
preempting whatever is running. The scheduler itself is usually
clusive access to a device). However, where the recipient
not involved in this decision; the new task runs as an interrupt
domain of the event is a device driver domain providing a
service routine.
(de)multiplexing function (e.g., demultiplexing Ethernet frame
The interrupt service routine (ISR) for a high interrupt rate
types), this domain is under the control of the OoS-based
device can therefore hog the processor for long periods, since
scheduler like any other and can (and does) have resource
the scheduler itself hardly gets a chance to run, let alone a
limits attached.
user process. Such high frequency interruptions can be counter
[18]
A significant benefit of the single virtual address space
describes a situation where careful
approach for ISR' s is that virtual addresses are valid regardless
prioritizing of interrupts led to high throughput, but with most
of which domain is currently scheduled. The maintenance of
interrupts disabled for a high proportion of the time.
scatter-gather maps to enable devices to DMA data directly to
productive; Dixon
Sensible design of hardware interfaces can alleviate this
problem, but devices designed with this behavior in mind
are still rare, and moreover they do not address the funda­
mental problem: scheduling decisions are being made by the
and from virtual memory addresses in client domains is thus
greatly simplified.
IV. SCHEDULING
interrupting device and interrupt dispatching code, and not
A Nemesis scheduler has several goals: to reduce the OoS
by the system scheduler, effectively bypassing the policing
crosstalk between applications; to enable application-specific
mechanism.
degradation under load; to support applications which need
LESLIE et al.: THE DESIGN AND IMPLEMENTATION OF AN OPERArING SYSTEM TO SUPPORT DISTRIBUTED MULTIMEDIA APPLICATIONS
1287
Scheduler:
Sdoms:
Best-effort
Domains:
Fig. 5.
Scheduling service architecture.
some baseline resource by providing some real guarantees on
CPU allocation.
A key concept is that applications should be allocated a
share of the processor. These shares are allocated by the QoS
Manager, based on an understanding of the available resources
and input from the QoS Controller (and hence from both
applications and the user). A key decision was made in order to
simplify the computation required by the scheduler on context
switches-the QoS Manager will ensure that the scheduler
can always meet its short-term demands by ensuring that less
than 100% of the processor is "contracted out" to domains
requiring QoS.
The QoS Manager takes a long-term view of the availability
of resources and uses algorithms with significant hysteresis to
provide a consistent guaranteed resource to the application.
However, this does not imply that the system is not work­
conserving-any "slack" time in the system is supplied to
those applications that request it with the information that this
is "optimistic" processor capacity and they should not adapt
their behavior to rely upon it.
A mechanism for specifying CPU time QoS must serve three
purposes: it must allow applications, users, or user agents to
specify an application's desired CPU time, enable the QoS
Manager to ensure the processor resource is not over allocated
and enable the scheduler to allocate processor time efficiently.
As described in Section II, Nemesis adopts an approach in
which users or user agents are expected to provide overall
control (by observation) of resource allocation. This leads to
a simple QoS specification. In the case of CPU time, there is
further advantage in a simple QoS specification: it reduces the
overhead for the scheduler in recalculating a schedule during
a context switch.
2) implement a scheduling algorithm to ensure that each
contract is satisfied;
3) block and unblock domains in response to their requests
and the arrival of events;
4) present an interface to domains which makes them aware
both of their own scheduling and of the passage of real
time;
5) provide a mechanism supporting the efficient implemen­
tation of potentially specialized threads packages within
domains.
Applications in Nemesis specify neither priorities nor dead­
lines. The scheduler deals with entities called scheduling
domains (sdoms) to which it aims to provide a particular share
of the processor over some short time frame. An sdom may
correspond to a single Nemesis domain or a set of domains
collectively allocated a share.
The service architecture is illustrated in Fig. 5. Sdoms
usually correspond to contracted domains, but also correspond
to best-effort classes of domains. In the latter case, processor
time allotted to the sdom is shared out among its domains
according to one of several algorithms, such as simple round­
robin or multilevel feedback queues. The advantage of this
approach is that a portion of the total CPU time can be reserved
for domains with no special timing requirements to ensure that
they are not starved of processor time. Also, several different
algorithms for' scheduling best-effort domains can be run in
parallel without impacting the performance of time-critical
activities.
It has already been mentioned that within Nemesis sched­
uling using shares is a core concept; however, the particular
scheduling algorithm is open to choice. The Atropos scheduler,
now the "standard" Nemesis scheduler is described in detail
below.
A. Scheduling Architecture and Service Model
B. The Atropos Scheduler
As well as the (relatively simple) code to switch between
running domains, the Nemesis scheduler has a variety of
functions. It must:
With the Atropos scheduler, shares are specified using an
application dependent period. The share of the processor each
sdom receives is specified by a tuple {s, p, x, I}. The slice s
and period p together represent the processor bandwidth to the
sdom: it will receive at least s ticks of CPU time (perhaps as
several time slices) in each period of length p. x is a Boolean
1) account for the time used by each holder of a QoS
guarantee and provide a policing mechanism to ensure
domains do not overrun their allotted time;
IEEE JOURNAL ON SELECTED AREAS IN COMMUNICAI'lONS, VOL. 14, NO.7, SEPTEMBER 1996
1288
value used to indicate whether the sdom is prepared to receive
"slack" CPU time. l, the latency hint, is described below.
The Atropos scheduler internally uses an earliest deadline
first (EDF) algorithm to provide this share guarantee. How­
ever, the dcadlines on which it operates are not available to
or specified by the application-for the implementation of the
scheduler to be simple and fast, it relies on the fact that the QoS
Manager has presented it with a soluble problem-this could
not be ensured if the applications were allowed to specify
their own deadlines.
An sdom can be in one of five states and may be on a
scheduler queue:
Queue
Qr, de
State
running
runnable
running in guaranteed time
guaranteed time available
Qw, dw
waiting
optimistic
awaiting a new allocation of time
an sdom running in slack time
Qb
blocked
awaiting an event
For each sdom, the scheduler holds a deadline d, which is
the time at which the sdom's current period ends, and a value
r which is the time remaining to the sdom within its cllrrent
period. There are queues Qr and Qw of nmnable and waiting
sdoms, both sorted by deadline (with dx and Px the deadlines
and periods of the respective queue heads), and a third queue
Qb of blocked sdoms.
The scheduler requires a hardware timer that will cause the
scheduler to be entered at or very shortly after a specificd time
in the future, ideally with a microsecond resolution or better.2
When the scheduler is entered at time t as a result of a timer
interrupt or an event dclivery:
1) the time for which the current sdom has been running
is deducted from its value of r;
2) if r is now zero, the sdom is inserted in Qw;
3) for each sdom on Qw for which t ::: d, r is set to s, its
new deadline is set to d + p, and it is moved to Qr;
4) a time is calculated for the n ext timer intemlpt depend­
ing on which of dr or dw + Pw is the lower;
5) the scheduler runs the head of Qr, or if empty selects
an element of Qw.
This basic algorithm will find a fcasible schedule. This is
seen by regarding a "task" as "the execution of an sdom for
s nanoseconds"-as the QoS Manager has ensured that the
total share of the processor allocated is less than 100% (i.e.,
Lsi/Pi < 1), and slices can be executed at any point during
their period'-this approach satisfies the conditions required
for an EDF algorithm to function correctly [19].
This argument relies on two simplifications: first, that sched­
uling overhead is negligible, and second that the system is
in a steady state. The first is addressed by ensuring there is
suffIcient slack time in the system to allow the scheduler to run
and by not counting time in the scheduler as used by anybody.
2 Such a timer is available on the DECchip EB64 board used to prototype
Nemesis, but has to be simulated with a 122 Ils periodic ticker on the
Sandpiper workstations.
The second is concerned with moving an sdom with a share
allocation from QI> to Qr. A safe option is to set d := t+p and
r := s; this introduces the sdom with the maximum scheduling
leeway and since a feasible schedule exists no deadlines will
be missed as a result. For most domains this is, sufficient, and
it is the default behavior.
In the limit, all sdoms can proceed simultaneously with an
instantaneous share of the processor which is constant over
time. This limit is often referred to as processor sharing.
Moreover, it can efficiently support domains requiring a wide
range of scheduling granularities.
1) Interrupts and Latency Hint: In fact, when unblocking
an sdom which as been asleep for more than its period, the
scheduler sets r := sand d := t + l, where I is the latency
hint. The default hehavior just described is then achieved by
setting I := p for most domains. However, in the case of device
drivers reacting to an interrupt, faster response is sometimes
required. If the device domain is using less than its share or
processor capacity, the unblocking latency hint I provides a
means for a device driver domain to respond to an interrupt
with low latency.
The consequences of reducing l in this way are that if such
an sdom is woken up when the complete system is under
heavy load, some sdoms may miss their deadline for one of
their periods. The scheduler's behavior in these circumstances
is to truncate the running time of the sdoms: they lose part of
their slice for that period. Thereafter, things settle down.
At a high interrupt rate from a given device, at most one
processor interrupt is taken per activation of the driver domain,
so that the scheduling mechanism is enforcing a maximum
interrupt and context switch rate. Hence, as the activity in the
device approaches the maximum that the driver domain has
time to process with its CPU allocation, the driver rarely has
time to block before the next action in the device that would
cause an interrupt, and so converges to a situation where the
driver polls the device whenever it has the CPU.
When device activity is more than the driver can process,
overrun occurs. Device activity which would normally cause
interrupts is ignored by the system since thc driver cannot
keep up with the device. This is decmed to be more desirable
than having the device schedule the processor: if the driver
has all the CPU cycles, the "clients" of the device wouldn't
be able to do anything with the data anyway. If they could,
then the driver is not being given enough processor time by the
domain manager. The system can detect such a condition over
a l onger period of time and reallocate processor bandwidth iII
the system to adapt to conditions.
2) Use of Slack Time: As long as Qr is nonempty, the
sdom at the head is due some contracted time and should
be run. If Qr becomes empty, the scheduler has fulfilled all
its commitments to sdoms until the head of Qw becomes
runnable. In this case, the scheduler can opt to run some sdom
in Qw for which x is true, i.e., one which has requested use of
slack time in the system. Domains are made aware of whether
they are running in this manner or in contracted time by a flag
in their DCB.
The current policy adopted by the scheduler is to run a
random element of Qw for a small, fixed interval or until the
LESLIE et al.: THE DESIG;-; AND IMPLEMENTATION OF AN OPERATING SYSTEM TO SUPPORT DISTRIBUTED MULTIMEDIA APPLICATIONS
head of Qw becomes runnable, whichever is sooner. Thus sev-
advance( e, n)
1289
This operation increments the value of event
n.
eral sdoms can receive the processor "optimistically" before
count e by the amount
Q r becomes nonempty . The best policy for picking sdoms to
other threads to become runnable.
run optimistically is a subject for further research. The current
implementation allocates a very small time quantum
read(s)
This returns the current value of the se­
(122 J.1s)
quencer
to a member of Qw picked cyclically. This works well in most
s.
More strictly, this returns some
value of the sequencer between the start of
cases, but there have been situations in which unfair "beats"
this operation and its termination.
ticket (s)
have been observed.
This may cause
This returns the current
member of a
monotonically
sequence
increasing
and
guarantees that any subsequent calls to
either
C. Other Schedulers
The Nemesis system does not prescribe a scheduler per se;
the Atropos scheduler is simply the one in common use. Other
schedulers can be used where appropriate.
An alternative scheduler, known as the Jubilee scheduler,
has been developed . It differs from the Atropos scheduler
in that CPU resources are allocated to applications using a
single system defined frequency. The period of this system
wide frequency is known as a Jubilee, and will typically be a
few tens of milliseconds. The Jubilee scheduler has scheduling
levels in a strict priority order, one for guaranteed CPU, the
others for successively mote speculative computations. Like
Atropos, it has a mechanism for handing out slack time in the
system. The use of priority is internal to the scheduler and not
visible to client domains.
The fixed Jubilees remove the need for EDF scheduling and
is particular suited to situations where the application load is
well understood and where the a single Jubilee size can be
chosen. Complete details can be found in
[20].
ticket
read
or
will return a higher
value.
In fact, there is little difference between the underlying
semantics of sequencers and event counts; the difference is
that the
ticket
operation does not need to consider awaking
threads, whereas the
wrong for a thread to
advance operation does (therefore, it is
await on a sequencer). The initial value
for sequencers and event counts is zero; this may be altered
immediately after creation using the above primitives. An
additional operation
waits until event
e
await_until( e, v, t)
has value
By convention, an
v
is supported, which
or until time has value t .
advance
o n a n outbound event will
cause the new value to be propagated by issuing the
system call. Only
read
and
await
send()
should be used on incom­
ing events, as their value may be overwritten at any time.
In this way, both local and interdomain synchronization can
be achieved using the same interface and, unless required, a
user level thread need not concern itself with the difference.
2) Concurrency Primitives Using Events: In
many
other
systems
where
implementing
contrast
one
style
to
of
concurrency primitives over another set can be expensive,
it is very efficient to implement many schemes over event
D. Intra-Domain Scheduling
counts.
This section considers the implementation of an intra­
The mutexes and conditional variables of SRC threads
[2 1],
domain scheduler to provide a familiar threaded environment.
POSIX threads, and the semaphores used in the Wanda system
The intra-domain scheduler is the code which sits above the
have all been implemented straightforwardly and efficiently
virtual processor interface. The code is not privileged and can
differ from domain to domain. It may be very simple (in the
case of a single threaded domain), or more complex.
over event counts. Details can be found in
[20] .
Implementing threads packages over the upcall interface
has proved remarkably easy. A Nemesis module implement­
The base technique for synchronization that was adopted
ing both preemptive and nonpreemptive threads packages,
within domains was to extend the use of the core Nemesis
providing both an interface to the event mechanism and
events already present for interdomain communication, and
provide event counts and sequencers
[16]. These event counts
and sequencers can be purely local within the domain or
synchronization based on event counts and sequencers comes
to about
2000 lines of heavily commented C and about 20
assembler opcodes. For comparison, the POSIXthreads library
attached to either outbound events (those which can be prop­
for OSFIl achieves essentially the same functionality over
agated to another domain using
OSFIl kernel threads with over
send()
or inbound events
(those which change asynchronously as a result of some other
domain issuing a
1)
send()).
Event Counts and Sequencers: There are three opera­
tions available on an event count
s.
6000 lines of code, with
considerably inferior performance.
e
and two on a sequencer
These are the following.
read (e)
This returns the current value of the event
of questions concerning the structure of applications, how
More strictly, this returns some
services traditionally provided by servers or a kernel are
value of the event count between the start
provided, and how applications process their own exceptions.
count
e.
of this operation and its termination.
await (e, v)
V. INTERFACES AND INVOCATION
The architecture introduced in Section III raises a number
In order to describe the inter-domain communication system
This operation blocks the calling thread
of Nemesis, it is necessary to present some of the higher level
until the event count e reaches or exceeds
constructs used in Nemesis to complement the single virtual
the value
v.
address space approach . A full account can be found in
[38].
IEEE JOURNAL ON SELECTED AREAS IN COM\-!UNICATIONS , VOL. 14, NO. 7, SEPTEMBER 1996
1290
The key aspects are the extensive use of typing, transparency,
A n interface i s represented i n memory a s a closure: a record
and modularity in the definition of the Nemesis intcrfaces
of two pointers, one to an array of function pointers and one
and the use of closures to provide comprehensible, safe, and
to a state record. To invoke an operation on an interface, the
client calls through the appropriate element of the operati on
extensive sharing of data and code.
Within the Nemesis programming model, there are concepts
table, passing as first argument the address of the closure itself.
of an interface reference, and an invocation reference, the
latter being obtained from the former by binding. An interface
rcference is an object containing the information used as part
of binding to build an invocation reference to a particular
instance of
an
i nterface. The invocation reference will he a
closure of the appropriate type and may be either a simple
pointer to library code (and local state) or to a surrogate for a
remote i nterface. Tn the local case, an interface reference and
invocation reference have the same representation-a pointer
to a closure-binding is an implicit and trivial operation.
In Nemesis, as in Spring
[14],
all interfaces are strongly
typed, and these types are defined in an interface definition
language (IDL). The IDL used in Nemesis, called MIDDL, is
similar in functionality to the IDL ' s used in obj ect-based RPC
systems, with some additional constructs to handle local and
low-level operating system interfaces. A MIDDL specification
defines a single abstract data type by declaring its supertype, if
any, and giving the signatures of all the operations it supports .
A specification can also include decl arations of exceptions and
concrete types.
The word "obj ect" in Nemesis denotes what lies behind an
interface : an object consists of state and code to implement
the operations of the one or more interfaces it provides. A
class is a set of objects which share the same underlying
implementation, and the idea of obj ect class is distinct from
that of type, which is a property of interfaces rather than
objects 3
When an operation is invoked upon an obj ect across onc
of its interfaces, the environment in which the operation is
performed depends only on the internal state of the object
and the arguments of the invocation. There are no global
symbols in the programming modeL Apart from the benefits of
encapsulation this provides, it facilitates the sharing of code.
Tn order to overcome the awkwardness that the lack of
global symbols might produce (consider having to pass a
reference to a memory allocation heap on virtually every
invocation). certain interfaces are treated as part of the thread
context. These are known as pervasives. The programming
model includes the notion of the currcntly cxecuting thread and
the current pervasives are always available. These include ex­
ception handlers, thread operations, domain control operations,
and the default memory allocation heap.
The programming model is supported by a linkage modeL
A stub compiler is used to map MIDDL-type definitions to
C language types. The compiler, known as middle processes
an interface specification and generates a header file giving C
type declarations for the concrete types defined in the interface
together with special types used to represent instances of the
interface.
and
type, and hence no clear notion of an interface. c++ abstract classes often
277 1 .
Nemesis provides a framework for building various inter­
domain communication mechanisms and abstractions using
events for notification and shared memory for data transfer.
One such model is inter-domain invocation; use is made of
the Nemcsis run-time type system to allow an arbitrary inter­
face to be made available for use by other domains. The basic
paradigm adopted is then dictated by the MIDDL interface
definition language RPC, with the addition of "announcement"
operations, which allow use of message passing semantics.
The use of an RPC paradigm for invocations in no way
implies the traditional RPC implementation techniques (mar­
shaling into buffer, transmission of buffer, unmarshaling and
dispatching, etc.). There are cases where the RPC program­
ming model is appropriate, but the underlying implementation
can be radically different. In particular, with the rich sharing
of data and text afforded by a single address space, a number
of highly efficient implementation options are available.
Furthermore, there are situations where RPC is clearly
not the ideal paradi gm: for example, bulk data transfer or
continuous media streams are often best handled using an
out-of-band RPC interface only for controL This is the case
with the RBuf mechanism presented in Section VI-C, which
employs the binding model described here and an interface­
based control mechanism.
Operating systems research to date has tended to focus on
optimizing the performance of the communication systems
used for RPC's, with relatively little attention given to the
process of binding to interfaces. By contrast, the field of
distributed processing has sophisticated and well-established
notions of interfaces and binding, for example, the "Trader"
within thc ANSA architecture
[24].
The Nemesis binding
model shares many features with the ANSA modeL
This section describes briefly the Nemesis approach to inter­
domain binding and invocation, including optimizations which
make use of the single address space and the system's notion
of real time to reduce synchronization overhead and the need
for protection domain switches, followed by an outline of the
support for stream-based IDe.
A Binding
In order to invoke operations on a remote interface, to which
a client has an interface reference, a client requires a local
interface encapsulating the implementation needed for the
remote invocation. This is what we have previously described
as an invocation reference. In Nemesis IDC,
an
invocation
reference is a closure pointer of the same type as the remote
3 This is different from C++ where there is no distinction between class
contain implementation details, and were added as an afterthought
VI. INTER - DOMAIN COMMUNICATION
[23, p.
interface-in other words a surrogate for the remote interface.
An interface reference typically arrives in a domain as
a result of a previous invocation. Name servers or traders
provide services by which clients can request a service by
LES LfE et aL. THE DESIGN AND IMPI.FMENTATION OF AN OPERATING SYSTEM TO SUPPORT DISTRIBUTED MULTIMEDIA APPLICATIONS
specifying its properties. An interlace reference is matched to
the service request and then returned to the client
In the local case (described in Section V), an interface ref­
erence is simply a pointer to the interface closure, and binding
is the trivial operation of reading the pointer. In the case
where communication has to occur across protection domain
boundaries (or across a network), the interface reference has
to include rather more information and the binding process is
correspondingly more complex.
1) Implicit Versus Explicit Binding: An implicit binding
mechanism creates the state associated with a binding in a
manner invisible to the client. An invocation which is declared
to return an interlace reference actually returns a closure for a
valid surrogate for the interface. Creation of the surrogate can
be performed at any stage betwcen the arrival of the interface
reference in an application domain and an attempt by the
application to invoke an operation on the interface reference.
Indeed, bindings can time out and then be re-established on
demand.
The key feature of the implicit binding paradigm is that
information about the binding itself is hidden from the client,
who is presented with a surrogate interface indistinguishable
from the "real thing." This is the approach adopted by many
distributed object systems, for example Modula-3 Network
Objects [25] and CORBA [26] . It is intuitive and easy to use
from the point of view of a client programmer, and for many
applications provides all the functionality required, provided
that a garbage collector is available to destroy the binding
when it is no longer in use.
On the other hand, traditional RPC systems have tended to
require clients to perlorm an explicit bind step due to the diffi­
culty of implementing generic implicit binding. The advent of
object-based systems has recently made the implicit approach
prominent for the reasons mentioned above. However, implicit
binding is inadequate in some circumstances, due to the hidden
nature of the binding mechanism. It assumes a single, "best
effort" level of service, and precludes any explicit control
over the duration of the binding. Implicit binding can thus
be ill-suited to the needs of time-sensitive applications.
For this reas�m, within Nemesis bindings can also be es­
tablished explicitly by the client when needed. If binding is
explicit, an operation which returns an interface reference does
not create a surrogate as part of the unmarshaling process, but
instead provides a local interface which can be later used to
create a binding. This interface can allow the duration and
qualities of the binding to be precisely controlled at bind time
with no loss in type safety or efficiency. The price of this level
of control is extra application complexity which arises both
from the need to parameterize the binding and from the loss of
transparency: acquiring an interlace reference from a locally­
implemented interface can now be different from acquiring
one from a surrogate.
B. Remote Invocation
The invocation aspect of IDC (how invocations occur across
a binding) is independent of the binding model used. Ideally,
an IDC framework should be able to accommodate several
1291
different methods of data transport within the computational
model.
RPC invocations have at least three aspects:
1) the transfer of information from sender to receiver,
whether client or server;
2) signaling the transfer of information;
3) the transfer of control from the sender to the receiver.
Current operating systems which support RPC as a local
communications mechanism tend to use one of two approaches
to the problem of carrying a procedure invocation across
domain boundaries: message passing and thread tunneling.
With care, a message passing system using shared memory
regions mapped pairwise between communicating protection
domains can provide high throughput, particularly by amortiz­
ing the cost of context switches over several invocations-in
other words by having many RPC invocations from a do­
main outstanding. This separation of information transfer from
control transfer is especially beneficial in a shared memory
multiprocessor, as described in [27].
The thread tunneling model achieves very low latency by
combining all components into one operation: the transfer of
the thread from client to server, using the kernel to simulate
the protected procedure calls implemented in hardware on, for
example, Multics [28] and some capability systems such as
the CAP [9]. An example is the replacement of the original
TAOS RPC mechanism by Lightweight RPC [10].
In these cases, the performance advantage of thread tun­
neling comes at a price; since the thread has left the client
domain, it has the same effect as having blocked as far as the
client is concerned. All threads must now be scheduled by the
kernel (since they cross protection domain boundaries), thus
applications can no longer reliably internally multiplex the
CPU. Accounting information must be tied to kernel threads,
leading to the crosstalk discussed in Section III.
1) Standard Mechanism: The "baseline" IDC invocation
transport mechanism in Nemesis operates very much like a
conventional RPC mechanism. The bind process creates a
pair of event channels between client and server. Each side
allocates a shared memory buffer and ensures that it is mapped
read-only into the other domain. The server creates a thread
which waits on the incoming event channel.
An invocation copies the arguments (and the operation to
be invoked) into the client's buffer and sends an event on
its outgoing channel, before waiting on the incoming event
channel. The server thread wakes up, unmarshals the argu­
ments, and calls the concrete interface. Results are marshalled
back into the buffer, or any exception raised by the server is
caught and marshalled. The server then sends an event on its
outgoing channel, causing the client thread to wake up. The
client unmarshals the results and reraises any exceptions.
Stubs for this transport are entirely generated by the MIDDL
compiler, and the system is used for cases where perfor­
mance is not critical. Measurements have been taken of
null RPC times between two domains an otherwise unloaded
DEC3000/400 Sandpiper. Most calls take about 30 MS, which
compares very favorably with those reported in [29] for Mach
(88 j1s) and Opal (122 J.ts) on the same hardware. Some calls
1292
lEER JOURNAL ON SELECTED AREAS IN COMMUNICATIONS, VOL
(20% in the experiments) take between 55 fJs and 65 fJS;
these have experienced more than one reschedule between
event transmissions, Nemesis does not currently implement
full memory protection for its domajns; the cost of a full
protection domain switch consists of a single instruction to
flush the 2 1 064 data translation buffer (DTB), followed by
a few DTB misses, This cost of a DTB fill on the current
hardware has been estimated at less than 1 p,s,
C Rbufs
The inter-process communication mechanism described
above fits the needs of inter-domain invocation quite well,
but is not appropriate for stream based bulk transfer of data,
Besides Pipes and Streams, schemes for controlling such
transfers are more often integrated with network buffering and
include Mbufs [30], IOBufs [ 1 8], Fbufs [3 1], and other
schemes to support application data unit (ADU) transfer such
as the lP trailers [32] scheme found in some versions of BSD,
A full discussion of these schemes can be found in [20] .
The scheme presented here, RBufs, is intended as the
principal mechanism for both interdomain streams and for
streams between devices and application domains, The main
design considerations are based on the requirements for net­
working and it is in that context it is presented; however, as is
demonstrated with the fileserver example, it is also intended
for more general stream I/O use,
The requirements for an I/O buffering system in Nemesis are
slightly different from all of the above systems, In Nemesis,
applications can negotiate for resources which possess avail­
ability guarantees, This means that an application can have a
certain amount of buffer memory which will not be paged,
If the system is short of memory then the QoS Manager will
require the application to free a certain amount Hence, like the
Fbuf system, there is no need for highly dynamic reallocation
of buffers between different I/O data paths, Also it would be
preferable if multirecipient data need not be copied,
I) Device Hardware Considerations: It i s useful to distin­
guish between network interfaces which are self-selecting and
those which are not Self-selecting interfaces use the VCl (or
similar) in the header of arriving data to access the correct
buffer control information, These are typically high bandwidth
interfaces with DMA support, Nonself-selecting interfaces
require software copying of data (e,g" Ethernet),
Examples of self-selecting interfaces include the . Aurora
TURBOchannel interface [33] and the JetstreamJAfterburner
combination [34] , In Jetstream, the arriving packets enter a
special buffer memory based on the arriving VCL The device
driver then reads the headers and instructs a special DMA
engine to copy the data to the final location, Knowledgeable
applications may make special use of the buffer pools in the
special memory,
It has been recent practice in operating systems to support
a protocol independent scheme for determining the process
for which packets arriving at an interface are destined. This
is known as packet filtering [6] and this technology is now
highly advanced [7], [35], For nonself-selecting interfaces,
packet filtering can determine which I/O path the data will
14,
NO, 7, SEPTEMBER
1996
travel along as easily as it can determine which process will be
the receiver, This property is assumed in the Nemesis buffer
mechanism derived below,
On older hardware, many devices which used DMA required
a single noninterrupted access to a contiguous buffeL On more
recent platforms, such as the TURBO-channel [36], the bus
architecture requires that a device burst for some maximum
period before relinquishing the bus, Thi s i s to prevent the
cache and write buffer being starved of memory bandwidth and
halting the CPU, Devices are expected to have enough internal
buffering to weather such gaps, Also, the high bandwidth that
is available from DMA on typical workstations depends on
accessing the DRAM's using page mode, Such accesses mean
that the DMA cycle must be re-initiated on crossing a DRAM
page boundary. Furthermore, most workstations are dcsigned
for running UNIX with its noncontiguous Mbuf chains, The
result of this is that most high performance DMA hardware is
capable of (at least limited) scatter-gather capability,
2) Protocol Software Considerations: Most
commonly
used protocols wish to operate on a data stream in three
ways, These are:
1) to add a header (e,g" Ethernet, IP, TCP, UDP, XTP);
2) to add a trailer (e,g" XTP, AAL5);
3) to break up a request into smaller sizes,
Headers: Headers are usually used to ensure that data
gets to the right place, or to signify that it came from
a particular place, We can consider how such operations
affect high performance stream I/O, particularly in respect
of security, In the Internet, much of the security which
exists relies on secure port numbers, These are port numbers
which are only available to the highest authority on a given
machine, and receivers may assume that any such packet
bears the full authority of the administrators of the source
machine rather than an arbitrary user, It is similarly important
that machines accurately report their own addresses, For
this reason, the transmission of arbitrary packets must be
prohibited; transmission must include the correct headers as
authorized by the system, This has been one reason for having
such networking protocols in the kernel or, in a microkernel,
implemented in a single "networking daemon," However, this
is not a foregone conclusion,
It is possible instead to have protocol implementations
witllin the user process, and still retain the required security,
The device driver must then perform the security controL
There is a broad spectrum of the possible ways of engineering
such a solution, In one extreme, the device drivers actually
include code (via a trusted library) which "understands" the
protocol and checks the headers; which is close to implement­
ing the protocol itself in each device driver,
Alternatively, the device driver could include an "inverse"
packet filter, code which detennines if the packet is valid for
transmission (rather than reception), As with a packet filter for
reception this process can be highly optimized,
For any implementation the volatility of the buffer memory
must be taken into consideration; the driver must protect
against the process corrupting the headers aftcr they have been
checked, This may entail copying the security related fields of
LESLIE e/ al.: THE DESIGN AND IMPLEMEN TATION OF AN OPERATING SYSTEM TO SUPPORT DISTRIBUTED MCLTIMEDIA APPLICATIONS
1 293
the header before checking them. Another solution may rely
subject to out of band discussion with the memory system.
on caching the secure part of the header in the device driver's
To as large an extent as possible the memory allocator will
private memory and updating the per-packet fields.
For many other fields such as checksums, the user process is
the only one to suffer if they are not initialized correctly. Thus,
keep these contiguous regions of virtual addresses backed by
contiguous regions of physical addresses (this is clearly a
platform-dependent factor).
for UDP and TCP only the port values need to be secured. For
The system provides a fast mechanism for converting Rbuf
IP, all but the length and checksum fields must be secured, and
data area virtual addresses into physical addresses for use by
for Ethernet all the fields must be secured.
drivers that perform DMA. On many platforms a page table
One final possible concern would be with respect to flow
mapping indexed by virtual page number exists for use by the
control or congestion avoidance; conceivably a llser process
TLB miss handler; on such platforms, this table can be made
could have private code which disobeyed the standards on
accessible to device driver domain with read-only status.
TCP congestion control. There are various answers to this.
Protection of the d�ta area is determined by the use of
First, a malevolent user process could simply use UDP, which
the I/O channel. It must be at least writable in the domain
has no congestion control, instead if it wished. Second, since
generating the data and at least readable in the domain
the operating system is designed with QoS support, the system
receiving the data. Other domains may also have access to the
could easily limit the rate at which a process is permitted to
data area especially when an I/O channel spanning multiple
transmit. Third, the application may in fact be able to make
domains (see Section
better use of the resources in the network due to application
specific knowledge, or by using advanced experimental code.
Trailers: Unlike headers, trailers do not usually contain
any security information. Trailers are most easily dealt with
by requiring the user process to provide enough space (or the
correct padding) for the packet on both receive and transmit.
If there is not enough, the packet will simply be discarded-a
loss to the user processes. Providing this space is not difficult
for a process once it is known how much is necessary, this
value can be computed by a shared library or discovered using
an IDC call.
Application data units: Many applications have applica­
tion specific basic data units which may be too large for
individual network packets. For example, NFS blocks over
Ethernet are usually fragmented at the IP level. Ideally, a
system should permit the application to specify receive buffers
in such a way that the actual data of interest to the application
ends up in contiguous virtual addresses.
On the other hand, for some applications, the application's
basic data unit (i.e., the unit over which the application
considers loss of any sub part to be loss of the total) may
be very small. This may be found in multimedia streams such
as audio over ATM, and compressed tiled video. For such
streams, the application should not have to suffer very large
numbers of interactions with the device driver; i t should be
able to handle the data stream only when an aggregate of
many small .data units is available.
3) Operation: The Rbuf design separates the three issues
of I/O buffering, namely:
•
the actual data;
•
the offsetllength aggregation mechanisms;
•
the memory allocation and freeing concerns.
An I/O channel is comprised of a data area (for the actual
VI-C4)
is in use.
One of the domains is logically the owner in the sense that
it allocates the addresses within the data area which are to
be used.
The Rbuf data area is considered volatile and is always
updateable by the domain generating the data.
Data aggregation: A collection of regions in the data
area may be grouped together (e.g., to form a packet) using a
data structure known as an I/O Record or iorec. An iorec is
closest in form to the UNIX concept of an iovec. It consists
of a header followed by a sequence of base and length pairs.
The header indicates the number of such pairs which follow
it and is padded to make it the same size as a pair.
This padding could be used on some channels where ap­
propriate to carry additional information. For example, the
exact time at which the packet arrived or partial checksum
information if this is computed by the hardware. Reference
r37] points out that for synchronization it is more important
to know exactly when something happened than getting to
process it immediately.
Control areas: A control area is a circular buffer used in
a producer/consumer arrangement. A pair of event channels
is provided between the domains to control access to this
circular buffer. One of these event channels (going from writer
to reader) indicates the head position and the other (going from
reader to writer) indicates the tail.
A circular buffer is given memory protection so that it is
writable by the writing domain and read-only to the reading
domain. A control area is used to transfer iorec information
in a simplex direction in an I/O channel. Two of these control
areas are thus required to form an I/O channel and their sizes
are chosen at the time that the I/O channel is established.
Fig.
6 shows a control area with two iorecs in it. The first
iorec describes two regions within the Rbuf data area whereas
the second describes a single contiguous region.
data) and some control areas (for the aggregation information).
Usage: Fig. 7 shows two domains A and B using control
The memory allocation is managed independently of the TIO
areas to send iorecs between them. Each control area, as
channel by the owner of the memory.
described above, provides a FIFO queue of iorecs between
Data area: The Rbuf data area consists of a small num­
the two ends of an I/O channel. Equivalently, an 1/0 channel
ber of large contiguous regions of the virtual address space.
is composed of two simplex control area FIFO ' s to form
These areas are allocated by the system and are always
a duplex management channel. The control areas are llsed
backed by physical memory. Revocation of this memory is
indistinguishably no matter how the I/O channel is being used.
IEEE JOURNAL ON SELECTED AREAS IN COMMUNICATIONS, VOL. 14. NO. 7. SEPTEMBER 1996
1294
Control Area for iorecs
Data
Area
Fig.
6.
from A to B
Process
P rocess
A
B
Rbuf memory arrangement.
A typical VO channel is, in fact, a simplex data channel
operating in one of two modes. The purpose of these two
Control Area for iorecs from B to A
modes is to allow for the support of ADU ' s in various contexts.
N.ote that there is no requirement for either end of the VO
channel to process the data in a FIFO manner, that is merely
Fig. 7.
Control areas for an 1/0 channel hetween A and B.
how the buffering between the two ends is implemented.
TABLE I
TMM AND RMM PROPERTrES
In transmit master mode (TMM) , the originator of the data
chooses the addresses in the Rbuf data area, places the data
Chooses the Addresses
Manages data area
Write access to data
Read access to data
into the Rbufs, and places the records into the control area. It
then updates the head event for that control buffer indicating
to the receiver that there is at least one record present. As
soon as the downstream side has read these records from the
control buffer, it updates the other (tail) event, freeing the
control buffer space for more records. When the downstream
side is finished with the data, it places the records into the
control area for the queue in the other direction and signals
its head event on that control buffer. The originator likewise
signals when it has read the returned acknowledgment from
the control butler. The originator is then free to reuse the data
indicated in the returning control buffer.
In receive master mode
(RMM), the operation of the control
areas is indistinguishable from TMM; the difference is that the
Rbuf data area is mapped with the permissions reversed and
the data is placed in the allocated areas by the downstream
side. It is the receiver of the data which chooses the addresses
in the Rbuf data area and passes iorecs which indicate where
it wishes the data to be placed to the downstream side. The
downstream side uses the other control area to indicate When
it has filled these areas with data.
The Master end, which is choosing the addresses, is re­
sponsible for managing the data area and keeping track of
which parts of it are "free" and which are "busy." This can
be done in whatever way is deemed appropriate. For some
applications, where FIFO processing occurs at both ends, it
may be sufficient to partition the data area into
iorecs
at
the initiation of an VO channel, performing no subsequent
allocation management.
Table I presents a summary of the differences between TMM
and RMM for the diagram shown in Fig. 7; without loss of
TMM
A
�"IM
A
A
A
A
B
B
A
collection and also that there will be space to dispose of it in
the other buffer. This functions reliably because event counts
never lose events. Routines for both blocking and nonblocking
access are standard parts of the Rbuf library.
4) Longer Channels: Sometimes an VO channel is needed
which spans more than two domains. An example may be
a file serving application where data arrives from a network
device driver, passes to the fileserver process, and then passes
to the disk driver.
When such an VO channel is set up, it is possible to share
certain areas of Rbuf data memory which are already allocated
to that domain for another VO channel. A domain may wish to
have some private Rbufs for each direction of the connection
(i.e., ones which are not accessible to domains in the other
direction) for passing privileged information. In the fileserver
example, the file server may have Rbufs which are used for
inode information which are not accessible by the network
device driver.
The management of the channel may either be at one end
or it may be in the middle. In the example of the file server,
it is likely to be in TMM for communicating with the disk
driver, and RMM for communicating with the network driver.
The important point is that the data need not be copied in a
longer chain provided trust holds.
Fig.
8 shows the 110 channels for a fileserver. For simplicity,
this only shows the control paths for writes. The
iorecs
used
in the channel between the fileserver and the disk driver will
generality A is the master-it chooses the addresses within
contain references to both the network buffer data area and the
the data area.
private inode data area. Only the network data buffer area is
Since the event counts for both control areas are available
used for receiving packets . The fileserver (operating in RMM)
to a user of an VO channel it is possible to operate in a
will endeavour to arrange the iorecs so that the disk blocks
nonblocking manner. By reading the event counts associated
aniving (probably fragmented across multiple packets) will
with the circular buffers, instead of blocking on them, a
end up contiguous in the single address space and hence in a
domain can ensure both that there is an Rbuf ready for
suitable manner for writing to disk.
LESLIE etal.: THE DESIGN AND IMPLEMENTATION OF AN OPERATING SYSTEM TO SUPPORT
�ISTRLBUTED MULTIMEDIA APPLlCAnONS
Disk Device Driver
TABLE
Written
Blocks
Inode Btocks
TMM
Write-Only
Fileserver
OK
IOBuf
Fbufs
No
No
Yes
No
No
Yes
??
Yes
Rbufs
copy to user process
Yes
No
No
No
copy to clever device
Yes
No
No
No
No
copy for multicast
Yes
Yes
Yes?
copy for retransmission
Yes
Yes
No
No
support for ADD's
No
No
No
Yes
limit on resource use
Yes4
No
No
Yes
must be cleared
N05
Yes
N06
No 6
5 This
is because
of the copy to user process memory. However some
networking code rounds up the sizes of ccrtain bnffers without clearing the
Where
to put
packets
Packets
alignment
Mbufs
4 This limit is actually as a result of socket buffering .
RMM
Arrived
II
COMPARISON OF BCFFERING PROPERTIES
page faults possible
Data and
1295
padding bytes thus included; this can cause an information leak of up to
three bytes.
6
Buffers must be cleared when the memory is first allocated. This
allocation is not for every buffer usage in Fbufs but is still more frequent
in Fbufs than Rbufs.
Network Device Driver
Fig.
8.
A longer Rbuf channel: control path for file server writes.
5) Complex Channels: In some cases the flow of data may
not be along a simple I/O channel. This is the case for multicast
traffic which is being received by multiple domains on the
same machine. For such cases, the Rbuf memory is mapped
readable by all the recipients using TMM I/O channels to each
recipient. The device driver places the records in the control
areas of all the domains which should receive the packet and
reference counts the Rbuf areas so that the memory is not
reused until all of the receivers have indicated they are finished
with it via their control areas.
Apart from the lack of copying, both domains benefit from
the buffering memory provided by the other compared with a
scheme using copying.
A problem potentially arises if one of the receivers of such
multicast data is slower at processing it than the other and
falls behind. Ideally, it would not be able to have an adverse
affect on the other receiver. This can be done by limiting the
amount of memory in use by each 110 channel. When the
limit is reached, the iorees are not placed in that channel and
the reference count used is one less. The buffers are hence
selectively dropped from channels where the receiver is unable
to keep up. An appropriate margin may be configured based
on the fan-out of the connection.
One approximate but very efficient way of implementing
this margin is to limit the size of the circular control butler.
Iorees are then dropped automatically when they cannot be
inserted in the buffer in a nonblocking manner. Even if a more
accurate implementation of the margin is required, the Rbuf
scheme ensures that the cost is only paid for I/O channels
where it is required, rather than in general .
VII. CONCLUSION
A. Current State
Nemesis is implemented on Alpha AXP, MIPS, and ARM
platforms. C libraries have been ported to these platforms
to allow programming in a familiar environment; this has
required, amongst other things, the integration of pervasives
rather than statics within the library code. Schedulers as
described have been developed; the Jubilee scheduler so far
has only been implemented on the ARM platform.4
A window system which provides primitives similar to X
has been implemented. For experimentation, this has been
implemented both as a traditional shared server to provide
server-based rendering and, in the more natural Nemesis fash­
ion, as a lihrary to provide client-based rendering. Experiments
have shown that the client-based rendering system reduces
application crosstalk enormously when best effort applications
are competing with continuous media applications rendering
video on the display. This work will be reported in detail in
the near future.
The ATM device driver for the DEC ATMWorks 750
TURBOchannel takes full advantage of the self-selecting
feature of the interface to direct AAL5 adaptation units directly
into memory at the position desired by the receiving domain;
rendering video from the network requires a single copy from
main memory into the frame buffer. (The application must
check the frames to discover where they should be placed
in the frame buffer.) The combination of the ATMWorks
interface and the device driver mean that contention between
outgoing ATM streams occurs only as a result of scheduling
the processor and when cells hit the network wire.
B. Future Plans
Nemesis is immature with much to work still to be done.
It represents a fundamentally ditlerent way of structuring an
operating system; indeed, it could be regarded as a family
of operating systems. The core concepts of events, domains,
activations , binding, Ibufs, and minimal kernel do not define an
operating system. Our next task is to create the libraries, device
drivers, and system domains to provide a complete operating
system over the Nemesis substrate.
is
4 This is used in an embedded application where the number of processes
small.
IEEE JOURNAL ON SELHCTED AREAS IN COMMUNICA:I'IONS , VOL. 14, NO. 7, SEPTEMBER
1296
As with the window system (and indeed the filing system
and the IP protocol stack), this will often be by providing
[II]
platforms and then moving toward a client-based execution
implementation. Work with the filing system and protocol
stack has just entered this second stage.
As these components become stable we expect to develop
[ 12]
1 14]
resource usage directly to application domains and thereby get
an accurate picture of application performance.
C. Conclusions
Nemesis represents an attempt to design
an
operating system
to support multimedia applications which process continuous
media. The consideration of QoS provision and application
crosstalk led to a design in which applications execute their
code directly rather than via shared servers. shared servers are
used only for security or concurrency control.
transparency and modularity in the definition of interfaces
and the use of closures to provide comprehensible, safe and
extensive sharing of code and data. Application programmers
can be protected from this paradigm shift; API's need not
change except when new facilities are required, the porting of
the C library is a case in point.
The development of
a
new operating system is not a small
task. The work here has been developed over at least four years
with the help of a large number of people. We are indebted to
all who worked in the Pegasus project.
REFERENCES
[ I ] S.
J.
Mul l eTHler,
I. \1.
Leslie, and D. R. McAuley, "Operating-system
Tennenhouse, "The VuSystem:
system for visual proce ssing of digital video," in Proe.
A programming
ACM Multimedia, San Francisco, CA, Oct. 1 994.
[3] G. Coulson. A. Campbell, P Robin. G. Blair. M. Papath oma s ,
and D.
S heperd , "The design of a QoS-controlled ATM-based communication
system in choru s ,"
IEEE J.
vol. 20.
[7] S. McCanne and V . Jacobson, ''The BSD packct filter: A ncw architcc­
ture for user-level packet capture," in USENIX Winter '93 Conf, Jan.
1993, PI'. 259�269.
[SJ A. Demers, S. Keshav ,
and S.
Sh enker, "Analysis and simulation of
fair queuing algorithm," 1. Internetworking: Research Experience,
1,
vol.
n o . 1, 1 990 .
[9J M. V. Wilkes and R. M. Needham, The Cambridge CAP Computer and
Its Operating System. London, U.K.: Elsevier, 1979.
[10] B. N. B ershad, T. E. Andcrson, E. D. Lazowska, and
H.
M. Levy,
"Lightweight remote procedure call." A CM Trans. Computer Syst., vol.
8, no. 1 , Pl'. 37�55, Feb. 1990.
user level,"
Dept. Computer Sci. &
Eng.,
area network," IEEE .T. Select. Areas Commun., vol. 1 3, no. 4, May 1995.
y. A. Khalidi and M. N. Nelson, "An implementation of UNIX on an
object-oriented operating system," Sun Microsystems Labs. Inc., Tech.
agement of parall elis m, " A CM Trans. Computer Systems, vol. 10, no. 1 ,
p p . 53�79, Feb. 1992.
[ 1 6) D. Reed and R . Kanodia. "S ynchronization with cventcounts and s e­
quencers," MIT Laboratory for Computer Science, Tech. Rep" 1977.
[17] D. Mills, "Internet time synchronization: The network time protocol,"
Internet Request for Comment Number 1 129, Oct. 1 989.
[1 g] M. J. Dixon, "S ystem support for multi-service traffic," Ph.D. dis ser­
tation. Univ. Cambridgc Computer Laboratory, Tech . Rep. 245, Sept.
1991.
[ 1 9] C. L. Liu and J. W. Layl and, "Scheduling algorithms for multiprogram­
ming in a hard-rcal-time environmcnt," 1. ACM, vol. . 20, no. 1, Pl'.
46�61 , Jan. 1973.
[20] R. J. Black, "Explicit network sched uling," Ph.D. dis sertation, Univ.
Cambridge Computer Laboratory, Tech. Rep. 3 6 1 , Dec. 1994.
[21 1 A. D. Birrell and J. V, Guttag, "Synchronization primiti ves for a
multiprocessor: A formal specification," Digital Equipment Corpo­
rati on
Syslems Research Cenler, Palo Alto, CA, Tech. Rep.
20,
1987.
[22] T. Roscoe, "Linkage in the Nemesis
s ingle address space operating
system," ACM Operat. Syst. Rev.. vol. 28, no. 4, Pl'. 48�55, Oct. 1 994.
[23 ] B. Strous trnp, The Design and Evolution oj 'C++. Reading, MA :
Addison-Wesley, 1 994.
[24] D. Otway , "The ANSA binding model," Architecture Projects Man age­
ment Limited , Poseidon House, Castle Park , Cambridge,
CB3 ORD,
UK,
ANSA Phase III document APM. 1 3 14.01, Oct. 1 994.
[25] A. Birrell, G. Nelson, S. Owicki, and T. Wobber, "Network objects," in
Proc. 14th A CM SIGOPS Symp. Operating Syst. Principles, Operating
Syst.
[26]
The
Rev.,
Dec. 1993, pp . 2 1 7�230.
Common Object Request Broker: Architecture and Specification,
Object Management Gronp, Draft 10th Dec. 1991, OMG Document
Number 9 1 . 12. 1 , revision 1 . 1 .
[27] B. N . Bershad, T . E. Anderson, E . D . Lazowska, and H . M. Levy, "User­
level interprocess communication for shared memory multiprocessors,"
ACM Trans. Computer Sysr., vol. 9, no. 2, PI'. 175�198, May 1 9 9 1 .
[28] E. 1 . Organick, The Multics System: An Examination of Its Structure.
[29]
Cambridge, MA: MIT Press, 1 972.
S. Chase, H. M. Levy, M . .T. Feeley, and E. D. Lazowska, "Sharing
.T.
and protection in a single address space operating systcm," Dept. Compo
Sci. Eng. , Univ. Washington, Tech. Rep . 93-04-02, Apr. 1993, revised
Jan.
1994.
[30] S. Leffler, M. McKusick, M. Karels, and 1. Quarterman, The Design
and Implementation of the 4.3B5'D UNIX Operating System. Reading ,
MA: Addi son-Wesley, 1989.
[31 ] P. Druschel and L. Peterson, "Fbufs: A high-bandwidth cross-domain
transfer
Select. Areas Commun., vol. 13, no. 4, pp.
686�699, May 1995.
[4] D. R. McAuley, "Protocol design for high speed networks." Ph.D.
dissertation, Univ. Cambridge Computer Lab., Tech. Rep. 186, Jan.
1990.
[5] D. L. Tennenhouse, "Layered multiplexing considered harmful," in
Protocols for High Speed Networks, Rudin and Williamson, Eds. New
York : Elsevier, 1989.
[61 J. Mogul, "Efficient use of workstations for p ass ive monitoring of local
area networks," in Computer Comm. Rev., ACM SIGCOMM, S ept. 1 990,
menting network prOlocols at
" Scheduler activations: Etfective kernel supp ort for the user-level man­
support for distribnted multimedia," in Proe. Summer 1994 Usenix Conj,
Boston, MA, June 1994, pp. 209--220.
[2] C. 1. Lindblad, D. J. Wetherall, and D. L.
Int. COllj Multimedia Computing Syst., May 1 994.
c. A. Thekkath , T. D. Nguyen, E. May, and E. D. Lazowska. "Imple­
Rep. 92-3, Dec. 1992.
E. Anderson, B. N. Bershad, E. D. Lazowska, and H. M. Levy,
Such an organization gives rise to a number of problems
with complexity which are solved by the use of typing,
Tokuda, "Processor capacity reserves :
applications," in Pmc. IERR
[ 1 5 ] T.
Nemesis architecture. We also will be using Nemesis as a
means on instrumenting multimedia applications; we can trace
H.
Univ. Washington, Seattlc, WA, Tcch. Rep . 93-03-01 , 1993.
[13] P. Barham, M. Hayter, D. McAuley, and 1. Pratt, "Devices on the desk
(and port) applications to take advantage of the facilities,
in particular the flexible QoS guarantees, available from the
W. Mercer, S. Savage, and
Operating system support for mllltim edia
an
initial server-based implementation by porting code from other
c.
1996
facility," in Proc. 14th ACM Symp. OperatinK Syst. Principles,
Dec. 1993, pp. 1 89�202.
[32] S. Leffler and M. Karels, "Trailer encapsulations," Internet Request for
Comment Number 893, Apr. 1984.
[33] P. Druschel , L. Peterson, and B. Davie , "Experiences wilh a high-speed
network adaptor: A software perspective," in Computer Comm. Review,
ACM SIGCOMM, Sept. 1994, vol. 24, pp. 2�1 3 .
[34] A . Edwards, G. Watson, J . Lumley, D. B anks, C. Calamvokis, and
C. Dalton, "Uscr-spacc protocols deliver high performance to appli­
cations on a low-cost Gbfs LAN," in Computer Comm.
[351
Review,
A CM
SJGCOMM, Sept. 1994, vol. 24, Pl'. 14-23.
IVI. Yuhara, C. Maeda, B. Bershad, and J. Moss , "The MACH packet
filter: Efficient packet demultiplexing for multiple endp oints and large
messages." in USRNIX Winter 1994 Conj, Jan. 1 994, Pl'. 1 53�165.
l36] TURBOchannel Industry Group, TURBOchallne! Specifications Version
3.0, Digital Equipment Corporation, 1993.
[37] C. J. Sreenan, "Synchronization services for digital continuous media,"
Ph. D . dissertation, Univ. Cambridge Computer Laboratory, Tech. Rep.
292, Mar. 1993.
[381 T. Roscoe , "Thc structure of a multi-service operating system," Ph.D.
dissertation, Univ. Cambridge Computer Laboratory, Tech. Rep. 376,
April. 1995.
LESLIE et al.: THE DESIGN AND I MPL EMENTATION OF AN OPERATING S YSTEM TO S UPPOI{[' DIS T RIBUTED MULTIMEDIA APPLICATIONS
Ian M. Leslie
(S'76--M'82), for a photograph and biography, see this issue,
Paul Barham received the
1297
B.A. degree in computer
science from the University of Camhridge, Cam­
p, 1 2 1 3 .
bridge, u.K., in 1992, where he is now working
toward fhe Ph.D. degree.
He is currently involved in the Pegasus. Measure,
and DCAN projects at the Computer Laboratory,
investigating QoS provision in the operating system
and particularly the
110 suhsystem of multimedia
workstations. His research interests include operat­
ing systems and workstation architecture, network­
Derek McAuley received the B . A . degree i n mathematics from the University
of Cambridgy, Cambridge, U.K., in 1982, and the Ph.D. degree in ATM
internetworking addressing issues in interconnecting heterogeneous ATM
networks from the University of Cambridgy Computer Laboratory, Cambridge,
ing, and a distributed parallel Prolog. Recent work
includes the PALcode, kernel, and device drivers for Nemesis on the DEC
3000
AXP platforms. a client-rendering window system and an extent-based
file system both supporting QoS guarantees.
U.K., in 1 989.
After a further five years as a Lecturer at the Computer Laboratory,
University of Cambridgy, Cambridge, U.K., he moved to a Chair in the
Department of Computing Science, University of Glasgow, in 1995. His
research interests include networking , distributed systems, and operating
systems. His recent work has conceutrated on the support of time-dependent
mixed media types in both networks and operating systems. This has included
development of ATM switches and devices leading to the DAN architecture
and the development of operating systems that can cxploit these systems
components.
David Evers
received the B.A. degree in physics
and theoretical physics and the Ph.D. degree in
computer science from the University of Cambridge.
Cambridge. U.K., in 1988 and 1993, respectively.
He is currently a member of the Staff at Nemesys
Research Ltd., Camhridge, U.K .. where he contin­
ues to work on software support, from devices to
applications, for distributed multimedia in an ATM
environment.
Richard Black
received the Bachelor's degree in
computer science and the Ph.D. degree in computer
science from the University of Cambridge, Cam­
bridge. U.K., in 1990 and 1995, respectively.
He has been awarded a Resean,;h Fellowship
from Churchill COllege, Cambridge, to continue
research at the University of Cambridge Computer
Laboratory. His research interests are on the inter­
action between operating systems and networking.
Previous activities have included work on the desk
area network and the Fairisle ATM switch.
Robin Fairbairns received the B.A. degree in math­
ematics and the Diploma in computer science from
the University of Cambridge, U.K., in 1 967 and
1968, r�spective1y.
He worked on the CAP Project at the University
of Cambridge Computer Laboratory, Cambridge,
U.K., from 1969 to 1975, and has worked on digital
cartography and satellite remote ,ensing. His current
work has been in the Pegasus and Measure projects,
and he is working toward a Ph.D. investigating the
provision of QoS within operating systems.
Timothy Roscoe received the B.A.
degree in mathematics from the University
of Cambridge, Cambridge, u.K., in 1 9 89, and the Ph.D. degree from the
University of Cambridge Computer Laboratory. Cambridge, U.K., in 1995.
At the University of Cambridge Computer Laboratory, Cambridge,
U.K., he
was a Principal Architect of the Nemesis operating system and author of the
initial AlphaJAXP implementation. He is currently Vice President of Research
and Development for Persimmon IT, Inc., Durham, NC. His research lUterests
inclnde operating systems, programming language design. distributed naming
and binding, network protocol implementation, the future of the World Wide
Web. and the Chapel Hill music scene.
Eoin Hyden
( S ' 8 1 -M'82) received the B.Sc., B.E.,
and M.Eng.Sc. degrees from the University of
Queensland, Australia, and the Ph.D. degree from
the University of Cambridge Computer Laboratory,
Cambridge, UX
Currently, he is a Member of Technical Staff
in the Networked Computing Research department
at AT&T Bell Laboratories, Ylurray Hill, NJ. His
interests include operating
systems, high speed
networks, and multimedia systems.
Was this manual useful for you? yes no
Thank you for your participation!

* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project

Download PDF

advertisement