Keith Bostic Michael J. Karels John S. Quarterman

Keith Bostic
Michael J. Karels
John S. Quarterman
The Design and Implementation of the
Operating System
The Design and Implementation of the
Operating System
Marshall Kirk McKusick
Keith Bostic
Berkeley Software Design, Inc.
Michael J. Karels
Berkeley Software Design, Inc.
John S. Quarterman
Texas Internet Consulting
San Francisco
New York Toronto
S ingapore
Mexico City
This book is in the Addison-Wesley UNIX and Open Systems Series
Series Editors: Marshall Kirk McKusick and John S. Quarterman
Publishing Partner: Peter S. Gordon
Associate Editor: Deborah R. Lafferty
Associate Production Supervisor: Patricia A. Oduor
Marketing Manager: Bob Donegan
Senior Manufacturing Manager: Roy E. Logan
Cover Designer: Barbara Atkinson
Troff Macro Designer: Jaap Akkerhuis
Copy Editor: Lyn Dupre
Cover Art: John Lasseter
UNIX is a registered trademark of X/Open in the United States and other countries. Many of
the designations used by manufacturers and sellers to distinguish their products are claimed
as trademarks. Where those designations appear in this book, and Addison-Wesley was
aware of a trademark claim, the designations have been printed in initial caps or all caps.
The programs and applications presented in this book have been included for their instruc­
tional value. They have been tested with care, but are not guaranteed for any particular pur­
pose. The publisher offers no warranties or representations, nor does it accept any liabili­
ties with respect to the programs or applications.
Library of Congress Cataloging-in-Publication Data
The design and imp l emen t a t i on o f
Mar s ha l l Kirk McKu s i c k
the 4 . 4B S D opera ting sys t em I
[ et al . ] .
cm .
Inc ludes bib liographic a l r e f e r enc e s
and index .
I S BN 0 - 2 0 1 -54979 - 4
McKu s ic k ,
( C ompu t e r fi l e )
Op e r a ting sys t ems
( C ompu t e r s )
Mar sha l l Kirk .
QA76 . 76 . 0 63D4743 1 996
0 0 5 . 4'3 - - dc20
96 - 2433
Copyright© 1 996 b y Addison-Wesley Longman, Inc.
All rights reserved. No part of this publication may be reproduced, stored in a retrieval sys­
tem, or transmitted, in any form or by any means, electronic, mechanical , photocopying,
recording, or otherwise, without the prior written permission of the publ isher. Printed in
the United States of America. Published simultaneously in Canada.
Text printed on recycled and acid-free paper.
IO I I 12131415 MA
10th Printing
03 02 01
February 2001
This book is dedicated to the BSD community.
Without the contributions of that community's members,
there would be nothing about which to write.
This book i s an extensive rev1s1on o f the first authoritative and full-length
description of the design and implementation of the research versions of the UNIX
system developed at the University of California at Berkeley. Most detail is given
about 4 . 4 BSD, which incorporates the improvements of the previous B erkeley
versions. Although 4 . 4 BSD includes nearly 500 utility programs in addition to the
kernel, this book concentrates almost exclusively on the kernel .
The UNIX System
The UNIX system runs on computers ranging from personal home systems to the
largest supercomputers . It is the operating system of choice for most
multiprocessor, graphics, and vector-processing systems, and is widely used for its
original purpose of timesharing. It is the most common platform for providing
network services (from FTP to WWW) on the Internet. It is the most portable
operating system ever developed. This portability is due partly to its
implementation language, C [Kernighan & Ritchie, 1978] (which is itself one of
the most widely ported languages), and partly to the elegant design of the system.
Many of the system's features are imitated in other systems [O' Dell , 1987] .
Since its inception in 1 969 [Ritchie & Thompson, 1978], the UNIX system
has developed in a number of divergent and rejoining streams . The original
developers continued to advance the state of the art with their Ninth and Tenth
Edition UNIX inside AT&T Bell Laboratories, and then their Plan 9 successor to
UNIX. Meanwhile, AT&T licensed UNIX System V as a product, before selling it
to Novell . Novell passed the UNIX trademark to X/OPEN and sold the source code
and distribution rights to Santa Cruz Operation (SCO). Both System V and Ninth
Edition UNIX were strongly influenced by the Berkeley Software Distributions
produced by the Computer Systems Research Group (CSRG) of the University of
California at Berkeley.
vi i
Berkeley Software Distributions
These Berkeley systems have introduced several useful programs and facilities to
the UNIX community :
2BSD (the Berkeley PDP - I I system) : the text editor vi
3BSD (the first Berkeley VAX system): demand-paged virtual-memory support
4.0BSD: performance improvements
4. 1 BSD: job control, autoconfiguration, and long C identifiers
4.2BSD and 4.3BSD: reliable signal s ; a fast filesystem ; improved networking,
including a reference implementation of T CP/IP; sophi sticated interprocess­
communication (IPC) primitives ; and more performance improvements
4.4BSD: a new virtual memory system; a stackable and extensible vnode
interface; a network filesystem (NFS ); a log-structured filesystem, numerous
filesystem types, including loopback, union, and uid/gid mapping layers ; an
IS09660 filesystem (e.g., CD-ROM ) ; ISO networking protocol s ; support f or 68K,
SPARC, MIPS, and PC architectures; POSIX support, including termios, sessions,
and most utilities ; multiple IP addresses per interface; disk labels; and improved
4.2BSD, 4.3BSD, and 4.4BSD are the bases for the UNIX systems of many vendors,
and are used internally by the development groups of many other vendors. Many
of these developments have also been incorporated by System V , or have been
added by vendors whose products are otherwi se based on System V .
The implementation of the T CP/IP networking protocol suite in 4.2BSD and
4.3BSD, and the availability of those systems, explain why the TCP /IP networking
protocol suite i s i mplemented so widely throughout the world. Numerous vendors
have adapted the Berkeley networking implementations, whether their base system
is 4.2BSD, 4.3BSD, 4.4BSD, System V, or even Digital Equipment Corporation's
V MS or Microsoft's Winsock interface in Windows ' 95 and Windows/NT.
4BSD has also been a strong influence on the POSIX (IEEE Std 1 003 . 1 )
operating-system interface standard, and on related standards. Several features­
such as reliable signals, job control, multiple access groups per process, and the
routines for directory operations-have been adapted from 4.3BSD for P OSIX.
Material Covered in this Book
Thi s book is about the internal structure of 4.4BSD [Quarterman et al, 1985] , and
about the concepts, data structures, and algorithms used in implementing 4.4BSD's
system faci lities. Its level of detai l i s similar to that of B ach's book about UNIX
System V [Bach, 1 986] ; however, this text focuses on the facilities, data structures,
and algorithms used in the Berkeley variant of the UNIX operating system. The
book covers 4.4BSD from the system-call level down-from the i nterface to the
kernel to the hardware itself. The kernel includes system facil ities, such as
process management, virtual memory, the 1/0 system, filesystems, the socket IPC
mechanism, and network protocol implementations. Material above the system­
call level-such as libraries, shells, commands, programming languages, and other
user interfaces-is excluded, except for some material related to the terminal
interface and to system startup. Like Organick's book about Multics [Organick,
1975], this book is an in-depth study of a contemporary operating system.
Where particular hardware is relevant, the book refers to the Hewlett-Packard
HP300 (Motorola 68000-based) architecture. Because 4.4BSD was developed on
the HP300, that is the architecture with the most complete support, so it provides a
convenient point of reference.
Readers who will benefit from this book include operating-system
implementors, system programmers, UNIX application developers, administrators,
and curious users. The book can be read as a companion to the source code of the
system, falling as it does between the manual [CSRG, 1994] and the code in detail
of treatment. But this book is specifically neither a UNIX programming manual
nor a user tutorial (for a tutorial, see [Libes & Ressler, 198 8 ] ) . Familiarity with
the use of some version of the UNIX system (see, for example, [Kernighan & Pike,
1984] ), and with the C programming language (see, for example, [Kernighan &
Ritchie, 198 8 ] ) would be extremely useful.
Use in Courses on Operating Systems
This book is suitable for use as a reference text to provide background for a
primary textbook in a second-level course on operating systems. It is not intended
for use as an introductory operating-system textbook; the reader should have
already encountered terminology such as memory management, process
scheduling, and //0 systems [Silberschatz & Galvin, 1994] . Familiarity with the
concepts of network protocols [Tanenbaum, 198 8 ; Stallings, 1993 ; Schwartz,
1987] will be useful for understanding some of the later chapters .
Exercises are provided at the end of each chapter. The exercises are graded
into three categories indicated by zero, one, or two asterisks. The answers to
exercises that carry no asterisks can be found in the text. Exercises with a single
asterisk require a step of reasoning or intuition beyond a concept presented in the
text. Exercises with two asterisks present major design projects or open research
This text discusses both philosophical and design issues, as well as details of the
actual implementation. Often, the discussion starts at the system-call level and
descends into the kernel. Tables and figures are used to clarify data structures and
control flow. Pseudocode similar to the C language is used to display algorithms.
Boldface font identifies program names and filesystem pathnames. Italics font
introduces terms that appear in the glossary and identifies the names of system
calls, variables, routines, and structure names. Routine names (other than system
calls) are further identified by the name followed by a pair of parenthesis (e.g.,
malloc () is the name of a routine, whereas argv is the name of a variable).
The book is divided into five parts, organized as follows:
• Part 1 , Overview
Three introductory chapters provide the context for the
complete operating system and for the rest of the book. Chapter 1 , History and
Goals, sketches the historical development of the system, emphai; izing the
system's research orientation. Chapter 2, Design Overview of 4.4BSD, describes
the services offered by the system, and outlines the internal organization of the
kernel . It also discusses the design decisions that were made as the system was
developed. Sections 2.3 through 2 . 1 4 in Chapter 2 give an overview of their
corresponding chapter. Chapter 3, Kernel Services, explains how system calls
are done, and describes in detail several of the basic services of the kernel .
• Part 2, Processes
The first chapter in this part-Chapter 4, Process
Management-lays the foundation for later chapters by describing the structure
of a process, the algorithms used for scheduling the execution of processes, and
the synchronization mechanisms used by the system to ensure consistent access
to kernel-resident data structures. In Chapter 5, Memory Management, the
virtual-memory- management system is discussed in detail.
• Part 3 , 1/0 System
First, Chapter 6, I/O System Overview, explains the
system interface to 1/0 and describes the structure of the facilities that support
this interface. Following this introduction are four chapters that give the details
of the main parts of the 1/0 system . Chapter 7, Local Filesystems, details the
data structures and algorithms that implement filesystems as seen by application
programs. Chapter 8, Local Filestores, describes how local filesystems are
interfaced with local media. Chapter 9, The Network Filesystem, explains the
network filesystem from both the server and client perspectives. Chapter I 0,
Terminal Handling, discusses support for character terminals, and provides a
description of a character-oriented device driver.
• Part 4 , Interprocess Communication
Chapter 1 1 , Interprocess
Co111municatio11, describes the mechanism for providing communication between
related or unrelated processes . Chapters 1 2 and 1 3 , Network Communication
and Network Protocols, are closely related, as the facilities explained in the
former are implemented by specific protocol s, such as the TCP /IP protocol suite,
explained in the latter.
• Part 5 , System Operation
Chapter 1 4, System Startup, discusses system
startup, shutdown, and configuration, and explains system initial ization at the
process level, from kernel initialization to user login.
The book is intended to be read in the order that the chapters are presented,
but the parts other than Part I are independent of one another and can be read
separately. Chapter 14 should be read after all the others, but knowledgeable
readers may find it useful independently.
At the end of the book are a Glossary with brief definitions of major terms
and an Index. Each chapter contains a Reference section with citations of related
material .
Getting 4.4BSD
Current information about the availability of 4.4B SD source code can be found at
the sites listed below. At press time, the source code for the 4.4BSD-Lite Release
2 system, as well as that for the FreeBSD version of 4.4BSD, which is compiled
and ready to run on PC-compatible hardware, are available from Walnut Creek
CDROM. Contact Walnut Creek for more information at 1 - 8 00-7 86-9907, or use
orders @, or The NetBSD distribution is
compiled and ready to run on most workstation architectures. For more
information, contact the NetBSD Project at majordomo @ NetBSD.ORG (send a
message body of "lists " ) , or http://www.NetBSD.ORG/. The OpenBSD
distribution is compiled and ready to run on a wide variety of workstation
architectures and has been extensively vetted for security and reliability. For more
information, visit the OpenBSD project's Web Site at
A fully supported commercial release, BSD/OS , is available from B erkeley
1 - 8 00- 8 00-4273,
Software Design, Inc. , at
bsdi-info @ bsdi .com, or
http ://www.bsdi .com/. The 4.4BSD manuals are jointly published by Usenix and
O' Reilly. O' Reilly sells the five volumes individually or in a set (ISBN
1 565 92-0 8 2- 1 ) : 1 - 8 00- 8 8 9- 8 9 69, order@, or
For you diehards who actually read to the end of the preface, your reward is
finding out that you can get T-shirts that are a reproduction of the the original
artwork drawn by John Lasseter for the cover of this book (yes, he is the John
Lasseter of Walt Disney/Pixar fame who masterminded the production of "Toy
Story " ) . These shirts were made available to the people who helped with the
creation, reviewing, and editing of the book and to those folks who first reported
errors in the book. A variation on these shirts that is clearly different from the
originals (so as not to diminish the rarity of the ones that people had to work to
get) is now available. For further information on purchasing a shirt, send a self­
addressed envelope (United States residents please include return postage) to
M. K. McKusick
1 6 1 4 Oxford St.
Berkeley, CA 94709- 1 608
Alternatively, you can visit the " History of BSD T-shirts " web page at .
We extend special thanks to Mike Hibler (University of Utah) who coauthored
Chapter 5 on memory management, and to Rick Macklem (University of Guelph),
whose NFS papers provided much of the material on NFS for Chapter 9.
We also thank the following people who read and commented on nearly the
entire book: Paul Abrahams (Consultant), Susan LoVerso (Orea Systems), George
Neville-Neil (Wind River Systems), and Steve Stepanek (California State
University, Northridge) .
We thank the following people, all o f whom read and commented o n early
drafts of the book: Eric Allman (Pangaea Reference Systems), Eric Anderson
(University of California at Berkeley) , Mark Andrews (Alias Research), Mike
Beede (Secure Computing Corporation), Paul Borman (Berkeley Software
Design), Peter Collinson (Hillside Systems) , Ben Cottrell (NetB SD user), Patrick
Cua (De La Salle University, Philippines) , John Dyson (The FreeBSD Project),
Sean Eric Fagan (BSD developer), Mike Fester (Medieus Systems Corporation),
David Greenman (The FreeBSD Project), Wayne Hathaway (Auspex Systems),
John Heidemann (University of California at Los Angeles), Jeff Honig (Berkeley
Software Design), Gordon Irlam (Cygnus Support), Alan Langerman (Orea
Systems), Sam Leffler (Sil icon Graphics), Casimir Lesiak (NASA/Ames Research
Center), Gavin Lim (De La Salle University, Phi lippines), Steve Lucco (Carnegie
Mellon University), Jan-Simon Pendry (Sequent, UK), Arnold Robbins (Georgia
Institute of Technology), Peter Salus (UNIX historian), Wayne Sawdon (Carnegie
Mellon University), Margo Seltzer (Harvard University), Keith Sklower
(University of California at Berkeley) , Keith Smith (Harvard University), and
Humprey C. Sy (De La Salle University, Phil ippines).
This book was produced using James Clark's implementations of pie, tbl,
eqn, and groff. The index was generated by awk scripts derived from indexing
programs written by Jon Bentley and Brian Kernighan [Bentley & Kernighan,
1 986] . Most of the art was created with xfig. Figure placement and widow
elimination were handled by the groff macros, but orphan elimination and
production of even page bottoms had to be done by hand.
We encourage readers to send us suggested improvements or comments about
typographical or other errors found in the book ; please send electronic mail to
Bach, 1 986.
M. J . Bach, The Design rJf the UNIX Operating System, Prentice-Hall ,
Englewood Cliffs, NJ, 1 986.
Bentley & Kernighan, 1 986.
J . Bentley & B . Kernighan. "Tools for Printing Indexes," Computing
Science Technical Report 1 28, AT&T Bell Laboratories, Murray Hill, NJ,
CSRG, 1 994.
CSRG, in 4.4 Berkeley SoRware Distribution, O' Reilly & Associates, Inc . ,
Sebastopol, C A , 1 994.
Kernighan & Pike, 1 984.
B. W. Kernighan & R. Pike, The UNIX Programming Environment, Prentice..
Hal l, Englewood Cliffs, NJ. 1 984.
Kernighan & Ritchie, 1 97 8 .
B . W . Kernighan & D . M . Ritchie, The C Programming Language, Prentice­
Hall, Englewood Cliffs, NJ. 1 97 8 .
Kernighan & Ritchie, 1 98 8 .
B . W. Kernighan & D. M . Ritchie, The C Pmgramming Language, 2 n d ed,
Prentice-Hall, Englewood Cliffs, NJ, 1 98 8 .
Libes & Ressler, 1 98 8 .
D . Libes & S . Ressler, L!fe with UNIX, Prentice-Hall, Englewood Cliffs, NJ,
1 98 8 .
O ' Dell, 1 987.
M . O' Del l, " UNIX : The World View," Proceedings <�f the 1987 Winter
USENIX Conference, pp. 35-45, January 1 987.
Organick, 1 975.
E. I . Organick, The Multics System: A n Examination (�f Its Structure, MIT
Press, Cambridge, MA, 1 975.
Quarterman et al, 1 98 5 .
J. S . Quarterman, A. Silberschatz, & J. L. Peterson, "4.2BSD and 4.3BSD
as Examples of the UNIX System," A CM Computing Surveys, vol . 1 7, no. 4,
pp. 379-4 1 8 , December 1 98 5 .
Ritchie & Thompson, 1 97 8 .
D . M . Ritchie & K. Thompson, "The UNIX Time-Sharing System," Bell
System Technical Journal, vol . 57, no. 6, Part 2, pp. 1 905- 1 929,
July-August 1 97 8 . The original version [Comm. ACM vol. 7, no. 7, pp.
365-375 (July I974)} described the 6th edition; this citation describes the
7th edition.
Schwartz, 1 987.
M . Schwartz, Telecommunication Networks, Series in Electrical and
Computer Engineering, Addison-Wesley, Reading, MA, 1 987.
Silberschatz & Galvin, 1 994.
A. Silberschatz & P. Galvin, Operating System Concepts, 4th Edition,
Addison-Wesley, Reading, MA, 1 994.
Stallings, 1 993.
R. Stallings, Data and Computer Commun ications, 4th Edition, Macmil lan,
New York, NY , 1 993.
Tanenbaum, 1 988.
A. S . Tanenbaum, Computer Networks, 2nd ed, Prentice-Hal l , Englewood
Cliffs, NJ, 1 98 8 .
About the Authors
Left to right: Mike Karels, Keith Bostic, Kirk McKusick, and John Quarterman
together for the first time at a Usenix Conference in San Diego.
Marshall Kirk McKusick writes books and articles, consults, and teaches classes
on UNIX- and BSD-related subjects.
While at the University of California at
Berkeley, he implemented the 4.2BSD fast file system, and was the Research
Computer Scientist at the Berkeley Computer Systems Research Group (CSRG)
overseeing the development and release of 4.3BSD and 4.4BSD. His particular
areas of interest are the virtual-memory system and the filesystem. One day, he
hopes to see them merged seamlessly. He earned his undergraduate degree in
Electrical Engineering from Cornell University, and did his graduate work at the
University of California at Berkeley, where he received Masters degrees in
Computer Science and Business Administration, and a doctoral degree in Computer
Science. He is a past president of the Usenix Association, and is a member of ACM
and IEEE.
In his spare time, he enjoys swimming, scuba diving, and wine
collecting. The wine is stored in a specially constructed wine cellar (accessible
from the web at or using "telnet
winecellar.McKusick.COM 451 ") in the basement of the house that he shares with
Eric Allman, his domestic partner of 17-and-some-odd years.
About the Authors
Keith Bostic is a member of the technical staff at Berkeley Software Design, Inc.
He spent 8 years as a member of the CSRG, overseeing the development of over
400 freely redistributable UNIX-compatible utilities, and is the recipient of the
1 99 1 Distinguished Achievement Award from the University of California,
Berkeley, for his work to make 4.4BSD freely redistributable. Concurrently, he
was the principal architect of the 2. lOBSD release of the B erkeley Software
Distribution for PDP-lls, and the coauthor of the Berkeley Log Structured
Filesystem and the Berkeley database package (DB). He is also the author of the
widely used vi implementation, nvi. He received his undergraduate degree in
Statistics and his Masters degree in Electrical Engineering from George
Washington University. He is a member of the ACM, the IEEE, and several
POSIX working groups. In his spare time, he enjoys scuba diving in the South
Pacific, mountain biking, and working on a tunnel into Kirk and Eric's specially
constructed wine cellar. He lives in Massachusetts with his wife, Margo Seltzer,
and their cats.
Michael J. Karels is the System Architect and Vice President of Engineering at
B erkeley Software Design, Inc. He spent 8 years as the Principal Programmer of
the CSRG at the University of California, Berkeley as the system architect for
4.3BSD. Karels received his Bachelor's degree in Microbiology from the
University of Notre Dame. While a graduate student in Molecular Biology at the
University of California, he was the principal developer of the 2.9BSD UNIX
release of the B erkeley Software Distribution for the PDP- 1 1 . He is a member of
the ACM, the IEEE, and several POSIX working groups. He lives with his wife
Teri Karels in the backwoods of Minnesota.
John S. Quarterman is a partner in Texas Internet Consulting (TIC), which
consults in networks and open systems with particular emphasis on TCP/IP
networks, UNIX systems, and standards . He is the author of The Matrix:
Computer Networks and Conferencing Systems Worldwide (Digital Press, 1 990),
and is a coauthor of UNIX, POSIX, and Open Systems: The Open Standards Puzzle
( 1 993), Practical Internetworking with TCP/IP and UNIX ( 1 993), The Internet
Connection: System Connectivity and Configuration ( 1 994) , and The E-Mail
Companion: Communicating Effectively via the Internet and Other Global
Networks ( 1 994), all published by Addison-Wesley. He is editor of Matrix News,
a monthly newsletter about issues that cross network, geographic, and political
boundaries, and of Matrix Maps Quarterly; both are published by Matrix
Information and Directory Services, Inc. (MIDS) of Austin, Texas. He is a partner
in Zilker Internet Park, which provides Internet access from Austin. He and his
wife, Gretchen Quarterman, split their time among his home in Austin, hers in
B uffalo, New York, and various other locations.
Part 1 Overview
Chapter 1
History and Goals
History of the UNIX System
1 .2
Research UNIX
AT&T UNIX System III and System V
Other Organizations
Berkeley Software Distributions
UNIX in the World
BSD and Other Systems
The Influence of the User Community
Design Goals of 4BSD
4.2BSD Design Goals
4.3BSD Design Goals
4.4BSD Design Goals
Release Engineering
1 .3
Chapter 2
Design Overview of 4.4BSD
2. 1
4.4BSD Facilities and the Kernel
The Kernel
2 .4
Kernel Organization
Kernel Services
Process Management
S ignals
Process Groups and Sessions
Memory Management
BSD Memory-Management Design Decisions
xvi i i
Memory Management Inside the Kernel
I/O System
Descriptors and I/O
Descriptor Management
Socket !PC
Scatter/Gather I/O
Multiple Filesystem Support
2. 1 0
2. 1 1
2. 1 2
2. 1 3
2. 1 4
Network Filesystem
Interprocess Communication
Network Communication
Network Implementation
System Operation
Chapter 3
3. 1
Kernel Services
Kernel Organization
System Processes
System Entry
Run-Time Organization
Entry to the Kernel
Return from the Kernel
System Calls
Result Handling
Returning from a System Call
3 .4
Traps and Interrupts
I/O Device Interrupts
Software Interrupts
Clock Interrupts
Statistics and Process Scheduling
User, Group, and Other Identifiers
Host Identifiers
Process Groups and Sessions
Memory-Management Services
Timing Services
Real Time
Adjustment of the Time
External Representation
Interval Time
Resource Services
Process Priorities
Resource Utilization
Resource Limits
Filesystem Quotas
System-Operation Services
Part 2 Processes
Chapter 4
4. 1
Introduction to Process Management
Process Management
Process State
The Process Structure
The User Structure
Context Switching
Process State
Low-Level Context Switching
Voluntary Context Switching
Process Scheduling
Calculations of Process Priority
Process-Priority Routines
Process Run Queues and Context Switching
Process Creation
Process Termination
1 00
Comparison w ith POSIX Signals
I 04
Posting of a Signal
I 06
Delivering a Signal
I 03
1 07
Process Groups and Sessions
1 09
1 10
Job Control
Process Debugging
1 14
I 16
Chapter 5
5. 1
1 12
Memory Management
1 17
1 18
Processes and Memory
1 19
1 20
Replacement Algorithms
Working-Set Model
Advantages of Virtual Memory
1 22
1 17
Hardware Requirements for Virtual Memory
Kernel Maps and Submaps
Kernel Address-Space Allocation
Kernel Malloc
5 .4
Per-Process Resources
Shared Memory
1 32
I 37
Mmap Model
1 41
Shared Mapping
Private Mapping
1 42
Collapsing of Shadow Chains
Private Snapshots
1 23
4.4BSD Process Virtual-Address Space
Page-Fault Dispatch
Mapping to Objects
Objects to Pages
1 37
5 .5
Overview o f the 4.4 BSD Virtual-Memory System
Kernel Memory Management
1 26
Creation o f a New Process
1 46
Reserving Kernel Resources
Duplication of the User Address Space
Creation of a New Process Without Copying
1 49
Execution of a File
1 50
Process Manipulation o f Its Address Space
Change of Process Size
File Mapping
Change of Protection
1 51
1 54
5 . 9 Termination of a Process
1 56
5 . IO The Pager Interface
Vnode Pager
Device Pager
Swap Pager
5 . 1 I Paging
I 62
5 . 1 2 Page Replacement
Paging Parameters
The Pageout Daemon
The Swap-In Process
5 . 1 3 Portability
1 66
1 68
1 73
The Role of the pmap Module
Initialization and Startup
Mapping Allocation and Deallocation
1 81
Change of Access and Wiring Attributes for Mappings
Management of Page-Usage Information
Initial ization of Physical Pages
Management of Internal Data Structures
1 87
1 88
Part 3 1/0 System
Chapter 6
6. 1
1 91
1/0 System Overview
1/0 Mapping from User to Device
1 93
Device Drivers
1/0 Queueing
Interrupt Handling
B lock Devices
1 96
Entry Points for B lock-Device Drivers
Sorting of Disk 1/0 Requests
1 98
1 99
Disk Labels
Character Devices
1 97
Raw Devices and Physical 1/0
20 1
Character-Oriented Devices
Entry Points for Character-Device Drivers
Descriptor Management and Services
Open File Entries
Management of Descriptors
File-Descriptor Locking
Multiplexing 1/0 on Descriptors
21 1
Implementation of Select
Movement of Data Inside the Kernel
2 16
The Virtual-Filesystem Interface
Contents of a V node
Vnode Operations
Pathname Translation
Exported Filesystem Services
Filesystem-Independent Services
The Name Cache
Buffer Management
Implementation of Buffer Management
Stackable Filesystems
23 1
Simple Filesystem Layers
The Union Mount Filesystem
Other Filesystems
Chapter 7
Local Filesystems
7. 1
Hierarchical Filesystem Management
Structure of an Inode
!node Management
Finding of Names in Directories
24 1
Pathname Translation
7 .4
Fi le Locking
Other Filesystem Semantics
Large File Sizes
File Flags
Chapter 8
8. 1
Local Filestores
Overview of the Filestore
The Berkeley Fast Filesystem
Organization of the Berkeley Fast Filesystem
Optimization of Storage Utilization
Reading and Writing to a File
Filesystem Parameterization
Layout Policies
Allocation Mechanisms
Block Clustering
Synchronous Operations
The Log-Structured Filesystem
Organization of the Log-Structured Filesystem
Index File
Reading of the Log
Writing to the Log
B lock Accounting
The Buffer Cache
Directory Operations
Creation of a File
Reading and Writing to a File
Fi lesystem Cleaning
Fi lesystem Parameterization
Filesystem-Crash Recovery
8 .4
The Memory-Based Filesystem
Chapter 9
9. I
9 .2
Organization of the Memory-Based Filesystem
Fi lesystem Performance
Future Work
The Network Filesystem
History and Overview
31 1
NFS Structure and Operation
The NFS Protocol
The 4.4BSD NFS Implementation
Client-Server Interactions
3 14
RPC Transport Issues
Security Issues
Techniques fo r Improving Performance
Crash Recovery
3 25
Chapter 10 Terminal Handling
1 0. 1
1 0.2
1 0.3
1 0.4
1 0.5
1 0.6
1 0.7
1 0.8
Terminal-Processing Modes
Line Disciplines
User Interface
The tty Structure
Process Groups, Sessions, and Terminal Control
RS-232 and Modem Control
Terminal Operations
Output Line Discipline
Output Top Half
Output Bottom Half
Input B ottom Half
35 1
Input Top Half
The stop Routine
The ioctl Routine
Modem Transitions
Closing of Terminal Devices
1 0.9
Other Line Disciplines
Serial Line IP Discipline
Graphics Tablet Discipline
Part 4 Interprocess Communication
Chapter 1 1
1 1.1
Interprocess Communication
Interprocess-Communication Model
Use of Sockets
1 1 .2
1 1 .3
Implementation Structure and Overview
Memory Management
Data Structures
Storage-Management Algorithms
Mbuf Utility Routines
1 1.4
Communication Domains
xx iv
Socket Addresses
1 1 .5
1 1 .6
Connection Setup
Data Transfer
Transmitting Data
Receiving Data
Passing Access Rights
Passing Access Rights in the Local Domain
1 1 .7
Socket Shutdown
39 1
Chapter 12 Network Communication
1 2. 1
Internal Structure
Data Flow
Communication Protocols
Network Interfaces
1 2.2
1 2.3
Socket-to-Protocol Interface
Protocol-Protocol Interface
1 2.4
Protocol User-Request Routine
Internal Requests
Protocol Control-Output Routine
pr input
4 11
41 I
Interface between Protocol and Network Interface
Packet Transmission
Packet Reception
1 2.5
Kernel Routing Tables
Routing Lookup
Routing Redirects
Routing-Table Interface
User-Level Routing Policies
User-Level Routing Interface: Routing Socket
1 2. 6
Buffering and Congestion Control
Protocol Buffering Policies
Queue Limiting
1 2. 7
Raw Sockets
Control Blocks
Input Processing
Output Processing
1 2.8
Additional Network-Subsystem Topics
Out-of-Band Data
Address Resolution Protocol
Chapter 13 Network Protocols
1 3. 1
1 3 .2
1 3 .3
1 3 .4
1 3 .5
1 3 .6
1 3 .7
Internet Network Protocols
Internet Addresses
Broadcast Addresses
Internet Multicast
Internet Ports and Associations
Protocol Control B locks
User Datagram Protocol (UDP)
Control Operations
Internet Protocol (IP)
45 1
Transmission Control Protocol (TCP)
TCP Connection States
Sequence Variables
TCP Algorithms
Estimation of Round-Trip Time
Connection Establishment
Connection Shutdown
TCP Input Processing
TCP Output Processing
Sending of Data
Avoidance of the Silly-Window Syndrome
Avoidance of Small Packets
Delayed Acknowledgments and Window Updates
47 1
Retransmit State
S low Start
Source-Quench Processing
Buffer and Window Sizing
Avoidance of Congestion with Slow Start
Fast Retransmission
Internet Control Message Protocol (ICMP)
OSI Implementation Issues
1 3.8
1 3 .9
1 3 . 1 0 Summary of Networking and Interprocess Communication
Creation of a Communication Channel
48 1
Sending and Receiving of Data
Termi nation of Data Transmission or Reception
Part 5 System Operation
Chapter 14 System Startup
1 4. 1
1 4.2
49 1
The boot Program
1 4. 3
Kernel Initialization
Assembly-Language Startup
Machine-Dependent Initialization
Message Buffer
System Data Structures
1 4.4
Device Probing
Device Attachment
New Autoconfiguration Data Structures
New Autoconfiguration Functions
50 I
Device Naming
1 4. 5
1 4. 6
Machine-Independent Initialization
User-Level Initialization
1 4.7
System-Startup Topics
Kernel Configuration
System Shutdown and Autoreboot
System Debugging
Passage of Information To and From the Kernel
5I I
History and Goals
History of the UNIX System
The UNIX system has been in wide use for over 20 years, and has helped to define
many areas of computing. Although numerous organizations have contributed
(and still contribute) to the development of the UNIX system, this book will pri­
marily concentrate on the BSD thread of development:
• Bell Laboratories, which invented UNIX
• The Computer Systems Research Group (CSRG) at the University of California
at Berkeley, which gave UNIX virtual memory and the reference implementation
• Berkeley Software Design, Incorporated (BSDI), The FreeB SD Project, and The
NetBS D Proj ect, which continue the work started by the CSRG
The first version of the UNIX system was developed at Bell Laboratories in 1 969
by Ken Thompson as a private research project to use an otherwise idle PDP-7.
Thompson was joined shortly thereafter by Dennis Ritchie, who not only con­
tributed to the design and implementation of the system, but also invented the C
programming language. The system was completely rewritten into C, leaving
almost no assembly language. The original elegant design of the system [Ritchie,
1 978] and developments of the past 1 5 years [Ritchie, l 984a; Compton, 1 985]
have made the UNIX system an important and powerful operating system [ Ritchie,
1 987].
Ritchie, Thompson, and other early UNIX developers at Bell Laboratories had
worked previously on the Multics proj ect [Peirce, 1 985; Organick, 1 975], which
had a strong influence on the newer operating system. Even the name UNIX is
Chapter 1
History and Goal s
merely a pun on Multics; in areas where Multics attempted to do many tasks,
UNIX tried to do one task well. The basic organization of the UNIX filesystem, the
idea of using a user process for the command interpreter, the general organization
of the filesystem interface, and many other system characteri stics, come directly
from Multics.
Ideas from various other operating systems, such as the Massachusetts Insti­
tute of Technology 's ( MIT 's ) CTSS, also have been incorporated. The fork opera­
tion to create new processes comes from Berkeley 's GENIE (SDS-940, later
XDS-940) operating system. Allowing a user to create processes inexpensively led
to using one process per command, rather than to commands being run as proce­
dure calls, as is done in Multics.
There are at least three major streams of development of the UNIX system.
Figure I . I sketches their early evolution ; Figure 1 . 2 (shown on page 6) sketches
their more recent developments, especially for those branches leading to 4 . 4 BSD
and to System V [Chambers & Quarterman , 1 98 3 ; Uniejewski , 1 985] . The dates
given are approximate, and we have made no attempt to show all influences.
Some of the systems named in the figure are not mentioned in the text, but are
included to show more clearly the relations among the ones that we shall examine.
Research UNIX
The first major editions of UNIX were the Research systems from Bell Laborato­
ries. In addition to the earliest versions of the system, these systems include the
UNIX Time-Sharing System, Sixth Edition, commonly known as V6, which, in
1 976, was the first version widely available outside of Bell Laboratories. Systems
are identified by the edition numbers of the UNIX Programmer 's Manual that were
current when the distributions were made.
The UNIX system was distinguished from other operating systems in three
important ways:
l . The UNIX system was written in a high-level language.
2. The UNIX system was distributed in source form .
3 . The UNIX system provided powerfu l primitives normally found in only those
operating systems that ran on much more expensive hardware.
Most of the system source code was written in C, rather than in assembly lan­
guage. The prevailing belief at the time was that an operating system had to be
written in assembly language to provide reasonable efficiency and to get access to
the hardware . The C language itself was at a sufficiently high level to all ow it to
be compiled easily for a wide range of computer hardware, without its being so
complex or restrictive that systems programmers had to revert to assembly lan­
guage to get reasonable efficiency or functionality. Access to the hardware was
provided through assembly-language stubs for the 3 percent of the operating-sys­
tem fu nctions-such as context switching-that needed them. Although the suc­
cess of UNIX does not stem solely from its being written in a high-level
Section I. I
History of the UNIX System
Berkeley Software
Bell Laboratories
First Edition
1 969
Fifth Edition
1 973
Sixth Edition
1 976
1 977 PWB
1 978
Seventh Edition
1 979
1 980
3.0. 1
1 98 1 4
1 982 5 .0
System III
1 983
System V
1 984
1 985
Figure 1 . 1
System V
Release 2
4. l aBSD
4. J cBSD
2.98 SD
Sun OS
The UNIX system family tree, 1 969- 1 985.
Chapter I
System V
Release 2
198 5
History and Goals
Sun OS
4 .2BSD 2.9BSD
198 6
I 987
System V
Release 3
2. 1 OBSD
Choru s
199 1
1 995
199 6
Figure 1 .2
4 .3BSD-Ren o
NET/ 2
BSD! 1 .0
FreeBSD 1 .0
Solaris 2
38 6B SD
1993 Linux
Plan 9
199 2
The UNIX sys tem family tree, 1 98 6- 1996.
4 .4BSD
4 .4BSD
Section 1 . 1
History of the UNIX System
language, the use of C was a critical first step [Ritchie et al , 1 97 8 ; Kernighan &
Ritchie, 1 97 8 ; Kernighan & Ritchie, 1 98 8 ] . Ritchie's C language is descended
[Rosler, 1 984] from Thompson 's B language, which was itself descended from
BCPL [Richards & Whitby-Strevens, 1 980] . C continues to evolve [Tuthill, 1 985 ;
X3J l I , 1 98 8 ] , and there is a variant-C++-that more readily permits data
abstraction [Stroustrup, 1 984; USENIX, 1 987] .
The second important distinction of UNIX was its early release from Bell Lab­
oratories to other research environments in source form . By providing source, the
system's founders ensured that other organizations would be able not only to use
the system, but also to tinker with its inner workings. The ease with which new
ideas could be adopted into the system always has been key to the changes that
have been made to it. Whenever a new system that tried to upstage UNIX came
along, somebody would dissect the newcomer and clone its central ideas into
UNIX. The unique ability to use a small, comprehensible system, written in a
high-level language, in an environment swimming in new ideas led to a UNIX sys­
tem that evolved far beyond its humble beginnings.
The third important distinction of UNIX was that it provided individual users
with the ability to run multiple processes concurrently and to connect these pro­
cesses into pipelines of commands. At the time, only operating systems running
on large and expensive machines had the ability to run multiple processes, and the
number of concurrent processes usually was controlled tightly by a system admin­
Most early UNIX systems ran on the PDP- I I, which was inexpensive and
powerful for its time. Nonetheless, there was at least one early port of S ixth Edi­
tion UNIX to a machine with a different architecture, the Interdata 7/32 [Miller,
1 97 8 ] . The PDP- I I also had an inconveniently small address space. The introduc­
tion of machines with 32-bit address spaces, especially the VAX- 1 1/78 0, provided
an opportunity for UNIX to expand its services to incl ude virtual memory and net­
working. Earlier experiments by the Research group in providing UNIX-like facil­
ities on different hardware had led to the conclusion that it was as easy to move
the entire operating system as it was to dupl icate UNIX 's services under another
operating system. The first UNIX system with portability as a specific goal was
UNIX Time-Sharing S y stem , Seventh Edi tion (V7), which ran on the PDP- I I
and the Interdata 8/32, and had a VAX variety called UNIX/32V Time -Sharing ,
S ystem Version 1.0 (32V). The Research group at Bell Laboratories has also
developed UNIX Time -Sharing S ystem , Eighth Edition (V8), UNIX Time-Shar­
ing S ystem , Ninth Edition (V9) , and UNIX Time-Sharing S ystem , Tenth Edi­
tion (VlO). Their 1 996 system is Plan 9.
AT&T UNIX System III and System V
After the distribution of Seventh Edition in 1 978, the Research group turned over
external distributions to the UNIX Support Group (USG). USG had previously dis­
tributed internally such systems as the UNIX Programmer 's Work Bench (PWB),
and had sometimes distributed them externally as well [Mohr, 1 985 ] .
Chapter 1
History and Goal s
USG's first external distribution after Seventh Edition was UNIX S ystem III
(S ystem III), in 1 982, which incorporated features of Seventh Edition, of 32V,
and also of several UNIX systems developed by groups other than the Research
group. Features of UNIX /RT (a real-time UNIX system) were included, as were
many features from PWB . USG released UNIX S y stem V (S y stem V) in 1 98 3 ;
that system is largely derived from System III. The court-ordered divestiture of
the Bell Operating Companies from AT&T permitted AT&T to market System V
aggressively [Wilson, 1 98 5 ; Bach, 1 986] .
USG metamorphosed into the UNIX System Development Laboratory (USDL),
which released UNIX S ystem V, Release 2 in 1 984. System V, Release 2, Ver­
sion 4 introduced paging [Miller, 1 984; Jung, 1 985 ] , including copy-on-write and
shared memory, to System V. The System V implementation was not based on the
Berkeley paging system. USDL was succeeded by AT&T Information Systems
(ATTIS), which distributed UNIX S ystem V, Release 3 in 1 987. That system
included STREAMS , an IPC mechanism adopted from V8 [Presotto & Ritchie,
1 98 5 ] . ATTIS was succeeded by UNIX System Laboratories (USL), which was
sold to Novell in 1 993. Novell passed the UNIX trademark to the X/ OPEN consor­
tium, giving the latter sole rights to set up certification standards for using the
UNIX name on products. Two years later, Novell sold UNIX to The Santa Cruz
Operation (SCO) .
Other Organizations
The ease with which the UNIX system can be modified has led to development
work at numerous organizations, including the Rand Corporation, which is
responsible for the Rand ports mentioned in Chapter 1 1 ; Bolt Beranek and New­
man (BBN), who produced the direct ancestor of the 4.2BSD networking imple­
mentation discussed in Chapter 1 3 ; the University of Ill inois, which did earlier
networking work; Harvard; Purdue; and Digital Equipment Corporation (DEC).
Probably the most widespread version of the U N I X operating system, accord­
ing to the number of machines on which it runs, is XENIX by Microsoft Corpora­
tion and The Santa Cruz Operation . XENIX was originally based on Seventh
Edition, but later on System V. More recently, SCO purchased UNIX from Novell
and announced plans to merge the two systems.
Systems prominently n o t based on UNIX include IBM 's OS/2 and Microsoft's
Windows 95 and Windows/NT. All these systems have been touted as UNIX
killers, but none have done the deed.
Berkeley Software Distributions
The most influential of the non-Bell Laboratories and non-AT&T UNIX develop­
ment groups was the University of Cal ifornia at Berkeley [McKusick, 1 985] .
Software from Berkeley is released in Be rkeley Softwa re Distributions
(BSD)-for example, as 4.3BSD. The first Berkeley VAX UNIX work was the
addition to 32V of virtual memory, demand paging, and page replacement in 1 979
by William Joy and Ozalp Babaoglu, to produce 3BSD [Babaoglu & Joy, 1 98 1 ] .
Section 1 . 1
History of the UNIX System
The reason for the large virtual-memory space of 3BSD was the development of
what at the time were large programs, such as Berkeley 's Franz LISP. This mem­
ory-management work convinced the Defense Advanced Research Projects
Agency (DARPA) to fund the Berkeley team for the later development of a stan­
dard system (4BSD) for DARPA's contractors to use.
A goal of the 4BSD project was to provide support for the DARPA Internet
networking protocols, TCP/IP [Cerf & Cain, 1 98 3 ] . The networking implementa­
tion was general enough to communicate among diverse network facilities, rang­
ing from local networks, such as Ethernets and token rings, to long-haul networks,
such as DARPA's ARPANET.
We refer to all the Berkeley VAX UNIX systems following 3BSD as 4BSD,
although there were really several releases-4.0BSD, 4. IBSD, 4.2BSD, 4.3BSD,
4.3BSD Tahoe, and 4.3BSD Reno. 4BSD was the UNIX operating system of choice
for VAXes from the time that the VAX first became available in 1 977 until the
release of System V in 1 98 3 . Most organizations would purchase a 32V license,
but would order 4BSD from Berkeley. Many installations inside the Bell System
ran 4. I BSD (and replaced it with 4.3BSD when the latter became available). A
new virtual-memory system was released with 4.4BSD. The VAX was reaching
the end of its useful lifetime, so 4.4BSD was not ported to that machine. Instead,
4.4BSD ran on the newer 68000, SPARC, MIPS, and Intel PC architectures.
The 4BSD work for DARPA was guided by a steering committee that included
many notable people from both commercial and academic institutions. The cul­
mination of the original Berkeley DARPA UNIX project was the release of 4.2BSD
in 1 983; further research at Berkeley produced 4.3BSD in mid- 1 986. The next
releases included the 4.3BSD Tahoe release of June 1 988 and the 4.3BSD Reno
release of June 1 990. These releases were primarily ports to the Computer Con­
soles Incorporated hardware platform . Interleaved with these releases were two
unencumbered networking releases : the 4.3BSD Netl release of March 1 989 and
the 4.3BSD Net2 release of June 1 99 1 . These releases extracted nonproprietary
code from 4.3BSD; they could be redistributed freely in source and binary form to
companies that and individuals who were not covered by a UNIX source license .
The final CSRG release was to have been two versions of 4.4BSD, to be released
in June 1 993. One was to have been a traditional full source and binary distrib­
ution, called 4.4BSD-Encumbered, that required the recipient to have a UNIX
source license. The other was to have been a subset of the source, cal led 4.4BSD­
Lite, that contained no licensed code and did not require the recipient to have a
UNIX source license. Following these distributions, the CSRG would be dis­
solved. The 4.4B SD-Encumbered was released as scheduled, but legal action by
USL prevented the distribution of 4.4BSD-Lite. The legal action was resolved
about 1 year later, and 4.4BSD-Lite was released in April 1 994. The last of the
money in the CSRG coffers was used to produce a bug-fi xed version 4.4BSD-Lite,
release 2, that was di stributed in June 1 995 . This release was the true final
distribution from the CSRG.
Nonetheless, 4BSD still lives on in all modern implementations of UNIX, and
in many other operating systems.
History and Goals
UNIX in the World
Dozens of computer manufacturers, incl uding almost all the ones usually consid­
ered major by market share, have introduced computers that run the UNIX system or
close derivatives, and numerous other companies sell related peripheral s, software
packages, support, training, and documentation. The hardware packages involved
range from micros through minis, multis, and mainframes to supercomputers . Most
of these manufacturers use ports of System V, 4.2BSD, 4.3BSD, 4.4BSD, or mix­
tures. We expect that, by now, there are probably no more machines ru nning soft­
ware based on System III, 4. 1 BSD, or Seventh Edition, although there may wel l sti ll
be PDP- I I s running 2BSD and other UNIX vari ants. If there are any Si xth Edition
systems still in regular operation , we would be amused to hear about them (our con­
tact information is given at the end of the Preface).
The UNIX system is also a fertile field for academic endeavor. Thompson and
Ritchie were given the Association for Computing Machi nery Turing award for
the design of the system [Ritchie, I 984b ] . The UNIX system and rel ated, specially
designed teaching systems-such as Tunis [Ewens et al, 1 985 ; Holt, 1 98 3 ] , XINU
[Comer, 1 984] , and MINIX [Tanenbaum, 1 987]-are widely used in courses on
operating systems . Linus Torvalds reimplemented the UNIX interface in his freely
redistributable LINUX operating system. The UNIX system is ubiquitous in uni­
versities and research facilities throughout the world, and is ever more widely used
in industry and commerce.
Even with the demise of the CSRG, the 4.4BSD system continues to flourish.
In the free software world, the FreeBSD and NetB SD groups continue to develop
and distribute systems based on 4.4BSD. The FreeBSD project concentrates on
developing distributions primarily for the personal-computer (PC) platform. The
NetBSD project concentrates on providing ports of 4.4BSD to as many platforms
as possible. Both groups based their first releases on the Net2 release, but
switched over to the 4.4BSD-Lite release when the latter became available.
The commercial variant most closely rel ated to 4.4BSD is BSD/OS , produced
by Berkeley Software Design, Inc . (BSDI). Early BSDI software releases were
based on the Net2 release ; the current BSDI release is based on 4.4BSD-Lite.
1 .2
BSD and Other Systems
The CSRG incorporated features not only from UNIX systems, but also from other
operating systems . Many of the features of the 4BSD terminal drivers are from
TENEXffOPS-20. Job control (in concept-not in implementation) is derived from
that of TOPS-20 and from that of the MIT Incompatible Timesharing System (ITS).
The virtual-memory inte1face first proposed for 4.2BSD, and since implemented
by the CSRG and by several commercial vendors, was based on the file-mapping
and page-level interfaces that first appeared in TENEX/TOPS-20. The cun-ent
4.4BSD virtual-memory system (see Chapter 5) was adapted from MACH, which
was itself an offshoot of 4.3BSD. Multics has often been a reference point in the
design of new facilities .
Section 1 . 2
BSD and Other Systems
The quest for efficiency has been a major factor in much of the CSRG's work.
Some efficiency improvements have been made because of comparisons with the
proprietary operating system for the VAX, VMS [Kashtan, 1 980; Joy, 1 980] .
Other UNIX variants have adopted many 4BSD features. AT&T UNIX System
V [AT&T, 1 987], the IEEE POSIX. I standard [P I 003 . I , 1 98 8 ] , and the related
National Bureau of Standards (NBS) Federal Information Processing Standard
(PIPS) have adopted
• Job control (Chapter 2)
• Reliable signals (Chapter 4)
• Multiple file-access permission groups (Chapter 6)
• Filesystem interfaces (Chapter 7)
The X/OPEN Group, originally comprising solely European vendors, but now
including most U.S. UNIX vendors, produced the XIOPEN Portability Guide
[X/OPEN, 1 987] and, more recently, the Spec 1 1 70 Guide. These documen ts
specify both the kernel interface and many of the utility programs available to
UNIX system users. When Novell purchased UNIX from AT&T in 1 993, it trans­
ferred exclusive ownership of the UNIX name to X/OPEN. Thus, all systems that
want to brand themselves as UNIX must meet the X/OPEN interface specifications.
The X/OPEN guides have adopted many of the POSIX facilities. The POSIX. l stan­
dard is also an ISO International Standard, named SC22 WG 1 5 . Thus, the POSIX
facilities have been accepted in most UNIX-like systems worldwide.
The 4B SD socket interprocess-communication mechanism (see Chapter 1 1 )
was designed for portability, and was immediately ported to AT&T System III,
although it was never distributed with that system. The 4BSD implementation of
the TCP/IP networking protocol suite (see Chapter 1 3) is widely used as the basis
for further implementations on systems ranging from AT&T 3B machines running
System V to VMS to IBM PCs.
The CSRG cooperated closely with vendors whose systems are based on
4.2BSD and 4.3BSD. This simultaneous development contributed to the ease of
further ports of 4.3BSD, and to ongoing development of the system .
The Influence of the Use r Community
Much of the Berkeley development work was done in response to the user commu­
nity. Ideas and expectations came not only from DARPA, the principal direct-fund­
ing organization, but also from users of the system at companies and universities
The Berkeley researchers accepted not only ideas from the user community,
but also actual software. Contributions to 4BSD came from universities and other
organizations in Australia, Canada, Europe, and the United States. These contri­
butions included major features, such as autoconfiguration and disk quotas . A few
ideas, such as the fcntl system call, were taken from System V, although licensing
Chapter I
H i story and Goal s
and p r i c i n g c o n s i derations prevented the u s e of any actual code fro m S y s te m I I I or
S y s t e m V in 4B S D . I n add i t i o n to contri bu t i o n s that were i n c l u ded i n the d i s tribu­
t i o n s p roper, the C S R G a l s o d i stri buted a set o f u se r-contributed software .
An e x a m p l e of a c o m m u n i ty-deve l oped fac i l i ty i s the p u b l i c - d o m a i n t i m e ­
zone-handl i n g package t h a t was adopted w i t h the 4 . 3 B S D Tahoe re l e ase .
I t was
designed and i m p l e mented by an i n ternational gro u p . i n c l u d i n g Art h u r O l s o n ,
US ENET news­
comp.std.unix. Th i s pack age takes t i m e-zone-convers i o n ru l e s c o m p l e te l y
Robert E l z , and G u y Harr i s , part l y because of d i s c u s s i o n s i n the
o u t of t h e C l i brary, p u t t i n g t h e m i n fi l e s t h a t req u i re no syste m -code c h anges to
c h ange t i me-zone ru l e s ; t h i s c h ange is espec i a l l y u se fu l w i th b i nary - o n l y di stribu­
tions of
rather t h a n
The method a l s o a l l ow s i n d i v i d u a l processes to choose ru l e s ,
kee p i n g o n e
ru l e s e t
spec i fication
syste m w i de .
d i s tri bution
i n c l udes a l arge database of ru l e s used i n many areas throughout the worl d , fro m
C h i n a to A u stral i a to Europe . D i stri b u t i o n s of the 4 . 4 B S D system are t h u s s i mp l i ­
fi e d because i t i s not neces sary to have t h e software s e t u p d i ffe re n t l y for differe n t
d e s t i n a t i o n s , as l ong as the w h o l e database i s i n c l u d e d . The adoption of the t i m e ­
z o n e package i n to B S D bro u g h t the tec h n o l ogy to the attention of com merc i a l ven­
dors, s u c h as S u n M i c rosystems, c a u s i n g t h e m to i ncorporate i t i n to the i r syste m s .
B erke l ey sol i c i te d e l ec t ro n i c m a i l about b u g s a n d t h e proposed fi xe s .
UNIX software house MT X I N U d i stri buted a bug l i st c o m p i l e d from such s u b m i s ­
M a n y of the bug fi xes were i n c orporated i n l ater d i stri b u t i o n s .
constant d i sc u s s i on of
There i s
UNIX i n general ( i n c l u d i n g 4 . 4 B S D ) i n the U S EN ET
comp.unix newsgro u p s , w h i c h are d i s tributed on the I n terne t ; both the I n ternet
and USEN ET are i n te rn a t i o n a l in scope . There was another USENET newsgroup
dedi cated to 4BSD bugs: comp.bugs.4bsd. Few i deas were accepted by Berke ley
d i re c t l y from these newsgro u p s ' assoc i ated m ai l i ng l i s ts because of the d i ffi c u l ty
of s i ft i n g thro u g h the vol u m i no u s s u b m i s s i o n s .
Later. a moderated newsgro u p
fixes, was create d . D i sc u s s i o n s i n these newsgroups sometimes led to new fac i l i ­
dedicated to the C S RG-sanctioned fi xe s to s u c h bugs. c a l l e d
t i e s b e i n g written that were l ater i ncorporated i n to t h e s y s te m .
1 .3
Design Goals of 4BSD
4BSD i s a re search system devel oped fo r and part l y by a re search c o m m u n i ty.
and, more rec e n t l y, a commerc i a l com m u n i ty.
The devel opers c o n s i dered many
design i s s u e s as they wrote the s y s te m . There were non trad i t i o n a l c o n s i dera ti o n s
and i np u t s i n to the d e s i g n . w h i c h neverth e l e s s y i e l de d re s u l ts w i t h commerc i a l
i m portance.
The earl y systems were technology d ri v e n . They took advan tage of curren t
h ardware that w a s u n ava i l ab l e i n other
i n c l uded
Virtu al-memory s u pport
U N I X s y s te m s .
T h i s n e w tec h n o l ogy
Section 1 . 3
Design Goals of 4BSD
• Device drivers for third-party (non-DEC) peripherals
• Terminal-independent support libraries for screen-based applications; numerous
applications were developed that used these libraries, including the screen-based
editor vi
4BSD's support of numerous popular third-party peripherals, compared to the
AT&T distribution 's meager offerings in 32V, was an important factor in 4BSD
popularity. Until other vendors began providing their own support of 4.2BSD­
based systems, there was no alternative for universities that had to minimize hard­
ware costs.
Terminal-independent screen support, although it may now seem rather
pedestrian, was at the time important to the Berkeley software 's popularity.
4.2BSD Design Goals
DARPA wanted Berkeley to develop 4.2BSD as a standard research operating sys­
tem for the VAX. Many new facilities were designed for inclusion in 4.2BSD.
These facilities included a completely revised virtual-memory system to support
processes with large sparse address space, a much higher-speed filesystem, inter­
process-communication facilities, and networking support. The high-speed
filesystem and revised virtual-memory system were needed by researchers doing
computer-aided design and manufacturing (CAD/CAM), image processing, and
artificial intelligence (AI) . The interprocess-communication facilities were needed
by sites doing research in distributed systems. The motivation for providing net­
working support was primarily DARPA's interest in connecting their researchers
through the 5 6-Kbit-per-second ARPA Internet (although Berkeley was also inter­
ested in getting good performance over higher-speed local-area networks) .
N o attempt was made t o provide a true distributed operating system [Popek,
1 98 1 ]. Instead, the traditional ARPANET goal of resource sharing was used.
There were three reasons that a resource-sharing design was chosen :
I . The systems were widely distributed and demanded administrative autonomy.
At the time, a true distributed operating system required a central administra­
tive authority.
2. The known algorithms for tightly coupled systems did not scale well.
3 . Berkeley's charter was to incorporate current, proven software technology,
rather than to develop new, unproven technology.
Therefore, easy means were provided for remote login (rlogin, telnet), file transfer
(rep, ftp), and remote command execution (rsh), but all host machines retained
separate identities that were not hidden from the users.
Because of time constraints, the system that was released as 4.2BSD did not
include all the facilities that were originally intended to be included. In particular,
the revised virtual-memory system was not part of the 4.2B S D release. The CSRG
Chapter 1
H i story and Goal s
did, however, continue i ts ongoing work to track fast-developing hardware
technology in several areas. The networking system supported a wide range of
hardware dev ices, including multiple interfaces to 1 0-Mbit-per-second Ethernet,
token ring networks, and to NSC's Hyperchanne l . The kernel sources were modu­
l arized and rearranged to ease portab i l i ty to new architectures. i ncluding to micro­
processors and to l arger machines.
4.3BSD Design Goals
Problems w i th 4.2BSD were among the reasons for the deve lopment of 4 . 3 B S D .
Because 4.2BSD incl uded many n e w fac i l i ties. it suffered a l o s s o f performance
compared to 4. 1 BSD. partly because of the introduction of symbolic l inks. Some
pern icious bugs had been i ntroduced, parti c u l arly i n the TCP protocol i mplementa­
tion . Some faci l ities had not been incl uded due to l ack of time. Others, such as
TCP/IP subnet and routing support, had not been spec i fi ed soon enough by outside
parties for them to be incorporated in the 4 . 2 B S D release.
Commercial systems usually maintain backward compati b i l i ty for many
releases, so as not to make exi sting app l i cations obsolete . Mainta i n i ng compati­
b i l i ty i s i ncreasingly difficult, however, so most research systems maintain l i ttle or
no backward compati b i l i ty. As a compromise for other re searchers, the B S D
releases were usual l y backward compatibl e for o n e release, b u t h ad the deprecated
fac i l i ties clearly marked . This approach allowed for an orderly tran sition to the
new i nterfaces wi thout constrai ning the system from evol v i ng smooth l y. In partic­
u l ar, bac kward compatib i l ity of 4 . 3 B S D w i th 4 . 2 B S D was consi dered highly desir­
able for app l ication portab i l i ty.
The C l angu age interface to 4 . 3 B S D differs from that of 4.2BSD in on ly a few
commands to the terminal i nterface and in the use of one argument to one !PC
system call (select; see Section 6.4). A fl ag was added i n 4 . 3 B S D to the system
call that estab l i shes a signal handler to allow a process to request the 4. 1 B S D
semantics fo r signal s, rather than the 4 . 2 B S D semantics (see Section 4 . 7 ) . The
sole purpose of the fl ag was to allow existing appl ications that depended on the
old semantics to continue working without being rewri tten.
The impl ementation changes between 4.2BSD and 4 . 3 B S D general ly were not
visible to users, but they were nu merous. For example, the devel opers made
changes to improve support for multiple network-protocol fam i l ies. such as
XEROX NS, i n addition to TCP/IP.
The second re lease of 4.3BSD, hereafter referred to as 4 . 3 B S D Tahoe, added
support for the Computer Consoles, Inc. (CCI) Power 6 (Tahoe) series of minicom­
puters i n addition to the VAX . Although general ly s i m i l ar to the origi nal re lease of
4 . 3 B S D for the VAX, it incl uded many modi fications and new features.
The thi rd re lease of 4 . 3 B S D , hereafter referred to as 4.3BSD-Reno, added
I S O/OS I networking support. a freely redi stributable implementation of NFS, and
the conversion to and addition of the POS I X . l facil ities.
Section 1 . 3
Design Goals of 4BSD
4.4BSD Design Goals
4.4BSD broadened the 4 . 3 B S D hardware base, and now supports numerous archi­
tectures, including Motoro l a 68K, S u n SPARC, M IPS, and Intel PCs .
The 4.4B S D release remedies several deficiencies i n 4 . 3 B S D . I n particul ar,
the v i rtual-memory sy stem needed to be and was completely replaced. The new
v i rtual-memory system provides algorithms that are better suited to the l arge
memories currently available, and is much less dependent on the VAX architecture .
The 4.4BSD rel ease a l s o added a n implementation o f networking protocol s i n the
International Organi zation for S tandardization (ISO) su ite, and further TCP/I P per­
formance i mprovements and enhancements .
The terminal driver had been carefu l l y kept compatible not only with Seventh
Edition, but even w i th S i xth Edi tion. Thi s feature had been usefu l , but i s increas­
ingly less so now, especially considering the l ack of orthogonal ity of its com­
mands and options. I n 4.4BSD, the CSRG repl aced it with a POS IX-compati ble
terminal driver; since System V i s compl i ant with POS I X , the terminal driver is
compati ble w i th S ystem V. POSI X compati b i l i ty i n general was a goal . POS I X
support i s not l i mi ted t o kernel fac i lities s u c h as termios a n d sessions, b u t rather
also incl udes most POSI X util ities.
The most critical shortcoming of 4.3BSD was the l ack of support for multiple
fi l e systems. A s is true of the networking protocols, there i s no single fi l esy stem
that provides enough speed and functional i ty for all si tuati ons. It i s frequently
necessary to support several different fi l e system protocol s , just as i t is necessary to
run several different network protocols. Thus, 4.4BSD incl udes an object-oriented
interface to fi lesystems s i m ilar to S u n M icrosystems' vnode framework. This
framework supports m u l tiple local and remote fi l esystems, much as multiple net­
working protocols are supported by 4 . 3 B S D [S andberg et al, 1 98 5 ] . The vnode
interface has been genera l i zed to make the operation set dynamical ly extensible
and to allow fi l esystems to be stacked. With this structure, 4.4BSD supports
numerous fi lesystem types, including l oopback, union, and uid/gi d mapping l ay­
ers, plus an IS 09660 fi l esystem, which is partic u l arly usefu l for CD-ROMs . It also
supports S u n ' s Network fi l e system (NFS) Versions 2 and 3 and a new l ocal di sk­
based l og-structured fi lesystem .
Origi nal work on the flexible configuration of I PC processing modu les was
done at B e l l Laboratories in U N I X Eighth Edition [ Presotto & Ritchie, 1 985 ] .
This stream 110 system was based o n the UNIX character I/O system. I t allowed a
user process to open a raw term i nal port and then to i n sett appropri ate kernel -pro­
cessing modules, such as one to do n ormal termi nal l i ne editing. Modules to pro­
cess network protoco l s also cou l d be inserted. S tacking a terminal -processing
modul e on top of a network-processing modu l e allowed flexible and efficient
implementation of network virtual terminals within the kernel . A problem with
stream modules, however, is that they are i nherently l i near in nature, and thus they
do not adequately handle the fan-in and fan-out associated with multiplexing i n
datagram-based networks ; such mul tiplexing i s done i n dev ice drivers, below the
modules proper. The Eighth Edition stream 1/0 system was adopted in System V,
Release 3 as the STREAMS system.
Chapter I
History and Goal s
The design of the networking fac i l i ties for 4 . 2 B S D took a different approach .
based on t h e socket interface a n d a flexible multilayer network architecture . Thi s
design al l ows a s i ngle system to support multiple sets of networking protoco l s
with stream. datagram, and other types of access. Protocol modules may deal with
multiplexing of data from different connections onto a single tran sport medi um, as
well as w i th dem u l tiplexing of data for different protocols and connections
received from each network dev ice. The 4.4BSD release made smal l extensions to
the socket interface to al l ow the i mplementation of the I S O networking protocol s .
Release Engineering
The CSRG was always a smal l group of software devel opers . This resource l i m i ta­
tion requ ired carefu l software-engineering management. Carefu l coordination was
needed not only of the CSRG personnel, but also of members of the general com­
munity who contributed to the development of the syste m . Even though the CSRG
i s no more, the community sti l l exists; it continues the B S D traditions with
FreeB S D . NetB S D , and B S D ! .
Maj or C S R G distributions usual ly al ternated between
Maj or new faci l ities: 3 B S D , 4.0BSD, 4.2BSD. 4.4BSD
B u g fi xes and effi c iency i mprovements : 4 . 1 BSD. 4 . 3 B S D
Th i s alternation all owed timely rel ease. while prov iding for refi nement and coITec­
tion of the new fac i l i ties and for elimination of performance problems produced
by the new fac i l i ties. The timely fol low-up of releases that incl uded new fac i l ities
reflected the importance that the CSRG placed on providing a rel i able and robust
syste m on which its u ser community could depend .
Developments from t h e C S R G were rel eased i n three step s : alpha, beta, and
fi n a l , as shown i n Table 1 . 1 . Alpha and beta rel eases were not true di stributions­
they were test systems. Alpha releases were normally avail able to only a few
sites. most of those within the Univers i ty. More sites got beta releases, but they
did not get these rel eases directly : a tree structure was i mposed t o allow bug
reports, fi xes, and new software to be collected, eval uated. and checked for
Ta ble 1 . 1 Test steps for the release of 4.2BSD.
Release steps
name :
maj or new faci l ity:
4. l aBSD
..J.. l bBSD
-1-. l cBSD
fast fi lesystem
revised signals
redundancies by fi rst-level sites before forwarding to the CSRG. For example,
4. 1 aBSD ran at more than 1 00 sites, but there were only about 1 5 pri mary beta
sites. The beta-test tree allowed the developers at the CSRG to concentrate on
actual devel opment. rather than sifting through detai l s from every beta-test s i te.
Th i s book was reviewed for tech nical accuracy by a similar process.
Many of the pri mary beta-test personnel not only had copies of the rel ease
ru nning on their own machines, but also had login accounts on the development
machine at Berkel ey. Such users were commonly found logged in at Berkeley
over the Internet. or sometimes via telephone dial up. from places far away, such as
Austral ia. England, Massachusetts, Utah . Maryland, Texas. and I l l inoi s , and from
cl oser pl aces. such as Stanford . For the 4 . 3 B S D and 4.4B S D releases. certain
accounts and u sers had permission to modify the master copy of the system source
directly. Several fac i l ities, such as the Fortran and C compilers, as wel l as impor­
tant system programs. such as telnet and .ftp. incl ude signi fi cant contributions from
people who did not work for the CSRG. One i mportant exception to this approach
was that changes to the kernel were made by only the CSRG personnel , al though
the changes often were suggested by the l arger community.
People given access to the master sources were carefLl i l y screened before­
hand, but were not c l osely superv i sed. Their work was checked at the end of the
beta-test period by the CSRG personne l , who did a complete compari son of the
source of the previous rel ease with the current master sources-for example. of
4 . 3 B S D with 4.2BSD. Fac i l i ties deemed inappropriate, such as new options to the
directory-l i sting command or a changed return value for the f�eek ( ) library rou­
tine. were removed from the source before fi nal di stribution.
Th i s process i l l u strates an adrnntage of h aving only a few pri ncipal develop­
ers : The devel opers all knew the whole system thoroughly enough to be able to
coordin ate the i r own work with that of other people to produce a coherent fi nal
system . Companies with large development organ izations find this result difficult
to duplicate.
There was no CSRG marketing division. Thu s . technical dec isions were made
largely for technical reasons, and were not driven by marketing promises. The
Berkeley deve lopers were fanatical about thi s position. and were well known for
never promising del i very on a spec i fi c date.
AT&T, 1 9 8 7 .
AT&T. The System V I11te1face Defin ition (SVID). Issue 2. American Tele­
phone and Telegraph. Murray H i l l , NJ. Janu ary 1 9 8 7 .
B abaogl u & Joy. 1 9 8 1 .
0 . B abaoglu & W. N . Joy. " Converting a S wap- B ased System to Do Paging
in an Archi tecture Lacking Page- Referenced B i ts." Proceedings (�f the
Eighth Symposium on Operating Systems Principles, pp. 78-86, December
1 98 1 .
Chapter I
H i s tory and Goal s
B ach, 1 9 86.
M. J . Bach, The Design (d the UNIX Operating System, Prentice-Hall ,
Englewood C l i ffs , NJ, 1 986.
Cerf & Cai n , 1 9 8 3 .
V. Cerf & E . Cai n , The DoD Internet A rchitecture Model, p p . 307-3 1 8 ,
El sevier Science, Amsterdam, Netherl ands. 1 98 3 .
Chambers & Quarterman, 1 98 3 .
J . B . Chambers & J . S . Quarterman. " UNIX System V and 4. 1 C B SD,"
USENIX Association Col!fe rence Proceedings, pp. 267-29 1 . June 1 9 8 3 .
Corner, 1 984.
D. Corner, Operating System Design: The Xinu Approach, Prentice-H al l ,
Englewood C l i ffs, N J , 1 9 84.
Compton. 1 985 .
M . Compton, editor, " The Evolution of UNIX," UNIX Rerieir, vol . 3. no. 1 ,
January 1 985 .
Ewens et al , 1 98 5 .
P. Ewen s, D . R . B lythe, M . Funkenhau ser, & R. C . Holt, " Tu n i s : A D i s ­
tributed M u ltiprocessor Operating System." USENIX Association C<n�f'e r­
ence Proceedings, pp. 247-254, June 1 9 8 5 .
Holt, 1 98 3 .
R . C. Holt, Concurrent Euclid, the UNIX System, and Tun is, Addi son-Wes­
ley, Reading, MA, 1 9 8 3 .
Joy, 1 9 80.
W. N . Joy. " Comments on the Performance of UNIX on the VAX," Techni­
cal Report, Univers i ty of Californ i a Computer System Research Group,
B erkel ey, CA, Apri l 1 9 80.
Jung, 1 98 5 .
R . S . J u n g , " Porting the AT&T Demand Paged U N I X Implementation to
M i c rocomputers ,'' USENIX Association Conference Proceedings, pp.
3 6 1 -370. June 1 985 .
Kashtan, 1 980.
D . L. Kashtan, " UNIX and VMS : Some Performance Compari sons," Tech­
nical Report, SRI Internati onal . Menlo Park, CA, Febru ary 1 9 80.
Kern i ghan & Ritchie, 1 97 8 .
B . W. Kern ighan & D . M . Ritchie, The C Programming Language. Prentice­
H al l , Englewood C l i ffs, NJ, 1 97 8 .
Kern ighan & R i tchie, 1 98 8 .
B . W. Kern i ghan & D . M . R i tchie, The C Programming Language. 2 n d ed,
Prentice-Hal l , Englewood C l i ffs, NJ . 1 98 8 .
McKusick. 1 98 5 .
M . K. McKusick, "A Berke ley Odyssey," UNIX Rei·iew, vol . 3 , no. I , p. 30,
Janu ary 1 9 85 .
M i l l er, 1 97 8 .
R . M i l l er, " UNIX-A Portable Operati n g System," A CM Operating System
Review, vol . 1 2 , no. 3. pp. 32-37, July 1 97 8 .
Miller, 1 984.
R. Miller, "A Demand Paging Virtual Memory Manager for System V,"
USENIX Association Conference Proceedings, p. 1 7 8- 1 82, June 1 984.
Mohr, 1 985.
A. Mohr, "The Genesis Story," UNIX Review, vol . 3 , no. 1 , p. 1 8, January
1 985.
Organick, 1 975.
E. I. Organick, The Multics System: An Examination of Its Structure, MIT
Press, Cambridge, MA, 1 975.
P 1 003 . 1 , 1 98 8 .
P 1 003. 1 , IEEE P/ 003. I Portable Operating System Inteiface for Computer
Environments (POSIX), Institute of Electrical and Electronic Engineers, Pis­
cataway, NJ, 1 98 8 .
Peirce, 1 985.
N. Peirce, "Putting UNIX In Perspective: An Interview with Victor Vyssot­
sky," UNIX Review, vol. 3, no. 1 , p. 5 8 , January 1 985.
Popek, 1 98 1 .
B . Popek, "Locus: A Network Transparent, High Reliability Distributed
System," Proceedings of the Eighth Symposium on Operating Systems Prin­
ciples, p. 1 69- 1 77, December 1 98 1 .
Presotto & Ritchie, 1 985.
D. L. Presotto & D. M. Ritchie, "Interprocess Communication in the Eighth
Edition UNIX System," USENIX Association Conference Proceedings, p.
309-3 1 6, June 1 985.
Richards & Whitby-Strevens, 1 980.
M. Richards & C . Whitby-Strevens, BCPL: The Language and Its Compiler;
Cambridge University Press, Cambridge, U.K., 1 980, 1 982.
Ritchie, 1 978.
D. M. Ritchie, "A Retrospective," Bell System Technical Journal, vol. 57,
no. 6, p. 1 947- 1 969, July-August 1 97 8 .
Ritchie, 1 984a.
D. M. Ritchie, "The Evolution of the UNIX Time-Sharing System," AT& T
Bell Laboratories Technical Jou rnal, vol . 6 3 , no. 8, p. 1 577- 1 593, October
1 984.
Ritchie, 1 984b.
D. M. Ritchie, " Reflections on Software Research," Comm ACM, vol . 27,
no. 8, p. 75 8-760, 1 984.
Ritchie, 1 987.
D. M . Ritchie, "Unix: A Dialectic," USENIX Association Conference Pro­
ceedings, p. 29-34, January 1 987.
Ritchie et al , 1 978.
D. M . Ritchie, S . C. Johnson, M . E. Lesk, & B. W. Kernighan, "The C Pro­
gramming Language," Bell System Technical Journal, vol . 57, no. 6, p.
1 99 1 -20 1 9, July-August 1 97 8 .
Chapter I
H i story and Goal s
Rosier. 1 9 84.
L. Rosier, " The Evol ution of C-Past and Future," AT& T Bell Laboratories
Techn ical Journal, vol . 6 3 , no. 8, pp. 1 685- 1 699, October 1 9 84.
Sandberg et al, 1 9 8 5 .
R . Sandberg, D . Goldberg , S . Kleiman, D . Wal sh, & B . Lyon, " Design and
Implementation of the Sun Network Fi lesystem.'' USENIX Association Con ­
ference Proceedings, p p . 1 1 9- 1 30, June 1 9 8 5 .
Strou strup, 1 984.
B. Strou strup, " Data Abstraction i n C," A T& T Bell Laboratories Technical
Jou rnal, vol . 63, no. 8 , pp. 1 70 1 - 1 7 3 2 , October 1 9 84.
Tanenbaum, 1 98 7 .
A . S . Tanenbaum, Operating Systems: Design a n d Implementation, Pren­
tice-Hal l , Englewood C l i ffs, NJ, 1 9 8 7 .
Tu th i l l , 1 98 5 .
B . Tuth i l l , "The Evolution o f C : Heresy and Prophecy," UNIX Review, vol .
3 , no. I , p. 80, January 1 9 8 5 .
Uniejewski , 1 9 8 5 .
J . Uniejewski , UNIX System V a n d BSD4. 2 Compatibility Study, Apo l l o
Computer. Chelm sford, MA. M arch 1 9 8 5 .
USENIX, 1 9 8 7 .
USENIX, Proceedings <!f the C+ + Workshop, USENIX Associ ation, Berke­
l ey, CA, November 1 9 8 7 .
Wi lson, 1 9 8 5 .
0 . Wilson, "The B u s i ness Evol ution o f the U N I X System," UNIX Review,
vol . 3, no. I , p. 46, Janu ary 1 98 5 .
X3J l I , 1 9 8 8 .
X3J I 1 , X3. l5 9 Programming Language C Standard, Gl obal Press, S anta
Ana, CA, 1 9 8 8 .
X/OPEN, 1 9 8 7 .
X/OPEN, The XIOPEN Portability Guide (XPG), I s s u e 2, Elsevier Science,
Amsterdam , Netherl ands. 1 9 87.
Design Overview of
4 .4B SD
4.4BSD Facilities and the Kernel
The 4.4BSD kernel provides four basic faci l ities: processes, a fi lesystem, commu­
n ication s , and system startup . Thi s section outli nes where each of these four basic
services i s described i n this book.
Processes constitute a thread of control in an address space . Mechanisms for
creating, terminating, and otherwise contro l ling processes are described in
Chapter 4. The system multiplexes separate v i rtual-address spaces for each
process; thi s memory management is d i scussed in Chapter 5 .
2 . The user i n terface t o the fi lesyste m and devices i s s i m i l ar; common aspects are
discussed in Chapter 6. The fi l esystem is a set of named fi les, organized in a
tree-structured h ierarchy of directories, and of operations to manipulate them,
as pre sented in Chapter 7. Files reside on physical media such as disks.
4.4B S D s upports several organi zations of data on the disk, as set forth i n Chap­
ter 8 . Access to files on remote mac h i nes i s the subject of Chapter 9 . Termi ­
n a l s are u s e d t o access t h e syste m ; thei r operation i s t h e subject of Chapter I 0.
Communication mechan i sms provi ded by tradi tional UNIX systems include
s i mplex rel i able byte streams between rel ated processes (see pipes, Section
1 1 . I ), and noti fication of exceptional events (see signal s , Section 4.7). 4.4B S D
also h a s a general i nterprocess-communication faci l i ty. This faci l i ty, described
in Chapter 1 1 , uses access mechan i s m s di stinct from those of the fi lesystem,
but, once a connection i s set up, a proces s can access i t as though i t were a
pipe. There i s a general networki ng framework, discussed in Chapter I 2, that
is norm a l l y used as a l ayer underlying the IPC faci l i ty. Chapter I 3 describes a
partic u l ar networking implementation i n detail .
Chapter 2
Design Overview of 4.4BSD
4. Any real operating system has operational issues, such as how to start i t run­
ning. Startup and operational issues are described in Chapter 1 4.
Sections 2 . 3 through 2 . 1 4 present i ntroduc tory materi al rel ated to Chapters 3
through 1 4. We shal l de fine term s, mention basic system cal l s , and expl ore h i s tor­
ical developments. Final ly, we shal l give the reasons for many major design deci­
sion s .
The Kernel
The kernel is the part of the system that runs in protected mode and mediates
access by all user programs to the underly i ng hardware ( e . g . , CPU, disks, term i ­
nal s. network l i n ks) and software con structs (e.g . , fi lesy stem, network protocol s ) .
The kernel provides the basic system fac i l i ti e s ; i t creates and manages processes,
and provides fu nctions to access the fi lesystem and communication fac i l ities .
These functions, cal led system calls, appear to user processes as l i brary subrou­
tines. These system cal l s are the only i nterface that processes have to these fac i l ­
i t i e s . Detai l s o f the system-call mechan i s m are given in Chapter 3 , as are
descriptions of several kernel mechanisms that do not execute as the direct result
of a process doing a system cal l .
A kernel, i n traditional operating-system terminology, i s a smal l nucleus of
software that provides only the mini mal fac i l i ties necessary for implementing
addi tional operating-system services. In contemporary research operating sys­
tems-such as Chorus [Rozier et al, 1 9 8 8 ] , Mach [ Accetta et al , 1 9 86] , Tun i s
[Ewens e t al, 1 9 8 5 ] , a n d t h e V Kernel [ Cheriton, 1 9 8 8] -this d i v i s i o n of function­
ality i s more than j ust a logical one . Services such as fi lesystems and networking
protocols are implemented as cl ient app l i cation processes of the nucleus or kernel .
The 4.4BSD kernel is not partitioned i nto multiple processes. Thi s basic
design dec ision was made i n the earl iest vers ions of UNIX. The fi rst two imple­
mentations by Ken Thompson had no me mory mapping, and thu s made no h ard­
ware-enforced disti nction between user and kernel space [Ritchie, 1 9 8 8 ] . A
message-pas sing system could have been implemented as readily as the actually
implemented model of kernel and user processe s . The monol i thic kernel was
chosen for simplicity and performance. And the early kernels were small ; the
inclu sion of fac i l ities such as networking into the kernel has increased its size.
The current trend i n operating-systems research i s to reduce the kernel size by
placing such services in user space.
Users ordinari ly interact with the sy stem through a com mand-language inter­
preter, called a shell, and perhaps through additional user appli cation program s .
S u c h programs and the shell are impl emented with processes. Detai l s o f s u c h pro­
grams are beyond the scope of this book. which i nstead concen trates al most exc l u ­
s i v e l y o n t h e kerne l .
Sections 2 . 3 and 2 . 4 describe the services provided b y the 4.4BSD kernel , and
give an overv iew of the latter's design. Later chapters describe the detailed design
and implementation of these serv ices as they appear i n 4.4BSD.
Section 2.2
Kernel Organization
Kernel Organization
In this section, we view the organization of the 4.4BSD kernel in two ways:
I . As a static body of software, categorized by the functionality offered by the
modules that make up the kernel
2. By its dynamic operation, categorized according to the services provided to
The largest part of the kernel implements the system services that appl ications
access through system calls. In 4.4BSD, this software has been organized accord­
ing to the following:
• Basic kernel facilities : timer and system-clock handling, descriptor management,
and process management
• Memory-management support: paging and swapping
• Generic system interfaces: the I/O, control, and multiplexing operations per­
formed on descriptors
• The filesystem: files, directories, pathname translation, file locking, and I/O
buffer management
• Terminal-handling support: the terminal-interface driver and terminal line disci­
• Interprocess-communication facilities : sockets
• Support for network communication: communication protocols and generic net­
work facilities, such as routing
Most of the software in these categories is machine independent and is portable
across different hardware architectures.
The machine-dependent aspects of the kernel are isolated from the main­
stream code. In particular, none of the machine-independent code contains condi­
tional code for specific architectures. When an architecture-dependent action is
needed, the machine-independent code calls an architecture-dependent function
that is located in the machine-dependent code. The software that is machine
dependent includes
• Low-level system-startup actions
• Trap and fault handling
• Low-level manipulation of the run-time context of a process
• Configuration and i nitialization of hardware devices
• Run-time support for I/O devices
Chapter 2
Design Overview of 4.4BSD
Table 2 . 1 Machine-independent software i n the 4.4BSD kernel .
Lines of code
Percentage of kernel
initial ization
1 , 1 07
kernel facil ities
generic i nterfaces
interprocess communication
terminal handl ing
3,9 1 1
1 .9
1 1 ,8 1 3
virtual memory
vnode management
fi lesystem naming
fa s t fi l estore
log-structure fi l estore
2. 1
4, 1 77
2. 1
memory-based fi lestore
cd9660 fi lesystem
miscellaneous fi lesystems ( I 0)
network fi lesystem
network communication
1 2,695
1 7 , 1 99
i nternet protocols
1 1 ,984
I S O protocols
23 ,924
1 1 .8
X . 2 5 protocols
1 0,626
X N S protocols
5 , 1 92
1 62,6 1 7
total machine i ndependent
Table 2 . 1 summarizes the machine-independent software that consti tutes the
4.4BSD kernel for the HP300. The numbers i n column 2 are for l i nes of C source
code, header fi les, and assembly l anguage. Virtuall y all the software i n the kernel
i s wri tten i n the C programming l angu age ; less than 2 percent i s written i n assem­
bly l anguage. As the stati stics i n Tabl e 2.2 show, the mac h i ne-dependent soft­
ware, excluding HP/UX and device support, accounts for a m i nuscule 6.9 percent
of the kernel .
Only a smal l part of the kernel i s devoted to i n i tial i zing the system. Th i s code
is used when the system is bootstrapped i nto operation and is responsible for set­
ting up the kernel hardware and software env i ronment ( see Chapter 1 4) . Some
operating systems (espec i a l l y those w i th l i m i ted physical memory ) discard or
overlay the software that performs these functions after that software has been
executed. The 4.4BSD kernel does not rec l a i m the memory used by the startup
code because that memory space is barely 0.5 percent of the kernel resources used
Section 2 . 3
Kernel Serv i ces
Table 2.2 Machine-dependent software for the HP300 i n the 4.4BSD kernel.
Lines of code
Percentage of kernel
machine dependent headers
1 ,562
device driver headers
1 .7
device driver source
1 7,506
virtual memory
1 .5
other machine dependent
3. l
routines i n assembly language
3 .0 1 4
1 .5
H P/UX compatibi l i ty
1 9 .6
total machine dependent
on a typical machine. Also, the startup code does not appear i n one p l ace i n the
kernel-it i s scattered throughout, and i t usually appears in places l ogical l y asso­
c iated with what is being i n i tial ized.
Kernel Services
The boundary between the kernel- and user-l evel code is enforced by h ardware­
protection fac i l ities provided by the underlying h ardware . The kern e l operates in a
separate address space that i s i n accessible to user processes. Privileged opera­
tions-such as starting 110 and halting the central processing unit (CPU)-are
avail able to only the kerne l . App lications request services from the kernel w i th
system calls. System cal l s are used to cause the kernel to execute compl icated
operations, such as writing data to secondary storage, and s i mple operations, such
as returning the current time of day. All system cal l s appear synchronous to appli­
cations : The app l ication does not run while the kernel does the actions associ ated
with a system cal l . The kernel may fi n i s h some operations associ ated w i th a sys­
tem call after i t has returned. For example, a write system call w i l l copy the data to
be written from the user process to a kernel buffer while the process waits, but w i l l
usual l y return from t h e system call before the kernel buffer i s written t o t h e disk.
A system call usually i s implemented as a h ardware trap that changes the
CPU's execution mode and the current address-space mappi ng. Parameters sup­
plied by users i n system cal l s are validated by the kernel before being u sed. Such
checking ensures the integrity of the system. A l l parameters passed into the ker­
nel are copied into the kerne l ' s address space, to ensure that vali dated parameters
are not changed as a s i de effect of the system cal l . Sy stem-call results are
returned by the kernel, ei ther in hardware regi sters or by thei r val ues being copied
to u ser- speci fied memory addresses. Like parameters passed i nto the kernel ,
Chapter 2
Design Overv iew of 4.4BSD
addresses used for the return of results must be val idated to ensure that they are
part of an appl ication's address space. If the kernel encounters an error while pro­
cessing a system cal l , it returns an error code to the user. For the C programming
language, this error code is stored in the gl obal vari able errno, and the fu nction
that executed the system call returns the value - I .
User appl ications and the kernel operate independently of each other. 4.4BSD
does not store 1/0 control bl ocks or other operating-system-re l ated data structures
in the appl icati on's address space . Each user- level appl ication i s provided an i nde­
pendent address space in which it execute s . The kernel makes most state changes,
such as suspend ing a process while another i s running. invisible to the processes
Process Management
4.4BSD supports a mul titasking environment. Each task or thread of execution i s
termed a process. T h e context o f a 4 . 4 B S D process consi sts of user-level state,
including the contents of its address space and the ru n-time environment, and
kerne l-level state, which incl udes scheduling parameters, resource contro l s , and
identification information . The context incl udes everything used by the kernel in
providing services for the proces s . Users can create processes, control the pro­
cesse s ' execution, and receive noti fication when the processes ' execution status
changes. Every process i s assigned a unique value, termed a process idemijier
(PI D ) . Thi s value is used by the kernel to identify a process when reporting sta­
tus changes to a user, and by a user when referencing a process in a sy stem cal l .
The kernel creates a process b y dupl icating the context o f another proces s .
T h e n e w process i s termed a child process of t h e original parent process. The
context dupl icated i n process creation incl udes both the user-level execution state
of the process and the process 's system state managed by the kerne l . Important
components of the kernel state are described in Chapter 4.
The process I i fecycle i s depicted i n Fig . 2. I . A process may create a new pro­
cess that is a copy of the original by using the fork system cal l . The fork call
returns twice : once i n the parent process, where the return value i s the process
Figure 2.1 Process-management system cal l s .
- - - - - - - - _1::1!!
Section 2.4
Process Management
i dentifier of the chi ld, and once in the c h i l d process, where the return val u e is 0.
The parent-ch i l d re lationship i nduces a h ierarchical structure on the set of pro­
cesses i n the system. The new process shares all its parent's resources, such as ti l e
descriptors, signal-handling status, a n d memory l ayout.
Although there are occas ions when the new process i s intended to be a copy
of the parent, the l oadi n g and execution of a different program is a more u sefu l
a n d typical action. A process c a n overlay itself w i t h the memory image o f another
program, passing to the new ly created image a set of parameters, using the system
call execve. One parameter is the name of a ti l e whose contents are in a format
recognized by the system-either a binary-executable ti l e or a tile that causes the
execution of a specified i nterpreter program to process its contents.
A process may terminate by executing an exit system cal l , sending 8 bits of
exit status to its parent. If a process wants to communicate more than a single
byte of information with its parent, i t must either set up an i nterprocess-com muni­
cation channel u s i ng pipes or sockets, or use an intermedi ate tile. Interprocess
communication is discu ssed extensively i n Chapter 1 1 .
A process can su spend execution until any of i ts child processes termi nate
using the wait system cal l , which returns the PID and exit statu s of the te1minated
c h i l d process. A parent process can arrange to be noti fied by a signal when a c h i l d
process e x i t s or terminates abnormal l y. Using t h e wait4 system cal l , t h e parent
can retrieve i nformation about the event that caused termi nati on of the c h i l d pro­
cess and about resources consumed by the J?rocess during its l i fetime. If a process
i s orphaned because i ts parent exits before it is fini shed, then the kernel arranges
for the child's exit status to be passed back to a spec ial system process (init: see
Sections 3 . 1 and 1 4.6).
The deta i l s of how the kernel creates and destroys processes are given i n
Chapter 5 .
Processes are scheduled for execution according t o a process-priority parame­
ter. This priority i s managed by a kernel-based schedu l i ng algorithm. Users can
i n fl uence the schedul i ng of a process by specifying a parameter (n ice) that weights
the overall schedu l i ng priori ty, but are sti l l obl igated to share the underlying CPU
resources according to the kernel 's scheduling policy.
The system defines a set of signals that may be deli vered to a process. Signals i n
4.4BSD are modeled after hardware inte rrupts. A process may specify a user-l evel
subroutine to be a handler to which a signal should be del ivered. When a signal is
generated, it is bl ocked from further occurrence while i t is being caught by the
handler. Catching a signal i nvolves saving the current process context and bu i l d­
ing a new one i n which to ru n the handler. The signal i s then delivered to the han­
dler, which can either abort the process or return to the executi ng process (perhaps
after setti ng a global vari abl e ) . If the handler return s , the signal is unblocked and
can be generated (and caught) again .
Alternatively, a process may specify that a signal i s to b e ignored, o r that a
defaul t action, as determ i ned by the kernel, is to be taken. The defau l t action of
Chapter 2
Design Overview of 4.4BSD
certain signals i s to termi nate the proce s s . Th i s term i nation may be accompan ied
by creation of a co re fi le that contains the current memory image of the process for
u se in postmmtem debugging.
Some signals cannot be caught or ignored. These signals incl ude SIGKILL.
which kil l s runaway processes. and the j ob-control signal SIGSTOP.
A process may choose to have signals de l ivered on a spec ial stack so that
sophi sticated software stack manipulations are possible. For example. a language
supporti ng coroutines needs to provide a stack for each coroutine. The l anguage
ru n-time sy stem can all ocate these stacks by dividing up the s i ngle stac k provided
by 4.4BSD. I f the kernel does not support a separate signal stack . the space allo­
cated for each coroutine must be ex panded by the amount of space required to
catch a signal .
A l l signals have the same priority . If m u l tiple signals are pending simulta­
neously. the order in which signals are de l ivered to a process is impl ementation
specific. Signal handl ers execute with the signal that caused their invocati on to
be bl ocked. but other signals may yet occur. Mechanisms are provided so that
processes can protect critical secti ons of code against the occ urrence of spec i fied
signal s .
The detai led de sign a n d i mplementation o f s i g n a l s i s described i n Section 4 . 7 .
Process Groups and Sessions
Processes are organ i zed i nto p rocess g roups. Process groups are u sed to control
access to termi nals and to provide a means of di stri buting signals to collections of
re lated processes. A process in herits its process group from its parent process.
Mechan i s m s are provi ded by the kernel to allow a process to alter its process
group or the proce ss group of i ts descendents . Creating a new process group is
easy ; the val ue of a new process group i s ordi nari l y the proces s identifier of the
creati ng process.
The group of processes i n a process group i s sometimes referred to as a job
and is manipul ated by high- level system software . such as the shel l . A common
kind of job created by a she l l i s a pipeline of several processes connected by pipes.
such that the output of the fi rst process is the input of the second. the output of the
second i s the input of the third. and so forth . The she l l creates such a job by fork­
ing a process for each stage of the pipe l i ne . then putting all those processes i n to a
separate process group.
A user proces s can send a signal to each process in a process group. as wel l as
to a single proces s . A process in a specific process group may receive software
interrupts affecting the group. causing the group to suspend or resume execution.
or to be i n terrupted or termi nated.
A terminal has a proces s-group identifier assigned to it. Thi s identifier is
normal l y set to the i denti fier of a process group associ ated with the terminal . A
job-control she l l may create a nu mber of process groups associ ated with the same
term i nal : the terminal i s the controlling terminal for each process in these groups.
A process may read from a descriptor for its contro l l i n g term inal only if the ter­
mi nal ' s process-group identifier matches that of the process. If the identifiers do
Section 2 . 5
Memory Management
not match, the process w i l l be blocked i f it attempts to read from the terminal .
By changing the process-group identifier of the terminal, a she l l can arbi trate a
term i nal among several different jobs. Thi s arbitration i s called job control and i s
described, with process groups, i n Section 4 . 8 .
J u s t a s a s e t o f rel ated processes c a n be collected i nto a process group, a s e t o f
process groups c a n be collected i nto a session . T h e m a i n u ses for sessions are to
create an isolated env i ronment for a daemon process and its c h i l dren , and to col­
lect together a user's login shell and the j obs that that shell spawns.
Memory Management
Each process has its own private address space. The address space is i n i ti al l y
divided i n to three logical segments : text, data, a n d stack. T h e text segment is
read-only and contains the machine i nstructions of a program. The data and stack
segments are both readabl e and writable. The data segment contain s the i n i tial­
i zed and uninitialized data portions of a program, whereas the stack segment holds
the application's run-time stack. On most machines, the stack segment i s
extended automaticall y by t h e kernel as the process execute s . A process can
expand or contract i ts data segment by making a system cal l , whereas a p rocess
can change the s i ze of its text segment only when the segment's contents are over­
laid with data from the fi l esystem, or when debugging takes p l ace. The i n i tial
contents of the segments of a child process are dupl icates of the segments of a par­
ent proces s .
The entire contents o f a process addres s space do not need to b e resident for a
process to execute . If a process references a part of its address space that i s not
res i dent in main memory, the system pages the necessary information i nto mem­
ory. When sy stem resources are scarce, the system uses a two-level approach to
maintain avai lable resources. If a modest amount of memory i s avai lable, the sys­
tem will take memory resources away from processes i f these resources h ave not
been used recently. Should there be a severe resource shortage, the system w i l l
resort t o swapping t h e entire context of a process t o secondary storage. The
demand paging and swapping done by the syste m are effectively transparent to
processes. A process may, however, adv ise the system about expected future
memory uti l i zation as a performance aid.
BSD Memory-Management Design Decisions
The support of l arge sparse address spaces, mapped files, and shared memory was
a requirement for 4 . 2 B S D . An i nterface was spec i fi ed, called mmap ( ), that
a l l owed unrelated processes to request a shared mapping of a file i nto their address
spaces. If multiple processes mapped the same fi l e i nto their address spaces,
changes to the fi l e ' s portion of an address space by one process would be reflected
i n the area mapped by the other processes, as well as i n the file itself. Ultimate ly,
4.2BSD was shipped w i thout the nunap ( ) i nterface , because of pressure to make
other features, such as networking, avail able.
Chapter 2
Design Overv iew of 4.4BSD
Further devel opment of the 111111ap ( ) interface continued during the work on
4 . 3 B S D . Over 40 companies and research groups parti ci pated i n the discussions
leading to the revi sed architecture that was described i n the Berkeley S oftware
Architecture Manual [McKusick. Kare l s et al , 1 994 ] . Several of the compan ies
have implemented the rev i sed interface [Gingell et a l , 1 9 87] .
Once again, time pressure prevented 4 . 3 B S D from prov iding an implementa­
tion of the i n terface. Al though the l atter cou ld have been bui l t into the exi sting
4 . 3 B S D v i rtual-memory system, the developers decided not to put it i n because
that i mplementation was nearl y 1 0 years old. Furthermore, the original v i rtual­
memory design was based on the assumption that computer memories were small
and expen sive, whereas disks were l ocal l y connected, fast. l arge, and i nexpen sive.
Thus, the v i rtual-memory sy stem was designed to be frugal with its use of mem­
ory at the expense of generating extra disk traffic. I n addition, the 4 . 3 B S D i mple­
mentation was riddled with VAX memory-management hardware dependencies
that i mpeded its portab i l i ty to other computer architectures. Final l y. the v i rtual­
memory system was not desi gned to support the tightly coupled multiprocessors
that are becoming i ncreasingly common and important today.
Attempts to improve the old implementation incremental l y seemed doomed
to fai l ure . A completely new design, on the other hand, coul d take advantage of
large memories, conserve disk transfers, and have the potential to ru n on multi­
processors . Consequently. the virtual-memory sy stem was completely rep l aced i n
4 . 4 B S D . The 4.4BSD v i rtual -memory system i s based on t h e M ach 2 . 0 VM sys­
tem [Tevanian, 1 9 87] , w i th updates from Mach 2.5 and Mach 3 .0. It features effi­
cient support for sharing. a clean separation of machine-independent and
machine-dependent features, as we l l as (curre ntly unused) mul tiprocessor support.
Processes can map files anywhere in thei r address space. They can share parts of
the i r address space by doing a shared mapping of the same fi l e . Changes made
by one process are visible in the address space of the other process. and also are
wri tten back to the file itself. Processes can also request private mappings of a
fi l e , which prevents any c hanges that they make from being visible to other pro­
cesses mapping the file or being written back to the fi l e itself.
Another issue with the v i rtual-me mory system i s the way that information i s
passed i n to the kernel w h e n a system cal l i s made. 4.4BSD always copies data
from the process addres s space into a buffer in the kernel . For read or write opera­
tions that are transferri ng l arge quanti ties of data. doing the copy can be time con­
suming. An al ternative to doing the copying is to remap the process memory into
the kernel . The 4.4BSD kernel always copies the data for several reasons:
• Often , the user data are not page al igned and are not a multiple of the hardware
page length.
• If the page i s taken away from the process. it will n o longer be able to reference
that page . Some programs depend on the data remai ning in the bu ffer even after
those data have been written.
• If the process i s allowed to keep a copy of the page (as i t i s i n c u rrent 4.4BSD
semantic s ) , the page must be made copy-011- 11-rite . A copy-on-write page is one
Section 2 . 6
I/O System
that i s protected against being written by being made read-only. If the process
attempts to modify the page, the kernel gets a wri te fault. The kernel then makes
a copy of the page that the proces s can modify. Unfortunately. the typical pro­
cess w i l l immedi ately try to write new data to its output buffer, forc i ng the data
to be copied anyway.
When pages are remapped to new vi rtual-memory addresses, most memory­
management hardware requ i res that the h ardware address-tran sl ation cache be
purged selectively. The cache purges are often s l ow. The net effect i s that
remapping i s slower than copying for blocks of data less than 4 to 8 Kbyte .
The biggest i ncentives for memory mapping are the needs for accessing big files
and for passing l arge quantities of data between processes. The mmap ( ) i nterface
provides a way for both of these tasks to be done w i thout copy i n g .
Memory Management Inside t h e Kernel
The kernel often does allocati ons of memory that are needed for only the duration
of a s i ngle system cal l . In a user process, such short-term memory wou l d be allo­
cated on the run-time stack. Because the kernel has a l i m i ted run-time stack, i t is
not feasible to all ocate even moderate-sized blocks of memory on it. Conse­
quently, such memory must be allocated through a more dynamic mechan i s m . For
example, when the system must trans l ate a pathn ame, it must all ocate a 1 - Kbyte
buffer to hold the name . Other blocks of memory must be more persi stent than a
single system cal l , and thus could not be a l located on the stack even if there was
space. An example is protocol-control blocks that remain throughout the duration
of a network connection.
Demands for dynamic memory a l location in the kernel h ave i ncreased as
more services have been added. A general ized memory al l ocator reduces the
complexity of writing code i n side the kern e l . Thus, the 4.4BSD kernel has a single
memory allocator that can be used by any part of the system . It has an i nterface
s i m i l ar to the C l ibrary routines malloc ( ) and ,fi-ee ( ) that provide memory alloca­
tion to app l ication programs [McKusick & Kare l s , 1 9 8 8 ] . Like the C l ibrary inter­
face, the all ocation routine takes a parameter spec ifying the s i ze of memory that i s
needed. T h e range of s i z e s for memory requests i s n o t constrained; however,
physical memory i s allocated and i s not paged. The free routine takes a pointer to
the storage being freed, but does not require the size of the piece of memory being
110 System
The basic model of the UNIX 1/0 system is a sequence of bytes that can be
accessed e i ther randomly or sequenti ally. There are no access methods and no
control blocks in a typical UNIX user process.
Chapter 2
Design Overview of 4.4BSD
Different programs expect vari ous leve l s of structure, but the kernel does not
impose structure on I/O . For i nstance, the convention for text fi l es i s l i nes of
ASCII characters separated by a s i ngle new l i n e c haracter (the A SC I I l i ne-feed c har­
acter) , but the kernel knows nothing about this conventi on. For the purposes of
most programs, the model is further s i mpl i fi ed to being a stream of data bytes, or
an 110 stream . It is this single common data form that makes the c haracteri stic
UNIX tool-based approach work [ Kernighan & Pike, 1 984] . An I/O stream from
one program can be fed as input to almost any other program . (Th i s kind of tradi­
tional UNIX I/O stream should not be confused w i th the Eighth Edition stream 1/0
system or with the System V, Rel ease 3 STREAMS, both of which can be accessed
as traditional I/O streams . )
Descriptors and 1/0
UNIX processes use descriptors to reference 1/0 streams. Descri ptors are small
unsigned i n tegers obtained from the open and socket system cal l s . The open sys­
tem call takes as arguments the name of a file and a permission mode to specify
whether the fi l e should be open for reading or for writi ng, or for both . This sys­
tem call also can be used to create a new, empty fi l e . A read or write system call
can be applied to a descriptor to transfer data. The close system call can be used
to deal locate any descriptor.
Descriptors represent underlying obj ects supported by the kernel, and are cre­
ated by system cal l s spec i fi c to the type of object. In 4.4BSD, three kinds of
objects can be represented by descriptors : fi les, pipes, and sockets.
A file is a l inear array of bytes with at least one name. A fi le exists unti l a l l its
names are deleted exp l icitly and no process holds a descriptor for it. A process
acquires a descriptor for a file by opening that fi l e ' s name w i th the open system
cal l . I/O dev ices are accessed as fi l e s .
A pipe i s a l i near array of bytes, as i s a file. b u t i t i s u s e d solely as an I/O stream,
and it i s unidirectional . It also has no name, and thus cannot be opened w i th
open . I nstead , it is created by the pipe system cal l , which return s two descrip­
tors, one of which accepts i nput that i s sent to the other descriptor rel iably, w i th­
out dupl ication , and in order. The system also supports a named pipe or FIFO . A
FIFO has properties identical to a pipe, except that it appears in the fi l esyste m ;
t h u s , it can be opened u s i ng t h e open system cal l . Two processes that w i s h to
communicate each open the FIFO: One opens it for reading, the other for w riting.
A socket i s a transient object that i s u sed for i nterprocess communicati o n : it
exists only as long as some process holds a descriptor referri ng to it. A socket is
created by the socket sy stem cal l , which returns a descriptor for it. There are dif­
ferent kinds of sockets that support various communication semantics, such as
re l i able del ivery of data. preservation of message orderi ng. and preservation of
message boundaries.
Section 2 . 6
I/O System
I n systems before 4.2BSD, pipes were impl emented u s i ng the fi l e system ; when
sockets were i n troduced i n 4.2BSD, pipes were reimplemented as sockets.
The kernel keeps for each process a descriptor table, which i s a table that
the kernel uses to transl ate the external representation of a descriptor into an
internal representation . (The descriptor is mere l y an i ndex i n to thi s table . ) The
descriptor tabl e of a process i s i nherited from that process ' s parent, and thus
access to the obj ects to which the descriptors refer also i s i n herited. The main
ways that a process can obtai n a descriptor are by opening or creation of an
object, and by i nheritance from the parent process. I n add i tion, socket IPC
allows pas s i ng of descriptors in messages between unrelated processes on the
same machine.
Every val i d descriptor has an associated .file offset i n bytes from the beginning
of the object. Read and write operations s tart at thi s offset, which i s updated after
each data transfer. For objects that perm i t random access, the fi l e offset also may
be set with the !seek system cal l . Ordinary fi les perm i t random access, and some
dev ices do, as wel l . Pipes and sockets do not.
When a process terminates, the kernel reclaims all the descriptors that were i n
use b y that process. I f the process was holding the final reference t o a n obj ect. the
object's manager is notified so that it can do any necessary cleanup action s, such
as final deletion of a file or deal l ocation of a socket.
Descriptor Management
Most processes expect three descriptors to be open al ready when they start run­
ning. These descriptors are 0, 1 . 2 , more commonly known as standard inp ltl,
standard output, and standard error, respectively. Usually, all three are associ ated
with the user's terminal by the login p rocess (see Section 1 4.6) and are i n herited
through f(Jrk and exec by processes run by the user. Thus, a program can read
what the user types by reading standard i nput, and the program can send output to
the user' s screen by wri ting to standard output. The standard error descriptor also
i s open for writing and i s used for error output, whereas standard output i s u sed
for ordinary output.
These (and other) descriptors can be mapped to objects other than the termi­
nal ; such mapping is called 110 redirection, and all the standard shel l s perm i t u sers
to do it. The shell can direct the output of a program to a file by closing descriptor
1 (standard output) and opening the desired output file to produce a new descriptor
1 . It can s i m i l arly redi rect standard i nput to come from a file by closing descriptor
0 and opening the fi l e .
Pipes allow t h e output of o n e program t o b e i n p u t t o another program without
rewriting or even relinking of either program. Instead of descriptor I (standard
output) of the source program being set up to write to the terminal, i t i s set up to be
the input descriptor of a pipe. S i m i l arly, descriptor 0 (standard i nput) of the s i n k
program i s s e t up t o reference t h e output o f t h e p i p e , i nstead of t h e terminal
keyboard. The resulting set of two processes and the connecti ng pipe i s known as
a pipeline. Pipe li nes can be arbitrari ly long series of processes connected by pipes.
Chapter 2
Des ign Overv iew of 4.4BSD
The open, pipe, and socket system cal l s produce new descri ptors with the l ow­
est un used number usable for a descriptor. For pipel ines to work. some mecha­
nism must be provided to map such descriptors into 0 and 1 . The dup system call
creates a copy of a descriptor that points to the same fi l e-table entry. The new
descriptor i s also the lowest unused one. but if the desired descriptor i s closed fi rst.
dup can be used to do the desired mappi ng. Care is required. however: If descrip­
tor I i s des i red, and descriptor 0 happens also to have been c losed, descriptor 0
w i l l be the res ult. To avoid this problem, the system prov ides the dup2 system
cal l ; i t is l i ke dup. but it takes an additional argument specifying the number of the
desired descriptor ( i f the desired descri ptor was already open, dup2 closes i t
before re using it).
Hardware devices have fi l enames, and may be accessed by the u ser via the same
system cal l s used for regular fi l e s . The kernel can distinguish a device special .file
or special file, and can determine to what device it refers. but most processes do
not need to make th i s determi natio n . Terminals, printers , and tape drives are a l l
accessed as though they were streams of bytes. l i ke 4 . 4 B S D disk fi le s . Thu s , de­
vice dependencies and pec u l i arities are kept in the kernel as much as possible, and
even in the kernel most of them are segregated i n the device drivers .
Hardware dev ices can be categorized as either structu red or unstructured;
they are known as hlock or character dev ices. respectively. Processes typical l y
access devices through .1pecilll files in the fi lesystem. 1/0 operations to these files
are handled by kerne l-resident software modules termed deFice drivers . Most net­
work-com munication hardware dev ices are accessible through only the interpro­
cess-communi cation fac i l i ties, and do not have special fi les in the fi l e system name
space, because the raw-socket interface provides a more natural interface than
does a special fi l e .
Structured or block dev ices are typi fied b y d i s k s and magnetic tapes . and
incl ude most random-access devices . The kernel supports read-modify-wri te-type
buffering actions on block-oriented structured dev ices to allow the l atter to be read
and written in a total ly random byte-addressed fashion. l i ke regular fi l e s . Filesys­
tems are created on block devices.
Unstructured dev ices are those devices that do not support a block structure .
Fam i l iar un stru ctured dev ices are communication l i nes. raster pl otters . and
unbuffered magnetic tape s and disks. Unstructured dev ices typical l y support l arge
block 1/0 transfers .
Unstructured fi le s are called character del'ices because the fi rst of these to be
implemented were terminal device drivers . The kerne l inte rface to the driver for
these devices proved convenient for other devices that were not block structured.
Dev ice spec ial fi l e s are created by the mknod syste m cal l . There is an addi­
tional system cal l . ioctl. for manipulating the underlying device parameters of spe­
cial fi l e s . The operations that can be done differ for each dev ice. Thi s system call
a l l ows the special characteri stics of devices to be accessed. rather than overload­
ing the semantics of other system cal l s . For example. there i s an ioctl on a tape
Section 2 . 6
1/0 System
drive to write an end-of-tape mark, instead of there being a spec ial or modified
version of write .
Socket IPC
The 4.2BSD kernel introduced an !PC mechanism more flexible than pipes, based
on sockets. A socket i s an endpoint of communication referred to by a descriptor,
j ust l ike a fi l e or a pipe. Two processes can each create a socket, and then connect
those two endpoints to produce a rel i able byte stream. Once connected, the
descriptors for the sockets can be read or written by processes, j u st as the l atter
would do with a pipe. The transparency of sockets a l l ows the kernel to redirect
the output of one process to the input of another process residing on another
machine. A maj or difference between pipes and sockets i s that pipes require a
common parent process to set up the communications channel . A connection
between sockets can be set up by two u n rel ated processes, possibly residing on
different mac h i nes.
System V p rovides local i nterprocess communication through FIFOs (also
known as named pipes) . FIFOs appear as an object i n the fi l esystem that unrel ated
processes can open and send data through in the same way as they would commu­
nicate through a pipe. Th us, FIFOs do not require a common parent to set them
up; they can be connected after a pair of processes are up and running. U n l i ke
sockets, FIFOs can be used on only a local machine; they cannot be used to com­
municate between processes on different machines. FIFOs are i mplemented in
4.4BSD only because they are required by the standard. Thei r fu nctional i ty i s a
subset of the socket interface .
The socket mechanism requires extensions to the traditional UNIX 1/0 system
cal l s to provide the associ ated naming and connection semantics. Rather than
overloading the existing i nterface , the developers used the exi sting i nterfaces to
the extent that the l atter worked w i thout being changed, and designed new i n ter­
faces to handle the added semanti c s . The read and write system cal l s were used
for byte- stream type connecti ons, but s i x new system cal l s were added to allow
sending and rece iving addressed messages such as network datagrams. The sys­
tem cal l s for writing messages i nc lude send, sendto, and se11d111sg. The system
cal l s for reading messages incl ude recv, recv.fimn, and recvmsg. In retrospect, the
fi rst two i n each class are special cases of the others; recvfiwn and sendto proba­
bly shou l d have been added as library i n terfaces to recvmsg and sendmsg, respec­
Scatter/Gather 1/0
I n addition to the traditional read and write system calls, 4.2BSD introduced the
abil i ty to do scatter/gather I/O . Scatter input uses the readv system call to allow a
s i ngle read to be placed in several different buffers. Conversely. the write\' syste m
call allows several di fferent buffers t o be written i n a s i n g l e atom ic write. Instead
of passing a single buffer and l ength parameter, as is done with read and write, the
process passes in a pointer to an array of buffers and lengths, along with a count
describing the size of the array.
Chapter 2
Design Overv iew of 4.4BSD
Thi s faci l i ty allows buffers i n different parts of a process address space to be
wri tten atomical ly, without the need to copy them to a s i ngle contiguous buffer.
Atomic writes are necessary in the case where the underly i ng abstraction is record
based, such as tape drives that output a tape block on each write request. It i s also
convenient to be able to read a s i ngle request into severa l different buffers (such as
a record header i nto one place and the data i nto another). Although an appl ication
can simulate the abi l i ty to scatter data by reading the data into a l arge buffer and
then copying the pieces to their intended de stinations, the cost of memory-to­
memory copying in such cases often would more than double the running time of
the affected app l ication.
Just as send and recv could have been i mpl emented as l i brary i nterfaces to
sendto and reci:fro m. it also would have been possible to simul ate read with readi·
and write with 1rrite1'. However, read and write are used so much more frequently
that the added cost of simulating them wou l d not have been worthwhi l e .
Multiple Filesystem S upport
With the expansion of network computi ng, it became desirable to support both
local and remote ti lesystems. To simplify the support of multiple fi lesystems, the
developers added a new v i rtual node or vnode interface to the kernel . The set of
operations exported from the vnode interface appear much l i ke the filesystem
operations previously supported by the local fi l esystem. However, they may be
supported by a wide range of ti l esystem types :
• Local disk-based ti lesystems
• Fi les i mported using a variety of remote fi lesystem protocols
• Read-only CD-ROM tilesystems
• Filesystems provi ding special-purpose
ti lesy stem
inte1faces-for example, the /proc
A few variants of 4.4BSD. such as FreeBSD. a l l ow ti lesystems to be loaded
dynamically when the ti lesystems are first referenced by the mount system cal l .
The vnode interface is described i n Section 6.5 ; its anc i l l ary support routines are
described in Section 6 . 6 ; several of the special-purpose ti l esystems are described
i n Section 6 . 7 .
2. 7
A regu l ar file is a l i near array of bytes, and can be read and written starting at any
byte i n the fi l e . The kernel distinguishes no record boundaries in regular files,
although many programs recognize l i ne-feed characters as distinguishing the ends
of l i nes, and other programs may impose other structure . No system-re lated i nfor­
mation about a fi l e is kept in the file itself. but the filesy stem stores a small amount
of ownershi p , protection. and usage information with each ti le .
Section 2.7
A filename component is a string of up to 255 characters . These filenames are
stored in a type of file called a directory. The information in a directory about a
file is called a directory entry and includes, in addition to the filename, a pointer to
the file itself. Directory entries may refer to other directories, as well as to plain
files. A hierarchy of directories and files is thus formed, and is called a filesystem;
a small one is shown in Fig. 2.2. Directories may contain subdirectories, and there
is no inherent limitation to the depth with which directory nesting may occur. To
protect the consistency of the filesystem, the kernel does not permit processes to
write directly i nto directories. A filesystem may include not only plain files and
directories, but also references to other objects, such as devices and sockets.
The fi lesystem forms a tree, the beginning of which is the root directory,
sometimes referred to by the name slash, spelled with a single solidus character
( / ) . The root directory contains files; in our example in Fig. 2.2, it contains vmu­
nix, a copy of the kernel-executable object file. It also contains directories; in this
example, it contains the usr directory. Within the usr directory is the bin direc­
tory, which mostly contains executable object code of programs, such as the files
Is and vi.
A process identifies a file by specifying that file's pathname, which is a string
composed of zero or more filenames separated by slash ( I ) characters. The kernel
associates two directories with each process for use in interpreting pathnames. A
process's root directory is the topmost point in the fi lesystem that the process can
access; it is ordinarily set to the root directory of the entire filesystem. A path­
name beginning with a slash is called an absolute pathname, and is interpreted by
the kernel starting with the process 's root directory.
F i g u re 2.2 A
small filesystem tree.
Chapter 2
Design Overview of 4.4BSD
A pathname that does not beg i n w i th a s l ash i s called a relative pathname, and
i s i n terpreted rel ative to the current working directo1y of the process. (This direc­
tory also is known by the shorter names current directory or working directory. )
T h e curren t directory itself may b e referred t o directly b y t h e name dot, spe l l ed
with a single period ( . ) . The fi lename dot-dot ( .. ) refers to a directory's parent
directory. The root directory i s its own parent.
A process may set its root directory with the chroot system cal l , and its cur­
ren t directory with the chdir system cal l . Any process may do chdir at any time,
but chroot i s permitted only a process with superuser privileges . Chroot i s nor­
mal l y u sed to set up restricted access to the syste m .
U s i n g t h e fi lesystem shown in F i g . 2 . 2 , i f a process h a s t h e root o f t h e fi l esys­
tem as its root di rectory. and has /usr as i ts current directory, it can refer to the ti l e
v i either from the root w i th the absol ute pathname /usr/bin/vi, or from i t s current
di rectory w i th the relative pathname bin/vi.
Sy stem util ities and databases are kept i n certain wel l-known directorie s . Part
of the wel l-defined hierarchy includes a directory that contains the home directory
for each user-for example, /usr/staff/mckusick and /usr/staff/karels in Fig. 2 . 2 .
When users l o g i n , the c u rrent working directory o f the i r she l l i s set t o the home
directory. With i n their home directories, users can create directories as easily as
they can regular fi l e s . Thus, a user can bui l d arbitrari l y complex subhierarchies.
The user usuall y knows of only one fi l e system, but the sy stem may know that
thi s one v i rtual fi l e system i s real ly composed of several physical fi l esystems , each
on a different device. A physical fi l e system may not span multiple hardware
devices. S ince most phy sical disk devices are divided i nto several l ogical devices,
there may be more than one fi l esystem per physical device, but there will be no
more than one per logical device. One fi lesystem-the fi l esystem that anchors all
absol u te pathnames-i s called the root filesystem, and i s al ways avai lable. Others
may be mounted; that is, they may be i n tegrated into the di rectory hierarchy of the
root fi lesystem. References to a directory that has a fi l e system mounted on it are
converted tran sparently by the kernel i nto references to the root directory of the
mounted fi l esystem.
The link system call takes the name of an existing fi le and another name to
create for that fi l e . After a successful link, the file can be accessed by either fi le­
name. A fi lename can be removed w i th the un link system cal l . When the fi nal
name for a file i s removed (and the fi nal proces s that has the file open closes it),
the fi l e i s deleted.
Fi les are organized hierarch ical l y i n directories. A directory i s a type of fi l e ,
but, i n contrast t o regular fi l e s , a directory h a s a structure imposed o n i t b y t h e sys­
tem . A process can read a directory as it would an ordinary ti l e. but only the ker­
nel i s permitted to modi fy a di rectory. Directories are created by the mkdir system
call and are removed by the rmdir sy stem cal l . Before 4.2BSD, the mkdir and
nndir system cal l s were implemented by a series of link and un link system cal l s
being done. There were three reasons for adding systems cal l s explicitly t o create
and del ete directorie s :
Section 2.7
The operation could be made atomic. If the system crashed, the directory
would not be left half-constructed, as could happen when a series of link oper­
ations were used.
2. When a networked filesystem is being run, the creation and deletion of files
and directories need to be specified atomically so that they can be serialized.
3. When supporting non-UNIX fi lesystems, such as an MS-DOS filesystem, on
another partition of the disk, the other filesystem may not support link opera­
tions. Although other filesystems might support the concept of directories,
they probably would not create and delete the directories with links, as the
UNIX filesystem does. Consequently, they could create and delete directories
only if explicit directory create and delete requests were presented.
The chown system call sets the owner and group of a file, and chmod changes
protection attributes. Stat appl ied to a filename can be used to read back such
properties of a file. The .fchown, .fchmod, and .f�tat system calls are applied to a
descriptor, instead of to a filename, to do the same set of operations. The rename
system call can be used to give a file a new name in the filesystem, replacing one
of the file's old names. Like the directory-creation and directory-deletion opera­
tions, the rename system call was added to 4.2BSD to provide atomicity to name
changes in the local filesystem. Later, it proved useful explicitly to export renam­
ing operations to foreign filesystems and over the network.
The truncate system call was added to 4.2BSD to allow files to be shortened
to an arbitrary offset. The call was added primarily in support of the Fortran run­
time library, which has the semantics such that the end of a random-access file is
set to be wherever the program most recently accessed that file. Without the trun­
cate system call, the only way to shorten a file was to copy the part that was
desired to a new file, to delete the old file, then to rename the copy to the original
name. As well as this algorithm being slow, the library could potentially fail on a
full filesystem.
Once the fi lesystem had the ability to shorten files, the kernel took advantage
of that ability to shorten large empty directories. The advantage of shortening
empty directories is that it reduces the time spent in the kernel searching them
when names are being created or deleted.
Newly created files are assigned the user identifier of the process that created
them and the group identifier of the directory in which they were created. A three­
level access-control mechanism is provided for the protection of files. These three
levels specify the accessibility of a file to
I . The user who owns the file
2 . The group that owns the file
3. Everyone else
Chapter 2
Design Overview of 4 .4B S D
Each level of access h a s separate indicators for read permission, wri te permission,
and execute permi ssion.
Files are created w i th zero length, and may grow when they are writte n .
While a file i s open, t h e system mai ntains a pointer i n t o t h e file indicating t h e cur­
ren t location i n the file associ ated w i th the d e scriptor. This poi nter can be moved
about in the file i n a random-access fashion. Processes shari ng a file descriptor
through a fork or dup system call share the current location pointer. Descriptors
c reated by separate open system cal l s have separate current location pointers .
Files may have holes i n the m . Holes are void areas i n the l inear extent of the fi l e
where data h ave never b e e n written. A process can create these h o l e s by position­
i n g the pointer past the curren t end-of-file and writing. When read, holes are
treated by the system as zero-valued byte s .
Earlier UNIX systems h a d a l i m i t of 1 4 characters p e r fi l ename component.
Thi s l i m i tation was often a problem. For example, i n addition to the n atural desire
of u sers to give files long descriptive names, a common way of formi ng filenames
i s as basename.extension, where the extension ( indicating the kind of file, such as
.c for C source or .o for i n termediate binary obj ect) i s one to three characters,
leaving 1 0 to 1 2 characters for the basename . Source-code-control systems and
editors usually take up another two characters, e i ther as a p refix or a suffix, for
thei r purposes, leaving eight to 1 0 characters . It i s easy to use 1 0 or 1 2 characters
in a single English word as a basename ( e . g . , " multiplexer" ) .
It i s possible t o keep within these l i mits, b u t i t i s i nconvenient or even dan­
gerous, because other UNIX systems accept strings longer than the limit when
creating files, but then truncate to the l i m i t . A C language source file named
multiplexer.c (alre ady 1 3 characters) might have a source-code-control file with
s. prepended, producing a filename s.multiplexer that i s indistingui shable from
the source-code-control fi l e for, a fi l e contai n i ng troff source for
documentation for the C program . The contents of the two original files could
easily get confused with no warning from the source-code-control system . Care­
ful coding can detect this problem, but the long filenames fi rst introduced i n
4 . 2 B S D practically eliminate i t .
The operations defined for local filesystems are divided into two parts. Common
to all l ocal filesystems are h ierarchical naming, locking, quotas, attribute manage­
ment, and protecti on. These features are i ndependent of how the data will be
stored. 4.4B S D has a single implementation to provide these semantics.
The other part of the local filesystem i s the organi zation and management of
the data on the storage media. Lay i ng out the contents of files on the storage
media is the responsib i l i ty of the filestore . 4.4B SD supports three differen t file­
store l ayout s :
Section 2 . 9
Network Filesystem
The traditional B erkeley Fast Filesy stem
The log-structured fi l e system, based on the Spri te <? perati ng-system design
[ Rosenblum & Ousterhout, 1 992]
A memory-based filesystem
Although the organizations of these fi l estores are completely different, these dif­
ferences are i ndistinguishable to the processes using the fi l estore s .
The Fast Filesystem organizes data i n to cyl i n der groups . F i l e s that are l i kely
to be accessed together, based on their locations i n the filesystem hierarchy, are
stored in the same cyl i n der group . Files that are not expected to accessed together
are moved i nto different cylinder groups. Thus, fi l e s written at the same time may
be p l aced far apart on the disk.
The log-structured fi lesystem organizes data as a log. A l l data being written
at any point in time are gathered together, and are written at the same disk loca­
tion. Data are never overwritten; i nstead, a new copy of the file i s written that
repl aces the old one. The old fi les are rec laimed by a garbage-collection process
that runs when the filesy stem becomes ful l and additional free space i s needed.
The memory-based filesy stem i s designed to store data i n v i rtual memory. It
is u sed for filesystems that need to support fast but temporary data, such as /tmp .
The goal o f the memory-based filesystem i s t o keep the storage p acked a s com­
pactly as possible to minimize the usage of virtual-memory resources.
Network Filesystem
Initially, networking was used to transfer data from one machine to another. Later,
it evolved to allowing u sers to log i n remotely to another machine. The next logi­
cal step was to bring the data to the u ser, i nstead of h aving the user go to the
data-and network fi l esystems were born . Users worki ng locally do not experi­
ence the network delays o n each keystroke, so they have a more responsive env i ­
B ringing the fi lesystem to a local machine was among the fi rst of the major
c lient-server app l i cations. The server i s the remote machine that exports one or
more of its filesystems. The client i s the local machine that i mports those filesys­
tems. From the l ocal client's point of v i ew, a remotely mounted fi lesystem
appears in the file-tree name space j ust l i ke any other l ocall y mounted fi lesystem.
Local clients can change i n to directories on the remote fi lesystem, and can read,
write, and execute binaries within that remote fi l esyste m identically to the way
that they can do these operations on a local fi l esystem.
When the local cl ient does an operation on a remote filesystem, the request i s
pac kaged a n d i s sent t o t h e server. The server does the requested operation and
returns either the requested information or an error indicating why the request was
Chapter 2
Design Overview of 4.4BSD
denied. To get reasonable performance. the client must cache frequently accessed
data. The complexity of remote filesystems lies i n mai ntaining cache consi stency
between the server and its many cl ients.
Although many remote- fi lesystem protocols have been devel oped over the
years, the most pervasive one i n use among UNIX systems is the Network Filesys­
tem (NFS ) , whose protocol and most widely used i mplementation were done by
Sun M icrosystems. The 4.4BSD kernel supports the NFS protocol , al though the
i mplementation was done i ndependently from the protocol spec i fi c ation [Mac k­
lem, 1 994] . The NFS protocol i s described in Chapter 9 .
Terminals support the standard system I/O operations, as well as a collection of
termi nal-spec i fi c operations to control input-character editing and output delays.
At the lowest l evel are the terminal dev ice drivers that control the hardware termi­
nal ports. Terminal input i s handled according to the underlying communication
characteri stics, such as baud rate, and according to a set of software-contro l l able
parameters , such as parity checking.
Layered above the termi nal device drivers are line disciplines that provi de
various degrees of character processing. The defaul t l ine discipl i ne is selected
when a port i s being used for an interactive login . The line discipline i s run in
canonical mode ; i nput i s processed to provide standard l ine-oriented editing func­
tions, and input i s presented to a process on a l i ne-by- l i ne bas i s .
Screen edi tors a n d programs that communicate w i th other computers gener­
ally run i n noncanonical mode (also commonly referred to as raw mode or char­
acter-at-a-time mode) . In thi s mode, input i s passed through to the reading process
i mmediately and without interpretation. All special-character i nput processing is
disabled, no erase or other line editing processi ng i s done, and all characters are
passed to the program that is readin g from the terminal .
It is possible to configure the terminal i n thousands of combi n ations between
these two extremes. For example, a screen editor that wanted to receive u ser inter­
rupts asynchronously might enable the special characters that generate signals and
enable output fl ow contro l . but otherwise run i n noncanonical mode: all other
characters wou l d be passed through to the process uninterpreted.
On output, the terminal handler provides simple formatting services, including
• Converting the l ine-feed character to the two-character carri age-return-l i ne-feed
• Inserti ng delays after certain standard control characters
• Expanding tabs
Section 2 . 1 1
Interprocess Communication
D i splaying echoed nongraphic ASCII characters as a two-character sequence of
the form
C ( i . e . , the ASCII c aret character followed by the ASCII character
that is the character's value offset from the ASCII " @ " character) .
Each of these formatting services can be disabled individually by a process
through control requests.
2.1 1
Interprocess Communication
Interprocess commun ication in 4.48S O is organi zed in communication domains.
Domains currently s upported include the local domain, for communication
between processes executing on the same machine; the internet domain, for com­
munication between processes using the TCP/IP p rotocol suite (perh aps within the
I nternet) ; the ISO/OS I p rotocol family for communication between sites requ i red
to run them; and the XNS domain, for communication between processes using the
XEROX Network Systems (XNS) protocol s .
Within a domain , communication takes place between communication end­
points known as sockets. As mentioned in Section 2 . 6 , the socket system call cre­
ates a socket and returns a descriptor; other IPC sys tem cal l s are described i n
Chapter 1 1 . Each socket has a type that defines its communications semanti c s ;
these semantics i nclude properties s u c h as reli ab i l i ty, orderi ng, a n d prevention o f
dupl ication of message s .
Each socket h a s associ ated w i th i t a communication protocol. Thi s protocol
provides the semantics required by the socket according to the l atter's type. Appli ­
cations may request a specific protocol when creating a socket, o r may allow the
system to select a protocol that i s appropriate for the type of socket being created.
Soc kets may have addresses bound to them. The form and meaning of socket
addresses are dependent on the communication domain i n which the socket is cre­
ated. B i nding a n ame to a socket in the local domain causes a file to be created i n
the filesyste m .
Normal data transmitted a n d received through sockets are untyped. Data-rep­
resentation issues are the responsibili ty of l i braries bui l t on top of the i n te rprocess­
commu n ication facil ities . I n addition to transporting normal data, communication
domains may support the transmission and reception of specially typed data,
termed access rights. The l ocal domain , for example, uses this faci l i ty to pass
descriptors between processes.
Networking implementations on UNIX before 4.28SO usually worked by
overloading the character-devi ce i n terfaces. One goal of the socket i n terface was
for naive programs to be able to work without change on stream-style connection s .
S u c h programs c a n work o n l y if the read and write systems calls are unchanged.
Consequently, the original i nterfaces were left intact, and were made to work on
Chapter 2
Design Overv iew of 4.4B SD
stream-type sockets. A new inte1i'ace was added for more compl icated sockets.
such as those used to send datagrams. w i th which a desti nati on address must be
presented with each send cal l .
Another benefi t is that the new i nterface i s highly portab le. Shortly after a
test rel ease was avai lable from Berke ley. the socket inte rface had been ported to
System I I I by a UNIX vendor ( al though AT&T did not su pport the socket interface
unti l the re lease of Sy stem V Release 4. dec iding i n stead to use the Eighth Edition
stream mechan i s m ) . The socket interface was also ported to ru n in many Ethernet
boards by vendors, such as Excelan and Interlan . that were se l l i ng into the PC
market, where the machines were too smal l to ru n networking in the main proces­
sor. More recently, the socket i n terface was used as the bas is for Microsoft ' s
Winsock networking interface for Windows.
Network Communication
Some of the communication domai ns supported by the socket I PC mechanism pro­
vide access to network protocol s . These protocol s are i mplemented as a separate
software l ayer l ogical ly below the socket software in the kerne l . The kerne l pro­
vides many anci l l ary services. such as buffer management. message routing, stan­
dardized interface s to the protocol s , and interfaces to the network interface drivers
for the use of the vari ous network protoco l s .
At the t i m e that 4 . 2 B S D w a s be ing implemented. there were many networking
protoco l s in use or under deve lopment. each with its own strengths and weak­
nesses. There was no c l early superior protocol or protocol suite. B y supporti ng
multiple protocols. 4.2BSD could provide interoperabi l ity and resource shari ng
among the diverse set of machines that was available i n the Berkeley environment.
M u l ti p le-protocol su pport also provides for fu ture changes. Today ' s protoco l s
de signed for 1 0- t o 1 00-Mbit-per-second Ethernets are l i kely t o b e inadequate for
tomorrow ' s 1 - to 1 0-Gbit-per- second fiber-optic networks. Consequently. the net­
work-communication layer i s designed to support multiple protoc o l s . New proto­
cols are added to the kernel without the support for older protocols being affected.
Older appl ications can continue to operate using the old protocol over the same
physical network as is used by newer a p p l i c a ti o ns ru n n i ng with a newer network
protocol .
Network Implementation
The first protocol suite implemented in 4 . 2 B S D was DA RPA's Tran smission Con­
tro l Protocol/Internet Protocol ( TCP/I P). The CSRG chose TCP/IP as the first net­
work to i ncorporate into the socket I PC framework. because a 4. 1 B S D-based
impl ementation was pu blicly avai lable from a DAR PA-sponsored project at Bolt.
Beranek, and Newman ( B B N ) . That was a n i n fl uential choice : The 4 . 2 B S D
implementation is the main reason for the extremely widespread use of this
protocol suite. Later performance and capability improvements to the TCP/IP
implementation have also been widely adopted. The TCP/IP implementation is
described in detail in Chapter 1 3 .
The release of 4 . 3 B S D added the Xerox Network Systems (XNS) protocol
suite, partly building on work done at the University of Maryland and at Cornell
University. This suite was needed to connect isolated machines that could not
communicate using TCP/IP.
The release of 4.4BSD added the ISO protocol suite because of the latter's
increasing visibility both within and outside the United States. Because of the
somewhat different semantics defined for the ISO protocols, some minor changes
were required in the socket interface to accommodate these semantics. The
changes were made such that they were invisible to clients of other existing proto­
cols. The ISO protocols also required extensive addition to the two-level routing
tables provided by the kernel in 4 . 3 B S D . The greatly expanded routing capabili­
ties of 4.4BSD include arbitrary level s of routing with variable-length addresses
and network masks.
System Operation
Bootstrapping mechanisms are used to start the system running. First, the 4.4B S D
kernel must be loaded into the main memory of the processor. Once loaded, it
must go through an initialization phase to set the hardware into a known state.
Next, the kernel must do autoconfiguration, a process that finds and configures the
peripherals that are attached to the processor. The system begins running in sin­
gle-user mode while a start-up script does disk checks and starts the accounting
and quota checking. Finally, the start-up script starts the general system services
and brings up the system to full multiuser operation.
During multiuser operation, processes wait for login requests on the terminal
lines and network ports that have been configured for user access. When a login
request is detected, a login process is spawned and u ser validation is done. When
the login validation is successful, a login shell is created from which the user can
run additional processes.
2. 1
How does a user process request a service from the kernel ?
How are data transferred between a process and the kernel? What alterna­
tives are available?
How does a process access an 1/0 stream? List three types o f 1/0 streams.
What are the four steps in the lifecycle of a process?
Chapter 2
Design Overv iew of 4.4B SD
Why are process groups provi ded i n 4.3BSD?
Describe fo u r machine-dependent functions o f t h e kernel ?
Describe t h e difference between an absol ute a n d a rel ative pathname.
Give three reasons w h y the mkdir system cal l was added t o 4.2BSD.
Define scatter-gather 110. Why i s i t usefu l ?
2. 1 0
What i s the difference between a block and a character device?
2. 1 1
List fi ve functions provided by a terminal driver.
2. 1 2
What i s the difference between a pipe and a socket?
2. 1 3
Describe how to create a group of processes i n a pipeline.
*2. 1 4
List the three sy stem calls that were required to create a new di rectory foo
i n the current directory before the addition of the mkdir system cal l .
*2. 1 5
Explain the difference between interprocess communication and net­
Accetta et al , 1 9 86.
M . Accetta, R . B aron , W. Bol osky, D . Golub, R . Rashid, A . Tevanian, & M .
Young, " M ac h : A N e w Kernel Foundation for UNIX Development,"
USENIX Association C011fere11ce Proceedings, pp. 93- 1 1 3 , June 1 9 86.
Cheriton, 1 98 8 .
D . R . Cheriton, " The V D i stributed System," Comm ACM, vol . 3 1 , n o . 3 ,
p p . 3 1 4- 3 3 3 , March 1 9 8 8 .
Ewens et al , 1 98 5 .
P. Ewens , D . R . B l ythe , M . Funkenhau ser, & R . C . Holt, " Tu n i s : A Dis­
tributed Multiprocessor Operating System," USENIX Association Confer­
ence Proceedings, pp. 247-254, June 1 98 5 .
Ginge l l et al, 1 9 87.
R . G i ngell , J . Moran, & W. Shannon, " Virtual Memory Architecture i n
SunOS," USENIX Association C011fere11ce Proceedings, pp. 8 1 -94, June
1 987.
Kern ighan & Pike, 1 9 84.
B . W. Kernighan & R . Pike, The UNIX Programming EnFironment, Prentice­
H a l l , Englewood C l i ffs, NJ, 1 9 84.
Macklem, 1 994.
R . Macklem, "The 4.4B S D NFS Imp lementation," i n 4. 4BSD System Man­
ager 's Manual, pp. 6 : 1 - 1 4, O ' Re i l l y & Assoc i ates, I nc . , Sebastopo l , C A ,
1 994.
McKusick & Kare l s , 1 9 8 8 .
M . K . M c Kusick & M . J . Kare l s , " Design of a General Purpose Memory
Allocator for the 4 . 3 B S D UNIX Kernel ," USENIX Association Cm?ference
Proceedings, pp. 295-304, June 1 98 8 .
McKusick, Kare l s et al , 1 994.
M. K . McKusick, M . J. Kare l s , S. J . Leffler, W. N . Joy, & R. S. Fabry,
" Berkeley Software Architecture Manual, 4.4BSD Edition," in 4.4BSD
Programmer 's Supplementary Documents, pp. 5 : 1 -42, O ' Rei l l y & Associ­
ates, Inc . , Sebastopol, CA, 1 994.
Ritchie, 1 98 8 .
D . M . R i tchie, " Early Kernel Design," private communication, March 1 9 8 8 .
Rosenblum & Ousterhout, 1 992.
M . Rosenblum & J . Ousterhout, " The Design and Implementation of a Log­
S tructured Fil e System," A CM Transactions on Computer Systems, vol . I O,
no. 1 , pp. 26-5 2, Assoc iation for Computing Machinery, February 1 992.
Rozier et al , 1 9 8 8 .
M . Rozier, V. Abross imov, F. Armand, I . Boule, M . Oien, M . Guil l e mont, F.
Herrmann, C. Kaiser, S . Langlois, P. Leonard, & W. Neuhauser, " Chorus
D istributed Operating Sy stems," USENIX Computing Systems, vol . 1 , no. 4,
pp. 305-370, Fal l 1 9 8 8 .
Tevanian, 1 98 7 .
A . Tevanian, "Architecture-Independent Vi rtu al Memory Management for
Parall e l and D i s tributed Envi ronments : The Mach Approach," Technical
Report CMU-CS- 88- 1 06, Department of Computer Science, Carnegie-Mel­
lon University , Pittsburgh, PA, December 1 9 8 7 .
Kernel S ervices
Kernel Organization
The 4.4BSD kernel can be v i ewed as a service provider to user processes. Pro­
cesses usually access these services through sy stem cal l s . Some services, such as
process schedul i ng and memory management, are implemented as processes that
execute in kernel mode or as routines that execute periodical ly within the kernel .
I n this chapter, w e describe how kernel services are prov ided t o user processes,
and what some of the anc i l l ary processing performed by the kernel is. Then , we
describe the basic kernel serv ices provided by 4.4BSD. and provide deta i l s of thei r
i m p lementation.
System Processes
A l l 4.4BSD processes originate from a single process that is crafted by the kernel
at startup. Three processes are created immediatel y and exist always. Two of
them are kernel processes, and function whol l y within the kerne l . ( Kernel pro­
cesses execute code that is comp i l ed i nto the kerne l ' s load i mage and operate with
the kernel ' s priv i l eged execution mode . ) The third i s the fi rst process to execute a
program i n user mode : it serves as the parent process for a l l subsequent p rocesses.
The two kernel processes are the swapper and the pagedaemon . The swap­
per-h i storically. process 0-i s responsible for schedu l i ng the tran sfer of whole
processes between main memory and secondary storage when system resources are
l ow. The pagedaemon-hi storicall y, process 2-i s responsible for writing parts of
the address space of a process to secondary storage in s upport of the pag i n g faci l i ­
ties o f the v i rtual-memory system. The third process i s the init process-hi stori­
cally, process I . Thi s process performs adm i n i strative tasks. such as spawning
getty processes for each termi nal on a machine and handling the orderly shutdown
of a system from multiuser to s i ngle-user operatio n . The init process is a u ser­
mode process, running outside the kernel ( see Section 1 4. 6 ) .
Chapter 3
Kernel Services
System Entry
Entrances into the kernel can be categorized according to the event or action that
initiates it;
Hardware inte rrupt
Hardware trap
Software-initiated trap
Hardware i nterrupts ari se from external events, such as an 1/0 device needing
attention or a clock reporti ng the passage of time. ( For example, the kernel
depends on the presence of a real-time c l ock or interval timer to maintain the cur­
rent time of day, to drive process schedu l i ng , and to i n i ti ate the execution of sys­
tem timeout fu nction s . ) Hardware i nterrupts occur asynchronously and may not
rel ate to the context of the currently executing process.
Hardware traps may be e i ther synchronous or asynchronous, but are rel ated
to the current executing process. Examples of hardware traps are those generated
as a result of an i l l egal arithmetic operation, such as divide by zero .
Software-initiated traps are u sed by the system to force the schedu l ing of an
event such as process reschedul i n g or network processing, as soon as is possible.
For most uses of software-initiated traps , it i s an imp lementation detail whether
they are implemented as a hardware-generated interrupt, or as a fl ag that is
checked whenever the priority level drops ( e . g . , on every exit from the kernel ) . An
example of hardware support for software-initiated traps i s the asynchronous sys­
tem trap (AST) provided by the VAX architecture. An AST is posted by the kernel .
Then, when a return-from-interrupt instruction drops the interrupt-priority level
below a threshold, an AST interrupt will be delivered. Most architectures today do
not have hardware support for ASTs, so they must implement ASTs i n software .
System cal l s are a special case of a software-initi ated trap-the machine
instruction used to initi ate a system call typical ly causes a hardware trap that i s
handled specially b y the kernel .
Run-Time Organization
The kernel can be logical l y divided into a top ha({ and a bottom ha({, as shown i n
Fig . 3 . 1 . T h e t o p h a l f of t h e kernel prov ides services t o processes in response to
system cal l s or traps . Th i s software can be thought of as a l ibrary of routines
shared by all processes. The top half of the kernel executes in a pri v i l eged execu­
tion mode, i n which it has access both to kernel data structure s and to the context
of user-level processes. The context of each process i s contai ned i n two areas of
memory reserved for process-specific information. The fi rst of these areas i s the
process structure, which has h istorically contained the i nformation that is neces­
sary even if the process has been swapped out. In 4.4BSD, thi s i nformation
incl udes the identifiers associated with the proces s , the p rocess's rights and p rivi �
leges, its descriptors, its memory map, pending external events and associated
Section 3 . 1
Kernel Organi zation
user process
Preemptive schedul i ng
cannot block; runs on user
stack in user address space
top half
Runs until blocked or done.
of kernel
Can block to wait for a
resource; runs on per-process
kernel stack
bottom half
Never scheduled, cannot
of kernel
block. Runs on kernel
stack i n kernel address space.
Figure 3.1 Run-tirrie structure of the kernel .
actions, maximum and current resource util ization, and many other thi ngs. The
second i s the user structure, which has h i storical l y contained the i nformation that
i s not necessary when the process is swapped out. I n 4 .4B SD, the u ser-structure
i nformation of each p rocess i ncl udes the hardware process control block (PCB ) ,
process accounting a n d statistics, a n d m i nor additional i nformation for debugging
and creating a core dump. Deciding what was to be stored i n the process structure
and the user structure was far more i mportant i n prev ious systems than i t was i n
4.4BSD. As memory became a less l i m ited resource, most o f the user s tructure
was merged i n to the p rocess s tructure for convenience; see Section 4 . 2 .
T h e bottom h a l f of t h e kernel comprises routines that are i nvoked to handle
hardware interrupts. The kernel requires that hardware fac i l ities be avail able to
block the delivery of interrupts. Improved performance i s ava i l ab l e if the hardware
fac i l i ties a l l ow interrupts to be defined i n order of priority. Whereas the HP300
provides distinct hardware priority leve l s for different k inds of i n te rrupts, UNIX
also runs on architectures such as the Perki n Elmer, where interrupts are a l l at the
same priority, or the ELXS I , where there are no i nterrupts i n the traditi onal sense.
Activ i ties i n the bottom half of the kernel are asynchronous, w i th respect to
the top half, and the software cannot depend on h aving a specific (or any ) process
running w hen an interrup t occurs . Thus, the state information for the process that
initiated the activi ty i s not available. (Activities i n the bottom half of the kernel
are synchronous with respect to the i n terrupt source . ) The top and bottom halves
of the kernel communicate through data structures, general l y organized around
work queues.
Chapter 3
Kernel Services
The 4.4BSD kernel i s never preempted to run another process while executing
in the top half of the kernel-for example, while executing a system call­
al though it will explicitly give up the processor i f it must wait for an event or for a
shared resource. Its execution may be i nterrupted, however, by interrupts for the
bottom half of the kerne l . The bottom half al ways begins ru nning at a spec ific
priority level. Therefore , the top half can block these interru pts by setting the p r o ­
cessor priority level to an appropriate value. The value i s chosen based on the pri­
ority l evel of the device that shares the data structures that the top half is about to
modify. This mechanism ensures the consi stency of the work queues and other
data structures shared between the top and bottom halves.
Processes cooperate i n the sharing of system re sources, such as the C P U . The
top and bottom halves of the kernel also work together in implementing certain
sy stem operati ons, such as 1/0 . Typical l y, the top half will start an 1/0 operation,
then relinquish the processor; then the requesting process will sleep, awaiting noti­
fication from the bottom half that the 1/0 request has completed.
Entry to the Kernel
When a process enters the kernel through a trap or an interrupt, the kernel must
save the current machine state before it begins to serv ice the event. For the HP300,
the machine state that must be saved incl udes the program counter, the user stack
pointer, the general-purpose registers and the processor status longword. The
HP300 trap instruction saves the program counter and the processor statu s long­
word as part of the exception stack frame ; the user stack pointer and registers must
be saved by the software trap handler. If the machine state were not fu l l y saved,
the kerne l could change values in the c u rrently executing program i n i mproper
ways. S i nce interrupts may occur between any two u ser-level i n structions (and,
on some architectures, between parts of a single instruction), and because they
may be completely unrel ated to the currently executing process, an i ncompletely
saved state cou l d cause correct programs to fail i n mysterious and not easi l y repro­
duceable way s .
The exact sequence o f events required t o save t h e process state i s completely
machine dependent, although the HP300 provides a good example of the general
procedure . A trap or system cal l w i l l trigger the fol l owing events:
• The hardware switches i nto kernel ( supervi sor) mode, so that memory-access
checks are made with kernel privileges. references to the stack pointer use the
kerne l ' s stack pointer, and privileged i nstructions can be executed.
• The hardware pushes onto the per-process kernel stack the program counter,
processor status longword, and information describing the type of trap. (On
architectures other than the H P300, this i nformati on can incl ude the system-cal l
nu mber and general-purpose registers as wel l . )
• A n assembl y-l anguage routine saves a l l state i nformation not saved b y the hard­
ware . On the HP300, th i s i nformation incl udes the general-purpose regi sters and
the user stack pointer, also saved onto the per-process kerne l stack.
Section 3 . 2
System Cal l s
After thi s pre l i m inary state saving, the kernel cal l s a C routine that can freel y use
the general-purpose regi sters as any other C routine would, without concern about
changing the unsuspecting process's state.
There are three major kinds of handlers, corresponding to particular kernel
entri e s :
Syscall ( ) for a system call
Trap ( ) for hardware traps and for software-initi ated traps other than system c al l s
3 . The appropri ate device-driver interrupt handler for a hardware i nterrupt
Each type of h andler takes i ts own specific set of parameters . For a system cal l ,
they are the system-call number and a n exception frame. For a trap, they are the
type of trap, the rel evant fl oating-point and virtual-address information rel ated to
the trap, and an exception frame . (The exception-frame arguments for the trap and
system call are not the same. The HP300 h ardware saves different information
based on different types of trap s . ) For a h ardware interrupt, the only parameter is
a unit (or board) number.
Return from the Kernel
When the handl ing of the sy stem entry is completed, the u ser-process state i s
restored, and the kernel returns t o the user process. Returning t o the user process
reverses the process of entering the kernel.
An assembly-language routine res tores the general-purpose registers and u ser­
stack pointer previously pushed onto the stack.
The hardware restores the program counter and program status l ongword, and
switches to user mode, so that future references to the stack pointer use the
user's stack pointer, privileged i nstructions cannot be executed, and memory­
access checks are done with user- l evel priv ilege s .
Execution t h e n resumes a t t h e n e x t i n s truction in t h e u ser's process.
System Calls
The most frequent trap into the kernel ( after c l ock processing) is a request to do a
sy stem cal l . Sy stem performance requires that the kernel mini mize the overhead
in fielding and di spatching a system cal l . The system-cal l handler must do the fo l ­
l owing work :
Verify that the parameters to the system cal l are located at a valid user address,
and copy them from the user's address space i n to the kernel
Call a kernel routine that i mplements the system call
Chapter 3
Kernel Services
Result Handling
Eventual l y, the system call returns to the cal l i ng process, e i ther successfu l l y or
u nsuccessfu l ly. On the HP300 architecture, success or fai l ure is returned as the
carry bit in the u ser process ' s program status longword : I f it is zero , the return was
successfu l ; otherwise, it was unsuccessfu l . On the HP300 and many other
machines, return values of C functions are passed back through a general-purpose
regi ster (for the HP300, data regi ster 0). The routines in the kernel that implement
sy stem cal l s return the values that are normally associated with the gl obal variabl e
errno. After a system cal l , the kernel system-cal l handler leaves this value in the
register. If the system call failed, a C l ibrary routine moves that value i nto errno,
and sets the return regi s ter to - 1 . The cal l i ng process is expected to notice the
value of the return register, and then to examine errno. The mechani s m i nvol ving
the carry bit and the global variable errno exists for h i storical reasons derived
from the PDP- I I .
There are two kinds of unsuccessfu l returns from a system c al l : those where
kernel routines di scover an error, and those where a system call is i nterrupted.
The most common case is a system cal l that i s interrupted when i t has rel i nqui shed
the processor to wait for an event that may not occur for a long time (such as ter­
minal input), and a signal arrives in the i nteri m. When s ignal handlers are initial­
i zed by a process, they specify whether sy stem cal l s that they i n terrupt should be
restarted, or whether the system cal l shou l d return with an interrupted system call
( EINTR) error.
When a system call is i nterrupted, the s ignal is de l i vered to the proces s . If the
process has requested that the signal abort the system cal l , the handler then returns
an error, as described previously. If the sy stem call i s to be restarted, however, the
handler resets the proces s ' s program cou nter to the machine instruction that
caused the system-cal l trap i n to the kernel . (Th i s calculation is necessary because
the program-counter val ue that was saved when the system-ca l l trap was done is
for the instruction after the trap-caus i ng i nstruction . ) The handler repl aces the
saved program-counter value w i th thi s address. When the process returns from
the signal handler, it resumes at the program-counter value that the handler pro­
vided, and reexec utes the same sy stem cal l .
Restarting a system call b y resetting the program counter has certain impl ica­
tions. First, the kernel must not modify any of the input parameters i n the process
address space (it can modify the kernel copy of the parameters that it make s ) .
Second, i t m u s t ensure that t h e sy stem call h a s n o t performed a n y actions that can­
not be repeated. For example, i n the current system, i f any c haracters have been
read from the terminal, the read must return with a short count. Otherwise, i f the
call were to be restarted, the al ready-read bytes wou l d be lost.
Returning from a System Call
While the system call is ru nning, a signal may be posted to the process, or another
proces s may attain a h igher schedu l ing priority. After the sy stem call completes,
the handler checks to see whether either event has occurred .
S ection 3 . 3
Traps and Interrupts
The handler first checks for a posted signal . Such signals incl ude signals that
interrupted the system cal l , as well as signals that arrived while a system cal l was
i n progress, but were held pending until the system cal l completed. Signals that
are ignored, by defaul t or by explicit programmatic request, are never posted to
the process. S i gnals with a defau l t action have that action taken before the process
runs again ( i . e . , the process may be stopped or terminated as appropri ate ) . If a
signal i s to be caught (and is not currently blocked), the handler arranges to have
the appropriate signal handler cal l ed, rather than to have the process return
directly from the system cal l . After the handler returns, the process w i l l resume
execution at system-cal l return (or system-call execution, if the system call i s
being restarted) .
After checking for posted signals , the handler checks to see whether any
process has a priority higher than that of the currently running one. If such a
process exists, the handler cal l s the context-switch routine to cause the higher­
priority process to run . At a l ater time, the c u rrent process will again have the
highest priority, and w i l l resume execution by returning from the system call to
the u ser process.
If a process has requested that the system do profi ling, the handler also calcu­
l ates the amount of time that has been spent in the system cal l , i.e., the system
time accounted to the process between the l atter' s entry into and exit from the
handler. This time i s charged to the routine in the user's process that m ade the
system cal l .
Traps and Interrupts
Traps, l i ke system cal l s , occur synchronou sly for a process. Traps normall y occur
because of unintentional errors, such as division by zero or indirection through an
invalid pointer. The process becomes aware of the problem either by c atching a
signal or by being terminated. Traps can also occur becau se of a page fault, in
which case the syste m makes the page avai lable and restarts the process without
the process being aware that the fault occurred.
The trap handler i s invoked l i ke the system-ca l l handler. First, the process
state is saved. Next, the trap handler determines the trap type, then arranges to post
a signal or to cause a pagein as appropriate . Finall y, it checks for pending signals
and higher-priority processes, and exits identical l y to the system-call handler.
I/O Device Interrupts
Interrupts from 1/0 and other dev ices are handled by interrupt routines that are
l oaded as part of the kernel ' s address space. These routines handle the console
terminal interface , one or more c locks, and several software-initiated i n terrupts
used by the system for l ow-pri ori ty clock processing and for networking faci l ities.
Chapter 3
Kernel Services
Unlike traps and system cal l s , device interrupts occur asynchronously. The
process that requested the serv ice is unlikely to be the currently running process,
and may no l onger exi s t ! The process that started the operation will be notified
that the operation has fini shed when that process ru ns agai n. As occurs with traps
and system cal l s , the entire machine state must be saved, since any changes could
cause errors in the currently ru nning process.
Device-interrupt handl ers run only on demand, and are never scheduled by the
kernel . U n l i ke system cal l s , interrupt handlers do not have a per-process context.
Interrupt handlers cannot use any of the context of the currently running process
(e.g . , the proces s ' s user structure ). The stack normally u sed by the kernel i s part
of a process context. On some systems (e . g . , the HP300) , the interrupts are caught
on the per-process kernel stack of whichever process happens to be running. This
approach requires that al l the per-process kernel stacks be large enough to handle
the deepest possible nesting cau sed by a sy stem call and one or more interrupts,
and that a per-process kernel stack always be avai lable, even when a process i s not
running. Other architectures ( e . g . , the VAX ) , provide a systemwide interrupt stack
that i s used solely for device interrupts . Thi s architecture allows the per-process
kernel stacks to be sized based on only the requirements for handling a syn­
chronous trap or system cal l . Regardless of the i mpl ementati on, when an i nterrupt
occurs, the system must switch to the correct stack (either expl icitly, or as part of
the hardware exception handl ing) before it begins to handle the i nterrupt.
The interrupt handler can never use the stack to save state between i nvoca­
tion s . An interrupt handler must get all the information that it needs from the data
structures that i t shares w i th the top half of the kernel-general ly, its global work
queue. S i m i l arly, all i nformati on provided to the top half of the kernel by the
interrupt handler must be communicated the same way. I n addition, because
4.4BSD requires a per-process context for a thread of control to sl eep, an interrupt
handler cannot re l i nqu ish the processor to wait for resources, but rather must
always run to compl etion.
Software Interrupts
Many events i n the kernel are driven by hardware interrupts. For high-speed
devices such as network contro l lers, these i n terrupts occur at a high priori ty. A
network controller must quickly acknowl edge receipt of a packet and reenable the
control ler to accept more packets to avoid losing closely spaced packets. How­
ever, the further processing of passing the packet to the receiving process,
although time consuming, does not need to be done quickly. Thus, a l ower prior­
i ty is possible for the further processing, so critical operations wi l l not be blocked
from executing longer than neces sary.
The mechan ism for doing l ower-priority processing is called a software inter­
rupt. Typical l y, a high-priority interrupt creates a queue of work to be done at a
lower-priority l evel . After queueing of the work request, the high-priority interrupt
arranges for the processing of the request to be ru n at a lower-priority leve l . When
the machine priority drops below that l ower pri ority. an interrupt is generated that
cal l s the requested function. If a higher-pri ority interrupt comes i n during request
Section 3 .4
Cloc k Interrupts
processing, that processing w i l l be preempted l i ke any other l ow-priority task. On
some architectures, the interrupts are true hardware traps caused by software
i nstructions. Other archi tectures i mplement the same functional i ty by monitoring
fl ags set by the interrupt handler at appropriate times and cal l i n g the request-pro­
cessing functions directl y.
The del ivery of network packets to destination processes is handled by a
packet-processing function that run s at low priori ty. As packets come i n , they are
put onto a work queue, and the controller is i m mediately reenabled. Between
packet arrival s , the packet-processing function works to del i ver the packets. Thu s ,
t h e control ler c a n accept n e w packets w i thout having t o w a i t for the previous
packet to be delivered. I n addition to network processing, software i nterrupts are
used to handle time-related events and process rescheduling.
Clock Interrupts
The system is driven by a clock that interrupts at reg u l ar i n tervals . Each i n terrupt
is referred to as a tick. On the HP300, the clock ticks I 00 times per second. At
each tick, the system updates the current time of day as well as user-process and
system timers .
Interrupts for c l ock ticks are posted at a high hardware-interrupt priority.
After the process state has been saved, the hardclock ( ) routine is called. It i s
i mportant that the hardclock ( ) routine fi n i sh i t s j o b quickly:
If hardclock ( ) ru ns for more than one tick, i t will miss the next c l ock i n terrupt.
S i nce hardclock ( ) maintains the time of day for the system, a m i s sed interrupt
w i l l cause the system to lose time.
Because of hardclock ( ) s high interrupt priority, nearly all other activity i n the
system i s b locked while hardclock ( ) i s running. This blocking can cause net­
work control lers to miss packets, or a disk controller to miss the transfer of a
sector coming u nder a disk drive ' s head.
S o that the time spent i n hardclock ( ) is minimized, less critical time-rel ated pro­
cessing is handled by a lower-pri ority software-interrupt handler called
softclock ( ) . I n addition. if multiple c l ocks are avai l able, some time-related pro­
cessing can be handled by other routines supported by alternate clocks.
The work done by hardclock ( ) i s as follows :
Increment the current time of day.
If the currently running process has a virtual or profil i n g i nterval timer (see S ec­
tion 3 .6), decrement the timer and del i ver a s i gnal i f the timer has expired.
If the system does not have a separate c l ock for stati sti c s gathering, the
hardclock ( ) routine does the operations normall y done by statclock ( ), as
described i n the next section .
Chapter 3
Kernel Services
• If S(�ftclock ( ) needs to be called. and the current interrupt-priori ty l evel i s low,
call softclock ( ) directly.
Statistics and Process Scheduling
On historic 4B S D systems. the hardc!ock ( ) routine col lected resource-uti l ization
statistics about what was happening when the clock interrupted. These stati stics
were u sed to do accounting, to monitor what the system was doing, and to deter­
mine future scheduling priorities. In addition, lwrdclock ( ) forced context
switches so that all processes would get a share of the CPU.
Th i s approach has weaknesses because the c l ock supporting lzardclock ( )
interrupts on a regu l ar bas i s . Processes can become synchronized with the system
clock. resulting in inaccurate measurements of resource util ization (especial ly
CPU) and i naccurate profil ing [McCanne & Torek. 1 993 ] . It i s also possible to
wri te programs that del i beratel y synchronize with the system clock to outwit the
On architectures with multiple high-prec ision, programmable c l ocks, such as
the HP300, random izing the interrupt period of a c l ock can i mprove the system
resource-usage measurements signi ficantly. One c l ock i s set to interrupt at a fi xed
rate ; the other in terrupts at a random interval chosen from times distributed uni­
formly over a bounded range .
To allow the col lection of more accurate pro fi l i ng information, 4.4BSD sup­
ports profi l i n g c l ocks. When a profiling c l ock is available, it i s set to run at a tick
rate that i s rel atively prime to the main system clock ( fi ve times as often as the
system clock. on the HP300 ) .
T h e statclock ( ) routine i s supported b y a separate c l ock i f o n e i s avai l able,
and i s responsible for acc umulating resource usage to processes. The work done
by statclock ( ) incl udes
• Charge the currently ru nning process with a tick; if the proces s has accumul ated
four ticks, recalcul ate its priority. If the new priority is less than the current pri­
ori ty, arrange for the process to be reschedu led.
• Collect stati stics on what the system was doi ng at the time of the tick ( s i tting
idle, executing i n user mode, or executing in system mode ) . Incl ude basic infor­
mation on system 1/0, such as which disk drives are currently active .
The remaining time-rel ated processing involves processing timeout requests and
periodical ly repri oritizing processes that are ready to ru n . These functions are
handled by the softclock ( ) routine.
When hardclock ( ) completes. i f there were any S(!ficlock ( ) fu nctions to be
done. hardclock ( ) schedules a softcl ock interrupt. or sets a fl ag that will cause
S(!ftclock ( ) to be called. As an optim ization , i f the state of the processor i s such
that the softclock ( ) execution will occur as soon as the hardc lock interrupt return s .
hardclock ( ) s i m p l y lowers t h e processor priority a n d cal ls s(Jjiclock ( ) directly.
Section 3.4
Clock Interrupts
avoiding the cost of returning from one interrupt only to reenter another. The
savings can be substantial over time, because interrupts are expensive and these
interrupts occur so frequently.
The primary task of the softclock ( ) routine is to arrange for the execution of
periodic events, such as
• Process real-time timer (see Section 3 .6)
• Retransmission of dropped network packets
• Watchdog timers on peripherals that require monitoring
• System process-rescheduling events
An important event is the scheduling that periodically raises or lowers the
CPU priority for each process in the system based on that process's recent CPU
usage (see Section 4.4). The rescheduling calculation is done once per second.
The scheduler is started at boot time, and each time that it runs, it requests that it
be i nvoked again 1 second in the future.
On a heavily loaded system with many processes, the scheduler may take a
long time to complete its job. Posting its next invocation I second after each com­
pletion may cause scheduling to occur less frequently than once per second. How­
ever, as the scheduler is not responsible for any time-critical functions, such as
maintaining the time of day, scheduling less frequently than once a second is nor­
mally not a problem.
The data structure that describes waiting events is called the callout queue.
Figure 3.2 shows an example of the callout queue. When a process schedules an
event, it specifies a function to be called, a pointer to be passed as an argument to
the function, and the number of clock ticks until the event should occur.
The queue is sorted in time order, with the events that are to occur soonest at
the front, and the most distant events at the end. The time for each event is kept as
a difference from the time of the previous event on the queue. Thus, the
hardclock ( ) routine needs only to check the time to expire of the first element to
determine whether softclock ( ) needs to run . In addition, decrementing the time to
expire of the first element decrements the time for all events. The softclock ( ) rou­
tine executes events from the front of the queue whose time has decremented to
zero until it finds an event with a still-future (positive) time. New events are
added to the queue much less frequently than the queue is checked to see whether
F i g u re 3.2
Timer events in the callout queue.
function and
3 ticks
0 ticks
8 1 ticks
g (y)
h (a)
I O ms
40 ms
40 ms
850 ms
I tick
Chapter 3
Kernel Serv i ces
any events are to occur. So. it i s more efficient to identi fy the proper location to
place an event when that event i s added to the queue than to scan the entire queue
to determ i ne which events should occur at any single time.
The single argument i s provided for the cal lout-queue fu nction that i s called.
so that one fu nction can be used by multiple processes. For example. there i s a
single real-time timer fu nction that sends a signal to a process when a timer
expires. Every proce ss that has a real-time timer ru nning posts a timeout request
for th i s function ; the argument that i s passed to the fu nction i s a poi nter to the pro­
cess structure for the proce s s . This argument enables the timeout fu nction to
del iver the signal to the correct proces s .
Ti meout processing is more efficient when the ti meouts are speci fied i n ticks.
Ti me updates require only an integer decreme nt. and checks for timer expiration
require only a compari son against zero . I f the ti mers contained time values. decre­
menting and compari sons wou l d be more compl ex. I f the nu mber of events to be
managed were l arge . the cost of the l i near search to in sert new events correctly
could domi nate the simple l i near queue used i n 4.4BSD. Other possible
approaches incl ude maintai ning a heap with the next-occurri ng event at the top
[ B arkley & Lee. 1 9 8 8 J , o r maintain i ng separate queues o f short-. medium- and
long-term events [Varghese & Lauck, 1 9 8 7 ] .
Memory-Management Services
The memory organization and layout associated with a 4.4B S D process is shown
i n Fig. 3 . 3 . Each process begins execution with three memory segments. called
text, data, and stack. The data segment i s divided into i n i tialized data and unini­
tial ized data ( a l so known as bss). The text i s read-only and i s normal ly shared by
a l l processes executing the fi l e . whereas the data and stack areas can be wri tten by.
and are pri vate to . each process. The text and in itial ized data for the process are
read from the executable fi l e .
An exerntable .file i s distingui shed b y i t s be ing a plain file ( rather than a direc­
tory, speci al file, or symbolic l i nk) and by its having I or more o f its execute bits
set. I n the traditional a.out execu table format. the first few bytes of the file contain
a magic number that spec ifies what type of executable fi le that file is. Executable
files fall i nto two major c l asse s :
1 . Files that m u s t b e read b y a n interpreter
Files that are direc tly executable
I n the first class. the first 2 bytes of the file are the two-character sequence #! fol ­
l owed b y the pathname of the in terpreter t o be used. ( Th i s pathname is currently
l i mited by a compi le-time con stant to 30 charac ters . ) For example. #!/bin/sh refers
to the Bourne shel l . The kernel executes the named i nterpreter. passing the name
of the file that is to be interpreted as an argument. To pre\'ent l oops. 4.4BSD a l l ows
only one l evel of interpretation, and a fi le · s interpreter may not itself be i nterpreted.
Section 3 . 5
Memory-Management Services
kernel stack
red zone
user area
ps_strings struct
signal code
env strings
argv strings
env pointers
argv pointers
process memory­
resident image
user stack
symbol table
initialized data
initialized data
a.out header
disk image
a.out magic number
Figure 3.3
Layout of a UNIX process in memory and on disk.
For performance reasons , most files are directly executable. Each directly
executable file has a m agic n u mber that specifies whether that file can be paged
and whether the text part of the file can be shared among multiple processes. Fol ­
lowing the magic number i s a n exec header that specifies the s i zes o f text, i n i tial­
i zed data, u n i n itialized data, and additional information for debugging. (The
debugging i nformation is not u sed by the kernel or by the executing program . )
Following t h e header i s an i mage o f t h e text, followed by a n i m age o f the i n i tial­
i zed data. U n i n i ti al ized data are not contained i n the executable fi l e because they
can be created o n demand using zero-fil led memory.
Chapter 3
Kernel Services
To begi n execution, the kernel arranges to have the text porti on of the file
mapped into the l ow part of the process address space . The initial i zed data portion
of the fi l e i s mapped i nto the address space fol l owing the text. An area equal to
the uninitial i zed data region i s created w i th zero- fi l led memory after the initialized
data region. The stack is also created from zero- fi l led memory. Although the
stack shoul d not need to be zero fi l led. early UNIX systems made it so. I n an
attempt to save some startup time, the developers modified the kernel to not zero
fi l l the stack, leav i ng the random prev ious contents of the page i nstead. Numerou s
programs stopped working because they depended on t h e l o c a l vari ables i n thei r
main procedure b e i n g i n i tial ized t o zero . Consequently, t h e zero fi l l ing of the
stack was restored.
Copying into memory the entire text and initialized data portion of a l arge
program causes a long startup l atency. 4.4BSD avoids this startup time by demand
paging the program into memory, rather than prel oading the program. In demand
paging, the program is l oaded in smal l pieces (pages) as it i s needed, rather than
all at once before it begi n s execution. The system does demand paging by divid­
ing up the address space into equal-sized areas called pages . For each page, the
kernel records the offset into the executable file of the corresponding data. The
first access to an address on each page causes a page-fault trap in the kernel . The
page-fau l t handler reads the correct page of the executable file into the process
memory. Thu s , the kernel l oads only those parts of the executable file that are
needed. Chapter 5 explains paging detai l s .
The uninitial i zed data area c a n b e extended w i th zero-fi l led pages u s i n g the
system call sbrk, although most user processes use the l i brary routine malloc ( ) , a
more programmer-friendly interface to sbrk. This all ocated memory, which grows
from the top of the original data segment, is called the heap . On the HP300, the
stack grows down from the top of memory, whereas the heap grows up from the
bottom of memory.
Above the user stack are areas of memory that are created by the system when
the process is started. Directly above the u ser stack is the nu mber of arguments
(argc), the argument vector (argv), and the process environment vector (envp) set
up when the program was executed. Above them are the argument and environ­
ment strings themselves. Above them i s the signal code, used when the system
delivers signals to the process; above that is the struct ps_strings structure, used
by ps to l ocate the argv of the process. At the top of user memory i s the u ser area
(u.), the red zane, and the per-process kernel stac k . The red zone may or may not
be present i n a port to an architecture . I f present, i t i s implemented as a page of
read-only memory immediately below the per-process kernel stack. Any attempt
to allocate below the fi xed-size kerne l stack w i l l res u l t i n a memory faul t , protect­
ing the user area from being overwritte n . On some architectures, it i s not possible
to mark these pages as read-only, or having the kernel stack attempt to write a
write protected page would result in unrecoverable system fai l ure . In these cases,
other approaches can be taken-for example, checking during each c l ock i n terrupt
to see whether the current kernel stack has grown too l arge.
Section 3 . 6
Ti ming Services
In addi tion to the i nformation maintained in the u ser area, a process usuall y
requires the u s e o f some g lobal system resources. The kernel maintains a l i nked
l i s t of processes, called the process table, which has an entry for each process in
the system . Among other data, the process entries record information on schedul ­
i n g and o n v irtual-memory a l l ocation. B ecause t h e entire process address space,
including the u ser area, may be swapped out of main memory, the process entry
must record enough information to be able to locate the process and to bring that
process back into memory. I n addition, information needed while the process is
swapped out ( e . g . , scheduling information) must be maintained i n the process
entry, rather than in the u ser area, to avoid the kernel s wapping in the process only
to decide that i t i s not at a high-enough priority to be run .
Other global resources associated w i t h a process include space t o record
information abou t descriptors and page tables that record i nformation about physi­
cal-memory util ization.
Timing Services
The kernel provides several different timing services to processes. These services
include timers that run i n rea l time and timers that run only while a process is
Real Time
The syste m ' s time offset since January 1 , 1 970, Universal Coordinated Time
(UTC), also known as the Epoch, i s returned by the system cal l gettimeofday.
Most modern processors (including the HP300 processors) maintain a battery­
backup time-of-day regi s ter. This c l oc k continues to run even if the processor i s
turned off. When t h e system boots, i t consults the processor's time-of-day register
to find out the c u rrent time. The syste m ' s time i s then maintained by the clock
i nterrupts. A t each interrupt, the syste m increments i ts global time variabl e by an
amount equal to the number of microseconds per tick. For the H P 3 0 0 , running at
1 00 ticks per second, each tick represents 1 0,000 microseconds.
Adjustment of the Time
Often, it is desirable to maintain the same time on all the machines on a network.
It is also possible to keep more accurate time than that avail able from the basic
processor clock. For example, h ardware i s readil y available that l istens to the set
of radio stations that broadcast UTC synchronization signals i n the United S tate s .
W h e n processes on different machines agree on a common time, they w i l l wish t o
change t h e c l o c k on their h o s t processor to agree w i t h t h e networkwide time
value. One possibil i ty i s to change the syste m time to the network time using the
settimeofday system cal l . Unfortunately, the settimeofday system call w i l l res u l t
i n t i m e running backward on machines whose c l ocks were fa s t . Time running
Chapter 3
Kernel Services
backward can confuse u ser programs (such as make) that expect ti me to invariably
increase. To avoid thi s problem, the system provides the adjtime system call
[Gusella et al, 1 994] . The adjtime sy stem call takes a time delta (either positive or
negative) and changes the rate at which time advances by 1 0 percent, faster or
s lower, until the ti me has been corrected. The operating system does the speedup
by i ncrementing the global ti me by 1 1 ,000 microseconds for each tick, and does
the slowdown by i ncrementing the g lobal time by 9,000 microseconds for each
tick. Regardless, time increases monotonical ly, and u ser processes depending on
the orderin g of file-modification times are not affected. However, time changes
that take ten s of seconds to adj ust will affect programs that are measuring time
intervals by using repeated calls to gettimeofday.
External Representation
Ti me is always exported from the system as microseconds , rather than as c l ock
ticks, to provide a resolution-independent format. Internally, the kernel i s free to
select whatever tick rate best trades off clock-interrupt-handling overhead with
timer resolution. As the tick rate per second increases, the resolution of the sys­
tem ti mers i mproves, but the time spent dealing with h ardclock interrupts
increases. As processors become faster, the tick rate can be i ncreased to provide
fi ner resolution without adversely affecting user appl ications.
A l l filesystem (and other) timestamps are maintained in UTC offsets from the
Epoch. Conversion to l ocal time, incl uding adj ustment for daylight-savings time,
is handled external l y to the system i n the C library.
Interval Time
The system provides each process with three interval timers . The real timer
decrements i n real time. An example of use for this ti mer i s a library routine
maintaining a wakeup- service queue. A SIGALRM signal i s del ivered to the pro­
cess when this timer expires. The real-time timer i s run from the timeout queue
maintained by the softclock ( ) routine ( see Section 3 .4).
The profiling timer decrements both i n process v i rtual time ( when running in
u ser mode) and when the system i s running on behalf of the process. It i s
designed t o b e u sed by processes t o profi l e their execution statistically. A SIG­
PROF signal i s delivered to the process when this timer expires. The profi l i ng
timer is i mplemented by the hardclock ( ) routine. Each time that hardclock ( ) runs,
it checks to see whether the currently running process has requested a profi ling
timer; if it has, hardclock ( ) decrements the timer, and sends the process a signal
when zero is reached.
The virtual ti mer decrements in process v i rtual time. It runs only when the
process i s executing in u ser mode. A SIGVTALRM signal i s del i vered to the pro­
cess when this timer expires. The virtual timer is also implemented in hardclock ( )
as the profiling timer is, except that i t decrements the timer for the current process
only if it i s executing i n u ser mode, and not if it i s running i n the kernel .
Section 3 . 7
U ser, Group, a n d Other Identifiers
User, Group, and Other Identifiers
One i mportant responsibi l i ty of an operating system is to implement access-con­
trol mechanisms. Most of these access-control mechanisms are based on the
notions of individual u sers and of groups of users . Users are named by a 32-bit
number called a user ident(fier (UID) . UIDs are not assigned by the kernel-they
are assigned by an outside administrative authority. UIDs are the bas i s for
accounting, for restricting access to pri v i leged kernel operations, (such as the
request used to reboot a running system), for dec i ding to what processes a signal
may be sent, and as a basis for fi lesystem access and di sk-space a llocation. A sin­
gle user, termed the superuser (also known by the user name root) , i s trusted by
the system and i s permitted to do any supported kernel operation. The superuser
is identified not by any speci fi c name, such as root, but i nstead by a UID of zero.
Users are organized into groups. Groups are named by a 32-bit number called
a group ident!fier (GID) . GIDs, l i ke UIDs, are used i n the fi l esystem access-control
faci l ities and i n disk-space al l ocation .
The state of every 4.4BSD process includes a UID and a set of GIDs. A pro­
cess's fi lesystem-access pri v i l eges are defined by the UID and GIDs of the process
(for the fi lesystem h ierarchy beginning at the proces s ' s root directory ) . Normal l y,
these identi fiers are inheri ted automaticall y from the parent process when a new
process i s created. Only the superuser i s permitted to al ter the UID or GID of a
process. This scheme enforces a strict compartmentali zation of privi leges, and
ensures that no user other than the superuser can gain privi lege s .
Each fi l e has three s e t s of permission bits, for read, write, or execute perm i s ­
sion for each of owner, group, a n d other. These perm i ssion b i t s are checked i n the
fo l l owing order:
If the UID of the fi l e i s the same as the UID of the process, only the owner per­
missions app l y ; the group and other perm issions are not checked.
If the UIDs do not match, but the GID of the file matches one of the GIDs of the
process, only the group permi ssions app l y ; the owner and other perm i ssions
are not checked.
Only if the UID and GIDs of the process fai l to match those of the file are the
permissions for all others checked. If these perm i ssions do not allow the
requested operation, i t w i l l fa i l .
The U I D and G I D s fo r a process are inherited from i t s parent. When a u ser l ogs i n ,
t h e l o g i n program ( see Section 1 4.6) sets t h e UID a n d G I D s before doing t h e exec
system call to run the user's login shel l ; thus, al l subsequent processes w i l l inherit
the appropriate i dentifiers .
Often , i t i s desirable to grant a u ser l i mi ted additional privi leges. For
exampl e , a user who wants to send mail must be abl e to append the mail to
another user's mai lbox. Making the target mai lbox writable by all users wou l d
Chapter 3
Kernel Services
permit a user other than its owner to modify messages i n it ( whether mal iciously
or u n intentional l y ) . To solve this problem, the kernel al l ows the creation of pro­
grams that are granted additional privi leges while they are running. Programs that
run w i th a different UID are called set-user-identifier ( setuid) programs ; programs
that run w i th an additional group priv i l ege are called set-group-identifier (setgid)
programs [Ritchie, 1 979] . When a setu i d program i s executed, the permissions of
the process are augmented to incl ude those of the UID associated with the pro­
gram. The UID of the program is termed the effective UID of the process, whereas
the original UID of the process i s termed the real UID. S i m i l arly, executin g a set­
gi d program augments a proces s ' s permi ssions with those of the program ' s GID,
and the effe ctive GID and real GID are defi ned accordingly.
Systems can use setu i d and setgid programs to provide contro l l ed access to
fi l e s or services. For example, the program that adds mail to the u sers ' mailbox
ru ns with the pri v i leges of the superu ser, which allow i t to write to any file i n the
system . Thus , u sers do not need permi ssion to write other u sers ' mail boxes, but
can sti l l do so by running this program. Naturally, such programs must be wri tten
carefu l l y to have only a l i m i ted set of functionality !
The UID and GIDs are mai ntained i n the per-process area. H i storical l y, GIDs
were implemented as one distingui shed GID (the effective GID) and a supplemen­
tary array of GIDs, which was logical l y treated as one set of GIDs. I n 4.4B SD, the
di sti nguished GID has been made the fi rst entry i n the array of GIDs . The supple­
mentary array is of a fi xed size ( 1 6 i n 4.4B SD), but may be changed by recompil­
ing the kern e l .
4 . 4 B S D implements t h e setgid capab i l ity b y setting the zeroth element o f the
supplementary groups array of the process that executed the setgid program to the
group of the fi l e . Permissions can then be checked as i t i s for a normal proces s .
Because of t h e additional group, t h e setg i d program m a y be abl e t o access more
fi l e s than can a user process that run s a program wi thout the special pri v i l ege. The
login program duplicates the zeroth array element i nto the fi rst array element
when initializing the user's supplementary group array, so that, when a setgid pro­
gram is ru n and modifi e s the zeroth e lement, the user does not l o se any pri v i l eges .
The setuid capab i l i ty is i mplemented by the effective UID of the process being
c hanged from that of the user to that of the program being executed. As i t w i l l
with setgid, t h e protection mechani s m w i l l n o w perm i t access w i thout any change
or special knowl edge that the program i s running setuid. S i nce a process can have
only a single UID at a time, it is possible to lose some pri v i l eges while ru nning
setuid. The prev ious real UID is sti l l maintai ned as the real UID when the new
effective UID i s i nstal led. The real UID, however, i s not used for any vali dation
A setu id process may w i sh to revoke its special privi lege temporaril y while it
i s running. For example, i t may need its special privi lege to access a restricted fi l e
a t o n l y the start and e n d o f i t s execution. During the rest o f i t s execution, it should
have only the real user's pri v i l eges. In 4 . 3 B S D, revocation of pri v i l ege was done
by switching of the real and effective UIDs. S i nce only the effective UID i s used
for access contro l , thi s approach provided the desired semantic s and provided a
S ection 3 . 7
U ser, Group, and Other Identifiers
p l ace to hide the special privi lege . The drawback to this approach was that the
real and effective urns could easi l y become confused.
In 4.4BSD, an additional identifier, the saved UID, was introduced to record
the identity of setuid programs . When a program is exec' ed, its effective urn is
copied to its saved urn. The first line of Table 3 . 1 shows an unprivileged program
for which the rea l , effective, and saved urns are all those of the real user. The sec­
ond l i ne of Tabl e 3 . 1 show a setuid program being run that causes the effective
urn to be set to its associated special-privilege urn. The special-privilege urn has
also been copied to the saved urn.
Also added to 4.4BSD was the new seteuid system call that sets only the
effective urn; it does not affect the real or saved urns. The seteuid system call is
permitted to set the effective urn to the value of either the real or the saved urn.
Lines 3 and 4 of Tabl e 3 . 1 show how a setuid program can give up and then
recl aim its special privilege while continuously retaining its correct real urn.
Lines 5 and 6 show how a setuid program can run a subprocess without granting
the l atter the special privilege. First, it sets its effective urn to the real urn. Then,
when it exec's the subprocess, the effective UID i s copied to the saved UID, and all
access to the special-privilege urn i s Jost.
A similar saved GID mechanism permits processes to switch between the real
GID and the initial effective GID.
Host Identifiers
An additional i dentifier is defined by the kernel for use on machines operating in a
networked environment. A string (of up to 25 6 characters) specifying the host's
name is maintained by the kernel. Thi s value is intended to be defined uniquely for
each machine i n a network. In addition, i n the Internet domain-name system, each
machine i s given a unique 32-bit number. Use of these identifiers permits applica­
tions to use networkwide unique i dentifiers for objects such as processes, fi l e s , and
u sers, which i s u sefu l in the construction of distributed appl ications [Gifford,
1 98 1 ]. The host i dentifiers for a machine are administered outside the kern e l .
Table 3 . 1
Actions affecting the real , effective, and saved
S-special-privilege user identifier.
exec-set u i d
3. seteuid(R)
4. seteuid(S)
5. seteuid(R)
R-real user identifier;
Chapter 3
Kernel Services
The 32-bit host identifier found i n 4 . 3 B SD has been deprecated i n 4.4B SD,
and i s supported only if the system i s compiled for 4.3BSD compatibi l i ty.
Process Groups and Sessions
Each process in the system is associated with a process group. The group of pro­
cesses in a process group is sometimes referred to as a job , and manipulated as a
single entity by processes such as the she l l . Some signals ( e . g . , S IGINT) are deliv­
ered to all members of a process group, causing the group as a whole to su spend
or resume execution, or to be interrupted or terminated.
Sessions were designed by the IEEE POSIX. l 003 . 1 Worki ng Group with the
i ntent of fixing a long-standing security problem i n UNIX-namely, that processes
cou l d modify the state of terminals that were trusted by another user's processes .
A session i s a collection of process groups, and all members of a process group
are members of the same session. I n 4.4BSD, when a u ser fi rst l ogs onto the sys­
tem, they are entered i nto a new sessio n . Each session has a controlling process,
which is normall y the user's login shel l . All subsequent processes created by the
user are part of process groups within thi s session, unless they exp l i c i tly create a
new session . Each session also has an associated login name, which is usually the
user's login name . Thi s name can be changed by only the superuser.
Each session i s associ ated with a terminal, known as i ts controlling terminal.
Each controlling terminal has a process group associ ated with it. Normall y, only
processes that are i n the terminal ' s current process group read from or write to the
terminal, allowing arbitration of a terminal between several different j obs. When
the contro l l ing process exits, access to the terminal is taken away from any
remaining processes within the session.
Newly created processes are assigned process IDs d istinct from all already­
exi sting processes and process groups, and are pl aced in the same process group
and session as their parent. Any process may set its process group equal to its pro­
cess ID (thus creating a new process group) or to the value of any process group
within i ts session. In addition, any process may create a new session, as long as i t
i s n o t already a process-group leader. Sessions, process groups, and associ ated
topics are disc ussed further in Section 4 . 8 and in Section 1 0. 5 .
Resource Services
All systems have l i mits i mposed by their hardware architecture and configuration
to ensure reasonable operation and to keep u sers from accidental l y (or mal i­
ciously) creating resource shortages. At a minimum, the hardware l i mits must be
i mposed on processes that run on the system. It i s u sual l y desirable to l i m i t pro­
cesses further, below these hardware-i mposed l i mits. The system measures
resource uti l i zation, and allow s l i mits to be i mposed on consumption either at or
below the hardware-imposed l i mits.
S ection 3 . 8
Resource S ervices
Process Priorities
The 4.4B SD system gives CPU schedul ing priority to processes that have not u sed
CPU time recently. Thi s priority scheme tends to favor processes that execute for
only short periods of time-for example, interactive processes. The priority
selected for each process i s maintained i n ternal l y by the kernel . The calculation
of the priority i s affected by the per-process nice variable. Positive nice values
mean that the process is w i l l i n g to receive less than its share of the processor.
Negative values of nice mean that the process wants more than its s hare of the pro­
cessor. Most processes run with the defaul t nice value of zero, asking neither
higher nor l ower access to the processor. It i s possible to determine or change the
nice c urrently assigned to a process, to a proces s group, or to the processes of a
speci fied user. M any factors other than nice affect schedul ing, including the
amount of CPU time that the process has used recently, the amount of memory that
the process has used recently, and the c u rrent l oad on the syste m . The exact algo­
rithms that are u sed are described i n Section 4.4.
Resource Utilization
As a process executes, it uses system resources, such as the CPU and memory.
The kernel tracks the resources used by each process and compiles statistics
describing this u sage . The statistics managed by the kernel are avail able to a pro­
cess while the l atter i s executing. When a process terminates, the statistics are
made avail ab l e to its parent via the wait fami l y of system cal l s .
The resources u sed by a process are returned by the syste m c a l l getrusage.
The resources used by the current process, or by all the termi nated children of the
current process, may be requested. This information includes
The amount of user and system time used by the process
The memory u ti l i zation of the process
The paging and disk I/O activity of the process
The number of voluntary and involuntary context switches taken b y the process
The amount of interprocess communication done by the process
The resource-usage information i s collected at locations throughout the kernel .
The CPU time i s collected b y the statclock ( ) function, which i s called either b y the
system clock in hardclock ( ) , or, if an alternate c l ock i s available, by the altemate­
c l ock interrupt routine. The kernel scheduler calculates memory utilization by
sampling the amount of memory that an active process is using at the same time
that i t i s recomputing process priori ti e s . The vmJault ( ) routine recalculates the
paging activi ty each time that i t s tarts a disk transfer to ful fi l l a paging request (see
Section 5 . 1 1 ) . The I/O activi ty statistics are coll ected each time that the process
has to start a transfer to ful fi l l a fi l e or device I/O request, as well as when the
Chapter 3
Kernel Services
general system stat1 st1cs are calc u l ated. The IPC communication activity is
updated each ti me that i nformation i s sent or received.
Resource Limits
The kernel also s upports l i m i ting of certain per-process resources.
resources incl ude
• The maximum amount of CPU time that can be accu m u lated
• The maximum bytes that a process can request be l ocked i nto memory
• The maximum s i ze of a fi l e that can be created by a process
• The maximum s i ze of a process's data segment
• The maximum size of a process's stack segment
• The maximum size of a core file that can be created by a process
• The maximum number of simultaneous processes al l owed to a user
• The maximum n umber of s i m u l taneous open files for a process
• The maximum amount of physical memory that a process may use at any given
For each resource controlled by the kernel , two l i m its are maintained: a soft limit
and a hard limit. All users c an alter the soft l i m i t within the range of 0 to the cor­
responding hard l i mit. A l l u sers can ( i rreversibly) l ower the hard l imit, but only
the superuser can raise the hard l i m it. I f a process exceeds certain soft l i mits, a
signal i s del ivered to the process to notify i t that a resource l i mi t has been
exceeded. Normall y, this signal causes the process to terminate , but the process
may e i ther catch or ignore the signal . I f the process ignores the signal and fails to
rel ease resources that i t already holds, further attempts to obtain more resources
w i l l result i n errors .
Resource l im i ts are generally enforced at or near the locations that the
resource stati stics are collected. The CPU time l i m i t is enforced in the process
context-switching function. The stack and data-segment l i mits are enforced by a
return of all ocation failure once those l imits have been reached. The fi l e-size l i mi t
i s enforced b y the fi lesystem.
Filesystem Quotas
In addi tion to l im i ts on the size of i ndividual fi les, the kernel opti onally enforces
l i m its on the total amount of space that a user or group can use on a filesystem.
Our discussion of the i mp lementation of these l i mits is deferred to Section 7 .4.
S ection 3 . 9
S ystem-Operation Services
System-Operation Services
There are several operational functions having to do with system startup and shut­
dow n . The bootstrapping operations are described i n Section 1 4. 2 . S ystem shut­
down i s described i n Section 1 4. 7 .
The system supports a s i mple form of resource accounti ng. As each process ter­
minates, an accounting record describing the resources used by that process i s
written t o a systemwide accounting file. The i nformation supplied by the system
The name of the command that ran
The amount of u ser and system CPU time that was u sed
The elapsed time the command ran
The average amount of memory used
The number of disk 1/0 operations done
The UID and GID of the process
The term i n al from which the process was started
The i nformation in the accounting record is drawn from the run-time statistics that
were described in S ection 3 . 8 . The granu larity of the time fields is in s ixty-fourths
of a second. To conserve space in the accounting file, the times are stored in a
1 6-bit word as a floating-point number u s i n g 3 bits as a base-8 exponent, and the
other 1 3 b i ts as the fractional part. For h istoric reasons, the same ft oating­
point-conversion routine processes the count of disk operations, so the number of
disk operations must be multiplied by 64 before it is converted to the ftoating­
point representatio n .
There are a l s o fl ags that describe h o w t h e process terminated, whether i t ever
had superuser pri v ileges, and whether it did an exec after a fork.
The superuser requests accounting by passing the name of the fi l e to be used
for accounting to the kernel . A s part of a process exiting, the kernel appends an
accounting record to the accounting file. The kernel makes n o use of the account­
ing records; the records' summaries and use are entirely the domain of u ser-level
accounting programs. A s a guard against a filesystem running out of space
because of unchecked growth of the accounting file, the system suspends account­
i ng when the fi lesystem is reduced to only 2 percent remaining free space.
Accounting resumes when the fi l esystem has at least 4 percent free space.
Chapter 3
Kernel Services
The accounting information has certain l imitations. The information on run
time and memory usage is only approximate because it is gathered statistical ly.
Accounting i nformation i s wri tten only when a process exits, so processes that are
sti l l running when a system is shut down unexpectedl y do not show up in the
accounting fi l e . (Obviously, long-lived system daemons are among such pro­
cesse s . ) Final l y, the accounting records fail to incl ude much information needed
to do accurate b i l l ing, including usage of other resources, such as tape drives and
pri nters .
3. 1
Describe three types of system activity.
When c a n a routine executing in the top h a l f o f the kernel be preempted?
When can i t be interrupted?
Why are routines executing in the bottom half o f the kernel precluded from
using information located in the user area?
3 .4
Why does the system defer as much work as possible from high-priority
interrupts to l ower-priority software-interrupt processes?
What determines the shortest (nonzero) time period that a user process can
request when setting a ti mer?
How does the kernel determine the system cal l fo r which i t h a s been
How are initialized data represented in a n executable fi l e ? How are unini­
tialized data represented in an executable file? Why are the representations
Describe how the "#! " mechanism can b e used t o make programs that
require emulation appear as though they were normal executables.
l s i t possible for a fi l e t o have permissions s e t such that i t s owner cannot
read it, even though a group can? I s this situation possible if the owner i s a
member of the group that can read the fi l e ? Explain your answers .
*3. 1 0
Describe the security implications of not zero fi l ling the stack region at pro­
gram startup.
*3. 1 1
Why i s the conversion from UTC to local time done by user processes,
rather than in the kernel ?
*3. 1 2
What i s the advantage of having the kernel , rather than an appl ication, re­
start an interrupted system cal l ?
*3. 1 3
Describe a scenario i n which the sorted-difference al gorithm u sed for the
callout queue does not work well . Suggest an alternative data structure that
runs more quickly than does the sorted-difference algorithm for your sce­
*3. 1 4
The SIGPROF profiling timer was originall y intended to repl ace the profit
system call to collec t a stati stical sampling of a program's program counter.
Give two reasons why the profit faci l i ty had to be retained.
* *3 . 1 5
What weakness i n the process-accounting mechanism makes the l atter
unsuitabl e for use in a commerc i al environment?
B arkley & Lee, I 988.
R. E . B arkley & T. P. L e e , " A Heap-B ased Callout Implementation to Meet
Real-Time Needs," USENIX Association Conference Proceedings, pp.
2 1 3-222, June 1 98 8 .
Gifford, I 98 I .
D. Gifford, " Information S torage in a Decentralized Computer S ystem,"
PhD Thesis, Electrical Engineering Department, Stanford University , S tan­
ford, CA, 1 98 1 .
Guse l l a et al , 1 994.
R . Gusella, S. Zatti , & J. M . Bloom, "The Berkeley UNIX Time Synchro­
nization Protocol," i n 4.4BSD System Manager 's Manual, pp. 1 2 : 1 - 1 0,
O' Reilly & Associates, Inc . , Sebastopo l , CA, 1 994.
McCanne & Torek, 1 993 .
S . McCanne & C . Torek, "A Randomized Sampling Clock for CPU Uti liza­
tion Estimation and Code Profiling," USENIX Association Conference Pro­
ceedings, pp. 3 87-394, January 1 993.
Ritchie, 1 979.
D . M. Ritchie, " Protection of Data File Contents," United States Patent, no.
4, 1 35 ,240, United S tates Paten t Office, Washington, D.C., J an uary 1 6, 1 979.
Assignee: B e l l Tel ephone Laboratories, I nc . , M urray Hill, NJ, Appl . No. :
377,59 1 , Filed: J u l . 9, 1 97 3 .
Varghese & Lauck, 1 987.
G. Varghese & T . Lauck, "Hashed a n d Hierarchical Timing Whee l s : Data
S tructures for the Efficient Implementation of a Ti mer Faci l i ty," Proceed­
ings of the Eleventh Symposium on Operating Systems Principles, pp.
25-38 , November 1 987.
Proce s s Management
Introduction to Process Management
A process is a program in execution. A process must have system resources, such
as memory and the underlying CPU. The kernel supports the i l l u sion of concurren t
execution o f multiple processes by schedul ing system resources among t h e s e t o f
processes that are ready t o execute. T h i s chapter describes t h e composition of a
process, the method that the sy stem u ses to switch between processes, and the
scheduling pol icy that i t u ses to promote sharing of the CPU. Later chapters study
process creation and termination, signal fac i lities, and proces s-debugging fac i lities.
Two months after the developers began the first implementation of the UNIX
operating system, there were two processes: one for each of the terminals of the
PDP-7 . At age I 0 months, and sti l l on the PDP-7, UNIX had many processes, the
fork operation, and something l ike the wait system cal l . A process executed a new
program by reading in a new program o n top of itself. The first PDP- 1 1 system
(First Edi tion UNIX) saw the introduction of exec. All these systems allowed only
one process in memory at a time. When a PDP- 1 1 with memory management ( a
KS- 1 1 ) w a s obtained, t h e system w a s changed t o permit several processes to
remain in memory simul taneously, to reduce swappi ng. But this change did not
apply to m u ltiprogramming because disk 1/0 was synchronou s . Thi s state of
affairs persisted into 1 972 and the first PDP- 1 1 /45 system. True multiprogram­
ming was final l y introduced when the system was rewritten i n C. Disk UO for one
process could then proceed while another process ran . The basic structure of pro­
cess management in UNIX has not changed since that time [Ritchie, 1 98 8 ] .
A process operates in either user mode o r kernel mode. I n user mode, a pro­
cess executes app l ication code with the m achine in a nonprivileged protection
mode. When a process requests services from the operating system with a system
cal l , i t switches into the machine ' s privi l eged protection mode v i a a protected
mechanism, and then operates i n kernel mode.
Chapter 4
Process M anagement
The resources used by a process are simi l arly split i nto two parts . The
resources needed for execution in user mode are defined by the CPU architecture
and typicall y incl ude the CPU ' s general -purpose regi sters, the program counter,
the processor-status register, and the stack-rel ated regi s ters, as w e l l as the contents
of the memory segments that consti tute the 4.4BSD notion of a program (the text,
data, and stack segments).
Kernel-mode resources incl ude those required by the underly i ng hardware­
such as regi sters, program counter, and stack pointer-and also by the state
required for the 4.4BSD kernel to provide system serv ices for a process. Thi s ker­
nel state incl udes parameters to the current sy stem cal l , the current process ' s user
identity, schedul ing i nformation, and so on. As described in Section 3 . 1 , the ker­
nel state for each process is divided i n to several separate data structures, with two
primary structure s : the pmcess structure and the user structure.
The proces s structure contain s information that must always remain resident
i n main memory, along w i th references to a nu mber of other structures that remain
resident; whereas the user structure contain s information that needs to be res i dent
only when the process i s executing (although user structu res of other processes
also may be res i dent) . User structures are allocated dynamicall y through the
memory-management fac i lities. Historical l y, more than one-half of the process
state was stored i n the user structure . I n 4.4BSD, the user structure i s used for
only the per-process kernel stack and a couple of structures that are referenced
from the process structure . Process structures are a l located dynamically as part of
process creation, and are freed as part of process exit.
The 4.4B SD system supports transparent mul tiprogramming: the i l lusion of con­
current execution of multiple processes or programs. I t does so by context
s witching-that i s , by switching between the execution context of processes. A
mechanism i s also provided for scheduling the execution of processes-that i s ,
for deciding w h i c h o n e t o execute next. Fac i l i ties are provided for ensuring con­
s i stent access to data structures that are shared among processes.
Context switching i s a h ardware-dependent operation whose impl ementation
is i n fl uenced by the underlying hardware faci l ities. Some architectures provide
machine instructions that save and restore the hardware-execution context of the
process, including the virtual-address space. On the others, the software must col­
lect the hardware state from various regi s ters and save it, then load those regi s ters
with the new hardware state . A l l architectures must save and restore the software
state u sed by the kernel .
Context switching i s done frequently, so i ncreasing the speed o f a context
switc h noticeably decreases time spent in the kernel and provides more time for
execution of user applications. S i nce most of the work of a context switch is
expended i n saving and restori ng the operating context of a process, reducing the
amount of the i nformation required for that context i s an effective way to produce
faster context switches.
Section 4. 1
Introduction to Process Management
Fair scheduling of processes is an involved task that is dependent on the types of
executable programs and on the goals of the scheduling policy. Programs are
characterized according to the amount of computation and the amount of I/O that
they do. Scheduling policies typically attempt to balance resource utilization
against the time that it takes for a program to complete. A process's priority is
periodically recalculated based on various parameters, such as the amount of CPU
time it has used, the amount of memory resources it holds or requires for execu­
tion, and so on. An exception to this rule is real-time scheduling, which must
ensure that processes finish by a specified deadline or in a particular order; the
4.4BSD kernel does not implement real-time scheduling.
4.4BSD uses a priority-based scheduling policy that i s biased to favor interac­
tive programs, such as text editors, over long-running batch-type jobs. Interactive
programs tend to exhibit short bursts of computation followed by periods of inac­
tivity or I/O. The scheduling policy initially assigns to each process a high execu­
tion priority and allows that process to execute for a fixed time slice. Processes
that execute for the duration of their slice have their priority lowered, whereas pro­
cesses that give up the CPU (usually because they do I/0) are al lowed to remain at
their priority. Processes that are inactive have their priority raised. Thus, jobs that
use large amounts of CPU time sink rapidly to a low priority, whereas interactive
jobs that are mostly inactive remain at a high priority so that, when they are ready
to run, they will preempt the long-running lower-priority jobs. An interactive job,
such as a text editor searching for a string, may become compute bound briefly,
and thus get a lower priority, but it will return to a high priori ty when it is inactive
again while the user thinks about the result.
The system also needs a scheduling policy to deal with problems that arise
from not having enough main memory to hold the execution contexts of all pro­
cesses that want to execute. The major goal of this scheduling policy is to mini­
mize th rash ing a phenomenon that occurs when memory is in such short supply
that more time is spent in the system handling page faults and scheduling pro­
cesses than in user mode executing application code.
The system must both detect and eliminate thrashi ng. It detects thrashing by
observing the amount of free memory. When the system has few free memory
pages and a high rate of new memory requests, it considers itself to be thrashing.
The system reduces thrashing by marking the least-recently run process as not
being allowed to run . This marking allows the pageout daemon to push all the
pages associated with the process to backi ng store . On most architectures, the ker­
nel also can push to backing store the user area of the marked process. The effect
of these actions is to cause the process to be swapped out (see Section 5 . 1 2) . The
memory freed by blocking the process can then be distributed to the remaining
processes, which usual ly can then proceed. If the thrashing continues, additional
processes are selected for being blocked from running until enough memory
becomes available for the remai ning processes to run effectively. Eventually,
enough processes complete and free their memory that blocked processes can
Chapter 4
Process Management
resume execution. However, even if there is not enough memory, the blocked
processes are allowed to resume execution after about 20 seconds. Usually, the
thrashing condition will return, requiring that some other process be selected for
being blocked (or that an administrative action be taken to reduce the load).
The orientation of the scheduling policy toward an interactive job mix reflects
the original design of 4.4BSD for use in a time-sharing environment. Numerous
papers have been written about alternative scheduling policies, such as those used
in batch-processing environments or real-time systems. Usually, these policies
require changes to the system in addition to alteration of the scheduling policy
[Khanna et al , 1 992] .
Process State
The layout of process state was completely reorganized in 4.4BSD. The goal was
to support multiple threads that share an address space and other resources.
Threads have also been called lightweight processes in other systems. A thread is
the unit of execution of a process; it requires an address space and other resources,
but it can share many of those resources with other threads. Threads sharing an
address space and other resources are scheduled independently, and can all do sys­
tem calls simultaneously. The reorganization of process state in 4.4BSD was
designed to support threads that can select the set of resources to be shared, known
as variable-weight processes [Aral et al , 1 989] . Unlike some other implementa­
tions of threads, the BSD model associates a process ID with each thread, rather
than with a collection of threads sharing an address space.
F i g u re 4.1
Process state.
process group
process credential
user credential
region list
file descriptors
signal actions
process information
process control block
process kernel stack
user structure
file entries
Section 4.2
Process State
The developers did the reorgani zation by moving many components of pro­
cess state from the process and user structures into separate substructures for each
type of state information, as shown in Fig. 4. 1 . The process structure references
all the substructures directly or indirectly. The use of global variables in the user
structure was completely eliminated. Variables moved out of the user structure
include the open file descriptors that may need to be shared among different
threads, as well as system-call parameters and error returns. The process structure
itself was also shrunk to about one-quarter of its former size. The idea is to mini­
mize the amount of storage that must be all ocated to support a thread . The
4.4BSD di stribution did not have kernel-thread support enabled, primarily because
the C l ibrary had not been rewritten to be able to handle multiple threads.
All the information in the substructures shown in Fig. 4. 1 can be shared
among threads running within the same address space, except the per-thread stati s­
tics, the signal actions, and the per-thread kernel stack. These unshared structures
need to be accessible only when the thread may be scheduled, so they are allo­
cated in the user structure so that they can be moved to secondary storage when
memory resources are low. The following sections describe the portions of these
structures that are relevant to process management. The VM space and its related
structures are described more fully in Chapter 5 .
The Process Structure
In addition to the references to the substructures, the process entry shown in Fig.
4. 1 contai ns the following categories of information :
• Process identification. The process identifier and the parent-process identi fier
• Scheduling. The process priority, user-mode scheduling priority, recent CPU uti­
lization, and amount of time spent sleeping
• Process state . The run state of a process (runnable, sleeping, stopped) ; addi­
tional status flags; if the process is sleeping, the wait channel, the identity of the
event for which the process is waiting ( see Section 4.3), and a pointer to a string
describing the event
• Signal state . Signals pending delivery, signal mask, and summary of signal
• Tracing. Process tracing information
• Machine state . The machine-dependent process infmmation
• Timers. Real-time timer and CPU-utilization counters
The process substructures shown in Fig . 4. 1 have the following categories of infor­
mation :
• Process-group identification. The process group and the session to which the
process belongs
Chapter 4
Process Management
• User credentials. The real, effective, and saved user and group identifiers
• Memory management. The structure that describes the allocation of virtual
address space used by the process
• File descriptors. An array of pointers to file entries indexed by the process open
file descriptors ; also, the open file flags and current directory
• Resource accounting . The rusage structure that describes the utilization of the
many resources provided by the system (see Section 3 . 8 )
• Statistics. Statistics collected while the process is running that are reported
when it exits and are written to the accounting file; also, includes process timers
and profiling information if the latter is being collected
• Signal actions . The action to take when a signal is posted to a process
• User structure. The contents of the user structure (described later in this section)
A process 's state has a value, as shown in Table 4. 1 . When a process is first cre­
ated with a .fork system cal l, it is initially marked as SIDL. The state is changed to
SRUN when enough resources are allocated to the process for the latter to begin
execution. From that point onward, a process's state will fluctuate among SRUN
(runnable-e.g., ready to execute}, SSLEEP (waiting for an event), and SSTOP
(stopped by a signal or the parent process), until the process terminates. A
deceased process is marked as SZOMB until its termination status is communi­
cated to its parent process.
The system organizes process structures into two lists . Process entries are on
the zambproc list if the process is in the SZOMB state ; otherwise, they are on the
allproc list. The two queues share the same linkage pointers in the process struc­
ture, since the lists are mutually exclusive. Segregating the dead processes from
the live ones reduces the time spent both by the wait system call, which must scan
the zombies for potential candidates to return, and by the scheduler and other
functions that must scan all the potentially runnable processes.
Table 4.1
Process states.
intermediate state in process creation
awaiting an event
process stopped or being traced
intermediate state in process termination
Section 4.2
Process State
Most processes, except the currently executing process, are also in one of two
queues: a run queue or a sleep queue. Processes that are in a runnable state are
placed on a run queue, whereas processes that are blocked awaiting an event are
located on a sleep queue . Stopped processes not also awaiting an event are on nei­
ther type of queue. The two queues share the same linkage pointers in the process
structure, since the lists are mutually exclusive. The run queues are organized
according to process-scheduling priority, and are described in Section 4.4. The
sleep queues are organized in a hashed data structure that optimizes finding of a
sleeping process by the event number (wait channel) for which the process is wait­
ing. The sleep queues are described in Section 4.3.
Every process in the system is assigned a unique identifier termed the process
identifier, (PJD). PIOs are the common mechanism used by applications and by
the kernel to reference processes. PIOs are used by appl ications when the latter
are sending a signal to a process and when receiving the exit status from a
deceased process. Two PIOs are of special importance to each process: the PIO of
the process itself and the PIO of the process's parent process.
The pyglist list and related lists (pyptr, p_children, and p_siblings) are used
in locating related processes, as shown in Fig. 4.2. When a process spawns a child
process, the child process is added to its parent's p_children list. The child pro­
cess also keeps a backward link to its parent in its pyptr field. If a process has
more than one child process active at a time, the children are linked together
through their p_sibling list entries. In Fig. 4.2, process B is a direct descendent of
process A, whereas processes C, D, and E are descendents of process B and are
siblings of one another. Process B typically would be a shell that started a
pipeline (see Sections 2.4 and 2.6) including processes C, D, and E. Process A
probably would be the system-initialization process init (see Section 3 . 1 and Sec­
tion 1 4.6).
CPU time is made available to processes according to their scheduling priority.
A process has two scheduling priorities, one for scheduling user-mode execution
and one for scheduling kernel-mode execution. The p_us1pri field in the process
structure contains the user-mode scheduling priority, whereas the pyriority field
holds the current kernel-mode scheduling priority. The current priority may be
Fi gure 4.2
Process-group hierarchy.
process A
p _pptr
process C
process D
process E
Chapter 4
Process Management
Ta ble 4.2 Process-schedul i ng pri orities.
priority while swapping process
pri ority while waiting for memory
priority while waiting for file control information
priority while waiting on disk 1/0 completion
priority while waiting for a kernel-level filesystem lock
base l i ne priority
priority while waiting on a socket
priority while waiting for a child to exit
priority while waiting for user-level fi l esystem lock
priority while waiting for a signal to arrive
base priority for user-mode execution
different from the user-mode pnonty when the process is executing in kernel
mode. Priorities range between 0 and 1 27, with a lower value interpreted as a
higher priority (see Table 4.2). User-mode priorities range from PUSER (50) to
1 27 ; priori ties less than PUSER are used only when a process is asleep-that is,
awaiting an event in the kernel-and immediately after such a process is awak­
ened. Processes in the kernel are given a higher priority because they typically
hold shared kernel resources when they awaken. The system wants to run them as
quickly as possible once they get a resource, so that they can use the resource and
return it before another process requests it and gets blocked waiting for it.
Historically, a kernel process that is asleep with a priority in the range PZERO
to PUSER would be awakened by a signal ; that is, it might be awakened and
marked runnable if a signal is posted to it. A process asleep at a priority below
PZERO would never be awakened by a signal . In 4.4BSD, a kernel process will be
awakened by a signal only if it sets the PCATCH flag when it sleeps. The PCATCH
flag was added so that a change to a sleep priority does not inadvertently cause a
change to the process's interruptibility.
For efficiency, the sleep interface has been divided into two separate entry
points : sleep ( ) for brief, noninterruptible sleep requests, and tsleep ( ) for longer,
possibly interrupted sleep requests. The sleep ( ) interface is short and fast, to han­
dle the common case of a short sleep. The tsleep ( ) interface handles all the special
cases including interruptible sleeps, sleeps limited to a maximum time duration,
and the processing of restartable system calls. The tsleep ( ) interface also includes
a reference to a string describing the event that the process awai ts ; this stri ng is
externally visible. The decision of whether to use an interruptible sleep is depen­
dent on how long the process may be blocked. Because it is complex to be pre­
pared to handle signals in the midst of doing some other operation, many sleep
Section 4.2
Process State
requests are not interruptible; that is, a process will not be scheduled to run until
the event for which it is waiting occurs. For example, a process waiting for disk
I/O will sleep at an uninterruptible priority.
For quickly occurring events, delaying to handle a signal until after they com­
plete is imperceptible. However, requests that may cause a process to sleep for a
long period, such as while a process is waiting for terminal or network input, must
be prepared to have their sleep interrupted so that the posting of signals is not
delayed indefi nitely. Processes that sleep at interruptible priorities may abort their
system call because of a signal arriving before the event for which they are wait­
ing has occurred. To avoid holding a kernel resource permanently, these processes
must check why they have been awakened. If they were awakened because of a
signal, they must release any resources that they hold. They must then return the
error passed back to them by tsleep ( ), which will be EINTR if the system call is to
be aborted after the signal, or ERESTART if it is to be restarted. Occasionally, an
event that is supposed to occur quickly, such as a tape 1/0, will get held up
because of a hardware failure. Because the process is sleeping in the kernel at an
uninterruptible priority, it will be impervious to any attempts to send it a signal ,
even a signal that should cause it to exit unconditionally. The only solution to this
problem is to change sleep ( )s on hardware events that may hang to be interrupt­
ible. In the remainder of thi s book, we shall always use sleep ( ) when referencing
the routine that puts a process to sleep, even when the tsleep ( ) interface may be
the one that is being used.
The User Structure
The user structure contains the process state that may be swapped to secondary
storage. The structure was an important part of the early UNIX kernels; it stored
much of the state for each process. As the system has evolved, this state has
migrated to the process entry or one of its substructures, so that it can be shared.
In 4.4BSD, nearly all references to the user structure have been removed. The
only place that user-structure references still exist are in the fork system call,
where the new process entry has pointers set up to reference the two remaining
structures that are still allocated in the user structure. Other parts of the kernel
that reference these structures are unaware that the latter are located in the user
structure ; the structures are always referenced from the pointers in the process
table. Changing them to dynamically allocated structures would require code
changes in only fork to allocate them, and exit to free them. The user-structure
state includes
The user- and kernel-mode execution states
The accounting information
The signal-disposition and signal-handling state
Selected process information needed by the debuggers and in core dumps
The per-process execution stack for the kernel
Chapter 4
Process Management
The current execution state of a process is encapsulated in a process control block
(PCB). This structure is allocated in the user structure and is defined by the
machine architecture; it includes the general-purpose registers, stack pointers, pro­
gram counter, processor-status longword, and memory-management regi sters .
Historically, the user structure was mapped to a fixed location in the virtual
address space. There were three reasons for using a fixed mapping :
1 . On many architectures, the user structure could be mapped into the top of the
user-process address space. Because the user structure was part of the user
address space, its context would be saved as part of saving of the user-process
state, with no additional effort.
2. The data structures contained in the user structure (also called the u-dot (u.)
structure, because all references in C were of the form u. ) could always be
addressed at a fixed address.
3 . When a parent forks, its run-time stack is copied for its child. Because the
kernel stack is part of the u. area, the child's kernel stack is mapped to the
same addresses as its parent kernel stack. Thus, all its internal references,
such as frame pointers and stack-variable references, work as expected.
On modern architectures with virtual address caches, mapping the user structure to
a fixed address is slow and inconvenient. Thus, reason 1 no longer holds . Since
the user structure is never referenced by most of the kernel code, reason 2 no
longer holds. Only reason 3 remains as a requirement for use of a fixed mapping.
Some architectures in 4.4BSD remove this final constraint, so that they no longer
need to provide a fixed mapping. Th ey do so by copying the parent stack to the
child-stack location. The machine-dependent code then traverses the stack, relo­
cating the embedded stack and frame pointers . On return to the machine-i ndepen­
dent fork code, no further references are made to local variables ; everything just
returns all the way back out of the kernel.
The location of the kernel stack in the user structure simplifies context switch­
ing by local izing all a process's kernel-mode state in a single structure. The kernel
stack grows down from the top of the user structure toward the data structures
allocated at the other end. This design restricts the stack to a fixed size. Because
the stack traps page faults, it must be allocated and memory resident before the
process can run . Thus, it is not only a fixed size, but also smal l ; usually it is allo­
cated only one or two pages of physical memory. Implementors must be careful
when writing code that executes in the kernel to avoid using large local variables
and deeply nested subroutine calls, to avoid overflowing the run-time stack. As a
safety precaution, some architectures leave an inval id page between the area for
the run-time stack and the page holding the other user-structure contents . Thus,
overflowing the kernel stack will cause a kernel-access fault, instead of disas­
trously overwriting the fixed-si zed portion of the user structure. On some archi­
tectures, interrupt processing takes place on a separate interrupt stack, and the size
of the kernel stack in the user structure restricts only that code executed as a result
of traps and system calls.
Section 4.3
Context Switching
Context Switching
The kernel switches among processes in an effort to share the CPU effectively; this
activity is called context switching. When a process executes for the duration of
its time slice or when it blocks because it requires a resource that is currently
unavailable, the kernel finds another process to run and context switches to it. The
system can also interrupt the currently executing process to service an asyn­
chronous event, such as a device interrupt. Although both scenarios involve
switching the execution context of the CPU, switching between processes occurs
synchronously with respect to the currently executing process, whereas servicing
interrupts occurs asynchronously with respect to the current process. In addition,
interprocess context switches are classified as voluntary or involuntary. A volun­
tary context switch occurs when a process blocks because it requires a resource
that is unavailable. An involuntary context switch takes place when a process
executes for the duration of its time slice or when the system identifies a higher­
priority process to run.
Each type of context switching is done through a different interface. Volun­
tary context switching is initiated with a call to the sleep ( ) routine, whereas an
involuntary context switch is forced by direct invocation of the low-level context­
switching mechanism embodied in the mi_switch ( ) and setrunnable ( ) routines.
Asynchronous event handling is managed by the underlying hardware and is effec­
tively transparent to the system. Our discussion will focus on how asynchronous
event handling relates to synchronizing access to kernel data structures.
Process State
Context switching between processes requires that both the kernel- and user-mode
context be changed; to simplify this change, the system ensures that all a process's
user-mode state is located in one data structure: the user structure (most kernel
state is kept elsewhere). The following conventions apply to this localization:
Kernel-mode hardware-execution state. Context switching can take place in
only kernel mode. Thus, the kernel's hardware-execution state is defined by the
contents of the PCB that is located at the beginning of the user structure.
User-mode hardware-execution state. When execution is in kernel mode, the
user-mode state of a process (such as copies of the program counter, stack pointer,
and general registers) always resides on the kernel's execution stack that is located
in the user structure. The kernel ensures this location of user-mode state by
requ iring that the system-call and trap handlers save the contents of the user-mode
execution context each time that the kernel is entered (see Section 3 . 1 ) .
The process structure. The process structure always remains resident in
Memory resources. Memory resources of a process are effectively described by
the contents of the memory-management registers located in the PCB and by the
values present in the process structure. As long as the process remains in
Chapter 4
Process Management
memory, these values will remain val id, and context switches can be done
without the associated page tables being saved and restored. However, these val­
ues need to be recalculated when the process returns to mai n memory after being
swapped to secondary storage.
Low-Level Context Switching
The localization of the context of a process in the latter's user structure permits the
kernel to do context switching simply by changing the notion of the current user
structure and process structure, and restori ng the context described by the PCB
within the user structure (including the mapping of the virtual address space).
Whenever a context switch is required, a call to the mi_switch ( ) routine causes the
highest-priority process to run . The mi_switclz ( ) routine first selects the appropri­
ate process from the scheduling queues, then resumes the selected process by
loading that process's context from its PCB . Once mi_switch ( ) has loaded the
execution state of the new process, it must also check the state of the new process
for a nonlocal return request (such as when a process first starts execution after a
fork; see Section 4.5).
Voluntary Context Switching
A voluntary context switch occurs whenever a process must await the availability
of a resource or the arrival of an event. Voluntary context switches happen fre­
quently in normal system operation . For example, a process typically blocks each
time that it requests data from an input device, such as a terminal or a disk. In
4.4BSD, voluntary context swi tches are in itiated through the sleep ( ) or tsleep ( )
routines. When a process no longer needs the CPU, i t invokes sleep ( ) with a
schedu ling priority and a wait channel. The pri ority specified in a sleep ( ) call is
the priority that should be assigned to the process when that process is awakened.
This priority does not affect the user-level scheduling priority.
The wait channel is typically the address of some data structure that identi fies
the resource or event for which the process i s waiting. For example, the address of
a disk buffer is used while the process i s waiting for the buffer to be filled. When
the buffer is filled, processes sleeping on that wait channel will be awakened. In
addition to the resource addresses that are used as wait channels, there are some
addresses that are used for special purposes:
The global variable /bolt is awakened by the scheduler once per second. Pro­
cesses that want to wait for up to I second can sleep on this global variable. For
example, the termi nal-ou tput routines sleep on /bolt while waiting for output­
queue space to become available. Because queue space rarely ru ns out, it is eas­
ier simply to check for queue space once per second during the brief periods of
shortages than it is to set up a notification mechanism such as that used for man­
aging disk buffers. Programmers can also use the /bolt wai t channel as a crude
watchdog timer when doing debugging.
Section 4.3
Context Switching
When a parent process does a wait system call to collect the termination status of
its children, it must wait for one of those children to exit. Since it cannot know
which of its children will exit first, and since it can sleep on only a single wait
channel, there is a quandary as to how to wait for the next of multiple events.
The solution is to have the parent sleep on its own process structure. When a
child exits, it awakens its parent's process-structure address, rather than its own.
Thus, the parent doing the wait will awaken independent of which child process
is the first to exit.
When a process does a sigpause system call, it does not want to run until it
receives a signal. Thus, it needs to do an interruptible sleep on a wait channel
that will never be awakened. By convention, the address of the user structure is
given as the wait channel.
Sleeping processes are organized in an array of queues (see Fig. 4.3) . The
sleep ( ) and wakeup ( ) routines hash wait channels to calculate an index into the
sleep queues. The sleep ( ) routine takes the following steps in its operation:
1 . Prevent interrupts that might cause process-state transitions by raising the
hardware-processor priority level to splhigh (hardware-processor priority lev­
els are explained in the next section).
2. Record the wait channel in the process structure, and hash the wait-channel
value to locate a sleep queue for the process.
3. Set the process's priority to the priority that the process will have when the
process is awakened, and set the SSLEEP flag.
Figure 4.3
Queueing structure for sleeping processes.
sleep queue
hash-table header
Chapter 4
Process Management
4. Place the process at the end of the sleep queue selected in step 2 .
5 . Call mi_switch ( ) t o request that a new process be scheduled; the hardware pri­
ority level is implicitly reset as part of switching to the other process.
A sleeping process is not selected to execute until it is removed from a sleep
queue and is marked runnable. This operation is done by the wakeup ( ) routine,
which is called to signal that an event has occurred or that a resource is available.
Wakeup ( ) is invoked with a wait channel, and it awakens all processes sleeping on
that wait channel . All processes waiting for the resource are awakened to ensure
that none are inadvertently left sleepi ng. If only one process were awakened, it
might not request the resource on which it was sleeping, and so any other pro­
cesses waiting for that resource would be left sleeping forever. A process that
needs an empty disk buffer in which to write data is an example of a process that
may not request the resource on which it was sleeping. Such a process can use
any available buffer. If none is available, it will try to create one by requesting
that a dirty buffer be written to disk and then waiting for the 1/0 to complete .
When the 1/0 finishes, the process will awaken and will check for an empty buffer.
If several are available, it may not use the one that it cleaned, leaving any other
processes waiting for the buffer that it cleaned sleeping forever.
To avoid having excessive numbers of processes awakened, kernel program­
mers try to use wait channels with fine enough granularity that unrelated uses will
not coll ide on the same resource . Thus, they put locks on each buffer in the buffer
cache, rather than putting a single lock on the buffer cache as a whole. The prob­
lem of many processes awakening for a single resource is further mitigated on a
uniprocessor by the latter's inherently single-threaded operation. Although many
processes will be put into the run queue at once, only one at a time can execute .
Si nce the kernel is nonpreemptive, each process will run its system call to comple­
tion before the next one will get a chance to execute. Unless the previous user of
the resource bl ocked in the kernel while trying to use the resource, each process
waiting for the resource will be able get and use the resource when it is next run.
A wakeup ( ) operation processes entries on a sleep queue from front to back.
For each process that needs to be awakened, wakeup ( )
I . Removes the process from the sleep queue
Recomputes the user-mode scheduling priority if the process has been sleeping
longer than I second
Makes the process runnable if it is in a SS LEEP state, and places the process on
the run queue if it is not swapped out of main memory ; if the process has been
swapped out, the swapin process will be awakened to load it back into memory
( see Section 5 . 1 2 ) ; if the process is in a SSTOP state . it is left on the queue
until it is explicitly restarted by a user-level process, either by a ptrace system
call or by a continue signal ( see Section 4.7)
Section 4.3
Context Switching
If wakeup ( ) moved any processes to the run queue and one of them had a schedul­
ing priority higher than that of the currently executing process, it will also request
that the CPU be rescheduled as soon as possible.
The most common use of sleep ( ) and wakeup ( ) is in scheduling access to
shared data structures; this use is described in the next section on synchronization .
Interprocess synchronization to a resource typically is implemented by the associ­
ation with the resource of two flags; a locked flag and a wanted flag. When a pro­
cess wants to access a resource, it first checks the locked flag. If the resource is
not currently in use by another process, thi s flag should not be set, and the process
can simply set the locked flag and use the resource. If the resource is in use, how­
ever, the process should set the wanted flag and call sleep ( ) with a wait channel
associated with the resource (typically the address of the data structure used to
describe the resource). When a process no longer needs the resource, it clears the
locked flag and, if the wanted flag is set, invokes wakeup ( ) to awaken all the pro­
cesses that called sleep ( ) to await access to the resource.
Routines that run in the bottom half of the kernel do not have a context and
consequently cannot wait for a resource to become available by calling sleep ( ) .
When the top half of the kernel accesses resources that are shared with the bottom
half of the kernel , it cannot use the locked flag to ensure exclusive use. Instead, it
must prevent the bottom half from running while it is using the resource. Syn­
chronizing access with routines that execute in the bottom half of the kernel
requires knowledge of when these routines may run . Although interrupt priorities
are machine dependent, most implementations of 4.4BSD order them according to
Table 4.3. To block interrupt routines at and below a certain priority level, a criti­
cal section must make an appropriate set-priority-level call. All the set-priority-
Table 4.3
Interrupt-priority assignments, ordered from lowest to highest.
sp/O( )
.1p/softclock( )
spinet( )
spltty ( )
splhio ( )
splimp ( )
splclock ( )
splhigh ( )
nothing (normal operating mode)
low-priority clock processing
network protocol processing
terminal multiplexers and low-priority devices
disk and tape controllers and high-priority devices
network device controllers
high-priority clock processing
all interrupt activity
Chapter 4
Process Management
level calls return the previous priority level. When the critical section is done, the
priority is returned to its previous level using splx ( ). For example, when a process
needs to manipulate a terminal 's data queue, the code that accesses the queue is
written in the following style :
sp l t ty ( ) ;
rais e priority to b l o c k t ty p r o c e s s ing * /
/ * manipu l a t e t ty * /
s p lx ( s ) ;
r e s e t p riority l ev e l
t o p r evious va l u e * /
Processes must take care to avoid deadlocks when locking multiple resources.
Suppose that two processes, A and B , require exclusive access to two resources,
R1 and R2, to do some operation. If process A acquires R 1 and process B acquires
R2, then a deadlock occurs when process A tries to acquire R2 and process B tries
to acquire R 1 . Since a 4.4BSD process executing in kernel mode is never pre­
empted by another process, locking of multiple resources is simple, although it
must be done carefully. If a process knows that multiple resources are required to
do an operation, then it can safely lock one or more of those resources in any
order, as long as it never relinquishes control of the CPU . If, however, a process
cannot acquire all the resources that it needs, then it must release any resources
that it holds before calling sleep ( ) to wait for the currently inaccessible resource
to become available.
Alternatively, if resources can be partially ordered, it is necessary only that
they be allocated in an increasing order. For example, as the namei ( ) routine tra­
verses the filesystem name space, it must lock the next component of a pathname
before it relinquishes the current component. A partial ordering of pathname
components exists from the root of the name space to the leaves. Thus, transla­
tions down the name tree can request a lock on the next component without con­
cern for deadlock. However, when it is traversing up the name tree (i.e., following
a pathname component of dot-dot ( .. )), the kernel must take care to avoid sleeping
while holding any locks.
Raising the processor priority level to guard against interrupt activity works
for a uniprocessor architecture, but not for a shared-memory multiprocessor
machine. Similarly, much of the 4.4BSD kernel implicitly assumes that kernel
processing will never be done concurrently. Numerous vendors-such as Sequent,
OSF/ l , AT&T, and Sun Microsystems-have redesigned the synchronization
schemes and have eliminated the uniprocessor assumptions implicit in the stan­
dard UNIX kernel, so that UNIX will run on tightly coupled multiprocessor archi­
tectures [Schimmel, 1 994 ] .
Process Scheduling
4.4BSD uses a process-scheduling algorithm based on 11111/tilei·e! feedback queues.
All processes that are runnable are assigned a scheduling priority that determines
in which run queue they are placed. In selecting a new process to run, the system
scans the run queues from highest to lowest priority and chooses the first process
Section 4.4
Process Scheduling
on the first nonempty queue. If multiple processes reside on a queue, the system
runs them round mbin; that is, it runs them in the order that they are found on the
queue, with equal amounts of time allowed. If a process blocks, it is not put back
onto any run queue . If a process uses up the time quantum (or time slice) allowed
it, it is placed at the end of the queue from which it came, and the process at the
front of the queue is selected to run.
The shorter the time quantum, the better the interactive response. However,
longer time quanta provide higher system throughput, because the system will
have less overhead from doing context switches, and processor caches will be
flushed less often. The time quantum used by 4.4BSD is 0. 1 second. This val ue
was empirically found to be the longest quantum that could be used without loss
of the desired response for interactive jobs such as editors. Perhaps surpri singly,
the time quantum has remained unchanged over the past 1 5 years . Although the
time quantum was originally selected on centralized timesharing systems with
many users, it is still correct for decentralized workstations today. Although
workstation users expect a response time faster than that anticipated by the time­
sharing users of I 0 years ago, the shorter run queues on the typical workstation
makes a shorter quantum unnecessary.
The system adjusts the priority of a process dynamical ly to reflect resource
requirements (e.g., being blocked awaiting an event) and the amount of resources
consumed by the process (e.g., CPU time) . Processes are moved between run
queues based on changes in their scheduling priority (hence the word feedback in
the name multilevel feedback queue). When a process other than the currently
running process attains a higher priority (by having that priority either assigned or
given when it is awakened), the system switches to that process immediately if the
current process is in user mode. Otherwise, the system switches to the higher-pri­
ority process as soon as the current process exits the kernel . The system tailors
this short-term scheduling algorithm to favor interactive jobs by rai sing the
scheduling priority of processes that are blocked waiting for 1/0 for I or more sec­
onds, and by lowering the priority of processes that accumu late significant
amounts of CPU time.
Short-term process scheduling is broken up into two parts . The next section
describes when and how a process's scheduling priority is altered; the section after
describes the management of the run queues and the interaction between process
scheduling and context switching.
Calculations of Process Priority
A process's scheduling priority is determined directly by two values contained in
the process structure: p_estcpu and p_nice. The value of p_estcpu provides an
estimate of the recent CPU utilization of the process. The value of p_nice is a
user-settable weighting factor that ranges numerically between -20 and 20. The
normal value for p_nice is 0. Negative values increase a process 's priority,
whereas positive values decrease its priority.
A process's user-mode scheduling priority is calculated every four clock ticks
(typically 40 milliseconds) by this equation :
Chapter 4
p- estcpu
Process Management
+ 2 x p_nice .
(Eq. 4. 1 )
Values less than PUSER are set to PUSER (see Table 4.2) ; values greater than 1 27
are set to 1 27 . This calculation causes the priority to decrease linearly based on
recent CPU utilization. The user-controllable p_nice parameter acts as a limited
weighting factor. Negative values retard the effect of heavy CPU utilization by
offsetting the additive term containing p_estcpu. Otherwise, if we ignore the sec­
ond term, p_nice simply shifts the priority by a constant factor.
The CPU utilization, p_estcpu, is incremented each time that the system clock
ticks and the process is found to be executing. In addition, p_estcpu is adjusted
once per second via a digital decay filter. The decay causes about 90 percent of
the CPU usage accumulated in a I -second interval to be forgotten over a period of
time that is dependent on the system load average. To be exact, p_estcpu is
adjusted according to
p- estcpu
(2 x load)
p -estcpu + p-nzce,
(2 x load + 1 )
(Eq. 4.2)
where the load is a sampled average of the sum of the lengths of the run queue
and of the short-term sleep queue over the previous I -minute interval of system
To understand the effect of the decay fi lter, we can consider the case where a
single compute-bound process monopolizes the CPU. The process's CPU utiliza­
tion will accumulate clock ticks at a rate dependent on the clock frequency. The
load average will be effectively 1 , resulting in a decay of
p _estcpu
0. 66 x p _estcpu + p _nice .
If we assume that the process accumulates Ti clock ticks over time interval i, and
that p_nice is zero, then the CPU utilization for each time interval will count into
the current value of p_estcpu according to
0. 66 x T0
0. 66 x (T1 + 0. 66 x T0) 0. 66 x T1 + 0. 44 x T0
0. 66 x T2 + 0. 44 x T1 + 0. 30 x T0
0. 66 x T3 +
+ 0. 20 x T0
0. 66 x T4 +
+ 0. 1 3 x T0 .
Thus, after five decay calculations, only 1 3 percent of T0 remains present in the
current CPU utilization value for the process. Since the decay filter is applied once
per second, we can also say that about 90 percent of the CPU utilization is forgot­
ten after 5 seconds .
Processes that are runnable have their priority adjusted periodically as just
described. However, the system ignores processes blocked awaiting an event:
These processes cannot accumulate CPU usage, so an estimate of their filtered
CPU usage can be calculated in one step. This optimization can significantly
reduce a system's scheduling overhead when many blocked processes are present.
The system recomputes a process 's priority when that process is awakened and
Section 4.4
Process Scheduling
has been sleeping for longer than 1 second. The system maintains a value,
p_slptime, that is an estimate of the time a process has spent blocked waiting for
an event. The value of p_slptime is set to 0 when a process calls sleep ( ), and is
incremented once per second while the process remains in an SSLEEP or SSTOP
state. When the process is awakened, the system computes the value of p_estcpu
according to
(2 x load)
(2 x load + 1 )
x p_estlpu,
(Eq. 4.3)
and then recalculates the scheduling priority using Eq. 4. 1 . This analysis ignores
the influence of p_nice; also, the load used is the current load average, rather than
the load average at the time that the process blocked.
Process-Priority Routines
The priority calculations used in the short-term scheduling algorithm are spread
out in several areas of the system. Two routines, schedcpu ( ) and roundrobin ( ),
run periodically. Schedcpu ( ) recomputes process priorities once per second, using
Eq. 4.2, and updates the value of p_slptime for processes blocked by a call to
sleep ( ). The roundrobin ( ) routine runs 10 times per second and causes the system
to reschedule the processes in the highest-priority (nonempty) queue in a round­
robin fashion, which allows each process a I 00-millisecond time quantum.
The CPU usage estimates are updated in the system clock-processing module,
hardclock ( ) , which executes I 00 times per second. Each time that a process accu­
mulates four ticks in its CPU usage estimate, p_estcpu, the system recalculates the
priority of the process. This recalculation uses Eq . 4. 1 and is done by the
setpriority ( ) routine. The decision to recalculate after four ticks is related to the
management of the run queues described in the next section. In addition to issuing
the call from hardclock ( ), each time setrunnable ( ) places a process on a run
queue, it also calls setpriority ( ) to recompute the process's scheduling p riority.
This call from wakeup ( ) to setrunnable ( ) operates on a process other than the cur­
rently running process. So, wakeup ( ) invokes updatepri ( ) to recalculate the CPU
usage estimate according to Eq. 4.3 before calling setpriority ( ) . The relationship
of these functions is shown in Fig. 4.4.
Figu re 4.4
Procedural interface to priority calculation.
wakeup ( )
hardclock( )
setrunnah/e ( )
setpriority( )
updatepri ( )
Chapter 4
Process Management
Process Run Q ueues and Context Switching
The schedul ing-priority calculations are used to order the set of runnable pro­
cesse s . The schedu ling pri ority ranges between 0 and 1 2 7 . with 0 to 49 reserved
for processes executing i n kerne l mode. and 50 to 1 27 reserved for processes
executing i n user mode . The number of queues used to hold the collection of
runnable processes affects the cost of managing the queues. If only a single
(ordered) queue i s maintained. then selecting the next runnable process becomes
s i mple, but other operations become expensive. Using 1 28 different queues can
significantly increase the cost of identifying the next process to ru n . The system
uses 3 2 run queues, selecting a run queue for a process by dividing the process's
priority by 4. The processes on each queue are not further sorted by their priori­
ties. The selection of 32 different queues was origi nal ly a compromise based
mainly on the avai l abi lity of certai n VAX machine i n structions that permitted the
sy stem to i mplement the lowest- l evel scheduling algorithm efficiently, using a
32-bit mask of the queues contai ning runnable processes. The compromise works
well enough today that 3 2 queues are sti l l used.
The ru n queues contain a l l the runnable processes i n main memory except the
currently running process. Figure 4.5 shows how each queue i s organized as a
doubly linked l i st of process structure s . The head of each run queue i s kept i n an
array ; assoc i ated w i th this array i s a bit vector, whichqs, that i s used in identifying
the nonempty run queues. Two routines, setrunqueue ( ) and remrq ( ), are u sed to
place a process at the tai l of a run queue, and to take a process off the head of a
run queue. The heart of the schedu l i ng algori thm is the cpu_switch ( ) routine.
The cpu_switch ( ) routine i s respon sible for selecting a new process to ru n ; i t oper­
ates as fol l ow s :
Figure 4 . 5 Queueing structure for runnable processes .
run queues
priori ty
Section 4.4
Process Schedu ling
B lock interru pts, then l ook for a nonempty run queue . Locate a n onempty
queue by fi nding the l ocation of the first nonzero bit i n the whichqs bit vector.
If 1rhichqs is zero. there are no processes to run , so unblock i n terrupts and
loop: thi s l oop i s the idle loop.
2 . Given a nonempty run queue, remove the first process on the queue.
If thi s run queue i s now empty as a result of re moving the process, reset the
appropriate bit in whichqs.
C lear the cu1proc poi nter and the want_resched flag. The curproc pointer ref­
erences the currently running process. Clear it to show that no process is cur­
rently running. The want_resched fl ag shows that a context switc h shou ld take
p l ace : it is described l ater in th i s secti o n .
Set t h e n e w process running and unblock interrupts.
The context-switch code i s broken into two parts. The machine-i ndependent code
resides i n mi_switch ( ) : the machine-dependent part resides i n cpu_switch ( ) . On
most architectures, cpu_switch ( ) i s coded i n assembly language for efficiency.
Given the mi_switch ( ) routine and the process-priority calculations, the only
missing piece i n the schedul ing fac i l i ty i s how the sy stem forces an involu ntary
context switch . Remember that voluntary context switches occur when a process
calls the sleep ( ) routi ne. Sleep ( ) can be i nvoked by only a runnable process, so
sleep ( ) needs only to place the process on a sleep queue and to i nvoke
mi_switch ( ) to schedule the next process to run . The mi_switch ( ) routi ne. how­
ever, cannot be cal led from code that executes at interrupt leve l , because i t must be
called within the context of the running process.
An alternative mechanism must exist. Thi s mechanism i s handled by the
machine-dependent need_resched( ) routine, which generally sets a global resched­
ule request fl ag, named want_resched, and then posts an asynch ronous system trap
(AST) for the current process. An AST i s a trap that is de l ivered to a process the
next time that that process returns to user mode . Some architectures support ASTs
directly in hardware ; other systems emul ate ASTs by checking the 11·cmt_resched
fl ag at the end of every system c al l , trap, and interrupt of user-mode execution.
When the hardware AST trap occurs or the want_resched flag i s set, the
mi_switch ( ) routi ne i s called, instead of the current process resuming execution .
Reschedul ing requests are made by the wakeup ( ) , setpriority ( ), roundrobin ( ) ,
schedcpu ( ), and setnmnable ( ) routines.
Because 4.4BSD does not preempt processes executing i n kernel mode, the
worst-case reaHime response to events is defined by the longest path through the
top half of the kernel . S i nce the system guarantees no upper bounds on the dura­
tion of a system cal l , 4.4BSD is decidedly not a real-time system. Attempts to
retrofit B S D with real-time process schedu l ing have addressed th i s problem in dif­
ferent ways [ Ferri n & Langridge. 1 9 80; Sanderson et a l . 1 9 86] .
Chapter 4
Process M anagement
Process Creation
I n 4.4BSD. new processes are created with the f(Jrk system cal l . There i s also a
ifork system call that differs from fork i n how the v i rtual-memory resources are
treated; 1fork also ensures that the parent w i l l not ru n unti l the child does e i ther an
exec or exit system cal l . The 1:f'ork system call is described i n Section 5 . 6.
The process created by a .fork i s termed a child process of the original parent
process. From a user's point of view, the c h i l d process i s an exact dup l i cate of the
parent process, except for two values : the child PIO, and the parent PIO. A cal l to
fork returns the child PIO to the parent and zero to the child process. Thu s , a pro­
gram can identify whether it is the parent or child process after a fork by checking
thi s return val ue.
A .fork involves three main steps :
A l l ocating and initiali zing a new process structure for the child process
2. Duplicating the context of the parent ( i nc l uding the user structure and v i rtual­
memory resources) for the child process
Schedu l ing the child process to run
The second step i s inti mately rel ated to the operation of the memory-management
faci l ities described i n Chapter 5. Consequently, only those actions related to pro­
cess management w i l l be described here .
The kernel begins b y allocati ng memory for the new process entry (see
Fig. 4. 1 ). The process entry is i nitial ized in three step s : part is copied from the
parent's process structure, part is zeroed, and the rest is expl icitly initial ized. The
zeroed fields incl ude recent CPU util ization, wait channel, s wap and s leep time.
timers, tracing, and pending-signal information. The copied portions include all
the privi leges and l i m itations i n herited from the parent, incl uding
• The process group and session
• The si gnal state (ignored, caught and blocked signal masks)
• The p_nice schedu l i ng parameter
• A reference to the parent's credential
• A reference to the parent's set of open fi les
• A reference to the parent's l i mits
The explicitly set i nformation incl udes
• Entry onto the l i st of all processes
Section 4 . 6
Process Termination
• Entry onto the chil d list of the parent and the back pointer to the p arent
• Entry onto the parent's process-group l i st
• Entry onto the hash structure that allows the process to be looked up by i ts PID
• A pointer to the process ' s statistics s tructure, allocated i n i ts u ser structure
• A pointer to the process's s ignal- actions structure, al l ocated in its u ser s tructure
• A new PID for the process
The new PID must be unique among al l processes. Early versions of B S D verified
the uniqueness of a PID by performi n g a l i near search of the process table. This
search became i nfeasible on l arge systems with many processes. 4.4BS D main­
tai n s a range of unal l ocated PIDs between nextpid and pidchecked. I t a l l ocates a
new PID by using the value of nextpid, and nextpid i s then i ncremented. When
nextpid reaches pidchecked, the sy stem calculates a new range of unused PIDs by
making a s i ngle scan of all exi sting processes (not just the active ones are
scanned-zombie and swapped processes also are checked) .
The final step i s to copy the parent's address space. To duplicate a process ' s
i mage, t h e kerne l i nvokes t h e memory-management fac i l ities through a call to
vmJork ( ). The vmJork ( ) routine is passed a pointer to the i n i tialized process
structure for the child process and is expected to allocate al l the resources that the
c h i l d will need to execute. The call to vmJork ( ) returns a value of I in the child
p rocess and of 0 i n the parent process.
Now that the child process i s ful l y built, i t i s made known to the scheduler by
being p laced on the run queue. The return value from vmJork ( ) is passed back to
i ndicate whether the process is returning in the parent or c h i l d process, and deter­
mines the return val ue of the fork system c al l .
Process Termination
Processes term i nate either voluntaril y through an exit system c al l , or involuntaril y
as the result of a signal . I n e i ther case, proces s termi nation causes a status code t o
be returned to t h e parent of t h e term i n ating process ( i f the parent sti l l exists ) . Thi s
termination status i s returned through the wait4 system c al l . The wait4 c a l l per­
mits an app l i cation to request the status of both stopped and termi n ated processes.
The wait4 request can wait for any direct child of the parent, or i t can wait selec­
tively for a s i ngle child process , or for only its c h i l dren i n a particular process
group. Wait4 can also request statistics describing the resource util i zation of a ter­
m i n ated c h i l d process. Final l y, the wait4 i nterface allows a process to request s ta­
tus codes w i thout blocking.
1 00
Chapter 4
Process Management
Within the kernel, a process termi n ates by calling the exit ( ) routine. Exit( )
first cleans up the process ' s kernel-mode execution state by
• Canceling any pending ti mers
• Releasing v i rtual-memory resources
• Closing open descriptors
• Handl ing stopped or traced child processes
With the kernel-mode state reset, the process i s then removed from the list of
active processes-the allproc l i s t-and is placed on the l i s t of zambie processes
poin ted to by zambproc. The process state is changed, and the global fl ag curproc
is marked to show that no process is c u rrently running. The exit ( ) routine then
• Records the termination status in the p_xstat field of the proces s structure
• B undles up a copy of the process's accumulated resource usage (for accounting
purposes) and hangs thi s structure from the p_ru field of the process structure
• Notifies the deceased process's p arent
Finally, after the parent has been notified, the cpu_exit ( ) routine frees any
mac h ine-dependent process resources, and arranges for a final context switch from
the process.
The wait4 call works by searching a process's descendant processes for pro­
cesses that have termi nated. I f a process in SZOMB state is found that matches the
wait criterion, the system will copy the term i n ation status from the deceased pro­
cess . The process entry then is taken off the zombie l i s t and is freed. Note that
resources used by children of a process are accumulated only as a result of a wait4
system call . When u sers are trying to analyze the behavior of a long-running pro­
gram, they would find it u seful to be able to obtain this resource u sage i nformation
before the term ination of a process. Although the i nformation i s available i n side
the kernel and within the context of that program, there is no interface to request it
outside of that context until process term i nation.
UNIX defines a set of signals for software and hardware conditions that may arise
during the normal execution of a program; these signals are l i sted in Table 4.4.
Signals may be de l ivered to a process through appl ication-specified signal han­
dlers, or may result in default actions, such as process term ination, carried out by
the system. 4.4B S D signals are designed to be software equivalents of hardware
i n terrupts or trap s .
Section 4 . 7
Table 4.4
1 01
Signals defined in 4.4BSD.
S i gnals
Default action
terminate process
terminate process
create core image
create core image
create core image
create core image
create core image
create core image
terminate process
create core image
create core image
create core i mage
terminate process
terminate process
terminate process
discard signal
stop process
stop process
discard signal
discard signal
stop process
stop process
discard signal
terminate process
terminate process
terminate process
terminate process
discard signal
discard signal
terminate process
terminate process
terminal line hangup
inten-upt program
quit program
illegal instruction
trace trap
1/0 trap instruction executed
emulate instruction executed
floating-point exception
kill program
bus en-or
segmentation violation
bad argument to system call
write on a pipe with no one to read it
real-time timer expired
software termination signal
urgent condition on 1/0 channel
stop signal not from terminal
stop signal from terminal
a stopped process is being continued
notification to parent on child stop or exit
read on terminal by background process
write to terminal by background process
1/0 possible on a descriptor
CPU time limit exceeded
file-size limit exceeded
virtual timer expired
profiling timer expired
window size changed
information request
user-defined signal I
user-defined signal 2
Each signal has an associ ated action that defi nes how i t should be handled
when it is delivered to a process. If a process has not specified an action for a sig­
nal, i t i s given a default action that may be any one of
1 02
Chapter 4
Process Management
• Ignoring the signal
• Terminating the process
• Terminating the process after generating a core file that contains the process's
execution state at the time the signal was delivered
• Stopping the process
• Resuming the execution of the process
An application program can use the sigaction system call to specify an action for a
signal, including
• Taking the default action
• Ignoring the signal
• Catching the signal with a handler
A signal handler is a user-mode routine that the system will invoke when the sig­
nal is received by the process. The handler is said to catch the signal. The two
signals SIGSTOP and SIGKILL cannot be ignored or caught; this restriction ensures
that a software mechanism exists for stopping and killing runaway processes. It is
not possible for a user process to decide which signals would cause the creation of
a core file by default, but it is possible for a process to prevent the creation of such
a file by ignoring, blocking, or catching the signal.
Signals are posted to a process by the system when it detects a hardware
event, such as an illegal instruction, or a software event, such as a stop request
from the terminal . A signal may also be posted by another process through the kill
system cal l . A sending process may post signals to only those receiving processes
that have the same effective user identifier (unless the sender is the superuser) . A
single exception to this rule is the continue signal, SIGCONT, which always can be
sent to any descendent of the sending process. The reason for this exception is to
allow users to restart a setuid program that they have stopped from their keyboard.
Like hardware interrupts, the delivery of signals may be masked by a process.
The execution state of each process contains a set of signals currently masked
from delivery. If a signal posted to a process is being masked, the signal is
recorded in the process 's set of pending signals, but no action is taken until the
signal is unmasked. The sigprocmask system call modifies a set of masked signals
for a process. It can add to the set of masked signals, delete from the set of
masked signals, or replace the set of masked signals.
The system does not allow the SIGKILL or SIGSTOP signals to be masked.
Although the delivery of the SIGCONT signal to the signal handler of a process
may be masked, the action of resuming that stopped process is not masked.
Two other signal-rel ated system calls are sigsuspend and sigaltstack. The sig­
suspend call permits a process to relinquish the processor until that process
receives a signal . This facility is similar to the system's sleep ( ) routine. The
Section 4.7
S ignal s
1 03
sigaltstack call allow s a process to specify a run-time stack to u se in signal
deli very. By default, the system will deliver signals to a process on the l atter' s nor­
mal run-time stack. I n some applications, however, thi s defaul t is unacceptable.
For example, if an application i s running on a stack that the system does not
expand automatically, and the stack overflows, then the signal handler must
execute o n an alternate stack . Thi s fac i l ity i s similar to the interrupt-stack mecha­
nism used by the kerne l .
The fi n al signal-related faci lity i s the sigreturn system c al l . Sigreturn i s the
equivalent of a u ser- level load-processor-context operation. A pointer to a
(machine-dependent) context block that describes the u ser-level execution state of
a proces s i s passed to the kerne l . The sigreturn system call is u sed to restore state
and to resume execution after a normal return from a user's signal handler.
Comparison with POSIX Signals
S ignals were original l y designed to model exceptional events, such as an attempt
by a user to k i l l a runaway program. They were not intended to be u sed as a gen­
eral interprocess-communication mechanism, and thu s no attempt was made to
make them reliable. In earlier systems, whenever a signal was caught, its actio n
w a s reset to t h e default action. T h e introduction of job control brought much
more frequent use of signals , and made more v isible a problem that faster proces­
sors also exacerbated: If two signals were sent rapidly, the second could cause the
process to die, even though a signal handler had been set up to catch the first sig­
nal . Thu s , rel i ab i l ity became desirable, so the devel opers designed a new frame­
work that contained the old c apab i l i ties as a subset while accommodatin g new
The signal facilities found i n 4.4BSD are designed around a virtual-machine
mode l , in which system calls are considered to be the parallel of machine's h ard­
ware instructio n set. Signals are the software equivalent of traps or interrupts, and
signal-handling routines perform the equivalent function of interrupt or trap service
routines . Just as machines provide a mechanism for blocking hardware interrupts
so that consistent access to data structures can be ensured, the signal faci lities allow
software signals to be masked. Finall y, because complex run-time stack environ­
ments may be required, signals , l ike interrupts, may be handled on an alternate run­
time stack. These machine models are s ummarized i n Table 4.5 (on page 1 04).
The 4.4BSD signal model was adopted by POSIX, although several significant
changes were made.
In POSIX, system cal l s interrupted by a signal cause the call to be terminated pre­
maturely and an "interrupted system cal l " error to be returned. In 4.4BSD, the
sigaction system cal l can be passed a fl ag that requests that system call s inter­
rupted by a signal be restarted automaticall y w henever possible and reasonab l e .
Automatic restarting o f system cal l s permits programs t o service signals without
having to check the return code from each system call to determine whether the
call should be restarted. I f thi s fl ag i s not given, the POSIX semantics apply.
Most appl i c ations use the C-library routine signal ( ) to set up thei r signal
Chapter 4
1 04
Process M anagement
Comparison of hardware-machine operations and the corresponding software
virtual-machine operations.
Table 4.5
Hardware machine
Software virtual machine
instruction set
set of system calls
restartable instructions
restartable system calls
interrupt/trap handlers
signal handlers
blocking interrupts
masking signal s
interrupt stack
signal stack
handlers . In 4.4BSD, the signal ( ) routine cal l s sigaction w i th the fl ag that
requests that system cal l s be restarted . Thus, app l ications running on 4.4BSD
and setting up signal handlers with signal ( ) continue to work as expected, even
though the sigaction interface conforms to the POSIX specification .
I n POSIX, signals are always delivered on the normal run-time stack of a process.
I n 4.4BSD, an alternate stack may be specified for del ivering signal s w i th the
sigaltstack system cal l . S ignal stacks perm i t programs that manage fi xed-sized
run-time stacks to handle signal s rel i ab l y.
POSIX added a new syste m c a l l sigpending; thi s routine determines what signal s
have been posted but h ave not yet been del i vered. Although it appears i n
4.4B S D , i t had no equ ivalent i n earlier B S D systems because there were n o app l i ­
cations that wanted t o m ake use o f pending-signal i nformation.
Posting of a Signal
The i mplementation of signals is broken up i nto two parts: posting a signal to a
process, and recognizing the signal and delivering i t to the target process. S i gnals
may be posted by any process or by code that executes at interrupt l evel . S ignal
del i very normal l y takes p l ace w i th i n the context of the receiving process. B u t
when a signal forces a process t o b e stopped, the action c a n b e carried o u t when
the sign al i s posted.
A signal i s posted to a single process w i th the psignal( ) routine or to a group
of processes w i th the gsignal ( ) routine. The gsignal ( ) routine i nvokes psignal ( )
for each process i n the specified process group . The acti ons associated with post­
ing a signal are straightforward, but the detail s are messy. In theory, posting a sig­
nal to a process s i mply causes the appropriate signal to be added to the set of
pending signal s for the process, and the process is then set to run (or is awakened
i f it was sleeping at an i n terruptible priori ty leve l ) . The CURSIG m acro calculates
the next signal , if any, that shou l d be del ivered to a process. It determ i nes the next
signal by inspecting the p_siglist fi e l d that contai ns the set of signals pending
del ivery to a process. Each time that a process returns from a call to sleep ( ) ( w i th
Section 4.7
S ignals
1 05
the PCATCH fl ag set) or prepares to exit the system after processing a system call
or trap, i t checks to see whether a signal i s pending del ivery. If a signal i s pending
and must be delivered i n the proce s s ' s context, i t i s removed from the pending set,
and the process i nvokes the postsig ( ) routine to take the appropriate action.
The work of psignal ( ) i s a patchwork of special cases required by the pro­
cess-debugging and j ob-control facilities, and by intrinsic propertie s associated
with signal s . The s teps involved in posting a signal are as follow s :
Determine t h e action that the receiving process will take w h e n the signal i s
delivered. T h i s i nformation i s kept i n the p_sigignore, p_sigmask, a n d p_sig­
catch fields of the proces s ' s process structure. If a process is not ignoring,
maskin , or catching a signal, the default action i s presumed to apply. If a
process is being traced by its parent-that i s , by a debugger-the parent pro­
cess is always permitted to intercede before the signal is delivered. If the pro­
cess is ignoring the signal, psignal ( ) 's work is done and the routine can return .
2. Given an action, psignal ( ) adds the signal to the set of pending signals,
p_siglist, and then does any implicit actions spec i fi c to that signal. For exam­
ple, i f the signal i s a continue signal, SIGCONT, any pending signals that
would normally cause the process to stop, such as SIGTTOU, are removed.
3. Next, psignal ( ) checks whether the s ignal i s being masked. If the process is
currently masking delivery of the signal, psigna l ( ) ' s work is complete and it
may return .
If, however, the signal i s not being m asked, psignal ( ) must either do the action
directly, or arrange for the process to execute so that the process will take the
action associated w i th the signal. To get the process running, psignal ( ) must
i n terrogate the state of the process, which i s one of the followin g :
The proces s is blocked awaiting an event. If the process is sleeping at a
negative priori ty, then nothing further can be done. Otherwise, the ker­
nel can apply the action-either directly, or indirectly by waking up the
process. There are two actions that can be applied directly. For signals
that cause a process to stop, the process i s placed i n an SSTOP s tate,
and the parent process is notified of the state change by a SIGCHLD sig­
nal being posted to it. For signals that are ignored by default, the signal
i s removed from p_siglist and the work i s complete. Otherwise, the
action associated with the signal must be done i n the context of the
receiving process, and the process i s placed onto the run queue with a
call to setrunnable ( ) .
The process is stopped by a signal or becau se it is being debugged. If
the process is being debugged, then there is nothing to do until the con­
trolling process permits i t to run agai n . If the process i s stopped by a
signal and the posted signal would cause the process to stop again , then
there i s nothing to do, and the posted signal i s discarded. Otherwise,
Chapter 4
1 06
Process Management
the signal is either a continue signal or a signal that would normally
cause the process to terminate (unless the signal is caught) . If the sig­
nal is S IGCONT, then the process is set running again, unless it is
blocked waiting on an event; if the process is blocked, it is returned to
the S SLEEP state. If the signal is SIGKILL, then the process is set run­
ning again no matter what, so that it can terminate the next time that it
is scheduled to run. Otherwise, the signal causes the process to be
made runnable, but the process is not placed on the run queue because
it must wait for a continue signal.
If the process is not the currently executing process, need_resched( ) is
called, so that the signal will be noticed by the receiving process as
soon as possible.
The implementation of psignal ( ) is complicated, mostly because psignal ( ) con­
trols the process-state transitions that are part of the job-control facilities and
because it interacts strongly with process-debugging facilities.
Delivering a Signal
Most actions associated with delivering a signal to a process are carried out within
the context of that process. A process checks its process structure for pending sig­
nals at least once each time that it enters the system, by calling the CURSIG macro.
If CURSIG determines that there are any unmasked signals in p_siglist, it calls
issignal ( ) to find the first unmasked signal in the list. If delivering the signal
causes a signal handler to be invoked or a core dump to be made, the caller is noti­
fied that a signal is pending, and actual delivery is done by a call to postsig ( ) .
That is,
( sig
CURS I G ( p ) )
p o s t sig ( s ig ) ;
Otherwise, the action associated with the signal is done within issignal( ) (these
actions mimic the actions carried out by psignal ( ) ) .
The postsig ( ) routine has two cases t o handle:
Producing a core dump
2. Invoking a signal handler
The former task is done by the coredump ( ) routine and is always followed by a
call to exit( ) to force process termination. To i nvoke a signal handler, postsig ( )
first calculates a set of masked signals and i nstalls that set in p_sigmask. This set
normally includes the signal being delivered, so that the signal handler will not be
invoked recursively by the same signal. Any signals specified in the sigaction
Process Groups a n d Sessions
Section 4.8
1 07
step l -sendsig( )
signal context
step 4-sigreturn ( )
signal context
step 2-sigtramp( ) called
signal context
step 3-sigtramp ( ) returns
signal handler
Figure 4.6
Delivery of a signal to a process.
system call at the time the h andler was installed also will be included. Postsig ( )
then c al l s the sendsig ( ) routine t o arrange for the signal handler to execute imme­
diately after the process returns to u ser mode . Final l y, the signal in p_cursig i s
cleared and postsig ( ) returns, presumably t o b e followed by a return t o u ser mode.
The implementation of the sendsig ( ) routine is machine dependent. Figure
4.6 shows the fl ow of control associated with signal delivery. If an alternate stack
has been requested, the user's stack pointer is switched to point at that stack. An
argument list and the proces s ' s current user-mode execution context are s tored on
the (possibly new) stack. The state of the process i s manipulated so that, o n return
to user mode, a call will be made immediately to a body of code termed the sig­
nal-trampoline code. This code invokes the signal handler with the appropriate
argument list, and, if the handler returns , makes a sigreturn system call to reset the
process ' s signal state to the state that exi sted before the signal .
Process Groups and Sessions
A process group is a collection of related processes, such as a shell pipeline, all of
which have been assigned the same process-group identifier. The process-group
identifier is the same as the PID of the process group ' s i nitial member; thus pro­
cess-group identifiers share the n ame space of process i dentifiers . When a new
1 08
Chapter 4
Process Management
process group i s created, the kernel allocates a process-group structure to be
associated with i t . This process-group structure is entered into a process-group
hash table so that it can be found quickly.
A process i s al ways a member of a single process group. When it i s created,
each process i s placed into the process group of i ts parent proces s . Programs such
as shel l s create new process groups, usually placing re lated chi l d processes into a
group. A process can change its own process group or that of a child process by
creati ng a new process group or by moving a process into an exi sting process
group using the setpgid syste m cal l . For example, when a shell wants to set up a
new pipeline, it wants to put the processes in the pipe l i ne i nto a process group dif­
ferent from its own , so that the pipeline can be controlled independently of the
shel l . The shell starts by creating the first process in the pipel ine, which initial ly
has the same process-group identifier as the shel l . Before executing the target pro­
gram, the first process does a setpgid to set its process-group identi fier to the same
value as i ts PID. This system call creates a new process group, w i th the child pro­
cess as the process-group leader of the process group . As the she l l starts each
additional process for the pipeline, each child process uses setpgid to join the
exi sting process group.
In our example of a shell creating a new pipeline, there i s a race. As the addi­
tional processes i n the pipeline are spawned by the she l l , each is p l aced i n the pro­
cess group created by the fi rst process i n the pipeline. These conventions are
enforced by the setpgid system cal l . It restricts the set of process-group identi fiers
to which a process may be set to either a value equal i ts own PID or a val ue of
another process-group identifier in i ts session. Unfortunately, if a pipeline process
other than the process-group l e ader is created before the process-group leader has
completed its setpgid cal l , the setpgid call to join the process group w i l l fail . A s
the setpgid call permits parents t o s e t t h e process group of their c h i ldren (within
some l i m i ts imposed by security concern s ) , the she l l can avoid this race by mak­
ing the setpgid call to change the chi l d ' s process group both in the new ly created
child and in the parent shel l . This algorithm guarantees that, no matter which pro­
cess runs first, the process group w i l l exist w i th the correct process-group leader.
The shell can also avoid the race by using the i:fork variant of the fork sy stem cal l
that forces the parent process to wait until the child process either has done an
exec system call or has exited. In addition, if the i n i ti al members of the process
group exit before a l l the pipeline members have joined the group-for example if
the process-group leader exits before the second process joins the group, the
setpgid call could fai l . The she l l can avoid thi s race by ensuring that all child pro­
cesses are pl aced i nto the process group w i thout cal l i n g the wait system cal l , usu­
ally by blocking the SIGCHLD signal so that the she l l will not be notified yet i f a
child exits . As l ong as a process-group member exists, even as a zombie process,
additional processes can join the process group.
There are additional restrictions on the setpgid system cal l . A process may
join process groups only within its current session (discussed in the next section),
and it cannot h ave done an exec system cal l . The l atter restriction i s intended to
Section 4 . 8
Process Groups a n d Sessions
1 09
avoid unexpected behavior if a process is moved i n to a different process group
after i t has begun execution. Therefore , when a shell cal l s setpgid in both parent
and child processes after a fork, the call made by the parent will fail if the child
has already m ade an exec cal l . However, the child will already h ave j oined the
process group successfully, and the failure i s innocuous .
Just as a set of rel ated processes are collected i n to a process group, a set of pro­
cess groups are collected i n to a session. A session is a set of one or more process
groups and may be associated with a terminal device. The main uses for sessions
are to collect together a user's login shell and the j obs that i t spawns, and to create
an isolated environment for a daemon process and its children . Any process that
i s not already a process-group leader m ay create a session using the setsid system
cal l , becoming the session leader and the only member of the session. Creating a
session also creates a new process group, where the process-group ID is the PID of
the process creating the session, and the process i s the process-group l eader. B y
definition, all members o f a process group are members o f the same session.
A session may have an associated controlling terminal that i s used by defau l t
for communicating w i t h t h e u ser. Only t h e session leader m a y allocate a control­
ling terminal for the session, becoming a controlling process when i t does so. A
device can be the controlling terminal for only one session at a time. The terminal
1/0 system (described i n C hapter 1 0) synchronizes access to a terminal by permit­
ting only a single process group to be the foreground process group for a control­
ling terminal at any time. Some term inal operations are allowed by only members
of the session. A session can have at most one controll ing terminal . When a ses­
sion i s created, the session l eader is dissociated from its controlling terminal i f i t
had one.
A login session i s created by a program that prepares a terminal for a user to
log into the system. That process normally executes a shell for the user, and thus
the shell i s created as the controll ing process. A n example of a typical login ses­
sion is shown i n Fig . 4.7 (on page 1 1 0) .
The data structures used to support sessions and process groups i n 4.4B S D are
shown in Fig. 4 . 8 . This fi gure paral lels the process l ayout shown in Fig. 4 . 7 . The
pg_members field of a process-group structure heads the l i st of member processes;
these processes are l i n ked together through the p_pglist list entry i n the process
structure. I n addition, each process has a reference to its process-group structure
in the p_pgrp field of the process structure. Each process-group s tructure has a
pointe r to its enclosing session . The session structure tracks per-login i nforma­
tion, including the process that created and controls the session, the controlling
terminal for the session, and the login name associated with the session. Two pro­
cesses wanting to determine whether they are i n the same session can traverse
their p_pg1p pointers to fi n d their process-group s tructures, and then compare the
pg_session pointers to see whether the l atter are the same.
Chapter 4
Process M anagement
process 3
process group 3
process 4
H process 5
process group 4
11 11
process 8
process group 8
Figure 4.7 A session and its processes. In this example, process 3 is the initial member
of the session-the session l eader-and is referred to as the controlling process if it has a
controlling terminal. It is contained in its own process group, 3. Process 3 has spawned
two jobs: one is a pipeline composed of processes 4 and 5, grouped together in process
group 4, and the other one is process 8 , which is in its own process group, 8 . No process­
group leader can create a new session; thus, processes 3 , 4, or 8 could not start their own
session, but process 5 would be allowed to do so.
Job Control
Job control is a faci l i ty first provided by the C shell [Joy, 1 994] , and today pro­
v i ded by most shel l s . It permits a user to control the operation of groups of pro­
cesses termed jobs. The most i mportant faci l i ties provided by j ob control are the
abi l ities to suspend and restart jobs and to do the mul tiplexing of access to the
user's termi nal . Only one j ob at a time is given control of the terminal and is able
to read from and write to the terminal . Thi s faci l i ty provides some of the advan­
tages of wi ndow systems, although job control i s sufficiently different that i t i s
often used i n comb i n ation with window systems on those systems that have the
l atter. Job control is implemented on top of the process group, session, and s ignal
faci l i ties.
Each job i s a process group. Outside the kernel, a shell manipulates a j ob by
sending signal s to the job's process group with the kil/pg system cal l , which deliv­
ers a signal to all the processes i n a process group . Within the system, the two
main u sers of process groups are the terminal handler (Chapter 1 0) and the inter­
process-communication faci l ities (Chapter 1 1 ) . B oth faci l i ties record process­
group i dentifiers in p rivate data structures and use them i n delivering signal s . The
terminal handler, in addition, u ses process groups to multiplex access to the con­
trolling term i nal .
For example, special characters typed at the keyboard of the term i nal ( e . g . ,
control-C or control-\) result i n a signal b e i n g s e n t t o all processes i n o n e job i n
the session; that j ob i s in the foreground, whereas a l l other j obs i n the session are
in the background. A shell may change the foreground j ob by using the
tcsetpgrp ( ) function, i mplemented by the TIOCSPGRP ioctl on the control l i ng ter­
minal . B ackground jobs w i l l be sent the SIGTTIN signal if they attempt to read
from the terminal, normal l y stopping the j ob. The SIGTTOU s ignal is sent to back­
ground j obs that attempt an ioctl system call that would alter the state of the
Process Groups a n d Sessions
Section 4 . 8
t struct pgrp
LISTHEAD pgrphashtbl
pg_members I process 8
r; pg_hash
(. struct 011:rp
4 ::;
::::- 1 process 4 I :::
;. process 5
vg_jobc 2
t_pgrp (foreground process group)
pg_members 1 process 3
1Pg_jobc 0
struct pgrp
Figure 4.8
ijg_ wsh
s_count 3 - t_session
struct tty
. � leader
struct session
p_pg list
Process-group organization.
terminal, and, if the TOSTOP option is set for the terminal, if they attempt to write
to the termi nal .
The foreground process group for a session is stored in the t_pg1p field of the
session ' s controlling terminal tty s tructure (see Chapter 1 0) . All other process
groups within the session are i n the b ackground. In Fig. 4 . 8 , the session leader has
set the foreground process group for its controlling terminal to be its own process
group. Thus , its two j obs are i n background, and the terminal input and output will
be controlled by the session-leader shel l . Job control i s limited to processes con­
tained within the same session and to the terminal associ ated with the session.
Only the members of the session are permitted to reassign the controlling terminal
among the process groups within the session.
I f a controlling process exits, the system revokes further access to the control­
l ing terminal and sends a SIGHUP signal to the foreground process group. I f a
process such as a j ob-control shell exits, each process group that it created will
become an orphaned process group: a process group in which no member has a
Chapter 4
Process Management
parent that i s a member of the same session but of a different process group. Such
a parent woul d normal ly be a j ob-control she l l capable of resuming stopped child
processe s . The pg_jobc field i n Fig. 4 . 8 counts the number of processes within the
process group that have the controll ing process as a parent: when that count goes
to zern, the process group is orphaned. If no action were taken by the system, any
orphaned process groups that were stopped at the time that they became orphaned
woul d be unlikely ever to resume. Historically, the system deal t harshly with such
stopped processes: They were killed. In POS I X and 4.4BSD, an orphaned process
group i s sent a hangup and a continue s ignal i f any of its members are stopped
when it becomes orphaned by the exit of a parent process. If processes choose to
catch or ignore the hangup signal , they can continue running after becoming
orphaned. The sy stem keeps a count of processes i n each proces s group that have
a parent process in another process group of the same session. When a process
exits, thi s count is adj u sted for the process groups of all child processes. I f the
count reaches zero, the process group has become orphaned. Note that a process
can be a member of an orphaned process group even if its original parent process
i s sti l l al ive . For example, if a she l l starts a job as a single process A. that process
then forks to create process B, and the parent shell exits. then process B i s a mem­
ber of an orphaned process group but i s not an orphaned process.
To avoid stopping members of orphaned process groups if they try to read or
write to their contro l l ing term inal , the kernel does not send them S IGTTIN and
S IGTTOU signal s . and prevents them from stopping i n response to those signal s .
Instead. attempts to read or write to the termi nal produce an error.
Process Debugging
4.4B S D prov ides a simpli stic faci l i ty for contro l l ing and debugging the execution
of a process. Thi s fac i l i ty, accessed through the ptrace system cal l . perm its a par­
ent process to control a child process's exec ution by manipulating u ser- and ker­
nel-mode execution state . In particul ar, with ptrace, a parent process can do the
fol lowi ng operations on a c h i l d proces s :
Read a n d write address space a n d registers
I ntercept signals posted to the process
S ingle step and continue the execution of the proces s
Terminate the execution of t h e process
The pt race call i s used almost exclusively by program debuggers , such as gdb.
When a process i s being traced, any signal s posted to that process cause i t to
enter the SSTOP state . The parent process i s notified with a SIGCHLD si gnal and
may i nterrogate the status of the child with the irnit4 system cal l . On most
machines, trace traps. generated when a process is single stepped. and breakpoint
faults, cau sed by a process executing a breakpoint i nstruction, are trans l ated by
Section 4.9
Process Debugging
4.4BSD i nto SIGTRAP signals. Because signal s posted to a traced process cause i t
t o stop and result i n the parent being notified, a program ' s execution c a n b e con­
trolled eas i l y.
To start a program that is to be debugged, the debugger first creates a child
p rocess with a .fo rk system cal l . After the fork, the child process uses a ptrace call
that causes the process to be fl agged as traced by setting the P_TRACED bit i n the
p_fiag field of the process structure . The child process then sets the trace trap bit
i n the process ' s processor status word and cal l s execve to load the i mage of the
p rogram that is to be debugged . Setting thi s bit ensures that the fi rst i nstruction
executed by the child process after the new i mage i s loaded w i l l result i n a hard­
ware trace trap, which is transl ated by the system i nto a S IGTRAP signal . Because
the parent process i s notified about all signals to the child, it can i ntercept the sig­
nal and gain control over the program before i t executes a single i nstruction.
A l l the operations prov ided by ptrace are carri ed out i n the context of the pro­
cess being traced. When a parent process wants to do an operation, it places the
parameters associated w i th the operation i nto a data structure named ipc and
sleeps on the address of ipc. The next time that the child process encounters a sig­
nal ( i mmediately i f i t i s currently stopped by a signal ), it retrieves the parameters
from the ipc structure and does the requested operation . The child process then
p l aces a return result in the ipc structure and does a wakeup ( ) call with the address
of ipc as the wait channe l . Thi s approach m i n i mizes the amount of extra code
needed in the kernel to s upport debugging. Because the child makes the changes
to its own address space, any pages that it tries to access that are not resident in
memory are brought i nto memory by the existing page-fault mechanisms. If the
parent tried to manipulate the c h i l d ' s address space, it would need special code to
fi nd and load any pages that it wanted to access that were not resident i n memory.
The ptrace faci l i ty is i neffi c ient for three reasons. First, ptrace uses a s i ngle
gl obal data structure for passing information back and forth between a l l the parent
and child processes in the system . Because there is only one structure, i t must be
interlocked to ensure that only one parent-child process pair w i l l use it at a time.
Second, because the data structure has a s ma l l , fi xed size, the parent process i s
l i mi ted t o reading or writing 32 b i t s at a t i m e . Final l y, since each request by a par­
ent p rocess must be done in the context of the c h i l d process, two context switches
need to be done for each request-one from the parent to the child to send the
request, and one from the child to the parent to return the result of the operation.
To address these problems, 4.4BSD added a /proc fi lesystem, s i m i l ar to the
one found i n UNIX Eighth Edition [ K i l l i an, 1 9 84] . In the /proc system, the
address space of another process can be accessed with read and write system calls,
which allows a debugger to access a process being debugged with much greater
efficiency. The page (or pages) of interest in the child process i s mapped into the
kernel address space. The requested data can then be copied directly from the ker­
nel to the parent address space. Thi s technique avoids the need to have a data
structure to pass messages bac k and forth between processes, and avoids the con­
text switches between the parent and child processes. Because the ipc mechanism
was derived from the origi nal UNIX code, it was not incl uded in the freely
Chapter 4
1 14
Process M anagement
redistributabl e 4.4B S D-Lite release. Most rei mplementations simply converted the
ptrace requests into calls on /proc, or map the process pages directly into the ker­
nel memory. The result is a much simpler and faster implementation of ptrace.
4. 1
What are three implications of not having the user structure mapped at a
fi xed v irtual address in the kernel ' s address space ?
W h y i s t h e performance of the context-switching mechanism critical to the
performance of a highly m u ltiprogrammed system ?
What effect w o u l d i ncreasing t h e t i m e quantum have o n t h e system's inter­
active response and total throughput?
What effect would reducing the number of run queues from 3 2 to 1 6 have
on the schedul ing overhead and on syste m performance?
Give three reasons for the sy stem to select a new process to run.
What type of schedul i ng pol icy does 4.4BSD use? What type of j obs does
the policy favor? Propose an algorithm for identifying these favored jobs.
Is job control sti l l a useful faci l i ty, now that window systems are widely
available? Explain your answer.
When a n d h o w does process schedu l i ng interact w i t h t h e memory-manage­
ment faci l i ties?
After a process has exited, it may enter the state of being a zombie,
SZOMB , before disappearing from the system entirely. What i s the purpose
of the SZOMB state ? What event causes a process to exit from SZOMB ?
4. 1 0
S uppose that the data structure s shown in Fig. 4 . 2 do not exist. Instead
assume that each process entry has only its own PIO and the PIO of its par­
ent. Compare the costs in space and time to support each of the fol lowing
operations :
Creation of a new process
b. Lookup of the process ' s parent
4. 1 1
Lookup of all a proces s ' s siblings
Lookup of all a proce s s ' s descendents
Destruction of a process
The system rai ses the hardware pri'ority to splhigh in the sleep ( ) routine be­
fore altering the contents of a process ' s process structure. Why does it do so?
4. 1 2
A process blocked with a priority less than PZERO may never be awakened
by a signal. Describe two problems a noninterruptible sleep may cause if a
disk becomes unavailable while the system is running.
4. 1 3
For each state listed in Table 4 . 1 , list the system queues on which a process
in that state might be found.
*4 . 1 4
Define three properties of a real-time system. Give two reasons why
4 . 4BSD is not a real-time system.
*4 . 1 5
In 4 . 4BSD, the signal SIGTSTP is delivered to a process when a user types
a " suspend character." Why would a process want to catch this signal
before it is stopped?
*4 . 1 6
Before the 4 . 4 BSD signal mechanism was added, signal handlers to catch
the SIGTSTP signal were written as
c a t chs t op ( )
prepare to s t op ;
s i gn a l ( S I GTSTP ,
S I G_DFL ) ;
k i l l ( ge tp i d ( ) ,
s i gn a l ( S I GTSTP ,
c a t chs t op ) ;
This code causes an infinite loop in 4 . 4BSD. Why does it do so? How
should the code be rewritten?
*4. 1 7
The process-priority calculations and accounting statistics are all based on
sampled data. Describe hardware support that would permit more accurate
statistics and priority calculations .
*4. 1 8
What are the implications of adding a fixed-priority scheduling algorithm
to 4 . 4BSD?
*4. 1 9
Why are signals a poor interprocess-communication facility?
* * 4 .20
A kernel-stack-invalid trap occurs when an invalid value for the kernel­
mode stack pointer is detected by the hardware. Assume that this trap is
received on an interrupt stack in kernel mode. How might the system ter­
minate gracefully a process that receives such a trap while executing on the
kernel's run-time stack contained in the user structure?
* * 4 .2 1
Describe a synchronization scheme that would work in a tightly coupled
multiprocessor hardware environment. Assume that the hardware supports
a test-and-set instruction.
* * 4.22
Describe alternatives to the test-and-set instruction that would allow you to
build a synchronization mechanism for a multiprocessor 4 . 4BSD system.
* *4.23
Chapter 4
Process M anagement
A lightweight process i s a thread of execution that operates w i thin the con­
text of a normal 4.4BSD process. M u l tiple l ightweight processes may exist
in a single 4.4BSD process and share memory, but each i s able to do block­
ing operations, such as system cal l s . Describe how l ightweight processes
might be i mplemented entirely i n user mode .
Aral et al , 1 98 9 .
Z. Aral, J . B l oom, T. Doeppner. I . Gertner, A . Langerman. & G . Schaffer.
" Vari able Weight Processes with Flexible Shared Resources." USENIX
Association Cm!ference Proceedings, pp. 405-4 1 2, January 1 98 9 .
Ferri n & Langridge, 1 9 80.
T. E. Ferrin & R. Langridge. " Interactive Computer Graph ics with the
UNIX Time-S haring System," Computer Graphics, vol . 1 3 , pp. 320-3 3 1 ,
1 980.
Joy, 1 994.
W. N. Joy, "An I n troduction to the C Shel l ," in 4. 4BSD User 's Supplemen ­
tary Documents, pp. 4: 1 -46, O ' Re i l l y & Associ ates, I nc . , Sebastopol , CA,
1 994.
Khanna et al , 1 992.
S . Khanna, M . Sebree, & J . Zolnowsky. " Realtime Schedu l i n g i n SunOS
5 .0," USENIX Association Cm!ference Proceedings, pp. 375-390, January
1 992.
Kil l ian, 1 9 84.
T. J . K i l lian, " Processes as Fi les," USENIX Association Cm!ference Pro­
ceedings. pp. 203-207, June 1 9 84.
Ritchie. 1 9 8 8 .
D . M . Ritchie, " M u l ti-Processor UNIX," private communication, April 2 5 ,
1 988.
S anderson et al , 1 9 86.
T. Sanderson, S . Ho, N . Heijden, E. Jabs, & J . L. Green . "Near-Realtime
Data Transmi ssion During the ICE-Comet G i acobini-Zinner Encounter,"
ESA Bulletin, vol . 45 , no. 2 1 , 1 98 6 .
S c h i m m e l , 1 994.
C. Schimme l , UNIX Systems .fbr Modern A rchitectures, Symmetric Multi­
processing, and Caching for Kernel Programmers, Addi son-Wesley, Read­
ing, MA, 1 994.
Memory Management
A central component of any operating system is the memory-management system.
A s the name implies, memory-management faci lities are responsible for the man­
agement of memory resources available on a m achine. These resources are typi ­
cal ly l ayered i n a h ierarchical fashion, with memory-access times i nversely rel ated
to thei r proximity to the CPU (see Fig. 5 . 1 ) . The p ri m ary memory system is main
memo1y; the next level of storage is secondmy storage or backing storage. Main­
memory systems usually are constructed from random-access memories, w hereas
secondary stores are p l aced on moving-head disk dri ves. In certain workstation
envi ronments, the common two-level hierarchy is becoming a three-level
Figure 5.1
Hierarchical layering of memory.
control ler
Chapter 5
Memory M anagement
hierarchy, with the addition of fi le-server machines connected to a workstation v i a
a l ocal-area network [Ginge l l , Moran, & Shannon, 1 9 87] .
In a multiprogrammed env i ronment, i t i s critical for the operating system to
share avai l able memory resources effectively among the processes. The operation
of any memory-management policy i s directly rel ated to the memory requ i red for
a process to execute . That i s , if a process must reside entirely in main memory for
it to execute, then a memory-management sy stem must be oriented toward allocat­
ing l arge units of memory. On the other hand, if a process can execute when i t is
only parti ally resident in main memory, then memory-management policies are
l i kely to be substantial l y different. Memory-management fac i lities usually try to
optimize the nu mber of runnable processes that are res ident in main memory.
Thi s goal must be considered w i th the goals of the process scheduler (Chapter 4),
so that con fl i cts that can adversely affect overall system performance are avoided.
Although the avai labi l i ty of secondary storage permits more processes to exist
than can be resident i n main memory, i t also requires additional algorithms that
can be compl icated. Space management typically requ i res algorithms and policies
different from those used for main memory, and a pol icy must be devi sed for
deciding when to move processes between main memory and secondary storage.
Processes and Memory
Each process operates on a virtual machine that is defined by the architecture of
the underlying hardware on which i t executes. We are i nterested i n only those
mac h i nes that include the notion of a virtual address space. A v i rtual address
space i s a range of memory l ocations that a process references i ndependently of
the physical memory present i n the syste m . I n other words, the v i rtual address
space of a process i s i ndependent of the physical address space of the CPU. For a
machine to support v i rtual memory, we also require that the whole of a proces s ' s
v i rtual address space does n o t need t o b e resident i n m a i n memory for that process
to execute.
References to the v i rtual address space vi rt ual addresses-are transl ated by
hardware i n to references to physical memory. Thi s operati on, termed address
translation, permits programs to be l oaded i nto memory at any l ocation wi thout
requiring position-dependent addresses in the program to be changed. Address
tran s l ation and v i rtual addressing are also i mportant i n efficient sharing of a CPU,
because position independence usually permits context switching to be done
Most machines prov ide a contiguous v i rtual address space for processes.
Some machines, however, choose to partition visibly a process ' s v i rtual address
space i n to regions termed segments [ Intel , 1 9 84 ]; such segments usually must be
physical l y contiguous in main memory and must beg i n at fi xed addresses. We
shall be concerned w i th only those systems that do not v i sibly segment their v i r­
tual address space . This use of the word segment i s not the same as its earl ier use
in Section 3 . 5 , when we were describing 4.4BSD process segments, such as text
and data segments.
Section 5 . 1
When m u l tiple processes are coresident i n main memory, we must protect the
physical memory associated with each process ' s v i rtual address space to ensure
that one process cannot alter the contents of another proces s ' s virtual address
space. Thi s protection is i mplemented i n hardware and i s usually tightly coupled
with the implementation of address translation. Consequently, the two operati ons
usually are defined and i mpl emented together as h ardware termed the memory­
management unit.
Virtual memory can be implemented in many ways, some of which are soft­
ware based, such as overlays. Most effective virtual-memory schemes are, how­
ever, hardware based. In these schemes, the v i rtual address space i s divided into
fi xed-sized units, termed pages, as shown in Fig. 5 .2 . Virtual-memory references
are resol ved by the address-transl ation unit to a page i n main memory and an off­
set within that page . H ardware protection i s applied by the memory-management
unit on a page-by-page bas i s .
S o m e systems provide a two-tiered virtual-memory system in which pages are
grouped i nto segments [Organ ick, 1 975 ] . In these systems, protection is usually at
the segment level. In the remainder of this chapter, we shall be concerned with
only those v irtual-memory systems that are page based.
Address trans l ation prov ides the implementation of v i rtual memory by decoupling
the virtual address space of a process from the physical address space of the CPU.
Each page of virtual memory i s marked as resident or nonresident in main mem­
ory. If a process references a location in v i rtual memory that i s not resident, a
hardware trap termed a page fault is generated. The servicing of page faults, or
paging, permits processes to execute even if they are only partial l y resident in
main memory.
Figure 5.2
Paged virtual-memory scheme. Key: MMU-memory-management unit.
page 0
-.. 1
I' I
'' , I
I ' I
page 1
\ \\ '
I {.
I \
page n
page 0
page 1
Chapter 5
1 20
Memory M anagement
Coffman and Denning [ 1 973] characterize paging systems by three important
policies :
I . When the system loads pages i nto memory-the .fetch policy
Where the system places pages in memory-the placement policy
How the system selects pages to be removed from main memory when pages
are unavailable for a placement request-the replacement policy
In normal c i rcumstances, all pages of main memory are equall y good, and the
placement policy has no effect on the performance of a pagi ng system. Thus, a
paging syste m ' s behavior is dependent on only the fetch policy and the replace­
ment pol icy. Under a pure demand-paging system, a demand-fetch policy i s used,
in which only the missing page i s fetched, and rep l acements occur only when
main memory i s ful l . Con sequently, the performance of a pure demand-paging
system depends on only the syste m ' s rep l acement pol icy. I n practice, paging sys­
tems do not implement a pure demand-paging algorithm . Instead, the fetch policy
often i s al tered to do prep agin g fetching pages of memory other than the one
that cau sed the page fau l t-and the replacement policy i s invoked before main
memory is ful l .
Replacement Algorithms
The replacement pol icy is the most critical aspect of any paging system. There i s
a wide range o f algorithms from which we c a n select in designing a replacement
strategy for a paging syste m . Much research has been carried out i n evaluating the
performance of different page-replacement algorithms [ B e l ady, 1 966; King, 1 97 1 ;
Marshall , 1 979] .
A proces s ' s paging behavior for a given input is described in terms of the
pages referenced over the time of the process's exec ution . Thi s sequence of
pages, termed a reference string, represents the behavior of the process at discrete
times during the process's l i feti me. Corre sponding to the sampled references that
consti tute a process ' s reference string are real-time values that reflect whether or
not the associated references res u l ted i n a page fau lt. A useful measure of a pro­
ces s ' s behavior is the fault rate, which is the number of page faul ts encountered
during processing of a reference string, normal ized by the length of the reference
Page-replacement algorithms typical l y are evalu ated in terms of their effec­
tiveness on reference strings that have been coll ected from execution of real pro­
grams . Formal anal ysis can also be used, al though it i s difficult to perform unless
many restrictions are appl ied to the execution environment. The most common
metric used i n measuring the effectiveness of a page-replacement algorithm i s the
fault rate.
Page-replacement algorithms are defined in terms of the criteria that they use
for selecting pages to be rec l aimed. For example. the optimal replacement policy
Section 5 . 1
Terminol ogy
1 21
[Denning, 1 970] states that the " best" choice of a page to repl ace is the one with
the longest expected time until its next reference. Clearly, this policy is not appli­
cable to dynamic systems, as it requ ires a priori knowledge of the paging charac­
teristics of a proces s . The policy is u sefu l for ev alu ation purposes, however, as i t
provides a yardstick for comparing t h e performance o f other page-repl acement
Practical page-repl acement algorithms require a certai n amount of s tate infor­
mation that the system uses in selecting repl acement pages. Thi s state typical l y
includes t h e reference pattern of a process, sampled at discrete t i m e interval s . O n
some systems , t h i s information c a n be expensive t o col l ect [ B abaoglu & Joy,
1 98 1 ] . As a result, the " best" page-replacement algorithm may not be the most
Working-Set Model
The worki ng-set model assumes that processes exhibit a slowly changing local ity
of reference. For a period of time, a proces s operates in a set of subroutines or
l oops, causing all its memory references to refer to a fixed subset of its address
space, termed the working set. The process periodically changes its working set,
abandoning certain areas of memory and beginning to access new ones. After a
period of transition, the process defines a new set of pages as its working set. In
general , if the system can provide the proces s with enough pages to hold that pro­
cess's working set, the process w i l l experience a low page-faul t rate . If the system
cannot provide the process with enough pages for the working set, the process w i l l
run s l o w l y a n d will have a h i g h page-faul t rate .
Precise calculation of the working set of a process is impossible without a pri­
ori knowledge of that process's memory-reference pattern . However, the working
set can be approxi m ated by various means. One method of approximation i s to
track the number of pages held by a proces s and that proces s ' s page-fau l t rate . If
the page-faul t rate increases above a high watermark, the working set is assumed
to have i ncreased, and the number of pages held by the process i s allowed to grow.
Conversely, if the page-fault rate drops below a low watermark, the workin g set is
assumed to h ave decreased, and the number of pages held by the process i s
Swapping i s the term used to describe a memory-management policy in which
entire processes are moved to and from secondary storage when main memory is
i n short supply. Swap-based memory-management systems usually are less com­
plicated than are demand-paged systems, since there i s less bookkeeping to do.
However, pure swapping systems are typically less effective than are paging sys­
tems, s ince the degree of mul tiprogramming i s l owered by the requirement that
processes be fully resident to execute. Swapping i s sometimes combined with
pagin g in a two-tiered scheme, whereby paging satisfies memory demands until a
severe memory shortfa l l requ ires drastic action, i n which case swapping is used.
1 22
Chapter 5
Memory M anagement
In this chapter, a portion of secondary storage that is used for paging or s wap­
ping is termed a swap area or swap space. The hardware dev ices on which these
areas reside are termed swap devices.
Advantages of Virtual Mem ory
There are several advantages to the use of v i rtual memory on computers capable
of supporti ng this faci l i ty properly. Virtual memory allows large programs to be
run on machines w i th main-memory configurations that are smaller than the pro­
gram size. On machines w i th a moderate amount of memory, it al l ows more pro­
grams to be resident i n main memory to compete for CPU time, as the programs
do not need to be completely res i dent. When programs use sections of their pro­
gram or data space for some time, leaving other sections unused, the unused sec­
tions do not need to be present. Also, the use of v i rtual memory allows programs
to start up faster, as they general ly require only a small section to be l oaded before
they begi n processing argu ments and determin ing what actions to take . Other
parts of a program may not be needed at all during i ndividual runs . As a program
runs, addi tional sections of its program and data spaces are paged in on demand
(demand paging) . Final ly, there are many algorithms that are more easily pro­
grammed by sparse use of a l arge address space than by carefu l packing of data
structures into a small area. Such techniques are too expensive for use without
virtual memory, but may run much faster when that faci l i ty is available, without
using an i nordinate amount of physical memory.
On the other hand, the use of virtual memory can degrade pe1formance. It i s
more efficient t o l oad a program all at o n e t i m e than t o l oad it entirely i n small
sections on demand. There i s a fi n i te cost for each operation, including saving and
restoring state and determining which page must be l oaded. So, some systems use
demand paging for only those programs that are larger than some minimum size.
Hardware Requirements for Virtual Memory
Nearly all versions of UNIX have required some form of memory-management
hardware to support transparent multiprogram ming. To protect processes from
modifi cation by other processes, the memory-management hardware must prevent
programs from changing thei r own address mapping. The 4.4BSD kernel runs i n a
privileged mode (kernel mode or system mode) in which memory mapping can be
controlled, whereas processes run in an unprivileged mode ( user mode). There are
several additional architectural requirements for support of virtual memory. The
CPU must distinguish between resident and nonresident porti ons of the address
space, must su spend programs when they refer to nonres ident addresses, and must
resume programs ' operation once the operating system has pl aced the requ i red
section i n memory. B ecause the CPU may di scover missing data at various times
during the execution of an i n struction, it must prov ide a mechan ism to save the
machine state, so that the i nstru c tion can be continued or restarted l ater. The CPU
may implement restarting by sav ing enough state when an i n struction begins that
the state can be restored when a fau l t i s di scovered. Alternatively, i nstructions
Section 5 . 2
Overview o f the 4.4B S D Virtual-Memory System
1 23
could delay any modifications or side effects unti l after any faults would be
di scovered, so that the i nstruction execution does not need to back up before
restarting. On some computers, i n struction backup requires the assi stance of the
operating system.
Most machines designed to support demand-paged v i rtual memory incl ude
hardware support for the collection of information on program references to mem­
ory. When the system selects a page for replacement, it must save the contents of
that page i f they have been modified s i nce the page was brought i nto memory.
The hardware u sual ly maintains a per-page fl ag showing whether the page has
been modi fied. M any machines also include a fl ag recording any access to a page
for use by the rep l acement algorith m .
Overview of the 4.4BSD Virtual-Memory System
The 4.4B S D v i rtual-memory system differs completely from the system that was
used in 4 . 3 B S D and predecessors . The implementation is based on the Mach 2 . 0
v i rtual-memory system [Tevanian, 1 987] , w i t h updates from M ac h 2 . 5 a n d M ac h
3 .0 . The M ach v i rtual-memory system w a s adopted because it features efficient
support for sharing and a c lean separation of machine-i ndependent and machine­
dependent features, as well as (currently unused) multi processor support. None of
the ori g i n al M ach system-call interface remains. It has been rep l aced w i th the
i n terface first proposed for 4.2BSD that has been wi dely adopted by the U N I X
i ndustry ; t h e 4 . 4 B S D inte rface i s described i n Section 5 . 5 .
The v i rtual-memory system implements protected address spaces into which
can be m apped data sources (objects) such as ti les or private, anonymous pieces of
s wap space. Phy sical memory is u sed as a cache of recently used pages from
these objects, and is managed by a global page-replacement algorithm much l i ke
that of 4 . 3 B S D .
The v i rtual addres s space o f m o s t architectures i s divided i nto t w o parts . Typi­
cally, the top 30 to I 00 Mbyte of the address space i s reserved for use by the ker­
nel. The remaining address space is a avai l able for use by processes. A traditional
UNIX layout i s shown i n Fig. 5 .3 (on page 1 24 ). Here, the kernel and its assoc i ated
data structures reside at the top of the address space. The i n i tial text and data areas
start at or near the beg i n n i ng of memory. Typical ly, the first 4 or 8 Kbyte of mem­
ory are kept off l i mits to the process. The reason for this restriction i s to ease pro­
gram debugging; indirecting through a n u l l pointer w i l l cause an i nval id address
fault, i n s tead of reading or writing the program text. Memory all ocations made by
the running process u s i ng the ma/foe ( ) l ibrary routine (or the sbrk system cal l ) are
done on the heap that s tarts immediately following the data area and grows to
higher addresses . The argument vector and env i ronment vectors are at the top of
the user portion of the address space . The user's stack starts just below these vec­
tors and grows to lower addresses . Subject to only administrative l i mits, the stack
and heap can each grow until they meet. At that point. a process ru nning on a
32-bit m achine w i l l be u s i ng nearly 4 Gbyte of address space.
Chapter 5
1 24
mall oc ( ) ' ed memory
i n terrupt stack
Memory Management
W;t;\', Cll\lf!
user stack
user process
Figure 5.3 Layout of virtual address space.
In 4.4B S D and other modern UNIX systems that support the mmap system
cal l , address-space usage is less structured. Shared l i brary implementations may
place text or data arbi trari ly, renderi ng the notion of predefined regions obsolete.
For compatib i l i ty, 4.4BSD sti l l supports the sbrk call that nwlloc ( ) uses to prov ide
a contiguous heap region, and the kernel has a designated stack region where adj a­
cent allocations are performed automatical l y.
At any time, the currently executing process is mapped into the v i rtual
address space. When the system dec ides to context switch to another process. it
must save the i nformati on about the cun-ent-process address mapping. then l oad
the address mapping for the new process to be ru n . The detai l s of this address­
map switching are architecture dependent. Some architectures need to change
only a few memory-mapping regi sters that point to the base. and to give the length
of memory-resident page tables. Other architectures store the page-table descrip­
tors i n spec ial high- speed static RAM . S witching these maps may require dump­
ing and reloading hu ndreds of map entries.
B oth the kernel and user processes use the same basic data structures for the
management of their virtual memory. The data structures used to manage v i rtual
memory are as fol lows :
Structure that encompasses both the machi ne-dependent and
machine-i ndependent structures describing a process's address
H ighest-level data structure that describes the machine-i nde­
pendent v i rtual address space
Structure that describes a v i rtual ly contiguous range of address
space that shares protection and in heritance attributes
Structure that describes a source of data for a range of addresses
Overview o f the 4.4BSD Virtual-Memory Sy stem
Section 5 . 2
1 25
shadow object
Special obj ec t that represents modified copy of original data
The lowest-level data structure that represents the physical mem­
ory being used by the v i rtual-memory system
In the remai nder of thi s section , we shall describe briefly how all these data struc­
tures fi t together. The remainder of this chapter will describe what the detail s of
the structures are and how the s tructures are used.
Figure 5 .4 shows a typical process addres s space and associated data struc­
ture s . The vmspace structure encapsulates the v i rtual-memory s tate of a particular
process, i ncluding the machine-dependent and machine-independent data struc­
tures, as well as statistics. The machine-dependent vm_pmap structure i s opaque
to all but the lowest level of the system, and contain s all i nformation necessary to
manage the memory-management hardware. This pmap layer i s the subject of
Section 5 . 1 3 and i s ignored for the remainder of the current discussion. The
machine-independent data s tructures include the address space that i s represented
by a vm_map structure . The vm_map contain s a linked list of vm_map _entl)'
structures, hints for speeding up l ookups during memory allocation and page-fau l t
handling, and a pointer t o the associated machine-dependent vm_pmap structure
contained i n the vmspace. A vm_map_entry structure describes a v irtually con­
tiguous range of address space that has the same protection and i nheritance
attribute s . Every vm_map_entry points to a chain of vm_object s tructures that
describes sources of data (objects) that are mapped at the indicated address range.
A t the tail of the chain i s the original mapped data object, usually representing a
persi sten t data source, such as a file. I n terposed between that object and the map
entry are one or more transient shadow objects that represent modified copies of
the original data. These shadow objects are discussed i n detail in Section 5 . 5 .
F i g u re 5.4
Data structures that describe a process address space.
vnode I object
vnode I obj ect
vnode I object
1 26
Chapter 5
Memory M anagement
Each vm_object structure contains a l inked list of vm_JJage structures repre­
senting the physical-memory cache of the object, as well as a poi nter to the
pager_struct structure that contains i nformation on how to page in or page out
data from its backing store . There i s a vm_page structure allocated for every page
of physical memory managed by the v i rtual-memory system , where a page here
may be a collection of multiple, contiguous hardware pages that will be treated by
the machine-dependent l ayer as though they were a s ingle unit. The structure also
contains the status of the page (e.g . , modified or referenced) and l inks for various
paging queues.
All structures contain the necessary i n terlocks for multi threading i n a multi­
processor environment. The locking i s fine grained, with at least one lock per
i nstance of a data structure. Many of the structures contain multiple locks to pro­
tect individual fields.
Kernel Memory Management
There are two ways in which the kernel ' s memory can be organi zed. The most
common i s for the kernel to be permanently mapped i n to the high part of every
process address space. I n this mode l , switching from one process to another does
not affect the kernel portion of the address space. The alternative organization i s
t o switch between having the kernel occupy the whole address space and mapping
the currently running process i nto the address space. Having the kernel perma­
nently mapped does reduce the amount of address space available to a large pro­
cess (and the kernel ) , but it also reduces the cost of data copying. M any system
calls require data to be transferred between the currently running user process and
the kernel . With the kernel permanently m apped, the data can be copied via the
efficient block-copy instructions. If the kernel i s alternately mapped w i th the pro­
cess, data copy i ng requires the use of special i nstructions that copy to and from
the previously mapped address space. These instructions are u sually a factor of 2
slower than the standard block-copy i nstructions . S i nce up to one-thi rd of the ker­
nel time i s spent i n copying between the kernel and user processes, slowing this
operation by a factor of 2 significantly slows system throughput.
Although the kernel i s able freely to read and write the address space of the
user process, the converse i s not true. The kernel ' s range of v i rtual address space
is marked i nacces s ible to all user processes. The reason for restricting writing i s
so that user processes cannot tamper with the kerne l ' s data structures. The reason
for restricting reading is so that user processes cannot watch sensi tive kernel data
structures, such as the term inal input queues, that include such things as users typ­
ing their passwords.
Usually, the h ardware dictates which organi zati on can be u sed. All the archi­
tectures supported by 4.4B S D map the kernel i nto the top of the address space.
Section 5 . 3
Kernel Memory Management
1 27
Kernel Maps and Submaps
When the system boots, the first task that the kernel must do is to set u p data struc­
tures to describe and manage its address space. Like any process, the kernel has a
vm_map w i th a corresponding set of vm_map_entry structures that describe the
use of a range of addresses. S ubmaps are a spec i a l kernel-only construct used to
isolate and constrain address-space allocation for kernel subsystems. One use i s
i n subsystems that require contiguous p i eces o f the kernel address space. S o that
i ntermixing of unrel ated a l l ocations within an address range is avoided, that range
is covered by a submap, and only the appropriate subsystem can a l l ocate from that
map. For example, several network buffer (mbuf) manipulation macros use
address arithmetic to generate unique indices, thus requiring the network buffer
region to be contiguous. Parts of the kernel may also requi re addresses w i th par­
ticular alignments or even specific addresses. B oth can be ensured by use of
submaps . Final l y, submaps can be used to l i m i t statically the amount of address
space and hence the physical memory consumed by a subsystem.
A typical l ayout of the kernel m ap is shown in Fig. 5 .5 . The kerne l ' s address
space i s described by the vm_map s tructure shown i n the upper- left corner of the
Fig u re 5.5
Kernel address-space maps.
start_addr KO
end_addr K l
. . .
start_addr K2
end_addr K6
. . .
start_addr K2
end_addr K3
. . .
start_addr K7
end_addr K8
start_addr K4
end_addr K5
. . .
. . .
1 28
Chapter 5
Memory M anagement
figure . Pieces of the address space are described by the vm_map_ellfry structures
that are linked i n ascending address order from KO to KS on the vm_map struc­
ture . Here, the kernel text, initial ized data, uninitialized data, and initial l y allo­
cated data structures reside in the range KO to K 1 and are represen ted by the fi rst
vm_map_ent1)'. The next vm_111ap_e11try i s assoc i ated w i th the address range from
K2 to K6; this piece of the kernel address space is being managed via a submap
headed by the referenced vm_map structure. This submap currently has two parts
of its addres s space used: the address range K2 to K3, and the address range K4 to
K5 . These two submaps represent the kernel mal l oc arena and the network buffer
arena, respectively. The fi nal part of the kernel address space is being managed i n
the kerne l ' s m a i n m a p ; t h e address range K7 t o KS representing t h e kernel I/O
staging area.
Kernel Address-Space Allocation
The v i rtual-memory system implements a set of pri m itive functions for allocating
and freeing the page-al igned, page-rounded v i rtual-memory ranges that the kernel
uses. These ranges may be a l l ocated ei ther from the main kernel -address map or
from a submap . The al l ocation routines take a map and size as parameters, but do
not take an address. Thus, spec ific addresses within a map cannot be selected.
There are different allocation routines for obtaining nonpageable and pageable
memory ranges.
A nonpageable, or wired, range has physical memory assigned at the time of
the cal l , and this memory i s not subject to replacement by the pageout daemon .
Wi red pages must never cause a page fau l t that might res u l t i n a blocking opera­
tion . Wired memory is all ocated with kmem_alloc ( ) and kmem_malloc ( ) .
Kmem_alloc ( ) returns zero-filled memory and may block if insufficient physical
memory is avai lable to honor the request. I t will return a fai l u re only if no address
space i s available in the indicated map. Kmem_malloc ( ) i s a vari ant of
kmem_alloc ( ) used by only the general allocator, malloc ( ), described in the next
subsection. This routine has a nonblocking option that protects call ers against
inadverten tly bl ocking on kernel data structure s ; i t will fai l i f i nsufficient physical
me mory is avail able to fill the requested range. This nonblocking option allocates
memory at interrupt time and during other critical secti ons of code . I n general ,
w i red memory should be a l located via the general-purpose kernel allocator.
Kmem_alloc ( ) should be used only to all ocate memory from spec ific kernel
Pageable kernel v i rtual memory can be all ocated with kme111_alloc_JJC1geable ( )
and kmem_alloc_wait ( ) . A pageable range has physical memory all ocated on
demand, and this memory can be wri tten out to backing store by the pageout dae­
mon as part of the l atter's normal replacement pol icy. Kmem_alloc_J1ageable ( )
w i l l return an error i f insufficient address space i s avai l able for the desired alloca­
tion; kmem_alloc_wait ( ) w i l l block until space is avai l able. CmTently, pageable
kernel memory is u sed only for temporary storage of exec arguments and for the
kernel stacks of processes that have been s wapped out.
Section 5 . 3
Kernel Memory Management
1 29
KmemJree ( ) deallocates kernel w i red memory and pageable memory allo­
cated w i th kmem_alloc_pageable ( ) . KmemJree_wakeup ( ) should be used w i th
kmem_alloc_wait ( ) because i t wakes up any processes wai ting for address space
in the specified map.
Kernel Malloc
The kernel also provides a generali zed n o npageable memory-allocation and free­
ing mechani s m that can handle requests w i th arbitrary alignment or size, as well
as allocate memory at interrupt time. Hence, i t i s the preferred way to allocate
kernel memory. This mechanism has an i nterface simi lar to that of the well­
known memory allocator provided for app lications programmers through the C
l ibrary routines malloc ( ) and free ( ) . L i ke the C l ibrary i nterface, the allocation
routine takes a parameter speci fying the size of memory that i s needed. The range
of s i zes for memory requests are not constrained. The free routine takes a pointer
to the storage being freed, but does not require the size of the p i ece of memory
being freed.
Often , the kernel needs a memory allocation for the duration of a s i ngle sys­
tem cal l . I n a user process, such short-term memory would be allocated on the
run-time stack. B ecause the kernel has a l imited run-time stack, i t i s not feasible
to allocate even moderate blocks of memory on it. Consequently, such memory
must be allocated dynamical l y. For example, when the system must transl ate a
pathname, i t must all ocate a 1 - Kbyte buffer to hold the name. Other blocks of
memory must be more persistent than a s i ngle system cal l , and h ave to be allo­
cated from dynamic memory. Examples include protocol control blocks that
remain throughout the duration of a network connection.
The design spec i fi cation for a kernel memory allocator i s similar to, but not
i dentical to, the design criteria for a u ser-level memory allocator. One criterion
for a memory allocator i s that the l atter make good use of the physical memory.
Use of memory is measured by the amount of memory needed to hold a set of
allocations at any point i n time. Percentage u ti l ization is expressed as
required .
Here, requested i s the sum of the memory that has been requested and not yet
freed; required i s the amount of memory that has been allocated for the pool from
which the requests are fi l led. An allocator requires more memory than requested
because of fragmentati on and a need to have a ready supply of free memory for
future requests . A perfect memory allocator would have a util ization of I 00 per­
cent. In practice, a 50-percent uti l i zation i s cons idered good [ Korn & Vo, 1 9 8 5 ] .
Good memory uti l i zation i n t h e kernel i s more i mportant than i n u s e r pro­
cesses. Because user processes run in v i rtual memory, unused parts of the i r
address space c a n b e paged o u t . Thus , pages i n t h e process address space that are
part of the required pool that are not being requested do not need to tie up physical
memory. S i nce the kernel mal loc aren a i s not paged, all pages in the required pool
are held by the kernel and cannot be used for other purposes . To keep the kernel-
Chapter 5
1 30
M emory M anagement
uti l i zation percentage as high as possible, the kernel should release unused
memory in the required pool, rather than hold it, as is typically done with user
processes. B ecause the kernel can manipulate its own page maps directly, freeing
unused memory i s fast; a user process must do a system call to free memory.
The most i mportant criterion for a kernel memory allocator is that the l atter
be fast. A slow memory allocator will degrade the system performance because
memory allocation is done frequently. Speed of allocation is more critical when
executing in the kernel than i t i s in user code because the kernel must allocate
many data structures that u ser processes can allocate cheaply on their run-time
stack. I n addition, the kernel represents the platform on which all user processes
run , and, i f it is slow, i t will degrade the performance of every process that is run­
Another problem w i th a slow memory allocator i s that programmers of fre­
quently used kernel interfaces will think that they cannot afford to use the memory
allocator as their primary one. Instead, they will bui l d their own memory allocator
on top of the ori g inal by maintaining their own pool of memory blocks. M u l tiple
allocators reduce the efficiency with which memory i s used. The kernel ends up
with m any different free l i sts of memory, i nstead of a s ingle free l i s t from which
all allocations can be drawn. For example, consider the case of two subsystems
that need memory. I f they have their own free l i sts, the amount of memory tied up
i n the two l ists will be the sum of the greatest amount of memory that each of the
two subsystems has ever used. If they share a free l ist, the amount of memory tied
up in the free l i s t m ay be as low as the greatest amount of memory that either sub­
syste m used. A s the number of subsystems grows, the savings from having a sin­
gle free list grow.
The kernel memory allocator uses a hybrid strategy. S mall allocations are
done u sing a power-of-2 list strategy; the typical allocation requ i re s only a compu­
tation of the list to use and the removal of an element i f that element is available,
so i t i s fast. Only i f the request cannot be ful fi lled from a list i s a call made to the
allocator itself. To ensure that the allocator is always called for l arge requests, the
l i sts corresponding to l arge allocations are always empty.
Freeing a small block also is fast. The kernel computes the l i s t on which to
place the block, and puts the block there. The free routine is called only if the
block of memory is considered to be a l arge allocation.
B ecause of the inefficiency of power-of-2 allocation strategi e s for large allo­
cations, the allocation method for l arge blocks i s based on allocating pieces of
memory in multiples of pages . The algorithm switches to the slower but more
memory-efficient strategy for allocation sizes l arger than 2 x pagesize . Thi s value
i s chosen because the power-of-2 algorithm yields sizes of 1 , 2 , 4, 8 , . . . , n pages ,
whereas t h e large block algorithm that allocates i n multiples of pages y i e l d s size s
of 1 , 2 , 3 , 4, . . , 11 pages. T h u s , for allocations of s i z e s between o n e a n d two
pages, both algorithms use two page s ; a difference emerges beginning with alloca­
tions of sizes between two and three pages, where the power-of-2 algorithm will
use four pages , whereas the large block algorithm will use three page s . Thu s , the
threshold between the large and small allocators is set to two page s .
Kernel Memory Management
Section 5 . 3
1 31
Large a l locations are first rounded up to be a mul tiple of the page size. The
a l locator then uses a " fi rst-fit" algorithm to find space in the kernel address aren a
s e t aside for dynami c allocations. On a machine w i th a 4-Kbyte page s i ze , a
request for a 20-Kbyte piece of memory w i l l use exactly fi ve pages of memory,
rather than the eight pages u sed w i th the power-of-2 a llocation strategy. When a
l arge piece of memory is freed, the memory pages are returned to the free-memory
pool and the vm_map_entry structure is deleted from the submap, effectively coa­
lescing the freed piece w i th any adj acent free space.
Another technique to i mprove both the efficiency of memory u ti l ization and
the speed of a l location i s to clu ster same-sized small allocations on a page. When
a l i st for a power-of-2 al l ocation is empty, a new page i s all ocated and is divided
i nto pieces of the needed size. This strategy speeds fu ture a l locations because sev­
eral pieces of memory become ava i l ab l e as a res u l t of the call into the a l l ocator.
Because the size i s not specified when a b l oc k of memory is freed, the alloca­
tor must keep track of the sizes of the pieces that it has handed out. Many alloca­
tors increase the allocation request by a few bytes to create space to store the s i ze
of the block i n a header j u s t before the allocation. However, this s trategy doubles
the memory requirement for al l ocations that request a power-of-2-sized block.
Therefore, i nstead of storing the size of each piece of memory with th.e p i ece
itself, the kernel associates the size i nformation w i th the memory page. Figure 5 . 6
shows h o w the kernel determ i ne s the s i ze o f a piece o f memory that i s being freed,
by calcul ating the p age in which i t res i de s and looking up the s i ze associ ated w i th
that page. Locating the allocation size outside of the all ocated block i mproved
util ization far more than expected. The reason is that many al l ocations in the ker­
nel are for b l ocks of memory whose size is exactly a power of 2. These requests
woul d be nearly doubled i n size if the more typical s trategy were used. Now they
can be accommodated w i th no wasted memory.
The al l oc ator can be c al l ed both from the top half of the kernel that i s w i l li n g
to w a i t for memory to become ava i l able, a n d from t h e interrupt routines i n t h e bot­
tom half of the kerne l that cannot wait for memory to become avai l able. C l ients
show the i r w i l l i ngness (and abil i ty ) to wait w i th a fl ag to the a l location routine.
For c lients that are w i l ling to wait, the al l ocator guarantees that their request w i l l
Figure 5 . 6 Calculation o f allocation size. Key: free-unused page; cont-continuation of
previous page.
char *kmembase
kmemsizes[ ]
{ 4096, 1 024, 2048, 1 2288, cont,
11 1 11 1 1 1 I
5 1 2,
usage: memsize(char *addr)
return(kmemsizes[(addr - kmembase) I PAGESIZE]);
free, cont,
I --
1 32
Chapter 5
Memory Management
succeed. Thus , these c l i ents do not need to check the return value from the
allocator. If memory is unavai l able and the c l ient cannot wait, the allocator returns
a n u l l pointer. These cl ients must be prepared to cope with th i s (hopefu l l y i nfre­
quent) condition (usually by giving up and hoping to succeed later) . The detail s of
the kernel memory allocator are further described i n [McKusick & Kare l s , 1 9 8 8 ] .
Per-Process Resources
As we have already seen, a process requires a process entry and a kernel stack.
The next major resource that must be a l l ocated is its v i rtual memory. The initial
v i rtual-memory requirements are defined by the header i n the proces s ' s
executable. These requirements include t h e space needed for the program text, the
i n i tial ized data, the uninitial i zed data, and the run-time stack. During the i n i ti al
startup of the program, the kernel w i l l build the data structures necessary to
describe these four areas. Most programs need to all ocate additional memory.
The kernel typical l y provides this additional memory by expanding the uninitial­
i zed data area.
Most 4.4BSD systems also provide shared l i brari es. The header for the
executable will describe the libraries that i t needs ( usually the C l i brary, and possi­
bly others ) . The kernel is not responsible for l ocating and mapping these libraries
during the i n i tial execution of the program. Finding, mapping, and creating the
dynamic l i n kages to these l ibraries is handled by the user-level startup code
prepended to the file being executed. Thi s startup code usually runs before control
is passed to the main entry point of the program [Ginge l l et al, 1 987] .
4.4BSD Process Virtual-Address Space
The i n i tial l ayout of the address space for a process is shown i n Fig. 5 . 7 . As dis­
cussed i n Section 5.2, the address space for a process i s described by that pro­
cess ' s vmspace structure. The contents of the addres s space are defined by a l i s t of
vm_map_entry structures, each structure describing a region of v i rtual address
space that resides between a start and an end address. A region describes a range
of memory that is being treated in the same way. For example, the text of a pro­
gram i s a region that i s read-only and is demand paged from the file on disk that
contains it. Thus , the vm_map_entry also contains the protection mode to be
applied to the region that it describes. Each vm_map_entry structure also has a
pointer to the object that provides the i n i tial data for the region. It also stores the
modified contents e i ther transiently when memory is being reclai med or more per­
manently when the region i s no l onger needed. Final l y, each vm_map_entry struc­
ture has an offset that describes where within the object the mapping begins.
The example shown i n Fig. 5.7 represents a process j ust after i t has started
execution . The fi rst two map entries both point to the same object; here, that
obj ect is the executable. The executabl e consi sts of two parts: the text of the pro­
gram that resides at the beginning of the fi l e and the i n itial ized data area that
Section 5 .4
Per-Process Resources
1 33
vm_map_e ntry
vms ace
start addr
end addr
obj offset
vnode I obj ect
start addr
end addr
obj offset
vnode I obj ect
start addr
end addr
obj offset
vnode I ob'ect
start addr
end addr
obj offset
F i g u re 5.7
Layout of an address space.
follows at the end of the text. Thu s , the fi rst vm_map_entry describes a read-only
region that maps the text of the program. The second vm_map_entry describes the
copy-on-write region that maps the initialized data of the program that follow s the
program text in the file (copy-on-wri te i s described i n Section 5 . 6 ) . The offset
field i n the entry reflects this different starting location . The third and fourth
vm_map_entry structures describe the uninitialized data and stack areas, respec­
tively. B oth of these areas are represented by anonymous objects. An anonymous
obj ect provides a zero-filled page on first use, and arranges to store modified pages
i n the swap area if memory becomes tight. Anonymous obj ects are described i n
more detail later i n this section.
Chapter 5
1 34
Memory Management
Page-Fault Dispatch
When a process attempts to access a piece of its address space that is not currently
resident, a page fault occurs. The page-fault handler in the kernel is presented
with the virtual address that caused the fault. The fault is handled with the follow­
ing four steps:
1. Find the vmspace structure for the faulting process; from that structure, find
the head of the vm_map_entry list.
2 . Traverse the vm_map_entry list starting at the entry indicated by the map hint;
for each entry, check whether the faulting address falls within its start and end
address range. If the kernel reaches the end of the list without finding any
valid region, the faulting address is not within any valid part of the address
space for the process, so send the process a segment fault signal.
3 . Having found a vm_map_entry that contains the faulting address, convert that
address to an offset within the underlying object. Calculate the offset within
the object as
obj e c t_o f f s e t
f au l t_addr e s s
- vm_map_en t ry� s t a r t_addr e s s
vm_map_ent ry�obj e c t_o f f s e t
Subtract o ff the start address to give the offset into the region mapped b y the
vm_map_entry. Add in the object offset to give the absolute offset of the page
within the object.
4. Present the absolute object offset to the underlying object, which allocates a
vm_page structure and uses its pager to fill the page. The obj ect then returns a
pointer to the vm_page structure, which is mapped into the faulting location in
the process address space.
Once the appropriate page has been mapped into the faulting location, the page­
fault handler returns and reexecutes the faulting instruction.
Mapping to Objects
Objects are used to hold information about either a file or about an area of anony­
mous memory. Whether a file is mapped by a single process in the system or by
many processes in the system, it will always be represented by a single object.
Thus, the object is responsible for maintaining all the state about those pages of a
file that are resident. All references to that file will be described by
vm_map_entry structures that reference the same object. An object never stores
the same page of a fi le in more than one memory page, so that all mappings will
get a consistent view of the file.
A n object stores the following information:
Section 5 .4
Per-Process Resources
1 35
• A l i s t of the pages for that object that are currently resident i n main memory ; a
page may be m apped i n to multiple address spaces , but i t i s always c l ai med by
exactly one obj ect
• A count of the number of vm_map_entry structures or other obj ects that reference the obj ect
• The size of the fi l e or anonymous area described by the obj ec t
• The n umber of memory-resident pages h e l d by t h e obj ect
• Pointers to copy or shadow objects (described in Section 5 .5 )
• A pointer t o the pager for the object; the pager i s responsible fo r providing the
data to fi l l a page, and for providing a p l ace to s tore the page when it has been
modified (pagers are covered i n Section 5 . 1 0)
There are four types of objects in the syste m :
• Named obj ects represent fi l e s ; they m a y a l s o represent hardware devices that are
able to provide mapped memory such as frame buffers .
• Anonymous obj ects represent areas of memory that are zero fi l led on fi rst u s e ;
they are abandoned when they are n o l onger needed.
• Shadow obj ects hold private copies of pages that have been modified; they are
abandoned when they are no l onger referenced.
• Copy objects hold old pages from fi l es that h ave been modified after they were
private l y m apped; they are abandoned when the private mapping is abandoned.
These objects are often referred to as " internal " obj ects in the source code. The
type of an object i s defined by the pager that that object uses to ful fi l l page-faul t
A named object uses e ither (an i n s tance of) the device pager, if i t maps a
h ardware device, or the v node pager, i f i t i s b ac ked by a fi l e i n the filesystem. A
pager services a page fault by returni n g the appropriate address for the device
being mapped. S i nce the device memory i s separate from the main memory on
the machine, i t w i l l never be selected by the pageout daemon. Thus, the device
pager never has to handle a pageout request.
The vnode pager provides an i nterface to objects that represent files i n the
fi lesyste m . A v node-pager i n stance keeps a reference to a vnode that represents
the file being mapped by the obj ect. A v node pager services a pagei n request by
doing a read on the vnode; i t services a pageou t request by doing a write to the
v node. Thu s, the file itself stores the modified pages . In cases where it is not
appropriate to modify the file directly, such as an executable that does not want to
modify i ts initialized data pages, the kernel must i n terpose an anonymous shadow
obj ect between the vm_map_entry and the obj ect representing the fi l e .
1 36
Chapter 5
Memory M anagement
Anonymous obj ects use the swap pager. An anonymous object services
pagei n requests by getting a page of memory from the free l i st, and zeroing that
page. When a pageout request i s made for a page for the first time, the swap pager
i s responsible for fi nding an unu sed page i n the swap area, writing the contents of
the page to that space, and recording w he re that page is stored. If a pagei n request
comes for a page that had been previously paged out, the swap pager i s responsi­
ble for finding where i t stored that page and reading back the contents i nto a free
page i n memory. A later pageout request for that page w i l l cause the page to be
written out to the previously all ocated location.
Shadow objects and copy obj ects also use the swap pager. They work j u s t
l i ke anonymous objects, except that t h e s wap pager provides their i n itial pages b y
copying existing pages i n response t o copy-on-write fa u l t s , i nstead of by zero-fi l l ­
ing pages.
Further detai l s on the pagers are given i n Section 5 . 1 0 .
Each v irtual-memory obj ec t has a pager associated w i th i t ; objects that map fi l e s
have a vnode pager associ ated with them. Each instance of a vnode pager i s asso­
ci ated with a particular vnode. Objects are stored on a hash chain and are identi­
fied by thei r associated pager. When a fault occurs for a file that i s mapped into
memory, the kernel checks its vnode pager cache to see whether a pager already
exists for that file. If a pager exi sts, the kernel then looks to see whether there i s
a n object still associ ated w i t h that pager. I f the obj ect exists, i t can b e checked to
see whether the faulted p age is resident. If the page is res i dent. it can be used. If
the page i s not resident, a new page i s allocated, and the pager i s requested to fi l l
t h e n e w page.
Caching i n the v i rtual-memory system is identi fied by an object that i s associ­
ated w i th a file or region that i t represents. Each object contain s pages that are the
cached contents of its associ ated fi le or region. Objects that represent anonymous
memory are reclaimed as soon as the reference count drops to zero . However,
obj ects that refer to fi les are persi stent. When the i r reference count drops to zero,
the object is stored on a least-recently used (LRU) l i s t known as the object cache.
T h e obj ect remains on its h a s h chai n , so that future u s e s of t h e assoc i ated file will
cause the existing object to be found. The pages associ ated w i th the object are
moved to the inactive list, which is described in Section 5 . 1 2 . However, their
i dentity i s retained, so that, if the object is reactivated and a page fau lt occurs
before the associ ated page i s freed, that page can be reattached, rather than being
reread from disk.
Thi s cache i s si milar to the text cache found i n earl ier versions of B S D i n that
i t p rovides performance i mprovements for short-running but frequently executed
programs. Frequently executed programs incl ude those to l i st the contents of
directories, to show system statu s, or to do the intermedi ate steps i nvol ved in com­
piling a program. For example, consider a typical application that is made up of
Section 5 .5
Shared Memory
1 37
m u l tiple source files. Each of several compiler steps must be run on each fi l e i n
tum . The first time that the compi ler i s run , the objects associ ated with i t s various
components are read i n from the disk. For each fi l e compiled thereafter, the previ ­
o u s l y created obj ects are found, alleviating the need t o reload them from d i s k each
Objects to Pages
When the system is fi rst booted, the kernel looks through the physical memory on
the machine to find out how many pages are avail able. After the physical memory
that wi l l be dedicated to the kernel itself has been deducted, a l l the remaining
pages of physical memory are described by vm_page structure s . These vm_page
structures are a l l i n itial l y p l aced on the memory free l i st. As the system starts run­
ning and processes beg i n to execute, they generate page faults. Each page faul t i s
matched t o the object that covers the faulting piece o f address space. The first
time that a p iece of an object i s faul ted, i t must a l l ocate a page from the free l ist,
and must i n i ti al i ze that p age e i ther by zero fi l l ing it or by reading its contents from
the fi lesyste m . That page then becomes associated w i th the obj ect. Thus, each
object has its current set of vm_JJage structures l in ked to it. A page can be associ­
ated with at most one object at a time. A l though a file may be m apped into sev­
eral processes at once, all those mappi ngs reference the same obj ect. Having a
single obj ec t for each fi l e ensures that a l l processes w i l l reference the same physi­
cal pages . One anomaly i s that the obj ec t offset i n a vm_map_entry structure m ay
not be page aligned (the res u l t of an mmap call with a non-page-al igned offset
parameter) . Consequently, a vm_page may be fi l l ed and associ ated with the obj ect
w i th a non-page-aligned tag that will not match another access to the same object
at the page-aligned boundary. Hence, i f two processes map the same object with
offsets of 0 and 32, two vm_JJages w i l l be fi l led w i th largely the same data, and
that can lead to i nconsistent v i ews of the fi l e .
I f memory becomes scarce, t h e paging daemon w i l l search for pages that have
not been used recently. B efore these pages can be used by a new obj ect, they must
be removed from all the processes that currently have them m apped, and any mod­
i fied contents must be saved by the object that own s the m . Once cleaned, the
pages can be removed from the object that owns them and can be placed on the
free l ist for reuse. The deta i l s of the paging sy stem are described in Section 5 . 1 2 .
Shared Memory
In Section 5 .4, we explained how the address space of a process is organized.
Thi s section shows the additional data structures needed to support shared address
space between processes. Traditionall y, the addres s space of each process was
completely i s o l ated from the addres s space of a l l other processes running on the
system. The only exception was read-only sharing of program text. All
1 38
Chapter 5
Memory M anagement
interprocess communication was done through well -defined channe l s that passed
through the kernel : pipes, sockets, files, and special devices. The benefit of thi s
isolated approach i s that, no matter how badly a process destroys i t s own address
space, i t cannot affect the address space of any other process runn ing on the sys­
tem . Each process can precisely control when data are sent or received; it can also
precisely identify the locations within its address space that are read or writte n .
T h e drawback of thi s approach i s that all inte rproce ss communication requ i res at
least two system cal l s : one from the sending process and one from the receiving
process. For high volumes of interprocess communication, especially when small
packets of data are being exchanged, the overhead of the system cal l s dominates
the communications cost.
S hared memory prov ides a way to reduce i nterprocess-communication costs
dramatical l y. Two or more processes that wish to communicate map the same
piece of read-wri te memory i nto the i r address space . Once all the processes have
mapped the memory i nto thei r address space, any changes to that piece of memory
are v i sible to a l l the other processes, w i thout any i ntervention by the kernel . Thus,
interprocess communication can be achieved without any system-call overhead,
other than the cost of the initial mapping. The drawback to thi s approach is that, if
a process that has the memory mapped corrupts the data structures i n that memory,
a l l the other processes mapping that memory also are corrupted. In addition, there
i s the complexi ty faced by the application developer who must deve lop data struc­
tures to control access to the shared memory, and must cope with the race condi­
tions i nherent i n manipu l ating and contro l l i n g such data struc tures that are being
accessed concurrently.
Some variants of UNIX have a kernel -based semaphore mechan i s m to prov ide
the needed seri a l ization of access to the shared memory. However, both getting
and setting such semaphores require system cal l s . The overhead of using such
semaphores i s comparabl e to that of using the traditional interprocess-communica­
tion methods. Unfortunately, these semaphores have all the complexity of shared
memory, yet confer l i ttle of its speed advantage. The pri mary reason to introduce
the complexi ty of shared memory i s for the commensurate speed gai n . I f this gain
i s to be obtained, most of the data-structure locking needs to be done i n the shared
memory segment i tself. The kernel-based semaphores should be used for only
those rare cases where there i s contention for a lock and one process must wait.
Consequently, modern interfaces, such as POSIX Pthreads, are designed such that
the semaphores can be l ocated i n the sh ared memory region. The com mon case of
setting or c leari ng an uncontested semaphore can be done by the user process,
without cal l i ng the kernel . There are two cases where a process must do a system
cal l . I f a process tries to set an already-l ocked semaphore . it must call the kernel
to block until the semaphore i s avai l able. Thi s system call has l ittle effect on per­
formance because the lock is contested, so it i s i mpossible to proceed and the ker­
nel has to be i nvoked to do a context switch anyway. If a process clears a
semaphore that i s wanted by another process, it must call the kernel to awaken that
process. S i nce most locks are uncontested, the applications can ru n at fu l l speed
w i thout kernel intervention.
Section 5 . 5
1 39
Shared Memory
Mmap Model
When two processes wish to create an area of shared memory, they must have
some way to name the piece of memory that they wish to share, and they must be
able to describe its size and initial contents. The system interface describing an
area of shared memory accomplishes all these goals by using files as the basis for
describing a shared memory segment. A process creates a shared memory seg­
ment by using
c addr_t addr
mmap (
caddr_t addr ,
/ * b a s e addr e s s
siz e_t l en ,
/ * l ength o f r e gion * /
int pro t ,
/ * p r o t e c tion o f r e gion * /
int f l ag s ,
I * mapping f l ags
int f d ,
o f f_t o f f s e t ) ;
/ * o f f s e t t o begin mapping
fil e
to map * /
to map the file referenced by descriptor fd starting at file offset offset into its
address space starting at addr and continuing for Zen bytes with access permission
prot. The flags parameter allows a process to specify whether it wants to make a
shared or private mapping. Changes made to a shared mapping are written back
to the file and are visible to other processes. Changes made to a private mapping
are not written back to the file and are not visible to other processes. Two pro­
cesses that wish to share a piece of memory request a shared mapping of the same
file into their address space. Thus, the existing and well-understood filesystem
name space is used to identify shared objects. The contents of the file are used as
the initial value of the memory segment. All changes made to the mapping are
reflected back into the contents of the file, so long-term state can be maintained in
the shared memory region, even across invocations of the sharing processes.
Some applications want to use shared memory purely as a short-term inter­
process-communication mechanism. They need an area of memory that is initially
zeroed and whose contents are abandoned when they are done using it. Such pro­
cesses neither want to pay the relatively high start-up cost associated with paging
in the contents of a file to initialize a shared memory segment, nor to pay the shut­
down costs of writing modified pages back to the file when they are done with the
memory. A lthough an alternative naming scheme was considered to provide a
rendezvous mechanism for such short-term shared memory, the designers ulti­
mately decided that all naming of memory obj ects should use the filesystem name
space. To provide an efficient mechanism for short-term shared memory, they cre­
ated a virtual-memory-resident filesystem for transient objects. the details of the
virtual-memory-resident filesystem are described in Section 8 . 4 . Unless memory
is in high demand, files created in the virtual-memory-resident filesystem reside
entirely in memory. Thus, both the initial paging and later write-back costs are
eliminated. Typically, a virtual-memory-resident filesystem is mounted on /tmp.
Two processes wishing to create a transient area of shared memory create a file in
/tmp that they can then both map into their address space.
Chapter 5
1 40
Memory Management
When a mapping is no longer needed, i t can be removed using
munmap ( c addr_t addr ,
s i z e_t l en ) ;
The munmap system call removes any mappings that exist in the address space,
starting at addr and continuing for fen bytes. There are n o constraints between
previous mappings and a l ater munmap. The specified range may be a subset of a
previous mmap or it may encompass an area that contai n s many mmap ' ed files.
When a process exits, the system does an i mplied munmap over its entire addres s
During its i n itial mapping, a process can set the protections on a page to allow
reading, writing, and/or execution. The process can change these protections l ater
by using
mpro t e c t ( c addr_t addr ,
in t l en ,
in t pro t ) ;
This feature can be u sed by debuggers when they are trying to track down a mem­
ory-corruption bug. By disabling writing on the page containing the data structure
that i s being corrupted, the debugger can trap all writes to the page and verify that
they are correct before allowing them to occur.
Traditionall y, programming for real-time systems has been done with spe­
cially written operating systems. I n the i n terests of reducing the costs of real-time
applications and of using the skills of the l arge body of UNIX programmers, com­
panies developing real-time applications h ave expressed i ncreased i nterest in using
UNIX-based systems for writing these applications . Two fundamental require­
ments of a real-time system are maximum guaranteed l atencies and predictable
execution times. Predictable execution time i s difficult to provide in a virtual­
memory-based system, since a page fault may occur at any point in the execution
of a program , res ulting i n a potentially l arge delay while the faulting page is
retrieved from the disk or network. To avoi d paging delays, the syste m allows a
process to force its pages to be resident, and not paged out, by using
m l o c k ( c addr_t addr ,
s i z e_t l en ) ;
As long as the process l imits its accesses to the locked area of its addres s space, i t
can be s ure that i t will n o t be delayed by page faults. To prevent a s i n g l e process
from acquiring all the physical memory on the machine to the detriment of all
other processes, the system imposes a resource limit to control the amount of
memory that may be locked. Typically, this l i mit is set to no more than one-third
of the physical memory, and it may be set to zero by a system administrator that
does not want random processes to be able to monopo lize system resources.
When a process has finished with its time-critical use of an mlock'ed region, i t
can rel ease t h e pages using
mun l o c k ( caddr_t addr ,
s i z e_t l en ) ;
Section 5 . 5
Shared Memory
1 41
After the munlock call, the pages in the specified address range are still accessible,
but they may be paged out if memory is needed and they are not accessed.
The architecture of some multiprocessing machines does not provide consis­
tency between a high-speed cache local to a CPU and the machine's main memory.
For these machines, it may be necessary to flush the cache to main memory before
the changes made in that memory are v isible to processes running on other CPUs.
A process does this synchronization using
msync ( c addr_t addr ,
in t l en ) ;
For a region containing a mapped file, msync also writes back any modified pages
to the filesystem.
Shared Mapping
When multiple processes map the same file into their address space, the system
must ensure that all the processes view the same set of memory pages. As shown
in Section 5 .4, each file that is being used actively by a client of the virtual-mem­
ory system is represented by an obj ect. Each mapping that a process has to a
piece of a file is described by a vm_map_entry structure. An example of two pro­
cesses mapping the same file into their address space is shown in Fig. 5 . 8 . When a
page fault occurs in one of these processes, the process's vm_map_entry refer­
ences the object to find the appropriate page. S ince all mappings reference the
same object, the processes will all get references to the same set of physical mem­
ory, thus ensuring that changes made by one process will be visible in the address
spaces of the other processes as well.
A second organization arises when a process with a shared mapping does a
fork. Here, the kernel interposes a sharing map between the two processes and the
shared obj ect, so that both processes' map entries reference this map, instead of
the object. A sharing map is identical in structure to an address map: It is a linked
Fig ure 5.8
Multiple mappings to a file.
proc A:
file obj ect
proc B :
Chapter 5
1 42
Memory M anagement
list of map entries. The i ntent i s that a sharing map, referenced by all processes
i nheriting a shared memory region, will be the focus of map-rel ated operations
that should affect a l l the processes. Sharing maps are u sefu l i n the creation of
shadow objects for copy-on-wri te operations becau se they affect part or all of the
shared region. Here, all sharing processes should use the same shadow object, so
that all w i l l see modifications made to the region. Sharin g maps are an artifact of
the v i rtual-memory code ' s early M ach ori g i n ; they do not work well in the 4.4B S D
env i ronment because they work fo r only that memory shared by i nheritance.
Shared mappings establ ished with mmap do not use the m . Hence, even if a shar­
ing map exists for a shared region, it does not necessaril y reflect all processes
involved. The only effect that sharing maps have i n 4.4BSD i s to extend across
forks the delayed creation of shadow and copy objects. This del ay does not offer a
significant advantage, and the small advantage is outweighed by the added amount
and complexity of code necessary to handle sharing maps. For thi s reason, shar­
ing maps probabl y w i l l be e l i m i nated from systems derived from 4.4B S D , as they
were from l ater versions of Mach.
Private Mapping
A process may request a private mapping of a fi l e . A private mappi n g has two
main effects :
Changes made to the memory mapping the fi l e are not reflected back into the
mapped fi l e .
2 . Changes m ade t o the memory mappi n g the fi l e are not v i s i b l e t o other pro­
cesses mapping the fi l e .
An example o f t h e use of a private mapping w o u l d be during program debugging .
The debugger w i l l request a private mapping of the program text so that, when it
sets a breakpoint, the modification i s not written back into the executable stored
on the disk and is not visible to the other (presumably nondebugging) processes
executing the program .
The kernel uses shadow obj ects to prevent changes made by a process from
being reflected back to the underlying object. The use of a shadow object i s
shown i n F i g . 5 .9. When t h e i n i tial private mapping is requested, t h e fi l e obj ect i s
mapped into the requesting-process address space, w i th copy-on-wri te semantic s .
If t h e process attempts t o write a page o f t h e object, a page fau l t occurs a n d traps
i nto the kernel . The kernel makes a copy of the page to be modi fied and hangs i t
from the shadow object. I n t h i s exampl e , process A h a s modified page 0 o f the
fi l e obj ect. The kernel has copied page 0 to the shadow object that is being used
to prov ide the private mapping for process A .
If free memory i s l i m i ted, i t woul d be better s i mply t o m o v e t h e modified
page from the file obj ect to the shadow obj ect. The move would reduce the imme­
diate demand on the free memory, because a new page woul d not have to be allo­
cated. The drawback to thi s optim i zation i s that, if there i s a l ater access to the fi l e
obj ect by some other process, the kernel w i l l have t o allocate a new page . The
Section 5 . 5
S hared Memory
1 43
proc A:
shadow ob"ect
file object
(mod by A)
Figure 5.9
Use of a shadow object for a private mapping.
kernel w i l l also have to pay the cost of doing an I/O operation to reload the page
contents . In 4.4B SD, the virtual -memory system never moves the page rather than
copying it.
When a page fau l t for the private mapping occurs, the kernel traverses the l i st
of obj ects headed by the vm_map_entry, looking for the faul ted page. The first
object in the chain that has the desired page i s the one that is used. If the search
gets to the final object on the chain w i thout fi nding the desired page, then the page
i s requested from that final object. Thus, pages on a s hadow obj ect w i l l be used in
preference to the same pages in the fi l e object itself. The detai l s of page-faul t han­
dling are given in Section 5 . 1 1 .
When a process removes a mapping from its address space (either explicitly
from an munmap request or implicitly when the address space is freed on process
exit), pages held by its s hadow obj ect are not written back to the file object. The
shadow-obj ect pages are s i mply placed back on the memory free l i st for i mmedi­
ate reuse.
When a process forks , it does not want changes to its private mappings to be
visible i n i ts child; similarly, the child does not want i ts changes to be visible i n its
parent. The result is that each process needs to create a shadow object if it contin­
ues to make changes in a private mapping. When process A in Fig. 5.9 forks, a set
of shadow object chains i s created, as s hown in Fig . 5 . 1 0 (on page 1 44 ). I n this
example, process A modi fied page 0 before it forked, then l ater modified page I .
Its modified version of p age I hangs off its new shadow object, so that those mod­
ifications w i l l not be visible to its child. S i m i l arly, its child has modi fied page 0.
If the child were to modify page 0 i n the original shadow obj ect, that change
would be visible in its parent. Thus, the child process must make a new copy of
page 0 in its own shadow object.
If the system run s short of memory, the kernel may need to reclaim inactive
memory held in a shadow object. The kernel assigns to the swap pager the task of
backin g the shadow object. The swap pager creates a s wap map that is l arge
enough to describe the entire contents of the shadow obj ect. It then a l locates
Chapter 5
1 44
proc A :
1 ·111_111ap_e1111:r
object 3
(mod by parent)
proc A child:
object 2
(mod by child)
Fig u re 5 . 1 0
Memory Management
file object
object 1
(mod by A
before fork )
Shadow-object chains.
enough s wap space to hold the requested shadow pages and writes them to that
area. These pages can then be freed for other uses. If a later page fault requests a
swapped-out page, then a new page of memory i s al l ocated and its contents are
rel oaded with an 1/0 from the swap area.
Collapsing of Shadow Chains
When a process with a private mapping removes that mapping e i ther expl icitly
with an mumnap system call or implicitly by exiting. its parent or child process
may be left with a chain of shadow obj ects. Usual ly, these chains of shadow
objects can be collapsed i nto a single shadow object. often freeing up memory as
part of the col lapse. Consider what happens when process A exits i n Fi g. 5 . 1 0.
First, shadow obj ect 3 can be freed. along with i ts associ ated page of memory.
This deal l ocation leaves shadow objects l and 2 in a chain with no intervening ref­
erences. Thus, these two obj ects can be col l apsed into a single shadow object.
S i nce they both contain a copy of page 0, and since only the page 0 in shadow
object 2 can be accessed by the remaining child process. the page 0 in shadow
object I can be freed, along with shadow object I itse l f.
If the chi l d of process A were to exit. then shadow object 2 and the assoc i ated
page of memory could be freed. Shadow objects I and 3 woul d then be in a chain
Section 5 . 5
Shared Memory
1 45
that wou ld be e l igible for col lapse. Here, there are no common pages, so the
remai ning col l apsed shadow object wou l d contain page 0 from shadow object I ,
as well as page I from shadow object 3 . A l i m i tation of the i mp lementation i s that
it cannot col l apse two objects if ei ther of them has all ocated a pager. This l i m ita­
tion i s serious, since pagers are allocated when the system begins running short of
memory-prec i sely the time when reclaiming of memory from collapsed objects
i s most necessary.
Private Snapshots
When a process makes read �ccesses to a private mapping of an object, it contin­
ues to see changes made to that object by other processes that are writing to the
object through the fil esystem or that have a shared mappi ng to the object. When a
process makes a write access to a private mapping of an obj ect, a snapshot of the
corresponding page of the object is made and is stored in the shadow object, and
the modification i s made to that snapshot. Thu s , fu rther changes to that page
made by other processes that are writing to the page through the fi lesystem or that
have a shared mapping to the object are no longer visible for that page. However,
changes to unmodified pages of the object continue to be visible. Thi s mix of
changing and unchanging parts of the fi l e can be confusing.
To prov i de a more consi stent view of a file, a process may want to take a
snapshot of the file at the time that it is initially private ly mapped. A process takes
such a snapshot by using a copy obj ect, as shown i n Fi g . 5 . 1 1 (on page 1 46 ) . In
this exampl e , process B has a shared m apping to the file obj ect, whereas process A
has a private mapping. Modifications made by process B w i l l be reflected in the
file, and hence w i l l be visible to any other process (such as process A) that is map­
ping that fi l e . To avoid seeing the modifications made by process B after process
B has done i ts mapping, process A i n terposes a copy obj ect between itself and the
fi l e object. At the same time, it changes the protections on the file object to be
copy-on-write . Thereafter, when process B tries to modify the fi l e object, it w i l l
generate a page fa u l t . T h e page-fau l t handler w i l l save a copy o f the unmodified
page in the copy object, then w i l l a l l ow process B to write the original page. If
process A l ater tries to access one of the pages that process B has modified, i t w i l l
g e t the page that was saved in the copy object, i nstead of getting the version that
process B changed.
I n 4.4BSD, private snapshots work correctly only if all processes modifying
the fi l e do so through the virtual-memory i nterface. For example, in Fig. 5 . 1 1 ,
assume that a third process C writes p age 2 of the fi l e using write before A or B
reference page 2. Now, even though A has made a snapshot of the file, it w i l l see
the modified version of page 2, si nce the v i rtual-memory system has no knowl­
edge that page 2 was wri tten. Th i s behavior i s an u nwelcome side effect of the
separate v i rtual memory and filesystem caches ; i t would be e l i m i nated if the two
caches were integrated .
M o s t n o n - B S D systems that provide the mmap inte1face do n o t provide copy­
object semantics. Thus, 4.4BSD does not provide copy semantics by defau lt; such
semantics are provided only when they are requested expl icitly. It i s debatable
Chapter 5
1 46
Memory M anagement
proc A :
vm_map entry
shadow object
copy object
(mod by A)
proc B :
vm_map entry
file ob·ect
(mod by B )
Figure 5. 1 1
Use of a copy object.
whether the copy semantics are worth providing a t a l l , because a process can
obtain them trivially by reading the file i n a single request into a buffer in the pro­
cess address space . The added complexity and overhead of copy objects may well
exceed the value of providing copy semantics i n the mmap interface .
Creation of a New Process
Processes are created with a f(> rk system cal l . The fork is usually fol l owed shortly
thereafter by an exec system call that overl ays the v i rtual address space of the
child process with the contents of an executable image that resides in the fi lesys­
tem . The process then executes until i t terminates by exiting. either voluntari l y or
involuntarily, by receiving a signal . I n Sections 5.6 to 5 . 9. we trace the manage­
ment of the memory resources used at each step i n this cyc l e .
A fork system call duplicates the address space of an exi sting process, creat­
ing an identical child process. Fork is the only way that new processes are created
in 4.4BSD (except for its variant, ifork, which is de scribed in the last subsection of
thi s section). Fork dupl icates all the resources of the origi nal process, and copies
that process's address space.
Section 5 . 6
Creation o f a New Process
1 47
The virtual-memory resources of the process that must be a l l ocated for the
child include the process structure and its associated substructures, and the u ser
area that includes both the user structure and the kernel stac k . In addition, the ker­
nel must reserve storage (either memory, fi lesystem space, or swap space) u sed to
back the proces s . The general outline of the implementation of a fork is as fol­
• Reserve virtual address space for the child process.
• A l l ocate a process entry for the child process, and fill it in.
• Copy to the child the parent's process group, credentials, fi l e descriptors, l i mits,
and signal actions.
• Al locate a new user area, copying the current one to initi al i ze it.
• Allocate a vmspace structure .
• Duplicate t h e address space, by creating copies of the p arent vm_map_entry
structures m arked copy-on-wri te .
• Arrange for the child process to return 0, to distinguish its return val ue from the
new PIO that i s returned by the parent process .
The all ocation a n d initialization of t h e process structure, a n d t h e arrangement
of the return value, were covered in Chapter 4. The remainder of this section dis­
cu sses the other steps i nvol ved in duplicating a process.
Reserving Kernel Resources
The first resource to be reserved when an address space is duplicated is the
required virtual addres s space. To avoi d running out of memory resources, the
kernel must ensure that i t does not promise to provide more v i rtual memory than it
i s able to del i ver. The total virtual memory that can be prov ided by the system is
l i m ited to the amount of physical memory avai l able for paging plus the amount of
s wap space that is provided. A few pages are held i n reserve to stage 1/0 between
the swap area and main memory.
The reason for this restriction is to ensure that processes get synchronous
notification of memory limitations. Specifical l y, a process should get an error
back from a system call (such as sbrk, fork, or mmap) if there are i nsufficient
resources to a l l ocate the needed v i rtual memory. If the kernel promises more vir­
tual memory than it can support, it can deadlock trying to service a p age fault.
Trouble ari ses when it has no free pages to service the fault and no avai l able s wap
space to save an active page. Here, the kernel has no choice but to send a segmen­
tation-fault signal to the process unfortunate enough to be page faulting. Such
asynchronous noti fication of insufficient memory resources i s unacceptable.
Excl uded from thi s l i m i t are those p arts of the address space that are mapped
read-only, such as the program text. Any pages that are being used for a read-only
part of the address space can be recl a imed for another use without being saved
1 48
Chapter 5
Memory Management
because their contents can be refi l led from the ori g i nal source. A l so excluded
from thi s l i m i t are parts of the address space that map shared fi l e s . The kernel can
rec laim any pages that are being used for a shared mappi n g after writing thei r con­
tents back to the fi l e system from which they are mapped. Here, the fi l esystem i s
be ing used a s a n exten sion o f the swap area. Finally, any piece o f memory that i s
used b y more than one process (such as a n area o f anonymous memory being
shared by several processes) needs to be cou nted only once toward the v i rtual­
memory l i mi t .
T h e l i m i t on t h e amount of v i rtual address space that c a n be a l located causes
problems for app l ications that want to allocate a l arge piece of address space, but
want to use the piece only sparsely. For example, a process may wish to make a
pri vate mapping of a large database from which i t w i l l access only a smal l part.
Because the kernel has no way to gu arantee that the access w i l l be sparse, it takes
the pessimi stic view that the entire file w i l l be modified and den ies the request.
One extension that many BSD derived systems have made to the mmap system call
i s to add a fl ag that tel l s the kernel that the process i s prepared to accept asyn­
chronous faults i n the mapping. Such a mapping would be perm i tted to use up to
the amount of v i rtual memory that had not been promi sed to other processes. If
the process then modi fies more of the file than thi s avai l able memory, or i f the
l i m i t i s reduced by other processes allocating promi sed memory, the kernel can
then send a segmentation-faul t signal to the process. On receiving the signal , the
process must mw111wp an unneeded part of the file to release resources back to the
sy stem . The process must ensure that the code, stack, and data structures needed
to handle the segment-fault signal do not res i de in the part of the address space
that is subject to such faults.
Tracking the outstanding v i rtual memory accurate l y i s a complex task. The
4.4BSD system makes no effort to calcul ate the outstandi ng-memory l oad and can
be made to promise more than it can del i ver. When memory resources run out, it
either picks a process to k i l l or s imply hangs . An i mportant future enhancement i s
t o track the amount o f v i rtual memory being used b y the processes i n the syste m .
Duplication of the User Address Space
The next step i n fork is to all ocate and initialize a new process structure . Thi s
operation must b e done before the address space o f the c u rrent process i s dupl i­
cated because i t records state i n the process structure . From the time that the pro­
cess structure is all ocated unti l all the needed re sources are al l ocated, the parent
process is locked agai nst swapping to avoi d deadl ock. The child is in an inconsis­
te nt state and cannot yet run or be swapped, so the parent i s needed to complete
the copy of i ts address space. To ensure that the c h i l d proces s is i gnored by the
schedul er, the kernel sets the process's state to SIDL during the entire fork proce­
H i stori cal ly, the fork system call operated by copy i ng the entire address space
of the parent proces s . When large processes fork, copying the entire user address
space i s expensive. A l l the pages that are on secondary storage must be read back
i nto memory to be copied. If there i s not enough free memory for both complete
Section 5 . 6
Creation o f a New Process
1 49
copies of the process, thi s memory shortage w i l l cause the sy stem to beg i n paging
to create enough memory to do the copy (see Section 5 . 1 2) . The copy operation
may res u l t i n parts of the parent and child processes being paged out, as well as
the paging out of parts of u nrel ated processes.
The technique used by 4.4B S D to create processes w i thout thi s overhead is
called copy-on-write. Rather than copy each page of a parent process, both the
child and parent processes resulting from a fork are given references to the same
physical pages . The page tables are changed to prevent either process from modi­
fying a shared page . I nstead, when a process attempts to modify a page, the ker­
nel is entered w i th a protection fau lt. On discoveri ng that the faul t was caused by
an attempt to modify a shared page, the kernel s i mply copies the page and changes
the protection field for the page to a l l ow modification once again . Only pages
modified by one of the processes need to be copied. B ecause processes that fork
typicall y overlay the c h i l d process w i th a new i m age with exec shortly thereafter,
this technique significantly i mproves the performance offork.
The next s tep in fork is to traverse the l i s t of vm_map_entry structures in the
parent and to create a corresponding e ntry i n the child. Each entry must be ana­
l yzed and the appropriate action take n :
• If t h e entry maps a read-onl y region, t h e chi l d c a n take a reference t o i t .
• If t h e entry maps a privately mapped region ( s u c h as t h e data area or stack), the
child must create a copy-on-write m apping of the region. The parent must be
converted to a copy-on-write m apping of the region. If either proces s l ater tries
to wri te the region, it w i l l create a shadow map to hold the modified pages.
• If the entry maps a shared region, a sharing map i s created referencing the shared
object, and both map entries are set to reference thi s map .
M a p entries for a process are never merged ( s i mp l ified). O n l y entries for the ker­
nel map itself can be merged. The kernel-map entries need to be simplified so that
excess growth i s avoided. It might be worthwhile to do such a merge of the map
entries for a process when it forks , e specially for l arge or long-running processes.
With the v i rtual-memory resources alloc ated, the system sets up the kernel­
and u ser- mode state of the new process, including the hardware memory-manage­
ment regi s ters and the user area. It then clears the S I DL fl ag and places the pro­
cess on the run queue; the new process can then beg i n execution.
Creation of a New Process Without Copying
When a process (such as a shell ) wi shes to start another program, it w i l l generally
fork, do a few simple operations such as redirecting l/O descriptors and changing
signal actions, and then start the new program w i th an exec. I n the meanti me, the
parent she l l suspends itself with wait until the new program complete s . For such
operation s , i t is not necessary for both parent and c h i l d to run s i m ultaneously, and
therefore only one copy of the address space is required. Thi s frequently occur­
ring set of system cal l s led to the i mplementation of the v.fork system cal l . I n
1 50
Chapter 5
Memory M anagement
4.4B SD, the v.fork system call sti l l exists, but i t i s implemented using the same
copy-on-write algorithm described in this section. Its only difference i s that it
ensures that the parent does not run until the chi l d has done either an exec or an
The hi storic implementation of i:fork w i l l always be more effic ient than the
copy-on-write i mplementation because the kernel avoids copying the address
space for the child. I nstead, the kernel simply passes the parent's address space to
the child and suspends the parent. The chi ld process needs to al l ocate only new
process and u ser structures, receiving everything e l se from the parent. The c h i l d
process returns from the v.fork system call w i th t h e parent stil l su spended. The
child does the usual activities in preparation for starting a new program , then cal l s
exec. Now t h e address space i s passed back t o the parent process, rather than
being abandoned, as i n a normal exec. Al ternatively, i f the child process encoun­
ters an error and i s unable to execute the new program, i t will exit. Agai n , the
address space i s passed back to the parent, instead of being abandoned.
With vfork, the entries describing the address space do not need to be copied,
and the page-table entries do not need to be marked and then c leared of copy-on­
write . Vfork i s l ikely to remain more efficient than copy-on-wri te or other
schemes that must dupl icate the process ' s v i rtual address space. The architectural
quirk of the v.fork call i s that the chi l d process may modify the contents and even
the size of the parent's address space while the child has contro l . Modification of
the parent's address space is bad programming practice. Some programs that took
advantage of this quirk broke when they were ported to 4.4B S D , which i mple­
mented vfork u s i ng copy-on-write.
5. 7
Execution of a File
The exec system call was described in Sections 2.4 and 3 . 1 ; it replaces the address
space of a process with the contents of a new program obtained from an
executable fi l e . During an exec, the target executabl e image i s validated, then the
arguments and environment are copied from the current process image i nto a tem­
porary area of pageabl e kernel v i rtual memory.
To do an exec, the system must allocate resources to hold the new contents of
the v i rtual address space, set up the mapping for th i s address space to reference the
new i mage, and release the resources being used for the exi sting virtual memory.
The fi rst step is to reserve memory resources for the new executabl e image.
The algorithm for the cal c u l ation of the amount of v i rtual address space that must
be reserved was described i n Section 5 . 6 . For an executable that i s not being
debugged (and hence w i l l not have its text space modi fied), a space reservati on
needs to be made for only the data and stack space of the new executable. Exec
does this reservation without first releasing the currently assigned space, because
the system must be able to conti nue ru nning the old executable until i t is sure that
it w i l l be abl e to run the new one. If the system rel eased the current space and the
memory reservation fai led, the exec woul d be unable to return to the original
Section 5 . 8
Process Manipulation o f I t s Address S pace
1 51
process. Once the reservation i s made, the address space and v i rtual-memory
resources of the current process are then freed as though the process were exiting;
thi s mechan i s m i s described i n Section 5 .9 .
Now, t h e process h a s only a u s e r structure a n d kernel stack. T h e kernel now
allocates a new vmspace structure and c reates the l i s t of four vm_map_entry struc­
ture s :
A copy-on-wri te, fi l l-from-file entry maps t h e text segment. A copy-on-write
mapping i s u sed, rather than a read-only one, to allow active text segments to
have debugging breakpoi nts set wi thout affecting other u sers of the binary. I n
4.4B SD, s o m e legacy code in t h e kernel debugging i nterface di sallows t h e set­
ting of break points i n binaries being used by more than one proces s . This
l egacy code prevents the u se of the copy-on-write feature .
2 . A private (copy-on-write), fi l l -from-file entry maps the i nitialized data seg­
An anonymous zero-fi l l - on-demand entry maps the uninitialized data segment.
An anonymous zero- fi l l -on-demand entry maps the stack segment.
No further operations are needed to create a new address space during an exec
system cal l ; the remainder of the work compri ses copying the arguments and env i ­
ronment out t o the top o f the new stack . I n i tial values are s e t for the regi s ters : The
program counter i s set to the entry point, and the stack poin ter i s set to point to the
argument vector. The new proces s i mage i s then ready to run.
Process Manipulation of Its Address Space
Once a process begins execution, it has several ways to manipulate its address
space. The sys tem has always allowed processes to expand thei r uni nitial i zed
data area (usually done w i th the malloc ( ) l ibrary routi ne) . The stack is grown o n
an as-needed basi s . T h e 4 . 4 B S D sys tem a l s o allow s a process t o m a p files and
devices i nto arbitrary p arts of its address space, and to change the protection of
various parts of its address space. as described i n Section 5 . 5 . Thi s section
describes how these address-space m anipulations are done .
Change of Process Size
A process can change its s i ze during execution by explicitly requesting more data
space with the sbrk system cal l . Al so, the stack segment w i l l be expanded auto­
matically if a protection fault is encountered because of an attempt to grow the
stack below the end of the stack region. In either case, the s i ze of the process
address space must be changed. The s i ze of the request is always rou nded u p to a
mu ltiple of page size. New pages are marked fi l l - wi th-zeros, as there are no con­
tents i n i tial l y associ ated with new sections of the address space .
Chapter 5
1 52
Memory M anagement
The fi rst step of enl arging a proces s ' s size is to check whether the new size
wou l d violate the s i ze l i m i t for the process segment involved. If the new size i s i n
range, the fol l owing steps are taken t o enl arge the data area:
Verify that the v i rtual-memory resources are avai lable.
Verify that the address space of the requested size i m mediate l y fol l owing the
c u rrent end of the data area i s not already mapped.
If the exi sting vm_map_entry is not constrai ned to be a fixed size because of
the allocation of swap space, increment i ts ending address by the requested
size. I f the entry has had one or more of its pages wri tten to s wap space, then
the current implementation of the s wap pager w i l l not perm it it to grow. Con­
sequently, a new vm_map_entry must be created with a starting address i mme­
di ately fol l owing the end of the previous fixed-sized entry. Its ending address
i s calculated to give i t the size of the request. Unti l a pageout forces the a l l o­
cation of a fi xed-sized swap partition of this new entry, the latter w i l l be able
to continue growing.
If the change i s to reduce the s i ze of the data segment, the operation i s easy : Any
memory all ocated to the pages that w i l l no longer be part of the address space i s
freed. The ending addres s o f the vm_map_ellfry i s reduced b y the size. If the
requested s i ze reduction is bigger than the range defined by the vm_map_entry,
the entire entry i s freed, and the remaining reduction i s appl ied to the
vm_map_ellfry that precedes it. Thi s algorithm i s appl ied until the entire reduc­
tion has been made . Future references to these addresses will res u l t i n protection
faults, as access is disallowed when the address range has been deal l ocated.
The all ocation of the stack segment is considerably different. At exec time,
the stack i s allocated at its max imum possible size. Due to the lazy allocation of
v i rtual-memory resources, this operation i nvol ves allocating only sufficient
address space. Physical memory and s wap space are all ocated on demand as the
stack grows. Hence, only step 3 of the data-growth algorithm applies to stack­
growth-rel ated page faults. An addi tional step is required to check that the
des i red growth does not exceed the dynamical l y changeable stack-size l i mit.
File Mapping
The mmap system call requests that a file be mapped i nto an address space. The
system call may request either that the mapping be done at a partic u l ar address or
that the kernel to pick an unused area. I f the request is for a particular address
range, the kernel first checks to see whether that part of the address space i s
already i n u s e . If i t i s in u s e , the kernel first does a n munmap of the exi sting map­
ping, then proceeds with the new mappi ng.
The kernel implements the munmap system call by traversing the l i s t of
vm_map_ent1)' structures for the process. The various overlap conditions to con­
sider are shown i n Fig. 5 . 1 2 . The five cases are as follows:
Section 5 . 8
Process Manipulation o f Its Address Space
rzj v#�
1 53
� � �� � �
� � �� �� ��
Figure 5 . 1 2
Five types of overlap that the kernel must consider when adding a new ad­
dress mapping.
1 . The new mapping exactly overl aps an existing mapping. The old mapping is
deallocated as described in Section 5 .9. The new mapping is created in its
place as described in the paragraph following this list.
The new mapping is a subset of the existing mapping. The existing mapping
is split into three pieces (two pieces if the new mapping begins at the begin­
ning or ends at the end of the existing mapping). The existing vm_map_entry
structure is augmented with one or two additional vm_map_entry structures :
one mapping the remaining part of the existing mapping before the new map­
ping, and one mapping the remaining part of the existing mapping following
the new mapping. Its overlapped piece is replaced by the new mapping, as
described in the paragraph following this list.
3. The new mapping is a superset of an existing mapping. The old mapping is
deallocated as described in Section 5 . 9 , and a new mapping is created as
described in the paragraph following this list.
4. The new mapping starts part way into and extends past the end of an existing
mapping. The existing mapping has its length reduced by the size of the
unmapped area. Its overlapped piece is replaced by the new mapping, as
described in the paragraph following this list.
The new mapping extends into the beginning of an exi sting mapping. The
existing mapping has its starting address incremented and its length reduced
by the size of the covered area. Its overlapped piece is replaced by the new
mapping, as described in the paragraph following this list.
In addition to the five basic types of overlap li sted, a new mapping request may
span several existing mappings . Specifically, a new request may be composed of
zero or one of type 4, zero to many of type 3, and zero or one of type 5 . When a
mapping is shortened, any shadow or copy pages associated with it are released, as
they are no longer needed.
1 54
Chapter 5
Memory Management
Once the address space is zero fi l led, the kernel creates a new l'lll _map_entry
to describe the new address range. If the object being mapped is al ready being
mapped by another process. the new entry gets a reference to the exi sting object.
Th is reference i s obtai ned in the same way. as described in Section 5 . 6, when a
new process i s being created and needs to map each of the regions in its parent. If
this request i s the fi rst mapping of an object, then the kernel checks the object
cache to see whether a prev ious instance of the object sti l l exists. If one does, then
that object i s activated and referenced by the new 1·111_111ap_ent1y.
I f the object is not found, then a new object must be created. First, a new
obj ect i s allocated. Next, the kernel must determine what is being mapped, so that
it can associate the correct pager with the object (pagers are described in Section
5 . 1 0) . Once the object and its pager have been set up. the new 1 ·111_111ap_entry can
be set to reference the object.
Change of Protection
A process may change the protections associated with a region of its virtual mem­
ory by using the mpmtect sy stem cal l . The size of the region to be protected may
be as small as a single page. Because the kernel depends on the hardware to
enforce the access permissions, the granu larity of the protection i s l i m i ted by the
underlying h ardware . A region may be set for any combination of read. write, and
execute permi ssions. Many architectures do not distinguish between read and
execute permissions; on such architectures, the execute permission is treated as
read permission.
The kernel i mpl ements the mprotect system call by finding the existing
vm_map_entry structure or structures that cover the region specified by the cal l . If
the exi sting permissions are the same as the request, then no further action i s
required. Otherwise, the n e w permissions are compared t o the maximum protec­
tion val ue associated with the vm_map_entry. The maximum value i s set at mmap
time and reflects the maxi mum value a l l owed by the underlying fi l e . If the new
permi ssions are valid, one or more new 1•111_111ap_en try structures have to be set up
to describe the new protections . The set of overl ap conditions that must be han­
dled i s similar to that described i n the previous subsection. Instead of replacing
the object underl ying the new l'm_map_e1111y structures, these 1·111_111ap_entry
structures sti l l reference the same object; the difference is that they grant different
access permi ssions to it.
Termination of a Process
The final change in process state that rel ates to the operation of the virtual-mem­
ory system i s exit; this system call term i nates a process. as described in Chapter 4.
The part of exit that is discussed here i s the re lease of the virtual-memory
resources of the process. The release is done in two steps :
Section 5 .9
Termination of a Process
1 55
The u ser portions of the address space are freed, both in memory and on swap
2. l)e u ser area is freed.
These two operations are complicated because the kernel stack i n the u ser area
must be used until the process rel i nqui shes the processor for the fi n al time.
The first step-freeing the user address space-i s identical to the one that
occurs during exec to free the old address space. The free operation proceeds
entry by entry through the l i s t of vm_map_entry structures associ ated w i th the
address space . The fi rst s tep i n freeing an entry is to traverse the l atter's l i st of
shadow and copy objects. If the entry is the last reference to a shadow or copy
object, then any memory or swap space that i s associated w i th the object can be
freed. In addition, the machine-dependent routines are called to unmap and free
up any page table or data s tructures that are associated w i th the object. If the
shadow or copy object is still referenced by other vm_map_entl)' structures, its
resources cannot be freed, but the kernel still needs to call the machine-dependent
routines to unmap and free the resources associ ated with the current process m ap­
ping. Finally, i f the underlying object referenced by the vm_map_ent1y is losing
i ts l ast reference, then that object i s a candidate for deallocation. I f i t is an object
that will never have any chance of a future reuse (such as an anonymous obj ect
associ ated w i th a stack or uninitialized data area), then its resources are freed as
though i t were a shadow or copy object. However, i f the obj ect maps a fi l e (such
as an executable) that might be used again soon, the object i s saved in the obj ect
cache, where i t can be found by newly execu ting processes or by processes map­
ping i n a fi l e . The . number of unreferenced cached objects is l i m i ted to a threshold
set by the system (typical l y 1 00). If adding thi s new obj ect would cause the cache
to grow beyond its limit, the least recently u sed object i n the cache is removed and
Next, the memory u sed by the user area must be freed. This operation begins
the problematic time when the process must free resources that i t has not yet fin­
i shed using. I t would be disastrous i f a page from the user structure or kernel
stack were reallocated and reused before the process had fi n i shed the exit ( ) .
Memory i s allocated ei ther synchronously b y the page-fau l t handler or asynchro­
nously from interrupt handlers that use malloc ( ), such as the network when pack­
ets arrive ( see Chapter 1 2) . To block any allocation of memory, it is necessary to
del ay i nterrupts by rai sing the processor interrupt-priority leve l . The process may
then free the pages of its user area, safe from hav i ng them reused unti l it has relin­
quished the processor. The next context switch will lower the priority so that
interrupts may resume.
With all its resources free, the exiting process fi n i shes detaching itself from its
process group, and noti fies its parent that i t i s done . The process has now become
a zombie process-one with no resources, not even a kernel stack. Its parent w i l l
col lect its e x i t status w i th a wait cal l , a n d w i l l free i t s p rocess structure.
1 56
Chapter 5
Memory Management
There i s noth i ng for the v i rtual-memory system to do when wait i s called: A l l
v i rtual-memory resources o f a process are removed w h e n exit i s done . On wait,
the system just returns the process status to the cal ler, and deall ocates the process­
tabl e entry and the small amount of space i n which the re source-u sage i nformation
was kept.
The Pager Interface
The pager inte1face prov ides the mechanism by which data are moved between
backing store and physical memory. The 4.4BSD pager in terface is a modification
of the interface present in Mach 2.0. The interface i s page based, with all data
requests made i n multiples of the software page size. Vm_page structures are
passed around as descriptors prov iding the backi ng-store o ffset and physical
cache-page address of the desired data. This i nterface should not be confused
with the current M ach 3.0 external paging interface [ Young, 1 9 89] , where pagers
are typical l y user appl ications outside the kernel and are invoked via asynchronous
remote procedure cal l s using the Mach i nterprocess-communication mechan i s m .
The 4 . 4 B S D i nterface i s internal in t h e sense that t h e p agers are compiled i nto the
kernel and pager routines are i nvoked v i a simple fu nction cal l s .
Associated with each object i s a pager_struct structure representi ng a n
instance of the pager type responsible for suppl y i n g the contents of pages within
the object. This structure contains poi nters to type- spec i fic routines for reading
and writing data, as wel l as a pointer to instance-spec ific storage . Conceptua l l y.
the pager_struct structure describes a logical ly contiguous piece of backing store,
such as a chunk of swap space or a disk fi l e . A pager_struct and any associ ated
instance-spec ific data are collectively known as a pager instance i n the fol l owing
A pager i n stance i s typically created at the same time as the object when a
file, device, or piece of anonymous memory i s mapped into a process address
space. The pager i nstance continues to exist unti l the object i s dea l l ocated. When
a page fau l t occurs for a v i rtual address mapping a particular object, the faul t-han­
d l i ng code all ocates a vm_page structure and converts the faulting address to an
offset within the object. Th i s offset is recorded in the vm_page structure, and the
page is added to the l i st of pages cached by the object. The page frame and the
object's pager i nstance are then passed to the underlying pager routine. The pager
routine i s responsible for fi l l i ng the 1·111_page structure w i th the appropri ate i nitial
value for that offset of the object that it represents .
The pager i nstance i s also re sponsible for saving the contents of a dirty page
if the system dec ides to push out the l atter to backing store . When the pageout
daemon dec ides that a particular page is no longer needed. it requests the object
that owns the page to free the page . The object first passes the page with the asso­
ci ated logical offset to the underlying pager i nstance, to be saved for future use.
The p ager i nstance i s respon sible for finding an appropriate place to save the page,
Section 5 . 1 0
The Pager Interface
1 57
doing any 1/0 necessary for the save, and then notifying the object that the page
can be freed. When it is done, the pager instance notifies the pageout daemon to
move the vm_page structure from the object to the free l i s t for future use.
There are seven routines associated w i th each pager type . The pgo_init rou­
tine is cal l ed at boot time to do any one-time type- specific initial izations, such as
allocating a pool of private pager structure s . The pgo_alloc and pgo_dealloc rou­
tines are called when an i nstance of a pager should be created or destroyed. The
allocation routine i s called whenever the corresponding obj ect is mapped i nto an
address space via mmap. Hence, only the first call should create the structure ;
successive cal l s j ust i ncrement the reference count for the associ ated obj ect and
return a pointer to the exi sting pager i nstance. The deal location routine i s called
only when the object reference count drops to zero .
Pgo__getpages is called to return one or more pages of data from a pager
instance either synchronou s l y or asynchronously. C u rrently, thi s routine is called
from only the page-fau l t handler to synchronously fill single pages. Pgo_putpages
writes back one or more pages of data. Thi s rou tine is called by the pageout dae­
mon to write back one or more pages asynchronously, and by msync to write back
single pages synchronously or asynchronously. B oth the get and put routines are
called with an array of vm_JJage structures indicating the affected pages.
Pgo_cluster takes an offset and returns an enclosing offset range representing
an optimal 1/0 transfer unit for the backing store. This range can be used with
pgo__getpages and pgo_JJutpages to help do i n formed prefetching or c l ustered
cleaning. Currently, it is used by only the pageout daemon for the l atter task. The
pgo_haspage routine queries a pager i nstance to see whether that i nstance has data
at a particular backing-store offset. Thi s routine is u sed i n only the page-fault
handler, to determine whether an internal copy object already has received a copy
of a particular page.
The three types of pagers supported by the system are described i n the next
three subsections.
Vnode Pager
The vnode pager handles obj ects that map fi les in a fi lesystem. Whenever a fi l e i s
mapped e i ther explicitl y b y 111111ap or impl icitly b y exec, the vnode-pager alloca­
tion routine i s cal led. If the call represents the first mapping of the vnode, the nec­
essary vnode-pager-specific structure is created, and an object of the appropriate
size is allocated and is assoc i ated w i th the pager instance . The vnode-pager struc­
ture contains a poi nter to the vnode and a copy of the l atter's current s i ze . The
vnode reference count i s i ncremented to reflect the pager reference. If this i n i tial­
i zation call i s not the fi rst for a vnode, the existing pager structure i s located. I n
e i ther case, the associ ated obj ect's reference count i s i ncremented, a n d a pointer to
the pager i nstance is returned.
When a pagein request i s received by the vnode-pager read routine, the pro­
vided physical page i s mapped i nto the kernel addres s space long enough for the
pager i nstance to cal l the fi lesystem VOP_READ v node operation to load the page
1 58
Chapter 5
Memory Management
w i th the file contents . Once the page i s fi lled, the kernel mapping can be dropped,
and the page can be returned.
When the vnode pager i s asked to save a page to be freed, it simply arranges
to write back the page to the part of the file from which the page came. The page
is mapped into the kernel address space long enough for the pager routine to call
the fi lesystem VOP_WRITE vnode operation to store the page back into the file.
Once the page i s stored, the kernel mapping can be dropped, and the object can be
notified that the page can be freed.
If a fi l e i s being private l y mapped, then modified pages cannot be written
back to the fi l esystem . S uc h private mapping must use a shadow object w i th a
swap pager for all pages that are modi fied. Thu s , a privately mapped object w i l l
never b e asked to save a n y dirty pages t o t h e underl y i ng ti l e .
When t h e l a s t address-space mapping of a vnode i s removed by munmap o r
exit, the v node-pager deall ocation routine i s called. Thi s routine rel eases the
vnode reference and free s the vnode-pager structure .
The vnode-pager 1/0 routines use the VOP_ R EA D and VOP _WRITE v node
operations that pass data through any caches maintained by filesystems ( e . g . , the
buffer cache used by UFS and NFS). The problem w i th this approach is that the v i r­
tual-memory syste m maintains a cache of file pages that i s i ndependent of the
filesystem caches, resulting in potential double caching of fi l e data. Thi s condition
l eads to i nefficient cache use, and worse, to the potential for i nconsistencies
between the two caches. Modi fications to fi les that are m apped into memory are
not seen by processes that read those files until the m apped fi l e is written back to
the fi lesystem and reread i nto the filesystem cache. S i mi l arly, changes to fi l e s writ­
ten to the fi l esystem are not v i sible to processes that map those files until the file i s
wri tten back t o disk and then page faul ted i nto the process. The writeback and
rereading may take seconds to hours, depending on the level of memory activ i ty.
In 4.4BSD, this problem is addressed i n an ad hoc and i ncomplete fashion .
Two vnode-pager-speci fi c routines are called from various points in the VFS code .
Vnode_pager_setsize ( ) i s i nvoked when a fi l e changes size. If the fi l e has shrunk,
any excess cached pages are removed from the obj ect. This page removal guaran­
tees that future mapped references to those pages w i l l cause page faul ts , and in
turn, w i l l generate a si gnal to the mapping process. Vnode_pager_uncache ( )
removes the obj ec t representing a vnode from the object cache. Recal l that the
obj ect cache contain s only obj ects that are not currently referenced; thus, this rou­
tine will not help to maintain consistency for an obj ect that i s currently mapped.
A more consi stent i n terface can be obtained by using a common cache for
both the v i rtual-memory system and the filesy stem. Three approaches to merging
the two caches are being undertaken . One approach i s to have the fi lesystem u se
objects in the v i rtual-memory system as its cache; a second approach is to have
the v i rtual -memory objects that map fi les use the existing ti lesystem cache ; the
third approach is to create a new cache that is a merger of the two existing caches,
and to convert both the v i rtual memory and the fi l e systems to use this new cache.
Each of these approaches has i ts merits and drawback s ; it is not yet clear which
approach will work best.
Section 5 . 1 0
The Pager Interface
1 59
Device Pager
The device pager handles obj ects representing memory-mapped hardware devices.
Memory-mapped devices provi de an i nterface that looks l i ke a piece of memory.
An exampl e of a memory-mapped device is a frame buffer, which presents a range
of memory addresses w i th one word per pixel on the screen. The kernel provides
access to memory-mapped devices by m apping the device memory i nto a pro­
cess ' s address space. The process can then access that memory w i thout further
operating-system i ntervention. Writing to a word of the frame-buffer memory
causes the corresponding pixel to take on the appropriate color and brightness.
The device pager i s fundamentall y different from the other two pagers i n that
it does not fi l l prov ided physical-memory p ages w i th data. I nstead, i t creates and
manages its own vm__page structures, each of which describes a page of the device
space. This approach makes device memory look l ike wired physical memory.
Thus , no special code shoul d be needed in the remainder of the v i rtual-memory
system to handle device memory.
When a dev ice is fi rst m apped, the device-pager allocation routine w i l l val i ­
date the desired range by cal l i ng the device d_mmap ( ) routine. If the dev ice
allows the requested access for a l l pages in the range, the device pager creates a
device-pager structure and associated obj ect. I t does not create vm__page struc­
tures at this time-they are created individual l y by the page-get routine as they are
referenced. The reason for this l ate a l location is that some devices export a large
memory range in which either not all pages are val i d or the pages may not be
accessed for common operations. Complete allocation of vm_page structures for
these sparsely accessed devices woul d be wastefu l .
The first access t o a device page w i l l cause a page fau l t a n d w i l l i nvoke the
device-pager page-get routine. The p ager i nstance creates a vm__page structure,
initializes the l atter w i th the appropriate object offset and a physical address
returned by the device d_mmap ( ) routine, and fl ags the page as fictitious. This
vm__page structure i s added to the l is t of all such allocated pages i n the dev i ce­
p ager structure . S i nce the fault code has no special knowledge of the device
p ager, i t has preallocated a physical-memory page to fi l l and has associ ated that
vm_JJage structure with the object. The devi ce-pager routine removes that
vm__page structure from the object, returns the structure to the free l i st, and i nserts
its own vm__page structure in the same p l ace.
The device-pager page-put routine expects never to be called and will panic i f
i t i s . Thi s behavior i s based on t h e assumption that device-pager p ages are never
entered i nto any of the paging queues and hence w i l l never be seen by the pageout
daemon. However, i t i s possible to msync a range of device memory. This opera­
tion brings up an exception to the h igher- l evel v i rtual-memory syste m ' s i gnorance
of device memory : The object page-cleaning routine w i l l skip pages that are
fl agged as fictitious.
Finall y, when a device i s unmapped, the device-pager dea l location routine is
i nvoked. Thi s routine dea l l oc ates the vm__page structures that it a l located, as well
as the device-pager structure itself.
1 60
Chapter 5
Memory Management
Swap Pager
The term s1rnp pager refers to two functionally different pagers. In the most com­
mon u se. s1rnp pager refers to the pager that i s used by objects that map anony­
mous memory. Th is pager has sometimes been referred to as the default pager
because i t i s the pager that i s u sed if no other pager has been requested. It pro­
vides what i s commonly known as swap space: nonpersi stent backing store that i s
zero fi l led o n first reference. The zero fi l l i ng is real ly done b y the fault-han d l i ng
code, without ever i nvoking the swap pager. Because of the zero- fi l l i ng opti m i za­
tion and the transient nature of the backing store, allocation of swap-pager
resources for a particul ar object may be del ayed until the fi rst pageout operation.
Until that time, the pager-structure pointer i n the object is NULL. While the object
i s i n this state. page faults (getpage) are handled by zero fi l l i ng. and page queries
(haspage) are not necessary. The expectation is that free memory will be plentifu l
enough that it w i l l not b e necessary t o swap o u t any pages. The object w i l l simply
create zero-fi l led pages during the process l i feti me that can all be returned to the
free l i st when the process exits.
The rol e of the swap pager i s swap-space management: figuring out where to
store d i rty pages and how to find di rty pages when they are needed again . Shadow
objects require that these operations be efficient. A typical shadow object i s
sparsel y popul ated: I t may cover a l arge range o f pages. but only those pages that
have been modified w i l l be i n the shadow object's backing store . I n addition, long
chains of shadow objects may require numerous pager queries to locate the correct
copy of an object page to sati sfy a page fault. Hence, determ i n i ng whether a pager
i nstance contai ns a particular page needs to be fast, preferably requiring no I/O
operations. A fi nal requirement of the swap pager is that it can do asynchronous
writeback of di rty pages . This requirement i s necessi tated by the pageout daemon,
which i s a single-threaded process. If the pageout daemon bloc ked waiti ng for a
page-clean operation to complete before starting the next operation, it is u n l i kely
that it could keep enough memory free i n times of heavy memory demand.
In theory, any pager that meets these cri teri a can be used as the swap pager.
In M ach 2.0. the vnode pager was used as the swap pager. Special paging files
could be created i n any fi l esystem and regi stered with the kerne l . The swap pager
would then suball ocate pieces of the fi les to back particular anonymous objects.
Asynchronous writes were a side effect of the filesystem " s u s e of the buffer cache.
One obvious advantage of using the vnode pager i s that swap space can be
expanded by the addition of more swap files or the extension of exi sting ones
dynamical ly ( i . e . , without rebooting or reconfi guring of the kerne l ) . The main dis­
advantage is that. i n the past, the fi l esystem has not been able to del iver a
respectable fraction of the disk bandwidth.
The des i re to prov ide the highest possible disk bandwidth led to the creation
of a spec ial ra1r-partition pager to use as the swap pager for 4.4BSD. Prev ious
versions of B S D also used dedicated disk partitions. commonly known as s11·ap
partitions; hence, this partition pager became the s1rnp pager. The remainder of
this section describes how the partition pager is i mplemented. and how it prov ides
the necessary capab i l i ties for backing anonymous objects.
Section 5 . 1 0
1 61
The Pager Interface
As mentioned, a swap-pager i nstance w i l l not be created unti l the first time
that a page from the object i s replaced in memory. At that time, a structure i s allo­
cated to describe the swap space that can hold the object. This swap space is
described by an array of fi xed-sized swap blocks. The size of each swap block i s
selected based on the s i ze o f the obj ec t t h a t the swap pager i s managing. For a
smal l object, a m i n imal-si zed (32-Kbyte) swap block w i l l be used; for a l arge
object, each swap block may be as l arge as 32 x pagesize . For a m achine such as
the HP300 with a pagesize of 4 Kbyte, the max i m u m swap-block s i ze w i l l be 1 28
Kbyte . A swap block i s al ways an i n tegral n umber of v i rtual-memory pages,
because those are the units u sed i n the pager interface .
The t w o structures created by t h e swap pager are shown i n F i g . 5 . 1 3 . The
swpager structure describes the swap area being managed by the pager. I t records
the total size of the obj ect (object size), the size of each swap block being man­
aged (block s i ze), and the number of swap blocks that make up the swap area for
the object (block count) . It also has a pointer to an array of block-count swblock
structures, each containing a device block number and a b i t mask. The block
number gives the address of the first device block i n a contiguous chunk of block­
size DEV_BSIZE-sized blocks that form the swap block, or i s zero i f a swap block
has never been allocated. A mask of 1 b i t per page-sized piece w i thin this swap
block records which pages i n the block contain val i d data. A bit i s set when the
corresponding page i s first written to the swap area. Together, the swblock array
and associated bit masks provi de a two-level page tab l e describing the backing
s tore of an obj ect. Thi s structure provides efficient swap-space allocation for
sparsely popul ated objects, since a given swap block does not need to be allocated
unti l the fi rst time that a page in its block-size range is written back. The structure
also allows efficient page lookup: at most an array-indexing operation and a bit­
mask operation.
The size of the obj ec t i s frozen at the time of al l ocation. Thu s , i f the anony­
mous area continues to grow (such as the stack or heap of a process), a new obj ect
must be c reated to describe the expanded area. On a system that i s short of mem­
ory, the resul t is that a l arge process may acqui re many anonymous objects.
Changing the swap pager to handle growing obj ects woul d cut down on thi s obj ec t
p ro l iferation dramatical ly.
Figure 5 . 1 3
Structures used to manage swap space.
block number 1 mask
object size
block size
block count
. . .
struct swpager
block number : mask
. . .
struct swblock
Chapter 5
1 62
(base) 0
1 11
in use
Memory Management
I in use I
resource map: <0,8>. < 1 6. 1 3>. <36.6>, <48. 1 5>
Figure 5. 1 4
A kernel resource map.
Al location of swap blocks from the system's pool of swap space is managed
with a resource map called swapmap. Resource maps are ordered arrays of <base,
size> pairs describing the free segments of a resource (see Fig. 5 . 1 4) . A segment
of a given size is allocated from a resource map by rmalloc ( ), using a first-fit
algorithm, and is subsequently freed with rn!free ( ) . The swap map is initialized at
boot time to contain all the available swap space. An index into the swapmap at
which space has been allocated is used as the index of the disk address within the
swap area.
The swap pager is complicated considerably by the requirement that it handle
asynchronous writes from the pageout daemon . The protocol for managing these
writes is described in Section 5 . 1 2 .
5.1 1
When the memory-management hardware detects an inval id virtual address, it
generates a trap to the system. This page-fault trap can occur for several reasons .
Most B S D programs are created in a format that permits the executable image to
be paged into main memory directly from the filesystem . When a program in a
demand-paged format is first run, the kernel marks as inval id the pages for the text
and initialized data regions of the executing process. The text and initialized data
regions share an object that provides fill-on-demand from the filesystem. As each
page of the text or i nitialized data region is first referenced, a page fault occurs.
Page faults can also occur when a process first references a page in the unini­
tial ized data region of a program . Here, the anonymous object managing the
region automatically allocates memory to the process and initializes the newly
assigned page to zero. Other types of page faults arise when previously resident
pages have been reclai med by the system in response to a memory shortage.
The handl ing of page faults is done with the vm_fa ult ( ) routine ; this routine
services all page faults. Each time vm_fault ( ) is invoked, it i s provided the virtual
address that caused the fault. The first action of l'ln_fault ( ) is to traverse the
vm_map_entry list of the faulting process to find the entry associated with the
fault. The routine then computes the logical page within the underlying object and
traverses the list of objects to find or create the needed page . Once the page has
been found, vm_fau!t ( ) must call the machine-dependent layer to validate the
faulted page, and return to restart the process.
Section 5 . 1
1 63
The details of calculating the address within the object were described in Sec­
tion 5 .4. Having computed the offset within the object and determined the
object's protection and object list from the vm_map_entry, the kernel is ready to
find or create the associated page . The page-fault-handling algorithm is shown in
Fig 5 . 1 5 (on pages 1 64 and 1 65 ) . In the following overview, the lettered points
are references to the tags down the left side of the code.
A. The loop traverses the list of shadow, copy, anonymous, and file objects unti l it
either finds an object that holds the sought-after page, or reaches the final
object in the l ist. If no page is found, the final object will be requested to pro­
duce it.
B. An object with the desired page has been found. If the page is busy, another
process may be in the middle of faulting it in, so this process is blocked until
the page is no longer busy. Since many things could have happened to the
affected object while the process was blocked, it must restart the entire fault­
handling algorithm. If the page was not busy, the algorithm exits the loop with
the page.
C. Anonymous objects (such as those used to represent shadow and copy objects)
do not allocate a pager until the first time that they need to push a page to
backing store. Thus, if an obj ect has a pager, then there is a chance that the
page previously existed but was paged out. If the object does have a pager,
then the kernel needs to allocate a page to give to the pager to be filled (see D ) .
The special case for the object being the first object is to avoid a race condition
with two processes trying to get the same page. The first process through' will
create the sought after page in the first object, but keep it marked as busy.
When the second process tries to fault the same page it will find the page cre­
ated by the first process and block on it (see B ) . When the first process com­
pletes the pagein processing, it will unlock the first page, causing the second
process to awaken, retry the fault, and find the page created by the first pro­
D . If the page is present in the file or swap area, the pager will bring it back into
the newly allocated page . If the pagein succeeds, then the sought after page
has been found. If the page never existed, then the pagein request will fail .
Unless this object is the first, the page i s freed and the search continues. I f this
object is the first, the page is not freed, so that it will act as a block to fu rther
searches by other processes (as described in C).
E. If the kernel created a page in the first object but did not use that page, it will
have to remember that page so that it can free the page when the pagei n is
done (see M).
F. If the search has reached the end of the object list and has not found the page,
then the fau lt is on an anonymous object chain, and the first object in the list
will handle the page fault using the page allocated in C . The first_page entry
is set to NULL to show that it does not need to be freed, the page is zero filled,
and the loop is exited.
Chapter 5
1 64
Memory Management
* Handl e a page f au l t o c c u r r i ng at
r e qu i r i ng the g i ven p e rmi s s i ons ,
I f suc c e s s f u l ,
the g iven addr e s s ,
the map spe c i f i ed .
the page i n t o the a s s o c i a t e d
* phys i c a l map .
vm_ f au l t ( map ,
addr ,
typ e )
R e t ryFau l t :
l o okup addr e s s
in map r e turning obj e c t / o f f s e t / p r o t ;
f i r s t_obj e c t = obj e c t ;
( ; ; )
page =
[B J
l o o kup page at
( page f ound )
obj e c t / o f f s e t ;
( p age busy )
b l o c k and g o t o R e t ryFau l t ;
r emove
f r om paging qu eue s ;
mark page as bu sy ;
break ;
( ob j e c t has pager or obj e c t = =
page = a l l o c a t e a page
f i r s t_ob j e c t )
f o r obj e c t / o f f s e t ;
( no pages ava i l ab l e )
b l o c k and g o t o R e t ryFau l t ;
( ob j e c t has page r )
c a l l pager
f i l l page ;
( IO error )
( p ager has page )
( ob j e c t
re turn an e r r o r ;
break ;
! = f i r s t_obj e c t )
f r e e page ;
/ * no pager ,
or pager do e s no t have page
( ob j e c t = =
f i r s t_obj e c t )
f i r s t_page = page ;
next_ob j e c t = next obj e c t ;
( no next obj e c t )
( ob j e c t
obj e c t =
page =
! = f i r s t_obj ec t )
f i r s t_obj e c t ;
f i rs t_page ;
f i r s t_page = NULL ;
z e ro f i l l page ;
break ;
obj e c t = next_obj e c t ;
Figure 5. 1 5
Page-fault handling.
Section 5 . 1 1
[ GJ
1 65
appr op r i a t e page has been f ound o r a l l o c a t e d * /
o r i g_page = page ;
( ob j e c t
! = f i r s t_obj e c t }
( f au l t typ e = = WRI T E }
c opy page
f i r s t_page ;
deac t iva t e page ;
page =
f i r s t_pag e ;
obj e c t =
prot & =
f i rs t_obj e c t ;
-wRITE ;
mark page c opy - on - wr i t e ;
( f i r s t_ob j e c t has
( f au l t typ e
pro t & =
c opy obj e c t }
! = WRI TE }
- WR I T E ;
mark page c opy - on-wr i t e ;
c opy_ob j e c t =
l o o kup page
f i r s t_ob j e c t c opy obj e c t ;
in c opy_obj e c t ;
( page ex i s t s }
( page busy }
b l o c k and g o t o R e t ryFau l t ;
a l l o c a t e a b l ank page ;
[ Kl
( no pages ava i l ab l e }
( c opy_ob j e c t has pager }
b l o c k and g o t o R e t ryFau l t ;
c a l l pager
s e e i f page exi s t s ;
( page exi s t s }
f r e e b l ank page ;
( page do e s n ' t exi s t )
c opy page
r emove o r i g_page
f r om pmap s ;
a c t iva t e c opy page ;
mark page n o t c opy- on-wr i t e ;
( pr o t & WRI T E )
mark page no t c opy- on-wr i t e ;
en t e r mapp ing f o r page ;
a c t i va t e and unbusy page ;
( f i r s t_page
! = NULL )
unbu sy and f r e e f i r s t_page ;
F i g u re 5 . 1 5
Page-fault handling (continued).
to copy_ob j e c t page ;
Chapter 5
1 66
Memory Management
G . The search exits the loop with page as the page that has been found or allo­
cated and initialized, and obj e c t as the owner of that page. The page has
been filled with the correct data at this point.
H. If the object providing the page is not the first object, then thi s mapping must
be private, with the first object being a shadow object of the object providing
the page. If pagein i s handli ng a write fault, then the contents of the page that
it has found have to be copied to the page that it allocated for the first obj ect.
Having made the copy, it can release the object and page from which the copy
came, as the first object and first page will be used to finish the page-fault ser­
vice. If pagein is handling a read fault, it can use the page that it found, but i t
has to mark the page copy-on-write t o avoid the page being modified i n the
Pagei n i s handling a read fault. It can use the page that it found, but has to
mark the page copy-on-write to avoid the page being modified before pagei n
has had a chance to copy the page fo r the copy object.
If the copy object already has a copy of the page in memory, then pagei n does
not have to worry about saving the one that it just created.
K. If there is no page in memory, the copy object may still have a copy in its
backing store. If the copy object has a pager, the vm_pager_has_page () rou­
tine is called to find out if the copy obj ect still has a copy of the page in its
backing store. Thi s routine does not return any data; the blank page is allo­
cated to avoid a race with other faults. Otherwise, the page does not exist, so
pagein must copy the page that it found to the page owned by the copy object.
After doing this copying, pagein has to remove all existing mappi ngs for the
page from which it copied, so that future attempts to access that page will fault
and find the page that pagei n left in the copy object.
L. If pagei n is handling a write fault, then it has made any copies that were neces­
sary, so i t can safely make the page writable.
M . As the page and possibly the first_page are released, any processes waiting for
that page of the object will get a chance to run to get their own references.
Note that the page and object locking has been elided i n Fig. 5 . 1 5 to simplify the
explanation. In 4 . 4BSD, no clustering is done on pagein ; only the requested page
is brought in from the backing store.
Page Replacement
The service of page faults and other demands for memory may be satisfied from
the free list for some time, but eventually memory must be reclaimed for reuse.
Some pages are reclaimed when processes exit. On systems with a large amount
of memory and low memory demand, exiting processes may provide enough free
Section 5 . 1 2
Page Replacement
1 67
memory to fill demand. Thi s case arises when there is enough memory for the
kernel and for all pages that have ever been used by any current process. Obvi­
ously, many computers do not have enough main memory to retai n all pages i n
memory. Thus, it eventually becomes necessary t o move some pages t o secondary
storage-to the swap space. B ringing in a page is demand driven. For paging it
out, however, there is no immediate indication when a page is no longer needed by
a process. The kernel must implement some strategy for deciding which pages to
move out of memory so that it can replace these pages with the ones that are cur­
rently needed in memory. Ideally, the strategy will choose pages for replacement
that will not be needed soon. An approximation to this strategy is to find pages
that have not been used recently.
The system i mplements demand paging with a page-replacement algorithm
that approximates global least-recently used [Corbato, I 968; Easton & Franaszek,
1 979] . It is an example of a global replacement algorithm: one in which the
choice of a page for replacement is made according to system wide criteria. A
local replacement algorithm would choose a process for which to replace a page,
and then chose a page based on per-process criteria. Although the algorithm in
4 . 4 BSD is similar in nature to that in 4 . 3 BSD, its i mplementation is considerably
The kernel scans physical memory on a regular basis, considering pages for
replacement. The use of a systemwide list of pages forces all processes to com­
pete for memory on an equal basis. Note that it is also consistent with the way
that 4 . 4 BSD treats other resources provided by the system . A common alternative
to allowing all processes to compete equally for memory is to partition memory
into multiple i ndependent areas, each localized to a collection of processes that
compete with one another for memory. This scheme is used, for example, by the
VMS operating system [Kenah & B ate, 1 984] . With this scheme, system adminis­
trators can guarantee that a process, or collection of processes, will always have a
minimal percentage of memory. Unfortunately, this scheme can be difficult to
administer. Allocating too small a number of pages to a partition can result i n
underutilization o f memory and excessive I/O activity to secondary-storage
devices, whereas setting the number too high can result in excessive swapping
[Lazowska & Kelsey, 1 978] .
The kernel divides the main memory into four lists :
Wired: Wired pages are locked in memory and cannot be paged out. Typically,
these pages are being used by the kernel or have been locked down with mlock.
In addition, all the pages being used to hold the user areas of loaded (i.e., not
swapped-out) processes are also wired. Wired pages cannot be paged out.
2 . Active: Active pages are being used by one or more regions of virtual memory.
Although the kernel can page them out, doing so is likely to cause an active
process to fault them back again.
3 . Inactive : Inactive pages have contents that are still known, but they are not
usually part of any active region. If the system becomes short of memory, the
Chapter 5
1 68
Memory Management
pageout daemon may try to move active pages to the inactive list in the hopes
of finding pages that are not really in use. The selection criteria that are used
by the pageout daemon to select pages to move from the active list to the inac­
tive list are described later in this section. When the free-memory list drops
too low, the pageout daemon traverses the inactive list to create more free
Free: Free pages have no useful contents, and will be used to fu lfill new page­
fault requests .
The pages of main memory that can be used by user processes are those on the
active, inactive, and free lists .
Ideally, the kernel would maintain a working set for each process in the sys­
tem . It would then know how much memory to provide to each process to mini­
mize the latter's page-fault behavior. The 4.4BSD virtual-memory system does not
use the working-set model because it lacks accurate information about the refer­
ence pattern of a process. It does track the number of pages held by a process via
the resident-set size, but it does not know which of the resident pages constitute
the working set. In 4.3BSD, the count of resident pages was used i n making deci­
sions on whether there was enough memory for a process to be swapped in when
that process wanted to run . This feature was not carried over to the 4.4B SD vir­
tual-memory system. Because it worked well during periods of high memory
demand, this feature should be incorporated in future 4.4BSD systems.
Paging Parameters
The memory-allocation needs of processes compete constantly, through the page­
fault handler, with the overall system goal of maintaining a minimum threshold of
pages in the free list. As the system operates, it moni tors main-memory utiliza­
tion, and attempts to run the pageout daemon frequently enough to keep the
amount of free memory at or above the minimum threshold. When the page-allo­
cation routine, vm_page_alloc ( ), determines that more memory is needed, it
awakens the pageout daemon.
The work of the pageout daemon is controlled by several parameters that are
calculated during system startup . These parameters are fine tuned by the pageout
daemon as it runs based on the memory available for processes to use. In general ,
the goal of this policy is to maintain free memory at, or above, a minimum thresh­
old. The pageout daemon implements this policy by reclaiming pages for the free
list. The number of pages to be reclaimed by the pageout daemon is a function of
the memory needs of the system . As more memory is needed by the system, more
pages are scanned. This scanning causes the number of pages freed to increase.
The pageout daemon determines the memory needed by comparing the num­
ber of free memory pages against several parameters . The first parameter,
free_target, specifies a threshold (in pages) for stopping the pageout daemon.
When available memory is above this threshold, no pages will be paged out by the
pageout daemon. Free_target is normally 7 percent of user memory. The other
Section 5 . 1 2
Page Replacement
1 69
interesting limit specifies the minimum free memory considered tolerable,
free_min ; this limit is normally 5 percent of user memory. If the amount of free
memory goes below free_min, the pageout daemon is started. The desired size of
the list of i nactive pages is kept in inactive_target; this limit is normally 33 per­
cent of available user memory. The size of this threshold changes over time as
more or less of the system memory is wired down by the kernel . If the number of
inactive pages goes below inactive_target, the pageout daemon begins scanning
the active pages to find candidates to move to the i nactive list.
The desired values for the paging parameters are communicated to the page­
out daemon through global variables. Li kewise, the pageout daemon records its
progress in a global variable. Progress is measured by the number of pages
scanned over each interval that it runs.
The Pageout Daemon
Page replacement is done by the pageout daemon. When the pageout daemon
reclaims pages that have been modified, it is responsible for writing them to the
swap area. Thus, the pageout daemon must be able to use normal kernel-synchro­
nization mechanisms, such as sleep ( ) . It therefore runs as a separate process, with
its own process structure, user structure, and kernel stack. Like init, the pageout
daemon is created by an internal fork operation during system startup (see Section
1 4.5); unlike init, however, it remains i n kernel mode after the fork. The pageout
daemon simply enters vm_pageout ( ) and never returns. Unlike other users of the
disk 1/0 routines, the pageout process needs to do its disk operations asyn­
chronously so that it can continue scanning in parallel with disk writes.
The goal of the pageout daemon is to keep at least 5 percent of the memory
on the free list. Whenever an operation that uses pages causes the amount of free
memory to fall below this threshold, the pageout daemon is awakened. It starts by
checking to see whether any processes are eligible to be swapped out (see the next
subsection). If the pageout daemon finds and swaps out enough eligible processes
to meet the free-page target, then the pageout daemon goes to sleep to await
another memory shortage.
If there is still not enough free memory, the pageout daemon scans the queue
of inactive pages, starting with the oldest page and working toward the youngest.
It frees those pages that it can until the free-page target is met or it reaches the end
of the inactive list. The following list enumerates the possible actions that can be
taken with each page:
If the page is clean and unreferenced, move it to the free list and increment the
free-list count.
If the page has been referenced by an active process, move it from the inactive
list back to the active list.
If the page is dirty and is being written to the swap area or the filesystem, skip it
for now. The expectation i s that the 1/0 will have completed by the next time
that the pageout daemon runs, so the page will be clean and can be freed.
1 70
Chapter 5
Memory Management
• If the page is dirty but is not actively being written to the swap space or the
filesystem, then start an I/O operation to get it written. As long as a pageout is
needed to save the current page, adjacent pages of the region that are resident,
inactive, and dirty are clustered together so that the whole group can be written
to the swap area or filesystem in a single l/O operation. If they are freed before
they are next modified, the free operation will not require the page to be written.
When the scan of the inactive list completes, the pageout daemon checks the size
of the inactive l ist. Its target is to keep one-third of the available (non wired) pages
on the inactive list. If the inactive queue has gotten too small, the pageout daemon
moves pages from the active list over to the inactive l i st until it reaches its target.
Like the inactive list, the active list is sorted into a least recently activated order:
The pages selected to be moved to the inactive list are those that were activated
least recently. Vm_pageout ( ) then goes to sleep until free memory drops below
the target.
The procedure for writing the pages of a process to the swap device, a page
push, is somewhat complicated. The mechanism used by the pageout daemon to
write pages to the swap area differs from normal l/O in two important ways:
1 . The dirty pages are mapped into the virtual address space of the kernel , rather
than being part of the virtual address space of the process.
2. The write op eration is done asynchronously.
Both these operations are done by the swap_pager_putpage ( ) routine. Because
the pageout daemon does not synchronously wait while the l/O is done. it
does not regain control after the l/O operation completes. Therefore,
swap_JJager_putpage ( ) marks the buffer with a callback flag and sets the routine
for the callback to be swap_JJa ger_iodone ( ) . When the push completes,
swap_JJ ager_iodone ( ) is called; it places the buffer on the list of completed page­
outs. If the pageout daemon has fini shed initiating paging I/O and has gone to
sleep, swap_Jmger_iodone ( ) awakens it so that it can process the completed page­
out list. If the pageout daemon is still running, it will find the buffer the next time
that it processes the completed pageout list.
Doing the write asynchronously allows the pageout daemon to continue exam­
ining pages, possibly starting additional pushes. Because the number of swap
buffers is constant, the kernel must take care to ensure that a buffer is available
before a commitment to a new page push is made . If the pageout daemon has used
al l the swap buffers, swap_JJager_JJ utpage ( ) waits for at least one write operation to
complete before it continues. When pageout operations complete, the buffers are
added to the list of completed pageouts and, if a swap_JJager_Jmtpage ( ) was
blocked awaiting a buffer, swap_JJ ager_JJ utpage ( ) is awakened.
The list of completed pageants is processed by swap_JJager_clean ( ) each
time a swap-pager instance is deallocated. before a new swap operation is started,
and before the pageout daemon sleeps . For each pageout operation on the l i st,
each page (including each in a page cluster) is marked as clean. has its busy bit
Section 5 . 1 2
Page Replacement
1 71
cleared, and has any processes waiting for it awakened. The page is not moved
from its active or inactive list to the free list. If a page remains on the inactive list,
it will eventually be moved to the free list during a future pass of the pageout dae­
mon . A count of pageouts in progress is kept for the pager associated with each
object; this count is decremented when the pageout completes, and, if the count
goes to zero, a wakeup ( ) is issued. This operation is done so that an object that is
deallocating a swap pager can wait for the completion of all pageout operations
before freeing the pager's references to the associated swap space.
Although swapping is generally avoided, there are several times when it is used in
4 . 4 BSD to address a serious resource shortage. S wapping is done in 4 . 4 BSD when
any of the following occurs :
• The system becomes so short of memory that the paging process cannot free
memory fast enough to satisfy the demand. For example, a memory shortfall
may happen when multiple large processes are run on a machine lacking enough
memory for the minimum working sets of the processes.
• Processes are completely i nactive for more than 20 seconds. Otherwise, such
processes would retain a few pages of memory associated with the user structure
and kernel stack.
S wap operations completely remove a process from main memory, including the
process page tables, the pages of the data and the stack segments that are not
already in swap space, and the user area.
Process swapping is invoked only when paging is unable to keep up with
memory needs or when short-term resource needs warrant swapping a process. In
general, the swap-scheduling mechanism does not do well under heavy load; sys­
tem performance is much better when memory scheduling can be done by the
page-replacement algorithm than when the swap algorithm is used.
Swapout is driven by the pageout daemon. If the pageout daemon can find
any processes that have been sleeping for more than 20 seconds (maxslp, the cut­
off for considering the time sleeping to be "a long time " ) , it will swap out the one
sleeping for the longest time. Such processes have the least likelihood of making
good use of the memory that they occupy ; thus, they are swapped out even if they
are smal l . If none of these processes are available, the pageout daemon will swap
out a process that has been sleeping for a shorter time. If memory is still desper­
ately low, it will select to swap out the runnable process that has been resident the
longest. These criteria attempt to avoid swapping entirely until the pageout dae­
mon is clearly unable to keep enough memory free. Once swapping of runnable
processes has begun, the processes eligible for swapping should take turns in
memory so that no process is frozen out entirely.
The mechanics of doing a swap out are simple. The swapped-in process flag
P _INMEM is cleared to show that the process is not resident in memory, and, if
Chapter 5
1 72
Memory Management
necessary, the process i s removed from the runnable process queue. Its user area
is then marked as pageable, which allows the u ser area pages, along with any other
remain i ng pages for the process, to be paged out via the standard pageout mecha­
n i s m . The s wapped-out process cannot be run until after it is swapped back into
The Swap-In Process
Swap-i n operations are done by the swapping process, process 0. Thi s process i s
the first one created b y the system when the l atter i s started. The swap-i n policy
of the swapper is embodied in the scheduler( ) routine. This routine swaps pro­
cesses back in when memory is avai l able and they are ready to ru n . At any time,
the s wapper i s i n one of three state s :
Idle : No swapped-out processes are ready t o be ru n . I d l e is the normal state .
2. Swapping in: At least one runnable process i s swapped out, and scheduler( )
attempts to find memory for it.
Swapping out: The system i s short of memory or there i s not enough memory
to swap in a process. Under these c i rcumstances, scheduler( ) awakens the
pageout daemon to free pages and to swap out other processes unti l the mem­
ory shortage abates.
I f more than one swapped-out proces s i s runnable, the fi rs t task of the swapper is
to deci de which process to swap i n . Thi s deci s i on may affect the decision whether
to swap out another process. Each swapped-out process i s assigned a priority
based on
The length of time i t has been swapped out
Its nice val ue
The amount of time i t was asleep since i t last ran
I n general , the process that has been swapped out longest or was swapped out
because i t was not ru nnab l e w i l l be brought i n first. Once a process is selected,
the swapper checks to see whether there is enough memory free to swap i n the
process. Hi storical l y, the 4 . 3 B S D system required as much memory to be avail­
able as was occupied by the process before that proces s was swapped. Under
4.4BSD, this requ i rement was reduced to a requirement that only enough meinory
be avail able to hold the swapped-process user structure and kernel stack. If there
i s enough memory available, the process i s brought back into memory. The user
area i s s wapped i n immedi atel y, but the process l oads the re st of its working set by
demand paging from the s wap device. Thus, not al l the memory that is committed
to the process is used immediately.
The procedure for swap i n of a process is the reverse of that for swapout:
Section 5 . 1 3
1 73
1 . Memory is allocated for the u ser structure and kernel stack, and they are read
back from swap space.
The process i s marked as resident and is returned to the run queue i f i t is
runnable ( i . e . , i s not s topped or s leeping) .
After the swapin completes, the process i s ready to run l i ke any other, except
that it has n o resident page s . It w i l l bring in the pages that i t needs by faulting
Everything discussed in this chapter up to this section has been part of the
machine-independent data structures and algorithms. These parts of the virtual­
memory system requi re l i ttle change when 4.4BSD i s ported to a new architecture.
This section will describe the machine-dependent parts of the virtual-memory sys­
tem ; the parts of the virtual-memory system that must be written as part of a port
of 4.4BSD to a new architecture. The machine-dependent parts of the v irtual­
memory system control the h ardware memory-management unit (MMU) . The
MMU i mplements address translation and access control when virtual memory is
mapped onto physical memory.
One common MMU design uses memory-resident forward-mapped page
tables. These page tables are l arge contiguous arrays indexed by the virtual
addre s s . There i s one e lement, or page-table entry, i n the array for each virtual
page i n the address space. This element contai n s the physical page to which the
virtual page i s mapped, as well as access permissions, status bits tel l ing whether
the page has been referenced or modified, and a bit indicating whether the entry
contain s vali d i nformation. For a 4-Gbyte address space with 4-Kbyte virtual
pages and a 32-bit p age-table entry, 1 million entries, or 4 Mbyte, wou l d be
needed to describe an entire addres s space. S ince most processes u se l i ttle of their
address space, most of the entries would be i nvalid, and allocating 4 Mbyte of
physical memory per process wou l d be wastefu l . Thus, most page-table structures
are hierarchical , u s i ng two or more levels of mapping. With a hierarchical struc­
ture, different portions of the virtual address are u sed to i ndex the various levels of
the page tables. The i n termediate levels of the table contain the addresses of the
next l ower level of the page table. The kerne l can m ark as unused l arge contigu­
ous regions of an address space by inserti ng i nval i d entries at the higher levels of
the page table, eliminating the need for invali d page descriptors for each individ­
ual unused v i rtual page .
This h ierarchical page-table structure requires the h ardware to make frequent
memory references to trans late a virtual addre s s . To speed the trans l ation process,
most page-table-based MMUs also h ave a small , fast, ful l y associative hardware
cache of recent addres s tran s l ations, a structure known commonly as a translation
lookaside buffer (TLB). When a memory reference is trans lated, the TLB is first
1 74
Chapter 5
Memo,ry M anagement
consulted and, only i f a val i d entry i s not found there, the page-table structure for
the current process is traversed. B ecause most programs exhibit spatial l oca l i ty in
the i r memory-access patterns, the TLB does not need to be l arge ; many are as
smal l as 64 entries.
As address spaces grew beyond 32 to 48 and, more recently, 64 bits, s i mple
i ndexed data structures become unwieldy, w i th three or more levels of tables
requ i red to handle address trans l ation. A response to thi s page-table growth i s the
inverted page table, also known as the reverse-mapped page table. In an i nverted
page table, the h ardware sti l l maintains a memory-resident table, but that table
contains one entry per phy sical page and i s indexed by physical address, instead of
by v i rtual address. An entry contains the v i rtual address to which the physical
page i s currently mapped, as well as protection and status attribute s . The hard­
ware does virtual-to-physical address tra n s l ation by computing a hash function on
the v i rtual address to select an entry i n the table. The system handles c o l l i sions by
l inking together tabl e entries and making a l i near search of this chain until i t finds
the matching v i rtual address.
The advantages of an i nverted page table are that the s i ze of the table i s pro­
portional to the amount of physical memory and that only one global tabl e i s
needed, rather than one tab l e per process. A di sadvantage t o thi s approach i s that
there can be only one virtual address mapped to any given physical page at any
one time. Thi s l im i tation makes virtual-address aliasing-having multiple v i rtual
addresses for the same physical page-difficult to handle. As i t i s with the for­
ward-mapped page table, a hardware TLB is typical ly used to speed the translation
A final common MMU organ i zation consi sts of just a TLB . Thi s architecture
i s the s i mplest hardware design. It gives the software maximum flexibi l i ty by
allowing the l atter to manage translation i nformation in whatever structure it
des i res.
The machi ne-dependent part of the v i rtual-memory system also may need to
interact with the memory cache. Because the speed of CPUs has i nc reased far
more rapidly than the speed of main memory, most machines today requ i re the use
of a memory cache to al low the CPU to operate near its ful l potenti al . There are
several cache-design choices that requ i re cooperation w i th the v i rtual-memory
The design option with the b i ggest effect i s whether the cache u ses v i rtual or
physical addressing. A physical l y addressed cache takes the address from the
CPU, runs it through the MMU to get the address of the physical page, then uses
thi s physical address to find out whether the requested memory location i s avai l ­
a b l e i n t h e cache. Al though the TLB significantly reduces t h e average l atency o f
t h e transl ation, there i s sti l l a delay i n g o i n g through t h e M M U . A virtually
addressed cache uses the v i rtual address as that address comes from the CPU to
find out whether the requested memory l ocation i s ava i l able i n the cache. The v i r­
tual-address cache i s faster than the physical-address cache because it avoids the
time to run the addres s through the MMU. H owever, the virtual-address cache
Section 5 . 1 3
Portabi l i ty
1 75
must be fl u shed completely after each context switch, because v i rtual addresses
from one process are i ndi stinguishable from the v i rtual addresses of another pro­
cess. By contrast, a physical-address cache needs to Hush only a few individu al
entries when their associated phy sical page is reassigned. In a system with many
short-running processes, a v i rtual-address cache gets flushed so frequently that i t
i s seldom u sefu l .
A further refi nement t o the v i rtual-address cache i s t o add a process tag to
each cache entry. At each context switch, the kernel l oads a hardware context reg­
i ster with the tag assigned to the process. Each time an entry is recorded in the
cache, both the v i rtual address and the process tag that faulted it are recorded. The
cache looks up the v i rtual address as before, but, when it fi nds an entry, it com­
pares the tag associated w i th that entry to the hardware context register. If they
match, the cached value is returned. I f they do not match, the correct value and
current process tag rep l ace the old cached value. When thi s technique is u sed, the
cache does not need to be H ushed completel y at each context switch, si nce multi­
ple processes can have entries in the cache. The drawback i s that the kernel must
manage the process tags. Usual l y, there are fewer tags (eight to 1 6) than there are
processes. The kerne l must assign the tags to the active set of processes. When an
o l d process drops out of the active set to al l ow a new one to enter, the kernel must
Hush the cache entries associated w i th the tag that i t i s about to reuse.
A final consideration i s a write-through versus a wri te-back cache. A write­
through cache writes the data back to main memory at the same time as it is writ­
i ng to the cache, forcing the CPU to wait for the memory access to conclude. A
write-back cache writes the data to only the cache, delaying the memory write
until an expl icit request or until the cache entry is reused. The write-back cache
al lows the CPU to resume execution more quickly and permits multiple writes to
the same cache block to be consolidated into a single memory write.
Often , a port to another arc h i tecture with a simi l ar memory-management
organi zation can be used as a starting point for a new port. Most mode l s of the
HP300 l ine of workstations, bui l t around the Motorola 68000 fami l y of processors,
use the typical two-level page-tabl e organization shown i n Fig. 5 . 1 6 (on page 1 76).
An address space i s broken i nto 4-Kbyte virtual pages, w i th each page identified by
a 32-bit e ntry i n the page table. Each page-tab l e entry contains the physical page
number assigned to the virtual page, the access permi ssions a l l owed, modify and
reference information, and a bit i ndicating that the entry contai ns val i d informa­
tion. The 4 Mbyte of page-table entries are l i kewise divided into 4- Kbyte page­
table pages, each of which is described by a single 32-bit entry in the segment
table. Segment-tabl e entries are nearly identical to page-tabl e entri e s : They con­
tain access bits, modify and reference bits, a val i d bit, and the physical page num­
ber of the page-tabl e page described. One 4-Kbyte page- 1 024 segment-table
entries-covers the maximum-sized 4-Gbyte address space. A hardware regi ster
contains the physical address of the segment-tabl e for the currently active process.
I n Fig. 5 . 1 6, trans l ation of a v i rtual address to a physical address during a
CPU access proceeds as follows :
Chapter 5
1 76
regi ster
Memory M anagement
segment table
page tables
Figure 5 . 1 6 Two-level page-table organization. Key : V-page-valid bit; M-pagemodified bit; R-page-referenced bit; ACC-page-access permissions.
The 1 0 most significant bits of the v i rtual address are u sed to i ndex i nto the
active segment table.
If the selected segment-tabl e entry i s val i d and the access perm i ssions grant the
access being made, the next I 0 bits of the v i rtual address are used to i ndex into
the page-tabl e page referenced by the segment-table entry.
If the selected page-table entry is val i d and the access permissions match, the
fi nal 1 2 bits of the v i rtual address are combi ned w i th the physical page refer­
enced by the page-tabl e entry to form the physical address of the acces s .
The Role of the pmap Module
The machi ne-dependent code describes how the physical mapping is done between
the user-processes and kernel v i rtual addresses and the physical addresses of the
main memory. Thi s mapping function incl udes management of access rights i n
addition t o addres s trans l atio n . In 4.4BSD , the physical mapping (pmap) module
manages machi ne-dependent trans l ation and access tables that are used either
directly or indirectly by the memory-management hardware . For example, on the
HP300, the pmap maintains the memory-resident segment and page tables for each
process, as well as for the kerne l . The machi ne-dependent state requ i red to
describe the trans l ation and access rights of a single page is often referred to as a
mapping or mapping structure.
T h e 4.4BSD pmap i nterface i s nearly identical t o that i n Mach 3 .0 : it shares
many design characteri stics. The pmap module is i ntended to be logical ly
Section 5 . 1 3
Portabi l i ty
1 77
i ndependent of the higher leve l s of the v i rtual-memory system. The i nterface
deals strictly in machine-independent page-aligned v i rtual and physical addresses
and i n mac h i ne-i ndependent protections . The machine-i ndependent page size may
be a multiple of the architecture-supported page size. Thu s , pmap operations must
be able to affect more than one physical page per logical page. The mac h i ne-i nde­
pendent protection is a s i mple encoding of read, write, and execute perm i ssion
bits. The pmap must map all possible comb inations into val i d architecture-spe­
c i fi c val ues.
A process ' s pmap i s considered to be a cache of mapping i nformation kept i n
a machine-dependent format. As s u c h , i t does n o t need t o contain complete state
for all val id mappings. Mapping state is the responsibility of the mac h i ne-i nde­
pendent l ayer. With one exception, the pmap modul e may throw away mapping
state at its discretion to reclaim resources. The exception i s w i red mappings,
which should never cause a faul t that reaches the machine-i ndependent vmJault ( )
routine. Thus, state fo r w i red mappings must b e retained i n the pmap unti l i t i s
removed expli c i tly.
I n theory, the pmap module may also del ay most i nterface operations, such as
removing mappings or changing thei r protection attributes. It can then do many of
them batched together, before doing expensive operations such as fl u shing the
TLB . In practice, however, thi s delayed operation has never been used, and it i s
unclear whether it works completely. Thi s feature w a s dropped from l ater releases
of the Mach 3.0 pmap interface .
I n general , pmap routi nes m ay act e ither on a set of mappings defi ned by a vir­
tual address range or on a l l mappings for a partic u l ar physical address. Being able
to act on . individual or all v i rtual mappings for a physical page requires that the
mappi n g i nformation maintained by the pmap module be indexed by both v i rtual
and physical address. For architectures such as the HP300 that support memory-res­
i dent page tables, the v i rtual-to-phy sical , or forward, l ookup may be a s i mp l e emu­
l ation ' of the hardware page-tab l e traversal. Physical-to-virtual, or reverse, l ookup
requires an inverted page table: an array with one entry per physical page i ndexed
by the physical page n u mber. Entrie s in thi s table may be e i ther a single mapping
structure, if only one v i rtual tran sl ation i s allowed per physical page, or a pointer to
a l i st of mappi n g structures, if virtual-address aliasing is a l l owed. The kernel typi­
cally handles forward lookups in a system without page tables by using a hash table
to map v i rtual addresses i nto mappi ng structures in the inverted page table.
There are two strategies that can be used for management of pmap memory
re sources , such as user-segment or page-table memory. The traditional and easiest
approach i s for the pmap modul e to manage its own memory. Under th i s strategy,
the pmap module can grab a fi xed amount of w i red physical memory at system
boot time, map that memory i nto the kernel ' s address space, and allocate pieces of
the memory as needed for its own data structure s . The primary benefit is that this
approach isolates the pmap modu l e ' s memory needs from those of the rest of the
system and l i m i ts the pmap modu l e ' s dependencies on other parts of the system.
This design is consi stent with a l ayered model of the v i rtual- memory system in
which the pmap i s the l owest, and hence self-sufficient, l ayer.
1 78
Chapter 5
Memory Management
The disadvantage is that this approach requires the duplication of many of the
memory-management functions. The pmap module has its own memory allocator
and deallocator for its private heap-a heap that is statically sized and cannot be
adj usted for varying systemwide memory demands . For an architecture with
memory-resident page tables, it must keep track of noncontiguous chunks of pro­
cesses ' page tables, because a process may populate its address space sparsely.
Handling this requirement entails duplicating much of the standard list-manage­
ment code, such as that used by the vm_map code .
An alternative approach, used by the HP300, is to use the higher-level virtual­
memory code recursively to manage some pmap resources. Here, the page table
for a user process appears as a virtually contiguous 4-Mbyte array of page-table
entries i n the kernel 's address space. Using higher-level allocation routines, such
as kmem_alloc_wait ( ), ensures that physical memory is allocated only when
needed and from the systemwide free-memory pool. Page tables and other pmap
resources also can be allocated from pageable kernel memory. This approach eas­
ily and efficiently supports large sparse address spaces, including the kernel 's own
address space.
The primary drawback is that this approach violates the independent nature of
the interface. In particular, the recursive structure leads to deadlock problems
with global multiprocessor spin locks that can be held while the kernel is calling a
pmap routine. Another problem for page-table allocation i s that page tables are
typically hierarchically arranged; they are not flat, as this technique represents
them. With a two-level organization present on some HP300 machines, the pmap
module must be aware that a new page has been allocated within the 4-Mbyte
range, so that the page's physical address can be inserted into the segment table.
Thus, the advantage of transparent allocation of physical memory is partially lost.
Although the problem is not severe i n the two-level case, the technique becomes
unwieldy for three or more levels.
The pmap data structures are contained in the machine-dependent i ll ciude
directory in the file pmap.h. Most of the code for these routines is in the
machine-dependent source directory in the file pmap.c. The main tasks of the
pmap module are these:
• System initialization and startup (pmap_bootstrap_alloc ( ), pmap_bootstrap ( ),
pmap_init ( ))
• Allocation and deallocation of mappings of physical to virtual pages
(pmap_enter( ), pmap_remove ( ))
• Change of access protections and other attributes of mappings
(pmap_change_wiring ( ) , pmap_page_protect ( ), pmap_protect ( ))
• Maintenance of physical page-usage information (pmap_clear_modify ( ),
pmap _clear_reference ( ), pmap_is_modified ( ), pmap_is_referenced ( ))
• Initialization of physical pages (pmap_copy_page ( ), pmap_zero_page ( ))
Section 5 . 1 3
Portabi l i ty
1 79
• Management of i nternal data structures (pmap_create ( ) , pmap_reference ( ),
pmap_destroy ( ), pmap_JJinit ( ), pmap_release ( ), pmap_copy ( ),
pmap_pageable ( ), pmap_collect ( ), pmap_update ( ))
Each of these tasks w i l l be described i n the following subsections.
Initialization and Startup
The first step i n starting up the system i s for the l oader to bring the kernel image
from a disk or the network i nto the physical memory of the machine. The kernel
load i mage looks much l i ke that of any other process; i t contain s a text segment,
an i n itial i zed data segment, and an u n i n i ti al ized data segment. The loader p l aces
the kernel contiguously i nto the beg i n n i ng of physical memory. Unl i ke a user pro­
cess that is demand paged i nto memory, the text and data for the kernel are read
i nto memory in their enti rety. Fol l owing these two segments, the l oader zeros an
area of memory equal to the size of the kernel ' s u n i n i ti al i zed memory segment.
After loading the kernel , the loader passes control to the starting address given in
the kernel executable image. When the kernel begins executing, i t i s executing
w i th the MMU turned off. Consequently, all addressing i s done u s i ng the direct
physical addresses.
The fi rs t task undertaken by the kernel i s to set up the kernel pmap, and any
other data structures that are necessary to describe the kerne l ' s v i rtual address
space and to make i t possible to enable the MMU. Th i s task i s done in
pmap_bootstrap ( ) . O n the HP300, bootstrap tasks incl ude allocating and initial i z­
ing the segment and page tables that map the statical l y loaded kernel i mage and
memory-mapped I/O address space, allocating a fi xed amount of memory for ker­
nel page-tabl e pages, all ocating and i n i ti al izing the user structure and kernel stack
for the i n i ti al proces s , allocating the empty segment table i nitially shared by all
processes, reserv ing special areas of the kernel ' s address space, and i n i tializing
assorted c ritical pmap - i nternal data structure s . After thi s cal l , the MMU is
enabled, and the kernel begins running i n the context of process zero .
Once the kernel i s running i n its v i rtual address space, i t proceeds t o i n i tial i ze
the rest of the system. Thi s initial i zation starts with a cal l to set up the machine­
i ndependent portion of the v i rtual-memory system and concludes w i th a
cal l to pmap_init ( ) . Any subsystem that requ i res dynamic memory allocation
between enabl ing of the MMU and the cal l to pmap_init ( ) must use
pmap_bootstrap_alloc ( ) . Memory allocated by this routine will not be managed
by the v i rtual-memory system and is effectively w i red down . Pmap_init ( ) allo­
cates al l resources necessary to manage m u l tiple u ser address spaces and synchro­
nizes the higher level kernel v i rtual-memory data structures w i th the kernel pmap.
On the HP300, it first marks as i n use the areas of the kerne l ' s vm_map that
were allocated during the bootstrap. These marks prevent future high-level all oca­
tions from trying to use those areas. Next, i t allocates a range of kernel v i rtual
address space, via a kernel submap, to use for u ser-process page tables. Pieces of
thi s address range are allocated when processes are created and are deall ocated
Chapter 5
1 80
Memory Management
when the processes exit. These areas are not popul ated w i th memory on
allocation. Page-table pages are all ocated on demand when a process first accesses
me mory that i s mapped by an entry i n that page of the page table. Thi s all ocation
i s di scussed l ater, in the mapping-al location subsecti on. Page tables are a l located
from their own submap to l i m i t the amount of kernel v i rtual address space that
they consume. At 4 Mbyte per process page table, 1 024 ac tive processes would
occupy the entire kernel address space. The avai l able page-tabl e address- space
l i m i t is approximately one-hal f of the entire address space.
Pmap_illit allocates a fixed amount of wired memory to use for kernel page­
table pages . I n theory, these pages could be allocated on demand from the general
free-memory poo l , as user page-table pages are ; i n practice, however, this
approach leads to deadlocks, so a fixed pool of memory i s used.
After determin ing the nu mber of pages of physical memory remaining, the
stmtup code allocates the i nverted page table, pv_table. Thi s table is an array of
pv_en try structure s . Each pv_ent1y describes a s i ngle address translation and
incl udes the v i rtual address, a poi nter to the associated pmap structure for that vir­
tual address, a l i n k for chaining together multiple entries mapping this physical
address, and additional i nformation spec i fi c to entries mapping page-table page s .
Figure 5 . 1 7 shows the pv_e11t1y references for a s e t of pages that have a single
mapping. The pv_table contains actual i nstances of pv_entl)' structures, rather
than poi nters ; this strategy optimizes the common case where physical pages have
only one mappi ng. The purpose of the pv_entl)' structures is to identify the
address space that has the page mapped. Rather than having a poi nter from the
vm_JJ a ge structure to its corresponding pv_entry, the rel ationship is based on the
array i ndex of the two entries. In Fig. 5 . 1 7 , the object is using pages 5, 1 8 , and
7 9 ; thus , the corresponding pv _entry structure s 5, 1 8 , and 79 point to the physical
map for the address space that has page tables referencing those pages.
Each pv_entl)' can reference only one physical map. When an object
becomes shared between two or more processes, each physical page of memory
becomes map p ed i nto two or more sets of page tables. To track these multiple ref­
erences, the pmap module must create chains of pv_entry structures, as shown i n
Figure 5 . 1 7 Physical pages w i t h a si ngle mapping.
start addr
end addr
obj offset
1·111_111ap_e11 t1:\'
vnode I obj ect
Section 5 . 1 3
Portabi l i ty
1 81
start addr
end addr
obj offset
start addr
end addr
obj offset
F i g u re 5 . 1 8
Physical pages with multiple mappings.
Fig. 5 . 1 8 . These additional structures are al located dynamically and are l inked
from a l i st headed by the pv_entry that was allocated i n the i n i ti al table. For
example, implementation of copy-on-write requ i res that the page tables be set to
read-only in all the processes sharing the object. The pmap module can imple­
ment thi s request by walking the l i st of pages associated with the object to be
made copy-on-write. For each page, it finds that page s ' corresponding pv_entry
structure. It then makes the appropriate c hange to the page tabl e associ ated w i th
that pv_entry structure . If that pv_entry structure has any additi onal pv_entry
structures l i n ked off i t, the pmap module traverses them, making the same modifi­
cation to their referenced page-table entry.
Final ly, a page-attribute array i s a l l ocated with 1 byte per physical page. This
array contain s reference and dirty i nformation and i s described l ater i n the subsec­
tion on the management of page u sage i nformation. The first and last physical
addresses of the area covered by both the pv_entry and attri bute arrays are
recorded, and are u sed by other routines for bounds checking. Thi s area i s
referred t o as t h e pmap-managed memory.
Mapping Allocation and Deallocation
The primary responsibil ity of the pmap modul e is validating (allocating) and
i nval i dating (deallocating) mappings of physical pages to v i rtual addresses. The
physical pages represent cached portions of an obj ect that is providing data from a
fi l e or an anonymous memory region. A physical page i s bound to a v i rtual
address because that object i s being mapped i nto a process's address space either
1 82
Chapter 5
Memory Management
expl icitly by mmap or implic i tly by fork or exec. Physical -to-v i rtual address
mappings are not created at the time that the object i s mapped; rather, their cre­
ation i s del ayed until the fi rst reference to a partic u l ar page i s made. At that point,
an access fau l t will occur, and pmap_enter( ) will be called. Pmap_enter is
re sponsible for any requ i red side effects associated with creation of a new map­
ping. Such side effects are l argely the result of entering a second translation for an
already mapped phy sical page-for example, as the result of a copy-on-write
operation. Typically, thi s operation requ i res fl u shing uniprocessor or mul tiproces­
sor TLB or cache entries to mai ntai n consi stency.
In addition to i ts use to create new mappings, pmap_enter( ) may also be
called to modify the wiring or protection attributes of an existing mapping or to
rebind an existing mapping for a v i rtual address to a new physical address. The
kernel can handle c hanging attributes by cal ling the appropriate interface routine,
descri bed i n the next subsection . Changing the target physical address of a map­
ping is simply a matter of fi rst removing the old mapping and then h andl ing i t l i ke
any other new mapping request.
Pmap_enter( ) i s the only routine that cannot l ose state or delay its action.
When called, it must create a mappi ng as requested, and i t must val idate that map­
ping before retu rning to the caller. On the HP300, pmap_enter( ) takes the fol l ow­
ing action s :
1 . I f no page-tabl e exists for the process, a 4-Mbyte range i s allocated i n the ker­
nel ' s address space to map the proces s ' s addres s space.
If the process has no segment table of its own ( i . e . , i t sti l l references the i n i tial
shared segment table), a private one is a l l ocated.
If a physical page has not yet been allocated to the process page-table at the
location required for the new mapping, that i s done now. Kernel page-tabl e
pages are acqui red from t h e reserved pool allocated a t bootstrap time. For
user processes, the kernel does the a l l ocation by simulating a fault on the
appropriate l ocation in the 4-Mbyte page-table range. Thi s fault forces al l oca­
tion of a zero- fi l led page and makes a recursive call to pmap_enter( ) to enter
the mapping of that page i n the kerne l ' s pmap . For e i ther kernel or user page­
table pages, the kernel mapping for the new page i s fl agged as being a page­
table page , and the physical address of the page i s recorded in the segment
table. Recording thi s address is more compl icated on the 68040 that has the
top two leve l s of the page-table hierarchy squeezed i nto the single segment­
table page.
After ensuring that al l page-table resources exist for the mappi ng being
entered, pmap_enter( ) val idates or modifies the requested mapping as fol l ows :
Check t o see whether a mapping structure already exi sts for this virtual-to­
physical address trans l ation . If one does, the cal l must be one to change the
protection or w i ri ng attributes of the mapping; i t is handled as described in the
next subsecti on.
Section 5 . 1 3
Portab i l i ty
1 83
2 . Otherwise, if a mapping exists for thi s v i rtual address but it references a differ­
ent physical address, that mapping i s removed.
3. If the indicated mapping i s for a user process, the kernel page-table page con­
tai ning that page-tabl e entry is marked as nonpageable . Making this marking
i s an obscure way of keeping page-table pages w i red as long as they contain
any val i d mappings. The vm_map_pageable ( ) routine keeps a wired count for
every v i rtual page, w i ring the page when the count is incremented from zero
and unwiring the page when the count is decremented to zero . The w i ring and
unwiring cal l s trigger a cal l to pmap_pageable ( ) , whose function is described
in the last subsection on the management of internal data structure s . Wiring a
page-tabl e page avoids having it involuntaril y paged out, effectively inval idat­
ing all pages that it currently map s . A beneficial side effect is that, when a
page-tabl e page i s fi n al l y unwired, it contains no usefu l i nformatio n and does
not need to be paged out. Hence, n o backing store is required for page-tabl e
page s .
4. If t h e physical address i s outside t h e range managed by t h e pmap modu le (e.g.,
a frame-buffer page), no pv_table entry i s allocated; only a page-tabl e entry i s
created. Otherwise, for the common case o f a new mapping for a managed
physical page, a pv_table entry is created.
For HP300 mac h i nes w i th a v i rtually-i ndexed cache, a check is made to see
whether thi s physical page already has other mappi ngs. If it does, all map­
pings may need to be marked cache inhibi ted, to avoid cache inconsistencies.
A page-tabl e entry i s created and val idated, w i th cache and TLB entries fl u shed
as necessary.
When an object is un mapped from an address space, e i ther explicitly by
munmap ( ) or i mpl icitly on process exit, the pmap modu l e is i nvoked to i nvali date
and remove the mapp i ngs for all physical pages caching data for the object.
Unlike pmap_enter( ) , pmap_remove ( ) can be called with a v i rtual-address range
encompassing more than one mapping. Hence, the kernel does the unmapping by
looping over all v i rtual pages in the range, ignoring those for which there is no
mapping and removing those for which there i s one. Also u n li ke pmap_enter( ),
t h e i m p l i e d action can be delayed until pmap_update ( ), described i n t h e n e x t sub­
section, i s called. This delay may enable the pmap to optimize the i nval i dation
process by aggregating i ndiv idual operation s .
Pmap_remove ( ) on t h e HP300 i s simple. I t l oops over t h e spec i fi ed address
range, i nvalidating individual page mappings. S i nce pmap_remove ( ) can be
called with l arge sparsel y all ocated regions, such as an entire process v i rtual
address range, it needs to skip efficiently i nval i d entries within the range. It skips
i nvali d entri e s by fi rst checking the segment-tabl e entry for a partic u l ar address
and, i f an entry i s i nval id, skipping to the next 4-Mbyte boundary. Thi s check also
prevents unnecessary al l ocation of a page-tabl e page for the empty area. When all
page mapp i ngs have been inval i dated, any necessary global cache fl u s h i ng i s done.
1 84
Chapter 5
Memory M anagement
To i nval idate a single mapping. the kernel locates and marks as i nval i d the
appropri ate page-table entry. The reference and modify bits for the page are saved
in the separate attribute array for fu ture retrieval . If this mapping was a user map­
ping, vm_map_JJageable ( ) is called to decrement the w i red count on the page­
table page . When the count reaches zero, the page-table page can be recl aimed
because i t contai ns no more val i d mappings. If the physical address from the
mapping i s outside the managed range, noth i ng more i s done . Otherwise, the
pv_table entry is found and is deall ocated. When a u ser page-table page i s
removed from the kerne l ' s address space ( i . e . , a s a result o f removal o f the final
val id user mapping from that page). the proces s ' s segment table must be updated.
The kernel does th i s update by i nval idating the appropri ate segment-table entry.
Change of Access and Wiring Attributes for Mappings
An i mportant rol e of the pmap modu le is to manipulate the hardware access pro­
tections for page s . These man i p u l ations may be applied to all mappings covered
by a v i rtual-address range within a pmap v i a pmap_JJ rotect( ) , or they may be
applied to all mappings of a particular physical page across pmaps v i a
pmap_JJage_JJ rotect ( ) . There are two features common t o both cal l s . Fi rst, e ither
form may be called w i th a protection value of VM_PROT_NONE to remove a range
of v i rtual addresses or to remove all mappings for a particular physi cal page. Sec­
ond. these routines should never add write permission to the affected mapping s ;
thus, cal l s including VM_PROT_WRITE s h o u l d make no changes. T h i s restriction
is necessary for the copy-on-write mechan i s m to function properly. Write perm i s­
sion is added only via cal l s to p111ap_e11ter ( ) .
Pmap_JJrotect ( ) i s used pri mari ly b y the mprotect system c a l l t o change the
protection for a region of process addre ss space . The strategy is s i m i l ar to that of
pmap_remol'e ( ) : Loop over all v i rtual pages in the range and apply the c hange to
a l l val i d mappi ngs that are found. Invalid mappings are l eft alone. As occurs w i th
pmap_remove ( ), the action may be delayed until pmap_update ( ) i s cal led.
For the HP300, pmap_JJrotect ( ) first checks for the special cases . If the
requested permission is VM_PROT_NONE , it cal l s pmap_renwve ( ) to handle the
revocation of all access permi ssion. If VM_pROT_WRITE is incl uded, it just
retu rn s i mmedi ately. For a normal protection value, pmap_remove ( ) l oops over
the given address range, skipp i ng i nval id mapp i n g s . For val i d mappi ngs, the page­
table entry is looked up and, if the new protection val ue differs from the current
value, the entry i s modi fied and any TLB and cache fl u s h i n g done . As occurs w i th
pmap_remove ( ), any global cache actions are del ayed until the entire range has
been modi fied.
Pmap_JJage_JJ rotect ( ) i s used i n ternally by the v i rtual-memory sy stem for
two purposes. It is called to set read-on ly permi ssion when a copy-on-write oper­
ation i s set up ( e . g . , during fork) . It also rem oves all access permissions before
doing page replacement to force all references to a page to bl ock pendi ng the com­
pletion of its operation . In Mach, thi s routine used to be two separate
routi nes-p m ap_copy_on _ w ri te ( ) and pmap_remove_all ( )-and many pmap
modu les i mplement pmap_JJage_JJrotect ( ) as a cal l to one or the other of these
functions, depending on the protection argument.
Section 5 . 1 3
Portab i l i ty
1 85
In the HP300 i mplementation of pmap_page_protect ( ) , a check is made to
ensure that this page is a managed physical page and that VM_PROT_ WRITE was
not specified. If ei ther of these conditions i s not met, pmap_page_protect ( )
returns without doing anything. Otherwise, i t locates the pv_table entry for the
specified physical page . If the request requires the removal of mappings,
pmap_page_protect ( ) loops over al l pv_entry structures that are chained together
for this page, i nval idating the individual mappings as described i n the previous
subsection. Note that TLB and cache flushing differ from those for
pmap_remove ( ), since they must inval idate entries from multiple process contexts,
rather than i nvali dating m u l tiple entries from a single proces s context.
If pmap_page_protect ( ) i s called to make mappings read-only, then i t loops
over all pv_entry structures for the physical address, modifying the appropriate
page-tabl e entry for each. As occurs with pmap_protect ( ), the entry is checked to
ensure that i t i s changing before expensive TLB and cache fl u shes are done.
Pmap_change_wiring ( ) is called to wire or unwire a single machine-indepen­
dent v i rtual page within a pmap. As described in the previous subsection, wiri n g
informs t h e pmap module that a mapping should not cause a hardware faul t that
reaches the machine-independent vmJault ( ) code. Wiring i s typically a software
attribute that has no affect on the hardware MMU state: It s i mply tell s the pmap
not to throw away state about the mapping. As such, if a pmap module never dis­
cards state, then it i s not strictly necessary for the module even to track the wired
status of pages. The only side effect of not tracking wiring i nformation i n the
pmap is that the mlock system cal l cannot be completely implemented without a
wired page-count statistic .
The HP300 pmap i mplementation maintai ns wiring information. An
unused bit i n the page-tabl e-entry structure records a page' s wired status .
Pmap_change_wiring ( ) sets o r c lears this b i t when it is i nvoked with a vali d vir­
tual address. S i nce the wired bit is ignored by the hardware, there is no need to
modify the TLB or cache when the bit is changed.
Management of Page-Usage Information
The machine-independent page-management code needs to be able to get basic
information about the u s age and modification of pages from the u nderlying hard­
ware . The pmap module facil i tates the col l ection of this i nformation without
requiring the machine-independent code to understand the detail s of the mapping
tables by providing a set of interfaces to query and c lear the reference and modify
bits. The pageout daemon can call pmap_is_modified ( ) to determine whether a
page is dirty. If the page is dirty, the pageout daemon can write it to backing s tore,
then call pmap_clear_mod(fy ( ) to c lear the modify bit. S i m i l arly, when the page­
out daemon pages out or inactivates a page, it u ses pmap_clear_reference ( ) to
clear the reference bit for the page. Later, when it considers moving the page
from the inactive list, it uses pmap_is_referenced ( ) to check whether the page has
been u sed since the page was inactivated. If the page has been u sed, i t is moved
back to the active list; otherwise, i t i s moved to the free list.
One important feature of the query routines i s that they should return valid
information even if there are currently no mappings for the page in question.
1 86
Chapter 5
Memory Management
Thu s , referenced and modified information cannot just be gathered from the
hardware-maintained bits of the vari ous page-table or TLB entries; rather, there
must be an au x i l i ary array where the i nformation i s retained when a mapping i s
removed .
The HP300 implementation of these routi nes is simple. As mentioned i n the
subsection on i nitial i zation and startup, a page-attribute array with one entry per
managed physical page is allocated at boot time. Initial ly zeroed, the entries are
updated whenever a mapping for a page is removed. The query routines return
FALSE if they are not passed a managed physical page . Otherw i se , they test the
referenced or modified bit of the appropriate attri bute-array entry and, if the bit i s
set, return TRUE i mmediately. S i nce thi s attribute array contains o n l y past infor­
mation, they sti l l need to check status bits i n the page-table entries for currently
val id mappi ngs of the page. Thus, they loop over all pv_ent1y structures associ­
ated w i th the physical page and examine the appropri ate page-table e ntry for eac h .
They c a n return TRUE as s o o n as they encounter a s e t bit or FALSE if t h e bit i s not
set i n any page-table entry.
The clear routines also return immediately if they are not passed a managed
physical page. Otherwise, the referenced or modi fied bit i s cleared in the attribute
array, and they l oop over al l pv_entry structures associ ated with the physical page,
cleari ng the hardware-maintained page-table-entry bits. Thi s final step may
i nvolve TLB or cache flu shes along the way or afterward.
Initialization of Physical Pages
Two i nterfaces are provided to allow the higher-level v i rtual-memory routines to
initialize physical memory. Pmap_zero_page ( ) takes a physical address and fi l l s
the page with zeros. Pmap_copy_page ( ) takes two physical addresses and copies
the contents of the first page to the second page . S i nce both take physical
addresses, the pmap modu le will most l i kely have fi rs t to map those pages into the
kernel ' s address space before it can access them. S i nce mapping and unmapping
single pages dynamically may be expensive, an al ternative i s to have all physical
memory permanently mapped into the kerne l ' s address space at boot time. With
this technique, addition of an offset to the physical address is all that i s needed to
create a usable kernel virtual address.
The HP300 i mplementation has a pair of gl obal kernel virtual addresses
reserved for zeroing and copying pages . and thus i s not as efficient as i t could be.
Pmap_zero_page ( ) cal l s pmap_ellter ( ) with the reserved v i rtual address and the
specified physical address, cal l s bzero ( ) to clear the page, and then removes the
temporary mapping w i th the si ngle translation-invalidation pri mi tive used by
pmap_remove ( ) . S i m i l arly, pmap_copy_page ( ) creates mappi ngs for both phy si­
cal addresses, uses bcopy ( ) to make the copy, and then removes both mappings.
Management of Internal Data Structures
The remain i ng pmap interface rou tines are used for management and synchroniza­
tion of internal data structures. Pmap_create ( ) creates an i n stance of the
machine-dependent pmap structure . The value returned i s the handle u sed for all
Exerc i ses
1 87
other pmap routines. Pmap_reference ( ) i ncrements the reference count for a
particular pmap. In theory this reference count allows a pmap to be shared by
m u l tiple processe s ; in practice, only the kernel submaps that use the kerne l ' s pmap
share references. S ince kernel submaps as well as the kernel map are permanent,
there i s currently no real need to maintain a reference count. Pmap_destroy ( )
decrements the reference count of the given pmap and deall ocates the pmap's
resources w h e n t h e count drops t o zero .
Because of an incomplete transition i n the v i rtual-memory code, there i s also
another set of routines to create and destroy pmaps effectively. Pmap_pin it ( ) ini­
tializes an already-existing pmap structure, and pmap_release ( ) frees any
resources associated w i th a pmap without freeing the pmap structure itse l f. These
routines were added i n support of the vm_space structure that encapsu lates a l l
storage associated w i t h a process's virtual-memory state .
On the HP300, the create and destroy routines use the kernel malloc ( ) and
free ( ) routines to manage space for the pmap structure, and then use pmap_pinit ( )
and pmap_release ( ) t o in itial ize and rel ease the pmap. Pmap_pinit ( ) sets the pro­
cess segment-table pointer to the common empty segment table. As noted earl ier
i n the subsection on mapping allocation and deallocation, page-table all ocation is
delayed until the first access to the proces s ' s address space. Pmap_release ( ) sim­
ply frees the process segment and page tables.
Pmap_copy ( ) and pmap_pageable ( ) are optional interface routines that are
used to provide hints to the pmap module about the use of vi rtual- memory
regions. Pnzap_copy ( ) i s called when a copy-on-write operati on has been done.
Its parameters include the source and destination pmap, and the virtual address
and the length of the region copied. On the HP300, this routine does nothing.
Pmap_pageable ( ) indicates that the specified address range has been ei ther w i red
or unwired. The HP300 pmap modul e uses this i nterface to detect when a page­
tabl e page i s empty and can be released. The current impl ementation does not
free the page-tabl e page ; i t just clears the modified state of the page and allows the
page to be recl ai med by the pageou t daemon as needed. Clearing the modify bit is
necessary to prevent the empty page from being wastefu l l y wri tten out to backing
store .
Pmap_update ( ) i s called to notify the pmap module that all delayed actions
for all pmaps shou l d be done now. O n the H P300, thi s routine does nothing.
Pmap_collect ( ) i s called to i nform the pmap module that the given pmap i s not
expected to be used for some time, allowing the pmap module to reclaim
resources that could be used more effectively el sewhere. Currently, i t i s called
whenever a process i s about to be swapped out. The HP300 pmap modu l e does not
u se thi s i nformation for user processes, but i t does use the i nformation to attempt
to recl a i m unused kernel page-table pages when none are available on the free l i st.
5. 1
What does it mean for a machine to support v i rtual memory ? What four
hardware faci l i ties are typical l y requ i red for a machine to support v i rtual
memory ?
1 88
Chapter 5
Memory Management
What i s the rel ationship between paging and s wapping on a demand-paged
virtual-memory system? Explain whether it i s desi rabl e to proviae both
mechanisms i n the same system . Can you suggest an alternative to provid­
ing both mechanisms?
What three pol icies characterize paging systems? Which o f these pol icies
usual l y has no effect on the performance of a paging system?
5 .4
Describe a disadvantage of the scheme used for the man agement of swap
space that holds the dynamic per-process segments. Hint: Consider what
happens when a process on a heav i l y paging system expands in many small
i ncrements.
What i s copy-on -write? In most UNIX applications, the fork system c a l l i s
fol l owed al most immediatel y b y a n exec system cal l . W h y does t h i s behav­
ior make i t particularly attractive to use copy-on-wri te i n implementing
fo rk ?
Explain w h y the v.fork sy stem call w i l l always be more effic ient than a
clever i mpl ementation of the fork sy stem cal l .
When a process exits, a l l i t s pages may not b e p l aced immedi ately on the
memory free l i st. Explain thi s behavior.
What i s clustering? Where i s it used i n the v i rtual-memory system?
What purpose does the pageout-daemon process serve i n the v i rtual-mem­
ory system? What faci l i ty i s used by the pageout daemon that i s not avail­
able to a normal user process?
5. 1 0
Why i s the stich,')' bit no l onger usefu l i n 4.4B S D ?
5. 1 1
Give two reasons for swapping to be i n itiated.
*5. 1 2
The 4 . 3 B S D virtual-memory system had a text cache that retained the iden­
tity of text pages from one execution of a program to the next. How does
the obj ect cache i n 4.4B S D i mprove on the performance of the 4 . 3 B S D text
*5. 1 3
Postul ate a scenario under which the HP300 kernel would deadlock i f i t
were t o allocate kernel page-table pages dynamically.
B abaoglu & Joy, 1 98 1 .
0 . B abaog l u & W. N . Joy, "Converting a S wap-B ased Sy stem to Do Paging
i n an Architecture Lacking Page-Referenced B i ts,'' Proceedings (}f the
Eighth Symposium on Operating Systems Principles. pp. 78-86, December
1 98 1 .
1 89
Bel ady, 1 966.
L . A . Belady, "A S tudy of Replacement Algorithms for Virtual S torage Sys­
tems," IBM Systems Journal, vol . 5 , no. 2, pp. 7 8- 1 0 1 , 1 966.
Coffman & Denn ing, 1 97 3 .
E . G. Coffman, Jr. & P. J . Denning, Operating Systems The01y, p . 243 , Pren­
tice-Hall , Englewood Cliffs , NJ, 1 97 3 .
Corbato, 1 96 8 .
F. J . Corbato, " A Paging Experiment with t h e Multics System," Proj ec t
M AC M e m o MAC-M-384, Massachusetts Institute o f Technology, B oston,
MA, July 1 968.
Denning, 1 970.
P. J . Denning, " Virtual Memory," Computer Surveys, vol . 2 , no. 3 , pp.
1 5 3- 1 90, September 1 970.
Easton & Franaszek, 1 979.
M . C . Easton & P. A . Franaszek, " Use Bit Scanning in Replacement Deci­
sions," IEEE Transactions on Computing, vol . 2 8 , no. 2 , pp. 1 3 3- 1 4 1 ,
February 1 979.
Gingell et al, 1 9 8 7 .
R . Gingell, M . Lee, X . Dang, & M . Weeks, " Shared Libraries i n SunOS,"
USENIX Association Co1�ference Proceedings, pp. 1 3 1 - 1 46, June 1 9 87.
Gingell, Moran, & Shannon, 1 987.
R . Gingell, J . Moran, & W. Shannon, " Vi rtual Memory Architecture in
SunOS," USENIX Association Conference Proceedings, pp. 8 1 -94, June
1 98 7 .
Intel , 1 9 84.
Intel , " In troduction to the iAPX 286," Order Number 2 1 0308, Intel Corpo­
ration, S anta Cl ara, CA, 1 984.
Kenah & B ate, 1 984.
L. J. Kenah & S. F. B ate, VAX/VMS Internals and Data Structures, Digital
Press, B edford, MA, 1 984.
King, 1 97 1 .
W. F. King, "Analysis of Demand Paging Algorithms," IFIP, pp. 485-490,
North Holland, Amsterdam, 1 97 1 .
Korn & Vo, 1 98 5 .
D . Korn & K . Vo, " I n Search o f a B etter Malloc," USENIX Association Con­
ference Proceedings, pp. 489-506, June 1 98 5 .
Lazowska & Kelsey, 1 97 8 .
E . D . Lazowska & J . M . Kel sey, "Notes on Tuning VAXNMS.," Technical
Report 78- 1 2-0 I , Department of Computer Science, Un iversity of Washing­
ton, S eattle, WA, December 1 97 8 .
M arshall , 1 979.
W. T. M arshal l , "A Unified Approach to the Evaluation of a Cl ass of ' Work­
ing Set Like' Repl acement Algorithms," PhD Thesis, Department of Com­
puter Engineering, Case Western Reserve University , Cleveland, OH, May
1 979.
1 90
Chapter 5
Memory Management
McKusick & Karels, 1 98 8 .
M . K. M c Kusick & M . J. Kare l s , " Design o f a General Purpose Memory
Allocator for the 4 . 3 B S D UNIX Kernel ," USENIX Association Conference
Proceedings, pp. 295-304, June 1 98 8 .
Organick, 1 975 .
E. I . Organick, The Multics System: An Examination of Its Structure, MIT
Press, Cambridge, MA, 1 97 5 .
Tevanian, 1 98 7 .
A . Tevanian, "Architecture-Independent Vi rtual Memory Management fo r
Parallel a n d Distributed Environments: T h e M a c h Approach," Technical
Report CMU-CS-8 8- 1 06, Department of Computer Science, Carnegie-Mel­
lon U niversity, Pi ttsburgh, PA, December 1 98 7 .
Young, 1 989.
M . W. Young, Exporting a User Interface to Memory Management from a
Communication-Oriented Operating System, CMU-CS-89-202, Department
of Computer Science, Carnegie-Mell o n University, November 1 9 89.
1/0 System
System Overview
1/0 Mapping from User to Device
Computers store and retrieve data through supporting peripheral 1/0 devices.
These devices typically include mass-storage devices, such as moving-head disk
drives, magnetic-tape drives, and network interfaces. Storage devices such as
disks and tapes are accessed through 1/0 controllers that manage the operation of
their slave devices according to 1/0 requests from the CPU.
Many hardware device peculiarities are hidden from the user by high-level
kernel facilities, such as the fi lesystem and socket i nterfaces. Other such peculiar­
ities are hidden from the bulk of the kernel itself by the 1/0 system. The 1/0 sys­
tem consists of buffer-caching systems, general device-driver code, and drivers for
specific hardware devices that must finally address peculiarities of the specific
devices. The various 1/0 systems are summarized in Fig. 6 . 1 (on page 1 94 ).
There are four main kinds of 1/0 i n 4 . 4BSD: the filesystem, the character-de­
vice i nterface, the block-device interface, and the socket interface with its related
network devices. The character and block interfaces appear in the filesystem
name space. The character i nterface provides unstructured access to the underly­
ing hardware, whereas the block device provides structured access to the underly­
ing hardware. The network devices do not appear in the filesystem; they are
accessible through only the socket interface. B lock and character devices are
described in Sections 6.2 and 6 . 3 respectively. The filesystem is described in
Chapters 7 and 8 . Sockets are described i n Chapter 1 1 .
A block-device interface, as the name i ndicates, supports only block-oriented
1/0 operations. The block-device interface uses the buffer cache to minimize the
number of 1/0 requests that require an 1/0 operation, and to synchronize w ith
filesystem operations on the same device. All I/O is done to or from 1/0 buffers
that reside in the kernel's address space. This approach requires at least one mem­
ory-to-memory copy operation to satisfy a user request, but also allows 4 . 4 BSD to
support I/O requests of nearly arbitrary size and alignment.
1 93
Chapter 6
1 94
1/0 System Overview
system-call interface to the kernel
active file entries
VNODE layer
local naming (UFS)
special devices
buffer cache
block-device driver
character-device driver
the hardware
F i g u re 6.1
Kernel 1/0 structure.
A character-device i nterface comes in two styles that depend on the c haracter­
istics of the underlying h ardware device. For some character-oriented h ardware
devices, such as terminal multiplexers, the i nterface is truly c haracter oriented,
although h igher- l evel software, such as the termi n al driver, may provide a l ine-ori­
ented interface to applications. However, for block-oriented devices such as disks
and tapes , a c haracter-device i nterface i s an unstructured or raw interface. For this
interface, 1/0 operations do not go through the buffer cache; i nstead, they are
made directly between the device and buffers in the appl ication ' s v i rtual addres s
space. Consequently, t h e size of t h e operations m u s t be a multiple of t h e underly­
ing block size required by the device, and, on some mac h i nes, the application ' s 1/0
buffer must be aligned on a suitabl e boundary.
Internal to the system, 1/0 devices are accessed through a fi xed set of entry
points provided by each device's device driver. The set of entry points varies
according to w hether the 1/0 device supports a block- or c haracter-device i nter­
face . For a block-device interface, a device driver is described by a bdevsw struc­
ture, whereas for character-device i nterface, it accesses a cdevsw structure. All the
bdevsw structures are col lected i n the block-device table, w hereas cdevsw struc­
tures are s i m i l arly organized i n a character-device table.
Devices are identified by a device number that i s constructed from a major
and a minor device number. The major device number uniquely identifies the type
of device (really of the device driver) and is the i ndex of the device's entry in the
block- or character-device table. Devices that support both block- and c haracter­
device i nterfaces have two maj or device n umbers, one for each tab l e . The minor
device n umber is interpreted solely by the device driver and is used by the driver
to identify to which, of potentially many, h ardware devices an 1/0 request refers .
For magnetic tapes , for example, minor device n umbers identify a specific con­
troller and tape transport. The minor device number may also specify a section of
a device-for example , a channel of a multiplexed device, or optional handling
parameters .
Section 6 . 1
110 Mapping from User to Device
1 95
Device Drivers
A device driver is divided into three main sections :
1 . Autoconfiguration and initialization routines
Routines for servicing 1/0 requests (the top half)
3 . Interrupt service routines (the bottom half)
The autoconfiguration portion of a driver is responsible for probing for a hardware
device to see whether the latter is present and to initialize the device and any asso­
ciated software state that is required by the device driver. This portion of the
driver is typically called only once, when the system is initialized. Autoconfigura­
tion is described in Section 1 4 . 4 .
The section of a driver that services 110 requests by the system is invoked
because of system calls or for the virtual-memory system. This portion of the de­
vice driver executes synchronously in the top half of the kernel and is permitted to
block by calling the sleep ( ) routine. We commonly refer to this body of code as
the top half of a device driver.
Interrupt service routines are invoked when the system fields an interrupt
from a device. Consequently, these routines cannot depend on any per-process
state and cannot block. We commonly refer to a device driver's interrupt service
routines as the bottom half of a device driver.
In addition to these three sections of a device driver, an optional crash-dump
routine may be provided. This routine, if present, is invoked when the system rec­
ognizes an unrecoverable error and wishes to record the contents of physical
memory for use in postmortem analysis. Most device drivers for disk controllers,
and some for tape controllers, provide a crash-dump routine. The use of the crash­
dump routine is described in Section 1 4 . 7 .
11 0 Queueing
Device drivers typically manage one or more queues of I/O requests in their nor­
mal operation. When an input or output request is received by the top half of the
driver, it is recorded in a data structure that is placed on a per-device queue for
processing. When an input or output operation completes, the device driver
receives an interrupt from the controller. The interrupt service routine removes the
appropriate request from the device's queue, notifies the requester that the com­
mand has completed, and then starts the next request from the queue. The I/O
queues are the primary means of communication between the top and bottom
halves of a device driver.
Because I/O queues are shared among asynchronous routines, access to the
queues must be synchronized. Routines that make up the top half of a device
driver must raise the processor priority level (using splbio ( ), spltty ( ), etc.) to pre­
vent the bottom half from being entered as a result of an interrupt while a top-half
routine is manipulating an 1/0 queue. Synchronization among multiple processes
starting I/O requests also must be done. This synchronization is done using the
mechanisms described in Section 4 . 3 .
Chapter 6
1 96
1/0 System Overview
Interrupt Handling
Interrupts are generated by devices to signal that an operation has completed or
that a change in status has occurred. On receiving a device interrupt, the system
invokes the appropriate device-driver interrupt service routine with one or more
parameters that identify uniquely the device that requires service. These parame­
ters are needed because device drivers typically support multiple devices of the
same type. If the interrupting device's identity were not supplied with each inter­
rupt, the driver would be forced to poll all the potential devices to identify the de­
vice that interrupted.
The system arranges for the unit-number parameter to be passed to the inter­
rupt service routine for each device by installing the address of an auxiliary glue
routine in the interrupt-vector table. This glue routine, rather than the actual inter­
rupt service routine, is invoked to service the interrupt; it takes the following
actions :
l . S ave all volatile registers.
2 . Update statistics on device interrupts.
Call the interrupt service routine with the appropriate unit number parameter.
Restore the volatile registers saved in step I .
5 . Return from the interrupt.
Because a glue routine is interposed between the interrupt-vector table and the
interrupt service routine, device drivers do not need to be concerned with saving
and restoring machine state. In addition, special-purpose instructions that cannot
be generated from C, which are needed by the hardware to support interrupts, can
be kept out of the device driver; this interposition of a glue routine permits device
drivers to be written without assembly language.
Block Devices
B lock devices include disks and tapes. The task of the block-device interface is to
convert from the user abstraction of a disk as an array of bytes to the structure
imposed by the underlying physical medium. Although the user may wish to
write a single byte to a disk, the hardware can read and write only in multiples of
sectors. Hence, the system must arrange to read in the sector containing the byte
to be modified, to replace the affected byte, and to write back the sector to the
disk. This operation of converting random access to an array of bytes to reads and
writes of disk sectors is known as block 110. B lock devices are accessible directly
through appropriate device special files, but are more commonly accessed indi­
rectly through the filesystem (see Section 8 . 2 ) .
Processes may read data in sizes smaller than a disk block. The first time that
a small read is required from a particular disk block, the block will be transferred
Section 6 . 2
Block Devices
1 97
from the disk into a kernel buffer. Later reads of parts of the same block then
require only copying from the kernel buffer to the memory of the user process.
Multiple small writes are treated similarly. A buffer is allocated from the cache
when the first write to a disk block is made, and later writes to part of the same
block are then likely to require only copying into the kernel buffer, and no disk 1/0.
In addition to providing the abstraction of arbitrary alignment of reads and
writes, the block buffer cache reduces the number of disk 1/0 transfers required by
filesystem accesses. B ecause system-parameter files, commands, and directories
are read repeatedly, their data blocks are usually in the buffer cache when they are
needed. Thus, the kernel does not need to read them from the disk every time that
they are requested.
If the system crashes while data for a particular block are in the cache but
have not yet been written to disk, the filesystem on the disk will be incorrect and
those data will be lost. (Critical system data, such as the contents of directories,
however, are written synchronously to disk, to ensure filesystem consistency ;
operations requiring synchronous 1/0 are described in the last subsection of Sec­
tion 8 . 2 . ) So that lost data are minimized, writes are forced periodically for dirty
buffer blocks. These forced writes are done (usually every 30 seconds) by a user
process, update, which uses the sync system call. There is also a system call,
fsync, that a process can use to force all dirty blocks of a single file to be written to
disk immediately; this synchronization is useful for ensuring database consistency
or before removing an editor backup file.
Most magnetic-tape accesses are done through the appropriate raw tape de­
vice, bypassing the block buffer cache. When the cache is used, tape blocks must
still be written in order, so the tape driver forces synchronous writes for them.
Entry Points for Block-Device Drivers
Device drivers for block devices are described by an entry in the bdevsw table.
Each bdevsw structure contains the following entry points:
Open the device in preparation for 1/0 operations. A device's open
entry point will be called for each open system call on a block special
device file, or, internally, when a device is prepared for mounting a
filesystem with the mount system call. The open ( ) routine will com­
monly verify the integrity of the associated medium. For example, it
will verify that the device was identified during the autoconfiguration
phase and, for tape and disk drives, that a medium is present and on­
Start a read or write operation, and return immediately. 1/0 requests to
or from filesystems located on a device are translated by the system
into calls to the block 1/0 routines bread( ) and bwrite ( ). These block
1/0 routines in turn call the device's strategy routine to read or write
data not in the cache. Each call to the strategy routine specifies a
pointer to a buf structure containing the parameters for an 1/0 request.
Chapter 6
1 98
1/0 System Overview
If the request is synchronous, the caller must sleep (on the address of
the buf structure) until 1/0 completes.
Close a device. The close ( ) routine is called after the final client inter­
ested in using the device terminates. These semantics are defined by
the higher-level 1/0 facilities. Disk devices have nothing to do when a
device is closed, and thus use a null close ( ) routine. Devices that sup­
port access to only a single client must mark the device as available
once again. Closing a tape drive that was open for writing typically
causes end-of-file marks to be written on the tape and the tape to be
Write all physical memory to the device. The dump entry point saves
the contents of memory on secondary storage. The system automati­
cally takes a dump when it detects an unrecoverable error and is about
to crash . The dump is used in a postmortem analysis of the problem
that caused the system to crash. The dump routine is invoked with the
processor priority at its highest level; thus, the device driver must poll
for device status, rather than wait for interrupts. All disk devices are
expected to support this entry point; some tape devices do as well .
Return the size o f a disk-drive partition . The driver is supplied a logi­
cal unit and is expected to return the size of that unit, typically a disk­
drive partition, in DEV_BSIZE blocks. This entry point is used during
the bootstrap procedure to calculate the location at which a crash dump
should be placed and to determine the sizes of the swap devices.
Sorting of Disk 1/0 Requests
The kernel provides a generic disksort ( ) routine that can be used by all the disk
device drivers to sort I/O requests into a drive's request queue using an elevator
sorting algorithm. This algorithm sorts requests in a cyclic, ascending, cylinder
order, so that requests can be serviced with a minimal number of one-way scans
over the drive. This ordering was originally designed to support the normal read­
ahead requested by the filesystem as well as to counteract the filesystem's random
placement of data on a drive. With the improved placement algorithms in the cur­
rent filesystem, the effect of the disksort ( ) routine is less noticeable; disksort ( )
produces the largest effect when there are multiple simultaneous users of a drive.
The disksort ( ) algorithm is shown in Fig. 6 . 2 . A drive 's request queue is
made up of one or two lists of requests ordered by cylinder number. The request
at the front of the first list indicates the current position of the drive. If a second
list is present, it is made up of requests that lie before the current position. Each
new request is sorted into either the first or the second list, according to the
request's location. When the heads reach the end of the first list, the drive begins
servicing the other list.
Disk sorting can also be important on machines that have a fast processor, but
that do not sort requests within the device driver. In this situation, if a write of
Section 6 .2
B lock Devices
d i s k s o r t ( dq ,
1 99
bp )
drive queue * dq ;
bu f f e r * bp ;
( dr ive queue i s emp t y )
p l a c e the bu f f er at
the front o f
the drive queue ;
re turn ;
( r e qu e s t l i e s be f o r e the f i r s t a c t ive r e qu e s t )
l o c a t e the beg inning o f
s or t bp i n t o
the s e c ond r e que s t l i s t ;
s e c ond requ e s t l i s t ;
s o r t bp i n t o the current requ e s t l i s t ;
Figure 6.2
Algorithm for disksort( ) .
several Kbyte is honored in order of queueing, it can block other processes from
accessing the disk while it completes. Sorting requests provides some scheduling,
which more fairly distributes accesses to the disk controller.
Disk Labels
Many disk controllers require the device driver to identify the location of disk sec­
tors that are to be transferred by their cylinder, track, and rotational offset. For
maximum throughput efficiency, this information is also needed by the filesystem
when deciding how to lay out files. Finally, a disk may be broken up into several
partitions, each of which may be used for a separate filesystem or swap area.
Historically, the information about the geometry of the disk and about the lay­
out of the partitions was compiled into the kernel device drivers. This approach
had several flaws. First, it was cumbersome to have all the possible disk types and
partitions compiled into the kernel. Any time that a disk with a new geometry was
added, the driver tables had to be updated and the kernel recompiled. It was also
restrictive in that there was only one choice of partition table for each drive type.
Choosing a different set of tables required modifying the disk driver and rebuild­
ing the kernel. Installing new tables also required dumping all the disks of that
type on the system, then booting the new kernel and restoring them onto the new
partitions. Disks with different partition layouts could not be moved from one
system to another. An additional problem arose when nonstandard partition tables
were used; new releases from the vendor had to have the partition tables modified
before they could be used on an existing system.
For all these reasons, 4 .4BSD and most commercial UNIX vendors added disk
labels. A disk label contains detailed geometry information, including cylinder,
track, and sector l ayout, along with any other driver-specific information. It also
contains information about the partition layout and usage, the latter describing
Chapter 6
VO System Overview
part1t1on usage: type of filesystem, swap partition, or unused. For the fast
fi lesystem, the partition usage contains enough additional information to enable
the fi lesystem check program (fsck) to locate the alternate superblocks for the
fi lesystem.
Having labels on each disk means that partition information can be different
for each disk, and that it carries over when the disk is moved from one system to
another. It also means that, when previously unknown types of disks are con­
nected to the system, the system administrator can use them without changing the
disk driver, recompiling, and rebooting the system.
The label is located near the beginning of each drive-usually, in block zero.
It must be located in the first track, because the device driver does not know the
geometry of the disk until the driver has read the label. Thus, it must assume that
the label is in cylinder zero, track zero, at some valid offset within that track.
Most architectures have hardware (or first-level) bootstrap code stored in read­
only memory (ROM). When the machine is powered up or the reset button is
pressed, the CPU executes the hardware bootstrap code from the ROM. The hard­
ware bootstrap code typically reads the first few sectors on the disk into the main
memory, then branches to the address of the first location that it read. The pro­
gram stored in these first few sectors is the second-level bootstrap. Having the
disk label stored in the part of the disk read as part of the hardware bootstrap
allows the second-level bootstrap to have the disk-label information. This infor­
mation gives it the ability to find the root fi lesystem and hence the files, such as
the kernel, needed to bring up 4.4BSD. The size and location of the second-level
bootstrap are dependent on the requirements of the hardware bootstrap code.
Since there is no standard for disk-label formats and the hardware bootstrap code
usually understands only the vendor label, it is often necessary to support both the
vendor and the 4.4BSD disk labels. Here, the vendor label must be placed where
the hardware bootstrap ROM code expects it; the 4.4BSD label must be placed out
of the way of the vendor l abel but within the area that is read in by the hardware
bootstrap code, so that it will be available to the second-level bootstrap.
Character Devices
Almost all peripherals on the system, except network interfaces, have a character­
device interface. A character device usually maps the hardware interface into a
byte stream, similar to that of the filesystem. Character devices of this type
include terminals (e.g. , /dev/tty OO), line printers (e.g, /dev/lpO), an interface to
physical main memory (/dev/mem), and a bottomless sink for data and an endless
source of end-of-file markers (/dev/null). Some of these character devices, such
as terminal devices, may display special behavior on line boundaries, but in gen­
eral are still treated as byte streams.
Devices emulating terminals use buffers that are smaller than those used for
disks and tapes. This buffering system involves small (usually 64-byte) b locks of
characters kept in linked lists. Although all free character buffers are kept in a
Section 6 . 3
Character Devices
single free list, most device drivers that use them limit the number of characters
that can be queued at one time for a single terminal port.
Devices such as high-speed graphics interfaces may have their own buffers or
may always do I/O directly into the address space of the user; they too are classed
as character devices. Some of these drivers may recognize special types of
records, and thus be further from the plain byte-stream model.
The character interface for disks and tapes is also called the raw device inter­
face; it provides an unstructured i nterface to the device. Its primary task is to
arrange for direct 1/0 to and from the device. The disk driver isolates the details
of tracks, cylinders, and the like from the rest of the kernel. It also handles the
asynchronous nature of 1/0 by maintaining and ordering an active queue of pend­
ing transfers. Each entry in the queue specifies whether it is for reading or writ­
ing, the main-memory address for the transfer, the device address for the transfer
(usually a disk sector number), and the transfer size (in bytes) .
A l l other restrictions o f the underlying hardware are passed through the char­
acter interface to its clients, making character-device interfaces the furthest from
the byte-stream model. Thus, the user process must abide by the sectoring restric­
tions imposed by the underlying hardware. For magnetic disks, the file offset and
transfer size must be a multiple of the sector size. The character interface does not
copy the user data into a kernel buffer before putting them on an 1/0 queue.
Rather, it arranges to have the I/O done directly to or from the address space of the
process. The size and alignment of the transfer is limited by the physical device.
However, the transfer size is not restricted by the maximum size of the internal
buffers of the system, because these buffers are not used.
The character interface is typically used by only those system utility programs
that have an intimate knowledge of the data structures on the disk or tape. The
character interface also allows user-level prototyping; for example, the 4.2BSD
filesystem implementation was written and largely tested as a user process that
used a raw disk interface, before the code was moved into the kernel.
Character devices are described by entries in the cdevsw table. The entry
points in this table (see Table 6. 1 on page 202) are used to support raw access to
block-oriented devices, as well as normal access to character-oriented devices
through the terminal driver. Because of the diverse requirements of these two
types of devices, the set of entry points is the union of two disj oint sets. Raw
devices support a subset of the entry points that correspond to those entry points
found in a block-device driver, whereas character devices support the full set of
entry points. Each is described in the following sections.
Raw Devices and Physical 1/0
Most raw devices differ from block devices only in the way that they do 1/0.
Whereas block devices read and write data to and from the system buffer cache,
raw devices transfer data to and from user data buffers. Bypassing the buffer
cache eliminates the memory-to-memory copy that must be done by block
devices, but also denies applications the benefits of data caching. In addition, for
devices that support both raw- and block-device access, applications must take
Chapter 6
Table 6.1
I/O System Overview
Entry points for character and raw device drivers.
Entry point
open ( )
open the device
close the device
do an I/O control operation
map device offset to memory location
do an input operation
reinitialize device after a bus reset
poll device for I/O readiness
stop output on the device
do an output operation
close ( )
ioctl( )
mmap ( )
read( )
reset ( )
select ( )
stop ( )
write ( )
care to preserve consistency between data in the buffer cache and data written
directly to the device; the raw device should be used only when the block device is
idle. Raw-device access is used by many filesystem utilities, such as the filesys­
tem check program, fsck, and by programs that read and write magnetic tapes­
for example, tar, dump, and restore.
Because raw devices bypass the buffer cache, they are responsible for manag­
ing their own buffer structures. Most devices borrow swap buffers to describe
their I/O. The read and write routines use the physio ( ) routine to start a raw I/O
operation (see Fig. 6.3). The strategy parameter identifies a block-device strategy
routine that starts I/O operations on the device. The buffer indicated by bp is used
by physio ( ) in constructing the request(s) made to the strategy routine. The de­
vice, read-write flag, and uio parameters completely specify the 1/0 operation that
should be done. The minphys ( ) routine is called by physio ( ) to adjust the size of
each 1/0 transfer before the latter is passed to the strategy routine; this call to
minphys ( ) allows the transfer to be done in sections, according to the maximum
transfer size supported by the device.
Raw-device I/O operations request the hardware device to transfer data
directly to or from the data buffer in the user program's address space described
by the uio parameter. Thus, unlike 1/0 operations that do direct memory access
(DMA) from buffers in the kernel address space, raw I/O operations must check
that the user's buffer is accessible by the device, and must lock it into memory for
the duration of the transfer.
Character-Oriented Devices
Character-oriented I/O devices are typified by terminal multiplexers, although they
also include printers and other character- or line-oriented devices. These devices
are usually accessed through the terminal driver, described in Chapter 1 0. The
Section 6.3
Character Devices
phys i o ( s t r a t egy ,
bp ,
dev ,
f l ag s ,
mi nphys ,
uio )
i n t s t ra t e gy ( ) ;
bu f f e r * bp ;
dev i c e dev ;
i n t f l ags ;
i n t mi nphys ( ) ;
s t ru c t u i o * u i o ;
i f n o bu f f e r p a s s ed i n ,
whi l e
a l l o c a t e a swap bu f f e r ;
( u i o i s n o t exhau s t e d )
check u s e r r e a d / wr i t e a c c e s s
i f bu f f e r p a s s ed i n ,
at u i o l o c a t i on ;
wa i t unt i l n o t busy ;
mark the bu f f e r busy f o r phys i c a l I / 0 ;
s e t up the bu f f e r f o r a maximum s i z ed trans f e r ;
c a l l mi nphys t o bound the t r an s f e r s i z e ;
l o c k the p a r t o f the u s e r addre s s s p a c e
invo lved i n the t rans f e r ;
the u s e r pages i n t o
c a l l s t r a t egy to
s tart
the bu f f e r ;
the t rans f e r ;
r a i s e the p r i o r i ty l eve l
splbi o ;
wa i t f o r the t rans f e r t o c omp l e t e ;
unmap the u s e r pages
f r om the bu f f er ;
un l o c k the p a r t o f the addr e s s space p r evi o u s l y
l o c ke d ;
wake u p anybody wa i t ing o n the bu f f e r ;
l ower the p r i o r i ty l eve l ;
dedu c t the t r ans f e r s i z e f rom the t o t a l number
of d a t a t o trans f e r ;
i f u s ing swap bu f f er ,
F i g u re 6.3
free i t ;
Algorithm for physical 1/0.
close tie to the terminal driver has heavily influenced the structure of character­
device drivers. For example, several entry points in the cdevsw structure exist for
communication between the generic terminal handler and the terminal multiplexer
hardware drivers.
Entry Points for Character-Device Drivers
A device driver for a character device is defined by an entry in the cdevsw table.
This structure contains many of the same entry points found in an entry in the
bdevsw table.
Chapter 6
1/0 System Overview
Open or close a character device. The open ( ) and close ( ) entry points
provide functions similar to those of a block device driver. For character
devices that simply provide raw access to a block device, these entry
points are usually the same. B ut some block devices do not have these
entry points, whereas most character devices do have them.
Read data from a device. For raw devices, this entry point normally just
calls the physio ( ) routine with device-specific parameters. For terminal­
oriented devices, a read request is passed immediately to the terminal
driver. For other devices, a read request requires that the specified data be
copied into the kernel 's address space, typically with the uiomove ( ) rou­
tine, and then be passed to the device.
Write data to a device. This entry point is a direct parallel of the read
entry point: Raw devices use physio ( ), terminal-oriented devices call the
terminal driver to do this operation, and other devices handle the request
Do an operation other than a read or write. This entry point originally
provided a mechanism to get and set device parameters for terminal
devices ; its use has expanded to other types of devices as well. Histori­
cally, ioctl ( ) operations have varied widely from device to device.
4.4B SD, however, defines a set of operations that is· supported by all tape
devices . These operations position tapes, return unit status, write end-of­
file marks, and place a tape drive off-line.
Check the device to see whether data are available for reading, or space is
available for writing, data. The select entry point is used by the select sys­
tem call in checking file descriptors associated with device special files.
For raw devices, a select operation is meaningless, since data are not
buffered. Here, the entry point is set to seltrue ( ), a routine that returns
true for any select request. For devices used with the terminal driver, this
entry point is set to ttselect ( ), a routine described in Chapter 1 0.
Stop output on a device. The stop routine is defined for only those
devices used with the terminal driver. For these devices, the stop routine
halts transmission on a line when the terminal driver receives a stop char­
acter-for example, "AS " -or when it prepares to flush its output queues.
Map a device offset into a memory address. This entry point is called by
the virtual-memory system to convert a logical mapping to a physical
address. For example, it converts an offset in /dev/mem to a kernel
Reset device state after a bus reset. The reset routine is called from the
bus-adapter support routines after a bus reset is made. The device driver
is expected to reinitialize the hardware to set into a known state-typi­
cally the state it has when the system is initially booted.
Section 6.4
Descriptor Management and Services
Descriptor Management and Services
For user processes, all 1/0 is done through descriptors. The user i nterface to
descriptors was described in Section 2.6. This section describes how the kernel
manages descriptors, and how it provides descriptor services, such as locking and
System calls that refer to open files take a file descriptor as an argument to
specify the file. The file descriptor is used by the kernel to index into the descrip­
tor table for the current process (kept in the filedesc structure, a substructure of the
process structure for the process) to locate a .file entry, or file structure. The rela­
tions of these data structures are shown in Fig. 6.4.
The file entry provides a file type and a pointer to an underlying object for the
descriptor. For data fi les, the file entry points to a vnode structure that references a
substructure containing the filesystem-specific information described in Chapters
7, 8, and 9. The vnode layer is described in Section 6.5. Special files do not have
data blocks allocated on the disk; they are handled by the special-device filesys­
tem that calls appropriate drivers to handle 1/0 for them. The 4.4BSD file entry
may also reference a socket, instead of a file. Sockets have a different file type,
and the file entry points to a system block that is used in doing interprocess com­
munication. The virtual-memory system supports the mapping of files into a pro­
cess 's address space. Here, the file descriptor must reference a vnode that will be
partially or completely mapped into the user's address space.
Open File Entries
The set of file entries is the focus of activity for file descriptors. They contain the
information necessary to access the underlying objects and to maintain common
The file entry is an obj ect-oriented data structure. Each entry contains a type
and an array of function pointers that translate the generic operations on file
descriptors into the specific actions associated with their type. In 4.4BSD, there
are two descriptor types: fi les and sockets. The operations that must be imple­
mented for each type are as follows:
Figure 6.4
File-descriptor reference to a file entry.
interpro�es �
filedesc process
Chapter 6
1/0 System Overview
Read from the descriptor
Write to the descriptor
Select on the descriptor
Do ioctl operations on the descriptor
Close and possibly deallocate the object associated with the descriptor
Note that there is no open routine defined in the object table. 4.4BSD treats
descriptors in an object-oriented fashion only after they are created. This
approach was taken because sockets and fi les have different characteristics. Gen­
eralizing the interface to handle both types of descriptors at open time would have
complicated an otherwise simple interface.
Each file entry has a pointer to a data structure that contains information spe­
cific to the instance of the underlying object. The data structure is opaque to the
routines that manipulate the file entry. A reference to the data structure is passed
on each call to a function that implements a file operation. All state associated
with an instance of an object must be stored in that instance's data structure; the
underlying obj ects are not permitted to manipulate the file entry themselves .
The read and write system calls do not take an offset in the file as an argu­
ment. Instead, each read or write updates the current file offset in the file accord­
ing to the number of bytes transferred. The offset determines the position in the
file for the next read or write. The offset can be set directly by the [seek system
call. Since more than one process may open the same file, and each such process
needs its own offset for the file, the offset cannot be stored in the per-object data
structure. Thus, each open system call allocates a new file entry, and the open file
entry contains the offset.
Some semantics associated with all file descriptors are enforced at the
descriptor level, before the underlying system call is invoked. These semantics are
maintained in a set of flags associated with the descriptor. For example, the flags
record whether the descriptor is open for reading, writing, or both reading and
writing. If a descriptor is marked as open for reading only, an attempt to write it
will be caught by the descriptor code. Thus, the functions defined for doing read­
ing and writing do not need to check the validity of the request; we can implement
them knowing that they will never receive an invalid request.
Other information maintained in the flags includes
The no-delay (NDELAY) flag: If a read or a write would cause the process to
block, the system call returns an error (EWOULDBLOCK) instead.
The asynchronous (ASYNC) flag: The kernel watches for a change in the status of
the descriptor, and arranges to send a signal (SIGIO) when a read or write
becomes possible.
Other i nformation that is specific to regular files also is maintained i n the flags
Section 6.4
Descriptor Management and Services
Information on whether the descriptor holds a shared or exclusive lock on the
underlying file: The locking primitives could be extended to work on sockets, as
well as on files. However, the descriptors for a socket rarely refer to the same
file entry. The only way for two processes to share the same socket descriptor is
for a parent to share the descriptor with its child by forking, or for one process to
pass the descriptor to another in a message.
The append flag: Each time that a write is made to the file, the offset pointer is
first set to the end of the file. This feature is useful when, for example, multiple
processes are writing to the same log file.
Each file entry has a reference count. A single process may have multiple refer­
ences to the entry because of calls to the dup or fcntl system calls. Also, file struc­
tures are inherited by the child process after a fork, so several different processes
may reference the same file entry. Thus, a read or write by either process on the
twin descriptors will advance the file offset. This semantic allows two processes
to read the same file or to interleave output to the same file. Another process that
has independently opened the file will refer to that file through a different file
structure with a different file offset. This functionality was the original reason for
the existence of the file structure; the file structure provides a place for the file off­
set intermediate between the descriptor and the underlying object.
Each time that a new reference is created, the reference count is incremented.
When a descriptor is closed (any one of ( 1 ) explicitly with a close, (2) implicitly
after an exec because the descriptor has been marked as close-on-exec, or (3) on
process exit), the reference count is decremented. When the reference count drops
to zero, the file entry is freed.
The AF_LOCAL domain interprocess-communication facility allows descrip­
tors to be sent between processes. While a descriptor is in transit between pro­
cesses, it may not have any explicit references. It must not be deallocated, as it
will be needed when the message is received by the destination process. However,
the message might never be received; thus, the file entry also holds a message
count for each entry. The message count is i ncremented for each descriptor that is
in transit, and is decremented when the descriptor is received. The file entry
might need to be reclaimed when all the remaining references are in messages.
For more details on message passing in the AF_LOCAL domain, see Section 1 1 .6.
The close-on-exec flag is kept i n the descriptor table, rather than i n the file
entry. This flag is not shared among all the references to the file entry because it is
an attribute of the file descriptor itself. The close-on-exec flag is the only piece of
information that is kept in the descriptor table, rather than being shared i n the file
Management of Descriptors
The fcntl system call manipulates the file structure. It can be used to make the fol­
lowing changes to a descriptor:
Chapter 6
1/0 System Overview
Duplicate a descriptor as though by a dup system call.
Get or set the close-on-exec flag. When a process forks, all the parent's descrip­
tors are duplicated in the child. The child process then execs a new process.
Any of the child's descriptors that were marked close-on-exec are closed. The
remaining descriptors are available to the newly executed process.
Set the descriptor into nonblocking mode. If any data are available for a read
operation, or if any space is available for a write operation, an immediate partial
read or write is done. If no data are available for a read operation, or if a write
operation would block, the system call returns an error showing that the opera­
tion would block, instead of putting the process to sleep. This facility was not
implemented for regular files in 4.4BSD, because filesystem 1/0 is always
expected to complete within a few milliseconds.
Force all writes to append data to the end of the file, instead of at the descriptor's
current location in the file.
Send a signal to the process when it is possible to do 1/0.
Send a signal to a process when an exception condition arises, such as when
urgent data arrive on an interprocess-communication channel .
Set or get the process identifier or process-group identifier to which the two
1/0-related signals in the previous steps should be sent.
Test or change the status of a lock on a range of bytes within an underlying file.
Locking operations are described in the next subsection .
The implementation of the dup system call is easy. If the process has reached
its limit on open files, the kernel returns an error. Otherwise, the kernel scans the
current process's descriptor table, starting at descriptor zero, until it finds an
unused entry. The kernel allocates the entry to point to the same file entry as does
the descriptor being duplicated. The kernel then increments the reference count
on the file entry, and returns the index of the allocated descriptor-table entry. The
Jent/ system call provides a similar function, except that it specifies a descriptor
from which to start the scan .
Sometimes, a process wants to allocate a specific descriptor-table entry. Such
a request is made with the dup2 system call . The process specifies the descriptor­
table index into which the duplicated reference should be placed. The kernel
implementation is the same as for dup, except that the scan to find a free entry is
changed to close the requested entry if that entry is open, and then to allocate the
entry as before. No action is taken if the new and old descriptors are the same.
The system implements getting or setting the close-on-exec flag via the Jent!
system call by making the appropriate change to the flags field of the associated
descriptor-table entry. Other attributes that fcntl can get or set manipulate the flags
in the file entry. However, the implementation of the various flags cannot be han­
dled by the generic code that manages the file entry. Instead, the file flags must be
passed through the object interface to the type-specific routines to do the
Section 6.4
Descriptor Management and Services
appropriate operation on the underlying object. For example, manipulation of the
nonblocking flag for a socket must be done by the socket layer, since only that
layer knows whether an operation can block.
The implementation of the ioctl system call is broken into two maj or levels.
The upper level handles the system call itself. The ioctl call includes a descriptor,
a command, and pointer to a data area. The command argument encodes what the
size is of the data area for the parameters, and whether the parameters are input,
output, or both input and output. The upper level is responsible for decoding the
command argument, allocating a buffer, and copying in any input data. If a return
value is to be generated and there is no input, the buffer is zeroed. Finally, the
ioctl is dispatched through the file-entry ioctl function, along with the 1/0 buffer,
to the lower-level routine that implements the requested operation.
The lower level does the requested operation. Along with the command argu­
ment, it receives a pointer to the 1/0 buffer. The upper level has already checked
for valid memory references, but the lower level may do more precise argument
validation because it knows more about the expected nature of the arguments.
However, it does not need to copy the arguments in or out of the user process. If
the command is successful and produces output, the lower level places the results
in the buffer provided by the top level . When the lower level returns, the upper
level copies the results to the process.
File-Descriptor Locking
Early UNIX systems had no provision for locking files. Processes that needed to
synchronize access to a file had to use a separate lock file. A process would try to
create a lock file. If the creation succeeded, then the process could proceed with
its update; if the creation failed, the process would wait, and then try again. This
mechanism had three drawbacks :
l . Processes consumed CPU time by looping over attempts to create locks.
2. Locks left lying around because of system crashes had to be removed (nor­
mally in a system-startup command script).
Processes running as the special system-administrator user, the superuser, are
always permitted to create files, and so were forced to use a different mecha­
Although it is possible to work around all these problems, the solutions are not
straightforward, so a mechanism for locking files was added in 4.2BSD.
The most general locking schemes allow multiple processes to update a file
concurrently. Several of these techniques are discussed in [Peterson, 1 98 3 ] . A
simpler technique is to serialize access to a file with locks. For standard system
applications, a mechanism that locks at the granularity of a file is sufficient. So,
4.2BSD and 4.3BSD provided only a fast whole-file locking mechanism. The
semantics of these locks include allowing locks to be inherited by child processes
and releasing locks only on the last close of a file.
21 0
Chapter 6
1/0 System Overview
Certain applications require the ability to lock pieces of a file. Locking facili­
ties that support a byte-level granularity are well understood [Bass, 1 9 8 1 ] . Unfor­
tunately, they are not powerful enough to be used by database systems that require
nested hierarchical locks, but are complex enough to require a large and cumber­
some implementation compared to the simpler whole-file locks. Because byte­
range locks are mandated by the POSIX standard, the developers added them to
4.4BSD reluctantly. The semantics of byte-range locks come from the lock's ini­
tial implementation in System V, which included releasing all locks held by a pro­
cess on a file every time a close system call was done on a descriptor referencing
that file. The 4.2BSD whole-file locks are removed only on the last close. A prob­
lem with the POSIX semantics is that an application can lock a file, then call a
library routine that opens, reads, and closes the locked file. Calling the library
routine will have the unexpected effect of releasing the locks held by the applica­
tion. Another problem is that a file must be open for writing to be allowed to get
an exclusive lock. A process that does not have permission to open a file for writ­
ing cannot get an exclusive lock on that file. To avoid these problems, yet remain
POSIX compliant, 4.4BSD provides separate int�rfaces for byte-range locks and
whole-file locks. The byte-range locks follow the POSIX semantics; the whole-file
locks follow the traditional 4.2BSD semantics. The two types of locks can be used
concurrently; they will serialize against each other properly.
B oth whole-file locks and byte-range locks use the same implementation; the
whole-file locks are implemented as a range lock over an entire file. The kernel
handles the other differing semantics between the two implementations by having
the byte-range locks be applied to processes whereas the whole-file locks are
applied to descriptors. Because descriptors are shared with child processes, the
whole-file locks are inherited. Because the child process gets its own process
structure, the byte-range locks are not inherited. The last-close versus every-close
semantics are a small bit of special-case code in the c lose routine that checks
whether the underlying object is a process or a descriptor. It releases locks on
every call if the lock is associated with a process, and only when the reference
count drops to zero if the lock is associated with a descriptor. .
Locking schemes can be classified according to the extent that they are
enforced. A scheme in which locks are enforced for every process without choice
is said to use mandatory locks, whereas a scheme in which locks are enforced for
only those processes that request them is said to use advismy locks. Clearly, advi­
sory locks are effective only when all programs accessing a file use the locking
scheme. With mandatory locks, there must be some override policy implemented
in the kernel. With advisory locks, the policy is left to the user programs. In the
4.4BSD system, programs with superuser privilege are allowed to override any
protection scheme. Because many of the programs that need to use locks must
also run as the superuser, 4.2BSD implemented advisory locks, rather than creating
an additional protection scheme that was inconsistent with the UNIX philosophy or
that could not be used by privileged programs. The use of advisory locks carried
over to the POSIX specification of byte-range locks and is retained in 4.4BSD.
Section 6.4
Descriptor Management and Services
21 1
The 4.4BSD file-locking faci lities allow cooperating programs to apply advi­
sory shared or exclusive locks on ranges of bytes within a file. Only one process
may have an exclusive lock on a byte range, whereas multiple shared locks may be
present. B oth shared and exclusive locks cannot be present on a byte range at the
same time. If any lock is requested when another process holds an exclusive lock,
or an exclusive lock is requested when another process holds any lock, the lock
request will block until the lock can be obtained. Because shared and exclusive
locks are only advisory, even if a process has obtained a lock on a file, another
process may access the file if it ignores the locking mechanism.
So that there are no races between creating and locking a file, a lock can be
requested as part of opening a file. Once a process has opened a file, it can manip­
ulate locks without needing to close and reopen the file. This feature is useful, for
example, when a process wishes to apply a shared lock, to read information, to
determine whether an update is required, then to apply an exclusive lock and to
update the file.
A request for a lock will cause a process to block if the lock cannot be
obtained immediately. In certain instances, this blocking is unsatisfactory. For
example, a process that wants only to check whether a lock is present would
require a separate mechanism to find out this information. Consequently, a pro­
cess can specify that its locking request should return with an error if a lock can­
not be obtained i mmediately. Being able to request a lock conditionally is useful
to daemon processes that wish to service a spooling area. If the first instance of
the daemon locks the directory where spooling takes place, later daemon pro­
cesses can easily check to see whether an active daemon exists. Since locks exist
only while the locking processes exist, locks can never be left active after the pro­
cesses exit or if the system crashes .
The implementation o f locks is done o n a per-filesystem basis. The imple­
mentation for the local filesystems is described in Section 7 . 5 . A network-based
filesystem has to coordinate locks with a central lock manager that is usually
located on the server exporting the filesystem. Client lock requests must be sent to
the lock manager. The lock manager arbitrates among lock requests from pro­
cesses running on its server and from the various clients to which it is exporting
the fi lesystem. The most complex operation for the lock manager is recovering
lock state when a client or server is rebooted or becomes partitioned from the rest
of the network. The 4.4BSD system does not have a network-based lock manager.
Multiplexing 1/0 on Descriptors
A process sometimes wants to handle 1/0 on more than one descriptor. For exam­
ple, consider a remote login program that wants to read data from �e keyboard
and to send them through a socket to a remote machine. This program also wants
to read data from the socket connected to the remote end and to write them to the
screen. If a process makes a read request when there are no data available, it is
normally blocked in the kernel until the data become available. In our example,
21 2
Chapter 6
I/O System Overview
blocking is unacceptable. If the process reads from the keyboard and blocks, it
will be unable to read data from the remote end that are destined for the screen.
The user does not know what to type until more data arrive from the remote end;
hence, the session deadlocks. Conversely, if the process reads from the remote
end when there are no data for the screen, it will block and will be unable to read
from the terminal. Again, deadlock would occur if the remote end were waiting
for output before sending any data. There is an analogous set of problems to
blocking on the writes to the screen or to the remote end. If a user has stopped
output to their screen by typing the stop character, the write will block until they
type the start character. In the meantime, the process cannot read from the
keyboard to find out that the user wants to flush the output.
Historic UNIX systems have handled the multiplexing problem by using mul­
tiple processes that communicate through pipes or some other interprocess-com­
munication facility, such as shared memory. This approach, however, can result in
significant overhead as a result of context switching among the processes if the
cost of processing input is small compared to the cost of a context switch. Fur­
thermore, it is often more straightforward to implement applications of this sort in
a single process. For these reasons, 4.4BSD provides three mechanisms that per­
mit multiplexing 1/0 on descriptors: polling 110, nonblocking 110, and signal­
driven 110. Polling is done with the select system call, described in the next sub­
section. Operations on nonblocking descriptors complete immediately, partially
complete an input or output operation and return a partial count, or return an error
that shows that the operation could not be completed at all . Descriptors that have
signaling enabled cause the associated process or process group to be notified
when the I/O state of the descriptor changes.
There are four possible alternatives that avoid the blocking problem:
1 . Set all the descriptors into nonblocking mode. The process can then try opera­
tions on each descriptor in turn, to find out which descriptors are ready to do
1/0. The problem with this approach is that the process must run continuously
to discover whether there is any I/O to be done.
2. Enable all descriptors of interest to signal when I/O can be done. The process
can then wait for a signal to discover when it is possible to do 1/0. The draw­
back to this approach is that signals are expensive to catch. Hence, signal­
driven I/O is impractical for applications that do moderate to large amounts of
3 . Have the system provide a method for asking which descriptors are capable of
doing 1/0. If none of the requested descriptors are ready, the system can put
the process to sleep until a descriptor becomes ready. This approach avoids
the problem of deadlock, because the process will be awakened whenever it is
possible to do 1/0, and will be told which descriptor is ready. The drawback is
that the process must do two system calls per operation: one to poll for the
descriptor that is ready to do I/O and another to do the operation itself.
Section 6.4
Descriptor Management and Services
21 3
4. Have the process notify the system of all the descriptors that it is interested in
reading, then do a blocking read on that set of descriptors. When the read
returns, the process is notified on which descriptor the read completed. The
benefit of this approach is that the process does a single system call to specify
the set of descriptors, then loops doing only reads [Accetta et al, 1 986] .
The first approach is available in 4.4BSD as nonblocking 1/0. It typically is
used for output descriptors, because the operation typically will not block. Rather
than doing a select, which nearly always succeeds, followed immediately by a
write, it is more efficient to try the write and revert to using select only during
periods when the write returns a blocking error. The second approach is available
in 4.4BSD as signal-driven 1/0. It typically is used for rare events, such as for the
arrival of out-of-band data on a socket. For such rare events, the cost of handling
an occasional signal is lower than that of checking constantly with select to find
out whether there are any pending data.
The third approach is available in 4.4BSD via the select system call.
Although Jess efficient than the fourth approach , it is a more general interface. In
addition to handling reading from multiple descriptors, it handles writes to multi­
ple descriptors, notification of exceptional conditions, and timeout when no I/O is
The select interface takes three masks of descriptors to be monitored, corre­
sponding to i nterest in reading, writing, and exceptional conditions. In addition, it
takes a timeout value for returning from select if none of the requested descriptors
becomes ready before a specified amount of time has elapsed. The select call
returns the same three masks of descriptors after modifying them to show the
descriptors that are able to do reading, to do writing, or to provide an exceptional
condition. If none of the descriptors has become ready in the timeout interval,
select returns showing that no descriptors are ready for 1/0.
Implementation of S e l e c t
The implementation o f select, like that of much other kernel functionality, is
divided into a generic top layer and many device- or socket-specific bottom pieces.
At the top level, select decodes the user's request and then calls the appropri­
ate lower-level select functions. The top level takes the following steps:
1 . Copy and validate the descriptor masks for read, write, and exceptional condi­
tions. Doing validation requires checking that each requested descriptor is
currently open by the process.
2. Set the selecting flag for the process.
For each descriptor in each mask, poll the device by calling its select routine.
If the descriptor is not able to do the requested I/O operation, the device select
routine is responsible for recording that the process wants to do 1/0. When
1/0 becomes possible for the descriptor-usually as a result of an interrupt
21 4
Chapter 6
I/O System Overview
from the underlying device-a notification must be issued for the selecting
4. Because the selection process may take a long time, the kernel does not want
to block out I/O during the time it takes to poll all the requested descriptors.
Instead, the kernel arranges to detect the occurrence of I/O that may affect the
status of the descriptors being p olled. When such I/O occurs, the select-notifi­
cation routine, selwakeup ( ), clears the selecting flag. If the top-level select
code finds that the selecting flag for the process has been cleared while it has
been doing the polling, and it has not found any descriptors that are ready to
do an operation, then the top level knows that the polling results are incom­
plete and must be repeated starting at step 2 . The other condition that requires
the polling to be repeated is a collision. Collisions arise when multiple pro­
cesses attempt to select on the same descriptor at the same time. Because the
select routines have only enough space to record a single process identifier,
they cannot track multiple processes that need to be awakened when I/O is
possible. In such rare instances, all processes that are selecting must be awak­
5. If no descriptors are ready and the select specified a timeout, the kernel posts a
timeout for the requested amount of time. The process goes to sleep, giving
the address of the kernel global variable selwait. Normally, a descriptor will
become ready and the process will be notified by selwakeup ( ). When the pro­
cess is awakened, it repeats the polling process and returns the available
descriptors. If none of the descriptors become ready before the timer expires,
the process returns with a timed-out error and an empty list of available
Each of the low-level polling routines in the terminal drivers and the network pro­
tocols follows roughly the same set of steps. A piece of the select routine for a
terminal driver is shown in Fig. 6.5. The steps involved in a device select routine
are as follows:
1 . The socket or device select entry is called with flag of FREAD, FWRITE, or 0
(exceptional condition). The example in Fig. 6.5 shows the FREAD case; the
others cases are similar.
2. The poll returns success if the requested operation is possible. In Fig. 6.5, it is
possible to read a character if the number of unread characters is greater than
zero. In addition, if the carrier has dropped, it is possible to get a read error.
A return from select does not necessarily mean that there are data to read;
rather, it means that a read will not block.
3. If the requested operation is not possible, the process identifier is recorded
with the socket or device for later notification. In Fig. 6.5, the recording is
done by the selrecord ( ) routine. That routine first checks to see whether the
current process was the one that was recorded previously for this record; if it
Descriptor Management and Services
Section 6.4
s t ru c t s e l i n f o
} ;
21 5
p i d_t
s i_p i d ;
/ * pr o c e s s
s i_ f l ags ;
/ * S I_COLL - c o l l i s i on o c curred * /
to be no t i f i e d * /
s t ru c t t ty * tp ;
c a s e F READ :
( nread > 0
r e turn
s e l r e c o rd ( curpro c ,
r e turn
( tp - > t_s t a t e & TS_CARR_ON )
(1) ;
& tp - > t_r s e l ) ;
(0) ;
s e l record ( s e l e c t o r ,
s ip )
s t ruc t p r o c * s e l e c t o r ;
s t ruc t s e l in f o * s ip ;
s t ru c t proc
*p ;
p i d_t myp i d ;
myp i d
s e l e c t o r - >p_p i d ;
( s ip - > s i_p i d
myp i d )
r e turn ;
( s ip - > s i_p i d &&
p - >p_wchan
s ip - > s i_ f l ag s
p f ind ( s ip - > s i_p i d ) )
( c addr_t ) & s e lwa i t )
s ip - > s i_p i d
Figure 6 . 5 Select
myp i d ;
code t o check fo r data t o read i n a terminal driver.
was, then no further action is needed. The second if statement checks for a
collision. The first part of the conj unction checks to see whether any process
identifier is recorded already. If there is none, then there is no collision. If
there is a process identifier recorded, it may remain from an earlier call on
select by a process that is no longer selecting because one of its other descrip­
tors became ready. If that process is still selecting, it will be sleeping on sel­
wait (when it is sleeping, the address of the sleep event is stored in p_wchan).
If it is sleeping on some other event, its p_wchan will have a value different
from that of selwait. If it is running, its p_wchan will be zero. If it is not
sleeping on selwait, there is no collision, and the process identifier is saved in
4. If multiple processes are selecting on the same socket or device, a collision is
recorded for the socket or device, because the structure has only enough space
Chapter 6
21 6
1/0 System Overview
for a single process identifier. In Fig. 6.5, a collision occurs when the second
if statement in the selrecord( ) function is true. There is a tty structure for each
terminal line (or pseudoterminal) on the machine. Normally, only one process
at a time is selecting to read from the terminal, so collisions are rare.
Selecting processes must be notified when 1/0 becomes possible. The steps
involved in a status change awakening a process are as follows:
1 . The device or socket detects a change in status. Status changes normally
occur because of an interrupt (e.g., a character coming in from a keyboard or a
packet arriving from the network).
2. Se/wakeup ( ) is called with a pointer to the selinfo structure used by
selrecord ( ) to record the process identifier, and with a flag showing whether a
collision occurred.
3 . If the process is sleeping on selwait, it is made runnable (or is marked ready, if
it is stopped) . If the process is sleeping on some event other than selwait, it is
not made runnable. A spurious call to selwakeup ( ) can occur when the pro­
cess returns from select to begin processing one descriptor and then another
descriptor on which it had been selecting also becomes ready.
4. If the process has its selecting flag set, the flag is cleared so that the kernel will
know that its polling results are invalid and must be recomputed.
5 . If a collision has occurred, all sleepers on selwait are awakened to rescan to
see whether one of their descriptors became ready. Awakening all selecting
processes is necessary because the selrecord( ) routine could not record all the
processes that needed to be awakened. Hence, it has to wake up all processes
that could possibly have been interested. Empirically, collisions occur infre­
quently. If they were a frequent occurrence, it would be worthwhile to store
multiple process identifiers in the selinfo structure.
Movement of Data Inside the Kernel
Within the kernel, 1/0 data are described by an array of vectors. Each 110 vector
or iovec has a base address and a length. The 1/0 vectors are identical to the 1/0
vectors used by the readv and writev system calls.
The kernel maintains another structure, called a uio structure, that holds addi­
tional information about the 1/0 operation. A sample uio structure is shown in
Fig. 6.6; it contains
A pointer to the iovec array
The number of elements in the iovec array
The file offset at which the operation should start
Section 6.4
Descriptor Management and Services
21 7
struct uio
r - - - - - - - - - - - ,
struct iov [ ]
r - - - - - - - - - - -
Figure 6.6
A uio structure.
• The sum of the lengths of the 1/0 vectors
• A flag showing whether the source and destination are both within the kernel, or
whether the source and destination are split between the user and the kernel
• A flag showing whether the data are being copied from the uio structure to the
kernel (UIO_WRITE) or from the kernel to the uio structure (UIO_READ)
• A pointer to the process whose data area is described by the uio structure (the
pointer is NULL if the uio structure describes an area within the kernel)
All 1/0 within the kernel is described with iovec and uio structures. System calls
such as read and write that are not passed an iovec create a uio to describe their
arguments ; this uio structure is passed to the lower levels of the kernel to specify
the parameters of an 1/0 operation. Eventually, the uio structure reaches the part
of the kernel responsible for moving the data to or from the process address space:
the filesystem, the network, or a device driver. In general, these parts of the kernel
do not interpret uio structures directly. Instead, they arrange a kernel buffer to
hold the data, then use uiomove ( ) to copy the data to or from the buffer or buffers
described by the uio structure. The uiomove ( ) routine is called with a pointer to a
kernel data area, a data count, and a uio structure. As it moves data, it updates the
counters and pointers of the iovec and uio structures by a corresponding amount.
If the kernel buffer is not as large as the areas described by the uio structure, the
uio structure will point to the part of the process address space just beyond the
location completed most recently. Thus, while servicing a request, the kernel may
call uiomove ( ) multiple times, each time giving a pointer to a new kernel buffer
for the next block of data.
21 8
Chapter 6
1/0 System Overview
Character device drivers that do not copy data from the process generally do
not i nterpret the uio structure. Instead, there is one low-level kernel routine that
arranges a direct transfer to or from the address space of the process. Here, a sep­
arate I/O operation is done for each iovec element, calling back to the driver with
one piece at a time.
H istoric UNIX systems used global variables in the user area to describe I/O.
This approach has several problems. The lower levels of the kernel are not reen­
trant, because there is exactly one context to describe an 1/0 operation. The sys­
tem cannot do scatter-gather 1/0, since there is only a single base and size variable
per process. Finally, the bottom half of the kernel cannot do 1/0, because i t does
not have a user area.
The one part of the 4.4BSD kernel that does not use uio structures is the
block-device drivers. The decision not to change these interfaces to use uio struc­
tures was largely pragmatic. The developers would have had to change many
drivers. The existing buffer i nterface was already decoupled from the user struc­
ture ; hence, the i nterface was already reentrant and could be used by the bottom
half of the kernel. The only gain was to allow scatter-gather 1/0. The kernel does
not need scatter-gather operations on block devices, however, and user operations
on block devices are done through the buffer cache.
The Virtual-Filesystem Interface
In 4.3BSD, the file entries directly referenced the local fi lesystem inode. An i node
is a data structure that describes the contents of a file; it is more fully described i n
Section 7.2. This approach worked fi n e when there was a single filesystem i mple­
mentation. However, with the advent of multiple filesystem types, the architecture
had to be generalized. The new architecture had to support i mporting of filesys­
tems from other machines i ncluding other machines that were running different
operating systems.
One alternative would have been to connect the multiple filesystems i nto the
system as different file types. However, this approach would have required mas­
sive restructuring of the internal workings of the system, because current directo­
ries, references to executables, and several other i nterfaces used inodes instead of
file entries as their point of reference. Thus, it was easier and more logical to add
a new object-oriented layer to the system below the file entry and above the i node.
This new layer was first implemented by Sun Microsystems, which called it the
virtual-node, or vnode, layer. Interfaces in the system that had referred previously
to i nodes were changed to reference generic vnodes. A vnode used by a local
filesystem would refer to an i node. A vnode used by a remote filesystem would
refer to a protocol control block that described the location and naming informa­
tion necessary to access the remote file.
Section 6.5
The Virtual-Filesystem Interface
21 9
Contents of a Vnode
The vnode is an extensible obj ect-oriented interface. It contains information that
is generically useful independent of the underlying fi lesystem object that it repre­
sents. The information stored in a vnode includes the following:
• Flags are used for locking the vnode and identifying generic attributes. A n
example generic attribute is a flag to show that a vnode represents an object that
is the root of a fi lesystem.
• The various reference counts include the number of file entries that are open for
reading and/or writing that reference the vnode, the number of file entries that
are open for writing that reference the vnode, and the number of pages and
buffers that are associated with the vnode.
• A pointer to the mount structure describes the filesystem that contains the object
represented by the vnode.
• Various information is used to do file read-ahead.
• A reference to an NFS lease is included; see Section 9 . 3 .
• A reference t o state about special devices, sockets, and FIFOs is included.
• There is a pointer to the set of vnode operations defined for the object. These
operations are described in the next subsection.
• A . pointer to private information needed for the underlying object is included.
For the local filesystem, this pointer will reference an inode; for NFS , it will ref­
erence an nfsnode.
• The type of the underlying object (e.g., regular file, directory, character device,
etc.) is given. The type information is not strictly necessary, since a vnode client
could always call a vnode operation to get the type of the underlying object.
However, because the type often is needed, the type of underlying objects does
not change, and it takes time to call through the vnode interface, the object type
is cached in the vnode.
• There are clean and dirty buffers associated with the vnode. Each valid buffer in
the system is identified by its associated vnode and the starting offset of its data
within the object that the vnode represents. All the buffers that have been modi­
fied, but have not yet been written back, are stored on their vnode dirty-buffer
list. All buffers that have not been modified, or have been written back since
they were last modified, are stored on their vnode clean list. Having all the dirty
buffers for a vnode grouped onto a single list makes the cost of doing an fsync
system call to flush all the dirty blocks associated with a file proportional to the
amount of dirty data. In 4.3BSD, the cost was proportional to the smaller of the
Chapter 6
I!O System Overview
size of the file or the size of the buffer pool . The list of clean buffers is used to
free buffers when a file is deleted. Since the file will never be read again, the
kernel can immediately cancel any pending 1/0 on its dirty buffers, and reclaim
all its clean and dirty buffers and place them at the head of the buffer free list,
ready for immediate reuse.
A count is kept of the number of buffer write operations in progress. To speed
the flushing of dirty data, the kernel does this operation by doing asynchronous
writes on all the dirty buffers at once. For local filesystems, this simultaneous
push causes all the buffers to be put into the disk queue, so that they can be
sorted into an optimal order to minimize seeking. For remote filesystems, this
simultaneous push causes all the data to be presented to the network at once, so
that it can maximize their throughput. System calls that cannot return until the
data are on stable store (such as fsync) can sleep on the count of pending output
operations, waiting for the count to reach zero.
The position of vnodes within the system was shown in Fig. 6. 1 . The vnode
itself is connected into several other structures within the kernel, as shown in
Fig. 6.7. Each mounted filesystem within the kernel is represented by a generic
mount structure that includes a pointer to a filesystem-specific control block. All
the vnodes associated with a specific mount point are linked together on a list
headed by this generic mount structure. Thus, when it is doing a sync system call
for a fi lesystem, the kernel can traverse this list to visit all the files active within
that filesystem. Also shown in the figure are the lists of clean and dirty buffers
associated with each vnode. Fi nally, there is a free list that links together all the
vnodes in the system that are not being used actively. The free list is used when a
filesystem needs to allocate a new vnode, so that the latter can open a new file; see
Section 6.4.
Vnode Operations
Vnodes are designed as an object-oriented interface. Thus, the kernel manipulates
them by passing requests to the underlying object through a set of defined opera­
tions. Because of the many varied filesystems that are supported in 4 .4B SD, the
set of operations defined for vnodes is both large and extensible. Unlike the origi­
nal Sun Microsystems vnode implementation, that in 4.4B S D allows dynamic
addition of vnode operations at system boot time. As part of the booting process,
each fi lesystem registers the set of vnode operations that it is able to support. The
kernel then builds a table that lists the union of all operations supported by any
fi lesystem. From that table, it builds an operations vector for each fi lesystem.
Supported operations are filled in with the entry point registered by the filesystem.
Filesystems may opt to have unsupported operations fil led in with either a default
routine (typically a routine to bypass the operation to the next lower layer; see
Section 6 . 7 ) , or a routine that returns the characteri stic error "operation not sup­
ported" [Heidemann & Popek, 1 994] .
The Virtual-Filesystem Interface
Section 6.5
r - - - - - - - - - ,
free list
- - - - -
F i g u re 6.7
Vnode linkages. D-dirty buffer; C-clean buffer.
In 4.3BSD, the local filesystem code provided both the semantics of the hier­
archical filesystem naming and the details of the on-disk storage management.
These functions are only loosely related. To enable experimentation with other
disk-storage techniques without having to reproduce the entire naming semantics,
4.4BSD splits the naming and storage code into separate modules. This split is
evident at the vnode layer, where there are a set of operations defined for hierar­
chical filesystem operations and a separate set of operations defined for storage of
variable-sized objects using a flat name space. About 60 percent of the traditional
filesystem code became the name-space management, and the remaining 40 per­
cent became the code implementing the on-disk file storage. The naming scheme
and its vnode operations are described in Chapter 7. The disk-storage scheme and
its vnode operations are explained in Chapter 8 .
Chapter 6
1/0 System Overview
Pathname Translation
The translation of a pathname requires a series of interactions between the vnode
interface and the underlying filesystems. The pathname-translation process pro­
ceeds as follows:
1 . The pathname to be translated is copied in from the user process or, for a
remote filesystem request, is extracted from the network buffer.
2. The starting point of the pathname is determined as either the root directory or
the current directory (see Section 2.7). The vnode for the appropriate direc­
tory becomes the lookup directory used in the next step.
3 . The vnode layer calls the filesystem-specific lookup ( ) operation, and passes to
that operation the remaining components of the pathname and the current
lookup directory. Typically, the underlying filesystem will search the lookup
directory for the next component of the pathname and will return the resulting
vnode (or an error if the name does not exist) .
4. If an error is returned, the top level returns the error. If the pathname has been
exhausted, the pathname lookup is done, and the returned vnode is the result of
the lookup. If the pathname has not been exhausted, and the returned vnode is
not a directory, then the vnode layer returns the " not a directory " error. If
there are no errors, the top layer checks to see whether the returned directory
is a mount point for another filesystem. If it is, then the lookup directOI)'
becomes the mounted filesystem; otherwise, the lookup directOI)' becomes the
vnode returned by the lower layer. The lookup then iterates with step 3 .
Although i t may seem inefficient t o call through the vnode interface for each
pathname component, doing so usually is necessary. The reason is that the under­
lying filesystem does not know which directories are being used as mount points.
Since a mount point will redirect the lookup to a new filesystem, it is important
that the current filesystem not proceed past a mounted directory. Although it
might be possible for a local filesystem to be knowledgeable about which directo­
ries are mount points, it is nearly impossible for a server to know which of the
directories within its exported filesystems are being used as mount points by its
clients. Consequently, the conservative approach of traversing only a single path­
name component per lookup ( ) call is used. There are a few instances where a
filesystem will know that there are no further mount points in the remaining path,
and will traverse the rest of the pathname. An example is crossing into a portal,
described in Section 6 . 7 .
Exported Filesystem Services
The vnode interface has a set of services that the kernel exports from all the
filesystems supported under the interface. The first of these is the ability to sup­
port the update of generic mount options. These options include the following:
Section 6.6
Filesystem-Independent Services
Do not execute any files on the filesystem. This option is often used
when a server exports binaries for a different architecture that cannot be
executed on the server itself. The kernel will even refuse to execute
shell scripts ; if a shell script is to be run, its interpreter must be invoked
Do not honor the set-user-id or set-group-id flags for any executables on
the filesystem. This option is useful when a filesystem of unknown ori­
gin is mounted.
Do not allow any special devices on the filesystem to be opened. This
option is often used when a server exports device directories for a differ­
ent architecture. The values of the major and minor numbers are non­
sensical on the server.
Together, these options allow reasonably secure mounting of untrusted or for­
eign filesystems. It is not necessary to unmount and remount the filesystem to
change these flags; they may be changed while a filesystem is mounted. In addi­
tion, a filesystem that is mounted read-only can be upgraded to allow writing.
Conversely, a filesystem that allows writing may be downgraded to read-only pro­
vided that no files are open for modification. The system administrator can
forcibly downgrade the filesystem to read-only by requesting that any files open
for writing have their access revoked.
Another service exported from the vnode interface is the ability to get infor­
mation about a mounted filesystem. The statfs system call returns a buffer that
gives the numbers of used and free disk blocks and inodes, along with the filesys­
tem mount point, and the device, location, or program from which the filesystem
is mounted. The getfsstat system call returns information about all the mounted
filesystems. This interface avoids the need to track the set of mounted filesystems
outside the kernel, as is done in many other UNIX variants.
Filesystem-Independent Services
The vnode interface not only supplies an obj ect-oriented interface to the underly­
ing filesystems, but also provides a set of management routines that can be used
by the client filesystems. These facilities are described in this section.
When the final file-entry reference to a file is closed, the usage count on the
vnode drops to zero and the vnode interface calls the inactive ( ) vnode operation.
The inactive ( ) call notifies the underlying filesystem that the file is no longer
being used. The filesystem will often use this call to write dirty data back to the
file, but will not typically reclaim the buffers. The filesystem is permitted to cache
the file so that the latter can be reactivated quickly (i.e., without disk or network
1/0) if the file is reopened.
In addition to the inactive ( ) vnode operation being called when the reference
count drops to zero, the vnode is placed on a systemwide free list. Unlike most
Chapter 6
1/0 System Overview
vendor's vnode implementations, which have a fixed number of vnodes allocated
to each filesystem type, the 4.4BSD kernel keeps a single systemwide collection of
vnodes. When an application opens a file that does not currently have an in-mem­
ory vnode, the client filesystem calls the getnewvnode ( ) routine to allocate a new
vnode. The getnewvnode ( ) routine removes the least recently used vnode from
the front of the free list and calls the reclaim ( ) operation to notify the fi lesystem
currently using the vnode that that vnode is about to be reused. The reclaim ( )
operation writes back any dirty data associated with the underlying object,
removes the underlying object from any lists that it is on (such as hash l ists used to
find it), and frees up any auxiliary storage that was being used by the object. The
vnode is then returned for use by the new client fi lesystem.
The benefit of having a single global vnode table is that the kernel memory
dedicated to vnodes is used more efficiently than when several fi lesystem-specific
collections of vnodes are used. Consider a system that is willing to dedicate mem­
ory for 1 000 vnodes. If the system supports IO filesystem types, then each fi lesys­
tem type will get 1 00 vnodes. If most of the activity moves to a single fi lesystem
(e.g., during the compilation of a kernel located in a local fi lesystem), all the
active files will have to be kept in the 1 00 vnodes dedicated to that fi lesystem
while the other 900 vnodes sit idle. In a 4.4BSD system, all 1 000 vnodes could be
used for the active filesystem, allowing a much larger set of files to be cached in
memory. If the center of activity moved to another fi lesystem (e.g., compiling a
program on an NFS mounted filesystem), the vnodes would migrate from the pre­
viously active local tilesystem over to the NFS fi lesystem. Here, too, there would
be a much larger set of cached files than if only 1 00 vnodes were available using a
partitioned set of vnodes.
The reclaim ( ) operation is a disassociation of the underlying fi lesystem object
from the vnode itself. This ability, combined with the ability to associate new
objects with the vnode, provides functionality with usefulness that goes far
beyond simply allowing vnodes to be moved from one fi lesystem to another. B y
replacing a n existing object with a n object from the dead filesystem-a fi lesystem
in which all operations except close fail-the kernel revokes the object. Internally,
this revocation of an object is provided by the vgone ( ) routine.
This revocation service is used for session management, where all references
to the controlling terminal are revoked when the session leader exits. Revocation
works as follows. All open terminal descriptors within the session reference the
vnode for the special device representing the session terminal . When vgone ( ) is
called on this vnode, the underlying special device is detached from the vnode and
is replaced with the dead filesystem. Any further operations on the vnode will
result in errors, because the open descriptors no longer reference the terminal .
Eventually, all the processes will exit and will close their descriptors, causing the
reference count to drop to zero. The inactive ( ) routine for the dead fi lesystem
returns the vnode to the front of the free list for immediate reuse, because it will
never be possible to get a reference to the vnode again.
The revocation service is used to support forcible unmounting of filesystems.
If it finds an active vnode when unmounting a filesystem, the kernel simply calls
Section 6.6
Filesystem-Independent Services
the vgone ( ) routine to disassociate the active vnode from the filesystem object.
Processes with open files or current directories within the filesystem find that they
have simply vanished, as though they had been removed. It is also possible to
downgrade a mounted fi lesystem from read-write to read-only. Instead of access
being revoked on every active file within the fi lesystem, only those fi les with a
nonzero number of references for writing have their access revoked.
Finally, the ability to revoke objects is exported to processes through the
revoke system cal l . This system call can be used to ensure controlled access to a
device such as a pseudo-terminal port. First, the ownership of the device is
changed to the desired user and the mode is set to owner-access only. Then, the
device name is revoked to eliminate any interlopers that already had it open.
Thereafter, only the new owner is able to open the device.
The Name Cache
Name-cache management is another service that is provided by the vnode man­
agement routines. The interface provides a facility to add a name and its corre­
sponding vnode, to look up a name to get the corresponding vnode, and to delete a
specific name from the cache. In addition to providing a facility for deleting spe­
cific names, the interface also provides an efficient way to invalidate all names that
reference a specific vnode. Directory vnodes can have many names that reference
them-notably, the .. entries in all their immediate descendents. The kernel could
revoke all names for a vnode by scanning the entire name table, looking for refer­
ences to the vnode in question. This approach would be slow, however, given that
the name table may store thousands of names. Instead, each vnode is given a
capability-a 32-bit number guaranteed to be unique. When all the numbers have
been exhausted, all outstanding capabilities are purged, and numbering restarts
from scratch. Purging is possible, because all capabilities are easily found in ker­
nel memory ; it needs to be done only if the machine remains running for nearly 1
year. When an entry is made in the name table, the current value of the vnode's
capability is copied to the associated name entry. A vnode's capability is invali­
dated each time it is reused by getnewvnode ( ) or, when specifically requested by a
client (e.g., when a file is being renamed), by assignment of a new capability to
the vnode. When a name is found during a cached lookup, the capability assigned
to the name is compared with that of the vnode. If they match, the lookup is suc­
cessful; if they do not match, the cache entry is freed and failure is returned.
The cache-management routines also allow for negative caching. If a name is
looked up in a directory and is not found, that name can be entered in the cache,
along with a null pointer for its corresponding vnode. If the name is later looked
up, it will be found in the name table, and thus the kernel can avoid scanning the
entire directory to determine that the name is not there. If a directory is modified,
then potentially one or more of the negative entries may be wrong. So, when the
directory is modified, the kernel must invalidate all the negative names for that
directory vnode by assigning the directory a new capability. Negative caching
provides a significant performance improvement because of path searching in
command shells. When executing a command, many shells will look at each path
Chapter 6
VO System Overview
in turn, looking for the executable. Commonly run executables will be searched
for repeatedly in directories in which they do not exist. Negative caching speeds
these searches.
An obscure but tricky issue has to do with detecting and properly handling
special device aliases . Special devices and FIFOs are hybrid objects. Their nam­
ing and attributes (such as owner, timestamps, and permissions) are maintained by
the fi lesystem in which they reside. However, their operations (such as read and
write) are maintained by the kernel on which they are being used. Since a special
device is identified solely by its major and minor number, it is possible for two or
more instances of the same device to appear within the fi lesystem name space,
possibly in different filesystems. Each of these different names has its own vnode
and underlying object, yet all these vnodes must be treated as one from the per­
spective of identifying blocks in the buffer cache and in other places where the
vnode and logical block number are used as a key. To ensure that the set of
vnodes is treated as a single vnode, the vnode layer provides a routine
checkalias ( ) that is called each time that a new special device vnode comes into
existence. This routine looks for other instances of the device, and if it finds them,
links them together so that they can act as one.
Buffer Management
Another important service provided by the fi lesystem-independent layer is the
management of the kernel's buffer space. The task of the buffer cache is two-fold.
One task is to manage the memory that buffers data being transferred to and from
the disk or network. The second, and more important, task is to act as a cache of
recently used blocks. The semantics of the filesystem imply much VO. If every
implied transfer had to be done, the CPU would spend most of its time waiting for
1/0 to complete. On a typical 4.4BSD system, over 85 percent of the implied disk
or network transfers can be skipped, because the requested block already resides
in the buffer cache. Depending on available memory, a system is configured with
from 1 00 to 1 000 buffers . The larger the number of buffers is, the longer a given
block can be retained in memory, and the greater the chance that actual VO can be
Figure 6.8 shows the format of a buffer. The buffer is composed of two parts.
The first part is the buffer header, which contains information used to find the
buffer and to describe the buffer's contents. The content information includes the
vnode (i.e., a pointer to the vnode whose data the buffer holds), the starting offset
within the file, and the number of bytes contained in the buffer. The flags entry
tracks status information about the buffer, such as whether the buffer contains use­
ful data, whether the buffer is in use, and whether the data must be written back to
the file before the buffer can be reused.
The second part is the actual buffer contents. Rather than the header being
prepended to the data area of the buffer, as is done with mbufs (see Section 1 1 . 3 ),
the data areas are maintained separately. Thus, there is a pointer to the buffer con­
tents and a field that shows the size of the data-buffer contents. The buffer size is
always at least as big as the size of the data block that the buffer contains. Data
Section 6.6
Filesystem-Independent Services
hash link
free-list link
ft ags
vnode pointer
file offset
byte count
buffer size
buffer pointer
buffer header
Figure 6.8
(64 Kbyte)
buffer contents
Format of a buffer.
are maintained separately from the header to allow easy manipulation of buffer
sizes via the page-mapping hardware. If the headers were prepended, either each
header would have to be on a page by itself or the kernel would have to avoid
remapping buffer pages that contained headers.
The sizes of buffer requests from a filesystem range from 5 1 2 bytes up to
65,536 bytes. If many small files are being accessed, then many small buffers are
needed. Alternatively, if several large files are being accessed, then fewer large
buffers are needed. To allow the system to adapt efficiently to these changing
needs, the kernel allocates to each buffer MAXBSIZE bytes of virtual memory, but
the address space is not fully populated with physical memory. Initially, each
buffer is assigned 4096 bytes of physical memory. As smaller buffers are allo­
cated, they give up their unused physical memory to buffers that need to hold
more than 4096 bytes. The algorithms for managing the physical memory are
described in the next subsection.
In earlier versions of BSD and in most other versions of UNIX, buffers were
identified by their physical disk block number. 4.4BSD changes this convention to
identify buffers by their logical block number within the file. For filesystems such
as NFS, the local client has no way to compute the physical block address of a log­
ical file block on the server, so only a logical block number can be used. Using
the logical block number also speeds lookup, because it is no longer necessary to
compute the physical block number before checking for the block in the cache.
For a local filesystem where the computation may require traversing up to three
indirect blocks, the savings are considerable. The drawback to using a logical­
address cache is that it is difficult to detect aliases for a block belonging to a local
file and the same block accessed through the block device disk whose logical­
block address is the same as the physical-block address. The kernel handles these
aliases by administratively preventing them from occurring. The kernel does not
allow the block device for a partition to be opened while that partition is mounted.
Chapter 6
1/0 System Overview
Conversely, the kernel will not allow a partition on a block device disk to be
mounted if the latter is already open .
The internal kernel interface to the buffer pool is simple. The filesystem allo­
cates and fills buffers by calling the bread( ) routine. Bread( ) takes a vnode, a log­
ical block number, and a size, and returns a pointer to a locked buffer. Any other
process that tries to obtain the buffer will be put to sleep until the buffer i s
released. A buffer can b e released in one o f four ways. I f the buffer has not been
modified, it can simply be released through use of brelse ( ), which returns it to the
free list and awakens any processes that are waiting for it.
If the buffer has been modified, it is called dirty. Dirty buffers must eventu­
ally be written back to their filesystem. Three routines are available based on the
urgency with which the data must be written. In the typical case, bdwrite ( ) is
used; since the buffer may be modified again soon, it should be marked as dirty,
but should not be written immediately. After the buffer is marked as dirty, it i s
returned to the free list and any processes waiting for it are awakened. The heuris­
tic is that, if the buffer will be modified again soon, the I/O would be wasted.
Because the buffer is held for an average of 15 seconds before it is written, a pro­
cess doing many small writes will not repeatedly access the disk or network.
If a buffer has been filled completely, then it is unlikely to be written again
soon, so it should be released with bawrite ( ) . Bawrite ( ) schedules an 1/0 on the
buffer, but allows the caller to continue running while the output completes.
The final case is bwrite ( ), which ensures that the write is complete before
proceeding. Because bwrite ( ) can introduce a long latency to the writer, it is used
only when a process explicitly requests the behavior (such as the /sync system
call), when the operation is critical to ensure the consistency of the filesystem after
a system crash, or when a stateless remote filesystem protocol such as NFS is
being served. Buffers that are written using bawrite ( ) or bwrite ( ) are placed on
the appropriate output queue. When the output completes, the brelse ( ) routine i s
called to return them to the free list and to awaken any processes that are waiting
for them.
Figure 6.9 shows a snapshot of the buffer pool. A buffer with valid contents
is contained on exactly one bufhash hash chain. The kernel uses the hash chains
to determine quickly whether a block is in the buffer pool, and if it is, to locate it.
A buffer is removed only when its contents become invalid or it is reused for dif­
ferent data. Thus, even if the buffer is in use by one process, it can still be found
by another process, although the busy flag will be set so that it will not be used
until its contents are consistent.
In addition to appearing on the hash list, each unlocked buffer appears on
exactly one free list. The first free list is the LOCKED list. B uffers on this list can­
not be flushed from the cache. This list was originally intended to hold superblock
data; in 4.4BSD, it is used by only the log-structured filesystem.
The second list is the LRU list. When a buffer is found-typically on the LRU
list-it is removed and used. The buffer is then returned to the end of the LRU list.
When new buffers are needed, they are taken from the front of the LRU list. Thus,
buffers used repeatedly will continue to migrate to the end of the LRU list and are
Section 6.6
Figure 6.9
Filesystem-Independent Services
Snapshot of the buffer pool. V-vnode; X-file offset
not likely to be reused for new blocks. As its name suggests, this list implements
a least recently used (LRU) algorithm.
The third free list is the AGE list. This list holds blocks that have not proved
their usefulness, but are expected to be used soon, or have already been used and
are not likely to be reused. Buffers can be pushed onto either end of this list:
Buffers containing no useful data are pushed on the front (where they will be
reclaimed quickly), and other buffers are pushed on the end (where they might
remain long enough to be used again) . When a file is unlinked, its buffers are
placed at the front of the AGE list. In Fig. 6.9, the file associated with vnode 7 has
just been deleted. The AGE list is also used to hold read-ahead blocks. In Fig.
6.9, vnode 8 has just finished using the buffer starting with offset 48 Kbyte
(which, being a full-sized block, contains logical blocks 48 through 55), and will
probably use its read-ahead, contained in the buffer starting with offset 56 Kbyte
at end of the AGE list. If a requested block is found on the AGE list, it is returned
to the end of the LRU list, because it has proved its usefulness. When a new buffer
is needed, the AGE list is searched first; only when that list is empty is the LRU list
The final list is the list of empty buffers, the EMP TY list. The empty buffers
have had all their physical memory stripped away by other buffers. They are held
on this list waiting for another buffer to be reused for a smaller block and thus to
give up its extra physical memory.
Implementation of Buffer Management
Having looked at the functions and algorithms used to manage the buffer pool, we
shall now tum our attention to the implementation requirements for ensuring the
consistency of the data in the buffer pool. Figure 6. 1 0 (on page 230) shows the
1/0 System Overview
Chapter 6
t�ke buffer off free bremfree ( )
hst and mark busy
find next available �----'----,
buffer on free l i st getnewbuf( )
F i g u re 6 . 1 0
do the 1/0
check for buffer on free list
adjust memory in buffer
to requested s i ze
Procedural interface to the buffer-allocation system.
support routines that implement the interface for getting buffers. The primary
i nterface to getting a buffer is through bread ( ) , which is called with a request for a
data block of a specified size for a specified vnode. There is also a related inter­
face, breadn ( ) , that both gets a requested block and starts read-ahead for addi­
tional blocks. Bread( ) first calls getblk ( ) to find out whether the data block is
available in a buffer that is already in memory. If the block is available in a buffer,
getblk ( ) calls bremfree ( ) to take the buffer off whichever free list it is on and to
mark it busy; bread( ) can then return the buffer to the caller.
If the block is not already in memory, getblk ( ) calls getnewbuf( ) to allocate a
new buffer. The new buffer is then passed to allocbuf( ) , which ensures that the
buffer has the right amount of physical memory. Getblk ( ) then returns the buffer
to bread( ) marked busy and unfilled. Noticing that the buffer is unfilled, bread( )
passes the buffer to the strategy ( ) routine for the underlying filesystem to have the
data read in. When the read completes, the buffer is returned.
The task of allocbuf( ) is to ensure that the buffer has enough physical mem­
ory allocated to it. Figure 6. 1 1 shows the virtual memory for the data part of a
buffer. The data area for each buffer is allocated MAXBSIZE bytes of virtual
address space. The bufsize field in the buffer header shows how much of the vir­
tual address space is backed by physical memory. Allocbuf( ) compares the size of
the i ntended data block with the amount of physical memory already allocated to
the buffer. If there is excess physical memory and there is a buffer available on
the EMPTY l ist, a buffer is taken off the EMPTY list, the excess memory is put into
the empty buffer, and that buffer is then inserted onto the front of the AGE list. If
F i g u re 6.1 1
Allocation of buffer memory.
��t:Jdress J :
I: l1
physical pages to' '[j
(64 Kbyte)
I : I : I 1 I
1 6 Kbyte
I 1
Section 6.7
Stackable Filesystems
: i : : : : : j old buffer
- l3 2l33 l3 4l 3 s 3 6 l37l3sl39l4ol 41 4 2143l4414s l46 l41I
new buffer
F i g u re 6.1 2
J : : : : : ! :
Potentially overlapping allocation of buffers.
there are no buffers on the EMPTY list, the excess physical memory is retained in
the original buffer.
If the buffer has insufficient memory, allocbuf( ) takes memory from other
buffers. A llocbuf( ) does the allocation by calling getnewbuf( ) to a second buffer
and then transferring the physical memory in the second buffer to the new buffer
under construction. If there is memory remaining in the second buffer, the second
buffer is released to the front of the AGE list; otherwise, the second buffer is
released to the EMPTY list. If the new buffer still does not have enough physical
memory, the process is repeated. A llocbuf( ) ensures that each physical-memory
page is mapped into exactly one buffer at all times.
To maintain the consistency of the filesystem, the kernel must ensure that a
disk block is mapped into at most one buffer. If the same disk block were present
in two buffers, and both buffers were marked dirty, the system would be unable to
determine which buffer had the most current information. Figure 6. 1 2 shows a
sample allocation. In the middle of the figure are the blocks on the disk. Above
the disk is shown an old buffer containing a 4096-byte fragment for a file that pre­
sumably has been removed or shortened. The new buffer is going to be used to
hold a 3072-byte fragment for a file that is presumably being created and that will
reuse part of the space previously held by the old file. The kernel maintains the
consistency by purging old buffers when files are shortened or removed. When­
ever a file is removed, the kernel traverses its list of dirty buffers . For each buffer,
the kernel cancels its write request and marks the buffer invalid, so that the buffer
cannot be found in the buffer pool again. Each invalid buffer is put at the front of
the AGE list, so that it will be used before any buffers with potentially useful data.
For a file being partially truncated, only the buffers following the truncation point
are invalidated. The system can then allocate the new buffer knowing that the
buffer maps the corresponding disk blocks uniquely.
6. 7
Stackable Filesystems
The early vnode interface was simply an object-oriented interface to an underlying
filesystem. As the demand grew for new filesystem features, it became desirable
to find ways of providing them without having to modify the existing and stable
filesystem code. One approach is to provide a mechanism for stacking several
Chapter 6
I/O System Overview
fi lesystems on top of one another other [Rosenthal , 1 990] . The stacking ideas
were refined and implemented in the 4.4BSD system [Heidemann & Popek, 1 994] .
The bottom of a vnode stack tends to be a disk-based filesystem, whereas the lay­
ers used above it typically transform their arguments and pass on those arguments
to a lower layer.
In all UNIX systems, the mount command takes a special device as a source
and maps that device onto a directory mount point in an existing filesystem.
When a filesystem is mounted on a directory, the previous contents of the direc­
tory are hidden; only the contents of the root of the newly mounted filesystem are
visible. To most users, the effect of the series of mount commands done at system
startup is the creation of a single seamless filesystem tree.
Stacking also uses the mount command to create new layers. The mount com­
mand pushes a new layer onto a vnode stack; an unmount command removes a
layer. Like the mounting of a fi lesystem, a vnode stack is visible to all processes
running on the system. The mount command identifies the underlying layer in the
stack, creates the new layer, and attaches that layer into the filesystem name space.
The new layer can be attached to the same place as the old layer (covering the old
layer) or to a different place in the tree (allowing both layers to be visible). An
example is shown in the next subsection.
If layers are attached to different places i n the name space then the same file
will be visible in multiple places. Access to the file under the name of the new
layer's name space will go to the new layer, whereas that under the old layer's
name space will go to only the old layer.
When a file access (e.g., an open, read, stat, or close) occurs to a vnode in the
stack, that vnode has several options:
• Do the requested operations and return a result.
• Pass the operation without change to the next-lower vnode on the stack. When
the operation returns from the lower vnode, it may modify the results, or simply
return them.
• Modify the operands provided with the request, then pass it to the next-lower
vnode. When the operation returns from the lower vnode, it may modify the
results, or simply return them.
If an operation is passed to the bottom of the stack without any layer taking action
on it, then the interface will return the error "operation not supported."
Vnode interfaces released before 4.4BSD implemented vnode operations as
indirect function calls. The requirements that intermediate stack layers bypass
operations to lower layers and that new operations can be added i nto the system at
boot time mean that this approach is no longer adequate. Filesystems must be
able to bypass operations that may not have been defined at the time that the
filesystem was implemented. In addition to passing through the function, the
filesystem layer must also pass through the function parameters, which are of
unknown type and number.
Section 6.7
Stackable Filesystems
* Check f o r read permi s s i on on f i l e
' ' vp ' ' .
( er r o r = VO P_ACC E S S ( vp ,
r e turn
cred ,
p) )
( e rro r ) ;
* Check a c c e s s p e rmi s s i on f o r a f i l e .
u f s_ac c e s s ( ap )
s truc t vop_ac c e s s_args
s t ru c t vnode op_de s c
* a_de s c ;
op e ra t i on d e s c r i p .
s t ruc t vnode * a_vp ;
f i l e t o be checked * /
i n t a_mode ;
/ * a c c e s s mode s ought
s truc t u c r e d * a_c r e d ;
/ * u s e r s e eking a c c e s s
s t ruc t proc
/ * a s s o c i a t e d pr o c e s s
* a_p ;
* ap ;
( p e rmi s s i on gran t ed )
r e turn
r e turn
F i g u re 6.1 3
(1) ;
(0) ;
Call to and function header for access vnode operation.
To resolve these two problems in a c lean and portable way, the kernel places
the vnode operation name and its arguments into an argument structure. This argu­
ment structure is then passed as a single parameter to the vnode operation. Thus,
all calls on a vnode operation will always have exactly one parameter, which is the
pointer to the argument structure. If the vnode operation is one that is supported by
the fi lesystem, then it will know what the arguments are and how to interpret them.
If it is an unknown vnode operation, then the generic bypass routine can call the
same operation in the next-lower layer, passing to the operation the same argument
structure that it received. In addition, the first argument of every operation is a
pointer to the vnode operation description. This description provides to a bypass
routine the information about the operation, including the operation's name and the
location of the operation's parameters. An example access-check call and its
implementation for the UFS fi lesystem are shown in Fig. 6. 1 3 . Note that the
vop_access_args structure is normally declared in a header file, but here is
declared at the function site to simplify the example.
Chapter 6
1/0 System Overview
Simple Filesystem Layers
The simplist filesystem layer is nullfs. It makes no transformations on its argu­
ments, simply passing through all requests that it receives and returning all results
that it gets back. Although it provides no useful functionality if it is simply
stacked on top of an existing vnode, nullfs can provide a loopback filesystem by
mounting the filesystem rooted at its source vnode at some other location in the
filesystem tree. The code for nullfs is also an excellent starting point for designers
who want to build their own filesystem layers . Examples that could be built
include a compression layer or an encryption layer.
A sample vnode stack is shown in Fig. 6. I 4. The figure shows a local filesys­
tem on the bottom of the stack that is being exported from /iocal via an NFS layer.
Clients within the administrative domain of the server can import the /local
filesystem directly, because they are all presumed to use a common mapping of
urns to user names.
The umapfs filesystem works much like the nullfs filesystem in that it pro­
vides a view of the file tree rooted at the /local filesystem on the /export mount
point. In addition to providing a copy of the /local filesystem at the /export
mount point, it transforms the credentials of each system call made to files within
the /export filesystem. The kernel does the transformation using a mapping that
was provided as part of the mount system call that created the umapfs layer.
The /export filesystem can be exported to clients from an outside administra­
tive domain that uses different urns and Grns. When an NFS request comes in for
the /export filesystem, the umapfs layer modifies the credential from the foreign
client by mapping the urns used on the foreign client to the corresponding urns
used on the local system. The requested operation with the modified credential is
passed down to the lower layer corresponding to the /local filesystem, where it is
processed identically to a local request. When the result is returned to the map­
ping layer, any returned credentials are mapped inversely so that they are con­
verted from the local urns to the outside urns, and this result is sent back as the
NFS response.
Figure 6 . 1 4
Stackable vnodes.
outside administrative exports
local administrative exports
NFS server
uid/gid mapping
operation not supported
Section 6.7
Stackable Filesystems
There are three benefits to this approach:
1 . There is no cost of mapping i mposed on the local clients .
2. There are no changes required to the local filesystem code or the NFS code to
support mapping.
3. Each outside domain can have its own mapping. Domains with simple map­
pings consume small amounts of memory and run quickly; domains with large
and complex mappings can be supported without detracting from the perfor­
mance of simpler environments.
Vnode stacking is an effective approach for adding extensions, such as the umapfs
The Union Mount Filesystem
The union filesystem is another example of a middle filesystem layer. Like the
nullfs, it does not store data; it j ust provides a name-space transformation. It is
loosely modeled on the work on the 3-D filesystem [Korn & Krell, 1 989] , on the
Translucent filesystem [Hendricks, 1 990] , and on the Automounter [Pendry &
Williams, 1 994] . The union filesystem takes an existing filesystem and transpar­
ently overlays the latter on another filesystem. Unlike most other filesystems, a
union mount does not cover up the directory on which the filesystem is mounted.
Instead, it shows the logical merger of both directories and allows both directory
trees to be accessible simultaneously [Pendry & McKusick, 1 995] .
A small example of a union-mount stack is shown in Fig. 6. 1 5 . Here, the bot­
tom layer of the stack is the src filesystem that includes the source for the shell
program. Being a simple program, it contains only one source and one header file.
The upper layer that has been union mounted on top of src initially contains just
the src directory. When the user changes directory into shell, a directory of the
same name is created in the top layer. Directories in the top layer corresponding
to directories in the lower layer are created only as they are encountered while the
top layer is traversed. If the user were to run a recursive traversal of the tree
rooted at the top of the union-mount location, the result would be a complete tree
of directories matching the underlying filesystem. In our example, the user now
types make in the shell directory. The sh executable is created in the upper layer
Figure 6. 1 5
A union-mounted filesystem.
�- - �
. k--,
efi""'Ie--,1 I sh.h I I sh.c I
eti"""le--,1 I sh.h I I sh.c I [ili]
Chapter 6
1/0 System Overview
of the union stack. To the user, a directory listing shows the sources and
executable all apparently together, as shown on the right in Fig. 6. 1 5 .
All filesystem layers, except the top one, are treated as though they were read­
only. If a file residing in a lower layer is opened for reading, a descriptor is
returned for that file. If a file residing in a lower layer is opened for writing, the
kernel first copies the file to the top layer, then returns a descriptor referencing the
copy of the file. The result is that there are two copies of the file: the original
unmodified file in the lower layer and the modified copy of the file in the upper
layer. When the user does a directory listing, any duplicate names in the lower
layer are suppressed. When a file is opened, a descriptor for the file in the upper­
most layer in which the name appears is returned. Thus, once a file has been
copied to the top layer, instances of the file in lower layers become inaccessible.
The tricky part of the union filesystem is handling the removal of fi les that
reside in a lower layer. Since the lower layers cannot be modified, the only way to
remove a file is to hide it by creating a whiteout directory entry in the top layer. A
whiteout is an entry in a directory that has no corresponding file; it is distin­
guished by having an inode number of 1 . If the kernel finds a whiteout entry
while searching for a name, the lookup is stopped and the "no such file or d frec­
tory " error is returned. Thus, the file with the same name in a lower layer appears
to have been removed. If a file is removed from the top layer, it is necessary to
create a whiteout entry for it only if there is a file with the same name in the lower
level that would reappear.
When a process creates a file with the same name as a whiteout entry, the
whiteout entry is replaced with a regular name that references the new file.
Because the new file is being created i n the top layer, it will mask out any files
with the same name in a lower layer. When a user does a directory listing, white­
out entries and the fi les that they mask usually are not shown. However, there is
an option that causes them to appear.
One feature that has long been missing in UNIX systems is the ability to
recover files after they have been deleted. For the union filesystem, the kernel can
implement file recovery trivially simply by removing the whiteout entry to expose
the underlying file. The LFS filesystem also has the (currently unimplemented)
ability to recover deleted files, because it never overwrites previously written data.
Deleted versions of files may not be reclaimed until the fi lesystem becomes nearly
ful l and the LFS garbage collector runs. For filesystems that provide file recovery,
users can recover files by using a special option to the remove command; pro­
cesses can recover fi les by using the undelete system cal l .
When a directory whose name appears in a lower layer is removed, a whiteout
entry is created just as it would be for a file. However, if the user later attempts to
create a directory with the same name as the previously deleted directory, the
union fi lesystem must treat the new directory specially to avoid having the previ­
ous contents from the lower-layer directory reappear. When a directory that
replaces a whiteout entry is created, the union filesystem sets a flag in the direc­
tory metadata to show that this directory should be treated specially. When a
directory scan is done, the kernel returns information about only the top-level
Section 6.7
Stackable Filesystems
directory ; it suppresses the list of files from the directories of the same name in the
lower layers.
The union filesystem can be used for many purposes:
It allows several different architectures to build from a common source base.
The source pool is NFS mounted onto each of several machines. On each host
machine, a local filesystem is union mounted on top of the imported source tree.
As the build proceeds, the objects and binaries appear in the local filesystem that
is layered above the source tree. This approach not only avoids contaminating
the source pool with binaries, but also speeds the compilation, because most of
the filesystem traffic is on the local filesystem.
It allows compilation of sources on read-only media such as CD-ROMs. A local
filesystem is union mounted above the CD-ROM sources . It is then possible to
change into directories on the CD-ROM and to give the appearance of being able
to edit and compile in that directory.
It allows creation of a private source directory. The user creates a source direc­
tory in her own work area, then union mounts the system sources underneath that
directory. This feature is possible because the restrictions on the mount com­
mand have been relaxed. Any user can do a mount if she owns the directory on
which the mount is being done and she has appropriate access permissions on
the device or directory being mounted (read permission is required for a read­
only mount, read-write permission is required for a read-write mount) . Only the
user who did the mount or the superuser can unmount a filesystem.
Other Filesystems
There are several other filesystems included as part of 4.4BSD. The portal filesys­
tem mounts a process onto a directory in the file tree. When a pathname that tra­
verses the location of the portal is used, the remainder of the path is passed to the
process mounted at that point. The process interprets the path in whatever way it
sees fit, then returns a descriptor to the calling process. This descriptor may be for
a socket connected to the portal process. If it is, further operations on the descrip­
tor will be passed to the portal process for the latter to interpret. Alternatively, the
descriptor may be for a file elsewhere in the filesystem.
Consider a portal process mounted on /dialout used to manage a bank of
dialout modems. When a process wanted to connect to an outside number, it
would open /dialout/15105551212/9600 to specify that it wanted to dial
1 -5 1 0-555- 1 2 1 2 at 9600 baud. The portal process would get the final two path­
name components. Using the final component, it would determine that it should
find an unused 9600-baud modem. It would use the other component as the num­
ber to which to place the call. It would then write an accounting record for future
billing, and would return the descriptor for the modem to the process.
One of the more interesting uses of the portal filesystem is to provide an Inter­
net service directory. For example, with an Internet portal process mounted on
/net, an open of /net/tcp/McKusick.COM/smtp returns a TCP socket descriptor
Chapter 6
I/O System Overview
to the calling process that is connected to the SMTP server on McKusick.COM.
Because access is provided through the normal filesystem, the calling process does
not need to be aware of the special functions necessary to create a TCP socket and
to establish a TCP connection [Stevens & Pendry, 1 995] .
There are several fi lesystems that are designed to provide a convenient inter­
face to kernel information. The procfs filesystem is normally mounted at /proc
and provides a view of the running processes in the system. Its primary use is for
debugging, but it also provides a convenient interface for collecting information
about the processes in the system. A directory listing of /proc produces a numeric
list of all the processes in the system. Each process entry is itself a directory that
contains the following:
A file to control the process, allowing the process to be stopped, continued, and signaled
The executable for the process
The virtual memory of the process
The registers for the process
A text file containing information about the process.
The fdesc filesystem is normally mounted on /dev/fd, and provides a list of all
the active file descriptors for the currently running process. An example where
this is useful is specifying to an application that it shoul d read input from its stan­
dard input. Here, you can use the pathname /dev/fd/O, instead of having to come
up with a special convention, such as using the name - to tel l the application to
read from its standard input.
The kernfs filesystem is normally mounted on /kern, and contains files that
have various information about the system. It includes information such as the
host name, time of day, and version of the system.
Finally there is the cd9660 filesystem. It allows IS0-9660-compliant filesys­
tems, with or without Rock Ridge extensions, to be mounted. The IS0-9660
filesystem format is most commonly used on CD-ROMs.
6. 1
Where are the read and write attributes of an open file descriptor stored?
Why is the close-on-exec bit located in the per-process descriptor table,
instead of in the system file table?
Why are the fi le-table entries reference counted?
What three shortcomings of lock files are addressed by the 4.4BSD descrip­
tor-locking facilities?
What two problems are raised by mandatory locks?
Why is the i mplementation of select split between the descriptor-manage­
ment code and the lower-level routines?
Describe how the process selecting flag is used in the implementation of
The update program is usually started shortly after the system is booted.
Once every 30 seconds, it does a sync system call. What problem could
arise if this program were not run?
The special device /dev/kmem provides access to the kernel's virtual
address space. Would you expect it to be a character or a block device?
Explain your answer.
6. 1 0
Many tape drives provide a block-device interface. I s i t possible t o support
a filesystem on a such a tape drive?
6. 1 1
When is a vnode placed on the free list?
6. 1 2
Why must the lookup routine call through the vnode interface once for each
component in a pathname?
6. 1 3
Give three reasons for revoking access to a vnode.
6. 1 4
Why are the buffer headers allocated separately from the memory that
holds the contents of the buffer?
6. 1 5
How does the maximum filesystem block size affect the buffer cache?
*6. 1 6
Why are there both an AGE list and an LRU list, instead of all buffers being
managed on the LRU list?
*6. 1 7
Filenames can b e up to 255 characters long. How could you i mplement the
systemwide name cache to avoid allocating 255 bytes for each entry?
*6. 1 8
If a process reads a large file, the blocks of the file will fill the buffer cache
completely, flushing out all other contents. All other processes in the sys­
tem then will have to go to disk for all their filesystem accesses. Write an
algorithm to control the purging of the buffer cache.
*6. 1 9
Discuss the tradeoff between dedicating memory to the buffer cache and
making the memory available to the virtual-memory system for use in ful­
filling paging requests. Give a policy for moving memory between the
buffer pool and the virtual-memory system.
Vnode operation parameters are passed between layers in structures. What
alternatives are there to this approach? Explain why your approach is more
or less efficient, compared to the current approach, when there are less than
five l ayers in the stack. Also compare the efficiency of your solution when
there are more than five layers in the stack.
*6.2 1
Chapter 6
I/O System Overview
True asynchronous 1/0 is not supported in 4.4BSD. What problems arise
with providing asynchronous 1/0 in the existing read-write interface?
Accetta et al, 1 986.
M . Accetta, R. B aron, W. Bolosky, D. Golub, R. Rashid, A. Tevanian, & M .
Young, " Mach: A New Kernel Foundation fo r UNIX Development," USENIX
Association Conference Proceedings, pp. 93- 1 1 3 , June 1 986.
B ass, 1 98 1 .
J . Bass, Implementation Description for File Locking, Onyx Systems Inc., 73
E. Trimble Road, San Jose, CA, January 1 98 1 .
Heidemann & Popek, 1 994.
J. S. Heidemann & G. J. Popek, "File-System Development with Stackable
Layers," ACM Transactions on Computer Systems, vol . 1 2, no. 1, pp. 5 8-89,
February 1 994.
Hendricks, 1 990.
D. Hendricks, "A Filesystem for Software Development," USENIX Associa­
tion Conference Proceedings, pp. 333-340, June 1 990.
Korn & Krell, 1 989.
D. Korn & E. Krell, "The 3-D File System," USENIX Association Confer­
ence Proceedings, pp. 1 47- 1 56, June 1 989.
Pendry & McKusick, 1 995.
J. Pendry & M . McKusick, " Union Mounts in 4.4B SD-Lite," USENIX Asso­
ciation Conference Proceedings, pp. 25-33, January 1 995 .
Pendry & Wil liams, 1 994.
J. Pendry & N. Williams, "AMD : The 4.4BSD Automounter Reference
Manual," in 4.4BSD System Manager 's Manual, pp. 1 3 : 1 -57, O ' Reilly &
Associ ates, Inc., Sebastopol, CA, 1 994.
Peterson, 1 98 3 .
G. Peterson, " Concurrent Reading While Writing," ACM Transactions on
Programming Languages and Systems, vol . 5, no. I , pp. 46-5 5 , January
1 983.
Rosenthal, 1 990.
D. Rosenthal, "Evolving the Vnode Interface," USENIX Association Confer­
ence Proceedings, pp. 1 07- 1 1 8, June 1 990.
Stevens & Pendry, 1 995 .
R. Stevens & J. Pendry, " Portals in 4.4B SD," USENIX Association Confer­
ence Proceedings, pp. 1 - 1 0, January 1 995 .
Local Filesystems
7. 1
Hierarchical Filesystem Management
The operations defined for local fi lesystems are divided into two parts. Common
to all local filesystems are hierarchical naming, locking, quotas, attribute manage­
ment, and protection. These features, which are independent of how data are
stored, are provided by the UFS code described in this chapter. The other part of
the local fi lesystem is concerned with the organization and management of the
data on the storage media. Storage is managed by the datastore filesystem opera­
tions described in Chapter 8 .
The vnode operations defined for doing hierarchical filesystem operations are
shown in Table 7. 1 . The most complex of these operations is that for doing a
lookup. The filesystem-independent part of the lookup was described in Section
6.5. The algorithm used to look up a pathname component in a directory is
described in Section 7 . 3 .
Table 7 . 1
Hierarchical fi lesystem operations.
Operation done
pathname searching
name creation
name change/deletion
attribute manipulation
object interpretation
process control
object management
Operator names
create, mknod, link, symlink, mkdir
rename, remove, rmdir
access, getattr, setattr
open, readdir, readlink, mmap, close
adv lock, ioctl, select
lock, unlock, inactive, reclaim, abortop
Chapter 7
Local Filesystems
There are five operators for creating names. The operator used depends on
the type of object being created. The create operator creates regular files and also
is used by the networking code to create AF_LOCAL domain sockets . The link
operator creates additional names for existing objects. The symlink operator cre­
ates a symbolic link (see Section 7.3 for a discussion of symbolic links). The
mknod operator creates block and character special devices ; it is also used to cre­
ate FIFOs. The mkdir operator creates directories.
There are three operators for changing or deleting existing names. The
rename operator deletes a name for an object in one location and creates a new
name for the object in another location. The implementation of this operator is
complex when the kernel is dealing with the movement of a directory from one
part of the filesystem tree to another. The remove operator removes a name. If the
removed name is the final reference to the object, the space associated with the
underlying object is reclaimed. The remove operator operates on all object types
except directories; they are removed using the rmdir operator.
Three operators are supplied for object attributes. The kernel retrieves
attributes from an object using the getattr operator; it stores them using the setattr
operator. Access checks for a given user are provided by the access operator.
Five operators are provided for interpreting objects. The open and close oper­
ators have only peripheral use for regular files, but, when used on special devices,
are used to notify the appropriate device driver of device activation or shutdown.
The readdir operator converts the filesystem-specific format of a directory to the
standard list of directory entries expected by an application. Note that the inter­
pretation of the contents of a directory is provided by the hierarchical filesystem­
management layer; the filestore code considers a directory as j ust another object
holding data. The read/ink operator returns the contents of a symbolic link. As it
does directories, the filestore code considers a symbolic link as just another object
holding data. The mmap operator prepares an object to be mapped into the
address space of a process.
Three operators are provided to allow process control over objects. The select
operator allows a process to find out whether an object is ready to be read or writ­
ten. The ioctl operator passes control requests to a special device. The advlock
operator allows a process to acquire or release an advisory lock on an object.
None of these operators modifies the object in the filestore. They are simply using
the object for naming or directing the desired operation.
There are five operations for management of the objects . The inactive and
reclaim operators were described in Section 6.6. The lock and unlock operators
allow the callers of the vnode interface to provide hints to the code that implement
operations on the underlying objects . Stateless filesystems such as NFS ignore
these hints. Stateful filesystems, however, can use hints to avoid doing extra work.
For example, an open system call requesting that a new file be created requires
two steps. First, a lookup call is done to see if the file already exists. Before the
lookup is started, a lock request is made on the directory being searched. While
scanning through the directory checking for the name, the lookup code also identi­
fies a location within the directory that contains enough space to hold the new
Section 7 . 2
Structure o f a n Inode
name. If the lookup returns successfully (meaning that the name does not already
exist), the open code verifies that the user has permission to create the file. If the
user is not eligible to create the new file, then the abortop operator is called to
release any resources held in reserve. Otherwise, the create operation is called. If
the filesystem is stateful and has been able to lock the directory, then it can simply
create the name in the previously identified space, because it knows that no other
processes will have had access to the directory. Once the name is created, an
unlock request is made on the directory. If the filesystem is stateless, then it can­
not lock the directory, so the create operator must rescan the directory to find
space and to verify that the name has not been created since the lookup.
Structure of an Inode
To allow files to be allocated concurrently and random access within files, 4.4BSD
uses the concept of an index node, or inode. The inode contains information about
the contents of the file, as shown in Fig. 7 . 1 . This information includes
Figure 7 . 1
The structure of an inode.
owners (2)
timestamps (3)
direct blocks
single indirect
double indirect
triple indirect
block count
reference count
generation number
Chapter 7
Local Filesystems
• The type and access mode for the file
• The file's owner
• The group-access identifier
• The time that the file was most recently read and written
• The time that the inode was most recently updated by the system
• The size of the file in bytes
• The number of physical blocks used by the file (including blocks used to hold
indirect pointers)
• The number of references to the file
• The flags that describe characteristics of the file
• The generation number of the file (a unique number selected to be the approxi­
mate creation time of the file and assigned to the inode each time that the latter is
allocated to a new file; the generation number is used by NFS to detect references
to deleted files)
Notably missing in the inode is the filename. Filenames are maintained in directo­
ries, rather than in inodes, because a file may have many names, or links, and the
name of a file may be large (up to 255 bytes in length). Directories are described
in Section 7 . 3 .
To create a new name fo r a file, the system increments the count o f the num­
ber of names referring to that inode. Then, the new name is entered in a directory,
along with the number of the inode. Conversely, when a name is deleted, the entry
is deleted from a directory, and the name count for the inode is then decremented.
When the name count reaches zero, the system deallocates the inode by putting all
the inode's blocks back on a list of free blocks and by putting the inode back on a
list of unused inodes.
The inode also contains an array of pointers to the blocks in the file. The sys­
tem can convert from a logical block number to a physical sector number by
indexing into the array using the logical block number. A null array entry shows
that no block has been allocated and will cause a block of zeros to be returned on
a read. On a write of such an entry, a new block is allocated, the array entry is
updated with the new block number, and the data are written to the disk.
!nodes are statically allocated and most files are small, so the array of pointers
must be small for efficient use of space. The first 1 2 array entries are allocated in
the inode itself. For typical fi lesystems, this allows the first 48 or 96 Kbyte of data
to be located directly via a simple indexed lookup.
For somewhat larger files, Fig. 7 . 1 shows that the inode contains a single indi­
rect pointer that points to a single indirect block of pointers to data blocks. To
find the one-hundredth logical block of a fi le, the system first fetches the block
identified by the indirect pointer, then indexes into the eighty-eighth block ( 1 00
minus 1 2 direct pointers), and fetches that data block.
Section 7.2
Structure of an Inode
For files that are bigger than a few Mbyte, the single indirect block is eventu­
ally exhausted; these files must resort to using a double indirect block, which is a
pointer to a block of pointers to pointers to data blocks. For files of multiple
Gbyte, the system uses a triple indirect block, which contains three levels of
pointer before reaching the data block.
Although indirect blocks appear to increase the number of disk accesses
required to get a block of data, the overhead of the transfer is typically much
lower. In Section 6.6, we discussed the management of the fi lesystem cache that
holds recently used disk blocks. The first time that a block of indirect pointers is
needed, it is brought into the filesystem cache. Further accesses to the indirect
pointers find the block already resident in memory; thus, they require only a single
disk access to get the data.
Inode Management
Most of the activity in the local filesystem revolves around inodes. As described
in Section 6.6, the kernel keeps a list of active and recently accessed vnodes. The
decisions regarding how many and which fi les should be cached are made by the
vnode layer based on information about activity across all filesystems. Each local
filesystem will have a subset of the system vnodes to manage. Each uses an inode
supplemented with some additional information to identify and locate the set of
files for which it is responsible. Figure 7 .2 shows the location of the inodes within
the system.
Reviewing the material in Section 6.4, each process has a process open-file
table that has slots for up to a system-imposed limit of file descriptors ; this table is
maintained as part of the process state. When a user process opens a file (or
socket), an unused slot is located in the process's open-file table; the small integer
file descriptor that is returned on a successful open is an index value into this
The per-process file-table entry points to a system open-file entry, which con­
tains information about the underlying file or socket represented by the descriptor.
For files, the file table points to the vnode representing the open file. For the local
filesystem, the vnode references an inode. It is the inode that identifies the file
F i g u re 7.2
Layout of kernel tables.
kernel-resident structures
open file
in ode
data '
Chapter 7
Local Filesystems
hash on
<inumber, devnumber>
F i g u re 7.3
Structure of the inode table.
The first step i n opening a file is to find the file's associated vnode. The
lookup request is given to the fi lesystem associated with the directory currently
being searched. When the local filesystem finds the name in the directory, it gets
the inode number of the associated file. First, the fi lesystem searches its collection
of inodes to see whether the requested inode is already in memory. To avoid
doing a linear scan of all its entries, the system keeps a set of hash chains keyed
on inode number and filesystem identifier, as shown in Fig. 7.3. If the inode is not
in the table, such as the first time a file is opened, the filesystem must request a
new vnode. When a new vnode is allocated to the local filesystem, a new structure
to hold the inode is allocated.
The next step is to locate the disk block containing the inode and to read that
block into a buffer in system memory. When the disk 1/0 completes, the inode is
copied from the disk buffer i nto the newly allocated inode entry. In addition to the
information contained in the disk portion of the inode, the inode table itself main­
tains supplemental information while the inode is in memory. This information
includes the hash chains described previously, as well as flags showing the inode's
status, reference counts on its use, and information to manage Jocks. The informa­
tion also contains pointers to other kernel data structures of frequent interest, such
as the superblock for the filesystem containing the inode.
When the last reference to a file is closed, the local fi lesystem is notified that
the file has become inactive. When it is inactivated, the inode times will be
updated, and the inode may be written to disk. However, it remains on the hash
list so that it can be found if it is reopened. After being inactive for a period deter­
mined by the vnode layer based on demand for vnodes in all the filesystems, the
vnode will be reclaimed. When a vnode for a local file is reclaimed, the inode is
removed from the previous filesystem's hash chain and, if the inode is dirty, its
Section 7 . 3
contents are written back to disk. The space for the inode is then deallocated, so
that the vnode will be ready for use by a new filesystem client.
Filesystems contain files, most of which contain ordinary data. Certain files are
distinguished as directories and contain pointers to files that may themselves be
directories. This hierarchy of directories and files is organized into a tree struc­
ture; Fig. 7.4 shows a small filesystem tree. Each of the circles in the figure repre­
sents an inode with its corresponding inode number inside. Each of the arrows
represents a name in a directory. For example, inode 4 is the /usr directory with
entry ., which points to itself, and entry , which points to its parent, inode 2, the
root of the fi lesystem. It also contains the name bin, which references directory
inode 7, and the name foo, which references file inode 6.
Directories are allocated in units called chunks; Fig. 7.5 (on page 248) shows a
typical directory chunk. The size of a chunk is chosen such that each allocation
can be transferred to disk in a single operation; the ability to change a directory in
a single operation makes directory updates atomic. Chunks are broken up into
variable-length directory entries to allow filenames to be of nearly arbitrary length.
No directory entry can span multiple chunks. The first four fields of a directory
entry are of fixed length and contain
I . An index into a table of on-disk inode structures; the selected entry describes
the file (inodes were described in Section 7 .2)
F i g u re 7.4
A small filesystem tree.
Chapter 7
Local Filesystems
a directory block with three entries
an empty directory block
F i g u re 7.5
Format of directory chunks.
2. The size of the entry in bytes
3. The type of the entry
4. The length of the filename contained in the entry in bytes
The remainder of an entry is of variable length and contains a null-terminated fi le­
name, padded to a 4-byte boundary. The maximum length of a filename in a
directory is 255 characters.
The filesystem records free space in a directory by having entries accumulate
the free space in their size fields. Thus, some directory entries are larger than
required to hold the entry name plus fixed-length fields. Space allocated to a
directory should always be accounted for completely by the total of the sizes of
the directory 's entries. When an entry is deleted from a directory, the system coa­
lesces the entry 's space into the previous entry in the same directory chunk by
increasing the size of the previous entry by the size of the deleted entry. If the first
entry of a directory chunk is free, then the pointer to the entry 's inode is set to zero
to show that the entry is unallocated.
Applications obtain chunks of directories from the kernel by using the getdi­
rentries system call. For the local filesystem, the on-disk format of directories is
identical to that expected by the application, so the chunks are returned uninter­
preted. When directories are read over the network or from non-BSD fi lesystems
such as MS-DOS, the getdirentries system call has to convert the on-disk represen­
tation of the directory to that described.
Normally, programs want to read directories one entry at a time. This inter­
face is provided by the directory-access routines. The opendir( ) function returns a
structure pointer that is used by readdir( ) to get chunks of directories using getdi­
rentries; readdir ( ) returns the next entry from the chunk on each call. The
closedir( ) function deallocates space allocated by opendir( ) and closes the direc­
tory. In addition, there is the rewinddir( ) function to reset the read position to the
beginning, the telldir( ) function that returns a structure describing the current
directory position, and the seekdir( ) function that returns to a position previously
obtained with telldir( ).
Section 7 . 3
Finding of Names in Directories
A common request to the filesystem is to look up a specific name in a directory.
The kernel usually does the lookup by starting at the beginning of the directory
and going through, comparing each entry in tum. First, the length of the sought­
after name is compared with the length of the name being checked. If the lengths
are identical, a string comparison of the name being sought and the directory entry
is made. If they match, the search is complete; if they fail, �ither in the length or
in the string comparison, the search continues with the next entry. Whenever a
name is found, its name and containing directory are entered into the systemwide
name cache described in Section 6.6. Whenever a search is unsuccessful, an entry
is made in the cache showing that the name does not exist in the particular direc­
tory. B efore starting a directory scan, the kernel looks for the name in the cache.
If either a positive or negative entry is found, the directory scan can be avoided.
Another common operation is to look up all the entries in a directory. For
example, many programs do a stat system call on each name in a directory in the
order that the names appear in the directory. To improve performance for these
programs, the kernel maintains the directory offset of the last successful lookup
for each directory. Each time that a lookup is done in that directory, the search is
started from the offset at which the previous name was found (instead of from the
beginning of the directory). For programs that step sequentially through a direc­
tory with n files, search time decreases from Order(n 2 ) to Order(n).
One quick benchmark that demonstrates the maximum effectiveness of the
cache is running the ls -l command on a directory containing 600 files. On a sys­
tem that retains the most recent directory offset, the amount of system time for this
test is reduced by 85 percent on a directory containing 600 files. Unfortunately,
the maximum effectiveness is much greater than the average effectiveness.
Although the cache is 90-percent effective when hit, it is applicable to only about
25 percent of the names being looked up. Despite the amount of time spent in the
lookup routine itself decreasing substantially, the i mprovement is diminished
because more time is spent in the routines that that routine calls. Each cache miss
causes a directory to be accessed twice-once to search from the middle to the
end, and once to search from the beginning to the middle.
Pathname Translation
We are now ready to describe how the filesystem looks up a pathname. The
small filesystem introduced in Fig. 7.4 is expanded to show its internal structure
in Fig. 7.6 (on page 250). Each of the files in Fig. 7.4 is shown expanded into its
constituent inode and data blocks. As an example of how these data structures
work, consider how the system finds the file /usr/bin/vi. It must first search the
root directory of the filesystem to find the directory usr. It first finds the inode
that describes the root directory. By convention, inode 2 is always reserved for
the root directory of a filesystem; therefore, the system finds and brings inode 2
into memory. This inode shows where the data blocks are for the root directory;
these data blocks must also be brought into memory so that they can be searched
for the entry usr. Having found the entry for usr, the system knows that the
Chapter 7
I- - -
-'- - - - -
,_ _ _ _ _ _ _ _ _ _
Apr 1 1 995
: wheel
_ _ -
- -
- - - - - - - -
Apr 1 1 995
- _[_
- -
- -
- - -
Apr 1 5 1 995
- -
- _ _ -
- - -
- - - - - -
Jan 1 9 1 994
- -
- _
- -
- -
- - - - - - - -
Apr 1 5 1 995
inode list
F i g u re 7.6
Hello World !
- - - -
Apr 1 1 995
- - - -
Local Filesystems
data blocks
Internal structure of a small filesystem.
contents of usr are described by inode 4. Returning once again to the disk, the
system fetches inode 4 to find where the data blocks for usr are located. Search­
ing these blocks, it finds the entry for bin. The bin entry points to i node 7 .
Next, the system brings in inode 7 and its associated data blocks from the disk,
to search for the entry for vi. Having found that vi is described by inode 9, the
system can fetch thi s i node and the blocks that contain the vi binary.
Section 7 . 3
Each file has a single inode, but multiple directory entries in the same filesystem
may reference that inode (i.e., the inode may have multiple names). Each direc­
tory entry creates a hard link of a filename to the inode that describes the file's
contents. The link concept is fundamental; inodes do not reside in directories, but
rather exist separately and are referenced by links. When all the links to an inode
are removed, the inode is deallocated. If one link to a file is removed and the file­
name is recreated with new contents, the other links will continue to point to the
old inode. Figure 7.7 shows two different directory entries, fo o and bar, that ref­
erence the same file; thus, the inode for the file shows a reference count of 2 .
The system also supports a symbolic link, o r s oft link. A symbolic link i s
implemented a s a file that contains a pathname. When the system encounters a
symbolic link while looking up a component of a pathname, the contents of the
symbolic link are prepended to the rest of the pathname; the lookup continues
with the resulting pathname. If a symbolic link contains an absolute pathname,
that absolute pathname is used; otherwise, the contents of the symbolic link are
evaluated relative to the location of the link in the file hierarchy (not relative to the
current working directory of the calling process) .
An example symbolic link is shown in Fig. 7 . 8 (on page 252) . Here, there is a
hard link, foo, that points to the file. The other reference, bar, points to a different
inode whose contents are a pathname of the referenced file. When a process
opens bar, the system interprets the contents of the symbolic link as a pathname
to find the file the link references. Symbolic links are treated like data files by the
system, rather than as part of the filesystem structure; thus, they can point at direc­
tories or files on other filesystems . If a filename is removed and replaced, any
symbolic links that point to it will access the new file. Finally, if the filename is
not replaced, the symbolic link will point at nothing, and any attempt to access it
will be an error.
Figure 7.7
Hard l inks to a file.
reference count
in ode
Chapter 7
.. .
Local Filesystems
reference count
of file
- ------
reference count
F i g u re 7.8
Symbolic l ink to a file.
When open is applied to a symbolic link, it returns a file descriptor for the file
pointed to, not for the link itself. Otherwise, it would be necessary to use indirec­
tion to access the file pointed to-and that file, rather than the link, is what is usu­
ally wanted. For the same reason, most other system calls that take pathname
arguments also follow symbolic links . Sometimes, it is useful to be able to detect
a symbolic link when traversing a fi lesystem or when making an archive tape. So,
the !stat system call is available to get the status of a symbolic link, instead of the
object at which that link points.
A symbolic link has several advantages over a hard link. Since a symbolic
link is maintained as a pathname, it can refer to a directory or to a file on a differ­
ent fi lesystem. So that loops in the filesystem hierarchy are prevented, unprivi­
leged users are not permitted to create hard links (other than . and ) that refer to
a directory. The implementation of hard links prevents hard links from referring
to files on a different filesystem.
There are several interesting implications of symbolic links. Consider a pro­
cess that has current working directory /usr/keith and does cd src, where src is a
symbolic link to directory /usr/src. If the process then does a cd . . , then the cur­
rent working directory for the process will be in /usr instead of in /usr/keith, as it
would have been if src was a normal directory instead of a symbolic link. The
kernel could be changed to keep track of the symbolic links that a process has tra­
versed, and to interpret .. differently if the directory has been reached through a
symbolic link. There are two problems with this implementation. First, the kernel
would have to maintain a potentially unbounded amount of information. Second,
no program could depend on being able to use .. , since it could not be sure how
the name would be interpreted.
Section 7.4
Many shells keep track of symbolic-link traversals. When the user changes
directory through from a directory that was entered through a symbolic link, the
shell returns the user to the directory from which they came. Although the shell
might have to maintain an unbounded amount of information, the worst that will
happen is that the shell will run out of memory. Having the shell fail will affect
only the user silly enough to traverse endlessly through symbolic links. Tracking
of symbolic links affects only change-directory commands in the shell; programs
can continue to depend on .. referencing its true parent. Thus, tracking symbolic
links outside of the kernel in a shell is reasonable.
Since symbolic links may cause loops in the filesystem, the kernel prevents
looping by allowing at most eight symbolic link traversals in a single pathname
translation. If the limit is reached, the kernel produces an error (ELOOP) .
Resource sharing always has been a design goal for the BSD system. By default,
any single user can allocate all the available space in the filesystem. In certain
environments, uncontrolled use of disk space is unacceptable. Consequently,
4.4BSD includes a quota mechanism to restrict the amount of filesystem resources
that a user or members of a group can obtain. The quota mechanism sets limits on
both the number of files and the number of disk blocks that a user or members of a
group may allocate. Quotas can be set separately for each user and group on each
Quotas support both hard and soft limits. When a process exceeds its soft
limit, a warning is printed on the user's terminal ; the offending process is not pre­
vented from allocating space unless it exceeds its hard limit. The idea is that users
should stay below their soft limit between login sessions, but may use more
resources while they are active. If a user fails to correct the problem for longer
than a grace period, the soft limit starts to be enforced as the hard limit. The grace
period is set by the system administrator and is 7 days by default. These quotas
are derived from a larger resource-limit package that was developed at the Univer­
sity of Melbourne in Australia by Robert Elz [Elz, 1 984 ] .
Quotas connect into the system primarily as an adj unct to the allocation rou­
tines. When a new block is requested from the allocation routines, the request is
first validated by the quota system with the following steps :
1 . I f there is a user quota associated with the file, the quota system consults the
quota associated with the owner of the file. If the owner has reached or
exceeded their limit, the request is denied.
2. If there is a group quota associated with the file, the quota system consults the
quota associated with the group of the file. If the group has reached or
exceeded its limit, the request is denied.
Chapter 7
Local Filesystems
3. If the quota tests pass, the request is permitted and is added to the usage statis­
tics for the file.
When either a user or group quota would be exceeded, the allocator returns a fail­
ure as though the filesystem were full. The kernel propagates this error up to the
process doing the write system call.
Quotas are assigned to a filesystem after it has been mounted. A system call
associates a file containing the quotas with the mounted filesystem. By conven­
tion, the file with user quotas is named quota.user, and the file with group quotas
is named These files typically reside either in the root of the
mounted filesystem or in the /var/quotas directory. For each quota to be imposed,
the system opens the appropriate quota file and holds a reference to it in the
mount-table entry associated with the mounted filesystem. Figure 7.9 shows the
mount-table reference. Here, the root filesystem has a quota on users, but has
none on groups. The /usr filesystem has quotas imposed on both users and
groups. As quotas for different users or groups are needed, they are taken from
the appropriate quota file.
Quota files are maintained as an array of quota records indexed by user or
group identifiers; Fig. 7 . 1 0 shows a typical record in a user quota file. To find the
quota for user identifier i, the system seeks to location i x sizeof ( quota structure)
in the quota file and reads the quota structure at that location. Each quota struc­
ture contains the limits imposed on the user for the associated filesystem. These
limits include the hard and soft limits on the number of blocks and inodes that the
user may have, the number of blocks and inodes that the user currently has allo­
cated, and the amount of time that the user has remaining before the soft limit is
enforced as the hard limit. The group quota file works in the same way, except
that it is indexed by group identifier.
F i g u re 7.9
References to quota files.
for I
for /usr
for /arch
vnode for /quota.user
vnode for /usr/quota.user
vnode for /usr/
Section 7.4
block quota (soft l imit)
block limit (hard limit)
current number of blocks
time to begin enforcing block quota
inode quota (soft limit)
inode limit (hard limit)
current number of inodes
time to begin enforcing inode quota
uid 0:
uid 1 :
uid i:
uid n:
quota block for uid i
quota.user file
Figure 7 . 1 0
Contents of a quota record.
Active quotas are held in system memory in a data structure known as a dquot
entry; Fig. 7 . 1 1 shows two typical entries. In addition to the quota limits and
usage extracted from the quota file, the dquot entry maintains information about
the quota while the quota is in use. This information includes fields to allow fast
access and identification. Quotas are checked by the chkdq ( ) routine. Since quo­
tas may have to be updated on every write to a file, chkdq () must be able to find
and manipulate them quickly. Thus, the task of finding the dquot structure associ­
ated with a file is done when the file is first opened for writing. When an access
Figure 7 . 1 1
Dquot entries.
= /usr
uid = 8
= /usr
reference count
type = user
uid = 8
fs = /usr
fs = /usr
uid = 8
fs /usr
gid = 1 2
reference count = 1
type = group
inode entries
quota block
for uid 8
quota block
for gid 1 2
dquot entries
Chapter 7
Local Filesystems
check is done to check for writing, the system checks to see whether there is either
a user or a group quota associated with the file. If one or more quotas exist, the
i node is set up to hold a reference to the appropriate dquot structures for as long as
the inode is resident. The chkdq ( ) routine can determine that a file has a quota
simply by checking whether the dquot pointer is nonnull; if it is, all the necessary
information can be accessed directly. If a user or a group has multiple files open
on the same filesystem, all inodes describing those files point to the same dquot
entry. Thus, the number of blocks allocated to a particular user or a group can
always be known easily and consistently.
The number of dquot entries in the system can grow large. To avoid doing a
linear scan of all the dquot entries, the system keeps a set of hash chains keyed on
the fi lesystem and on the user or group identifier. Even with hundreds of dquot
entries, the kernel needs to inspect only about five entries to determine whether a
requested dquot entry is memory resident. If the dquot entry is not resident, such
as the first time a file is opened for writing, the system must reallocate a dquot
entry and read in the quota from disk. The dquot entry is reallocated from the
least recently used dquot entry. So that it can find the oldest dquot entry quickly,
the system keeps unused dquot entries linked together in an LRU chain. When the
reference count on a dquot structure drops to zero, the system puts that dquot onto
the end of the LRU chain. The dquot structure is not removed from its hash chain,
so if the structure is needed again soon, it can still be located. Only when a dquot
structure is recycled with a new quota record is it removed and relinked into the
hash chain. The dquot entry on the front of the LRU chain yields the least recently
used dquot entry. Frequently used dquot entries are reclaimed from the middle of
the LRU chain and are relinked at the end after use.
The hashing structure allows dquot structures to be found quickly. However,
it does not solve the problem of how to discover that a user has no quota on a par­
ticular fi lesystem. If a user has no quota, a lookup for the quota will fail . The cost
of going to disk and reading the quota file to discover that the user has no quota
imposed would be prohibitive. To avoid doing this work each time that a new file
is accessed for writing, the system maintains nonquota dquot entries. When an
i node owned by a user or group that does not already have a dquot entry is first
accessed, a dummy dquot entry is created that has infinite values fi lled in for the
quota limits. When the chkdq ( ) routine encounters such an entry, it will update
the usage fields, but will not impose any limits. When the user later writes other
files, the same dquot entry will be found, thus avoiding additional access to the
on-disk quota file. Ensuring that a file will always have a dquot entry improves
the performance of the writing data, since chkdq ( ) can assume that the dquot
pointer is always valid, rather than having to check the pointer before every use.
Quotas are written back to the disk when they fall out of the cache, whenever
the filesystem does a sync, or when the filesystem is unmounted. If the system
crashes, leaving the quotas in an inconsistent state, the system administrator must
run the quotacheck program to rebuild the usage information in the quota files.
Section 7.5
7 .5
File Locking
File Locking
Locks may be placed on any arbitrary range of bytes within a file. These seman­
tics are supported in 4.4BSD by a list of locks, each of which describes a lock of a
specified byte range. An example of a file containing several range locks is shown
in Fig. 7. 1 2. The list of currently held or active locks appears across the top of the
figure, headed by the i_lockf field in the inode, and linked together through the
lf_next field of the lock structures. Each lock structure identifies the type of the
lock (exclusive or shared), the byte range over which the lock applies, and the
identity of the lock holder. A lock may be identified either by a pointer to a pro­
cess entry or by a pointer to a file entry. A process pointer is used for POSIX-style
range locks; a file-entry pointer is used for BSD-style whole file locks . The exam­
ples in this section show the identity as a pointer to a process entry. In this exam­
ple, there are three active locks : an exclusive lock held by process I on bytes I to
3, a shared lock held by process 2 on bytes 7 to 1 2, and a shared lock held by pro­
cess 3 on bytes 7 to 14.
In addition to the active locks, there are other processes that are sleeping wait­
ing to get a lock applied. Pending locks are headed by the lf_block field of the
A set of range locks on a file.
F i g u re 7. 1 2
i_lockf -r----.
type = EX
ID =
type = SH
range = 1 .. 3
ID = 2
range = 7 . . 1 2
in ode
type = EX
type = SH
ID = 4
range = 3 .. 1 0
, - - - -
ID = 2
range = 3 . . 5
- __../
type = SH
ID = 1
- - .
rarge = 9 .. 1 2
_ .../
type = SH
ID = 3
range = 7 .. 1 4
Chapter 7
Local Filesystems
active lock that prevents them from being applied. If there are multiple pending
locks, they are linked through their lf_block fields. New lock requests are placed
at the end of the list; thus, processes tend to be granted locks in the order that they
requested the locks. Each pending lock uses its lf_next field to identify the active
lock that currently blocks it. In the example in Fig. 7. 1 2, the first active lock has
two other locks pending. There is also a pending request for the range 9 to 1 2 that
is currently linked onto the second active entry. It could equally well have been
l inked onto the third active entry, since the third entry also blocks it. When an
active lock is released, all pending entries for that lock are awakened, so that they
can retry their request. If the second active lock were released, the result would be
that its currently pending request would move over to the blocked list for the last
active entry.
A problem that must be handled by the locking implementation is the detec­
tion of potential deadlocks. To see how deadlock is detected, consider the addi­
tion of the lock request by process 2 outlined in the dashed box in Fig. 7 . 1 2. Since
the request is blocked by an active lock, process 2 must sleep waiting for the
active lock on range 1 to 3 to clear. We follow the if_next pointer from the
requesting lock (the one in the dashed box), to identify the active lock for the
1 -to-3 range as being held by process I . The wait channel for process 1 shows
that that process too is sleeping, waiting for a lock to clear, and identifies the
pending lock structure as the pending lock (range 9 to 1 2) hanging off the lf_block
field of the second active lock (range 7 to 1 2). We follow the ({_next field of this
pending lock structure (range 9 to 1 2) to the second active lock (range 7 to 1 2) that
is held by the lock requester, process 2. Thus, the lock request is denied, as it
would lead to a deadlock between processes I and 2. This algorithm works on
cycles of locks and processes of arbitrary size.
As we note, the pending request for the range 9 to 12 could equally well have
been hung off the third active lock for the range 7 to 1 4. Had it been, the request
for adding the lock in the dashed box would have succeeded, since the third active
lock is held by process 3, rather than by process 2. If the next lock request on this
file were to release the third active lock, then deadlock detection would occur
when process l 's pending lock got shifted to the second active lock (range 7 to
1 2) . The difference is that process 1 , instead of process 2, would get the deadlock
When a new lock request is made, it must first be checked to see whether it is
b locked by existing locks held by other processes. If it is not blocked by other
processes, it must then be checked to see whether it overlaps any existing locks
already held by the process making the request. There are five possible overlap
cases that must be considered; these possibilities are shown in Fig. 7. 1 3 . The
assumption in the figure is that the new request is of a type different from that of
the existing lock (i.e., an exclusive request against a shared lock, or vice versa) ; if
the existing lock and the request are of the same type, the analysis is a bit simpler.
The five cases are as follows:
Section 7.5
File Locking
� WA
� � ��� � �
� � �� �� ��
Figure 7.1 3
Five types of overlap considered by the kernel when a range lock is added.
1 . The new request exactly overlaps the existing lock. The new request replaces
the existing lock. If the new request downgrades from exclusive to shared, all
requests pending on the old lock are awakened.
2. The new request is a subset of the existing lock. The existing lock is broken
into three pieces (two if the new lock begins at the beginning or ends at the
end of the existing lock) . If the type of the new request differs from that of the
existing lock, all requests pending on the old lock are awakened, so that they
can be reassigned to the correct new piece, blocked on a lock held by some
other process, or granted.
3. The new request is a superset of an existing lock. The new request replaces
the existing lock. If the new request downgrades from exclusive to shared, all
requests pending on the old lock are awakened.
4. The new request extends past the end of an existing lock. The existing lock is
shortened, and its overlapped piece is repl aced by the new request. All
requests pending on the existing lock are awakened, so that they can be reas­
signed to the correct new piece, blocked on a lock held by some other process,
or granted.
5. The new request extends into the beginning of an existing lock. The existing
lock is shortened, and its overlapped piece is replaced by the new request. All
requests pending on the existing lock are awakened, so that they can be reas­
signed to the correct new piece, blocked on a lock held by some other process,
or granted.
I n addition to the five basic types of overlap outlined, a request may span several
existing locks. Specifically, a new request may be composed of zero or one of
type 4, zero or more of type 3, and zero or one of type 5 .
Chapter 7
Local Filesystems
To understand how the overlap is handled, we can consider the example
shown in Fig. 7 . 1 4. This figure shows a file that has all its active range locks held
by process 1 , plus a pending lock for process 2.
Now consider a request by process 1 for an exclusive lock on the range 3 to
1 3 . This request does not conflict with any active locks (because all the active
locks are already held by process 1 ) . The request does overlap all three active
locks, so the three active locks represent a type 4, type 3, and type 5 overlap
respectively. The result of processing the lock request is shown in Fig. 7 . 1 5 . The
first and third active locks are trimmed back to the edge of the new request, and
the second lock is replaced entirely. The request that had been held pending on
the first lock is awakened. It is no longer blocked by the first lock, but is blocked
by the newly installed lock. So, it now hangs off the blocked list for the second
lock. The first and second locks could have been merged, because they are of the
same type and are held by the same process. However, the current implementation
makes no effort to do such merges, because range locks are normally released over
the same range that they were created. If the merger were done, it would probably
have to be split again when the release was requested.
Lock-removal requests are simpler than addition requests; they need only to
consider existing locks held by the requesting process. Figure 7 . 1 6 shows the
five possible ways that a removal request can overlap the locks of the requesting
1 . The unlock request exactly overlaps an existing lock. The existing lock is
deleted, and any lock requests that were pending on that lock are awakened.
2. The unlock request is a subset of an existing lock. The existing lock is broken
into two pieces (one if the unlock request begins at the beginning or ends at
F i g u re 7.1 4
Locks before addition of exclusive-lock request by process I on range 3 . . 1 3 .
i_lockf - .
type = EX
ID = I
range = 1 ..3
lf_next ->------'
type = EX
ID = 2
range = 3 . . 1 2
type = SH
ID = I
range = 5 .. 1 0
type = SH
ID = 1
range = 1 2 . . 1 9
Section 7.5
File Locking
i_lockf - r-----.
type = EX
ID = I
range = 1 ..2
type = EX
ID = I
range = 3 . . 1 3
type = EX
ID = 2
range = 3 .. 1 2
Figure 7.1 5
type = SH
ID = I
range = 1 4 . . 1 9
Locks after addition of exclusive-lock request by process 1 on range 3 . . 1 3 .
the end o f the existing lock) . Any locks that were pending o n that lock are
awakened, so that they can be reassigned to the correct new piece, blocked on
a lock held by some other process, or granted.
3. The unlock request is a superset of an existing lock. The existing lock is
deleted, and any locks that were pending on that lock are awakened.
4. The unlock request extends past the end of an existing lock. The end of the
existing lock is shortened. Any locks that were pending on that lock are awak­
ened, so that they can be reassigned to the shorter lock, b locked on a lock held
by some other process, or granted.
5. The unlock request extends into the beginning of an existing lock. The begin­
ning of the existing lock is shortened. Any locks that were pending on that
Figure 7 . 1 6
Five types of overlap considered by the kernel when a range lock is deleted.
0 WM
becomes: none
bS3 ��� � �
0 E2J none 0
Chapter 7
Local Filesystems
lock are awakened, so that they can be reassigned to the shorter lock, blocked
on a lock held by some other process, or granted.
In addition to the five basic types of overlap outlined, an unlock request may span
several existing locks. Specifically, a new request may be composed of zero or
one of type 4, zero or more of type 3, and zero or one of type 5 .
7 .6
Other Filesystem Semantics
Two major new fi lesystem features were introduced in 4.4BSD. The first of these
features was support for much l arger file sizes. The second was the introduction
of file metadata.
Large File Sizes
Traditionally, UNIX systems supported a maximum file and filesystem size of 2 3 1
bytes. When the filesystem was rewritten i n 4.2BSD, the inodes were defined to
allow 64-bit file sizes. However, the interface to the filesystem was still limited to
3 1 -bit sizes. With the advent of ever-larger disks, the developers decided to
expand the 4.4BSD interface to allow larger files. Export of 64-bit file sizes from
the filesystem requires that the defined type o.ff_t be a 64-bit integer (referred to as
long long or quad in most compilers).
The number of affected system calls is surprisingly low:
• [seek has to be able to specify 64-bit offsets
• stat, fstat, and [stat have to return 64-bit sizes
• truncate and ftruncate have to set 64-bit sizes
• mmap needs to start a mapping at any 64-bit point in the file
• getrlimit and setrlimit need to get and set 64-bit filesize limits
Changing these interfaces did cause applications to break. No trouble was
encountered with the stat family of system calls returning larger data values;
recompiling with the redefined stat structure caused appl ications to use the new
larger values. The other system calls are all changing one of their parameters to
be a 64-bit value. Applications that fail to cast the 64-bit argument to off_t may
get an incorrect parameter list. Except for !seek, most applications do not use
these system calls, so they are not affected by their change. However, many appli­
cations use [seek and cast the seek value explicitly to type long. So that there is no
need to make changes to many applications, a prototype for [seek is placed i n the
commonly included header file <sys/types.h>. After this change was made, most
applications recompiled and ran without difficulty.
For completeness, the type of size_t also should have been changed to be a
64-bit integer. This change was not made because it would have affected too
Section 7.6
Other Filesystem Semantics
many system calls. Also, on 32-bit address-space machines, an application cannot
read more than can be stored in a 32-bit integer. Finally, it is important to mini­
mize the use of 64-bit arithmetic that is slow on 32-bit processors .
File Flags
4.4BSD added two new system calls, chfiags and fchfiags, that set a 32-bit flags
word in the inode. The flags are included in the stat structure so that they can be
The owner of the file or the superuser can set the low 1 6 bits. Currently, there
are fl ags defined to mark a file as append-only, immutable, and not needing to be
dumped. An immutable file may not be changed, moved, or deleted. An append­
only file is immutable except that data may be appended to it. The user append­
only and i mmutable flags may be changed by the owner of the file or the superuser.
Only the superuser can set the high 1 6 bits. Currently, there are flags defined
to mark a file as append-only and i mmutable. Once set, the append-only and
immutable flags in the top 16 bits cannot be cleared when the system is secure.
The kernel runs with four different levels of security. Any superuser process
can raise the security level, but only the init process can lower that level (the init
program is described in Section 1 4.6). Security levels are defined as follows:
- 1 . Permanently insecure mode: Always run system in level 0 mode (must be
compiled into the kernel).
0. Insecure mode: Immutable and append-only flags may be turned off. All
devices can be read or written, subj ect to their permissions.
1 . Secure mode: The superuser-settable immutable and append-only flags cannot
be cleared; disks for mounted filesystems and kernel memory (/dev/mem and
/dev/kmem) are read-only.
2. Highly secure mode: This mode is the same as secure mode, except that disks
are always read-only whether mounted or not. This level precludes even a
superuser process from tampering with filesystems by unmounting them, but
also inhibits formatting of new filesystems.
Normally, the system runs with level 0 security while in single-user mode, and
with level 1 security while in multiuser mode. If level 2 security is desired while
the system is running in multiuser mode, it should be set in the /etc/re startup
script (the /etc/re script is described in Section 1 4.6).
Files marked immutable by the superuser cannot be changed, except by some­
one with physical access to either the machine or the system console. Files
marked immutable include those that are frequently the subject of attack by
intruders (e.g. , login and su). The append-only flag is typically used for critical
system logs. If an intruder breaks in, he will be unable to cover his tracks.
Although simple in concept, these two features improve the security of a system
Chapter 7
Local Filesystems
7. 1
What are the seven classes of operations handled by the hierarchical
7 .2
What is the purpose of the inode data structure?
7 .3
How does the system select an inode for replacement when a new inode
must be brought in from disk?
Why are directory entries not allowed to span chunks ?
Describe the steps involved in looking up a pathname component.
Why are hard links not permitted to span filesystems?
Describe how the interpretation of a symbolic link containing an absolute
pathname is different from that of a symbolic link containing a relative
Explain why unprivileged users are not permitted t o make hard links to
directories, but are permitted to make symbolic links to directories.
How can hard links be used to gain access to files that could not be
accessed if a symbolic link were used instead?
7. 1 0
How does the system recognize loops caused by symbolic links? Suggest
an alternative scheme for doing loop detection.
7. 1 1
How do quotas differ from the file-size resource limits described in
Section 3 . 8 ?
7. 1 2
How does the kernel determine whether a file has an associated quota?
7. 1 3
Draw a picture showing the effect of processing an exclusive-lock request
by process 1 on bytes 7 to I 0 to the lock list shown in Fig. 7 . 1 4. Which of
the overlap cases of Fig. 7 . 1 3 apply to this example?
*7. 1 4
Give an example where the file-locking implementation i s unable to detect
a potential deadlock.
**7. 1 5
Design a system that allows the s__e curity level of the system to be lowered
while the system is still running in muJtiuser mode.
Elz, 1 984.
K. R. Elz. "Resource Controls, Privileges, and Other MUSH," USENIX
Association Conference Proceedings, pp. 1 83- 1 9 1 , June 1 984.
Local Filestores
This chapter describes the organization and management of data on storage media.
4.4BSD provides three different filestore managers : the traditional Berkeley Fast
Filesystem (FFS), the recently added Log-Structured Filesystem (LFS), and the
Memory-based Filesystem (MFS) that uses much of the FFS code base. The FFS
filestore was designed on the assumption that buffer caches would be small and
thus that files would need to be read often. It tries to place files likely to be
accessed together in the same general location on the disk. It is described in Sec­
tion 8.2. The LFS filestore was designed for fast machines with large buffer
caches. It assumes that writing data to disk is the bottleneck, and it tries to avoid
seeking by writing all data together in the order in which they were created. It
assumes that active files will remain in the buffer cache, so is little concerned with
the time that it takes to retrieve files from the filestore. It is described in Section
8 . 3 . The MFS filestore was designed as a fast-access repository for transient data.
It is used primarily to back the /tmp filesystem. It is described in Section 8 .4.
Overview of the Filestore
The vnode operations defined for doing the datastore filesystem operations are
shown in Table 8 . 1 (on page 266) . These operators are fewer and semantically
simpler than are those used for managing the name space.
There are two operators for allocating and freeing objects. The valloc opera­
tor creates a new object. The identity of the object is a number returned by the
operator. The mapping of this number to a name is the responsibility of the name­
space code. An object is freed by the vfree operator. The object to be freed is
identified by only its number.
The attributes of an object are changed by the update operator. This layer
does no interpretation of these attributes; they are simply fixed-size auxiliary data
Chapter 8
Table 8.1
Local Filestores
Datastore filesystem operations.
Operation done
object creation and deletion
attribute update
object read and write
change in space allocation
Operator names
valloc, vfree
vget, blkatoff, read, write, fsync
stored outside the main data area of the object. They are typically file attributes,
such as the owner, group, permissions, and so on.
There are five operators for manipulating existing objects. The vget operator
retrieves an existing object from the filestore. The object is identified by its num­
ber and must have been created previously by valloc. The read operator copies
data from an object to a location described by a uio structure. The blkatoff opera­
tor is similar to the read operator, except that the blkatoff operator simply returns a
pointer to a kernel memory buffer with the requested data, instead of copying the
data. This operator is designed to increase the efficiency of operations where the
name-space code interprets the contents of an object (i.e., directories), instead of
j ust returning the contents to a user process. The write operator copies data to an
object from a location described by a uio structure. The fsync operator requests
that all data associated with the object be moved to stable storage (usually by their
all being written to disk). There is no need for an analog of blkatojf for writing, as
the kernel can simply modify a buffer that it received from blkatoff, mark that
buffer as dirty, and then do an fsync operation to have the buffer written back.
The final datastore operation is truncate. This operation changes the amount
of space associated with an object. Historically, it could be used only to decrease
the size of an object. In 4.4BSD, it can be used both to increase and to decrease
the size of an object.
Each disk drive has one or more subdivisions, or partitions. Each such parti­
tion can contain only one filestore, and a filestore never spans multiple partitions.
The filestore is responsible for the management of the space within its disk
partition. Within that space, its responsibility is the creation, storage, retrieval,
and removal of files. It operates in a flat name space. When asked to create a new
file, it allocates an inode for that file and returns the assigned number. The nam­
ing, access control, locking, and attribute manipulation for the file are all handled
by the hierarchical filesystem-management layer above the filestore.
The filestore also handles the allocation of new blocks to files as the latter
grow. Simple filesystem implementations, such as those used by early microcom­
puter systems, allocate files contiguously, one after the next, until the files reach
the end of the disk. As files are removed, holes occur. To reuse the freed space,
the system must compact the disk to move all the free space to the end. Files can
Section 8 . 1
Overview of the Files tore
be created only one at a time; for the size of a file other than the final one on the
disk to be increased, the file must be copied to the end, then expanded.
As we saw in Section 7 .2, each file in a filestore is described by an inode; the
locations of its data blocks are given by the block pointers in its inode. Although
the filestore may cluster the blocks of a file to improve 1/0 performance, the inode
can reference blocks scattered anywhere throughout the partition. Thus, multiple
files can be written simultaneously, and all the disk space can be used without the
need for compaction.
The filestore implementation converts from the user abstraction of a file as an
array of bytes to the structure i mposed by the underlying physical medium. Con­
sider a typical medium of a magnetic disk with fixed-sized sectoring. Although
the user may wish to write a single byte to a file, the disk supports reading and
writing only in multiples of sectors. Here, the system must read in the sector con­
taining the byte to be modified, replace the affected byte, and write the sector back
to the disk. This operation-converting random access to an array of bytes to
reads and writes of disk sectors-is called block 110.
First, the system breaks the user's request into a set of operations to be done
on logical blocks of the file. Logical blocks describe block-sized pieces of a file.
The system calculates the logical blocks by dividing the array of bytes into file­
store-sized pieces. Thus, if a filestore's block size is 8 1 92 bytes, then logical
block 0 would contain bytes 0 to 8 1 9 1 , logical block 1 would contain bytes 8 1 92
to 1 6,383, and so on.
The data i n each logical block are stored in a physical block on the disk. A
physical block is the location on the disk to which the system maps a logical
block. A physical disk block is constructed from one or more contiguous sectors.
For a disk with 5 1 2-byte sectors, an 8 1 92-byte filestore block would be built up
from 16 contiguous sectors. Although the contents of a logical block are contigu­
ous on disk, the logical blocks of the file do not need to be laid out contiguously.
The data structure used by the system to convert from logical blocks to physical
blocks was described in Section 7 .2.
Figure 8 . 1 (on page 268) shows the flow of information and work required to
access the file on the disk. The abstraction shown to the user is an array of bytes.
These bytes are collectively described by a file descriptor that refers to some loca­
tion in the array. The user can request a write operation on the file by presenting
the system with a pointer to a buffer, with a request for some number of bytes to
be written. As shown in Fig. 8 . 1 , the requested data do not need to be aligned
with the beginning or end of a logical block. Further, the size of the request is not
constrained to a single logical block. In the example shown, the user has
requested data to be written to parts of logical blocks 1 and 2. Since the disk can
transfer data only in multiples of sectors, the filestore must first arrange to read in
the data for any part of the block that is to be left unchanged. The system must
arrange an intermediate staging area for the transfer. This staging is done through
one or more system buffers, as described in Section 6.6.
In our example, the user wishes to modify data in logical blocks 1 and 2. The
operation iterates over five steps:
Chapter 8
Local Filestores
user: write (fd, buffer, cnt);
logical file:
system buffers:
logical file blocks:
I�__o _--+-----+--2_
I \
1 : #90255
I 2�l
1 \
o : #3 22 1 8
F i g u re 8.1
#5 1 879 '1
# 1 1 954
The block I/O system.
1 . Allocate a buffer.
2. Determine the location of the corresponding physical block on the disk.
3. Request the disk controller to read the contents of the physical block into the
system buffer and wait for the transfer to complete.
4. Do a memory-to-memory copy from the beginning of the user's I/O buffer to
the appropriate portion of the system buffer.
5 . Write the block to the disk and continue without waiting for the transfer to
If the user's request is incomplete, the process is repeated with the next logical
block of the file. In our example, the system fetches logical block 2 of the file and
is able to complete the user's request. Had an entire block been written, the sys­
tem could have skipped step 3 and have simply written the data to the disk without
first reading in the old contents. This incremental filling of the write request is
transparent to the user's process because that process is blocked from running dur­
ing the entire procedure. The filling is transparent to other processes ; because the
inode is locked during the process, any attempted access by any other process will
be blocked until the write has completed.
Section 8 . 2
The B erkeley Fast Filesystem
The Berkeley Fast Filesystem
A traditional UNIX filesystem is described by its superblock, which contains the
basic parameters of the filesystem. These parameters include the number of data
blocks in the filesystem, a count of the maximum number of files, and a pointer to
the free list, which is a list of all the free blocks in the filesystem.
A 1 50-Mbyte traditional UNIX filesystem consists of 4 Mbyte of inodes fol­
lowed by 1 46 Mbyte of data. That organization segregates the inode information
from the data; thus, accessing a file normally incurs a long seek from the file's
i node to its data. Files in a single directory typically are not allocated consecutive
slots in the 4 Mbyte of inodes, causing many nonconsecutive disk blocks to be
read when many inodes in a single directory are accessed.
The allocation of data blocks to files also is suboptimal. The traditional
filesystem implementation uses a 5 1 2-byte physical block size. But the next
sequential data block often is not on the same cylinder, so seeks between 5 1 2-byte
data transfers are required frequently. This combination of small block size and
scattered placement severely limits filesystem throughput.
The first work on the UNIX filesystem at B erkeley attempted to i mprove both
the reliability and the throughput of the filesystem. The developers improved reli­
ability by staging modifications to critical filesystem i nformation so that the modi­
fications could be either completed or repaired cleanly by a program after a crash
[McKusick & Kowalski, 1 994] . Doubling the block size of the filesystem
improved the performance of the 4.0BSD filesystem by a factor of more than 2
when compared with the 3BSD filesystem. This doubling caused each disk trans­
fer to access twice as many data blocks and eliminated the need for indirect blocks
for many files. In the remainder of this section, we shall refer to the fi lesystem
with these changes as the old filesystem.
The performance improvement in the old filesystem gave a strong indication
that increasing the block size was a good method for improving throughput.
Although the throughput had doubled, the old filesystem was still using only about
4 percent of the maximum disk throughput. The main problem was that the order
of blocks on the free list quickly became scrambled, as files were created and
removed. Eventually, the free-list order became entirely random, causing files to
have their blocks allocated randomly over the disk. This randomness forced a
seek before every block access. Although the old filesystem provided transfer
rates of up to 1 75 Kbyte per second when it was first created, the scrambling of
the free list caused this rate to deteriorate to an average of 30 Kbyte per second
after a few weeks of moderate use. There was no way of restoring the perfor­
mance of an old fi lesystem except to recreate the system.
Organization of the Berkeley Fast Filesystem
The first version of the current BSD filesystem appeared in 4.2BSD [McKusick et
al, 1 984] . In the 4.4BSD filesystem organization (as in the old filesystem organi­
zation), each disk drive contains one or more filesystems. A 4.4BSD filesystem is
described by its superblock, located at the beginning of the filesystem's disk
Chapter 8
Local Filestores
part1t1on . Because the superblock contains critical data, it is replicated to protect
against catastrophic loss. This replication is done when the filesystem is created;
since the superblock data do not change, the copies do not need to be referenced
unless a disk failure causes the default superblock to be corrupted.
So that files as large as 2 3 2 bytes can be created with only two levels of indi­
rection, the minimum size of a filesystem block is 4096 bytes. The block size can
be any power of 2 greater than or equal to 4096. The block size is recorded in the
filesystem's superblock, so it is possible for filesystems with different block sizes
to be accessed simultaneously on the same system. The block size must be
selected at the time that the filesystem is created; it cannot be changed subse­
quently without the filesystem being rebuilt.
The BSD filesystem organization divides a disk partition into one or more
areas, each of which is called a cylinder group. Figure 8.2 shows a set of cylinder
groups, each comprising one or more consecutive cylinders on a disk. Each cylin­
der group contains bookkeeping information that includes a redundant copy of the
superblock, space for inodes, a bitmap describing available blocks in the cylinder
group, and summary information describing the usage of data blocks within the
cylinder group. The bitmap of available blocks in the cylinder group replaces the
traditional filesystem's free list. For each cylinder group, a static number of
inodes is allocated at filesystem-creation time. The default policy is to allocate
one inode for each 2048 bytes of space in the cylinder group, with the expectation
that this amount will be far more than will ever be needed. The default may be
changed at the time that the filesystem is created.
F i g u re 8.2
Layout of cylinder groups.
-- 0 0 --
Section 8 . 2
The Berkeley Fast Filesystem
The rationale for using cylinder groups is to create clusters of inodes that are
spread over the disk close to the blocks that they reference, instead of them all
being located at the beginning of the disk. The filesystem attempts to allocate file
blocks close to the inodes that describe them to avoid long seeks between getting
the inode and getting its associated data. Also, when the inodes are spread out,
there is less chance of losing all of them in a single disk failure.
All the bookkeeping information could be placed at the beginning of each
cylinder group. If this approach were used, however, all the redundant informa­
tion would be on the same platter of a disk. A single hardware failure could then
destroy all copies of the superblock. Thus, the bookkeeping information begins at
a varying offset from the beginning of the cylinder group. The offset for each suc­
cessive cylinder group is calculated to be about one track farther from the begin­
ning than in the preceding cylinder group. In this way, the redundant information
spirals down into the pack, so that any single track, cylinder, or platter can be lost
without all copies of the superblock also being lost. Except for the first cylinder
group, which leaves space for a boot block, the space between the beginning of the
cylinder group and the beginning of the cylinder-group information is used for
data blocks.
Optimization of Storage Utilization
Data are laid out such that large blocks can be transferred in a single disk opera­
tion, greatly increasing filesystem throughput. A file in the new filesystem might
be composed of 8 1 92-byte data blocks, as compared to the 1 024-byte blocks of
the old filesystem; disk accesses would thus transfer up to 8 times as much infor­
mation per disk transaction. In large files, several blocks can be allocated consec­
utively, so that even larger data transfers are possible before a seek is required.
The main problem with larger blocks is that most BSD filesystems contain pri­
marily small files. A uniformly large block size will waste space. Table 8 . 2
Table 8 . 2
Amount of space wasted as a function of block size.
1 1 .7
1 5 .4
5 .4
1 2.3
6 1 .2
1 .6
data only, no separation between files
data only, files start on 5 1 2-byte boundary
data + inodes, 5 1 2-byte block
data + inodes, I 024-byte block
data + inodes, 2048-byte block
data + inodes, 4096-byte block
data + inodes, 8 1 92-byte block
data + inodes, 1 6384-byte block
Chapter 8
Local Filestores
shows the effect of filesystem block size on the amount of wasted space in the
fi lesystem. The measurements used to compute this table were collected from a
survey of the Internet conducted in 1 993 [Irlam, 1 993] . The survey covered 1 2
million files residing o n I 000 fi lesystems with a total size of 250 Gbyte. The
investigators found that the median file size was under 2048 bytes; the average file
size was 22 Kbyte. The space wasted is calculated to be the percentage of disk
space not containing user data. As the block size increases, the amount of space
reserved for inodes decreases, but the amount of unused data space at the end of
blocks rises quickly to an intolerable 29.4 percent waste with a minimum alloca­
tion of 8 1 92-byte fi lesystem blocks.
For large blocks to be used without significant waste, small fi les must be
stored more efficiently. To increase space efficiency, the fi lesystem allows the divi­
sion of a single fi lesystem block into one or more .fi·agments. The fragment size is
specified at the time that the fi lesystem is created; each fi lesystem block optionally
can be broken into two, four, or eight fragments, each of which is addressable. The
lower bound on the fragment size is constrained by the disk-sector size, which is
typically 5 1 2 bytes. The block map associated with each cylinder group records
the space available in a cylinder group in fragments ; to determine whether a block
is available, the system examines aligned fragments . Figure 8.3 shows a piece of a
block map from a filesystem with 4096-byte blocks and I 024-byte fragments ,
hereinafter referred to as a 409611 024 filesystem.
On a 4096/ 1 024 filesystem, a fi le is represented by zero or more 4096-byte
blocks of data, possibly plus a single fragmented block. If the system must frag­
ment a block to obtain space for a small number of data, it makes the remaining
fragments of the block available for allocation to other fi les. As an example, con­
sider an 1 1 000-byte file stored on a 4096/ 1 024 filesystem. This file would use two
full-sized blocks and one three-fragment portion of another block. If no block
with three aligned fragments were available at the time that the file was created, a
fu ll-sized block would be split, yielding the necessary fragments and a single
unused fragment. This remaining fragment could be allocated to another file as
Example of the layout of blocks and fragments in a 4096/ 1 024 filesystem.
Each bit in the map records the status of a fragment; a "-" means that the fragment is in
use, whereas a " l " means that the fragment is available for allocation. In this example.
fragments 0 through 5. I 0, and 1 1 are in use. whereas fragments 6 through 9 and 1 2
through 1 5 are free. Fragments of adjacent blocks cannot be used as a full block, even if
they are large enough. In this example, fragments 6 through 9 cannot be allocated as a ful l
block; only fragments 1 2 through 1 5 can be coalesced into a full block.
F i g u re 8.3
bits in map
fragment numbers
block numbers
-- 1 1
4- 7
1 1 -8- 1 I
1 2- 1 5
Section 8 . 2
The B erkeley Fast Filesystem
Reading and Writing to a File
Having opened a file, a process can do reads or writes on it. The procedural path
through the kernel is shown in Fig. 8 .4. If a read is requested, it is channeled
through the ffs_read ( ) routine. Ffs_read( ) is responsible for converting the read
into one or more reads of logical file blocks. A logical block request is then
handed off to ufs_bmap ( ) . Ufs_bmap ( ) is responsible for converting a logical
block number to a physical block number by interpreting the direct and indirect
block pointers in an inode. Ffs_read( ) requests the block 1/0 system to return a
buffer filled with the contents of the disk block. If two or more logically sequen­
tial blocks are read from a file, the process is assumed to be reading the file
sequentially. Here, ufs_bmap ( ) returns two values: first, the disk address of the
requested block; then, the number of contiguous blocks that follow that block on
disk. The requested block and the number of contiguous blocks that follow it are
passed to the cluster( ) routine. If the file is being accessed sequentially, the
cluster( ) routine will do a single large 1/0 on the entire range of sequential blocks.
If the file is not being accessed sequentially (as determined by a seek to a different
part of the file preceding the read), only the requested block or a subset of the
cluster will be read. If the file has had a long series of sequential reads, or if the
number of contiguous blocks is small, the system will issue one or more requests
for read-ahead blocks in anticipation that the process will soon want those blocks.
The details of block clustering are described at the end of this section.
Figure 8.4
Procedural interface to reading and writing.
read( )
write ( )
vnode to
offset to
logical block number
ffs_balloc ( )
chkdq ( )
logical to filesystem
block number
ufs_bmap ( )
allocation of
filesystem blocks
identification of contiguous blocks and
aggregation of single block buffers
quota check
buffer allocation and
filesystem to physical block number
physical block number to
disk <cylinder, track, offset>
disk read-write
Chapter 8
Local Filestores
Each time that a process does a write system call, the system checks to see
whether the size of the file has increased. A process may overwrite data in the
middle of an existing file, in which case space would usually have been allocated
already (unless the file contains a hole in that location). If the file needs to be
extended, the request is rounded up to the next fragment size, and only that much
space is allocated (see "Allocation Mechanisms" later in this section for the
details of space allocation). The write system call is channeled through the
jfs_write ( ) routine. Ffs_write ( ) is responsible for converting the write into one or
more writes of logical file blocks. A logical block request is then handed off to
jfs_balloc ( ). Ffs_balloc ( ) is responsible for interpreting the direct and indirect
block pointers in an inode to find the location for the associated physical block
pointer. If a disk block does not already exist, the jfs_alloc ( ) routine is called to
request a new block of the appropriate size. After calling chkdq ( ) to ensure that
the user has not exceeded their quota, the block is allocated, and the address of the
new block is stored in the inode or indirect block. The address of the new or
already-existing block is returned. Ffs_write ( ) allocates a buffer to hold the con­
tents of the block. The user's data are copied into the returned buffer, and the
buffer is marked as dirty. If the buffer has been filled completely, it is passed to
the cluster( ) routine. When a maximally sized cluster has been accumulated, a
noncontiguous block is allocated, or a seek is done to another part of the file, and
the accumulated blocks are grouped together into a single 1/0 operation that is
queued to be written to the disk. If the buffer has not been filled completely, it is
not considered immediately for writing. Rather, the buffer is held in the expecta­
tion that the process will soon want to add more data to it. It is not released until
it is needed for some other block-that is, until it has reached the head of the free
list, or until a user process does a sync system call. There is normally a user pro­
cess called update that does a sync every 30 seconds.
Repeated small write requests may expand the file one fragment at a time.
The problem with expanding a file one fragment at a time is that data may be
copied many times as a fragmented block expands to a fu ll block. Fragment real­
location can be minimized if the user process writes a full block at a time, except
for a partial block at the end of the file. Since filesystems with different block
sizes may reside on the same system, the filesystem interface provides application
programs with the optimal size for a read or write. This facility is used by the
standard 1/0 library that many application programs use, and by certain system
utilities, such as archivers and loaders, that do their own 1/0 management.
If the layout policies (described at the end of this section) are to be effective, a
filesystem cannot be kept completely full. A parameter, termed the free-space
reserve, gives the minimum percentage of filesystem blocks that should be kept
free. If the number of free blocks drops below this level, only the superuser is
allowed to allocate blocks . This parameter can be changed any time that the
filesystem is unmounted. When the number of free blocks approaches zero, the
filesystem throughput tends to be cut in half because the filesystem is unable to
localize blocks in a file. If a filesystem's throughput drops because of overfilling,
it can be restored by removal of files until the amount of free space once again
Section 8 . 2
The Berkeley Fast Filesystem
reaches the minimum acceptable level. Users can restore locality to get faster
access rates for fi les created during periods of little free space by copying the file
to a new one and removing the original one when enough space is available.
Filesystem Parameterization
Except for the initial creation of the free list, the old fi lesystem ignores the param­
eters of the underlying hardware. It has no information about either the physical
characteristics of the mass-storage device or the hardware that interacts with the
filesystem. A goal of the new fi lesystem is to parameterize the processor capabili­
ties and mass-storage characteristics so that blocks can be allocated in an optimum
configuration-dependent way. Important parameters include the speed of the pro­
cessor, the hardware support for mass-storage transfers, and the characteristics of
the mass-storage devices. These parameters are summarized in Table 8 . 3 . Disk
technology is constantly improving, and a given installation can have several dif­
ferent disk technologies running on a single processor. Each fi lesystem is parame­
terized so that it can be adapted to the characteristics of the disk on which it is
For mass-storage devices such as disks, the new filesystem tries to allocate a
file's new blocks on the same cylinder and rotationally well positioned. The dis­
tance between rotationally optimal blocks varies greatly; optimal blocks can be
consecutive or rotationally delayed, depending on system characteristics. For
disks attached to a dedicated 1/0 processor or accessed by a track-caching con­
troller, two consecutive disk blocks often can be accessed without time lost
because of an intervening disk revolution. Otherwise, the main processor must
field an interrupt and prepare for a new disk transfer. The expected time to service
this interrupt and to schedule a new disk transfer depends on the speed of the main
The physical characteristics of each disk include the number of blocks per
track and the rate at which the disk spins. The allocation routines use this infor­
mation to calculate the number of milliseconds required to skip over a block. The
Table 8.3
Important parameters maintained by the filesystem.
maximum blocks per file in a cylinder group
maximum contiguous blocks before a rotdelay gap
minimum percentage of free space
sectors per track
rotational delay between contiguous blocks
revolutions per second
tracks per cylinder
track skew in sectors
Chapter 8
Local Filestores
characteristics of the processor include the expected time to service an interrupt
and to schedule a new disk transfer. Given a block allocated to a file, the alloca­
tion routines calculate the number of blocks to skip over such that the next block
in the file will come into position under the disk head in the expected amount of
time that it takes to start a new disk-transfer operation. For sequential access to
large numbers of data, this strategy minimizes the amount of time spent waiting
for the disk to position itself.
The parameter that defines the minimum number of milliseconds between the
completion of a data transfer and the initiation of another data transfer on the same
cylinder can be changed at any time. If a filesystem is parameterized to lay out
blocks with a rotational separation of 2 mill iseconds, and the disk is then moved to
a system that has a processor requiring 4 milliseconds to schedule a disk opera­
tion, the throughput will drop precipitously because of lost disk revolutions on
nearly every block. If the target machine is known, the filesystem can be parame­
terized for that machine, even though it is initially created on a different processor.
Even if the move is not known in advance, the rotational-layout delay can be
reconfigured after the disk is moved, so that all further allocation is done based on
the characteristics of the new machine.
Layout Policies
The filesystem layout policies are divided into two distinct parts. At the top level
are global policies that use summary information to make decisions regarding the
placement of new inodes and data blocks. These routines are responsible for
deciding the placement of new directories and files. They also calculate rotation­
ally optimal block layouts and decide when to force a long seek to a new cylinder
group because there is insufficient space left in the current cylinder group to do
reasonable layouts. Below the global-policy routines are the local-allocation rou­
tines. These routines use a locally optimal scheme to lay out data blocks . The
original intention was to bring out these decisions to user level so that they could
be ignored or replaced by user processes. Thus, they are definitely policies, rather
than simple mechanisms .
Two methods for improving fi lesystem performance are to increase the local­
ity of reference to minimize seek latency [Trivedi, 1 980] , and to improve the lay­
out of data to make larger transfers possible [Nevalainen & Vesterinen, 1 977] .
The global layout policies try to improve performance by clustering related infor­
mation. They cannot attempt to localize all data references. but must instead try to
spread unrelated data among different cylinder groups. If too much localization is
attempted, the local cylinder group may run out of space, forcing further related
data to be scattered to nonlocal cylinder groups. Taken to an extreme, total local­
ization can result in a single huge cluster of data resembling the old fi lesystem.
The global policies try to balance the two conflicting goals of localizing data that
are concurrently accessed while spreading out unrelated data.
One allocatable resource is inodes . Inodes of fi les in the same directory fre­
quently are accessed together. For example, the list-directory command, Is, may
access the inode for each file in a directory. The inode layout policy tries to place
Section 8.2
The Berkeley Fast Fi lesystem
all the inodes of files in a directory in the same cylinder group. To ensure that
files are distributed throughout the filesystem, the system uses a different policy to
allocate directory inodes. New directories are placed in cylinder groups with a
greater-than-average number of free inodes and with the smallest number of direc­
tories. The intent of this policy is to allow inode clustering to succeed most of the
time. The filesystem allocates inodes within a cylinder group using a next-free
strategy. Although this method allocates the inodes randomly within a cylinder
group, all the inodes for a particular cylinder group can be accessed with 1 0 to 20
disk transfers. This allocation strategy puts a small and constant upper bound on
the number of disk transfers required to access the inodes for all the files in a
directory. In contrast, the old filesystem typically requires one disk transfer to
fetch the inode for each file in a directory.
The other major resource is data blocks. Data blocks for a file typically are
accessed together. The policy routines try to place data blocks for a file in the
same cylinder group, preferably at rotationally optimal positions in the same
cylinder. The problem with allocating all the data blocks in the same cylinder
group is that large files quickly use up the available space, forcing a spillover to
other areas. Further, using all the space causes future allocations for any file in the
cylinder group also to spill to other areas. Ideally, none of the cylinder groups
should ever become completely full. The heuristic chosen is to redirect block
allocation to a different cylinder group after every few Mbyte of allocation. The
spillover points are intended to force block allocation to be redirected when any
file has used about 25 percent of the data blocks in a cylinder group. In day-to­
day use, the heuristics appear to work well in minimizing the number of com­
pletely filled cylinder groups. Although this heuristic appears to benefit small files
at the expense of the larger files, it really aids both file sizes. The small files are
helped because there are nearly always blocks available in the cylinder group for
them to use. The large files benefit because they are able to use rotationally well
laid out space and then to move on, leaving behind the blocks scattered around the
cylinder group. Although these scattered blocks are fine for small files that need
only a block or two, they slow down big fi les that are best stored on a single large
group of blocks that can be read in a few disk revolutions.
The newly chosen cylinder group for block allocation is the next cylinder
group that has a greater-than-average number of free blocks left. Although big fi les
tend to be spread out over the disk, several Mbyte of data typically are accessible
before a seek to a new cylinder group is necessary. Thus, the time to do one long
seek is small compared to the time spent in the new cylinder group doing the I/O.
Allocation Mechanisms
The global-policy routines call local-allocation routines with requests for specific
blocks. The local-allocation routines will always allocate the requested block if it
is free; otherwise, they will allocate a free block of the requested size that is rota­
tionally closest to the requested block. If the global layout policies had complete
information, they could always request unused blocks, and the allocation routines
would be reduced to simple bookkeeping. However, maintaining complete
Chapter 8
Local Filestores
information is costly; thus, the global layout policy uses heuristics based on the
partial information that is available.
If a requested block is not available, the local allocator uses a four-level allo­
cation strategy:
I . Use the next available block rotationally closest to the requested block on the
same cylinder. It is assumed that head-switching time is zero. On disk con­
trollers where this assumption is not valid, the time required to switch between
disk platters is incorporated into the rotational layout tables when they are
2. If no blocks are available on the same cylinder, choose a block within the same
cylinder group.
3 . If the cylinder group is full, quadratically hash the cylinder-group number to
choose another cylinder group in which to look for a free block. Quadratic
hash is used because of its speed in finding unused slots in nearly full hash
tables [Knuth, 1 975] . Filesystems that are parameterized to maintain at least
I O percent free space rarely need to use this strategy. Filesystems used with­
out free space typically have so few free blocks available that almost any allo­
cation is random; the most important characteristic of the strategy used under
such conditions is that it be fast.
4. Apply an exhaustive search to all cylinder groups. This search is necessary
because the quadratic rehash may not check all cylinder groups.
The task of managing block and fragment allocation is done by jfs_balloc ( ) .
I f the file i s being written and a block pointer i s zero o r points to a fragment that is
too small to hold the additional data, jfs_balloc ( ) calls the allocation routines to
obtain a new block. If the file needs to be extended, one of two conditions exists :
1 . The file contains no fragmented blocks (and the final block in the file contains
insufficient space to hold the new data) . If space exists in a block already allo­
cated, the space is filled with new data. If the remainder of the new data con­
sists of more than a full block, a full block is allocated and the first full block
of new data is written there. This process is repeated until less than a full
block of new data remains. If the remaining new data to be written will fit in
less than a full block, a block with the necessary number of fragments is
located; otherwise, a full block is located. The remaining new data are written
into the located space. However, to avoid excessive copying for slowly grow­
ing files, the filesystem allows only direct blocks of files to refer to fragments.
2. The file contains one or more fragments (and the fragments contain insuffi­
cient space to hold the new data) . If the size of the new data plus the size of
the data already in the fragments exceeds the size of a full block, a new block
is allocated. The contents of the fragments are copied to the beginning of the
block, and the remainder of the block is filled with new data. The process then
continues as in step 1 . Otherwise, a set of fragments big enough to hold the
Section 8 . 2
The Berkeley Fast Filesystem
ffs_halloc ( )
ffs_realloccg ( )
ffs_alloc ( )
ffs_hlkpref( )
layout policy
ffsJragextend ( ) extend a fragment
allocate a new block or fragment
ffs_hashalloc ( ) find a cylinder group
ffs_alloccg ( )
allocate a fragment
ffs_alloccghlk( ) allocate a block
Figure 8.5
Procedural interface to block allocation.
data is located; if enough of the rest of the current block is free, the filesystem
can avoid a copy by using that block. The contents of the existing fragments,
appended with the new data, are written into the allocated space.
Ffs_balloc ( ) is also responsible for allocating blocks to hold indirect pointers. It
must also deal with the special case in which a process seeks past the end of a file
and begins writing. Because of the constraint that only the final block of a file
may be a fragment, ffs_balloc () must first ensure that any previous fragment has
been upgraded to a full-sized block.
On completing a successful allocation, the allocation routines return the block
or fragment number to be used; ffs_balloc ( ) then updates the appropriate block
pointer in the i node. Having allocated a block, the system is ready to allocate a
buffer to hold the block's contents so that the block can be written to disk.
The procedural description of the allocation process is shown in Fig. 8 . 5 .
Ffs_balloc ( ) is the routine responsible for determining when a new block must be
allocated. It first calls the layout-policy routine ffs_blkpref( ) to select the most
desirable block based on the preference from the global-policy routines that were
described earlier in this section. If a fragment has already been allocated and
needs to be extended, ffs_halloc ( ) calls ffs_realloccg ( ). If nothing has been allo­
cated yet, ffs_balloc ( ) calls jfs_alloc ( ) .
Ffs_realloccg ( ) first tries to extend the current fragment in place. Consider
the sample block of an allocation map with two fragments allocated from it, shown
in Fig. 8.6. The first fragment can be extended from a size 2 fragment to a size 3 or
a size 4 fragment, since the two adj acent fragments are unused. The second
Figure 8.6
Sample block with two allocated fragments.
entry in table
allocated fragments
size 2
size 3
Chapter 8
Local Filestores
fragment cannot be extended, as it occupies the end of the block, and fragments are
not allowed to span blocks. If ffs_realloccg ( ) is able to expand the current frag­
ment in place, the map is updated appropriately and it returns. If the fragment can­
not be extended, ffs_realloccg ( ) calls the ffs_alloc ( ) routine to get a new fragment.
The old fragment is copied to the beginning of the new fragment. and the old frag­
ment is freed.
The bookkeeping tasks of allocation are handled by ffs_alloc ( ). It first veri­
fies that a block is available in the desired cyl inder group by checking the filesys­
tem summary information. If the summary information shows that the cylinder
group is fu ll, .ffs a llo c ( ) quadratically rehashes through the summary information
looking for a cylinder group with free space . Having found a cylinder group with
space, ffs_alloc ( ) calls either the fragment-allocation routine or the block-alloca­
tion routine to acquire a fragment or block.
The block-allocation routine is given a preferred block. If that block is avai l­
able, it is returned. If the block is unavailable, the allocation routine tries to find
another block on the same cylinder that is rotationally close to the requested
block. So that the task of locating rotationally optimal blocks is simplified, the
summary information for each cylinder group includes a count of the available
blocks at different rotational positions . By default, eight rotational positions are
distinguished; that is, the resolution of the summary information is 2 milliseconds
for a 3600 revolution-per-minute drive. The superblock contains an array of lists
called the rotational-layout table. The array is indexed by rotational position.
Each entry in the array lists the index into the block map for every data block con­
tained in its rotational position. When searching for a block to allocate, the sys­
tem first looks through the summary information for a rotational position with a
nonzero block count. It then uses the index of the rotational position to find the
appropriate list of rotationally optimal blocks. This list enables the system to limit
its scan of the free-block map to only those parts that contain free, rotationally
well-placed blocks.
The fragment-allocation routine is given a preferred fragment. If that frag­
ment is available, it is returned. If the requested fragment is not available, and the
filesystem is configured to optimize for space utilization, the filesystem uses a
best-fit strategy for fragment allocation. The fragment-allocation routine checks
the cylinder-group summary information, starting with the entry for the desired
size, and scanning larger sizes until an available fragment is found. If there are no
fragments of the appropriate size or larger, then a full-sized block is allocated and
is broken up.
If an appropriate-sized fragment is listed in the fragment summary, then the
allocation routine expects to find it in the allocation map. To speed up the process
F i g u re 8.7
Map entry for an 8 1 92/ 1 024 filesystem.
bits in map
l l l -- 1 1
decimal value
1 15
Section 8 . 2
The Berkeley Fast Filesystem
of scanning the potentially large allocation map, the fi lesystem uses a table-driven
algorithm. Each byte in the map is treated as an index into a fragment-descriptor
table. Each entry in the fragment-descriptor table describes the fragments that are
free for that corresponding map entry. Thus, by doing a logical AND with the bit
corresponding to the desired fragment size, the allocator can determine quickly
whether the desired fragment is contained within a given allocation-map entry. As
an example, consider the entry from an allocation map for the 8 1 92/ 1 024 filesys­
tem shown in Fig. 8.7. The map entry shown has already been fragmented, with a
single fragment allocated at the beginning and a size 2 fragment allocated in the
middle. Remaining unused is another size 2 fragment, and a size 3 fragment.
Thus, if we look up entry 1 1 5 in the fragment table, we find the entry shown in
Fig. 8 . 8 . If we were looking for a size 3 fragment, we would inspect the third bit
and find that we had been successful; if we were looking for a size 4 fragment, we
would i nspect the fourth bit and find that we needed to continue. The C code that
implements this algorithm is as follows:
i++ )
( f ragtbl [ a l l o cmap [ i ] ]
( size - 1 ) ) )
break ;
Using a best-fit policy has the benefit of minimizing disk fragmentation; how­
ever, it has the undesirable property that it maximizes the number of fragment-to­
fragment copies that must be made when a process writes a file in many small
pieces. To avoid this behavior, the system can configure filesystems to optimize
for time, rather than for space. The first time that a process does a small write on
a fi lesystem configured for time optimization, it is allocated a best-fit fragment.
On the second small write, however, a full-sized block is allocated, with the
unused portion being freed. Later small writes are able to extend the fragment in
place, rather than requiring additional copy operations. Under certain circum­
stances, this policy can cause the disk to become heavily fragmented. The system
tracks this condition, and automatically reverts to optimizing for space if the per­
centage of fragmentation reaches one-half of the minimum free-space limit.
Block Clustering
Most machines running 4.4BSD do not have separate 1/0 processors. The main
CPU must take an interrupt after each disk I/O operation; if there is more disk 1/0
to be done, it must select the next buffer to be transferred and must start the opera­
tion on that buffer. Before the advent of track-caching controllers, the fi lesystem
F i g u re 8.8
Fragment-table entry for entry 1 1 5 .
entry i n table
available fragment size
Chapter 8
Local Filestores
obtained its highest throughput by leaving a gap after each block to allow time for
the next I/O operation to be scheduled. If the blocks were laid out without a gap,
the throughput would suffer because the disk would have to rotate nearly an entire
revolution to pick up the start of the next block.
Track-caching controllers have a large buffer in the controller that continues
to accumulate the data coming in from the disk even after the requested data have
been received. If the next request is for the immediately following block, the con­
troller will already have most of the block in its buffer, so it will not have to wait a
revolution to pick up the block. Thus, for the purposes of reading, it is possible to
nearly double the throughput of the filesystem by laying out the files contiguously,
rather than leaving gaps after each block.
Unfortunately, the track cache is less usefu l for writing. Because the kernel
does not provide the next data block until the previous one completes, there is still
a delay during which the controller does not have the data to write, and it ends up
waiting a revolution to get back to the beginning of the next block. One solution
to this problem is to have the controller give its completion interrupt after it has
copied the data into its cache, but before it has finished writing them. This early
interrupt gives the CPU time to request the next 1/0 before the previous one com­
pletes, thus providing a continuous stream of data to write to the disk.
This approach has one seriously negative side effect. When the I/O comple­
tion interrupt is delivered, the kernel expects the data to be on stable store.
Filesystem integrity and user applications using the f�ync system call depend on
these semantics. These semantics will be violated if the power fails after the 1/0
completion interrupt but before the data are written to disk. Some vendors elimi­
nate this problem by using nonvolatile memory for the controller cache and pro­
viding microcode restart after power fail to determine which operations need to be
completed. Because this option is expensive, few controllers provide this func­
The 4.4BSD system uses 1/0 clustering to avoid this dilemma. Clustering was
first done by Santa Cruz Operations [Peacock, 1 988] and Sun Microsystems
[McVoy & Kleiman, 1 99 1 ] ; the idea was later adapted to 4.4BSD [Seltzer et al ,
1 993] . As a file is being written, the allocation routines try to allocate up to 64
Kbyte of data in contiguous disk blocks. Instead of the buffers holding these
blocks being written as they are filled, their output is delayed. The cluster is com­
pleted when the limit of 64 Kbyte of data is reached, the file is closed, or the clus­
ter cannot grow because the next sequential block on the disk is already in use by
another file. If the cluster size is limited by a previous allocation to another file,
the fi lesystem is notified and is given the opportunity to find a larger set of con­
tiguous blocks into which the cluster may be placed. If the reallocation is success­
ful, the cluster continues to grow. When the cluster is complete, the buffers mak­
ing up the cluster of blocks are aggregated and passed to the disk controller as a
single I/O request. The data can then be streamed out to the disk in a single unin­
terrupted transfer.
A similar scheme is used for reading. If the .ff�·_read( ) discovers that a file is
being read sequentially, it inspects the number of contiguous blocks returned by
Section 8 . 2
The B erkeley Fast Filesystem
ufs_bmap ( ) to look for clusters of contiguously allocated blocks. It then allocates
a set of buffers big enough to hold the contiguous set of blocks and passes them to
the disk controller as a single 1/0 request. The 1/0 can then be done in one opera­
tion. Although read clustering is not needed when track-caching controllers are
available, it reduces the interrupt load from systems that have them, and it speeds
low-cost systems that do not have them.
For clustering to be effective, the filesystem must be able to allocate large
clusters of contiguous blocks to files. If the filesystem always tried to begin allo­
cation for a file at the beginning of a large set of contiguous blocks, it would soon
use up its contiguous space. Instead, it uses an algorithm similar to that used for
the management of fragments. Initially, file blocks are allocated via the standard
algorithm described in the previous two subsections. Reallocation is invoked
when the standard algorithm does not result in a contiguous allocation. The real­
location code searches a cluster map that summarizes the available clusters of
blocks in the cylinder group. It allocates the first free cluster that is large enough
to hold the file, then moves the file to this contiguous space. This process contin­
ues until the current allocation has grown to a size equal to the maximum permis­
sible contiguous set of blocks (typically 16 blocks). At that point, the 1/0 is done,
and the process of allocating space begins again.
Unlike fragment reallocation, block reallocation to different clusters of blocks
does not require extra 1/0 or memory-to-memory copying. The data to be written
are held in delayed write buffers. Within that buffer is the disk location to which
the data are to be written. When the location of the block cluster is relocated, it
takes little time to walk the list of buffers in the cluster and to change the disk
addresses to which they are to be written. When the 110 occurs, the fi nal destina­
tion has been selected and will not change.
To speed the operation of finding clusters of blocks, the filesystem maintains
a cluster map with l bit per block (in addition to the map with 1 bit per fragment).
It also has summary information showing how many sets of blocks there are for
each possible cluster size. The summary information allows it to avoid looking
for cluster sizes that do not exist. The cluster map is used because it is faster to
scan than is the much larger fragment bitmap. The size of the map is i mportant
because the map must be scanned bit by bit. Unlike fragments, clusters of blocks
are not constrained to be aligned within the map. Thus, the table-lookup opti­
mization done for fragments cannot be used for look up of clusters .
The filesystem relies on the allocation of contiguous blocks to achieve high
levels of performance. The fragmentation of free space may increase with time or
with fi lesystem utilization. This fragmentation can degrade performance as the
filesystem ages . The effects of utilization and aging were measured on over 50
filesystems at Harvard University. The measured fi lesystems ranged in age since
initial creation from 1 to 3 years. The fragmentation of free space on most of the
measured filesystems caused performance to degrade no more than 10 percent
from that of a newly created empty fi lesystem. The most severe degradation mea­
sured was 30 percent on a highly active fi lesystem that had many small fi les and
was used to spool USENET news [Seltzer et al, 1 995] .
Chapter 8
Local Filestores
Synchronous Operations
If the system crashes or stops suddenly because of a power failure, the filesystem
may be in an inconsistent state. To ensure that the on-disk state of the filesystem
can always be returned deterministically to a consistent state, the system must do
three operations synchronously:
1 . Write a newly allocated inode to disk before its name is entered into a directory.
2. Remove a directory name before the inode is deallocated.
3. Write a deallocated inode to disk before its blocks are placed into the cylinder­
group free list.
These synchronous operations ensure that directory names always reference valid
inodes, and that no block is ever claimed by more than one inode. Because the
filesystem must do two synchronous operations for each file that it creates, and for
each file that it deletes, the filesystem throughput is limited to the disk-write speed
when many files are created or deleted simultaneously.
Three techniques have been used to eliminate these synchronous operations:
1 . Put stable store (battery-backed-up memory) on the disk-controller board.
Filesystem operations can then proceed as soon as the block to be written is
copied into the stable store. If the system fails, unfinished disk operations can
be completed from the stable store when the system is rebooted [Moran et al ,
1 990] .
2. Keep a log of filesystem updates on a separate disk or in stable store. Filesys­
tem operations can then proceed as soon as the operation to be done is written
into the log. If the system fails, unfinished filesystem operations can be com­
pleted from the log when the system is rebooted [Chutani et al, 1 992] .
3 . Maintain a partial ordering on filesystem update operations. Before commit­
ting a change to disk, ensure that all operations on which it depends have been
completed. For example, an operation that would write an inode with a newly
allocated block to disk would ensure that a deallocated inode that previously
owned the block had been written to disk first. Using a technique of partial
rollback to break circular dependencies, this algorithm can eliminate 95 per­
cent of the synchronous writes [Ganger & Patt, 1 994] .
The first technique ensures that the filesystem is always consistent after a crash
and can be used as soon as the system reboots. The second technique ensures that
the filesystem is consistent as soon as a log rollback has been done. The third
technique still requires that the filesystem-check program be run to restore the
consistency of the filesystem; however, it does not require any specialized hard­
ware or additional disk space to do logging. All these techniques have been devel­
oped in derivatives of the FFS , although none of them are currently part of the
4.4BSD distribution.
Section 8 . 3
The Log-Structured Filesystem
The Log-Structured Filesystem
The factors that limited the performance of the implementation of the FFS found
in historic versions of 4BSD are the FFS 's requirement for synchronous 1/0 during
file creation and deletion, and the seek times between 1/0 requests for different
files. The synchronous 1/0 used during file creation and deletion is necessary for
filesystem recoverability after failures. The worst-case example is that it normally
takes five separate disk I/O's (two synchronous, three asynchronous), each pre­
ceded by a seek, to create a new file in the FFS : The file inode is written twice, the
containing directory is written once, the containing directory's inode is written
once, and, of course, the file's data are written . This synchronous behavior is
rarely an issue. Unimaginative benchmarks to the contrary, few applications cre­
ate large numbers of files, and fewer still immediately delete those files.
Seek times between 1/0 requests to a single file are significant only when the
file has been allocated poorly on disk. The FFS does an excellent job of laying out
files on disk, and, as long as the disk remains empty enough to permit good alloca­
tion, it can read and write individual files at roughly 50 percent of the disk band­
width, skipping one disk block for every one read or written. In 4.4BSD, where
clustering has been added, or when using a disk controller that supports track
caching, the FFS can transfer at close to the full bandwidth of the disk. For these
reasons, the seek times between 1/0 requests for different files will often dominate
performance. (As an example, on a typical disk, an average seek takes only
slightly less time than a disk rotation, so many blocks can be written in the time
that it takes to seek to a new location on the disk.)
As the main-memory buffer cache has become larger over the past decade,
applications have tended to experience this problem only when writing to the disk.
Repeated reads of data will go tb the disk only the first time, after which the data
are cached and no further 1/0 is required. In addition, doing read-ahead further
amends this problem, as sequential reads of a file will wait for only the first data
block to transfer from disk. Later reads will find the data block already in the
cache, although a separate 1/0 will still have been done. In summary, the problem
to be solved in modem filesystem design is that of writing a large volume of data,
from multiple files, to the disk. If the solution to this problem eliminates any syn­
chronous 1/0, so much the better.
The LFS , as proposed by Ousterhout and Douglis [Ousterhout & Douglis,
1 989] , attempted to address both the problem and the issue of synchronous 1/0.
The fundamental idea of the LFS is to improve fi lesystem performance by storing
all fi lesystem data in a single, contiguous log. The LFS is optimized for writing,
and no seek is required between writes, regardless of the file to which the writes
belong. It is also optimized for reading files written in their entirety over a brief
period (as is the norm in workstation environments) because the files are placed
contiguously on disk.
The FFS provides logical locality, as it attempts to place related files (e.g.,
fi les from the same directory) i n the same cylinder group. The LFS provides tem­
poral locality, as it places files created at about the same time together on disk,
relying on the buffer cache to protect the application from any adverse effects of
Chapter 8
Local Filestores
this decision. It is important to realize that no performance characteristics of the
disk or processor are taken into account by the LFS. The assumption that the LFS
makes is that reads are cached, and that writes are always contiguous. Therefore,
a simpler model of disk activity suffices.
Organization of the Log-Structured Filesystem
The LFS is described by a superblock similar to the one used by the FPS . In addi­
tion, to minimize the additional software needed for the LFS, FPS index structures
(inodes) and directories are used almost without change, making tools written to
analyze the FPS immediately applicable to the LFS (a useful result in itself) .
Where the LFS differs from the FPS is i n the layout o f the inode, directory and file
data blocks on disk.
The underlying structure of the LFS is that of a sequential, append-only log.
The disk is statically partitioned into fixed-sized contiguous segments, (which are
generally 0.5 to l Mbyte), as shown by the disk-layout column of Fig. 8.9. The
initial superblock is in the same location as in the FPS, and is replicated through­
out the disk in selected segments. All writes to the disk are appended to the logi­
cal end of the log. Although the log logically grows forever, portions of the log
that have already been written must be made available periodically for reuse
because the disk is not infinite in length. This process is called cleaning, and the
Figure 8.9
Log-Structured Filesystem layout.
segment 1
segment summary
data block
inode block
data block
data block
segment 2
segment n
disk layout
next segment
file count
inode count
file information 1
inode block
, ,_________,
data block
partial segment
file information 2
1 1------<
file information n
inode daddr
inode daddr
segment summary
' number of blocks
version number
inode number
last block size
logical block 1
logical block 2
I .-------\
\ logical block n
file information
Section 8 . 3
The Log-Structured Filesystem
utility that performs this reclamation is called the cleaner. The need for cleaning
is the reason that the disk is logically divided into segments. Because the disk is
divided into reasonably large static areas, i t is easy to segregate the portions of the
disk that are currently being written from those that are currently being cleaned.
The logical order of the log is not fixed, and the log should be viewed as a linked
list of segments, with segments being periodically cleaned, detached from their
current position in the log, and reattached after the end of the log.
In ideal operation, the LFS accumulates dirty blocks in memory. When
enough blocks have been accumulated to fill a segment, they are written to the
disk in a single, contiguous 1/0 operation. Since it makes little sense to write data
blocks contiguously and continue to require seeks and synchronous writes to
update their inode-modification times, the modified inodes are written into the
segment at the same time as the data. As a result of this design goal, inodes are no
longer in fi xed locations on the disk, and the LFS requires an additional data struc­
ture called the inode map, which maps inode numbers to the current disk
addresses of the blocks containing them. So that fast recovery after crashes is
facilitated, the inode map is also stored on disk (the inode map would be time con­
suming to recreate after system failure).
As the LFS writes dirty data blocks to the logical end of the log (that is, into
the next available segment), modified blocks will be written to the disk in locations
different from those of the original blocks. This behavior is called a no-overwrite
policy, and it is the responsibility of the cleaner to reclaim space resulting from
deleted or rewritten blocks. Generally, the cleaner reclaims space in the filesystem
by reading a segment, discarding dead blocks (blocks that belong to deleted files or
that have been superseded by rewritten blocks), and rewriting any live blocks to the
end of the log.
In a workstation environment, the LFS usually will not accumulate many dirty
data blocks before having to write at least some portion of the accumulated data.
Reasons that writes must happen include the requirement of the Network Filesys­
tem (NFS) that write operations be flushed to the disk before the write call returns,
and that UNIX filesystems (and POSIX standards) have historically guaranteed that
closing a file descriptor both updates the inode and flushes pending write opera­
tions to the disk.
B ecause the LFS can only rarely write full segments, each segment is further
partitioned into one or more partial segments. A partial segment can be thought
of as the result of a single write operation to disk. Each partial segment is com­
posed of a single partial-segment summary, and inode blocks and data blocks, as
shown by the partial-segment column of Fig. 8.9. The segment summary
describes the inode and data blocks in the partial segment, and is shown by the
segment-summary column of Fig. 8.9. The partial-segment summary contains the
following information:
Checksums for the summary information and for the entire partial segment
The time that the partial segment was written (not shown in Fig. 8.9)
Chapter 8
Local Filestores
Directory-operation information (not shown in Fig. 8.9)
The disk address of the segment to be written immediately after this segment
The number of file-information structures and the number of inode disk addresses
that follow
A file-information structure for each separate file for which blocks are included
in this partial segment (described next)
A disk address for each block of inodes included in this partial segment
The checksums are necessary for the recovery agent to determine that the par­
tial segment is complete. Because disk controllers do not guarantee that data are
written to disk in the order that write operations are issued, it is necessary to be
able to determine that the entire partial segment has been written to the disk suc­
cessfully. Writing a single disk sector's worth of the partial-segment summary
after the rest of the partial segment was known to have been written successfully
would largely avoid this problem; however, it would have a catastrophic effect on
filesystem performance, as there would be a significant rotational latency between
the two writes. Instead, a checksum of 4 bytes in each block of the partial segment
is created and provides validation of the partial segment, permitting the filesystem
to write multiple partial segments without an intervening seek or rotation.
The file-information structures and inode disk addresses describe the rest of
the partial segment. The number of file-information structures and blocks of
inodes in the partial segment is specified in the segment-summary portion of the
partial segment. The i node blocks are identical to the FFS inode blocks. The disk
address of each inode block is also specified in the partial-segment summary
information, and can be retrieved easily from that structure. B locks in the partial
segment that are not blocks of inodes are file data blocks, in the order listed in the
partial-segment summary i nformation.
The file-information structures are as shown by the file-information column of
Fig. 8.9. They contain the following information:
The number of data blocks for this file contained in this partial segment
A version number for the file, intended for use by the cleaner
• The file's inode number
The size of the block written most recently to the file in this partial segment
The logical block number for the data blocks in this partial segment
Index File
The final data structure in the LFS is known as the index file (shown in Fig. 8 . 1 0),
because it contains a mapping from the inode number to the disk address of the
block that contains the inode. The index file is maintained as a regular, read-only
file visible in the filesystem, named ijile by convention.
Section 8 . 3
The Log-Structured Filesystem
# dirty segments
# clean segments
segment info I
live byte count
times tamp
segment info 2
segment info n
version number
inode info I
disk address
free list pointer
inode info 2
version number
disk address
free-list pointer
inode info n
Figure 8.1 0
Log-Structured Filesystem index-file structure.
There are two reasons for the index file to be implemented as a regular file.
First, because the LFS does not allocate a fi xed position for each inode when cre­
ated, there is no reason to limit the number of inodes in the fi lesystem, as is done
in the FFS . This feature permits the LFS to support a larger range of uses because
the filesystem can change from being used to store a few, large fi les (e.g., an X I I
binary area) to storing many files (e.g., a home directory or news partition) with­
out the filesystem being recreated. In addition, there is no hard limit to the num­
ber of files that can be stored i n the filesystem. However, this lack of constraints
requires that the inode map be able to grow and shrink based on the filesystem's
inode usage. Using an already established mechanism (the kernel file code) mini­
mizes the special-case code in the kernel.
Chapter 8
Local Filestores
Second, the information found in the index file is used by the cleaner. The
LFS cleaner is implemented as a user-space process, so it is necessary to make the
i ndex-file information accessible to application processes. Again, because the
i ndex file is visible in the filesystem, no additional mechanism is required, mini­
mizing the special-case code in both the kernel and the cleaner.
Because the index file's inode and data blocks are themselves written to new
locations each time that they are written, there must be a fixed location on the disk
that can be used to find them. This location is the superblock. The first
superblock is always in the same position on the disk and contains enough infor­
mation for the kernel to find the disk address of the block of inodes that contains
the index file's inode.
In addition to the inode map, the index file includes the other information that
is shared between the kernel and the cleaner. The index file contains information :
It contains the number of clean and dirty segments .
It records segment-usage information, one entry per segment (rather than per
partial segment) on the disk. The segment-usage information includes the
number of live bytes currently found in the segment; the most recent modifica­
tion time of the segment; and flags that show whether the segment is currently
being written, whether the segment was written since the most recent check­
point (checkpoints are described in the writing to the log subsection), whether
the segment has been cleaned, and whether the segment contains a copy of the
superblock. Because segment-based statistics are maintained on the amount of
useful information that is currently in the segment, it is possible to clean seg­
ments that contain a high percentage of useless data, so that the maximum
amount of space is made available for reuse with the minimal amount of
It maintains inode information, one entry per current inode in the filesystem.
The inode information includes the current version number of the inode, the disk
address of the block of inodes that contains the inode, and a pointer if the inode
is unused and is on the current list of free inodes.
So that calculations are simplified, segment-summary-information entries and
inode-map entries are block aligned and are not permitted to span block bound­
aries, resulting in a fixed number of each type of entry per block. This alignment
makes it possible for the filesystem to calculate easily the logical block of the
index file that contains the correct entry.
Reading of the Log
To clarify the relationships among these structures, we shall consider the steps
necessary to read a single block of a file if the file's inode number is known and
there is no other information available.
Section 8 . 3
The Log-Structured Filesystem
l . Read i n the superblock. The superblock contains the index file's inode num­
ber, and the disk address of the block of i nodes that contains the index file's
in ode.
2. Read in the block of i nodes that contains the i ndex file's inode. Search the
block and find the i ndex file's inode. Inode blocks are searched linearly. No
more complicated search or data structure is used, because, on the average, i n
a n 8-Kbyte-block fi lesystem, only 32 o r so memory locations need t o b e
checked fo r any given inode in a block t o b e located.
3. Use the disk addresses in the index file's inode and read in the block of the
index file that contains the inode-map entry for the requested file's inode.
4. Take the disk address found in the i node-map entry and use it to read i n the
block of i nodes that contains the inode for the requested file. Search the block
to find the file's inode.
5. Use the disk addresses found in the file's inode to read in the blocks of the
requested file.
Normally, all this information would be cached in memory, and the only real
1/0 would be a single 1/0 operation to bring the file's data block into memory.
However, it is i mportant to minimize the i nformation stored in the index file to
ensure that the latter does not reserve unacceptable amounts of memory.
Writing to the Log
When a dirty block must be flushed to the disk for whatever reason (e.g., because
of a fsync or sync system call, or because of closing a file descriptor), the LFS
gathers all the dirty blocks for the fi lesystem and writes them sequentially to the
disk in one or more partial segments. In addition, if the number of currently dirty
buffers approaches roughly one-quarter of the total number of buffers in the sys­
tem, the LFS will initiate a segment write regardless.
The filesystem does the write by traversing the vnode lists linked to the
filesystem mount point and collecting the dirty blocks. The dirty blocks are sorted
by file and logical block number (so that fi les and blocks within files will be writ­
ten as contiguously as possible), and then are assigned disk addresses. Their asso­
ciated meta-data blocks (inodes and indirect blocks) are updated to reflect the new
disk addresses, and the meta-data blocks are added to the information to be writ­
ten . This i nformation is formatted into one or more partial segments, partial seg­
ment summaries are created, checksums are calculated, and the partial segments
are written into the next available segment. This process continues until all dirty
blocks in the filesystem have been written.
Periodically, the LFS synchronizes the information on disk, such that all disk
data structures are completely consistent. This state is known as a filesystem
checkpoint. Normally, a checkpoint occurs whenever the sync system call is made
Chapter 8
Local Filestores
by the update utility, although there is no reason that it cannot happen more or less
often . The only effect of changing how often the filesystem checkpoints is that the
time needed to recover the filesystem after system failure is inversely proportional
to the frequency of the checkpoints . The only requirement is that the files y stem be
checkpointed between the time that a segment is last written and the time that the
segment is cleaned, to avoid a window where system failure during cleaning of a
segment could cause the loss of data that the kernel has already confirmed as
being written safely to disk.
For the filesystem to be checkpointed, additional data structures must be writ­
ten to disk. First, because each file inode is written into a new location each time
that it is written, the index file must also be updated and its dirty meta-data blocks
written. The flags in the segment usage information that note if each segment was
written since the most recent checkpoint must be toggled and written as part of
this update. Second, because the index-file inode will have been modified, it too
must be written, and the superblock must be updated to reflect its new location.
Finally, the superblock must be written to the disk. When these objects have been
updated and written successfully, the filesystem is considered checkpointed.
The amount of information needing to be written during a filesystem check­
point is proportional to the amount of effort the recovery agent is willing to make
after system failure. For example, it would be possible for the recovery agent to
detect that a file was missing an indirect block, if a data block existed for which
there was no appropriate indirect block, in which case, indirect blocks for files
would not have to be written during normal writes or checkpoints . Or, the recov­
ery agent could find the current block of inodes that contains the latest copy of the
index file inode by searching the segments on the disk for a known inode number,
in which case the superblock would not need to be updated during checkpoint.
More aggressively, it would be possible to rebuild the index file after system fail­
ure by reading the entire disk, so the index file would not have to be written to
complete a checkpoint. Like the decision of how often to checkpoint, the determi­
nation of the tradeoff between what is done by the system during filesystem
checkpoint and what is done by the recovery agent during system recovery is a
flexible decision.
Writes to a small fragment of a LFS are shown in Fig. 8 . 1 1 . Note that the no­
overwrite policy of the LFS results in the latter using far more disk space than is
used by the FFS, a classic space-time tradeoff: Although more space is used,
because the disk I/O is contiguous on disk, it requires no intermediate seeks.
Block Accounting
B lock accounting in the LFS is far more complex than in the FFS . In the FFS ,
blocks are allocated as needed, and, if no blocks are available, the allocation fails.
The LFS requires two different types of block accounting.
The first form of block accounting is similar to that done by the FFS. The
LFS maintains a count of the number of disk blocks that do not currently contain
useful data. The count is decremented whenever a newly dirtied block enters the
buffer cache. Many files die in the cache, so this number must be incremented
Section 8.3
The Log-Structured Filesystem
+ + + I
SS Fl 1 F l 2 F2 1
partial segment 1
SS F3 1 F32 F33 F34
partial segment 1
segment 1
clean segment
partial segment 2
seg ment 1
SS F l 1 F l 2 F2 1
seg ment 2
SS F3 1 F32 F33 F34 SS F22 I
F4 1 F42 F l 1 IF
partial segment 2
segment 2
Figure 8.1 1 Log-Structured Filesystem fragment. In the first snapshot, the first partial
segment contains a segment summary (SS), two blocks from file 1 (F l 1 and F l 2), a single
block from file 2 (F2 1 ), and a block of inodes (I). The block of inodes contains the inodes
(and therefore the disk addresses) for fi les Fl and F2. The second partial segment contains
a segment summary and four blocks from file 3 . In the second snapshot, a block has been
appended to file 2 (F22); a new file, file 4, has been written that has two blocks (F4 1 and
F42); and the first block of file 1 (Fl l) has been modified and therefore rewritten. Because
the disk addresses for fi l es 1 and 2 have changed, and the inodes for files 3 and 4 have not
yet been written, those files' inodes are written (I). Note that this inode block still refer­
ences disk addresses in the first and second partial segments, because blocks F l 2 and F2 1 ,
and the blocks from file 3, are still live. Since the locations of the files' inodes have
changed, if the filesystem is to be consistent on disk, the modified blocks from the index
file (IF) must be written as well.
whenever blocks are deleted, even if the blocks were never written to disk. This
count provides a system-administration view of how much of the filesystem is
currently in use. However, this count cannot be used to authorize the acceptance
of a write from an application because the calculation implies that blocks can be
written successfully into the cache that will later fail to be written to disk. For
example, this failure could be caused by the disk filling up because the additional
blocks necessary to write dirty blocks (e.g., meta-data blocks and partial-segment
summary blocks) were not considered in this count. Even if the disk were not
full, all the available blocks might reside in uncleaned segments, and new data
could not be written.
The second form of block accounting is a count of the number of disk blocks
currently available for writing-that is, that reside in segments that are clean and
ready to be written. This count is decremented whenever a newly dirtied block
Chapter 8
Local Filestores
enters the cache, and the count is not i ncremented until the block is discarded or
the segment into which it is written is cleaned. This accounting value is the value
that controls cleaning i nitiation. If an application attempts to write data, but there
is no space currently available for writing, the application will block until space is
available. Using this pessimistic accounting to authorize writing guarantees that,
if the operating system accepts a write request from the user, it will be able to do
that write, barring system failure.
The accounting support in the LFS is complex. Thi s complexity arises
because allocation of a block must also consider the allocation of any necessary
meta-data blocks and any necessary i node and partial-segment summary blocks.
Determining the actual disk space required for any block write is difficult because
i nodes are not collected into i node blocks, and indirect blocks and segment sum­
maries are not created until the partial segments are actually written. Every time
an inode is modified i n the inode cache, a count of inodes to be written is incre­
mented. When blocks are dirtied, the number of available disk blocks is decre­
mented. To decide whether there is enough disk space to allow another write into
the cache, the system computes the number of segment summaries necessary to
write the dirty blocks already in the cache, adds the number of i node blocks neces­
sary to write the dirty inodes, and compares that number to the amount of space
currently available to be written. If insufficient space is available, either the
cleaner must run or dirty blocks in the cache must be deleted.
The Buffer Cache
B efore the i ntegration of the LFS i nto 4BSD, the buffer cache was thought to be
filesystem-independent code. However, the buffer cache contained assumptions
about how and when blocks are written to disk. The most significant problem was
that the buffer cache assumed that any single dirty block could be flushed to disk
at any time to reclaim the memory allocated to the block. There are two problems
with this assumption:
1 . Flushing blocks a single block at a time would destroy any possible perfor­
mance advantage of the LFS, and, because of the modified meta-data and par­
tial-segment summary blocks, the LFS would use enormous amounts of disk
2. Also because of the modified meta-data and partial-segment summary blocks,
the LFS requires additional memory to write: If the system were completely
out of memory, it would be i mpossible for the LFS to write anything at all.
For these reasons, the LFS needs to guarantee that it can obtain the additional
buffers that i t needs when it writes a segment, and that it can prevent the buffer
cache from attempting to flush blocks backed by a LFS. To handle these prob­
lems, the LFS maintains its dirty buffers on the kernel LOCKED queue, i nstead of
on the traditional LRU queue, so that the buffer cache does not attempt to reclaim
them. Unfortunately, maintaining these buffers on the LOCKED queue exempts
Section 8 . 3
The Log-Structured Filesystem
most of the dirty LFS blocks from traditional buffer-cache behavior, which
undoubtedly alters system performance in unexpected ways. To prevent the LFS
from locking down all the available buffers and to guarantee that there are always
additional buffers available when they are needed for segment writing, the LFS
begins segment writing as described previously, when the number of locked-down
buffers exceeds a threshold. In addition, the kernel blocks any process attempting
to acquire a block from a LFS if the number of currently locked blocks is above a
related access threshold. B uffer allocation and management will be much more
reasonably handled by systems with better i ntegration of the buffer cache and vir­
tual memory.
Another problem with the historic buffer cache was that it was a logical buffer
cache, hashed by vnode and file logical block number. In the FFS, since i ndirect
blocks did not have logical block numbers, they were hashed by the vnode of the
raw device (the file that represents the disk partition) and the disk address. Since
the LFS does not assign disk addresses until the blocks are written to disk, i ndirect
blocks have no disk addresses on which to hash. So that this problem could be
solved, the block name space had to incorporate meta-data block numbering.
B lock numbers were changed to be signed i ntegers, with negative block numbers
referencing indirect blocks and zero and positive numbers referencing data blocks.
S ingly indirect blocks take on the negative block number of the first data block to
which they point. Doubly and triply indirect blocks take the next-lower negative
number of the singly or doubly indirect block to which they point. This approach
makes it possible for the filesystem to traverse the indirect block chains in either
direction, facilitating reading a block or creating indirect blocks. Because it was
possible for the FFS also to use this scheme, the current hash chains for both
filesystems are done in this fashion.
Directory Operations
Directory operations include those system calls that affect more than one inode
(typically a directory and a file). They include create, link, mkdir, mknod, remove,
rename, rmdir, and symlink. These operations pose a special problem for the LFS .
S ince the basic premise of the LFS is that small 1/0 operations can be postponed
and then coalesced to provide larger 1/0 operations, retaining the synchronous
behavior of directory operations would make little sense. In addition, the UNIX
semantics of directory operations are defined to preserve ordering (e.g. , if the cre­
ation of one file precedes the creation of another, any recovery state of the filesys­
tem that includes the second file must also i nclude the first). This semantic is used
i n UNIX fi lesystems to provide mutual exclusion and other locking protocols.
Since directory operations affect multiple inodes, we must guarantee that either all
i nodes and associated changes are written successfully to the disk, or that any par­
tially written information is ignored during recovery.
The basic unit of atomicity in LFS is the partial segment because the check­
sum i nformation guarantees that either all or none of the partial segment will be
considered valid. Although it would be possible to guarantee that the inodes for
any single directory operation would fit i nto a partial segment, that would require
Chapter 8
Local Filestores
each directory operation to be flushed to the disk before any vnode participating i n
it is allowed t o participate in another directory operation, o r a potentially
extremely complex graph of vnode i nterdependencies has to be maintained.
Instead, a mechanism was i ntroduced to permit directory operations to span multi­
ple partial segments. First, all vnodes participating in any directory operation are
flagged. When the partial segment containing the first of the flagged vnodes is
written, the segment summary flag SS_DIROP is set. If the directory-operation
i nformation spans multiple partial segments, the segment summary flag SS_CONT
also is set. So that the number of partial segments participating in a set of direc­
tory operations is minimized, vnodes are included in partial segments based on
whether they participated in a directory operation. Finally, so that directory opera­
tions are prevented from being only partially reflected in a segment, no new direc­
tory operations are begun while the segment writer is writing a partial segment
containing directory operations, and the segment writer will not write a partial
segment containing directory operations while any directory operation is in
During recovery, partial segments with the SS_DIROP or SS_CONT flag set are
ignored unless the partial segment completing the directory operation was written
successfully to the disk. For example, if the recovery agent finds a segment with
both SS_DIROP and SS_CONT set, it ignores all such partial segments until it finds
a later partial segment with SS_DIROP set and SS_CONT unset (i.e. the final partial
segment including any part of this set of directory operations). If no such partial
segment is ever found, then all the segments from the initial directory operation on
are d iscarded.
Creation of a File
Creating a file in the LFS is a simple process. First, a new inode must be allocated
from the filesystem. There is a field in the superblock that points to the first free
i node in the linked list of free i nodes found in the index file. If this pointer refer­
ences an inode, that i node is allocated in the index file, and the pointer is updated
from that i node's free-list pointer. Otherwise, the index file is extended by a
block, and the block is divided into i ndex-file i node entries. The first of these
entries is then allocated as the new inode.
The i node version number is then i ncremented by some value. The reason for
this i ncrement is that it makes the cleaner's task simpler. Recall that there is an
i node version number stored with each file-information structure in the segment.
When the cleaner reviews a segment for l ive data, mismatching version numbers
or an unallocated i ndex file inode makes detection of file removal simple.
Conversely, deleting a file from the LFS adds a new entry to the index file's
free-inode list. Contrasted to the multiple synchronous operations required by the
FFS when a file is created, creating a file in LFS is conceptually simple and blind­
i ngly fast. However, the LFS pays a price for avoiding the synchronous behavior:
It cannot permit segments to be written at the same time as files are being created,
and the maintenance of the allocation information is significantly more complex.
Section 8 . 3
The Log-Structured Filesystem
Reading and Writing to a File
Having created a file, a process can do reads or writes on it. The procedural path
through the kernel is largely identical to that of the FFS, as shown by Fig. 8.4 with
the ffs_ routines changed to ifs_. The code for ffs_read ( ) and lfs_read( ), and that
for ffs_write ( ) and lfs_write ( ), is the same, with some C preprocessing #defines
added for minor tailoring. As in the FFS, each time that a process does a write
system call, the system checks to see whether the size of the file has increased. If
the file needs to be extended, the request is rounded up to the next fragment size,
and only that much space is allocated. A logical block request is handed off to
lfs_balloc ( ), which performs the same functions as ffs_balloc ( ), allocating any
necessary indirect blocks and the data block if it has not yet been allocated, and
reallocating and rewriting fragments as necessary.
Filesystem Cleaning
Because the disk is not infinite, cleaning must be done periodically to make new
segments available for writing. Cleaning is the most challenging aspect of the
LFS, in that its effect on performance and its interactions with other parts of the
system are still not fully understood.
Although a cleaner was simulated extensively in the original LFS design
[Rosenblum & Ousterhout, 1 992] , the simulated cleaner was never implemented,
and none of the implemented cleaners (including the one in 4BSD) have ever been
simulated. Cleaning must be done often enough that the filesystem does not fill
up; however, the cleaner can have a devastating effect on performance. Recent
research [Seltzer et al, 1 995] shows that cleaning segments while the LFS is active
(i.e., writing other segments) can result in a performance degradation of about 35
to 40 percent for some transaction-processing-oriented applications. This degra­
dation is largely unaffected by how full the filesystem is; it occurs even when the
filesystem is half empty. However, even at 40-percent degradation, the LFS per­
forms comparably to the FFS on these applications. Recent research also shows
that typical workstation workloads can permit cleaning during disk idle periods
[Blackwell et al, 1 995] , without introducing any user-noticeable latency.
Cleaning in the LFS is implemented by a user utility named lfs_cleanerd.
This functionality was placed in user space for three maj or reasons.
First, experimentation with different algorithms, such as migrating rarely
accessed data to the same segment or restricting cleaning to disk idle times, prob­
ably will prove fruitful, and making this experimentation possible outside the
operating system will encourage further research. In addition, a single cleaning
algorithm is unlikely to perform equally well for all possible workloads. For
example, coalescing randomly updated files during cleaning should dramatically
improve later sequential-read performance for some workloads.
Second, the cleaner may potentially require large amounts of memory and
processor time, and previous implementations of the cleaner in the kernel have
caused noticeable latency problems in user response. When the cleaner is moved
Chapter 8
Local Filestores
to user space, it competes with other processes for processor time and virtual
memory, instead of tying down a significant amount of physical memory.
Third, given the negative effect that the cleaner can have on performance, and
the many possible algorithms for deciding when and what segments to clean, run­
ning the cleaner is largely a policy decision, always best implemented outside the
The number of live bytes of information in a segment, as determined from the
segment-usage information in the index file, is used as a measure of cleaning
importance. A simple algorithm for cleaning would be always to clean the seg­
ment that contains the fewest live bytes, based on the argument that this rule
would result in the most free disk space for the least effort. The cleaning algo­
rithm in the current LFS implementation is based on the simulation in Rosenblum
and Ousterhout, 1 992. This simulation shows that selection of segments to clean
is an important design parameter in minimizing cleaning overhead, and that the
cost-benefit policy defined there does well for the simulated workloads. B riefly
restated, each segment is assigned a cleaning cost and benefit. The VO cost to
clean a segment is equal to
1 + utilization ,
where l represents the cost to read the segment to be cleaned, and utilization is the
fraction of live data in the segment that must be written back into the log. The
benefit of cleaning a segment is
free bytes generated x age of segment,
where free bytes generated is the fraction of dead blocks in the segment (I uti­
lization) and age of segment is the number of seconds since the segment was
written to disk. The selection of the age of segment metric can have dramatic
effects on the frequency with which the cleaner runs (and interferes with system
When the filesystem needs to reclaim space, the cleaner selects the segment
with the largest benefit-to-cost ratio:
utilization) x age of segment
+ utilization
Once a segment has been selected for cleaning, by whatever mechanism,
cleaning proceeds as follows:
1. Read one (or more) target segments .
2. Determine the blocks that contain useful data. For the cleaner to determine the
blocks in a segment that are live, it must be able to identify each block in a
segment; so, the summary block of each partial segment identifies the inode
and logical block number of every block in the partial segment.
3. Write the live blocks back into the filesystem.
4. Mark the segments as clean.
Section 8 . 3
The Log-Structured Filesystem
The cleaner shares information with the kernel via four new system calls and
the index file. The new system calls interface to functionality that was used by the
kernel (e.g., the translation of file logical block numbers to disk addresses done by
ufs_bmap ( )) and to functionality that must be in the kernel to avoid races between
the cleaner and other processes.
The four system calls added for the cleaner are as follows:
1. lfs_bmapv: Take an array of inode number and logical block number pairs, and
return the current disk address, if any, for each block. If the disk address
returned to the cleaner is the one in the segment that it is considering, the
block is live.
2. lfs_markv: Take an array of inode number and logical block number pairs and
write their associated data blocks into the filesystem in the current partial seg­
ment. Although it would be theoretically possible for the cleaner to accom­
plish this task itself, the obvious race with other processes writing or deleting
the same blocks, and the need to do the write without updating the inode's
access or modification times, made it simpler for this functionality to be in the
3 . lfs_segclean: Mark a segment clean. After the cleaner has rewritten all the live
data in the segment, this system call marks the segment clean for reuse. It is a
system call so that the kernel does not have to search the index file for new
segments and so that the cleaner does not have to modify the i ndex file.
4. lfs_segwait: Make a special-purpose sleep call. The calling process is put to
sleep until a specified timeout period has elapsed or, optionally, until a seg­
ment has been written. This operation lets the cleaner pause until there may
be a requirement for further cleaning.
When a segment is selected and read i nto memory, the cleaner processes each
partial segment in the segment sequentially. The segment summary specifies the
blocks that are in the partial segment. Periodically, the cleaner constructs an array
of pairs consisting of an i node number and a logical block number, for file blocks
found in the segment, and uses the lfs_bmapv system call to obtain the current
disk address for each block. If the returned disk address is the same as the loca­
tion of the block in the segment being examined, the block is live. The cleaner
uses the lfs_markv system call to rewrite each live block into another segment in
the fi lesystem.
B efore rewriting these blocks, the kernel must verify that none of the blocks
have been superseded or deleted since the cleaner called lfs_bmapv. Once the call
to lfs_markv begins, only blocks specified by the cleaner are written i nto the log,
until the lfs_markv call completes, so that, if cleaned blocks die after the lfs_markv
call verifies that they are alive, the partial segments written after the lfs_markv par­
tial segments will update their status properly.
The separation of the lfs_bmapv and lfs_markv functionality was done delib­
erately to make it easier for LFS to support new cleaning algorithms. There is no
Chapter 8
Local Filestores
requirement that the cleaner always call lfs_markv after each call to lfs_bmapv, or
that it call lfs_markv with the same arguments. For example, the cleaner might
use lfs_markv to do block coalescing from several segments.
When the cleaner has written the live blocks using lfs_markv, the cleaner calls
lfs_segclean to mark the segment clean. When the cleaner has cleaned enough
segments, it calls (fs_segwait, sleeping until the specified timeout elapses or a new
segment is written into the filesystem.
Since the cleaner is responsible for producing free space, the blocks that it
writes must get preference over all other dirty blocks to be written, so that the sys­
tem avoids running out of free space. In addition, there are degenerative cases
where cleaning a segment can consume more space than it reclaims. So that the
cleaner can always run and will eventually generate free space, all writing by any
process other than the cleaner is blocked by the kernel when the number of clean
segments drops below 3 .
Filesystem Parameterization
Parameterization in the LFS is minimal . At filesystem-creation time, it is possible
to specify the filesystem block and fragment size, the segment size, and the per­
centage of space reserved from normal users. Only the last of these parameters
may be altered after filesystem creation without recreation of the filesystem.
Filesystem-Crash Recovery
Historic UNIX systems spend a significant amount of time in filesystem checks
while rebooting. As disks become ever larger, this time will continue to increase.
There are two aspects to filesystem recovery : bringing the filesystem to a physi­
cally consistent state and verifying the logical structure of the filesystem. When
the FFS or the LFS adds a block to a file, there are several different pieces of infor­
mation that may be modified: the block itself, its inode, indirect blocks, and, of
course, the location of the most recent allocation. If the system crashes between
any of the operations, the filesystem is likely to be left in a physically inconsistent
There is currently no way for the FFS to determine where on the disk or in the
filesystem hierarchy an inconsistency is likely to occur. As a result, it must
rebuild the entire filesystem state, including cylinder-group bitmaps and all meta­
data after each system failure. At the same time, the FFS verifies the filesystem
hierarchy. Traditionally, fsck is the utility that performs both of these fu nctions.
Although the addition of filesystem-already-clean flags and tuning fsck has pro­
vided a significant decrease in the time that it takes to reboot in 4BSD, it can still
take minutes per filesystem before applications can be run.
Because writes are localized in the LFS, the recovery agent can determine
where any filesystem inconsistencies caused by the system crash are located, and
needs to check only those segments, so bringing a LFS to a consistent state nor­
mally requires only a few seconds per filesystem. The minimal time required to
achieve filesystem consistency is a major advantage for the LFS over the FFS .
However, although fast recovery from system failure is desirable, reliable recovery
from media failure is necessary. The high level of robustness that fsck provides
Section 8 . 3
The Log-Structured Filesystem
for the FFS is not maintained by this consistency checking. For example, fsck is
capable of recovering from the corruption of data on the disk by hardware, or by
errant software overwriting filesystem data structures such as a block of inodes.
Recovery i n the LFS has been separated into two parts. The first part involves
bringing the filesystem into a consistent state after a system crash. This part of
recovery is more similar to standard database recovery than to fsck. It consists of
three steps:
1. Locate the most recent checkpoint-the last time at which the filesystem was
consistent on disk.
2. Initialize all the fi lesystem structures based on that checkpoint.
3. Roll forward, reading each partial segment from the checkpoint to the end of
the log, in write order, and incorporating any modifications that occurred,
except as noted previously for directory operations.
Support for rolling forward is the purpose of much of the information
included in the partial-segment summary. The next-segment pointers are provided
so that the recovery agent does not have to search the disk to find the next segment
to read. The recovery agent uses the partial-segment checksums to identify valid
partial segments (ones that were written completely to the disk) . It uses the partial
segment time-stamps to distinguish partial segments written after the checkpoint
from those that were written before the checkpoint and that were later reclaimed
by the cleaner. It uses the file and block numbers in the fi le-information structures
to update the index file (the inode map and segment-usage information) and the
file inodes, to make the blocks in the partial segment appear in the file. The latter
actions are similar to those taken in cleaning. As happens in database recovery,
the filesystem-recovery time is proportional to the interval between filesystem
The second part of recovery in the LFS i nvolves the fi lesystem-consistency
checks performed for the FFS by fsck. This check is similar to the functionality of
fsck, and, like fsck, will take a long time to run. (This functionality has not been
implemented i n 4.4BSD.)
The LFS implementation permits fast recovery, and applications are able to
start running as soon as the roll forward has been completed, while basic sanity
checking of the fi lesystem is done in the background. There is the obvious prob­
lem of what to do if the sanity check fails. If that happens, the fi lesystem must be
downgraded forcibly to read-only status, and fixed. Then, writes can be enabled
once again . The only applications affected by this downgrade are those that were
writing to the filesystem. Of course, the root fi lesystem must always be checked
completely after every reboot, to avoid a cycle of reboot followed by crash fol­
lowed by reboot if the root has become corrupted,
Like the FFS, the LFS replicates the superblock, copying the latter into several
segments. However, no cylinder placement is taken into account in this replica­
tion, so it is theoretically possible that all copies of the superb lock would be on the
same disk cylinder or platter.
Chapter 8
Local Filestores
The Memory-Based Filesystem
Memory-based filesystems have existed for a long time; they have generally been
marketed as random-access-memory disks (RAM-disk) or sometimes as software
packages that use the machine's general-purpose memory. A RAM disk is
designed to appear like any other disk peripheral connected to a machine. It is
normally interfaced to the processor through the 1/0 bus, and is accessed through
a device driver similar or sometimes identical to the device driver used for a nor­
mal magnetic disk. The device driver sends requests for blocks of data to the de­
vice, and the hardware then transfers the requested data to or from the requested
disk sectors . Instead of storing its data on a rotating magnetic disk, the RAM disk
stores its data in a large array of RAM or bubble memory. Thus, the latency of
accessing the RAM disk is nearly zero, whereas 1 5 to 50 milliseconds of latency
are incurred when rotating magnetic media are accessed. RAM disks also have the
benefit of being able to transfer data at the memory bandwidth of the system,
whereas magnetic disks are typically limited by the rate at which the data pass
under the disk head.
Software packages simulating RAM disks operate by allocating a fixed parti­
tion of the system memory. The software then provides a device-driver interface
similar to the one used by disk hardware. B ecause the memory used by the RAM
disk is not available for other purposes, software RAM-disk solutions are used pri­
marily for machines with limited addressing capabilities, such as 1 6-bit computers
that do not have an effective way to use the extra memory.
Most software RAM disks lose their contents when the system is powered
down or rebooted. The system can save the contents either by asing bat­
tery-backed-up memory, or by storing critical fi lesystem data structures in the
filesystem and running a consistency-check program after each reboot. These
conditions increase the hardware cost and potentially slow down the speed of the
disk. Thus, RAM-disk filesystems are not typically designed to survive power fail­
ures; because of their volatility, their usefu lness is l imited to storage of transient or
easily recreated information, such as might be found in /tmp. Their primary bene­
fit is that they have higher throughput than do disk-based fi lesystems [Smith,
1 98 1 ] . This improved throughput is particularly useful for utilities that make
heavy use of temporary files, such as compilers. On fast processors, nearly one­
half of the elapsed time for a compilation is spent waiting for synchronous opera­
tions required for file creation and deletion. The use of the MFS nearly eliminates
this waiting time.
Use of dedicated memory to support a RAM disk exclusively is a poor use of
resources . The system can improve overall throughput by using the memory for
the locations with high access rates. These locations may shift between support­
ing process virtual address spaces and caching frequently used disk blocks. Mem­
ory dedicated to the fi lesystem is used more effectively in a buffer cache than as a
RAM disk. The buffer cache permits faster access to the data because it requires
only a single memory-to-memory copy from the kernel to the user process. The
use of memory in a RAM-disk configuration may require two memory-to-memory
Section 8.4
The Memory-B ased Filesystem
copies: one from the RAM disk to the buffer cache, then another from the buffer
cache to the user process.
The 4.4BSD system avoids these problems by building its RAM-disk filesys­
tem i n pageable memory, instead of i n dedicated memory. The goal is to provide
the speed benefits of a RAM disk without paying the performance penalty inherent
in dedicating to the RAM disk part of the physical memory on the machine. When
the fi lesystem is built in pageable memory, it competes with other processes for
the available memory. When memory runs short, the paging system pushes its
least recently used pages to backing store. Being pageable also allows the filesys­
tem to be much larger than would be practical if it were limited by the amount of
physical memory that could be dedicated to that purpose. The /tmp filesystem can
be allocated a virtual address space that is larger than the physical memory on the
machine. Such a configuration allows small files to be accessed quickly, while
still allowing /tmp to be used for big files, although at a speed more typical of nor­
mal, disk-based filesystems.
An alternative to building a MFS would be to have a filesystem that never did
operations synchronously, and that never flushed its dirty buffers to disk. How­
ever, we believe that such a fi lesystem either would use a disproportionately large
percentage of the buffer-cache space, to the detriment of other filesystems, or
would require the paging system to flush its dirty pages. Waiting for other filesys­
tems to push dirty pages subjects all filesystems to delays while they are waiting
for the pages to be written [Ohta & Tezuka, 1 990] .
Organization of the Memory-Based Filesystem
The implementation of the MFS in 4.4BSD was done before the FFS had been split
into semantic and filestore modules. Thus, to avoid rewriting the semantics of the
4.4BSD fi lesystem, it instead used the FFS in its entirety. The current design does
not take advantage of the memory-resident nature of the filesystem. A future
implementation probably will use the existing semantic layer, but will rewrite the
filestore layer to reduce its execution expense and to make more efficient use of
the memory space.
The user creates a filesystem by invoking a modified version of the newfs util­
ity, with an option telling newfs to create a MFS . The newfs utility allocates a sec­
tion of virtual address space of the requested size, and builds a filesystem in the
memory, instead of on a disk partition. When the filesystem has been built, newfs
does a mount system call specifying a filesystem type of MFS . The auxiliary data
parameter to the mount call specifies a pointer to the base of the memory in which
it has built the filesystem. The mount call does not return until the filesystem is
unmounted. Thus, the newfs process provides the context to support the MFS .
The mount system call allocates and initializes a mount-table entry, and then
calls the filesystem-specific mount routine. The filesystem-specific routine is
responsible for doing the mount and for initializing the filesystem-specific portion
of the mount-table entry. It allocates a block-device vnode to represent the mem­
ory disk device. In the private area of this vnode, it stores the base address of the
Chapter 8
Local Filestores
filesystem and the process identifier of the ne1�fs process for later reference when
doing 1/0 . It also initializes an 110 list that it uses to record outstanding 1/0
requests. It can then call the normal FFS mount system call, passing the special
block-device vnode that it has created, i nstead of the usual disk block-device
vnode. The mount proceeds just like any other local mount, except that requests
to read from the block device are vectored through the MFS block-device vnode,
instead of through the usual block-device 1/0 function. When the mount is com­
pleted, mount does not return as most other filesystem mount system calls do;
instead, it sleeps in the kernel awaiting 1/0 requests . Each time an 1/0 request is
posted for the filesystem, a wakeup is issued for the corresponding newfs process.
When awakened, the process checks for requests on its 1/0 l ist. The filesystem
services a read request by copying to a kernel buffer data from the section of the
newfs address space corresponding to the requested disk block. Similarly, the
filesystem services a write request by copying data to the section of the newfs
address space corresponding to the requested disk block from a kernel buffer.
When all the requests have been serviced, the ne wf� process returns to sleep to
await more requests.
Once the MFS is mounted, all operations on files are handled by the FFS code
until they get to the point where the fi lesystem needs to do 1/0 on the device.
Here, the filesystern encounters the second piece of the MFS . Instead of calling
the special-device strategy routine, it calls the memory-based strategy routine.
Usually, the filesystem services the request by l inking the buffer onto the 1/0 list
for the MFS vnode, and issuing a wakeup to the 11e wf1· process. This wakeup
results in a context switch to the newfs process, which does a copyin or copyout,
as described previously. The strategy routine must be careful to check whether the
1/0 request is coming from the newfs process itself, however. Such requests hap­
pen during mount and unmount operations, when the kernel is reading and writing
the superblock. Here, the MFS strategy routine must do the 1/0 itself, to avoid
The final piece of kernel code to support the MFS is the close routine. After
the filesystem has been unmounted successfully, the device close routine is called.
This routine flushes any pending 1/0 requests, then sets the 1/0 list head to a spe­
cial value that is recognized by the 1/0 servicing loop as an indication that the
filesystem is unmounted. The mount system call exits, in turn causing the ne w.f�
process to exit, resulting in the filesystem vanishing in a cloud of dirty pages.
The paging of the filesystem does not require any additional code beyond that
already in the kernel to support virtual memory. The newf1· process competes with
other processes on an equal basis for the machine's available memory. Data pages
of the filesystem that have not yet been used are zero-fill-on-demand pages that do
not occupy memory. As long as memory is plentiful, the entire contents of the
filesystem remain memory resident. When memory runs short, the oldest pages of
newfs are pushed to backing store as part of the normal paging activity. The pages
that are pushed usually hold the contents of files that have been created in the
MFS, but that have not been accessed recently (or have been deleted).
Section 8.4
The Memory-Based Filesystem
Filesystem Performance
The performance of the current MFS is determined by the memory-to-memory
copy speed of the processor. Empirically, the throughput is about 45 percent of
this memory-to-memory copy speed. The basic set of steps for each block written
is as follows:
1 . Memory-to-memory copy from the user process doing the write to a kernel
2. Context switch to the newfs process
3. Memory-to-memory copy from the kernel buffer to the newfs address space
4. Context switch back to the writing process
Thus, each write requires at least two memory-to-memory copies, accounting for
about 90 percent of the CPU time. The remaining 10 percent is consumed i n the
context switches and in the fi lesystem-allocation and block-location code. The
actual context-switch count is only about one-half of the worst case outlined previ­
ously because read-ahead and write-behind allow multiple blocks to be handled
with each context switch.
The added speed of the MFS is most evident for processes that create and
delete many files. The reason for the speedup is that the fi lesystem must do two
synchronous operations to create a file: first, writing the allocated inode to disk;
then, creating the directory entry. Deleting a file similarly requires at least two
synchronous operations. Here, the low latency of the MFS is noticeable compared
to that of a disk-based filesystem because a synchronous operation can be done
with just two context switches, instead of incurring the disk latency.
Future Work
The most obvious shortcoming of the current implementation is that fi lesystem
blocks are copied twice: once between the newfs process address space and the
kernel buffer cache, and once between the kernel buffer and the requesting pro­
cess. These copies are done in different process contexts, necessitating two con­
text switches per group of 110 requests. When the MFS was built, the virtual­
memory system did not support paging of any part of the kernel address space.
Thus, the only way to build a pageable fi lesystem was to do so in the context of a
normal process. The current virtual-memory system allows parts of the kernel
address space to be paged. Thus, it is now possible to build a MFS that avoids the
double copy and context switch. One potential problem with such a scheme is that
many kernels are limited to a small address space (usually a few Mbyte). This
restriction limits the size of MFS that such a machine can support. On such a
machine, the kernel can describe a MFS that is larger than its address space and
can use a window to map the larger fi lesystem address space into its limited
Chapter 8
Local Filestores
address space. The window maintains a cache of recently accessed pages. The
problem with this scheme is that, if the working set of active pages is greater than
the size of the window, then much time is spent remapping pages and invalidating
translation buffers. Alternatively, a separate address space could be constructed
for each MFS, as in the current implementation. The memory-resident pages of
each address space could be mapped exactly as other cached pages are accessed.
The current system uses the existing local filesystem structures and code to
implement the MFS . The major advantages of this approach are the sharing of
code and the simplicity of the approach. There are several disadvantages, how­
ever. One is that the size of the filesystem is fi xed at mount time. Thus, only a
fixed number of files and data blocks can be supported. Currently, this approach
requires enough swap space for the entire fi lesystem and prevents expansion and
contraction of the filesystem on demand. The current design also prevents the
filesystem from taking advantage of the memory-resident character of the filesys­
tem. For example, the current filesystem structure is optimized for magnetic
disks. It includes replicated control structures, cylinder groups with separate allo­
cation maps and control structures, and data structures that optimize rotational lay­
out of files. None of these optimizations are useful in a MFS (at least when the
backing store for the filesystem is allocated dynamically and is not contiguous on
a single disk type). Alternatively, directories could be implemented using dynami­
cally allocated memory organized as linked lists or trees, rather than as files stored
in disk blocks. Allocation and location of pages for file data might use virtual­
memory primitives and data structures, rather than direct and indirect blocks.
8. 1
What are the four classes of operations handled by the datastore filesystem?
Under what circumstances can a write request avoid reading a block from
the disk?
What is the difference between a logical block and a physical block? Why
is this distinction important?
8 .4
Give two reasons why increasing the basic block size in the old filesystem
from 5 1 2 bytes to 1 024 bytes more than doubled the system's throughput.
Why is the per-cylinder group information placed at varying offsets from
the beginning of the cylinder group?
How many blocks and fragments are allocated t o a 3 1 ,200-byte fi l e o n a
FFS with 4096-byte blocks and 1 024-byte fragments? How many blocks
and fragments are allocated to this file on a FFS with 4096-byte blocks and
5 1 2-byte fragments? Also answer these two questions assuming that an
inode had only six direct block pointers, instead of 1 2.
Explain why the FFS maintains a 5 to 1 0 percent reserve of free space.
What problems would arise if the free-space reserve were set to zero?
What is a quadratic hash? Describe fo r what i t is used in the FFS, and why
it is used for that purpose.
Why are the allocation policies for inodes different from those for data
8. 10
Under what circumstances does block clustering provide benefits that can­
not be obtained with a disk-track cache?
8. 1 1
What are the FFS performance bottlenecks that the LFS filesystem attempts
to address?
8. 1 2
Why does the LFS provide on-disk checksums fo r partial segments?
8. 1 3
Why does the LFS segment writer require that no directory operations occur
while it runs?
8. 14
Which three FFS operations must b e done synchronously t o ensure that the
fi lesystem can always be recovered deterministically after a crash (barring
unrecoverable hardware errors) ?
*8. 15
What problems would arise i f fi les had to b e allocated i n a single contigu­
ous piece of the disk? Consider the problems created by multiple pro­
cesses, random access, and fi les with holes.
*8. 1 6
Construct a n example of an LFS segment where cleaning would lose, rather
than gain, free blocks.
**8. 1 7
!nodes could b e allocated dynamically as part o f a directory entry. Instead,
inodes are allocated statically when the filesystem is created. Why is the
latter approach used?
**8. 1 8
The no-overwrite policy of the LFS offers the ability to support new fea­
tures such as unrm, which offers the ability to un-remove a file. What
changes would have to be made to the system to support this feature?
**8. 19
The LFS causes wild swings in the amount o f memory used b y the buffer
cache and the filesystem, as compared to the FFS . What relationship should
the LFS have with the virtual-memory subsystem to guarantee that this
behavior does not cause deadlock?
B lackwell et al, 1 995 .
T. Blackwell, J. Harris, & M . Seltzer, "Heuristic Cleaning Algorithms i n
Log-Structured File Systems,"
USENIX Association
Proceedings, pp. 277-288, January 1 995 .
Chapter 8
Local Filestores
Chutani et al, 1 992.
S . Chutani, 0. Anderson, M. Kazar, W. Mason, & R. Sidebotham, "The
Episode File System," USENIX Association Conference Proceedings, pp.
43-59, January 1 992.
Ganger & Patt, 1 994.
G. Ganger & Y. Patt, "Metadata Update Performance in File Systems,"
USENIX Symposium on Operating Systems Design and Implementation, pp.
49-60, November 1 994.
Irlam, 1 993.
G. Irlam, Unix File Size Survey-1 993,­
/ufs93 .html, email : <gordoni>, November 1 993.
Knuth, 1 97 5 .
D. Knuth, The A rt of Computer Programming, Volume 3-Sorting and
Searching, pp. 506-549, Addison-Wesley, Reading, MA, 1 97 5 .
McKusick e t a l , 1 984.
M. K. McKusick, W. N. Joy, S. J. Leffler, & R. S. Fabry, "A Fast File System
for UNIX," ACM Transactions on Computer Systems, vol. 2, no. 3, pp.
1 8 1 - 1 97 , Association for Computing Machinery, August 1 984.
McKusick & Kowalski, 1 994.
M . K. McKusick & T. J. Kowalski , " Fsck: The UNIX File System Check
Program," in 4.4BSD System Manager 's Manual, pp. 3 : 1 -2 1 , O' Reilly &
Associates, Inc., Sebastopol, CA, 1 994.
McVoy & Kleiman, 1 99 1 .
L. McVoy & S . Kleiman, "Extent-Like Performance from a Unix File Sys­
tem," USENIX Association Conference Proceedings, pp. 33-44, January
1 99 1 .
Moran et al, 1 990.
J . Moran, R. Sandberg, D. Coleman, J . Kepecs, & B. Lyon, "Breaking
Through the NFS Performance B arrier," Proceedings of the Spring 1 990
European UNIX Users Group Conference, pp. 1 99-206, April 1 990.
Nevalainen & Vesterinen, 1 977.
0. Nevalainen & M . Vesterinen, "Determining Blocking Factors for Sequen­
tial Files by Heuristic Methods," The Computer Journal, vol. 20, no. 3, pp.
245-247, August 1 977.
Ohta & Tezuka, 1 990.
M. Ohta & H. Tezuka, "A Fast /tmp File System by Async Mount Option,"
USENIX Association Conference Proceedings, pp. 145- 1 50, June 1 990.
Ousterhout & Dougl is, 1 989.
J . Ousterhout & F. Douglis, "Beating the 1/0 Bottleneck: A Case for Log­
Structured File Systems," Operating Systems Review, vol. 23, 1 , pp. 1 1 -27,
January 1 989.
Peacock, 1 98 8 .
J . Peacock, "The Counterpoint Fast File System," USENIX Association
Conference Proceedings, pp. 243-249, January 1 98 8 .
Rosenblum & Ousterhout, 1 992.
M. Rosenblum & J. Ousterhout, "The Design and Implementation of a Log­
Structured File System," ACM Transactions on Computer Systems, vol. 1 0,
no. 1 , pp. 26-52, Association for Computing Machinery, February 1 992.
Seltzer et al, 1 993.
M. Seltzer, K. Bostic, M. K. McKusick, & C . Staelin, "An Implementation
of a Log-Structured File System for UNIX," USENIX Association Confer­
ence Proceedings, pp. 307-326, January 1 993.
Seltzer et al, 1 995.
M. Seltzer, K. Smith, H. B alakrishnan, J . Chang, S . McMains, & V. Padman­
abhan, "File System Logging Versus Clustering: A Performance Compari­
son," USENIX Association Conference Proceedings, pp. 249-264, January
1 995 .
Smith, 1 98 1 .
A . J . Smith, "Bibliography on File and 1/0 System Optimizations and
Related Topics," Operating Systems Review, vol. 1 4, no. 4, pp. 39-54, Octo­
ber 1 98 1 .
Trivedi, 1 980.
K . Trivedi, "Optimal Selection of CPU Speed, Device Capabilities, and File
Assignments," Journal of the ACM, vol. 27, no. 3, pp. 457-473, July 1 980.
The Network Filesystem
This chapter is divided into three main sections. The first gives a brief history of
remote filesystems. The second describes the client and server halves of NFS and
the mechanics of how they operate. The final section describes the techniques
needed to provide reasonable performance for remote filesystems in general , and
NFS i n particular.
History and Overview
When networking first became widely available in 4.2BSD, users who wanted to
share files all had to log in across the net to a central machine on which the shared
files were located. These central machines quickly became far more loaded than
the user's local machine, so demand quickly grew for a convenient way to share
files on several machines at once. The most easily understood sharing model is
one that allows a server machine to export its fi lesystems to one or more client
machines. The clients can then import these filesystems and present them to the
user as though they were j ust another local filesystem.
Numerous remote-fi lesystem protocol designs and protocols were proposed
and i mplemented. The implementations were attempted at all levels of the kernel.
Remote access at the top of the kernel resulted in semantics that nearly matched
the local fi lesystem, but had terrible performance. Remote access at the bottom of
the kernel resulted in awful semantics, but great performance. Modern systems
place the remote access in the middle of the kernel at the vnode layer. This level
gives reasonable performance and acceptable semantics.
An early remote filesystem, UNIX United, was implemented near the top of
the kernel at the system-call dispatch level . It checked for file descriptors repre­
senting remote files and sent them off to the server. No caching was done on the
31 1
31 2
Chapter 9
The Network Filesystem
client machine. The lack of caching resulted in slow performance, but in
semantics nearly identical to a local filesystem. Because the current directory and
executing files are referenced internally by vnodes rather than by descriptors,
UNIX United did not allow users to change directory into a remote filesystem and
could not execute files from a remote filesystem without first copying the files to a
local filesystem.
At the opposite extreme was Sun Microsystem's network disk, implemented
near the bottom of the kernel at the device-driver level. Here, the client's entire
filesystem and buffering code was used. Just as in the local filesystem, recently
read blocks from the disk were stored in the buffer cache. Only when a file access
requested a block that was not already in the cache would the client send a request
for the needed physical disk block to the server. The performance was excellent
because the buffer cache serviced most of the file-access requests just as it does
for the local filesystem. Unfortunately, the semantics suffered because of inco­
herency between the client and server caches. Changes made on the server would
not be seen by the client, and vice versa. As a result, the network disk could be
used only by a single client or as a read-only filesystem.
The first remote filesystem shipped with System V was RFS [Rifkin et al,
1 986] . Although it had excellent UNIX semantics, its performance was poor, so it
met with little use. Research at Carnegie-Mellon lead to the Andrew filesystem
[Howard, 1 98 8 ] . The Andrew filesystem was commercialized by Transarc and
eventually became part of the Distributed Computing Environment promulgated
by the Open Software Foundation, and was supported by many vendors. It is
designed to handle widely distributed servers and clients and also to work well
with mobile computers that operate while detached from the network for long
periods .
The most commercially successful and widely available remote-filesystem
protocol is the network filesystem (NFS) designed and implemented by Sun
Microsystems [Walsh et al, 1 985 ; Sandberg et al, 1 985 ] . There are two important
components to the success of NFS . First, Sun placed the protocol specification for
NFS in the public domain. Second, Sun sells that implementation to all people
who want it, for less than the cost of implementing it themselves. Thus, most ven­
dors chose to buy the Sun implementation. They are willing to buy from Sun
because they know that they can always legally write their own implementation if
the price of the Sun implementation is raised to an unreasonable level. The
4.4BSD implementation was written from the protocol specification, rather than
being incorporated from Sun, because of the developers desire to be able to redis­
tribute it freely in source form.
NFS was designed as a client-server application. Its implementation is
divided into a client part that imports filesystems from other machines and a server
part that exports local filesystems to other machines. The general model is shown
in Fig. 9. 1 . Many goals went into the NFS design :
The protocol is designed to be stateless. Because there is no state to maintain or
recover, NFS can continue to operate even during periods of client or server fail­
ures. Thus, it is much more robust than a system that operates with state.
Section 9. 1
History and Overview
31 3
disk store
Figure 9.1
The division of NFS between client and server.
• NF S is designed to support UNIX filesystem semantics. However, its design also
allows it to support the possibly less rich semantics of other filesystem types,
such as MS-DOS .
• The protection and access controls follow the UNIX semantics of having the pro­
cess present a um and set of groups that are checked against the file's owner,
group, and other access modes. The security check is done by filesystem-depen­
dent code that can do more or fewer checks based on the capabilities of the
fi lesystem that it is supporting. For example, the MS-DOS filesystem cannot
implement the full UNIX security validation and makes access decisions solely
based on the um.
• The protocol design is transport independent. Although it was originally built
using the UDP datagram protocol, it was easily moved to the TCP stream proto­
col. It has also been ported to run over numerous other non-IP-based protocols.
Some of the design decisions limit the set of applications for which NFS is appro­
• The design envisions clients and servers being connected on a locally fast net­
work. The NFS protocol does not work well over slow links or between clients
and servers with intervening gateways. It also works poorly for mobile comput­
ing that has extended periods of disconnected operation.
• The caching model assumes that most files will not be shared. Performance suf­
fers when files are heavily shared.
• The stateless protocol requires some loss of traditional UNIX semantics. Filesys­
tem locking (flock) has to be implemented by a separate stateful daemon . Defer­
ral of the release of space in an unlinked file until the final process has closed the
file is approximated with a heuristic that sometimes fails.
Despite these limitations, NFS proliferated because it makes a reasonable
tradeoff between semantics and performance; its low cost of adoption has now
made it ubiquitous.
31 4
Chapter 9
The Network Filesystem
NFS Structure and Operation
NFS operates as a typical client-server application. The server receives remote­
procedure-call (RPC) requests from its various clients . An RPC operates much
like a local procedure call: The client makes a procedure call, then waits for the
result while the procedure executes. For a remote procedure call, the parameters
must be marshalled together into a message . Marshalling includes replacing
pointers by the data to which they point and converting binary data to the canoni­
cal network byte order. The message is then sent to the server, where it is unmar­
shalled (separated out into its original pieces) and processed as a local filesystem
operation. The result must be similarly marshalled and sent back to the client.
The client splits up the result and returns that result to the calling process as
though the result were being returned from a local procedure call [B irrell & Nel­
son, 1 984] . The NFS protocol uses the Sun's RPC and external data-representation
(XDR) protocols [Reid, 1 987]. Although the kernel implementation is done by
hand to get maximum performance, the user-level daemons described later in this
section use Sun's public-domain RPC and XDR libraries.
The NFS protocol can run over any available stream- or datagram-oriented
protocol. Common choices are the TCP stream protocol and the UDP datagram
protocol. Each NFS RPC message may need to be broken into multiple packets to
be sent across the network. A big performance problem for NFS running under
UDP on an Ethernet is that the message may be broken into up to six packets; if
any of these packets are lost, the entire message is lost and must be resent. When
running under TCP on an Ethernet, the message may also be broken into up to six
packets; however, individual lost packets, rather than the entire message, can be
retransmitted. Section 9.3 discusses performance issues in greater detail.
The set of RPC requests that a client can send to a server is shown in Table
9. 1 . After the server handles each request, it responds with the appropriate data,
or with an error code explaining why the request could not be done. As noted in
the table, most operations are idempotent. An idempotent operation is one that
can be repeated several times without the final result being changed or an error
being caused. For example, writing the same data to the same offset in a file is
idempotent because it will yield the same result whether it is done once or many
times. However, trying to remove the same file more than once is nonidempotent
because the file will no longer exist after the first try. ldempotency is an issue
when the server is slow, or when an RPC acknowledgment is lost and the client
retransmits the RPC request. The retransmitted RPC will cause the server to try to
do the same operation twice. For a nonidempotent request, such as a request to
remove a file, the retransmitted RPC, if undetected by the server recent-request
cache [Juszczak, 1 989], will cause a "no such file" error to be returned, because
the file will have been removed already by the first RPC. The user may be con­
fused by the error, because they will have successfully found and removed the file.
Each file on the server can be identified by a unique file handle. A file handle
is the token by which clients refer to files on a server. Handles are globally unique
and are passed in operations, such as read and write, that reference a file. A file
NFS Structure and Operation
Section 9.2
Table 9.1 NFS,
31 5
get file attributes
set file attributes
look up file name
read from symbolic link
read from file
write to file
create file
remove file
rename fi le
create link to file
create symbolic link
create directory
remove directory
read from directory
get filesystem attributes
Version 2, RPC requests.
handle is created by the server when a pathname-translation request (lookup) is
sent from a client to the server. The server must find the requested file or directory
and ensure that the requesting user has access permission. If permission is
granted, the server returns a file handle for the requested file to the client. The file
handle identifies the file in future access requests by the client. Servers are free to
build file handles from whatever i nformation they find convenient. In the 4.4BSD
NFS i mplementation, the file handle is built from a filesystem identifier, an inode
number, and a generation number. The server creates a unique filesystem identi­
fier for each of its locally mounted filesystems. A generation number is assigned
to an i node each time that the latter is allocated to represent a new file. Each gen­
eration number is used only once. Most NFS implementations use a random-num­
ber generator to select a new generation number; the 4.4BSD i mplementation
selects a generation number that is approximately equal to the creation time of the
file. The purpose of the file handle is to provide the server with enough i nforma­
tion to find the file in future requests. The fi lesystem identifier and inode provide
a unique identifier for the inode to be accessed. The generation number verifies
that the inode still references the same file that it referenced when the file was first
accessed. The generation number detects when a file has been deleted, and a new
file is later created using the same inode. Although the new file has the same
filesystem identifier and inode number, it is a completely different file from the
one that the previous file handle referenced. Since the generation number is
included i n the file handle, the generation number i n a file handle for a previous
31 6
Chapter 9
The Network Filesystem
use of the inode will not match the new generation number in the same inode.
When an old-generation file handle is presented to the server by a client, the server
refuses to accept it, and instead returns the " stale file handle" error message.
The use of the generation number ensures that the file handle is time stable.
Distributed systems define a time-stable identifier as one that refers uniquely to
some entity both while that entity exists and for a long time after it is deleted. A
time-stable identifier allows a system to remember an identity across transient fail­
ures and allows the system to detect and report errors for attempts to access
deleted entities.
The NFS Protocol
The NFS protocol is stateless. Being stateless means that the server does not need
to maintain any information about which clients it is serving or about the files that
they currently have open. Every RPC request that is received by the server is com­
pletely self-contained. The server does not need any additional information
beyond that contained in the RPC to fulfill the request. For example, a read
request will include the credential of the user doing the request, the file handle on
which the read is to be done, the offset in the file to begin the read, and the num­
ber of bytes to be read. This information allows the server to open the file, verify­
ing that the user has permission to read it, to seek to the appropriate point, to read
the desired contents, and to close the file. In practice, the server caches recently
accessed file data. However, if there is enough activity to push the file out of the
cache, the file handle provides the server with enough information to reopen the
In addition to reducing the work needed to service incoming requests, the
server cache also detects retries of previously serviced requests. Occasionally, a
UDP client will send a request that is processed by the server, but the acknowledg­
ment returned by the server to the client is lost. Receiving no answer, the client
will timeout and resend the request. The server will use its cache to recognize that
the retransmitted request has already been serviced. Thus, the server will not
repeat the operation, but will just resend the acknowledgment. To detect such
retransmissions properly, the server cache needs to be large enough to keep track
of at least the most recent few seconds of NFS requests .
The benefit of the stateless protocol is that there is no need to do state recov­
ery after a client or server has crashed and rebooted, or after the network has been
partitioned and reconnected. Because each RPC is self-contained, the server can
simply begin servicing requests as soon as it begins running; it does not need to
know which files its clients have open. Indeed, it does not even need to know
which clients are currently using it as a server.
There are drawbacks to the stateless protocol . First, the semantics of the local
filesystem imply state. When files are unlinked, they continue to be accessible
until the last reference to them is closed. Because NFS knows neither which fi les
are open on clients nor when those files are closed, it cannot properly know when
Section 9.2
NFS Structure and Operation
31 7
to free file space. As a result, it always frees the space at the time of the unlink of
the last name to the file. Clients that want to preserve the freeing-on-last-close
semantics convert unlink's of open fi les to renames to obscure names on the
server. The names are of the form . nfsAxxxx4.4, where the xxxx is replaced with
the hexadecimal value of the process identifier, and the A is successively incre­
mented until an unused name is found. When the last close is done on the client,
the client sends an unlink of the obscure fi lename to the server. This heuristic
works for file access on only a single client; if one client has the file open and
another client removes the file, the file will still disappear from the first client at
the time of the remove. Other stateful semantics include the advisory locking
described in Section 7 . 5 . The locking semantics cannot be handled by the NFS
protocol. On most systems, they are handled by a separate lock manager; the
4.4BSD version of NFS does not implement them at all.
The second drawback of the stateless protocol is related to performance. For
version 2 of the NFS protocol, all operations that modify the filesystem must be
committed to stable-storage before the RPC can be acknowledged. Most servers
do not have battery-backed memory; the stable store requirement means that all
written data must be on the disk before they can reply to the RPC. For a growing
file, an update may require up to three synchronous disk writes: one for the inode
to update its size, one for the indirect block to add a new data pointer, and one for
the new data themselves. Each synchronous write takes several milliseconds; this
delay severely restricts the write throughput for any given client file.
Version 3 of the NFS protocol eliminates some of the synchronous writes by
adding a new asynchronous write RPC request. When such a request is received
by the server, it is permitted to acknowledge the RPC without writing the new data
to stable storage. Typically, a client will do a series of asynchronous write
requests followed by a commit RPC request when it reaches the end of the file or it
runs out of buffer space to store the file. The commit RPC request causes the
server to write any unwritten parts of the file to stable store before acknowledging
the commit RPC. The server benefits by having to write the inode and indirect
blocks for the file only once per batch of asynchronous writes, instead of on every
write RPC request. The client benefits from having higher throughput for file
writes. The client does have the added overhead of having to save copies of all
asynchronously written buffers until a commit RPC is done, because the server
may crash before having written one or more of the asynchronous buffers to stable
store. When the client sends the commit RPC, the acknowledgment to that RPC
tells which of the asynchronous blocks were written to stable store. If any of the
asynchronous writes done by the client are missing, the client knows that the
server has crashed during the asynchronous-writing period, and resends the unac­
knowledged blocks. Once all the asynchronously written blocks have been
acknowledged, they can be dropped from the client cache.
The NFS protocol does not specify the granularity of the buffering that should
be used when files are written. Most implementations of NFS buffer fi les i n
8-Kbyte blocks. Thus, if an application writes 1 0 bytes i n the middle o f a block,
31 8
Chapter 9
The Network Filesystem
the client reads the entire block from the server, modifies the requested 10 bytes,
and then writes the entire block back to the server. The 4.4BSD implementation
also uses 8-Kbyte buffers, but it keeps additional information that describes which
bytes in the buffer are modified. If an application writes 10 bytes in the middle of
a block, the client reads the entire block from the server, modifies the requested 1 0
bytes, but then writes back only the 1 0 modified bytes to the server. The block
read is necessary to ensure that, if the application later reads back other unmodi­
fied parts of the block, it will get valid data. Writing back only the modified data
has two benefits :
1 . Fewer data are sent over the network, reducing contention for a scarce
2 . Nonoverlapping modifications to a file are not lost. If two different clients
simultaneously modify different parts of the same file block, both modifica­
tions will show up in the file, since only the modified parts are sent to the
server. When clients send back entire blocks to the server, changes made by
the first client will be overwritten by data read before the first modification
was made, and then will be written back by the second client.
The 4.4BSD NFS Implementation
The NFS implementation that appears in 4.4BSD was written by Rick Macklem at
the University of Guelph using the specifications of the Version 2 protocol pub­
lished by Sun Microsystems [Sun Microsystems, 1 989; Macklem, 1 99 1 ] . This
NFS Version 2 implementation had several 4.4BSD-only extensions added to it; the
extended version became known as the Not Quite NFS (NQNFS) protocol [Mack­
lem, 1 994a] . This protocol provides
Sixty-four-bit file offsets and sizes
An access RPC that provides server permission checking on file open, rather than
having the client guess whether the server will allow access
An append option on the write RPC
Extended file attributes to support 4.4BSD filesystem functionality more fully
A variant of short-term leases with delayed-write client caching that give dis­
tributed cache consistency and improved performance [Gray & Cheriton, 1 989]
Many of the NQNFS extensions were incorporated into the revised NFS Version 3
specification [Sun Microsystems, 1 99 3 ; Pawlowski et al, 1 994] . Others, such as
leases, are still available only with NQNFS . The NFS i mplementation distributed
in 4.4BSD supports clients and servers running the NFS Version 2, NFS Version 3 ,
o r NQNFS protocol [Macklem, 1 994b] . The NQNFS protocol is described in Sec­
tion 9.3.
Section 9.2
NFS Structure and Operation
31 9
The 4.4BSD client and server implementations of NFS are kernel resident.
NFS interfaces to the network with sockets using the kernel interface available
through sosend( ) and soreceive ( ) (see Chapter 1 1 for a discussion of the socket
interface). There are connection-management routines for support of sockets
using connection-oriented protocols; there are timeout and retransmit support for
datagram sockets on the client side.
The less time-critical operations, such as mounting and unmounting, as well
as determination of which fi lesystems may be exported and to what set of clients
they may be exported are managed by user-level system daemons. For the server
side to function, the portmap, mountd, and nfsd daemons must be running. The
portmap daemon acts as a registration service for programs that provide RPC­
based services. When an RPC daemon is started, it tells the portmap daemon to
what port number it is l istening and what RPC services it is prepared to serve.
When a client wishes to make an RPC call to a given service, it will first contact
the portmap daemon on the server machine to determine the port number to
which RPC messages should be sent.
The interactions between the client and server daemons when a remote
fi lesystem is mounted are shown in Fig. 9.2. The mountd daemon handles two
i mportant functions:
1. On startup and after a hangup signal, mountd reads the /etc/exports file and
creates a list of hosts and networks to which each local filesystem may be
exported. It passes this list into the kernel using the mount system call; the
Figure 9.2
Daemon interaction when a remote filesystem is mounted. Step 1 : The client's
mount process sends a message to the well-known port of the server's portmap daemon,
requesting the port address of the server's mountd daemon. Step 2 : The server's portmap
daemon returns the port address of its server's mountd daemon. Step 3: The client's
mount process sends a request to the server's mountd daemon with the pathname of the
filesystem that it wants to mount. Step 4: The server's mountd daemon requests a file han­
dle for the desired mount point from its kernel. If the request is successful, the file handle
is returned to the client's mount process. Otherwise, the error from the file-handle request
is returned. If the request is successful, the client's mount process does a mount system
call, passing in the file handle that it received from the server's mountd daemon.
portmap mountd user
- - 1, - -
- -
Chapter 9
The Network Filesystem
kernel links the list to the associated local filesystem mount structure so that
the list is readily available for consultation when an NFS request is received.
2. Client mount requests are directed to the mountd daemon . After verifying
that the client has permission to mount the requested filesystem, mountd
returns a file handle for the requested mount point. Thi s file handle is used by
the client for later traversal into the filesystem .
The nfsd master daemon forks off children that enter the kernel using the
nfssvc system cal l . The children normal ly remain kernel resident, providing a pro­
cess context for the NFS RPC daemons. Typical systems run four to six nfsd dae­
mons . If nfsd is providing datagram service, it will create a datagram socket
when it is started. If nfsd is providing stream service, connected stream sockets
will be passed in by the master nfsd daemon in response to connection-oriented
connection requests from clients, When a request arrives on a datagram or stream
socket, there is an upcall from the socket layer that invokes the nfsrv_rcv ( ) rou­
tine. The nfsrv_rcv ( ) call takes the message from the socket receive queue and
dispatches that message to an available nfsd daemon . The nfsd daemon verifies
the sender, and then passes the request to the appropriate local filesystem for pro­
cessing. When the result returns from the filesystem, it is returned to the request­
ing client. The nfsd daemon is then ready to loop back and to service another
request. The maximum degree of concurrency on the server is determined by the
number of nfsd daemons that are started.
For connection-oriented transport protocol s, such as TCP, there is one connec­
tion for each client-to-server mount point. For datagram-oriented protocols, such
as UDP, the server creates a fixed number of incoming RPC sockets when it starts
its nfsd daemons ; clients create one socket for each imported mount point. The
socket for a mount point is created by the mount command on the client, which
then uses it to communicate with the mountd daemon on the server. Once the
client-to-server connection is established, the daemon processes on a connection­
oriented protocol may do additional verification, such as Kerberos authentication .
Once the connection is created and verified, the socket is passed into the kernel . If
the connection breaks while the mount point is still active, the client will attempt a
reconnect with a new socket.
The client side can operate without any daemons running, but the system
administrator can improve performance by running several nfsiod daemons
(these daemons provide the same service as the Sun biod daemons) . The purpose
of the nfsiod daemons is to do asynchronous read-aheads and write-behinds.
They are typically started when the kernel begins running multiuser. They enter
the kernel using the l!fssvc system call, and they remain kernel resident, providing
a process context for the NFS RPC client side. In their absence, each read or write
of an NFS file that cannot be serviced from the local client cache must be done in
the context of the requesting process. The process sleeps while the RPC is sent to
the server, the RPC is handled by the server, and a reply sent back. No read­
aheads are done, and write operations proceed at the disk-write speed of the
Section 9.2
NFS Structure and Operation
write ( )
Figure 9.3 Daemon interaction when I/O i s done. Step 1 : The client's process does a
write system call. Step 2: The data to be written are copied into a kernel buffer on the
client, and the write system call returns. Step 3 : An nfsiod daemon awakens inside the
client's kernel, picks up the dirty buffer, and sends the buffer to the server. Step 4: The in­
coming write request is delivered to the next available nfsd daemon running inside the ker­
nel on the server. The server's nfsd daemon writes the data to the appropriate local disk,
and waits for the disk I/O to complete. Step 5: After the I/O has completed, the server's
nfsd daemon sends back an acknowledgment of the I/O to the waiting nfsiod daemon on
the client. On receipt of the acknowledgment, the client's nfsiod daemon marks the buffer
as clean.
server. When present, the nfsiod daemons provide a separate context in which to
issue RPC requests to a server. When a file is written, the data are copied into the
buffer cache on the client. The buffer is then passed to a waiting nfsiod that does
the RPC to the server and awaits the reply. When the reply arrives, nfsiod updates
the local buffer to mark that buffer as written. Meanwhile, the process that did the
write can continue running. The Sun Microsystems reference port of the NFS pro­
tocol flushes all the blocks of a file to the server when that file is closed. If all the
dirty blocks have been written to the server when a process closes a file that it has
been writing, it will not have to wait for them to be flushed. The NQNFS protocol
does not flush all the blocks of a file to the server when that file is closed.
When reading a file, the client first hands a read-ahead request to the nfsiod
that does the RPC to the server. It then looks up the buffer that it has been
requested to read. If the sought-after buffer is already in the cache because of a
previous read-ahead request, then it can proceed without waiting. Otherwise, it
must do an RPC to the server and wait for the reply. The interactions between the
client and server daemons when 1/0 is done are shown in Fig. 9.3.
Client-Server Interactions
A local filesystem is unaffected by network service disruptions. It is always avail­
able to the users on the machine unless there is a catastrophic event, such as a disk
or power failure. Since the entire machine hangs or crashes, the kernel does not
need to concern itself with how to handle the processes that were accessing the
filesystem. By contrast, the client end of a network filesystem must have ways to
Chapter 9
The Network Filesystem
handle processes that are accessing remote files when the client is still running,
but the server becomes unreachable or crashes. Each NFS mount point is provided
with three alternatives for dealing with server unavailability:
1 . The default is a hard mount that will continue to try to contact the server "for­
ever" to complete the filesystem access. This type of mount is appropriate
when processes on the client that access files in the filesystem do not tolerate
110 system calls that return transient errors . A hard mount is used for pro­
cesses for which access to the filesystem is critical for normal system opera­
tion. It is also useful if the client has a long-running program that simply
wants to wait for the server to resume operation (e.g., after the server is taken
down to run dumps).
2. The other extreme is a soft mount that retries an RPC a specified number of
times, and then the corresponding system call returns with a transient error.
For a connection-oriented protocol, the actual RPC request is not retransmitted;
instead, NFS depends on the protocol retransmission to do the retries. If a
response is not returned within the specified time, the corresponding system
call returns with a transient error. The problem with this type of mount is that
most applications do not expect a transient error return from I/O system calls
(since they never occur on a local filesystem) . Often, they will mistakenly
interpret the transient error as a permanent error, and will exit prematurely.
An additional problem is deciding how long to set the timeout period. If it is
set too low, error returns will start occurring whenever the NFS server is slow
because of heavy load. Alternately, a large retry limit can result in a process
hung for a long time because of a crashed server or network partitioning.
3 . Most system administrators take a middle ground by using an interruptible
mount that will wait forever like a hard mount, but checks to see whether a ter­
mination signal is pending for any process that is waiting for a server
response. If a signal (such as an interrupt) is sent to a process waiting for an
NFS server, the corresponding 1/0 system call returns with a transient error.
Normally, the process is terminated by the signal. If the process chooses to
catch the signal, then it can decide how to handle the transient failure. This
mount option allows interactive programs to be aborted when a server fails,
while allowing long-running processes to await the server's return.
The original NFS implementation had only the first two options. Since neither of
these two options was ideal for interactive use of the filesystem, the third option
was developed as a compromise solution.
RPC Transport Issues
The NFS Version 2 protocol runs over UDP/IP transport by sending each request­
reply message in a single UDP datagram. Since UDP does not guarantee datagram
delivery, a timer is started, and if a timeout occurs before the corresponding RPC
Section 9.2
NFS Structure and Operation
reply is received, the RPC request is retransmitted. At best, an extraneous RPC
request retransmit increases the load on the server and can result in damaged files
on the server or spurious errors being returned to the client when nonidempotent
RPCs are redone. A recent-request cache normally is used on the server to mini­
mize the negative effect of redoing a duplicate RPC request [Juszczak, 1 989] .
The amount of time that the client waits before resending an RPC request is
called the round-trip timeout (RIT). Figuring out an appropriate value for the RTT
is difficult. The RTT value is for the entire RPC operation, including transmitting
the RPC message to the server, queuing at the server for an nfsd, doing any
required 1/0 operations, and sending the RPC reply message back to the client. It
can be highly variable for even a moderately loaded NFS server. As a result, the
RTT interval must be a conservative (large) estimate to avoid extraneous RPC
request retransmits. Adj usting the RTT interval dynamically and applying a con­
gestion window on outstanding requests has been shown to be of some help with
the retransmission problem [Nowicki, 1 989] .
On an Ethernet with the default 8-Kbyte read-write data size, the read-write
reply-request will be an 8+-Kbyte UDP datagram that normally must be broken
into at least six fragments at the IP layer for transmission. For IP fragments to be
reassembled successfully into the IP datagram at the receive end, all fragments
must be received at the destination. If even one fragment is lost or damaged in
transit, the entire RPC message must be retransmitted, and the entire RPC redone.
This problem can be exaggerated if the server is multiple hops away from the
client through routers or slow links. It can also be nearly fatal if the network inter­
face on the client or server cannot handle the reception of back-to-back network
packets [Kent & Mogul, 1 987] .
An alternative to all this madness is to run NFS over TCP transport, instead of
over UDP. Since TCP provides reliable delivery with congestion control, it avoids
the problems associated with UDP. B ecause the retransmissions are done at the
TCP level, instead of at the RPC level, the only time that a duplicate RPC will be
sent to the server is when the server crashes or there is an extended network parti­
tion that causes the TCP connection to break after an RPC has been received but
not acknowledged to the client. Here, the client will resend the RPC after the
server reboots, because it does not know that the RPC has been received.
The use of TCP also permits the use of read and write data sizes greater than
the 8-Kbyte limit for UDP transport. Using large data sizes allows TCP to use the
full duplex bandwidth of the network effectively, before being forced to stop and
wait for RPC response from the server. NFS over TCP usually delivers comparable
to significantly better performance than NFS over UDP, unless the client or server
processor is slow. For processors running at less than 1 0 million instructions per
second (MIPS) , the extra CPU overhead of using TCP transport becomes significant.
The main problem with using TCP transport with Version 2 of NFS is that it is
supported between only BSD and a few other vendors clients and servers . How­
ever, the clear superiority demonstrated by the Version 2 BSD TCP implementation
of NFS convinced the group at Sun Microsystems implementing NFS Version 3 to
Chapter 9
The Network Filesystem
make TCP the default transport. Thus, a Version 3 Sun client will first try to
connect using TCP; only if the server refuses will it fall back to using UDP.
Security Issues
NFS is not secure because the protocol was not designed with security in mind.
Despite several attempts to fix security problems, NFS security is still limited.
Encryption is needed to bui ld a secure protocol , but robust encryption cannot be
exported from the United States. So, even if building a secure protocol were pos­
sible, doing so would be pointless, because all the file data are sent around the net
in clear text. Even if someone is unable to get your server to send them a sensitive
file, they can just wait until a legitimate user accesses it, and then can pick it up as
it goes by on the net.
NFS export control is at the granularity of local filesystems. Associated with
each local filesystem mount point is a list of the hosts to which that filesystem
may be exported. A local filesystem may be exported to a specific host, to all
hosts that match a subnet mask, or to all other hosts (the world) . For each host or
group of hosts, the filesystem can be exported read-only or read-write. In addi­
tion, a server may specify a set of subdirectories within the filesystem that may be
mounted. However, this list of mount points is enforced by only the mountd dae­
mon. If a malicious client wishes to do so, it can access any part of a filesystem
that is exported to it.
The final determination of exportability is made by the list maintained in the
kernel . So, even if a rogue client manages to snoop the net and to steal a file han­
dle for the mount point of a valid client, the kernel will refuse to accept the file
handle unless the client presenting that handle is on the kernel's export list. When
NFS is running with TCP, the check is done once when the connection is estab­
lished. When NFS is running with UDP, the check must be done for every RPC
The NFS server also permits limited remapping of user credentials. Typically.
the credential for the superuser is not trusted and is remapped to the low-privilege
user "nobody." The credentials of all other users can be accepted as given or also
mapped to a default user (typically " nobody " ) . Use of the client UID and GID list
unchanged on the server implies that the UID and GID space are common between
the client and server (i.e., UID N on the client must refer to the same user on the
server). The system administrator can support more complex UID and GID map­
pings by using the umapfs filesystem described in Section 6.7.
The system administrator can increase security by using Kerberos credentials,
instead of accepting arbitrary user credentials sent without encryption by clients of
unknown trustworthiness [Steiner et al, 1 98 8 ] . When a new user on a client wants
to begin accessing files in an NFS filesystem that is exported using Kerberos, the
client must provide a Kerberos ticket to authenticate the user on the server. If suc­
cessful, the system looks up the Kerberos principal in the server's password and
group databases to get a set of credentials, and passes in to the server nfsd a local
translation of the client UID to these credentials. The nfsd daemons run entirely
Section 9.3
Techniques for Improving Performance
within the kernel except when a Kerberos ticket is received. To avoid putting all
the Kerberos authentication into the kernel, the nfsd returns from the kernel tem­
porarily to verify the ticket using the Kerberos libraries, and then returns to the
kernel with the results.
The NFS implementation with Kerberos uses encrypted timestamps to avert
replay attempts. Each RPC request includes a timestamp that is encrypted by the
client and decrypted by the server using a session key that has been exchanged as
part of the i nitial Kerberos authentication. Each timestamp can be used only once,
and must be w ithin a few minutes of the current time recorded by the server. This
implementation requires that the client and server clocks be kept within a few
minutes of synchronization (this requirement is already imposed to run Kerberos).
It also requires that the server keep copies of all timestamps that it has received
that are within the time range that it will accept, so that it can verify that a times­
tamp is not being reused. Alternatively, the server can require that timestamps
from each of its clients be monotonically i ncreasing. However, this algorithm will
cause RPC requests that arrive out of order to be rejected. The mechanism of
using Kerberos for authentication of NFS requests is not well defined, and the
4.4BSD implementation has not been tested for interoperability with other ven­
dors. Thus, Kerberos can be used only between 4.4BSD clients and servers.
Techniques for Improving Performance
Remote filesystems provide a challenging performance problem: Providing both a
coherent networkwide view of the data and delivering that data quickly are often
conflicting goals. The server can maintain coherency easily by keeping a single
repository for the data and sending them out to each client when the clients need
them; this approach tends to be slow, because every data access requires the client
to wait for an RPC round-trip time. The delay is fu rther aggravated by the huge
load that it puts on a server that must service every 1/0 request from its clients. To
increase performance and to reduce server load, remote filesystem protocols
attempt to cache frequently used data on the clients themselves. If the cache is
designed properly, the client will be able to satisfy many of the client's 1/0
requests directly from the cache. Doing such accesses is faster than communicat­
ing with the server, reducing latency on the client and load on the server and net­
work. The hard part of client caching is keeping the caches coherent-that is,
ensuring that each client quickly replaces any cached data that are modified by
writes done on other clients. If a first client writes a file that is later read by a sec­
ond client, the second client wants to see the data written by the first client, rather
than the stale data that were in the file previously. There are two main ways that
the stale data may be read accidentally:
1. If the second client has stale data sitting in its cache, the client may use those
data because it does not know that newer data are available.
Chapter 9
The Network Filesystem
2. The first client may have new data sitting in its cache, but may not yet have
written those data back to the server. Here, even if the second client asks the
server for up-to-date data, the server may return the stale data because it does
not know that one of its clients has a newer version of the file in that client's
The second of these problems is related to the way that client writing is done.
Synchronous writing requires that all writes be pushed through to the server dur­
ing the write system call. This approach is the most consistent, because the server
always has the most recently written data. It also permits any write errors, such as
"filesystem out of space," to be propagated back to the client process via the write
system-call return. With an NFS filesystem using synchronous writing, error
returns most closely parallel those from a local filesystem. Unfortunately, this
approach restricts the client to only one write per RPC round-trip time.
An alternative to synchronous writing is delayed writing, where the write sys­
tem call returns as soon as the data are cached on the client; the data are written to
the server sometime later. This approach permits client writing to occur at the rate
of local storage access up to the size of the local cache. Also, for cases where file
truncation or deletion occurs shortly after writing, the write to the server may be
avoided entirely, because the data have already been deleted. Avoiding the data
push saves the client time and reduces load on the server.
There are some drawbacks to delayed writing. To provide full consistency,
the server must notify the client when another client wants to read or write the file,
so that the delayed writes can be written back to the server. There are also prob­
lems with the propagation of errors back to the client process that issued the write
system call. For example, a semantic change is i ntroduced by delayed-write
caching when the file server is full. Here, delayed-write RPC requests can fail
with an "out of space" error. If the data are sent back to the server when the file
is closed, the error can be detected if the application checks the return value from
the close system call. For delayed writes, written data may not be sent back to the
server until after the process that did the write has exited-long after it can be
notified of any errors. The only solution is to modify programs writing an impor­
tant file to do an fsync system call and to check for an error return from that call,
instead of depending on getting errors from write or close. Finally, there is a risk
of the loss of recently written data if the client crashes before the data are written
back to the server.
A compromise between synchronous writing and delayed writing is asyn­
chronous writing. The write to the server is started during the write system call,
but the write system call returns before the write completes. This approach mini­
mizes the risk of data loss because of a client crash, but negates the possibility of
reducing server write load by discarding writes when a file is truncated or deleted.
The simplest mechanism for maintaining full cache consistency is the one
used by Sprite that disables all client caching of the file whenever concurrent write
sharing might occur [Nelson et al, 1 98 8 ] . Since NFS has no way of knowing when
Section 9.3
Techniques for Improving Performance
write sharing might occur, it tries to bound the period of inconsistency by writing
the data back when a file is closed. Files that are open for long periods are written
back at 30-second intervals when the filesystem is synchronized. Thus, the NFS
implementation does a mix of asynchronous and delayed writing, but always
pushes all writes to the server on close. Pushing the delayed writes on close
negates much of the performance advantage of delayed writing, because the delays
that were avoided in the write system calls are observed in the close system call.
With this approach, the server is always aware of all changes made by its clients
with a maximum delay of 30 seconds and usually sooner, because most files are
open only briefly for writing.
The server maintains read consistency by always having a client verify the
contents of its cache before using that cache. When a client reads data, it first
checks for the data in its cache. Each cache entry is stamped with an attribute that
shows the most recent time that the server says that the data were modified. If the
data are found in the cache, the client sends a timestamp RPC request to its server
to find out when the data were last modified. If the modification time returned by
the server matches that associated with the cache, the client uses the data in its
cache; otherwise, it arranges to replace the data in its cache with the new data.
The problem with checking with the server on every cache access is that the
client still experiences an RPC round-trip delay for each file access, and the server
is still inundated with RPC requests, although they are considerably quicker to
handle than are full 1/0 operations. To reduce this client latency and server load,
most NFS implementations track how recently the server has been asked about
each cache block. The client then uses a tunable parameter that is typically set at
a few seconds to delay asking the server about a cache block. If an 1/0 request
finds a cache block and the server has been asked about the validity of that block
within the delay period, the client does not ask the server again, but rather j ust
uses the block. Because certain blocks are used many times in succession, the
server will be asked about them only once, rather than on every access. For exam­
ple, the directory block for the /usr/include directory will be accessed once for
each #include in a source file that is being compiled. The drawback to this
approach is that changes made by other clients may not be noticed for up to the
delay number of seconds.
A more consistent approach used by some network filesystems is to use a
callback scheme where the server keeps track of all the files that each of its clients
has cached. When a cached file is modified, the server notifies the clients holding
that file so that they can purge it from their cache. This algorithm dramatically
reduces the number of queries from the client to the server, with the effect of
decreasing client 1/0 latency and server load [Howard et al, 1 98 8 ] . The drawback
is that this approach introduces state into the server because the server must
remember the clients that it is serving and the set of files that they have cached. If
the server crashes, it must rebuild this state before it can begin running again.
Rebuilding the server state is a significant problem when everything is running
properly; it gets even more complicated and time consuming when it is aggravated
Chapter 9
The Network Filesystem
by network partitions that prevent the server from communicating with some of its
clients [Mogul, 1 993] .
The 4.4BSD NFS implementation uses asynchronous writes while a file is
open, but synchronously waits for all data to be written when the file is closed.
This approach gains the speed benefit of writing asynchronously, yet ensures that
any delayed errors will be reported no later than the point at which the file is
closed. The implementation will query the server about the attributes of a file at
most once every 3 seconds. This 3-second period reduces network traffic for files
accessed frequently, yet ensures that any changes to a file are detected with no
more than a 3-second delay. Although these heuristics provide tolerable seman­
tics, they are noticeably imperfect. More consistent semantics at lower cost are
available with the NQNFS lease protocol described in the next section.
The NQNFS protocol is designed to maintain full cache consistency between
clients in a crash-tolerant manner. It is an adaptation of the NFS protocol such that
the server supports both NFS and NQNFS clients while maintaining full consis­
tency between the server and NQNFS clients. The protocol maintains cache con­
sistency by using short-term leases instead of hard-state information about open
files [Gray & Cheriton, 1 989] . A lease is a ticket permitting an activity that is
valid until some expiration time. As long as a client holds a valid lease, it knows
that the server will give it a callback if the file status changes. Once the lease has
expired, the client must contact the server if it wants to use the cached data.
Leases are issued using time intervals rather than absolute times to avoid the
requirement of time-of-day clock synchronization. There are three important time
constants known to the server. The maximum_lease_term sets an upper bound on
lease duration-typically, 30 seconds to 1 minute. The clock_skew is added to all
lease terms on the server to correct for differing clock speeds between the client
and server. The write_slack is the number of seconds that the server is willing to
wait for a client with an expired write-caching lease to push dirty writes.
Contacting the server after the lease has expired is similar to the NFS tech­
nique for reducing server load by checking the validity of data only every few sec­
onds. The main difference is that the server tracks its clients ' cached files, so
there are never periods of time when the client is using stale data. Thus, the time
used for leases can be considerably longer than the few seconds that clients are
willing to tolerate possibly stale data. The effect of this longer lease time is to
reduce the number of server calls almost to the level found in a full callback
implementation such as the Andrew Filesystem [Howard et al, 1 98 8 ] . Unlike the
callback mechanism, state recovery with leases is trivial. The server needs only to
wait for the lease's expiration time to pass, and then to resume operation. Once all
the leases have expired, the clients will always communicate with the server
before using any of their cached data. The lease expiration time is usually shorter
than the time it takes most servers to reboot, so the server can effectively resume
operation as soon as it is running. If the machine does manage to reboot more
Section 9.3
Techniques for Improving Performance
quickly than the lease expiration time, then it must wait until all leases have
expired before resuming operation .
An additional benefit of using leases rather than hard state information is that
leases use much less server memory. If each piece of state requires 64 bytes, a
large server with hundreds of clients and a peak throughput of 2000 RPC requests
per second will typically only use a few hundred Kbyte of memory for leases, with
a worst case of about 3 Mbyte. Even if a server has exhausted lease storage, it can
simply wait a few seconds for a lease to expire and free up a record. By contrast,
a server with hard state must store records for all fi les currently open by all clients.
The memory requirements are 3 to 1 2 Mbyte of memory per 1 00 clients served.
Whenever a client wishes to cache data for a file, it must hold a valid lease.
There are three types of leases: noncaching, read caching, and write caching. A
noncaching lease requires that all file operations be done synchronously with the
server. A read-caching lease allows for client data caching, but no file modifica­
tions may be done. A write-caching lease allows for client caching of writes for
the period of the lease. If a client has cached write data that are not yet written to
the server when a write-cache lease has almost expired, it will attempt to extend
the lease. If the extension fails, the client is required to push the written data.
If all the clients of a file are reading it, they will all be granted a read-caching
lease. A read-caching lease allows one or more clients to cache data, but they may
not make any modifications to the data. Figure 9.4 shows a typical read-caching
scenario. The vertical solid black lines depict the lease records. Note that the
time lines are not drawn to . scale, since a client-server interaction will normally
take less than 1 00 milliseconds, whereas the normal lease duration is 30 seconds.
F i g u re 9.4
Read-caching leases. Solid vertical lines represent valid leases.
client A
read syscall
read syscalls
(from cache)
: rP• d + lease request
client B
read-caching lease
for client A
i--� �� s������
(c ac he m i ss )
lease times out
read syscall
modification time
match, cache valid
� get lease request
reply with same
r- modification time
read request
(cache miss)
read syscalls
(from cache)
read + 1 ease request
reolv, client B
added to lease
lease times out
. lease expires
read syscall
read request
(cache miss)
read syscalls
(from cache)
lease times out
Chapter 9
The Network Filesystem
Every lease includes the time that the file was last modified. The client can use
this timestamp to ensure that its cached data are still current. Initially, client A
gets a read-caching lease for the file. Later, client A renews that lease and uses it
to verify that the data in its cache are still valid. Concurrently, client B is able to
obtain a read-caching lease for the same file.
If a single client wants to write a file and there are no readers of that file, the
client will be issued a write-caching lease. A write-caching lease permits delayed
write caching, but requires that all data be pushed to the server when the lease
expires or is terminated by an eviction notice. When a write-caching lease has
almost expired, the client will attempt to extend the lease if the file is still open, but
is required to push the delayed writes to the server if renewal fails (see Fig. 9.5).
The writes may not arrive at the server until after the write lease has expired on the
client. A consistency problem is avoided because the server keeps its write lease
valid for write_slack seconds longer than the time given in the lease issued to the
client. In addition, writes to the file by the lease-holding client cause the lease
expiration time to be extended to at least write_slack seconds. This write_slack
period is conservatively estimated as the extra time that the client will need to write
back any written data that it has cached. If the value selected for write_slack is too
short, a write RPC may arrive after the write lease has expired on the server.
Although this write RPC will result in another client seeing an inconsistency, that
inconsistency is no more problematic than the semantics that NFS normally pro­
Figure 9.5
Write-caching lease. Solid vertical lines represent valid leases.
client B
write-caching lease
for client B
ge t w ri te le a-"'s-'-e __ write syscall
(write-caching lease)
get write lease
lease renewed
lease times out
expiration delayed
due to write activity
expires write_slack seconds
after most recent write
lease-renewal request
before expiration
(write-caching lease)
close syscall
write syscall
(delayed writes
being cached)
lease expires
Section 9.3
Techniques for Improving Performance
The server is responsible for maintaining consistency among the NQNFS
clients by disabling client caching whenever a server file operation would cause
inconsistencies . The possibility of inconsistencies occurs whenever a client has a
write-caching lease and any other client or a local operation on the server tries to
access the file, or when a modify operation is attempted on a file being read
cached by clients. If one of these conditions occurs, then all clients will be issued
noncaching leases. With a noncaching lease, all reads and writes will be done
through the server, so clients will always get the most recent data. Figure 9.6
shows how read and write leases are replaced by a noncaching lease when there is
the potential for write sharing. Initially, the file is read by client A . Later, it is
written by client B. While client B is still writing, client A issues another read
request. Here, the server sends an "eviction notice" message to client B, and then
waits for lease termination. Client B writes back its dirty data, then sends a
" vacated" message. Finally, the server issues noncaching leases to both clients.
In general, lease termination occurs when a " vacated" message has been received
from all the clients that have signed the lease or when the lease has expired. The
server does not wait for a reply for the message pair "eviction notice" and
Figure 9.6
Write-sharing leases. Solid vertical lines represent valid leases.
client A
read syscall
: read + lease request
read request
(cache miss)
lease times out
read-caching lease
for client A
read syscalls
(from cache)
client B
lease expires
get write lease
(write-caching lease)
read syscall
write syscall
(delayed writes
being cached)
et lease request
eviction notice
re ly
re I
vacated messa e
- (noncaching lease)
read data
(not cached)
read request
1eply data
delayed writes
being flushed
to server
get write lease
(noncaching lease)
write syscall
write syscall
synchronous writes
(not cached)
Chapter 9
The Network Filesystem
" vacated," as it does for all other RPC messages; they are sent asynchronously to
avoid the server waiting indefinitely for a reply from a dead client.
A client gets leases either by doing a specific lease RPC or by including a
lease request with another RPC. Most NQNFS RPC requests allow a lease request
to be added to them. Combining lease requests with other RPC requests mini­
mizes the amount of extra network traffic. A typical combination can be done
when a file is opened. The client must do an RPC to get the handle for the file to
be opened. It can combine the lease request, because it knows at the time of the
open whether it will need a read or a write lease. All leases are at the granularity
of a file, because all NFS RPC requests operate on individual files, and NFS has no
intrinsic notion of a file hierarchy. Directories, symbolic links, and fi le attributes
may be read cached but are not write cached. The exception is the file-size
attribute that is updated during cached writing on the client to reflect a growing
file. Leases have the advantage that they are typically required only at times when
other UO operations occur. Thus, lease requests can almost always be piggy­
backed on other RPC requests, avoiding some of the overhead associated with the
explicit open and close RPC required by a long-term callback implementation.
The server handles operations from local processes and from remote clients
that are not using the NQNFS protocol by issuing short-term leases for the duration
of each file operation or RPC. For example, a request to create a new file will get a
short-term write lease on the directory in which the file is being created. Before
that write lease is issued, the server will vacate the read leases of all the NQNFS
clients that have cached data for that directory. Because the server gets leases for
all non-NQNFS activity, consistency is maintained between the server and NQNFS
clients, even when local or NFS clients are modifying the filesystem. The NFS
clients will continue to be no more or less consi stent with the server than they
were without leases.
Crash Recovery
The server must maintain the state of all the current leases held by its clients . The
benefit of using short-term leases is that, maximum_lease_term seconds after the
server stops issuing leases, it knows that there are no current leases left. As such,
server crash recovery does not require any state recovery. After rebooting, the
server simply refu ses to service any RPC requests except for writes (predomi­
nantly from clients that previously held write leases) until write_slack seconds
after the final lease would have expired. For machines that cannot calculate the
time that they crashed, the final-lease expiration time can be estimated safely as
boot_time + maximum_lease_term + write_slack + clock_skew
Here, boot_time is the time that the kernel began running after the kernel was
booted. With a maximum_lease_term 30 to 60 seconds, and clock_skew and
write_slack at most a few seconds, this delay amounts to about 1 minute, which
for most systems is taken up with the server rebooting process. When this time
has passed, the server will have no outstanding leases. The clients will have had at
least write_slack seconds to get written data to the server, so the server should be
up to date. After this, the server resumes normal operation.
There i s another failure condition that can occur when the server is congested.
In the worst-case scenario, the client pushes dirty writes to the server, but a large
request queue on the server delays these writes for more than write_slack seconds.
In an effort to minimize the effect of these recovery storms, the server replies "try
again later" to the RPC requests that it is not yet ready to service [Baker & Ouster­
hout, 1 99 1 ] . The server takes two steps to ensure that all clients have been able to
write back their written data. First, a write-caching lease is terminated on the
server only when there are have been no writes to the file during the previous
write_slack seconds. Second, the server will not accept any requests other than
writes until it has not been overloaded during the previous write_slack seconds. A
server is considered overloaded when there are pending RPC requests and all its
nfsd processes are busy.
Another problem that is solved by short-term leases is how to handle a
crashed or partitioned client that holds a lease that the server wishes to vacate.
The server detects this problem when it needs to vacate a lease so that it can issue
a lease to a second client, and the first client holding the lease fails to respond to
the vacate request. Here, the server can simply wait for the first client's lease to
expire before issuing the new one to the second client. When the first client
reboots or gets reconnected to the server, it simply reacquires any leases it now
needs. If a client-to-server network connection is severed j ust before a write­
caching lease expires, the client cannot push the dirty writes to the server. Other
clients that can contact the server will continue to be able to access the file and
will see the old data. Since the write-caching lease has expired on the client, the
client will synchronize with the server as soon as the network connection has been
re-established. This delay can be avoided with a write-through policy.
A detailed comparison of the effects of leases on performance is given in
[Macklem, 1 994a] . B riefly, leases are most helpful when a server or network is
loaded heavily. Here, leases allow up to 30 to 50 percent more clients to use a net­
work and server before beginning to experience a level of congestion equal to
what they would on a network and server that were not using leases. In addition,
leases provide better consistency and lower latency for clients, independent of the
load. Although leases are new enough that they are not widely used in commer­
cial implementations of NFS today, leases or a similar mechanism will need to be
added to commercial versions of NFS if NFS is to be able to compete effectively
against other remote filesystems, such as Andrew.
9. 1
Describe the functions done by an NFS client.
Describe the functions done by an NFS server.
Chapter 9
The Network Filesystem
Describe three benefits that NFS derives from being stateless.
Give two reasons why TCP is a better protocol to use than is UDP for han­
dling the NFS RPC protocol.
Describe the contents of a file handle in 4.4BSD. How is a file handle used?
When is a new generation number assigned to a file? What purpose does
the generation number serve?
Describe the three ways that an NFS client can handle filesystem-access
attempts when its server crashes or otherwise becomes unreachable.
Give two reasons why leases are given
What is a callback? When is it used?
9. 10
A server may issue three types o f leases : noncaching, read caching, and
write caching. Describe what a client can do with each of these leases.
9. 1 1
Describe how an NQNFS server recovers after a crash.
limited lifetime.
*9 . 1 2
Suppose that there is a client that supports both versions 2 and 3 of the NFS
protocol running on both the TCP and UDP protocols, but a server that sup­
ports only version 2 of NFS running on UDP. Show the protocol negotia­
tion between the client and server, assuming that the client prefers to run
using version 3 of NFS using TCP.
* *9 . 1 3
Assume that leases have an unlimited lifetime. Design a system for recov­
ering the lease state after a client or server crash.
B aker & Ousterhout, 1 99 1 .
M . B aker & J. Ousterhout, "Availability in the Sprite Distributed File Sys­
tem," A CM Operating System Review, vol. 25, no. 2, pp. 95-98, April 1 99 1 .
B irrell & Nelson, 1 984.
A. D. B irrell & B. J. Nelson, " Implementing Remote Procedure Calls,"
ACM Transactions on Computer Systems, vol . 2, no. 1 , pp. 39-59, Associa­
tion for Computing Machinery, February 1 984.
Gray & Cheriton, 1 989.
C. Gray & D. Cheriton, "Leases: An Efficient Fault-Tolerant Mechanism
for Distributed Fi le Cache Consistency," Proceedings of the Twelfth Sympo­
sium on Operating Systems Principles, pp. 202-2 1 0, December 1 989.
Howard, 1 98 8 .
J. Howard, "An Overview o f the Andrew File System," USENIX Association
Conference Proceedings, pp. 23-26, January 1 988.
Howard et al, 1 98 8 .
J. Howard, M. Kazar, S . Menees, D. Nichols, M. Satyanarayanan, R.
Sidebotham, & M. West, " Scale and Performance in a Distributed File
System," ACM Transactions on Computer Systems, vol. 6, no. 1 , pp. 5 1 -8 1 ,
Association for Computing Machinery, February 1 988.
Juszczak, 1 989.
C. Juszczak, " Improving the Performance and Correctness of an NFS Ser­
ver," USENIX Association Conference Proceedings, pp. 53-63 , January
1 989.
Kent & Mogul, 1 987.
C. Kent & J. Mogul, "Fragmentation Considered Harmful," Research
Report 87/3 , Digital Equipment Corporation Western Research Laboratory,
Palo Alto, CA, December 1 987.
Macklem, 1 99 1 .
R. Macklem, "Lessons Learned Tuning the 4.3BSD-Reno Implementation
of the NFS Protocol," USENIX Association Conference Proceedings, pp.
53-64, January 1 99 1 .
Macklem, l 994a.
R. Macklem, "Not Quite NFS , Soft Cache Consistency for NFS ," USENIX
Association Conference Proceedings, pp. 26 1 -278, January 1 994.
Macklem, 1 994b.
R. Macklem, "The 4.4B SD NFS Implementation," in 4. 4BSD System Man­
ager 's Manual, pp. 6 : 1 - 1 4, O' Reilly & Associates, Inc . , Sebastopol, CA,
1 994.
Mogul, 1 993.
J. Mogul, "Recovery in Spritely NFS ," Research Report 93/2, Digital
Equipment Corporation Western Research Laboratory, Palo Alto, CA, June
1 993.
Nelson et al, 1 98 8 .
M. Nelson, B . Welch, & J. Ousterhout, "Caching in the Sprite Network File
System," ACM Transactions on Computer Systems, vol. 6, no. 1 , pp.
1 34- 1 54, Association for Computing Machinery, February 1 988.
Nowicki, 1 989.
B . Nowicki, "Transport Issues i n the Network File System," Computer
Communications Review, vol. 1 9, no. 2, pp. 1 6-20, April 1 989.
Pawlowski et al, 1 994.
B. Pawlowski, C. Juszczak, P. Staubach, C. Smith, D. Lebel, & D. Hitz,
"NFS Version 3: Design and Implementation," USENIX Association Confer­
ence Proceedings, pp. 1 37- 1 5 1 , June 1 994.
Reid, 1 987.
Irving Reid, "RPCC : A Stub Compiler for Sun RPC," USENIX Association
Conference Proceedings, pp. 357-366, June 1 987.
Rifkin et al, 1 986.
A. Rifkin, M. Forbes, R. Hamilton, M. Sabrio, S. Shah, & K. Yueh, " RFS
Architectural Overview," USENIX Association Conference Proceedings, pp.
248-259, June 1 986.
Chapter 9
The Network Filesystem
Sandberg et al, 1 985.
R. Sandberg, D. Goldberg, S . Kleiman, D. Walsh, & B . Lyon, " Design and
Implementation of the Sun Network Filesystem," USENIX Association Con­
ference Proceedings, pp. 1 1 9- 1 30, June 1 9 8 5 .
Steiner e t a l , 1 98 8 .
J. Steiner, C . Neuman, & J. Schiller, " Kerberos: An Authentication Service
for Open Network Systems," USENIX Association Conference Proceedings,
pp. 1 9 1 -202, February 1 988.
Sun Microsystems, 1 989.
Sun Microsystems, " NFS : Network File System Protocol Specification,"
RFC 1 094, available by anonymous FTP from, March 1 989.
Sun Microsystems, 1 993.
Sun Microsystems, NFS: Network File System Version 3 Protocol Specifica­
tion, Sun Microsystems, Mountain View, CA, June 1 993 .
Walsh et al, 1 985.
D. Walsh, B . Lyon, G. Sager, J. Chang, D. Goldberg, S . Kleiman, T. Lyon,
R. Sandberg, & P. Weiss, "Overview of the Sun Network File System,"
USENIX Association Conference Proceedings, pp. 1 1 7- 1 24, January 1 985.
Terminal Handling
A common type of peripheral device found on 4.4BSD systems is a hardware
interface supporting one or more terminals. The most common type of interface is
a terminal multiplexer, a device that connects multiple, asynchronous RS-232
serial lines, which may be used to connect terminals, modems, printers, and simi­
lar devices. Unlike the block storage devices described in Section 6.2 and the net­
work devices to be considered in Chapter 1 1 , terminal devices commonly process
data one character at a time. Like other character devices described in Section 6.3,
terminal multiplexers are supported by device drivers specific to the actual hard­
Terminal interfaces interrupt the processor asynchronously to present input,
which is independent of process requests to read user input. Data are processed
when they are received, and then are stored until a process requests them, thus
allowing type-ahead. Many terminal ports attach local or remote terminals on
which users may log in to the system. When used in this way, terminal input rep­
resents the keystrokes of users, and terminal output is printed on the users ' screens
or printers. We shall deal mostly with this type of terminal line usage in this chap­
ter. Asynchronous serial lines also connect modems for computer-to-computer
communications or serial-interface printers. When serial interfaces are used for
these purposes, they generally use a subset of the system's terminal-handling
capability. Sometimes, they use special processing modules for higher efficiency.
We shall discuss alternate terminal modules at the end of this chapter.
The most common type of user session in 4.4BSD uses a pseudo-terminal, or
pty. The pseudo-terminal driver provides support for a device-pair, termed the
master and slave devices. The slave device provides a process an interface identi­
cal to the one described for terminals in this chapter. However, whereas all other
devices that provide this interface are supported by a hardware device of some
sort, the slave device has, instead, another process manipulating it through the
master half of the pseudo-terminal . That is, anything written on the master device
Chapter 1 0
Terminal Handling
is provided to the slave device as input, and anything written on the slave device is
presented to the master device as input. The driver for the master device emulates
all specific hardware support details described in the rest of this chapter.
Terminal-Processing Modes
4.4BSD supports several modes of terminal processing. Much of the time, termi­
nals are in canonical mode (also commonly referred to as cooked mode or line
mode), in which input characters are echoed by the operating system as they are
typed by the user, but are buffered internally until a newline character is received.
Only after the receipt of a newline character is the entire line made available to the
shell or other process reading from the terminal. If the process attempts to read
from the terminal line before a complete line is ready, the process will sleep until a
newline character is received, regardless of a partial line already having been
received. (The common case where a carriage return behaves like a newline char­
acter and causes the line to be made available to the waiting process is imple­
mented by the operating system, and is configurable by the user or process.) In
canonical mode, the user may correct typing errors, deleting the most recently
typed character with the erase character, deleting the most recent word with the
word-erase character, or deleting the entire current line with the kill character.
Other special characters generate signals sent to processes associated with the ter­
minal ; these signals may abort processing or may suspend it. Additional characters
start and stop output, flush output, or prevent special interpretation of the succeed­
ing character. The user can type several lines of input, up to an implementation­
defined limit, without waiting for input to be read and then removed from the input
queue. The user can specify the special processing characters or can selectively
disable them.
Screen editors and programs that communicate with other computers generally
run in noncanonical mode (also commonly referred to as raw mode or character-at­
a-time mode). In this mode, the system makes each typed character available to be
read as input as soon as that character is received. All special-character input pro­
cessing is disabled, no erase or other line-editing processing is done, and all charac­
ters are passed to the program reading from the terminal .
It is possible to configure the terminal in thousands of combinations between
these two extremes. For example, a screen editor that wanted to receive user inter­
rupts asynchronously might enable the special characters that generate signals, but
otherwise run in noncanonical mode.
In addition to processing input characters, terminal interface drivers must do
certain processing on output. Most of the time, this processing is simple: Newline
characters are converted to a carriage return plus a line feed, and the interface
hardware is programmed to generate appropriate parity bits on output characters.
In addition to doing character processing, the terminal output routines must man­
age flow control, both with the user (using stop and start characters) and with the
Section 1 0.2
Line Disciplines
process. Because terminal devices are slow in comparison with other computer
peripherals, a program writing to the terminal may produce output much faster
than that output can be sent to the terminal. When a process has filled the terminal
output queue, it will be put to sleep; it will be restarted when enough output has
Line Disciplines
Most of the character processing done for terminal interfaces is independent of the
type of hardware device used to connect the terminals to the computer. Therefore,
most of this processing is done by common routines in the tty driver or terminal
handler. Each hardware interface type is supported by a specific device driver.
The hardware driver is a device driver like those described in Chapter 6; it is
responsible for programming the hardware multiplexer. It is responsible for
receiving and transmitting characters, and for handling some of the synchroniza­
tion with the process doing output. The hardware driver is called by the tty driver
to do output; in tum, it calls the tty driver with input characters as they are
received. Because serial lines may be used for more than j ust connection of termi­
nals, a modular interface between the hardware driver and the tty driver allows
either part to be replaced with alternate versions. The tty driver interfaces with the
rest of the system as a line discipline. A line discipline is a processing module
used to provide semantics on an asynchronous serial interface (or, as we shall see,
on a software emulation of such an interface). It is described by a procedural
interface, the linesw (line-switch) structure.
The linesw structure specifies the entry points of a line discipline, much as the
character-device switch cdevsw lists the entry points of a character-device driver.
The entry points of a line discipline are listed in Table 1 0. 1 . Like all device
drivers, a terminal driver is divided i nto the top half, which runs synchronously
Table 1 0. 1
Entry points of a line discipline.
Called from
initial entry to discipline
exit from discipline
read from line
write to line
control operations
received character
completion of transmission
modem carrier transition
Chapter IO
Terminal Handling
when called to process a system call, and the bottom half, which runs
asynchronously when device interrupts occur. The line discipline provides rou­
tines that do common terminal processing for both the top and bottom halves of a
terminal driver.
Device drivers for serial terminal interfaces support the normal set of char­
acter-device-driver entry points specified by the character-device switch. Several
of the standard driver entry points (read, write, and ioctl) immediately transfer
control to the line discipline when called. (The standard tty select routine
ttselect ( ) usually is used as the device driver select entry in the character-device
switch.) The open and close routines are similar; the line-discipline open entry is
called when a line first enters a discipline, either at initial open of the line or when
the discipline is changed. Similarly, the discipline close routine is called to exit
from a discipline. All these routines are called from above, in response to a corre­
sponding system call. The remaining line-discipline entries are called by the bot­
tom half of the device driver to report input or status changes detected at interrupt
time. The l_rint (receiver interrupt) entry is called with each character received on
a line. The corresponding entry for transmit-complete interrupts is the !_start rou­
tine, which is called when output operations complete. This entry gives the line
discipline a chance to start additional output operations. For the normal terminal
line discipline, this routine simply calls the driver's output routine to start the next
block of output. Transitions in modem-control lines (see Section 1 0.7) may be
detected by the hardware driver, in which case the !_modem routine is called with
an indication of the new state.
The system includes several different types of line disciplines. Most lines use
the terminal-oriented discipline described in Section 1 0. 3 . Other disciplines in the
system support graphics tablets on serial lines and asynchronous serial network
User Interface
The terminal line discipline used by default on most terminal lines is derived from
a discipline that was present in System V, as modified by the POSIX standard, and
then was modified further to provide reasonable compatibility with previous
Berkeley line disciplines. The base structure used to describe terminal state in
System V was the termio structure. The base structure used by POSIX and by
4.4BSD is the termios structure.
The standard programmatic interface for control of the terminal line discipline
is the ioctl system call. This call changes disciplines, sets and gets values for spe­
cial processing characters and modes, sets and gets hardware serial line parame­
ters, and performs other control operations. Most ioctl operations require one
argument in addition to a file descriptor and the command; the argument is the
address of an integer or structure from which the system gets parameters, or into
which information is placed. Because the POSIX Working Group thought that the
ioctl system call was difficult and undesirable to specify-because of its use of
Section 1 0. 3
User Interface
arguments that varied in size, in type, and in whether they were being read or
written-the group members chose to introduce new interfaces for each of the
ioctl calls that they believed were necessary for application portability. Each of
these calls is named with a tc prefix. In the 4.4BSD system, each of these calls is
translated (possibly after preprocessing) into an ioctl call.
The following set of ioctl commands apply specifically to the standard termi­
nal line discipline, although all line disciplines must support at least the first two.
Other disciplines generally support other ioctl commands. This list is not exhaus­
tive, although it presents all the commands that are used commonly.
Get (set) the line discipline for this line.
Get (set) the termios parameters for this line, including line
speed, behavioral parameters, and special characters (e.g. , erase
and kill characters).
Set the termios parameters for this line after waiting for the out­
put buffer to drain (but without discarding any characters from
the input buffer) .
Set the termios parameters for this line after waiting for the out­
put buffer to drain and discarding any characters from the input
Discard all characters from the input and output buffers.
Wait for the output buffer to drain.
Get (release) exclusive use of the line.
Clear (set) the terminal hardware BREAK condition for the line.
Clear (set) data terminal ready on the line.
Get (set) the process group associated with this terminal (see
Section I 0.5).
Return the number of characters i n the terminal's output buffer.
Enter characters into the terminal ' s input buffer as though they
were typed by the user.
Disassociate the current controlling terminal from the process
(see Section 1 0.5).
Make the terminal the controlling terminal for the process (see
Section 1 0.5).
Chapter 1 0
Terminal Handling
Start (stop) output on the terminal .
Get (set) the terminal or window size for the terminal line; the
window size includes width and height in characters and
(optionally, on graphical displays) in pixels.
The tty Structure
Each terminal hardware driver has a data structure to contain the state of each line
that it supports. This structure, the tty structure (see Table 1 0.2), contains state
information, the input and output queues, the modes and options set by the ioctl
operations listed in Section 1 0. 3 , and the line-discipline number. The tty structure
is shared by the hardware driver and the line discipline. The calls to the line disci­
pline all require a tty structure as a parameter; the driver locates the correct tty
according to the minor device number. This structure also contains information
about the device driver needed by the line discipline.
The sections of the tty structure include:
State information about the hardware terminal line. The t_state field includes
line state (open, carrier present, or waiting for carrier) and maj or file options
(e.g., signal-driven 1/0). Transient state for flow control and synchronization is
also stored here.
Table 1 0.2
The tty structure.
character queues
raw input queue
canonical input queue
device output queue
high/low watermarks
hardware parameters
device number
start/stop output functions
set hardware state function
process selecting for reading
process selecting for writing
termios state
process group
terminal column number
number of rows and columns
Section I 0.5
Process Groups, Sessions, and Terminal Control
• Input and output queues. The hardware driver transmits characters placed in the
output queue, t_outq. Line disciplines generally use the t_rawq and t_canq
(noncanonical and canonical queues) for input; in line mode, the canonical queue
contains full lines, and the noncanonical queue contains any current partial line.
In addition, t_hiwat and t_lowat provide boundaries where processes attempting
to write to the terminal will be put to sleep, waiting for the output queue to drain .
• Hardware and software modes and parameters, and special characters. The
t_termios structure contains the information set by TIOCSETA, TIOCSETAF and
TIOCSETAW. Specifically, line speed appears in the c_ispeed and c_ospeed fields
of the t_termios structure, control information i n the c_ijiag, c_ojlag, c_cjlag and
c_ljlag fields, and special characters (end-of-file, end-of-line, alternate end-of­
line, erase, word-erase, kill, reprint, interrupt, quit, suspend, start, stop, escape­
next-character, status-interrupt, flush-output and VMIN and VTIME information)
in the c_cc field.
• Hardware driver information. This information includes t_oproc and t_stop, the
driver procedures that start (stop) transmissions after data are placed in the out­
put queue; t_param, the driver procedure that sets the hardware state; and t_dev,
the device number of the terminal line.
• Terminal line-discipline software state. This state includes the terminal column
number and counts for tab and erase processing (t_column, t_rocount and
t_rocol), the process group of the terminal (t_pgrp ), the session associated with
the terminal (t_session), and information about any processes selecting for input
or output (t_rsel and t_wsel) .
• Terminal or window size (t_winsize) . This information is not used by the kernel,
but it is stored here to present consistent and correct information to applications.
In addition, 4.4BSD supplies the SIGWINCH signal (derived from Sun Microsys­
tems ' SunOS) that can be sent when the size of a window changes. This signal
is useful for windowing packages such as X Window System [Scheifler & Get­
tys, 1 986] that allow users to resize windows dynamically; programs such as text
editors running in such a window need to be i nformed that something has
changed and that they should recheck the window size.
The tty structure is initialized by the hardware terminal driver's open routine and
by the line-discipline open routine.
Process Groups, Sessions, and Terminal Control
The process-control (j ob-control) facilities described in Section 4.8 depend on the
terminal 1/0 system to control access to the terminal . Each job (a process group
that is manipulated as a single entity) is known by a process-group ID.
Each terminal structure contains a pointer to an associated session. When a
process creates a new session, that session has no associated terminal. To acquire
Chapter 1 0
Terminal Handling
an associated terminal, the session leader must make an ioctl system call using a
file descriptor associated with the terminal and specifying the TIOCSCTTY flag.
When the ioctl succeeds, the session leader is known as the controlling process.
In addition, each terminal structure contains the process group ID of the fore­
ground process group. When a session leader acquires an associated terminal, the
terminal process group is set to the process group of the session leader. The termi­
nal process group may be changed by making an ioctl system call using a file
descriptor associated with the terminal and specifying the TIOCSPGRP flag. Any
process group in the session is permitted to become the foreground process group
for the terminal.
Signals that are generated by characters typed at the terminal are sent to all
the processes in the terminal 's foreground process group. By default, some of
those signals cause the process group to stop. The shell creates jobs as process
groups, setting the process group ID to be the PIO of the first process in the pro­
cess group. Each time it places a new job in the foreground, the shell sets the ter­
minal process group to the new process group. Thus, the terminal process group
is the identifier for the process group that is currently in control of the terminal­
that is, for the process group running in the foreground. Other process groups may
run in the background. If a background process attempts to read from the termi­
nal, its process group is sent another signal, which stops the process group.
Optionally, background processes that attempt terminal output may be stopped as
well. These rules for control of input and output operations apply to only those
operations on the controlling terminal .
When carrier is lost for the terminal-for example, at modem disconnect-the
session leader of the session associated with the terminal is sent a SIGHUP signal.
If the session leader exits, the controlling terminal is revoked, and that invalidates
any open file descriptors in the system for the terminal. This revocation ensures
that processes holding file descriptors for a terminal cannot still access the terminal
after the terminal is acquired by another user. The revocation operates at the vnode
layer. It is possible for a process to have a read or write sleeping for some rea­
son-for example, it was in a background process group. Since such a process
would have already resolved the file descriptor through the vnode layer, a single
read or write by the sleeping process could complete after the revoke system call.
To avoid this security problem, the system checks a tty generation number when a
process wakes up from sleeping on a terminal, and, if the number has changed,
restarts the read or write system call.
The terminal I/O system deals with data in blocks of widely varying sizes. Most
input and output operations deal with single characters (typed input characters and
their output echoes). Input characters are usually aggregated with previous input
to form lines of varying sizes. Some output operations involve larger numbers of
data, such as screen updates or other command output. The data structures
Section 1 0.6
Documentation is
the castor oil of programming.
Managers know it must be good
because the programmers hate
it so much.
F i g u re 1 0. 1
A C-list structure.
originally designed for terminal drivers, the character block, C-block, and
character list, C-list, are still in use in 4.4BSD. Each C-block is a fixed-size buffer
that contains a linkage pointer and space for buffered characters and quoting infor­
mation. Its size is a power of 2, and it is aligned such that the system can compute
boundaries between blocks by masking off the low-order bits of a pointer. 4.4BSD
uses 64-byte C-blocks, storing 52 characters and an array of quoting flags ( I -bit
per character) . A queue of input or output characters is described by a C-list,
which contains pointers to the first and final characters, and a count of the number
of characters in the queue (see Fig. I 0. 1 ) . Both of the pointers point to characters
stored in C-blocks. When a character is removed from a C-list queue, the count is
decremented, and the pointer to the first character is incremented. If the pointer
has advanced beyond the end of the first C-block on the queue, the pointer to the
next C-block is obtained from the forward pointer at the start of the current C­
block. After the forward pointer is updated, the empty C-block is placed on a free
chain. A similar process adds a character to a queue. If there is no room in the
current buffer, another buffer is allocated from the free list, the linkage pointer of
the last buffer is set to point at the new buffer, and the tail pointer is set to the first
storage location of the new buffer. The character is stored where indicated by the
tail pointer, the tail pointer is incremented, and the character count is incremented.
A set of utility routines manipulates C-lists : getc ( ) removes the next character
from a C-list and returns that character; putc ( ) adds a character to the end of a C­
list. The getc ( ) routine returns an integer, and the putc ( ) routine takes an integer
as an argument. The lower 8 bits of this value are the actual character. The upper
bits are used to provide quoting and other information. Groups of characters may
be added to or removed from C-lists with b_to_q ( ) and q_to_b ( ), respectively, in
which case no additional information (e.g., quoting information) can be specified
or returned. The terminal driver also requires the ability to remove a character
from the end of a queue with unputc ( ), to examine characters in the queue with
nextc ( ), and to concatenate queues with catq ( ) .
Chapter 1 0
Terminal Handling
When UNIX was developed on computers with small address spaces, the
design of buffers for the use of terminal drivers was a challenge. The C-list and
C-block provided an elegant solution to the problem of storing arbitrary-length
queues of data for terminal input and output queues when the latter were designed
for machines with small memories. On modern machines that have far larger
address spaces, it would be better to use a data structure that uses less CPU time
per character at a cost of reduced space efficiency. 4.4BSD still uses the original
C-list data structure because of the high labor cost of converting to a new data
structure; a change to the queue structure would require changes to all the line dis­
ciplines and to all the terminal device drivers, which would be a substantial
amount of work. The developers could j ust change the implementations of the
interface routines, but the routines would still be called once per character unless
the actual interface was changed, and changing the interface would require chang­
ing the drivers.
RS-232 and Modem Control
Most terminals and modems are connected via asynchronous RS-232 serial ports.
This type of connection supports several lines, in addition to those that transmit and
receive data. The system typically supports only a few of these lines. The most
commonly used lines are those showing that the equipment on each end is ready for
data transfer. The RS-232 electrical specification is asymmetrical: Each line is
driven by one of the two devices connected and is sampled by the other device.
Thus, one end i n any normal connection must be wired as data-terminal equipment
(DTE), such as a terminal, and the other as data-communications equipment (DCE),
such as a modem. Note that terminal in DTE means endpoint: A terminal on which
people type is a DTE, and a computer also is a DTE. The data-terminal ready (DTR)
line is the output of the DTE end that serves as a ready indicator. In the other direc­
tion, the data-carrier detect (DCD) line indicates that the DCE device is ready for
data transfer. Historically, VAX terminal interfaces were all wired as DTE (they
may be connected directly to modems, or connected to local terminals with null
modem cables) . The terminology used in the 4.4BSD terminal drivers and com­
mands reflects this orientation, even though many computers incorrectly use the
opposite convention .
When terminal devices are opened, the DTR output is asserted so that the con­
nected modem or other equipment may begin operation. If modem control is sup­
ported on a line, the open does not complete unless the O_NONBLOCK option was
specified or the CLOCAL control flag is set for the line, and no data are transferred
until the DCD input carrier is detected or the CLOCAL flag is set. Thus, an open
on a line connected to a modem will block until a connection is made; the connec­
tion commonly occurs when a call is received from a remote modem. Data then
can be transferred for as long as carrier remains on. If the modem loses the con­
nection, the DCD line is turned off, and subsequent reads and writes fail .
Section 1 0. 8
Terminal Operations
Ports that are used with local terminals or other DTE equipment are connected
with a null-modem cable that connects DTR on each end to DCD on the other end.
Alternatively, the DTR output on the host port can be looped back to the DCD
input. If the cable or device does not support modem control, the system will
ignore the state of the modem control signals when the CLOCAL control flag is set
for the line, Finally, some drivers may be configured to ignore modem-control
Terminal Operations
Now that we have examined the overall structure of the terminal 1/0 system and
have described that system's data structures, as well as the hardware that the sys­
tem controls, we continue with a description of the terminal 1/0 system operation.
We shall examine the operation of a generalized terminal hardware device driver
and the usual terminal line discipline. We shall not cover the autoconfiguration
routines present in each driver; they function in the same way as do those
described in Section 1 4.4.
Each time that the special file for a terminal-character device is opened, the hard­
ware driver's open routine is called. The open routine checks that the requested
device was configured i nto the system and was located during autoconfiguration,
then initializes the tty structure. If the device was not yet open, the default modes
and line speed are set. The tty state is set to TS _WOPEN, waiting for open. Then,
if the device supports modem-control lines, the open routine enables the DTR out­
put line. If the CLOCAL control flag is not set for the terminal and the open call
did not specify the O_NONBLOCK fl ag, the open routine blocks awaiting assertion
of the DCD input line. Some drivers support device flags to override modem con­
trol; these flags are set in the system-configuration file and are stored i n the driver
data structures. If the bit corresponding to a terminal line number is set in a de­
vice's flags, modem-control lines are ignored on input. When a carrier signal is
detected on the line, the TS_CARR_ON bit is set in the terminal state. The driver
then passes control to the initial (or current) line discipline through its open entry.
The default line discipline when a device is first opened is the termios termi­
nal-driver discipline. If the line was not already open, the terminal-size informa­
tion for the line is set to zero, indicating an unknown size. The line is then marked
as open (state bit TS_OPEN).
Output Line Discipline
After a line has been opened, a write on the resulting file descriptor produces out­
put to be transmitted on the terminal line. Writes to character devices result i n
calls t o the device write entry, d_write, w ith a device number, a uio structure
Chapter 1 0
Terminal Handling
describing the data to be written, and a flag specifying whether the 1/0 is
nonblocking. Terminal hardware drivers use the device number to locate the cor­
rect tty structure, then call the line discipline !_write entry with the tty structure
and uio structure as parameters.
The line-discipline write routine does most of the work of output translation
and flow control. It is responsible for copying data into the kernel from the user
process calling the routine and for placing the translated data onto the terminal 's
output queue for the hardware driver. The terminal-driver write routine, ttwrite ( ),
first checks that the terminal line still has carrier asserted (or that modem control
is being ignored). If carrier is significant and not asserted, the process will be put
to sleep awaiting carrier if the terminal has not yet been opened, or an error will be
returned. If carrier is being ignored or is asserted, ttwrite ( ) then checks whether
the current process is allowed to write to the terminal at this time. The user may
set a tty option to allow only the foreground process (see Section 1 0.5) to do out­
put. If this option is set, and if the terminal line is the controlling terminal for the
process, then the process should do output immediately only if it is in the fore­
ground process group (i.e., if the process groups of the process and of the terminal
are the same) . If the process is not in the foreground process group, and a SIGT­
TOU signal would cause the process to be suspended, a SIGTTOU signal is sent to
the process group of the process. In this case, the write will be attempted again
when the user moves the process group to the foreground. If the process is in the
foreground process group, or a SIGTTOU signal would not suspend the process,
the write proceeds as usual.
When ttwrite ( ) has confirmed that the write is permitted, it enters a loop that
copies the data to be written into the kernel, checks for any output translation that
is required, and places the data on the output queue for the terminal. It prevents
the queue from becoming overfull by blocking if the queue fills before all charac­
ters have been processed. The limit on the queue size, the high watermark, is
dependent on the output line speed ; the difference between the low watermark and
high watermark is approximately I second's worth of output. When forced to
wait for output to drain before proceeding, ttwrite ( ) sets a flag in the tty structure
Figure 1 0.2
Pseudocode for checking the output queue in a line discipline.
s t ru c t t ty * tp ;
t t s t a r t ( tp ) ;
s = sp l t ty ( ) ;
( tp - > t_ou t q . c_c c > h i gh - wa t e r -mark )
tp - > t_s t a t e
t tys l e ep ( & tp - > t_ou t q ) ;
splx ( s ) ;
Section 1 0. 8
Termi nal Operations
state, TS_ASLEEP, so that the transmit-complete interrupt handler will awaken it
when the queue is reduced to the low watermark. The check of the queue size and
subsequent sleep must be ordered such that any i nterrupt is guaranteed to occur
after the sleep. See Fig. 1 0. 2 for an example, presuming a uniprocessor machine.
Once errors, permissions, and flow control have been checked, ttwrite ( )
copies the user's data into a local buffer i n chunks of at most 1 00 characters using
uiomove ( ) . (A value of 1 00 is used because the buffer is stored on the stack, and
so cannot be large.) When the terminal driver is configured i n noncanonical
mode, no per-character translations are done, and the entire buffer is processed at
once. In canonical mode, the terminal driver locates groups of characters requir­
ing no translation by scanning through the output string, looking up each character
in turn in a table that marks characters that might need translation (e.g. , newline),
or characters that need expansion (e.g., tabs). Each group of characters that
requires no special processing is placed i nto the output queue using b_to_q ( ) .
Trailing special characters are output with ttyoutput ( ) . I n either case, ttwrite ( )
must check that enough C-list blocks are available; i f they are not, i t waits for a
short time (by sleeping on !bolt for up to 1 second), then retries.
The routine that does output with translation is ttyoutput ( ) , which accepts a
single character, processes that character as necessary, and places the result on the
output queue. The following translations may be done, depending on the terminal
• Tabs may be expanded to spaces.
• Newlines may be replaced with a carriage return plus a line feed.
As soon as data are placed on the output queue of a tty, ttstart ( ) is called to
initiate output. Unless output is already in progress or has been suspended by
receipt of a stop character, ttstart( ) calls the hardware-driver start routine specified
in the tty's t_oproc field. Once all the data have been processed and have been
placed i nto the output queue, ttwrite ( ) returns an indication that the write com­
pleted successfully, and the actual serial character transmission is managed asyn­
chronously by the device driver.
Output Top Half
The device driver handles the hardware-specific operation of character transmis­
sion, as well as synchronization and flow control for output. The structure of the
start ( ) routine varies l ittle from one driver to another. There are two general
classes of output mechanisms, depending on the type of hardware device. The
first class operates on devices that are capable of direct-memory-access (DMA)
output, which can fetch the data directly from the C-list block. For this class of
device, the device fetches the data from main memory, transmits each of the char­
acters in turn, and interrupts the CPU when the transmission is complete. Because
the hardware fetches data directly from main memory, there may be additional
requirements on where the C-lists can be located in physical memory.
Chapter 1 0
Terminal Handling
The other extreme for terminal interfaces are those that do programmed 110,
potentially on a character-by-character basis. One or more characters are loaded
into the device's output-character register for transmission. The CPU must then
wait for the transmit-complete interrupt before sending more characters . Because
of the many interrupts generated in this mode of operation, several variants have
been developed to minimize the overhead of terminal I/O.
One approach is to compute in advance as much as possible of the informa­
tion needed at interrupt time. (Generally, the information needed is a pointer to
the next character to be transmitted, the number of characters to be transmitted,
and the address of the hardware device register that will receive the next charac­
ter.) This strategy is known as pseudo-DMA ; the precomputed information is
stored in a pdma structure. A small assembly-language routine receives each
hardware transmit-complete interrupt, transmits the next character, and returns.
When there are no characters left to transmit, it calls a C-language interrupt rou­
tine with an indication of the line that completed transmission. The normal driver
thus has the illusion of DMA output, because it is not called until the entire block
of characters has been transmitted.
Another approach is found on hardware that supports periodic polling inter­
rupts instead of per-character interrupts. Usually, the period is settable based on
the line speed. A final variation is found in hardware that can buffer several char­
acters at a time in a silo and that will i nterrupt only when the silo has been emp­
tied completely. In addition, some hardware devices are capable of both DMA and
a variant of character-at-a-time 1/0, and can be programmed by the operating sys­
tem to operate in either mode.
After an output operation is started, the terminal state is marked with
TS_BUSY so that new transmissions will not be attempted until the current one
Output Bottom Half
When transmission of a block of characters has been completed, the hardware
multiplexer interrupts the CPU; the transmit interrupt routine is then called with
the unit number of the device. Usually, the device has a register that the driver can
read to determine which of the device's lines have completed transmit operations.
For each line that has finished output, the interrupt routine clears the TS_BUSY
flag. The characters that have been transmitted were removed from the C-list
when copied to a local buffer by the device driver using getc ( ) or q_to_b ( ) ; or if
they were not, the driver removes them from the output queue using ndfiush ( ) .
These steps complete one section o f output.
The line-discipline start routine is called to start the next operation; as noted,
this routine generally does nothing but call the driver start routine specified in the
terminal t_oproc field. The start routine now checks to see whether the output
queue has been reduced to the low watermark, and, if it has been, whether the top
half is waiting for space in the output queue. If the TS_ASLEEP flag is set, the out­
put process is awakened. In addition, selwakeup ( ) is called, and, if a process is
recorded in t_wsel as selecting for output, that process is notified. Then, if the
output queue is not empty, the next operation is started as before.
Section 1 0. 8
Terminal Operations
Input Bottom Half
Unlike output, terminal input is not initiated by a system call, but rather arrives
asynchronously when the terminal line receives characters from the keyboard or
other input device. Thus, the input processing in the terminal system occurs
mostly at interrupt time. Most hardware multiplexers interrupt each time that a
character is received on any line. They usually provide a silo that stores received
characters, along with the line number on which the characters were received and
any associated status information, until the device handler retrieves the characters.
Use of the silo prevents characters from being lost if the CPU has not processed a
received-character interrupt by the time that the next character arrives. On many
devices, the system can avoid per-character interrupts by programming the device
to i nterrupt only after the silo is partially or completely full. However, the driver
must then check the device periodically so that characters do not stagnate in the
silo if additional input does not trigger an interrupt. If the device can also be pro­
grammed to i nterrupt a short time after the first character enters the silo, regardless
of additional characters arriving, these periodic checks of the device by the driver
can be avoided. Characters cannot be allowed to stagnate because input flow-con­
trol characters must be processed without much delay, and users will notice any
significant delay in character echo as well. The drivers in 4.4BSD for devices with
such timers always use the silo interrupts. Other terminal drivers use per-character
interrupts until the input rate is high enough to warrant the use of the silo alarm
and a periodic scan of the silo.
When a device receiver interrupt occurs, or when a timer routine detects
i nput, the receiver-interrupt routine reads each character from the i nput silo, along
with the latter's line number and status information. Normal characters are passed
as i nput to the terminal line discipline for the receiving tty through the latter's
l_rint entry :
( * l i nesw [ tp - > t_l ine ] . l_rint ) ( i npu t - chara c t e r ,
tp ) ;
The input character is passed to the l_rint routine as an integer. The bottom 8 bits
of the integer are the actual character. Characters received with hardware-detected
parity errors, break characters, or framing errors have flags set in the upper bits of
the integer to indicate these conditions.
The receiver-interrupt (l_rint) routine for the normal terminal line discipline
is ttyinput ( ) . When a break condition is detected (a longer-than-normal character
with only 0 bits), it is ignored, or an i nterrupt character or a null is passed to the
process, depending on the terminal mode. The interpretation of terminal input
described in Section 1 0 . 1 is done here. Input characters are echoed if desired. In
noncanonical mode, characters are placed i nto the raw input queue without i nter­
pretation. Otherwise, most of the work done by ttyinput ( ) is to check for charac­
ters with special meanings and to take the requested actions. Other characters are
placed into the raw queue. In canonical mode, if the received character is a car­
riage return or another character that causes the current line to be made available
to the program reading the terminal, the contents of the raw queue are added to the
canonicalized queue and ttwakeup ( ) is called to notify any process waiting for
Chapter 1 0
Terminal Handling
input. In noncanonical mode, ttwakeup ( ) is called when each character is
processed. It will awaken any process sleeping on the raw queue awaiting input
for a read and will notify processes selecting for input. If the terminal has been
set for signal-driven I/O using fcntl and the FASYNC flag, a SIGIO signal is sent to
the process group controlling the terminal .
Ttyinput ( ) must also check that the input queue does n o t become too large,
exhausting the supply of C-list blocks; input characters are discarded when the
limit ( 1 024 characters) is reached. If the IXOFF termios flag is set, end-to-end
flow control is invoked when the queue reaches half full by output of a stop char­
acter (normally XOFF or control-S).
Up to this point, all processing is asynchronous, and occurs independent of
whether a read call is pending on the terminal device. In this way, type-ahead is
allowed to the limit of the input queues .
Input Top Half
Eventually, a read call is made on the file descriptor for the terminal device. Like
all calls to read from a character-special device, this one results in a call to the de­
vice driver's d_read entry with a device number, a uio structure describing the
data to be read, and a flag specifying whether the 1/0 is nonblocking. Terminal
device drivers use the device number to locate the tty structure for the device, then
call the line discipline !_read entry to process the system call.
The !_read entry for the terminal driver is ttread( ) . Like ttwrite ( ), ttread( )
first checks that the terminal line still has carrier (and that carrier i s significant) ; if
not, it goes to sleep or returns an error. It then checks to see whether the process
is part of the session and the process group currently associated with the terminal.
If the process is a member of the session currently associated with the terminal, if
any, and is a member of the current process group, the read proceeds. Otherwise,
if a SIGTIIN would suspend the process, a SIGTTIN is sent to that process group.
In this case, the read will be attempted again when the user moves the process
group to the foreground. Otherwise, an error is returned. Finally, ttread( ) checks
for data in the appropriate queue (the canonical queue in canonical mode, the raw
queue in noncanonical mode) . If no data are present, ttread( ) returns the error
EWOULDBLOCK if the terminal is using nonblocking 1/0 ; otherwise, it sleeps on
the address of the raw queue. When ttread( ) is awakened, it restarts processing
from the beginning because the terminal state or process group might have
changed while it was asleep.
When characters are present in the queue for which ttread( ) is waiting, they
are removed from the queue one at a time with getc ( ) and are copied out to the
user's buffer with ureadc ( ) . In canonical mode, certain characters receive special
processing as they are removed from the queue : The delayed-suspension character
causes the current process group to be stopped with signal SIGTSTP, and the end­
of-file character terminates the read without being passed back to the user pro­
gram. If there was no previous character, the end-of-file character results in the
read returning zero characters, and that is interpreted by user programs as indicat­
ing end-of-file. However, most special processing of input characters is done
Section 1 0 . 8
Terminal Operations
when the character is entered into the queue. For example, translating carriage
returns to newlines based on the ICRNL flag must be done when the character is
first received because the newline character wakes up waiting processes in canoni­
cal mode. In noncanonical mode, the characters are not examined as they are pro­
Characters are processed and returned to the user until the character count in
the uio structure reaches zero, the queue is exhausted, or, if in canonical mode, a
line terminator is reached. When the read( ) call returns, the returned character
count will be the amount by which the requested count was decremented as char­
acters were processed.
After the read completes, if terminal output was blocked by a stop character
being sent because the queue was filling up, and the queue is now less than 20-per­
cent full, a start character (normally XON, control-Q) is sent.
The stop Routine
Character output on terminal devices is done in blocks as large as possible, for
efficiency. However, there are two events that should cause a pending output oper­
ation to be stopped. The first event is the receipt of a stop character, which should
stop output as quickly as possible; sometimes, the device receiving output is a
printer or other output device with a limited buffer size. The other event that stops
output is the receipt of a special character that causes output to be discarded, pos­
sibly because of a signal. In either case, the terminal line discipline calls the char­
acter device driver's d_stop entry to stop any current output operation. Two
parameters are provided: a tty structure and a flag that indicates whether output is
to be flushed or suspended. Theoretically, if output is flushed, the terminal disci­
pline removes all the data i n the output queue after calling the device stop routine.
More practically, the flag is ignored by most current device drivers.
The implementation of the d_stop routine is hardware dependent. Different
drivers stop output by disabling the transmitter, thus suspending output, or by
changing the current character count to zero. Drivers using pseudo-DMA may
change the limit on the current block of characters so that the pseudo-DMA routine
will call the transmit-complete interrupt routine after the current character is trans­
mitted. Most drivers set a flag in the tty state, TS_FLUSH, when a stop is to flush
data, and the aborted output operation will cause an interrupt. When the transmit­
complete interrupt routine runs, it checks the TS_FLUSH flag, and avoids updating
the output-queue character count (the queue has probably already been flushed by
the time the i nterrupt occurs). If output is to be stopped but not flushed, the
TS_TTSTOP flag is set in the tty state; the driver must stop output such that it may
be resumed from the current position.
The ioctl Routine
Section 1 0.3 described the user interface to terminal drivers and line disciplines,
most of which is accessed via the ioctl system call. Most of these calls manipulate
software options in the terminal line discipline; some of them also affect the
Chapter 1 0
error =
( * l i nesw [ tp - > t_l ine ] . l_ i o c t l ) ( tp ,
data ,
f l ag ) ;
( error > = 0 )
r e turn
( erro r ) ;
error = t t i o c t l ( tp ,
cmd ,
Terminal Handling
cmd ,
da t a ,
f l ag ) ;
( e rror > = 0 )
r e turn
swi t c h
( cmd )
( e rro r ) ;
c a s e T I O C S BRK :
r e turn
/ * hardwar e spe c i f i c c ommands
(0) ;
c a s e TI OCCBRK :
r e turn
(0) ;
de f au l t :
r e turn
F i g u re 1 0.3
( ENOTTY ) ;
Handling of an error return from a line discipline.
operation of the asynchronous serial port hardware. In particular, the hardware
line speed, word size, and parity are derived from these settings. So, ioctl calls are
processed both by the current line discipline and by the hardware driver.
The device driver d_ioctl routine is called with a device number, an ioctl com­
mand, and a pointer to a data buffer when an ioctl is done on a character-special
file, among other arguments. Like the read and write routines, most terminal­
driver ioctl routines locate the tty structure for the device, then pass control to the
line discipline. The line-discipline ioctl routine does discipline-specific actions,
including change of line discipline. If the line-discipline routine fails, the driver
will immediately return an error, as shown in Fig. 1 0. 3 . Otherwise, the driver will
then call the ttioctl () routine that does most common terminal processing, includ­
ing changing terminal parameters. If ttioctl ( ) fails, the driver will immediately
return an error. Otherwise, some drivers implement additional ioctl commands
that do hardware specific processing-for example, manipulating modem-control
outputs. These commands are not recognized by the line discipline, or by com­
mon terminal processing, and thus must be handled by the driver. The ioctl rou­
tine returns an error number if an error is detected, or returns zero if the command
has been processed successfully. The ermo variable is set to ENOITY if the com­
mand is not recognized.
Modem Transitions
The way in which the system uses modem-control lines on terminal lines was
introduced in Section 1 0.7. Most terminal multiplexers support at least the set of
modem-control lines used by 4.4BSD; those that do not act instead as though
Section 1 0.9
Other Line Disciplines
carrier were always asserted. When a device is opened, the DTR output is enabled,
and then the state of the carrier input is checked. If the state of the carrier input
changes later, this change must be detected and processed by the driver. Some
devices have a separate interrupt that reports changes in modem-control status ;
others report such changes along with other status information with received char­
acters. Some devices do not interrupt when modem-control lines change, and the
driver must check their status periodically. When a change is detected, the line
discipline is notified by a call to its !_modem routine with the new state of the car­
rier input.
The normal terminal-driver modem routine, ttymodem ( ), maintains the state
of the TS_CARR_ON flag in the tty structure and processes corresponding state
changes. When carrier establishment is detected, a wakeup is issued for any pro­
cess waiting for an open to complete. When carrier drops on an open line, the
leader of the session associated with the terminal (if any) is sent a hangup signal,
SIGHUP, and the terminal queues are flushed. The return value of ttymodem ( )
indicates whether the driver should maintain its DTR output. If the value i s zero,
DTR should be turned off. Ttymodem ( ) also implements an obscure terminal
option to use the carrier line for flow-control handshaking, stopping output when
carrier drops and resuming when it returns .
Closing of Terminal Devices
When the final reference to a terminal device is closed, or the revoke system call is
made on the device, the device-driver close routine is called. B oth the line discipline
and the hardware driver may need to close down gracefully. The device-driver rou­
tine first calls the line-discipline close routine. The standard line-discipline close
entry, tty/close ( ), waits for any pending output to drain (if the terminal was not
opened with the O_NONBLOCK flag set and the carrier is still on), then flushes the
input and output queues. (Note that the close may be i nterrupted by a signal while
waiting for output to complete.) The hardware driver may clear any pending opera­
tions, such as transmission of a break. If the state bit TS_HUPCLS has been set with
the TIOCHPCL ioctl, DTR is disabled to hang up the line. Finally, the device-driver
routine calls ttyclose ( ), which flushes all the queues, i ncrements the generation
number so that pending reads and writes can detect reuse of the terminal, and clears
the terminal state.
Other Line Disciplines
We have examined the operation of the terminal 1/0 system using the standard ter­
minal-oriented line-discipline routines. For completeness, we now describe two
other line disciplines in the system. Note that the preceding discussion of the
operation of the terminal multiplexer drivers applies when these disciplines are
used, as well as when the terminal-oriented disciplines are used.
Chapter 1 0
Terminal Handling
Serial Line IP Discipline
The serial line IP (SLIP) line discipline is used by networking software to encapsu­
late and transfer Internet Protocol (IP) datagrams over asynchronous serial lines
[Romkey, 1 98 8 ] . (See Chapter 13 for information about IP.) The slattach pro­
gram opens a serial line, sets the line's speed, and enters the SLIP line discipline.
The SLIP line discipline's open routine associates the terminal line with a precon­
figured network interface and prepares to send and receive network packets. Once
the interface's network address is set with the ifconfig program, the network will
route packets through the SLIP line to the system to which it connects. Packets are
framed with a simple scheme; a framing character (0300 octal) separates packets.
Framing characters that occur within packets are quoted with an escape character
(0333 octal) and are transposed (to 0334 octal). Escape characters within the
packet are escaped and transposed (to 0335 octal).
The output path is started every time a packet is output to the SLIP interface.
Packets are enqueued on one of two queues: one for i nteractive traffic and one for
other traffic. Interactive traffic takes precedence over other traffic. The SLIP disci­
pline places the framing character and the data of the next packet onto the output
queue of the tty, escaping the framing and the escape characters as needed, and in
some cases compressing packet headers. It then starts transmission by calling
ttstart ( ), which in turn calls the device's start routine referenced i n the tty t_oproc
field. It may place multiple packets onto the output queue before returning, as
long as the system is not running short of C-list blocks. However, it stops moving
packets into the tty output queue when the character count has reached a fairly low
limit (60 bytes), so that future interactive traffic is not blocked by noninteractive
traffic already in the output queue. When transmission completes, the device
driver calls the SLIP start routine, which continues to place data onto the output
queue until all packets have been sent or the queue hits the limit again.
When characters are received on a line that is using the SLIP discipline, the
escaped characters are translated and data characters are placed into a network
buffer. When a framing character ends the packet, the packet header is uncom­
pressed if necessary, the packet is presented to the network protocol, and the
buffer is reinitialized.
The SLIP discipline allows moderate-speed network connection