spaghetti code - Parent Directory

spaghetti code - Parent Directory
A Systems Approach
The Morgan Kaufmann Series in Networking
Series Editor, David Clark, M.I.T.
Computer Networks: A Systems Approach, 3e
Larry L. Peterson and Bruce S. Davie
Network Architecture, Analysis, and Design, 2e
James D. McCabe
MPLS Network Management: MIBs, Tools, and Techniques
Thomas D. Nadeau
Developing IP-Based Services: Solutions for Service Providers and Vendors
Monique Morrow and Kateel Vijayananda
Telecommunications Law in the Internet Age
Sharon K. Black
Optical Networks: A Practical Perspective, 2e
Rajiv Ramaswami and Kumar N. Sivarajan
Internet QoS: Architectures and Mechanisms
Zheng Wang
TCP/IP Sockets in Java: Practical Guide for Programmers
Michael J. Donahoo and Kenneth L. Calvert
TCP/IP Sockets in C: Practical Guide for Programmers
Kenneth L. Calvert and Michael J. Donahoo
Multicast Communication: Protocols, Programming, and Applications
Ralph Wittmann and Martina Zitterbart
MPLS: Technology and Applications
Bruce Davie and Yakov Rekhter
High-Performance Communication Networks, 2e
Jean Walrand and Pravin Varaiya
Internetworking Multimedia
Jon Crowcroft, Mark Handley, and Ian Wakeman
Understanding Networked Applications: A First Course
David G. Messerschmitt
Integrated Management of Networked Systems: Concepts, Architectures,
and their Operational Application
Heinz-Gerd Hegering, Sebastian Abeck, and Bernhard Neumair
Virtual Private Networks: Making the Right Connection
Dennis Fowler
Networked Applications: A Guide to the New Computing Infrastructure
David G. Messerschmitt
Modern Cable Television Technology: Video, Voice, and Data Communications
Walter Ciciora, James Farmer, and David Large
Switching in IP Networks: IP Switching, Tag Switching, and Related Technologies
Bruce S. Davie, Paul Doolan, and Yakov Rekhter
Wide Area Network Design: Concepts and Tools for Optimization
Robert S. Cahn
Frame Relay Applications: Business and Technology Case Studies
James P. Cavanagh
For further information on these books and for a list of forthcoming titles, please visit
our website at
Larry L. Peterson & Bruce S. Davie
A Systems Approach
Senior Editor Rick Adams
Publishing Services Manager Simon Crump
Developmental Editor Karyn Johnson
Cover Design Ross Carron Design
Cover Image Vasco de Gama Bridge, Lisbon, Portugal
Composition/Illustration International Typesetting and Composition
Copyeditor Ken DellaPenta
Proofreader Jennifer McClain
Indexer Steve Rath
Printer Courier Corporation
Designations used by companies to distinguish their products are often claimed as trademarks
or registered trademarks. In all instances in which Morgan Kaufmann Publishers is aware of a
claim, the product names appear in initial capital or all capital letters. Readers, however, should
contact the appropriate companies for more complete information regarding trademarks and
Morgan Kaufmann Publishers
An Imprint of Elsevier Science
340 Pine Street, Sixth Floor
San Francisco, CA 94104-3205
© 2003 by Elsevier Science (USA)
All rights reserved
Printed in the United States of America
07 06 05 04 03
5 4 3 2 1
No part of this publication may be reproduced, stored in a retrieval system, or transmitted in
any form or by any means—electronic, mechanical, photocopying, or otherwise—without the
prior written permission of the publisher.
Library of Congress Control Number: xxxxxxxxxx
ISBN: 1-55860-832-X (Casebound)
ISBN: 1-55860-833-8 (Paperback)
This book is printed on acid-free paper.
To Lee Peterson and Robert Davie
This Page Intentionally Left Blank
David Clark
Massachusetts Institute of Technology
his third edition represents another major upgrade to this classic networking
book. The field continues to change fast, and new concepts emerge with amazing speed. This version expands its discussion of a lot of important new topics, including peer-to-peer networks, Ipv6, overlay and content distribution networks,
MPLS and switching, wireless and mobile technology, and more. It also contains an
earlier and stronger focus on applications, which reflects the student and professional’s
increased familiarity with a wide range of networked applications. The book continues
its tradition of giving you the facts you need to understand today’s world.
But it has not lost track of its larger goal, to tell you not only the facts but the
why behind the facts. The philosophy of the book remains the same: to be timely but
timeless. What this book will teach you in today’s networked world will give you the
insight needed to work in tomorrow’s landscape. And that is important, since there
is no reason to believe that the evolution of networks is going to slow down anytime
It is hard to remember what the world looked like only ten years ago. Back then
the Internet was not really a commercial reality. Ten megabits per second was really
fast. We didn’t worry about spam and virus attacks—we left our computers unguarded
and hardly worried. Those times were simpler, but today may be more exciting. And
you better believe that tomorrow will be different from today: at least as exciting, with
luck no less trustworthy, and certainly bigger, faster and filled with fresh innovation.
So I hope Larry and Bruce can relax for a little before they have to start the next
revision. Meanwhile, use this book to learn about today and get ready for tomorrow.
Have fun.
This Page Intentionally Left Blank
David Clark
Massachusetts Institute of Technology
he term spaghetti code is universally understood as an insult. All good computer
scientists worship the god of modularity, since modularity brings many benefits,
including the all-powerful benefit of not having to understand all parts of a
problem at the same time in order to solve it. Modularity thus plays a role in presenting
ideas in a book, as well as in writing code. If a book’s material is organized effectively—
modularly—the reader can start at the beginning and actually make it to the end.
The field of network protocols is perhaps unique in that the “proper” modularity
has been handed down to us in the form of an international standard: the seven-layer
reference model of network protocols from the ISO. This model, which reflects a
layered approach to modularity, is almost universally used as a starting point for
discussions of protocol organization, whether the design in question conforms to the
model or deviates from it.
It seems obvious to organize a networking book around this layered model.
However, there is a peril to doing so, because the OSI model is not really successful
at organizing the core concepts of networking. Such basic requirements as reliability,
flow control, or security can be addressed at most, if not all, of the OSI layers. This
fact has led to great confusion in trying to understand the reference model. At times it
even requires a suspension of disbelief. Indeed, a book organized strictly according to
a layered model has some of the attributes of spaghetti code.
Which brings us to this book. Peterson and Davie follow the traditional layered
model, but they do not pretend that this model actually helps in the understanding of
the big issues in networking. Instead, the authors organize discussion of fundamental
concepts in a way that is independent of layering. Thus, after reading the book, readers
will understand flow control, congestion control, reliability enhancement, data representation, and synchronization, and will separately understand the implications of
addressing these issues in one or another of the traditional layers.
This is a timely book. It looks at the important protocols in use today—especially
the Internet protocols. Peterson and Davie have a long involvement in and much
experience with the Internet. Thus their book reflects not just the theoretical issues in
Foreword to the First Edition
protocol design, but the real factors that matter in practice. The book looks at some of
the protocols that are just emerging now, so the reader can be assured of an up-to-date
perspective. But most importantly, the discussion of basic issues is presented in a way
that derives from the fundamental nature of the problem, not the constraints of the
layered reference model or the details of today’s protocols. In this regard, what this
book presents is both timely and timeless. The combination of real-world relevance,
current examples, and careful explanation of fundamentals makes this book unique.
Foreword to the First Edition
1 Foundation
Problem: Building a Network
1.2.1 Connectivity
1.2.2 Cost-Effective Resource Sharing
1.2.3 Support for Common Services
Network Architecture
1.3.1 Layering and Protocols
1.3.2 OSI Architecture
1.3.3 Internet Architecture
Implementing Network Software
1.4.1 Application Programming Interface (Sockets)
1.4.2 Example Application
1.4.3 Protocol Implementation Issues
1.5.1 Bandwidth and Latency
1.5.2 Delay × Bandwidth Product
1.5.3 High-Speed Networks
1.5.4 Application Performance Needs
Open Issue: Ubiquitous Networking
Further Reading
2 Direct Link Networks
Problem: Physically Connecting Hosts
Hardware Building Blocks
2.1.1 Nodes
2.1.2 Links
Encoding (NRZ, NRZI, Manchester, 4B/5B)
2.3.1 Byte-Oriented Protocols (BISYNC, PPP, DDCMP)
2.3.2 Bit-Oriented Protocols (HDLC)
2.3.3 Clock-Based Framing (SONET)
Error Detection
2.4.1 Two-Dimensional Parity
2.4.2 Internet Checksum Algorithm
2.4.3 Cyclic Redundancy Check
Reliable Transmission
2.5.1 Stop-and-Wait
2.5.2 Sliding Window
2.5.3 Concurrent Logical Channels
Ethernet (802.3)
2.6.1 Physical Properties
2.6.2 Access Protocol
2.6.3 Experience with Ethernet
Token Rings (802.5, FDDI)
2.7.1 Physical Properties
2.7.2 Token Ring Media Access Control
2.7.3 Token Ring Maintenance
2.7.4 Frame Format
2.7.5 FDDI
Wireless (802.11)
2.8.1 Physical Properties
2.8.2 Collision Avoidance
2.8.3 Distribution System
2.8.4 Frame Format
Network Adaptors
2.9.1 Components
2.9.2 View from the Host
2.9.3 Memory Bottleneck
2.10 Summary
Open Issue: Does It Belong in Hardware?
Further Reading
3 Packet Switching
Problem: Not All Networks Are Directly Connected
Switching and Forwarding
3.1.1 Datagrams
3.1.2 Virtual Circuit Switching
3.1.3 Source Routing
Bridges and LAN Switches
3.2.1 Learning Bridges
3.2.2 Spanning Tree Algorithm
3.2.3 Broadcast and Multicast
3.2.4 Limitations of Bridges
Cell Switching (ATM)
3.3.1 Cells
3.3.2 Segmentation and Reassembly
3.3.3 Virtual Paths
3.3.4 Physical Layers for ATM
3.3.5 ATM in the LAN
Implementation and Performance
3.4.1 Ports
3.4.2 Fabrics
Open Issue: The Future of ATM
Further Reading
4 Internetworking
Problem: There Is More Than One Network
Simple Internetworking (IP)
4.1.1 What Is an Internetwork?
4.1.2 Service Model
4.1.3 Global Addresses
4.1.4 Datagram Forwarding in IP
4.1.5 Address Translation (ARP)
4.1.6 Host Configuration (DHCP)
4.1.7 Error Reporting (ICMP)
4.1.8 Virtual Networks and Tunnels
4.2.1 Network as a Graph
4.2.2 Distance Vector (RIP)
4.2.3 Link State (OSPF)
4.2.4 Metrics
4.2.5 Routing for Mobile Hosts
Global Internet
4.3.1 Subnetting
4.3.2 Classless Routing (CIDR)
4.3.3 Interdomain Routing (BGP)
4.3.4 Routing Areas
4.3.5 IP Version 6 (IPv6)
4.4.1 Link-State Multicast
4.4.2 Distance-Vector Multicast
4.4.3 Protocol Independent Multicast (PIM)
Multiprotocol Label Switching (MPLS)
4.5.1 Destination-Based Forwarding
4.5.2 Explicit Routing
4.5.3 Virtual Private Networks and Tunnels
Open Issue: Deployment of IPV6
Further Reading
5 End-to-End Protocols
Problem: Getting Processess to Communicate
Simple Demultiplexer (UDP)
Reliable Byte Stream (TCP)
5.2.1 End-to-End Issues
5.2.2 Segment Format
5.2.3 Connection Establishment and Termination
5.2.4 Sliding Window Revisited
5.2.5 Triggering Transmission
5.2.6 Adaptive Retransmission
5.2.7 Record Boundaries
5.2.8 TCP Extensions
5.2.9 Alternative Design Choices
Remote Procedure Call
5.3.1 Bulk Transfer (BLAST)
5.3.2 Request/Reply (CHAN)
5.3.3 Dispatcher (SELECT)
5.3.4 Putting It All Together (SunRPC, DCE)
Open Issue: Application-Specific Protocols
Further Reading
6 Congestion Control and Resource Allocation
Problem: Allocating Resources
Issues in Resource Allocation
6.1.1 Network Model
6.1.2 Taxonomy
6.1.3 Evaluation Criteria
Queuing Disciplines
6.2.1 FIFO
6.2.2 Fair Queuing
TCP Congestion Control
6.3.1 Additive Increase/Multiplicative Decrease
6.3.2 Slow Start
6.3.3 Fast Retransmit and Fast Recovery
Congestion-Avoidance Mechanisms
6.4.1 DECbit
6.4.2 Random Early Detection (RED)
6.4.3 Source-Based Congestion Avoidance
Quality of Service
6.5.1 Application Requirements
6.5.2 Integrated Services (RSVP)
6.5.3 Differentiated Services (EF, AF)
6.5.4 ATM Quality of Service
6.5.5 Equation-Based Congestion Control
Open Issue: Inside versus Outside the Network
Further Reading
7 End-to-End Data
Problem: What Do We Do with the Data?
Presentation Formatting
7.1.1 Taxonomy
7.1.2 Examples (XDR, ASN.1, NDR)
7.1.3 Markup Languages (XML)
Data Compression
7.2.1 Lossless Compression Algorithms
7.2.2 Image Compression (JPEG)
7.2.3 Video Compression (MPEG)
7.2.4 Transmitting MPEG over a Network
7.2.5 Audio Compression (MP3)
Open Issue: Computer Networks Meet Consumer Electronics
Further Reading
8 Network Security
Problem: Securing the Data
Cryptographic Algorithms
8.1.1 Requirements
8.1.2 Secret Key Encryption (DES)
8.1.3 Public Key Encryption (RSA)
8.1.4 Message Digest Algorithms (MD5)
8.1.5 Implementation and Performance
Security Mechanisms
8.2.1 Authentication Protocols
8.2.2 Message Integrity Protocols
8.2.3 Public Key Distribution (X.509)
Example Systems
8.3.1 Pretty Good Privacy (PGP)
8.3.2 Secure Shell (SSH)
8.3.3 Transport Layer Security (TLS, SSL, HTTPS)
8.3.4 IP Security (IPSEC)
8.4.1 Filter-Based Firewalls
8.4.2 Proxy-Based Firewalls
8.4.3 Limitations
Open Issue: Denial-of-Service Attacks
Further Reading
9 Applications
Problem: Applications Need Their Own Protocols
Name Service (DNS)
9.1.1 Domain Hierarchy
9.1.2 Name Servers
9.1.3 Name Resolution
Traditional Applications
9.2.1 Electronic Mail (SMTP, MIME, IMAP)
9.2.2 World Wide Web (HTTP)
9.2.3 Network Management (SNMP)
Multimedia Applications
9.3.1 Real-time Transport Protocol (RTP)
9.3.2 Session Control and Call Control (SDP, SIP, H.323)
Overlay Networks
9.4.1 Routing Overlays
9.4.2 Peer-to-Peer Networks
9.4.3 Content Distribution Networks
Open Issue: New Network Artichitecture
Further Reading
Solutions to Selected Exercises
About the Authors
This Page Intentionally Left Blank
hen the first edition of this book was published in 1996, it was a novelty to
be able to order merchandise on the Internet, and a company that advertised
its domain name was considered cutting edge. Today, Internet commerce is
a fact of life, and “.com” stocks have gone through an entire boom and bust cycle.
A host of new technologies ranging from optical switches to wireless networks are
now becoming mainstream. It seems the only predictable thing about the Internet is
constant change.
Despite these changes the question we asked in the first edition is just as valid
today: What are the underlying concepts and technologies that make the Internet
work? The answer is that much of the TCP/IP architecture continues to function just
as was envisioned by its creators nearly 30 years ago. This isn’t to say that the Internet
architecture is uninteresting, quite the contrary. Understanding the design principles
that underlie an architecture that has not only survived but fostered the kind of growth
and change that the Internet has seen over the past three decades is precisely the right
place to start. Like the previous editions, the third edition makes the “why” of the
Internet architecture its cornerstone.
Our intent is that the book should serve as the text for a comprehensive networking
class, at either the graduate or upper-division undergraduate level. We also believe that
the book’s focus on core concepts should be appealing to industry professionals who
are retraining for network-related assignments, as well as current network practitioners
who want to understand the “whys” behind the protocols they work with every day
and to see the big picture of networking.
It is our experience that both students and professionals learning about networks
for the first time often have the impression that network protocols are some sort of edict
handed down from on high, and that their job is to learn as many TLAs (three-letter
acronyms) as possible. In fact, protocols are the building blocks of a complex system
developed through the application of engineering design principles. Moreover, they
are constantly being refined, extended, and replaced based on real-world experience.
With this in mind, our goal with this book is to do more than survey the protocols
in use today. Instead, we explain the underlying principles of sound network design.
We feel that this grasp of underlying principles is the best tool for handling the rate of
change in the networking field.
Changes in the Third Edition
Even though our focus is on the underlying principles of networking, we illustrate
these principles using examples from today’s working Internet. Therefore, we added a
significant amount of new material to track many of the important recent advances in
networking. We also deleted, reorganized, and changed the focus of existing material
to reflect changes that have taken place over the past seven years.
Perhaps the most significant change we have noticed since writing the first edition
is that almost every reader now has some familiarity with networked applications such
as the World Wide Web and email. For this reason, we have increased the focus on
applications, starting in the first chapter. We use applications as the motivation for
the study of networking, and to derive a set of requirements that a useful network
must meet if it is to support both current and future applications on a global scale.
However, we retain the problem-solving approach of previous editions that starts with
the problem of interconnecting hosts and works its way up the layers to conclude with
a detailed examination of application-layer issues. We believe it is important to make
the topics covered in the book relevant by starting with applications and their needs. At
the same time, we feel that higher-layer issues, such as application-layer and transportlayer protocols, are best understood after the basic problems of connecting hosts and
switching packets have been explained.
Another important change in this edition is in the exercises. We have increased
the number and quality of exercises; we have attempted to identify those that are
especially difficult or that require above-average levels of mathematical knowledge
(these are marked with an icon ); and in each chapter we have added a number of
exercises with worked solutions that are included in the book. As before, the complete
set of exercise solutions is available only to instructors.
As we did in the second edition, we have added or increased coverage of important new topics and brought other topics up-to-date. Major new or substantially
updated topics in this edition are
■ a new section on Multiprotocol Label Switching (MPLS), including coverage
of traffic engineering and virtual private networks
■ a new section on overlay networks, including “peer-to-peer” networking and
“content distribution networks”
■ greatly expanded coverage on protocols for multimedia applications, such as
Session Initiation Protocol (SIP) and Session Description Protocol (SDP)
■ updated coverage of congestion-control mechanisms, including selective acknowledgments for TCP, equation-based congestion control, and explicit congestion notification
■ updated security coverage, including distributed denial of service (DDoS) attacks
■ updated material on wireless technology, including spread spectrum techniques and the emerging 802.11 standards
Finally, the book is now supplemented by a comprehensive set of laboratory exercises designed to illustrate the key concepts through simulation experiments. Sections
that discuss material covered by the laboratory exercises are marked with the icon
shown in the margin. Details on this new feature of the book appear below.
For an area that’s as dynamic and changing as computer networks, the most important
thing a textbook can offer is perspective—to distinguish between what’s important and
what’s not, and between what’s lasting and what’s superficial. Based on our experience over the past 20 years doing research that has led to new networking technology,
teaching undergraduate and graduate students about the latest trends in networking, and delivering advanced networking products to market, we have developed a
perspective—which we call the systems approach—that forms the soul of this book.
The systems approach has several implications:
■ Rather than accept existing artifacts as gospel, we start with first principles
and walk you through the thought process that led to today’s networks. This
allows us to explain why networks look like they do. It is our experience that
once you understand the underlying concepts, any new protocol that you are
confronted with will be relatively easy to digest.
■ Although the material is loosely organized around the traditional network
layers, starting at the bottom and moving up the protocol stack, we do not
adopt a rigid layered approach. Many topics—congestion control and security
are good examples—have implications up and down the hierarchy, and so
we discuss them outside the traditional layered model. In short, we believe
layering makes a good servant but a poor master; it’s more often useful to
take an end-to-end perspective.
■ Rather than explain how protocols work in the abstract, we use the most
important protocols in use today—many of them from the TCP/IP Internet—
to illustrate how networks work in practice. This allows us to include realworld experiences in the discussion.
■ Although at the lowest levels networks are constructed from commodity hardware that can be bought from computer vendors and communication services
that can be leased from the phone company, it is the software that allows networks to provide new services and adapt quickly to changing circumstances.
It is for this reason that we emphasize how network software is implemented,
rather than stopping with a description of the abstract algorithms involved.
We also include code segments taken from a working protocol stack to illustrate how you might implement certain protocols and algorithms.
■ Networks are constructed from many building-block pieces, and while it is
necessary to be able to abstract away uninteresting elements when solving
a particular problem, it is essential to understand how all the pieces fit together to form a functioning network. We therefore spend considerable time
explaining the overall end-to-end behavior of networks, not just the individual components, so that it is possible to understand how a complete network
operates, all the way from the application to the hardware.
■ The systems approach implies doing experimental performance studies, and
then using the data you gather both to quantitatively analyze various design
options and to guide you in optimizing the implementation. This emphasis on
empirical analysis pervades the book.
■ Networks are like other computer systems—for example, operating systems,
processor architectures, distributed and parallel systems, and so on. They
are all large and complex. To help manage this complexity, system builders
often draw on a collection of design principles. We highlight these design
principles as they are introduced throughout the book, illustrated, of course,
with examples from computer networks.
Pedagogy and Features
The third edition retains several features that we encourage you to take advantage of:
■ Problem statements. At the start of each chapter, we describe a problem that
identifies the next set of issues that must be addressed in the design of a
network. This statement introduces and motivates the issues to be explored
in the chapter.
■ Shaded sidebars. Throughout the text, shaded sidebars elaborate on the topic
being discussed or introduce a related advanced topic. In many cases, these
sidebars relate real-world anecdotes about networking.
■ Highlighted paragraphs. These paragraphs summarize an important nugget
of information that we want you to take away from the discussion, such as a
widely applicable system design principle.
■ Real protocols. Even though the book’s focus is on core concepts rather than
existing protocol specifications, real protocols are used to illustrate most of the
important ideas. As a result, the book can be used as a source of reference for
many protocols. To help you find the descriptions of the protocols, each applicable section heading parenthetically identifies the protocols described in that
section. For example, Section 5.2, which describes the principles of reliable
end-to-end protocols, provides a detailed description of TCP, the canonical
example of such a protocol.
■ Open issues. We conclude the main body of each chapter with an important
issue that is currently being debated in the research community, the commercial world, or society as a whole. We have found that discussing these issues
helps to make the subject of networking more relevant and exciting.
■ Further reading. These highly selective lists appear at the end of each chapter.
Each list generally contains the seminal papers on the topics just discussed.
We strongly recommend that advanced readers (e.g., graduate students) study
the papers in this reading list to supplement the material covered in the
Road Map and Course Use
The book is organized as follows:
■ Chapter 1 introduces the set of core ideas that are used throughout the rest
of the text. Motivated by widespread applications, it discusses what goes into
network architecture, and it defines the quantitative performance metrics that
often drive network design.
■ Chapter 2 surveys a wide range of low-level network technologies, ranging
from Ethernet to token ring to wireless. It also describes many of the issues
that all data link protocols must address, including encoding, framing, and
error detection.
■ Chapter 3 introduces the basic models of switched networks (datagrams versus
virtual circuits) and describes one prevalent switching technology (ATM) in
some detail. It also discusses the design of hardware-based switches.
■ Chapter 4 introduces internetworking and describes the key elements of the
Internet Protocol (IP). A central question addressed in this chapter is how
networks that scale to the size of the Internet are able to route packets.
■ Chapter 5 moves up to the transport level, describing both the Internet’s Transmission Control Protocol (TCP) and Remote Procedure Call (RPC) used to
build client/server applications in detail.
■ Chapter 6 discusses congestion control and resource allocation. The issues
in this chapter cut across both the network level (Chapters 3 and 4) and the
transport level (Chapter 5). Of particular note, this chapter describes how
congestion control works in TCP, and it introduces the mechanisms used by
both the Internet and ATM to provide quality of service.
■ Chapter 7 considers the data sent through a network. This includes the problems of both presentation formatting and data compression. The discussion
of compression includes explanations of how MPEG video compression and
MP3 audio compression work.
■ Chapter 8 discusses network security, ranging from an overview of cryptography protocols (DES, RSA, MD5), to protocols for security services (authentication, digital signature, message integrity), to complete security systems
(privacy enhanced email, IPSEC). The chapter also discusses pragmatic issues
like firewalls.
■ Chapter 9 describes a representative sample of network applications and the
protocols they use, including traditional applications like email and the Web,
multimedia applications such as IP telephony and video streaming, and overlay
networks like peer-to-peer file sharing and content distribution networks.
For an undergraduate course, extra class time will most likely be needed to help
students digest the introductory material in the first chapter, probably at the expense
of the more advanced topics covered in Chapters 6 through 8. Chapter 9 then returns
to the popular topic of network applications. In contrast, the instructor for a graduate
course should be able to cover the first chapter in only a lecture or two—with students
studying the material more carefully on their own—thereby freeing up additional
class time to cover the last four chapters in depth. Both graduate and undergraduate
classes will want to cover the core material contained in the middle four chapters
(Chapters 2–5), although an undergraduate class might choose to skim the more advanced sections (e.g., Sections 2.2, 2.9, 3.4, and 4.4).
For those of you using the book in self-study, we believe that the topics we have
selected cover the core of computer networking, and so we recommend that the book
be read sequentially, from front to back. In addition, we have included a liberal supply
of references to help you locate supplementary material that is relevant to your specific
areas of interest, and we have included solutions to selected exercises.
The book takes a unique approach to the topic of congestion control by pulling
all topics related to congestion control and resource allocation together in a single
place—Chapter 6. We do this because the problem of congestion control cannot be
solved at any one level, and we want you to consider the various design options at
the same time. (This is consistent with our view that strict layering often obscures
important design trade-offs.) A more traditional treatment of congestion control is
possible, however, by studying Section 6.2 in the context of Chapter 3 and Section 6.3
in the context of Chapter 5.
Significant effort has gone into improving the exercises in both the second and third
editions. In the second edition we greatly increased the number of problems and, based
on class testing, dramatically improved their quality. In this edition, we added a few
more exercises, but made two other important changes:
■ For those exercises that we feel are particularly challenging or require special
knowledge not provided in the book (e.g., probability expertise), we have
to indicate the extra level of difficulty.
added an icon
■ In each chapter we added some extra representative exercises for which worked
solutions are provided in the back of the book. These exercises, marked ,
are intended to provide some help in tackling the other exercises in the book.
The current sets of exercises are of several different styles:
■ Analytical exercises that ask the student to do simple algebraic calculations
that demonstrate their understanding of fundamental relationships
■ Design questions that ask the student to propose and evaluate protocols for
various circumstances
■ Hands-on questions that ask the student to write a few lines of code to test
an idea or to experiment with an existing network utility
■ Library research questions that ask the student to learn more about a particular topic
Also, as described in more detail below, socket-based programming assignments,
as well as simulation labs, are available online.
Supplemental Materials and Online Resources
To assist instructors, we have prepared an instructor’s manual that contains solutions
to selected exercises. The manual is available from the publisher.
Additional support materials, including lecture slides, figures from the text,
socket-based programming assignments, and sample exams and programming assignments are available through the Morgan Kaufmann Web site at
(search for Computer Networks). We suggest that you visit the page for this book
every few weeks, as we will be adding support materials and establishing links to
networking-related sites on a regular basis.
And finally, new with the third edition, a set of laboratory experiments supplements the book. These labs, developed by Professor Emad Aboelela from the University
of Massachusetts Dartmouth, use simulation to explore the behavior, scalability, and
performance of protocols covered in the book. The simulations use the OPNET simulation toolset, which is available for free to anyone using Computer Networks in their
This book would not have been possible without the help of many people. We would
like to thank them for their efforts in improving the end result. Before we do so,
however, we should mention that we have done our best to correct the mistakes that
the reviewers have pointed out and to accurately describe the protocols and mechanisms that our colleagues have explained to us. We alone are responsible for any
remaining errors. If you should find any of these, please send email to our publisher,
Morgan Kaufmann, at [email protected], and we will endeavor to correct them in
future printings of this book.
First, we would like to thank the many people who reviewed drafts of all or
parts of the manuscript. In addition to those who reviewed prior editions, we wish
to thank Carl Emberger, Isaac Ghansah, and Bobby Bhattacharjee for their thorough reviews. Thanks also to Peter Druschel, Limin Wang, Aki Nakao, Dave Oran,
George Swallow, Peter Lei, and Michael Ramalho for their reviews of various sections. We also wish to thank all those who provided feedback and input to help us
decide what to do in this edition: Chedley Aouriri, Peter Steenkiste, Esther A. Hughes,
Ping-Tsai Chung, Doug Szajda, Mark Andersland, Leo Tam, C. P. Watkins,
Brian L. Mark, Miguel A. Labrador, Gene Chase, Harry W. Tyrer, Robert Siegfried,
Harlan B. Russell, John R. Black, Robert Y. Ling, Julia Johnson, Karen Collins, Clark
Verbrugge, Monjy Rabemanantsoa, Kerry D. LaViolette, William Honig, Kevin Mills,
Murat Demirer, J Rufinus, Manton Matthews, Errin W. Fulp, Wayne Daniel, Luiz
DaSilva, Don Yates, Raouf Boules, Nick McKeown, Neil T. Spring, Kris Verma, Szuecs
Laszlo, Ted Herman, Mark Sternhagen, Zongming Fei, Dulal C. Kar, Mingyan Liu,
Ken Surendran, Rakesh Arya, Mario J. Gonzalez, Annie Stanton, Tim Batten, and Paul
Second, several members of the Network Systems Group at Princeton contributed
ideas, examples, corrections, data, and code to this book. In particular, we would like
to thank Andy Bavier, Tammo Spalink, Mike Wawrzoniak, Zuki Gottlieb, George
Tzanetakis, and Chad Mynhier. As before, we want to thank the Defense Advanced
Research Projects Agency, the National Science Foundation, Intel Corporation, and
Cisco Systems, Inc. for supporting our networking research over the past several years.
Third, we would like to thank our series editor, David Clark, as well as all
the people at Morgan Kaufmann who helped shepherd us through the book-writing
process. A special thanks is due to our original sponsoring editor, Jennifer Mann; our
editor for the third edition, Rick Adams; our developmental editor, Karyn Johnson;
and our production manager, Simon Crump. The whole crew at MKP has been a
delight to work with.
This Page Intentionally Left Blank
I must Create a System, or be enslav’d by another Man’s; I will not
Reason and Compare: my business is to Create.
—William Blake
uppose you want to build a computer network, one that has the potential to
grow to global proportions and to support applications as diverse as teleconferencing, video-on-demand, electronic commerce, distributed computing, and
digital libraries. What available technologies would serve as the underlying building blocks, and what kind of software architecture would you design to integrate
these building blocks into an effective communication service? AnswerP R O B L E M
ing this question is the overriding
goal of this book—to describe the
Building a Network
available building materials and then
to show how they can be used to construct a network from the ground up.
Before we can understand how to design a computer network, we should first
agree on exactly what a computer network is. At one time, the term network meant
the set of serial lines used to attach dumb terminals to mainframe computers. To
some, the term implies the voice telephone network. To others, the only interesting
network is the cable network used to disseminate video signals. The main thing these
networks have in common is that they are specialized to handle one particular kind of
data (keystrokes, voice, or video) and they typically connect to special-purpose devices
(terminals, hand receivers, and television sets).
What distinguishes a computer network from these other types of networks?
Probably the most important characteristic of a computer network is its generality.
Computer networks are built primarily from general-purpose programmable hardware, and they are not optimized for a particular application like making phone calls or
delivering television signals. Instead, they are able to carry many different types of data,
and they support a wide, and ever-growing, range of applications. This chapter looks
at some typical applications of computer networks and
discusses the requirements that a network designer who
wishes to support such applications must be aware of.
Once we understand the requirements, how do we
proceed? Fortunately, we will not be building the first network. Others, most notably the community of researchers
responsible for the Internet, have gone before us. We will
use the wealth of experience generated from the Internet
to guide our design. This experience is embodied in a network architecture that identifies the available hardware
and software components and shows how they can be
arranged to form a complete network system.
To start us on the road toward understanding how
to build a network, this chapter does four things. First, it
explores the requirements that different applications and
different communities of people (such as network users
and network operators) place on the network. Second, it
introduces the idea of a network architecture, which lays
the foundation for the rest of the book. Third, it introduces some of the key elements in the implementation of
computer networks. Finally, it identifies the key metrics
that are used to evaluate the performance of computer
1 Foundation
1.1 Applications
Most people know the Internet through its applications: the World Wide Web, email,
streaming audio and video, chat rooms, and music (file) sharing. The Web, for example,
presents an intuitively simple interface. Users view pages full of textual and graphical
objects, click on objects that they want to learn more about, and a corresponding new
page appears. Most people are also aware that just under the covers, each selectable
object on a page is bound to an identifier for the next page to be viewed. This identifier,
called a uniform resource locator (URL), uniquely names every possible page that can
be viewed from your Web browser. For example,
is the URL for a page representing this book at Morgan Kaufmann: The string http
indicates that the HyperText Transfer Protocol (HTTP) should be used to download
the page, is the name of the machine that serves the page, and pd3e
uniquely identifies the page at the publisher’s site.
What most Web users are not aware of, however, is that by clicking on just one
such URL, as many as 17 messages may be exchanged over the Internet, and this
assumes the page itself is small enough to fit in a single message. This number includes
up to six messages to translate the server name ( into its Internet address
(, three messages to set up a Transmission Control Protocol (TCP)
connection between your browser and this server, four messages for your browser
to send the HTTP “get” request and the server to respond with the requested page
(and for each side to acknowledge receipt of that message), and four messages to tear
down the TCP connection. Of course, this does not include the millions of messages
exchanged by Internet nodes throughout the day, just to let each other know that they
exist and are ready to serve Web pages, translate names to addresses, and forward
messages toward their ultimate destination.
Although not yet as common as surfing the Web, another emerging application
of the Internet is streaming audio and video. Although an entire video file could first
be fetched from a remote machine and then played on the local machine, similar to
the process of downloading and displaying a Web page, this would entail waiting for
the last second of the video file to be delivered before starting to look at it. Streaming
video implies that the sender and the receiver are, respectively, the source and the sink
for the video stream. That is, the source generates a video stream (perhaps using a
video capture card), sends it across the Internet in messages, and the sink displays the
stream as it arrives.
To be more precise, video is not an application; it is a type of data. One example
of a video application is video-on-demand, which reads a preexisting movie from disk
1.1 Applications
and transmits it over the network. Another kind of application is videoconferencing,
which is actually the more interesting case because it has very tight timing constraints.
Just as when using the telephone, the interactions among the participants must be
timely. When a person at one end gestures, then that action must be displayed at
the other end as quickly as possible. Too much delay makes the system unusable. In
contrast, if it takes several seconds from the time the user starts the video until the
first image is displayed, then the service is still deemed satisfactory. Also, interactive
video usually implies that video is flowing in both directions, while a video-on-demand
application is most likely sending video in only one direction.
The Unix application vic is an example of a popular videoconferencing tool.
Figure 1.1 shows the control panel for a vic session. Note that vic is actually one
of a suite of conferencing tools designed at Lawrence Berkeley Laboratory and
Figure 1.1
The vic video application.
1 Foundation
UC Berkeley. The others include a whiteboard application (wb) that allows users to
send sketches and slides to each other, a visual audio tool called vat, and a session
directory (sdr) that is used to create and advertise videoconferences. All these tools
run on Unix—hence their lowercase names—and are freely available on the Internet.
Similar tools are available for other operating systems.
Although they are just two examples, downloading pages from the Web and
participating in a videoconference demonstrate the diversity of applications that can
be built on top of the Internet and hint at the complexity of the Internet’s design.
Starting from the beginning, and addressing one problem at a time, the rest of this
book explains how to build a network that supports such a wide range of applications.
Chapter 9 concludes the book by revisiting these two specific applications, as well as
several others that have become popular on today’s Internet.
1.2 Requirements
We have just established an ambitious goal for ourselves: to understand how to build
a computer network from the ground up. Our approach to accomplishing this goal
will be to start from first principles, and then ask the kinds of questions we would
naturally ask if building an actual network. At each step, we will use today’s protocols to illustrate various design choices available to us, but we will not accept these
existing artifacts as gospel. Instead, we will be asking (and answering) the question
of why networks are designed the way they are. While it is tempting to settle for just
understanding the way it’s done today, it is important to recognize the underlying concepts because networks are constantly changing as the technology evolves and new
applications are invented. It is our experience that once you understand the fundamental ideas, any new protocol that you are confronted with will be relatively easy to
The first step is to identify the set of constraints and requirements that influence
network design. Before getting started, however, it is important to understand that the
expectations you have of a network depend on your perspective:
■ An application programmer would list the services that his or her application
needs, for example, a guarantee that each message the application sends will
be delivered without error within a certain amount of time.
■ A network designer would list the properties of a cost-effective design, for
example, that network resources are efficiently utilized and fairly allocated to
different users.
■ A network provider would list the characteristics of a system that is easy to
administer and manage, for example, in which faults can be easily isolated
and where it is easy to account for usage.
1.2 Requirements
This section attempts to distill these different perspectives into a high-level
introduction to the major considerations that drive network design, and in doing
so, identifies the challenges addressed throughout the rest of this book.
Starting with the obvious, a network must provide connectivity among a set of computers. Sometimes it is enough to build a limited network that connects only a few
select machines. In fact, for reasons of privacy and security, many private (corporate)
networks have the explicit goal of limiting the set of machines that are connected. In
contrast, other networks (of which the Internet is the prime example) are designed
to grow in a way that allows them the potential to connect all the computers in the
world. A system that is designed to support growth to an arbitrarily large size is said
to scale. Using the Internet as a model, this book addresses the challenge of scalability.
Links, Nodes, and Clouds
Network connectivity occurs at many different levels. At the lowest level, a network
can consist of two or more computers directly connected by some physical medium,
such as a coaxial cable or an optical fiber. We call such a physical medium a link, and
we often refer to the computers it connects as nodes. (Sometimes a node is a more
specialized piece of hardware rather than a computer, but we overlook that distinction
for the purposes of this discussion.) As illustrated in Figure 1.2, physical links are
sometimes limited to a pair of nodes (such a link is said to be point-to-point), while
in other cases, more than two nodes may share a single physical link (such a link is
said to be multiple access). Whether a given link supports point-to-point or multipleaccess connectivity depends on how the node is attached to the link. It is also the case
that multiple-access links are often limited in size, in terms of both the geographical
distance they can cover and the number of nodes they can connect. The exception is
a satellite link, which can cover a wide geographic area.
Figure 1.2
Direct links: (a) point-to-point; (b) multiple-access.
Figure 1.3
1 Foundation
Switched network.
If computer networks were limited to situations in which all nodes are directly
connected to each other over a common physical medium, then either networks would
be very limited in the number of computers they could connect, or the number of wires
coming out of the back of each node would quickly become both unmanageable and
very expensive. Fortunately, connectivity between two nodes does not necessarily imply
a direct physical connection between them—indirect connectivity may be achieved
among a set of cooperating nodes. Consider the following two examples of how a
collection of computers can be indirectly connected.
Figure 1.3 shows a set of nodes, each of which is attached to one or more pointto-point links. Those nodes that are attached to at least two links run software that forwards data received on one link out on another. If organized in a systematic way, these
forwarding nodes form a switched network. There are numerous types of switched networks, of which the two most common are circuit switched and packet switched. The
former is most notably employed by the telephone system, while the latter is used for
the overwhelming majority of computer networks and will be the focus of this book.
The important feature of packet-switched networks is that the nodes in such a network
send discrete blocks of data to each other. Think of these blocks of data as corresponding to some piece of application data such as a file, a piece of email, or an image. We
call each block of data either a packet or a message, and for now we use these terms
interchangeably; we discuss the reason they are not always the same in Section 1.2.2.
Packet-switched networks typically use a strategy called store-and-forward. As
the name suggests, each node in a store-and-forward network first receives a complete
1.2 Requirements
packet over some link, stores the packet in its internal memory, and then forwards
the complete packet to the next node. In contrast, a circuit-switched network first
establishes a dedicated circuit across a sequence of links and then allows the source
node to send a stream of bits across this circuit to a destination node. The major
reason for using packet switching rather than circuit switching in a computer network
is efficiency, discussed in the next subsection.
The cloud in Figure 1.3 distinguishes between the nodes on the inside that
implement the network (they are commonly called switches, and their sole function is to store and forward packets) and the nodes on the outside of the cloud that
use the network (they are commonly called hosts, and they support users and run
application programs). Also note that the cloud in Figure 1.3 is one of the most
important icons of computer networking. In general, we use a cloud to denote any
type of network, whether it is a single point-to-point link, a multiple-access link, or a
switched network. Thus, whenever you see a cloud used in a figure, you can think of
it as a placeholder for any of the networking technologies covered in this book.
A second way in which a set of computers can be indirectly connected is shown in
Figure 1.4. In this situation, a set of independent networks (clouds) are interconnected
to form an internetwork, or internet for short. We adopt the Internet’s convention
of referring to a generic internetwork of networks as a lowercase i internet, and the
Figure 1.4
Interconnection of networks.
1 Foundation
currently operational TCP/IP Internet as the capital I Internet. A node that is connected
to two or more networks is commonly called a router or gateway, and it plays much
the same role as a switch—it forwards messages from one network to another. Note
that an internet can itself be viewed as another kind of network, which means that an
internet can be built from an interconnection of internets. Thus, we can recursively
build arbitrarily large networks by interconnecting clouds to form larger clouds.
Just because a set of hosts are directly or indirectly connected to each other does
not mean that we have succeeded in providing host-to-host connectivity. The final
requirement is that each node must be able to say which of the other nodes on the
network it wants to communicate with. This is done by assigning an address to each
node. An address is a byte string that identifies a node; that is, the network can use
a node’s address to distinguish it from the other nodes connected to the network.
When a source node wants the network to deliver a message to a certain destination
node, it specifies the address of the destination node. If the sending and receiving
nodes are not directly connected, then the switches and routers of the network use this
address to decide how to forward the message toward the destination. The process
of determining systematically how to forward messages toward the destination node
based on its address is called routing.
This brief introduction to addressing and routing has presumed that the source
node wants to send a message to a single destination node (unicast). While this is
the most common scenario, it is also possible that the source node might want to
broadcast a message to all the nodes on the network. Or a source node might want
to send a message to some subset of the other nodes, but not all of them, a situation
called multicast. Thus, in addition to node-specific addresses, another requirement of
a network is that it support multicast and broadcast addresses.
The main idea to take away from this discussion is that we can define a network
recursively as consisting of two or more nodes connected by a physical link, or as two
or more networks connected by a node. In other words, a network can be constructed
from a nesting of networks, where at the bottom level, the network is implemented by
some physical medium. One of the key challenges in providing network connectivity is
to define an address for each node that is reachable on the network (including support
for broadcast and multicast connectivity), and to be able to use this address to route
messages toward the appropriate destination node(s).
Cost-Effective Resource Sharing
As stated above, this book focuses on packet-switched networks. This section explains
the key requirement of computer networks—efficiency—that leads us to packet switching as the strategy of choice.
1.2 Requirements
Given a collection of nodes indirectly connected by a nesting of networks, it is
possible for any pair of hosts to send messages to each other across a sequence of
links and nodes. Of course, we want to do more than support just one pair of communicating hosts—we want to provide all pairs of hosts with the ability to exchange
messages. The question then is, How do all the hosts that want to communicate share
the network, especially if they want to use it at the same time? And, as if that problem
isn’t hard enough, how do several hosts share the same link when they all want to use
it at the same time?
To understand how hosts share a network, we need to introduce a fundamental
concept, multiplexing, which means that a system resource is shared among multiple
users. At an intuitive level, multiplexing can be explained by analogy to a timesharing
computer system, where a single physical CPU is shared (multiplexed) among multiple
jobs, each of which believes it has its own private processor. Similarly, data being sent
by multiple users can be multiplexed over the physical links that make up a network.
To see how this might work, consider the simple network illustrated in Figure 1.5,
where the three hosts on the left side of the network (L1–L3) are sending data to the
three hosts on the right (R1–R3) by sharing a switched network that contains only
one physical link. (For simplicity, assume that host L1 is communicating with host R1,
and so on.) In this situation, three flows of data—corresponding to the three pairs of
hosts—are multiplexed onto a single physical link by switch 1 and then demultiplexed
back into separate flows by switch 2. Note that we are being intentionally vague about
exactly what a “flow of data” corresponds to. For the purposes of this discussion,
assume that each host on the left has a large supply of data that it wants to send to its
counterpart on the right.
There are several different methods for multiplexing multiple flows onto one physical link. One common method is synchronous time-division multiplexing (STDM).
The idea of STDM is to divide time into equal-sized quanta and, in a round-robin
Switch 1
Figure 1.5
Switch 2
Multiplexing multiple logical flows over a single physical link.
1 Foundation
fashion, give each flow a chance to send its data over the physical link. In other words,
during time quantum 1, data from the first flow is transmitted; during time quantum
2, data from the second flow is transmitted; and so on. This process continues until all
the flows have had a turn, at which time the first flow gets to go again, and the process
repeats. Another method is frequencydivision multiplexing (FDM). The idea of
FDM is to transmit each flow over the physand WANs
ical link at a different frequency, much the
One way to characterize networks
same way that the signals for different TV
is according to their size. Two wellstations are transmitted at a different freknown examples are LANs (local
quency on a physical cable TV link.
area networks) and WANs (wide
Although simple to understand, both
area networks); the former typiSTDM and FDM are limited in two ways.
cally extend less than 1 km, while
First, if one of the flows (host pairs) does not
the latter can be worldwide. Other
have any data to send, its share of the physnetworks are classified as MANs
ical link—that is, its time quantum or its
(metropolitan area networks),
frequency—remains idle, even if one of the
which usually span tens of kilomeother flows has data to transmit. For comters. The reason such classifications
puter communication, the amount of time
are interesting is that the size of a
that a link is idle can be very large—for
network often has implications for
example, consider the amount of time you
the underlying technology that can
spend reading a Web page (leaving the link
be used, with a key factor being the
idle) compared to the time you spend fetchamount of time it takes for data
ing the page. Second, both STDM and FDM
to propagate from one end of the
are limited to situations in which the maxnetwork to the other; we discuss
imum number of flows is fixed and known
this issue more in later chapters.
ahead of time. It is not practical to resize the
An interesting historical note is
quantum or to add additional quanta in the
the term “wide area network”
case of STDM or to add new frequencies in
was not applied to the first WANs
the case of FDM.
because there was no other sort
The form of multiplexing that we
of network to differentiate them
make most use of in this book is called
from. When computers were instatistical multiplexing. Although the name
credibly rare and expensive, there
is not all that helpful for understanding
was no point in thinking about
the concept, statistical multiplexing is really
how to connect all the computers
quite simple, with two key ideas. First,
in the local area—there was only
it is like STDM in that the physical link
one computer in that area. Only as
is shared over time—first data from one
flow is transmitted over the physical link,
1.2 Requirements
then data from another flow is transmitted, and so on. Unlike STDM, however, data is
transmitted from each flow on demand rather than during a predetermined time slot.
Thus, if only one flow has data to send, it gets to transmit that data without waiting
for its quantum to come around and thus without having to watch the quanta assigned
to the other flows go by unused. It is this
avoidance of idle time that gives packet
switching its efficiency.
As defined so far, however, statistical
computers began to proliferate did
multiplexing has no mechanism to ensure
LANs become necessary, and the
that all the flows eventually get their turn to
term “WAN” was then introduced
transmit over the physical link. That is, once
to describe the larger networks that
a flow begins sending data, we need some
interconnected geographically disway to limit the transmission, so that the
tant computers.
other flows can have a turn. To account for
Another kind of network that
this need, statistical multiplexing defines an
we need to be aware of is SANs
upper bound on the size of the block of data
(system area networks). SANs are
that each flow is permitted to transmit at a
usually confined to a single room
given time. This limited-size block of data
and connect the various compois typically referred to as a packet, to distinnents of a large computing sysguish it from the arbitrarily large message
tem. For example, HiPPI (High
that an application program might want
Performance Parallel Interface) and
to transmit. Because a packet-switched netFiber Channel are two common
work limits the maximum size of packets,
SAN technologies used to connect
a host may not be able to send a commassively parallel processors to
plete message in one packet. The source may
scalable storage servers and data
need to fragment the message into several
vaults. (Because they often connect
packets, with the receiver reassembling the
computers to storage servers, SANs
packets back into the original message.
are sometimes defined as storage
In other words, each flow sends a searea networks.) Although this book
quence of packets over the physical link,
does not describe such networks
with a decision made on a packet-by-packet
in detail, they are worth knowing
basis as to which flow’s packet to send next.
about because they are often at the
Notice that if only one flow has data to send,
leading edge in terms of perforthen it can send a sequence of packets backmance, and because it is increasto-back. However, should more than one of
ingly common to connect such netthe flows have data to send, then their packworks into LANs and WANs.
ets are interleaved on the link. Figure 1.6
depicts a switch multiplexing packets from
multiple sources onto a single shared link.
1 Foundation
Figure 1.6
A switch multiplexing packets from multiple sources onto one shared link.
The decision as to which packet to send next on a shared link can be made in a
number of different ways. For example, in a network consisting of switches interconnected by links such as the one in Figure 1.5, the decision would be made by
the switch that transmits packets onto the shared link. (As we will see later, not all
packet-switched networks actually involve switches, and they may use other mechanisms to determine whose packet goes onto the link next.) Each switch in a packetswitched network makes this decision independently, on a packet-by-packet basis.
One of the issues that faces a network designer is how to make this decision in a
fair manner. For example, a switch could be designed to service packets on a firstin-first-out (FIFO) basis. Another approach would be to service the different flows
in a round-robin manner, just as in STDM. This might be done to ensure that certain flows receive a particular share of the link’s bandwidth, or that they never have
their packets delayed in the switch for more than a certain length of time. A network that allows flows to request such treatment is said to support quality of service
Also, notice in Figure 1.6 that since the switch has to multiplex three incoming
packet streams onto one outgoing link, it is possible that the switch will receive packets
faster than the shared link can accommodate. In this case, the switch is forced to buffer
these packets in its memory. Should a switch receive packets faster than it can send them
for an extended period of time, then the switch will eventually run out of buffer space,
and some packets will have to be dropped. When a switch is operating in this state,
it is said to be congested.
1.2 Requirements
The bottom line is that statistical multiplexing defines a cost-effective way for
multiple users (e.g., host-to-host flows of data) to share network resources (links and
nodes) in a fine-grained manner. It defines the packet as the granularity with which the
links of the network are allocated to different flows, with each switch able to schedule
the use of the physical links it is connected to on a per-packet basis. Fairly allocating
link capacity to different flows and dealing with congestion when it occurs are the key
challenges of statistical multiplexing.
Support for Common Services
While the previous section outlined the challenges involved in providing cost-effective
connectivity among a group of hosts, it is overly simplistic to view a computer network
as simply delivering packets among a collection of computers. It is more accurate to
think of a network as providing the means for a set of application processes that are
distributed over those computers to communicate. In other words, the next requirement of a computer network is that the application programs running on the hosts
connected to the network must be able to communicate in a meaningful way.
When two application programs need to communicate with each other, there
are a lot of complicated things that need to happen beyond simply sending a message from one host to another. One option would be for application designers to
build all that complicated functionality into each application program. However, since
many applications need common services, it is much more logical to implement those
common services once and then to let the application designer build the application
using those services. The challenge for a network designer is to identify the right set
of common services. The goal is to hide the complexity of the network from the application without overly constraining the application designer.
Intuitively, we view the network as providing logical channels over which
application-level processes can communicate with each other; each channel provides
the set of services required by that application. In other words, just as we use a cloud
to abstractly represent connectivity among a set of computers, we now think of a channel as connecting one process to another. Figure 1.7 shows a pair of application-level
processes communicating over a logical channel that is, in turn, implemented on top
of a cloud that connects a set of hosts. We can think of the channel as being like a
pipe connecting two applications, so that a sending application can put data in one
end and expect that data to be delivered by the network to the application at the other
end of the pipe.
The challenge is to recognize what functionality the channels should provide
to application programs. For example, does the application require a guarantee that
1 Foundation
Figure 1.7
Processes communicating over an abstract channel.
messages sent over the channel are delivered, or is it acceptable if some messages fail to
arrive? Is it necessary that messages arrive at the recipient process in the same order in
which they are sent, or does the recipient not care about the order in which messages
arrive? Does the network need to ensure that no third parties are able to eavesdrop
on the channel, or is privacy not a concern? In general, a network provides a variety
of different types of channels, with each application selecting the type that best meets
its needs. The rest of this section illustrates the thinking involved in defining useful
Identifying Common Communication Patterns
Designing abstract channels involves first understanding the communication needs
of a representative collection of applications, then extracting their common communication requirements, and finally incorporating the functionality that meets these
requirements in the network.
One of the earliest applications supported on any network is a file access program like FTP (File Transfer Protocol) or NFS (Network File System). Although many
details vary—for example, whether whole files are transferred across the network or
only single blocks of the file are read/written at a given time—the communication
1.2 Requirements
component of remote file access is characterized by a pair of processes, one that requests that a file be read or written and a second process that honors this request. The
process that requests access to the file is called the client, and the process that supports
access to the file is called the server.
Reading a file involves the client sending a small request message to a server and
the server responding with a large message that contains the data in the file. Writing
works in the opposite way—the client sends a large message containing the data to be
written to the server, and the server responds with a small message confirming that the
write to disk has taken place. A digital library, as exemplified by the World Wide Web,
is another application that behaves in a similar way: A client process makes a request,
and a server process responds by returning the requested data.
Using file access, a digital library, and the two video applications described in the
introduction (videoconferencing and video-on-demand) as a representative sample, we
might decide to provide the following two types of channels: request/reply channels
and message stream channels. The request/reply channel would be used by the file
transfer and digital library applications. It would guarantee that every message sent
by one side is received by the other side and that only one copy of each message is
delivered. The request/reply channel might also protect the privacy and integrity of the
data that flows over it, so that unauthorized parties cannot read or modify the data
being exchanged between the client and server processes.
The message stream channel could be used by both the video-on-demand and
videoconferencing applications, provided it is parameterized to support both one-way
and two-way traffic and to support different delay properties. The message stream
channel might not need to guarantee that all messages are delivered, since a video application can operate adequately even if some frames are not received. It would, however,
need to ensure that those messages that are delivered arrive in the same order in which
they were sent, to avoid displaying frames out of sequence. Like the request/reply
channel, the message stream channel might want to ensure the privacy and integrity of
the video data. Finally, the message stream channel might need to support multicast,
so that multiple parties can participate in the teleconference or view the video.
While it is common for a network designer to strive for the smallest number
of abstract channel types that can serve the largest number of applications, there is
a danger in trying to get away with too few channel abstractions. Simply stated, if
you have a hammer, then everything looks like a nail. For example, if all you have
are message stream and request/reply channels, then it is tempting to use them for the
next application that comes along, even if neither type provides exactly the semantics
needed by the application. Thus, network designers will probably be inventing new
types of channels—and adding options to existing channels—for as long as application
programmers are inventing new applications.
1 Foundation
Also note that independent of exactly what functionality a given channel provides, there is the question of where that functionality is implemented. In many cases,
it is easiest to view the host-to-host connectivity of the underlying network as simply
providing a bit pipe, with any high-level communication semantics provided at the
end hosts. The advantage of this approach is it keeps the switches in the middle of the
network as simple as possible—they simply forward packets—but it requires the end
hosts to take on much of the burden of supporting semantically rich process-to-process
channels. The alternative is to push additional functionality onto the switches, thereby
allowing the end hosts to be “dumb” devices (e.g., telephone handsets). We will see this
question of how various network services are partitioned between the packet switches
and the end hosts (devices) as a reoccurring issue in network design.
As suggested by the examples just considered, reliable message delivery is one of the
most important functions that a network can provide. It is difficult to determine how
to provide this reliability, however, without first understanding how networks can fail.
The first thing to recognize is that computer networks do not exist in a perfect world.
Machines crash and later are rebooted, fibers are cut, electrical interference corrupts
bits in the data being transmitted, switches run out of buffer space, and if these sorts
of physical problems aren’t enough to worry about, the software that manages the
hardware sometimes forwards packets into oblivion. Thus, a major requirement of a
network is to mask (hide) certain kinds of failures, so as to make the network appear
more reliable than it really is to the application programs using it.
There are three general classes of failure that network designers have to worry
about. First, as a packet is transmitted over a physical link, bit errors may be introduced
into the data; that is, a 1 is turned into a 0 or vice versa. Sometimes single bits are
corrupted, but more often than not, a burst error occurs—several consecutive bits are
corrupted. Bit errors typically occur because outside forces, such as lightning strikes,
power surges, and microwave ovens, interfere with the transmission of data. The good
news is that such bit errors are fairly rare, affecting on average only one out of every
106 to 107 bits on a typical copper-based cable and one out of every 1012 to 1014
bits on a typical optical fiber. As we will see, there are techniques that detect these bit
errors with high probability. Once detected, it is sometimes possible to correct for such
errors—if we know which bit or bits are corrupted, we can simply flip them—while
in other cases the damage is so bad that it is necessary to discard the entire packet. In
such a case, the sender may be expected to retransmit the packet.
The second class of failure is at the packet, rather than the bit, level; that is, a
complete packet is lost by the network. One reason this can happen is that the packet
contains an uncorrectable bit error and therefore has to be discarded. A more likely
1.3 Network Architecture
reason, however, is that one of the nodes that has to handle the packet—for example,
a switch that is forwarding it from one link to another—is so overloaded that it has
no place to store the packet, and therefore is forced to drop it. This is the problem of
congestion mentioned in Section 1.2.2. Less commonly, the software running on one of
the nodes that handles the packet makes a mistake. For example, it might incorrectly
forward a packet out on the wrong link, so that the packet never finds its way to the
ultimate destination. As we will see, one of the main difficulties in dealing with lost
packets is distinguishing between a packet that is indeed lost and one that is merely
late in arriving at the destination.
The third class of failure is at the node and link level; that is, a physical link is cut,
or the computer it is connected to crashes. This can be caused by software that crashes,
a power failure, or a reckless backhoe operator. While such failures can eventually be
corrected, they can have a dramatic effect on the network for an extended period of
time. However, they need not totally disable the network. In a packet-switched network, for example, it is sometimes possible to route around a failed node or link. One of
the difficulties in dealing with this third class of failure is distinguishing between a failed
computer and one that is merely slow, or in the case of a link, between one that has been
cut and one that is very flaky and therefore introducing a high number of bit errors.
The key idea to take away from this discussion is that defining useful channels involves both understanding the applications’ requirements and recognizing the
limitations of the underlying technology. The challenge is to fill in the gap between
what the application expects and what the underlying technology can provide. This is
sometimes called the semantic gap.
1.3 Network Architecture
In case you hadn’t noticed, the previous section established a pretty substantial set of
requirements for network design—a computer network must provide general, costeffective, fair, and robust connectivity among a large number of computers. As if this
weren’t enough, networks do not remain fixed at any single point in time, but must
evolve to accommodate changes in both the underlying technologies upon which they
are based as well as changes in the demands placed on them by application programs.
Designing a network to meet these requirements is no small task.
To help deal with this complexity, network designers have developed general
blueprints—usually called a network architecture—that guide the design and implementation of networks. This section defines more carefully what we mean by a network
architecture by introducing the central ideas that are common to all network architectures. It also introduces two of the most widely referenced architectures—the OSI
architecture and the Internet architecture.
1 Foundation
Application programs
Process-to-process channels
Host-to-host connectivity
Figure 1.8
Example of a layered network system.
Layering and Protocols
When the system gets complex, the system designer introduces another level of abstraction. The idea of an abstraction is to define a unifying model that can capture
some important aspect of the system, encapsulate this model in an object that provides
an interface that can be manipulated by other components of the system, and hide the
details of how the object is implemented from the users of the object. The challenge
is to identify abstractions that simultaneously provide a service that proves useful in a
large number of situations and that can be efficiently implemented in the underlying
system. This is exactly what we were doing when we introduced the idea of a channel
in the previous section: We were providing an abstraction for applications that hides
the complexity of the network from application writers.
Abstractions naturally lead to layering, especially in network systems. The general idea is that you start with the services offered by the underlying hardware, and
then add a sequence of layers, each providing a higher (more abstract) level of service. The services provided at the high layers are implemented in terms of the services
provided by the low layers. Drawing on the discussion of requirements given in the
previous section, for example, we might imagine a network as having two layers of
abstraction sandwiched between the application program and the underlying hardware, as illustrated in Figure 1.8. The layer immediately above the hardware in this
case might provide host-to-host connectivity, abstracting away the fact that there may
be an arbitrarily complex network topology between any two hosts. The next layer up
builds on the available host-to-host communication service and provides support for
process-to-process channels, abstracting away the fact that the network occasionally
loses messages, for example.
Layering provides two nice features. First, it decomposes the problem of building
a network into more manageable components. Rather than implementing a monolithic
piece of software that does everything you will ever want, you can implement several
layers, each of which solves one part of the problem. Second, it provides a more modular design. If you decide that you want to add some new service, you may only need
1.3 Network Architecture
Application programs
Request/reply Message stream
Host-to-host connectivity
Figure 1.9
Layered system with alternative abstractions available at a given layer.
to modify the functionality at one layer, reusing the functions provided at all the other
Thinking of a system as a linear sequence of layers is an oversimplification,
however. Many times there are multiple abstractions provided at any given level of
the system, each providing a different service to the higher layers but building on the
same low-level abstractions. To see this, consider the two types of channels discussed
in Section 1.2.3: One provides a request/reply service, and one supports a message
stream service. These two channels might be alternative offerings at some level of a
multilevel networking system, as illustrated in Figure 1.9.
Using this discussion of layering as a foundation, we are now ready to discuss the
architecture of a network more precisely. For starters, the abstract objects that make
up the layers of a network system are called protocols. That is, a protocol provides
a communication service that higher-level objects (such as application processes, or
perhaps higher-level protocols) use to exchange messages. For example, we could imagine a network that supports a request/reply protocol and a message stream protocol,
corresponding to the request/reply and message stream channels discussed above.
Each protocol defines two different interfaces. First, it defines a service interface to the other objects on the same computer that want to use its communication
services. This service interface defines the operations that local objects can perform
on the protocol. For example, a request/reply protocol would support operations by
which an application can send and receive messages. Second, a protocol defines a peer
interface to its counterpart (peer) on another machine. This second interface defines
the form and meaning of messages exchanged between protocol peers to implement
the communication service. This would determine the way in which a request/reply
protocol on one machine communicates with its peer on another machine. In other
words, a protocol defines a communication service that it exports locally, along with
a set of rules governing the messages that the protocol exchanges with its peer(s) to
implement this service. This situation is illustrated in Figure 1.10.
1 Foundation
Host 1
Host 2
Figure 1.10 Service and peer interfaces.
Except at the hardware level where peers directly communicate with each other
over a link, peer-to-peer communication is indirect—each protocol communicates with
its peer by passing messages to some lower-level protocol, which in turn delivers the
message to its peer. In addition, there are potentially multiple protocols at any given
level, each providing a different communication service. We therefore represent the
suite of protocols that make up a network system with a protocol graph. The nodes of
the graph correspond to protocols, and the edges represent a depends-on relation. For
example, Figure 1.11 illustrates a protocol graph for the hypothetical layered system
we have been discussing—protocols RRP (Request/Reply Protocol) and MSP (Message Stream Protocol) implement two different types of process-to-process channels,
and both depend on HHP (Host-to-Host Protocol), which provides a host-to-host
connectivity service.
In this example, suppose that the file access program on host 1 wants to send
a message to its peer on host 2 using the communication service offered by protocol
RRP. In this case, the file application asks RRP to send the message on its behalf.
To communicate with its peer, RRP then invokes the services of HHP, which in turn
transmits the message to its peer on the other machine. Once the message has arrived
at protocol HHP on host 2, HHP passes the message up to RRP, which in turn delivers
the message to the file application. In this particular case, the application is said to
employ the services of the protocol stack RRP/HHP.
Note that the term protocol is used in two different ways. Sometimes it refers to
the abstract interfaces—that is, the operations defined by the service interface and the
form and meaning of messages exchanged between peers—and sometimes it refers to
the module that actually implements these two interfaces. To distinguish between the
1.3 Network Architecture
Host 1
Host 2
Figure 1.11 Example of a protocol graph.
interfaces and the module that implements these interfaces, we generally refer to the
former as a protocol specification. Specifications are generally expressed using a combination of prose, pseudocode, state transition diagrams, pictures of packet formats, and
other abstract notations. It should be the case that a given protocol can be implemented
in different ways by different programmers, as long as each adheres to the specification.
The challenge is ensuring that two different implementations of the same specification
can successfully exchange messages. Two or more protocol modules that do accurately
implement a protocol specification are said to interoperate with each other.
We can imagine many different protocols and protocol graphs that satisfy the
communication requirements of a collection of applications. Fortunately, there exist
standardization bodies, such as the International Standards Organization (ISO) and
the Internet Engineering Task Force (IETF), that establish policies for a particular protocol graph. We call the set of rules governing the form and content of a protocol
graph a network architecture. Although beyond the scope of this book, standardization bodies such as the ISO and the IETF have established well-defined procedures for
1 Foundation
introducing, validating, and finally approving protocols in their respective architectures. We briefly describe the architectures defined by the ISO and the IETF shortly,
but first there are two additional things we need to explain about the mechanics of a
protocol graph.
Consider what happens in Figure 1.11 when one of the application programs sends a
message to its peer by passing the message to protocol RRP. From RRP’s perspective,
the message it is given by the application is an uninterpreted string of bytes. RRP does
not care that these bytes represent an array of integers, an email message, a digital
image, or whatever; it is simply charged with sending them to its peer. However, RRP
must communicate control information to its peer, instructing it how to handle the
message when it is received. RRP does this by attaching a header to the message.
Generally speaking, a header is a small data structure—from a few bytes to a few
dozen bytes—that is used among peers to communicate with each other. As the name
suggests, headers are usually attached to the front of a message. In some cases, however,
this peer-to-peer control information is sent at the end of the message, in which case
it is called a trailer. The exact format for the header attached by RRP is defined by
its protocol specification. The rest of the message—that is, the data being transmitted
on behalf of the application—is called the message’s body or payload. We say that the
application’s data is encapsulated in the new message created by protocol RRP.
This process of encapsulation is then repeated at each level of the protocol graph;
for example, HHP encapsulates RRP’s message by attaching a header of its own. If we
now assume that HHP sends the message to its peer over some network, then when the
message arrives at the destination host, it is processed in the opposite order: HHP first
strips its header off the front of the message, interprets it (i.e., takes whatever action
is appropriate given the contents of the header), and passes the body of the message
up to RRP, which removes the header that its peer attached, takes whatever action
is indicated by that header, and passes the body of the message up to the application
program. The message passed up from RRP to the application on host 2 is exactly the
same message as the application passed down to RRP on host 1; the application does
not see any of the headers that have been attached to it to implement the lower-level
communication services. This whole process is illustrated in Figure 1.12. Note that in
this example, nodes in the network (e.g., switches and routers) may inspect the HHP
header at the front of the message.
Note that when we say a low-level protocol does not interpret the message it is
given by some high-level protocol, we mean that it does not know how to extract any
meaning from the data contained in the message. It is sometimes the case, however,
that the low-level protocol applies some simple transformation to the data it is given,
such as to compress or encrypt it. In this case, the protocol is transforming the entire
1.3 Network Architecture
Host 1
Host 2
Figure 1.12 High-level messages are encapsulated inside of low-level messages.
body of the message, including both the original application’s data and all the headers
attached to that data by higher-level protocols.
Multiplexing and Demultiplexing
Recall from Section 1.2.2 that a fundamental idea of packet switching is to multiplex
multiple flows of data over a single physical link. This same idea applies up and down
the protocol graph, not just to switching nodes. In Figure 1.11, for example, we can
think of RRP as implementing a logical communication channel, with messages from
two different applications multiplexed over this channel at the source host and then
demultiplexed back to the appropriate application at the destination host.
Practically speaking, all this means is that the header that RRP attaches to its
messages contains an identifier that records the application to which the message
belongs. We call this identifier RRP’s demultiplexing key, or demux key for short.
At the source host, RRP includes the appropriate demux key in its header. When the
message is delivered to RRP on the destination host, it strips its header, examines the
demux key, and demultiplexes the message to the correct application.
1 Foundation
RRP is not unique in its support for multiplexing; nearly every protocol implements this mechanism. For example, HHP has its own demux key to determine which
messages to pass up to RRP and which to pass up to MSP. However, there is no uniform
agreement among protocols—even those within a single network architecture—on exactly what constitutes a demux key. Some protocols use an 8-bit field (meaning they can
support only 256 high-level protocols), and others use 16- or 32-bit fields. Also, some
protocols have a single demultiplexing field in their header, while others have a pair of
demultiplexing fields. In the former case, the same demux key is used on both sides of
the communication, while in the latter case, each side uses a different key to identify the
high-level protocol (or application program) to which the message is to be delivered.
OSI Architecture
The ISO was one of the first organizations to formally define a common way to connect
computers. Their architecture, called the Open Systems Interconnection (OSI) architecture and illustrated in Figure 1.13, defines a partitioning of network functionality into
End host
End host
Data link
Data link
Data link
Data link
One or more nodes
within the network
Figure 1.13 OSI network architecture.
1.3 Network Architecture
seven layers, where one or more protocols implement the functionality assigned to a
given layer. In this sense, the schematic given in Figure 1.13 is not a protocol graph, per
se, but rather a reference model for a protocol graph. The ISO, usually in conjunction
with a second standards organization known as the International Telecommunications
Union (ITU),1 publishes a series of protocol specifications based on the OSI architecture. This series is sometimes called the “X dot” series since the protocols are given
names like X.25, X.400, X.500, and so on. There have been several networks based on
these standards, including the public X.25 network and private networks like Tymnet.
Starting at the bottom and working up, the physical layer handles the transmission of raw bits over a communications link. The data link layer then collects a stream
of bits into a larger aggregate called a frame. Network adaptors, along with device
drivers running in the node’s OS, typically implement the data link level. This means
that frames, not raw bits, are actually delivered to hosts. The network layer handles
routing among nodes within a packet-switched network. At this layer, the unit of data
exchanged among nodes is typically called a packet rather than a frame, although
they are fundamentally the same thing. The lower three layers are implemented on all
network nodes, including switches within the network and hosts connected along the
exterior of the network. The transport layer then implements what we have up to this
point been calling a process-to-process channel. Here, the unit of data exchanged is
commonly called a message rather than a packet or a frame. The transport layer and
higher layers typically run only on the end hosts and not on the intermediate switches
or routers.
There is less agreement about the definition of the top three layers. Skipping
ahead to the top (seventh) layer, we find the application layer. Application layer protocols include things like the File Transfer Protocol (FTP), which defines a protocol by
which file transfer applications can interoperate. Below that, the presentation layer is
concerned with the format of data exchanged between peers, for example, whether an
integer is 16, 32, or 64 bits long and whether the most significant bit is transmitted
first or last, or how a video stream is formatted. Finally, the session layer provides a
name space that is used to tie together the potentially different transport streams that
are part of a single application. For example, it might manage an audio stream and a
video stream that are being combined in a teleconferencing application.
Internet Architecture
The Internet architecture, which is also sometimes called the TCP/IP architecture after
its two main protocols, is depicted in Figure 1.14. An alternative representation is given
in Figure 1.15. The Internet architecture evolved out of experiences with an earlier
A subcommittee of the ITU on telecommunications (ITU-T) replaces an earlier subcommittee of the ITU, which
was known by its French name, Comité Consultatif International de Télégraphique et Téléphonique (CCITT).
1 Foundation
Figure 1.14 Internet protocol graph.
Figure 1.15 Alternative view of the Internet architecture.
packet-switched network called the ARPANET. Both the Internet and the ARPANET
were funded by the Advanced Research Projects Agency (ARPA), one of the R&D
funding agencies of the U.S. Department of Defense. The Internet and ARPANET
were around before the OSI architecture, and the experience gained from building
them was a major influence on the OSI reference model.
While the seven-layer OSI model can, with some imagination, be applied to the
Internet, a four-layer model is often used instead. At the lowest level are a wide variety
of network protocols, denoted NET1 , NET2 , and so on. In practice, these protocols
are implemented by a combination of hardware (e.g., a network adaptor) and software (e.g., a network device driver). For example, you might find Ethernet or Fiber
Distributed Data Interface (FDDI) protocols at this layer. (These protocols in turn
may actually involve several sublayers, but the Internet architecture does not presume
anything about them.) The second layer consists of a single protocol—the Internet
Protocol (IP). This is the protocol that supports the interconnection of multiple networking technologies into a single, logical internetwork. The third layer contains two
main protocols—the Transmission Control Protocol (TCP) and the User Datagram
Protocol (UDP). TCP and UDP provide alternative logical channels to application
1.3 Network Architecture
programs: TCP provides a reliable byte-stream channel, and UDP provides an unreliable datagram delivery channel (datagram may be thought of as a synonym for
message). In the language of the Internet, TCP and UDP are sometimes called end-toend protocols, although it is equally correct to refer to them as transport protocols.
Running above the transport layer are a range of application protocols, such as
FTP, TFTP (Trivial File Transport Protocol), Telnet (remote login), and SMTP (Simple
Mail Transfer Protocol, or electronic mail), that enable the interoperation of popular
applications. To understand the difference between an application layer protocol and
an application, think of all the different World Wide Web browsers that are available
(e.g., Mosaic, Netscape, Internet Explorer, Lynx, etc.). There are a similarly large
number of different implementations of Web servers. The reason that you can use any
one of these application programs to access a particular site on the Web is because
they all conform to the same application layer protocol: HTTP (HyperText Transport
Protocol). Confusingly, the same word sometimes applies to both an application and
the application layer protocol that it uses (e.g., FTP).
The Internet architecture has three features that are worth highlighting. First, as
best illustrated by Figure 1.15, the Internet architecture does not imply strict layering.
The application is free to bypass the defined transport layers and to directly use IP or
one of the underlying networks. In fact, programmers are free to define new channel
abstractions or applications that run on top of any of the existing protocols.
Second, if you look closely at the protocol graph in Figure 1.14, you will notice
an hourglass shape—wide at the top, narrow in the middle, and wide at the bottom.
This shape actually reflects the central philosophy of the architecture. That is, IP serves
as the focal point for the architecture—it defines a common method for exchanging
packets among a wide collection of networks. Above IP can be arbitrarily many transport protocols, each offering a different channel abstraction to application programs.
Thus, the issue of delivering messages from host to host is completely separated from
the issue of providing a useful process-to-process communication service. Below IP,
the architecture allows for arbitrarily many different network technologies, ranging
from Ethernet to FDDI to ATM to single point-to-point links.
A final attribute of the Internet architecture (or more accurately, of the IETF
culture) is that in order for someone to propose a new protocol to be included in the
architecture, they must produce both a protocol specification and at least one (and
preferably two) representative implementations of the specification. The existence of
working implementations is required for standards to be adopted by the IETF. This
cultural assumption of the design community helps to ensure that the architecture’s
protocols can be efficiently implemented. Perhaps the value the Internet culture places
on working software is best exemplified by a quote on T-shirts commonly worn at
IETF meetings:
1 Foundation
We reject kings, presidents, and voting. We believe in rough consensus and running
(Dave Clark)
Of these three attributes of the Internet architecture, the hourglass design philosophy is important enough to bear repeating. The hourglass’s narrow waist represents
a minimal and carefully chosen set of global capabilities that allows both higher-level
applications and lower-level communication technologies to coexist, share capabilities, and evolve rapidly. The narrow-waisted model is critical to the Internet’s ability
to adapt rapidly to new user demands and changing technologies.
1.4 Implementing Network Software
Network architectures and protocol specifications are essential things, but a good
blueprint is not enough to explain the phenomenal success of the Internet: The number
of computers connected to the Internet has been doubling every year since 1981 and is
now approaching 200 million; the number of people who use the Internet is estimated
at well over 600 million; and it is believed that the number of bits transmitted over the
Internet surpassed the corresponding figure for the voice phone system sometime in
What explains the success of the Internet? There are certainly many contributing
factors (including a good architecture), but one thing that has made the Internet such
a runaway success is the fact that so much of its functionality is provided by software
running in general-purpose computers. The significance of this is that new functionality can be added readily with “just a small matter of programming.” As a result,
new applications and services—electronic commerce, videoconferencing, and packet
telephony, to name a few—have been showing up at a phenomenal pace.
A related factor is the massive increase in computing power available in commodity machines. Although computer networks have always been capable in principle of
transporting any kind of information, such as digital voice samples, digitized images,
and so on, this potential was not particularly interesting if the computers sending and
receiving that data were too slow to do anything useful with the information. Virtually
all of today’s computers are capable of playing back digitized voice at full speed and can
display video at a speed and resolution that is useful for some (but by no means all)
applications. Thus, today’s networks have begun to support multimedia, and their
support for it will only improve as computing hardware becomes faster.
The point to take away from this is that knowing how to implement network
software is an essential part of understanding computer networks. With this in mind,
this section first introduces some of the issues involved in implementing an application
program on top of a network, and then goes on to identify the issues involved in
1.4 Implementing Network Software
implementing the protocols running within the network. In many respects, network
applications and network protocols are very similar—the way an application engages
the services of the network is pretty much the same as the way a high-level protocol
invokes the services of a low-level protocol. As we will see later in the section, however,
there are a couple of important differences.
Application Programming Interface (Sockets)
The place to start when implementing a network application is the interface exported
by the network. Since most network protocols are implemented in software (especially those high in the protocol stack), and nearly all computer systems implement
their network protocols as part of the operating system, when we refer to the interface “exported by the network,” we are generally referring to the interface that the
OS provides to its networking subsystem. This interface is often called the network
application programming interface (API).
Although each operating system is free to define its own network API (and most
have), over time certain of these APIs have become widely supported; that is, they
have been ported to operating systems other than their native system. This is what has
happened with the socket interface originally provided by the Berkeley distribution of
Unix, which is now supported in virtually all popular operating systems. The advantage
of industrywide support for a single API is that applications can be easily ported from
one OS to another. It is important to keep in mind, however, that application programs
typically interact with many parts of the OS other than the network; for example, they
read and write files, fork concurrent processes, and output to the graphical display.
Just because two systems support the same network API does not mean that their
file system, process, or graphic interfaces are the same. Still, understanding a widely
adopted API like Unix sockets gives us a good place to start.
Before describing the socket interface, it is important to keep two concerns separate in your mind. Each protocol provides a certain set of services, and the API
provides a syntax by which those services can be invoked in this particular OS.
The implementation is then responsible for mapping the tangible set of operations
and objects defined by the API onto the abstract set of services defined by the protocol. If you have done a good job of defining the interface, then it will be possible
to use the syntax of the interface to invoke the services of many different protocols.
Such generality was certainly a goal of the socket interface, although it’s far from
The main abstraction of the socket interface, not surprisingly, is the socket. A
good way to think of a socket is as the point where a local application process attaches
to the network. The interface defines operations for creating a socket, attaching the
socket to the network, sending/receiving messages through the socket, and closing the
1 Foundation
socket. To simplify the discussion, we will limit ourselves to showing how sockets are
used with TCP.
The first step is to create a socket, which is done with the following operation:
int socket(int domain, int type, int protocol)
The reason that this operation takes three arguments is that the socket interface was
designed to be general enough to support any underlying protocol suite. Specifically, the
domain argument specifies the protocol family that is going to be used: PF INET denotes
the Internet family, PF UNIX denotes the Unix pipe facility, and PF PACKET denotes
direct access to the network interface (i.e., it bypasses the TCP/IP protocol stack). The
type argument indicates the semantics of the communication. SOCK STREAM is used to
denote a byte stream. SOCK DGRAM is an alternative that denotes a message-oriented
service, such as that provided by UDP. The protocol argument identifies the specific
protocol that is going to be used. In our case, this argument is UNSPEC because the
combination of PF INET and SOCK STREAM implies TCP. Finally, the return value from
socket is a handle for the newly created socket, that is, an identifier by which we can
refer to the socket in the future. It is given as an argument to subsequent operations
on this socket.
The next step depends on whether you are a client or a server. On a server
machine, the application process performs a passive open—the server says that it is
prepared to accept connections, but it does not actually establish a connection. The
server does this by invoking the following three operations:
int bind(int socket, struct sockaddr *address, int addr len)
int listen(int socket, int backlog)
int accept(int socket, struct sockaddr *address, int *addr len)
The bind operation, as its name suggests, binds the newly created socket to the
specified address. This is the network address of the local participant—the server. Note
that, when used with the Internet protocols, address is a data structure that includes
both the IP address of the server and a TCP port number. (As we will see in Chapter 5,
ports are used to indirectly identify processes. They are a form of demux keys as defined in Section 1.3.1.) The port number is usually some well-known number specific
to the service being offered; for example, Web servers commonly accept connections
on port 80.
The listen operation then defines how many connections can be pending on
the specified socket. Finally, the accept operation carries out the passive open. It is
a blocking operation that does not return until a remote participant has established
a connection, and when it does complete, it returns a new socket that corresponds
to this just-established connection, and the address argument contains the remote
1.4 Implementing Network Software
participant’s address. Note that when accept returns, the original socket that was
given as an argument still exists and still corresponds to the passive open; it is used in
future invocations of accept.
On the client machine, the application process performs an active open; that is,
it says who it wants to communicate with by invoking the following single operation:
int connect(int socket, struct sockaddr *address, int addr len)
This operation does not return until TCP has successfully established a connection, at
which time the application is free to begin sending data. In this case, address contains
the remote participant’s address. In practice, the client usually specifies only the remote
participant’s address and lets the system fill in the local information. Whereas a server
usually listens for messages on a well-known port, a client typically does not care
which port it uses for itself; the OS simply selects an unused one.
Once a connection is established, the application processes invoke the following
two operations to send and receive data:
int send(int socket, char *message, int msg len, int flags)
int recv(int socket, char *buffer, int buf len, int flags)
The first operation sends the given message over the specified socket, while the second
operation receives a message from the specified socket into the given buffer. Both
operations take a set of flags that control certain details of the operation.
Example Application
We now show the implementation of a simple client/server program that uses the socket
interface to send messages over a TCP connection. The program also uses other Unix
networking utilities, which we introduce as we go. Our application allows a user on
one machine to type in and send text to a user on another machine. It is a simplified
version of the Unix talk program, which is similar to the program at the core of a Web
chat room.
We start with the client side, which takes the name of the remote machine as an
argument. It calls the Unix utility gethostbyname to translate this name into the remote
host’s IP address. The next step is to construct the address data structure (sin) expected
by the socket interface. Notice that this data structure specifies that we’ll be using the
socket to connect to the Internet (AF INET). In our example, we use TCP port 5432
as the well-known server port; this happens to be a port that has not been assigned to
any other Internet service. The final step in setting up the connection is to call socket
and connect. Once the connect operation returns, the connection is established and the
1 Foundation
client program enters its main loop, which reads text from standard input and sends
it over the socket.
#define SERVER_PORT 5432
#define MAX_LINE 256
main(int argc, char * argv[])
FILE *fp;
struct hostent *hp;
struct sockaddr_in sin;
char *host;
char buf[MAX_LINE];
int s;
int len;
if (argc==2) {
host = argv[1];
else {
fprintf(stderr, "usage: simplex-talk host\n");
/* translate host name into peer's IP address */
hp = gethostbyname(host);
if (!hp) {
fprintf(stderr, "simplex-talk: unknown host: %s\n", host);
/* build address data structure */
bzero((char *)&sin, sizeof(sin));
sin.sin_family = AF_INET;
bcopy(hp->h_addr, (char *)&sin.sin_addr, hp->h_length);
sin.sin_port = htons(SERVER_PORT);
/* active open */
if ((s = socket(PF_INET, SOCK_STREAM, 0)) < 0) {
perror("simplex-talk: socket");
1.4 Implementing Network Software
if (connect(s, (struct sockaddr *)&sin, sizeof(sin)) < 0) {
perror("simplex-talk: connect");
/* main loop: get and send lines of text */
while (fgets(buf, sizeof(buf), stdin)) {
buf[MAX_LINE-1] = '\0';
len = strlen(buf) + 1;
send(s, buf, len, 0);
The server is equally simple. It first constructs the address data structure by filling in
its own port number (SERVER PORT). By not specifying an IP address, the application
program is willing to accept connections on any of the local host’s IP addresses. Next,
the server performs the preliminary steps involved in a passive open: creates the socket,
binds it to the local address, and sets the maximum number of pending connections
to be allowed. Finally, the main loop waits for a remote host to try to connect, and
when one does, receives and prints out the characters that arrive on the connection.
#define MAX_LINE
struct sockaddr_in sin;
char buf[MAX_LINE];
int len;
int s, new_s;
/* build address data structure */
bzero((char *)&sin, sizeof(sin));
sin.sin_family = AF_INET;
sin.sin_addr.s_addr = INADDR_ANY;
sin.sin_port = htons(SERVER_PORT);
1 Foundation
/* setup passive open */
if ((s = socket(PF_INET, SOCK_STREAM, 0)) < 0) {
perror("simplex-talk: socket");
if ((bind(s, (struct sockaddr *)&sin, sizeof(sin))) < 0) {
perror("simplex-talk: bind");
listen(s, MAX_PENDING);
/* wait for connection, then receive and print text */
while(1) {
if ((new_s = accept(s, (struct sockaddr *)&sin, &len)) < 0){
perror("simplex-talk: accept");
while (len = recv(new_s, buf, sizeof(buf), 0))
fputs(buf, stdout);
Protocol Implementation Issues
As mentioned at the beginning of this section, the way application programs interact
with the underlying network is similar to the way a high-level protocol interacts with
a low-level protocol. For example, TCP needs an interface to send outgoing messages
to IP, and IP needs to be able to deliver incoming messages to TCP. This is exactly the
service interface introduced in Section 1.3.1.
Since we already have a network API (e.g., sockets), we might be tempted to use
this same interface between every pair of protocols in the protocol stack. Although
certainly an option, in practice the socket interface is not used in this way. The reason
is that there are inefficiencies built into the socket interface that protocol implementers
are not willing to tolerate. Application programmers tolerate them because they simplify their programming task and because the inefficiency only has to be tolerated once,
but protocol implementers are often obsessed with performance and must worry about
getting a message through several layers of protocols. The rest of this section discusses
the two primary differences between the network API and the protocol-to-protocol
interface found lower in the protocol graph.
Process Model
Most operating systems provide an abstraction called a process, or alternatively, a
thread. Each process runs largely independently of other processes, and the OS is
1.4 Implementing Network Software
Figure 1.16 Alternative process models: (a) process-per-protocol; (b) process-permessage.
responsible for making sure that resources, such as address space and CPU cycles,
are allocated to all the current processes. The process abstraction makes it fairly
straightforward to have a lot of things executing concurrently on one machine; for
example, each user application might execute in its own process, and various things
inside the OS might execute as other processes. When the OS stops one process from
executing on the CPU and starts up another one, we call the change a context switch.
When designing the network subsystem, one of the first questions to answer
is, “Where are the processes?” There are essentially two choices, as illustrated in
Figure 1.16. In the first, which we call the process-per-protocol model, each protocol
is implemented by a separate process. This means that as a message moves up or down
the protocol stack, it is passed from one process/protocol to another—the process that
implements protocol i processes the message, then passes it to protocol i − 1, and so
on. How one process/protocol passes a message to the next process/protocol depends
on the support the host OS provides for interprocess communication. Typically, there
is a simple mechanism for enqueuing a message with a process. The important point,
however, is that a context switch is required at each level of the protocol graph—
typically a time-consuming operation.
The alternative, which we call the process-per-message model, treats each protocol as a static piece of code and associates the processes with the messages.
1 Foundation
That is, when a message arrives from the network, the OS dispatches a process that
it makes responsible for the message as it moves up the protocol graph. At each level,
the procedure that implements that protocol is invoked, which eventually results in the
procedure for the next protocol being invoked, and so on. For outbound messages, the
application’s process invokes the necessary procedure calls until the message is delivered. In both directions, the protocol graph is traversed in a sequence of procedure calls.
Although the process-per-protocol model is sometimes easier to think about—
I implement my protocol in my process, and you implement your protocol in your
process—the process-per-message model is generally more efficient for a simple reason:
A procedure call is an order of magnitude more efficient than a context switch on most
computers. The former model requires the expense of a context switch at each level,
while the latter model costs only a procedure call per level.
Now think about the relationship between the service interface as defined above
and the process model. For an outgoing message, the high-level protocol invokes a
send operation on the low-level protocol. Because the high-level protocol has the
message in hand when it calls send, this operation can be easily implemented as a
procedure call; no context switch is required. For incoming messages, however, the
high-level protocol invokes the receive operation on the low-level protocol, and then
must wait for a message to arrive at some unknown future time; this basically forces a
context switch. In other words, the process running in the high-level protocol receives a
message from the process running in the low-level protocol. This isn’t a big deal if only
the application process receives messages from the network subsystem—in fact, it’s the
right interface for the network API since application programs already have a processcentric view of the world—but it does have a significant impact on performance if such
a context switch occurs at each layer of the protocol stack.
It is for this reason that most protocol implementations replace the receive operation with a deliver operation. That is, the low-level protocol does an upcall—a procedure call up the protocol stack—to deliver the message to the high-level protocol.
Figure 1.17 shows the resulting interface between two adjacent protocols, TCP and IP
in this case. In general, messages move down the protocol graph through a sequence of
send operations, and up the protocol graph through a sequence of deliver operations.
Message Buffers
A second inefficiency of the socket interface is that the application process provides
the buffer that contains the outbound message when calling send, and similarly it
provides the buffer into which an incoming message is copied when invoking the
receive operation. This forces the topmost protocol to copy the message from the application’s buffer into a network buffer, and vice versa, as shown in Figure 1.18. It turns
out that copying data from one buffer to another is one of the most expensive things a
1.4 Implementing Network Software
sendIP (message)
deliverTCP (message)
Figure 1.17 Protocol-to-protocol interface.
Application process
Topmost protocol
Figure 1.18 Copying incoming/outgoing messages between application buffer and
network buffer.
protocol implementation can do. This is because while processors are becoming faster
at an incredible pace, memory is not getting faster as quickly as processors are.
Instead of copying message data from one buffer to another at each layer in the
protocol stack, most network subsystems define an abstract data type for messages
that is shared by all protocols in the protocol graph. Not only does this abstraction
permit messages to be passed up and down the protocol graph without copying, but
it usually provides copy-free ways of manipulating messages in other ways, such as
adding and stripping headers, fragmenting large messages into a set of small messages,
and reassembling a collection of small messages into a single large message. The exact
form of this message abstraction differs from OS to OS, but it generally involves a
linked list of pointers to message buffers, similar to the one shown in Figure 1.19.
We leave it as an exercise for you to define a general copy-free message abstraction.
1 Foundation
Figure 1.19 Example message data structure.
1.5 Performance
Up to this point, we have focused primarily on the functional aspects of networks.
Like any computer system, however, computer networks are also expected to perform
well, since the effectiveness of computations
distributed over the network often depends
directly on the efficiency with which the network delivers the computation’s data. While
the old programming adage “First get it
right and then make it fast” is valid in many
settings, in networking it is usually necessary to “design for performance.” It is therefore important to understand the various
factors that impact network performance.
Bandwidth and
Network performance is measured in two
fundamental ways: bandwidth (also called
throughput) and latency (also called delay).
The bandwidth of a network is given by
the number of bits that can be transmitted
over the network in a certain period of time.
For example, a network might have a bandwidth of 10 million bits/second (Mbps),
meaning that it is able to deliver 10 million
Bandwidth and
Bandwidth and throughput are two
of the most confusing terms used
in networking. While we could try
to give you a precise definition of
each term, it is important that you
know how other people might use
them and for you to be aware that
they are often used interchangeably. First of all, bandwidth is literally a measure of the width of
a frequency band. For example,
a voice-grade telephone line supports a frequency band ranging
from 300 to 3300 Hz; it is said
to have a bandwidth of 3300 Hz −
300 Hz = 3000 Hz. If you see the
word “bandwidth” used in a situation in which it is being measured
in hertz, then it probably refers to
the range of signals that can be
When we talk about the bandwidth of a communication link, we
1.5 Performance
bits every second. It is sometimes useful to think of bandwidth in terms of how long it
takes to transmit each bit of data. On a 10-Mbps network, for example, it takes 0.1
microsecond (μs) to transmit each bit.
While you can talk about the bandwidth of the network as a whole, sometimes
you want to be more precise, focusing, for example, on the bandwidth of a single
physical link or of a logical process-to-process channel. At the physical level, bandwidth is constantly improving, with no end in sight. Intuitively, if you think of a second
of time as a distance you could measure with a ruler, and bandwidth as how many
bits fit in that distance, then you can think of each bit as a pulse of some width. For
example, each bit on a 1-Mbps link is 1 μs wide, while each bit on a 2-Mbps link
is 0.5 μs wide, as illustrated in Figure 1.20. The more sophisticated the transmitting
and receiving technology, the narrower each bit can become, and thus, the higher the
bandwidth. For logical process-to-process
channels, bandwidth is also influenced by
other factors, including how many times the
software that implements the channel has to
normally refer to the number of
handle, and possibly transform, each bit of
bits per second that can be transdata.
mitted on the link. We might say
The second performance metric, lathat the bandwidth of an Ethernet
tency, corresponds to how long it takes a
is 10 Mbps. A useful distinction
message to travel from one end of a network
might be made, however, between
to the other. (As with bandwidth, we could
the bandwidth that is available on
be focused on the latency of a single link
the link and the number of bits per
or an end-to-end channel.) Latency is measecond that we can actually transsured strictly in terms of time. For example,
mit over the link in practice. We
a transcontinental network might have a latend to use the word “throughput”
tency of 24 milliseconds (ms); that is, it takes
to refer to the measured perfora message 24 ms to travel from one end of
mance of a system. Thus, because of
North America to the other. There are many
various inefficiencies of implemensituations in which it is more important to
tation, a pair of nodes connected by
know how long it takes to send a message
a link with a bandwidth of 10 Mbps
from one end of a network to the other and
might achieve a throughput of only
back, rather than the one-way latency. We
2 Mbps. This would mean that an
call this the round-trip time (RTT) of the
application on one host could send
data to the other host at 2 Mbps.
We often think of latency as having
Finally, we often talk about
three components. First, there is the speedthe bandwidth requirements of an
of-light propagation delay. This delay ocapplication—the number of bits
curs because nothing, including a bit on
a wire, can travel faster than the speed
1 Foundation
1 second
1 second
Figure 1.20 Bits transmitted at a particular bandwidth can be regarded as having some
width: (a) bits transmitted at 1 Mbps (each bit 1 µs wide); (b) bits transmitted at 2 Mbps
(each bit 0.5 µs wide).
of light. If you know the distance between two points, you can calculate the speed-oflight latency, although you have to be careful because light travels across different mediums at different speeds: It travels at 3.0×108 m/s in a vacuum, 2.3×108 m/s in a cable,
and 2.0×108 m/s in a fiber. Second, there is the amount of time it takes to transmit a unit
of data. This is a function of the network bandwidth and the size of the packet in which
the data is carried. Third, there may be queuing delays inside the network, since packet
switches generally need to store packets for some time before forwarding them on an
outbound link, as discussed in Section 1.2.2. So, we could define the total latency as
Latency = Propagation + Transmit + Queue
Propagation = Distance/SpeedOfLight
Transmit = Size/Bandwidth
where Distance is the length of the wire over
which the data will travel, SpeedOfLight
is the effective speed of light over that
wire, Size is the size of the packet, and
Bandwidth is the bandwidth at which the
packet is transmitted. Note that if the
message contains only one bit and we are
talking about a single link (as opposed to
a whole network), then the Transmit and
Queue terms are not relevant, and latency
corresponds to the propagation delay only.
Bandwidth and latency combine to define the performance characteristics of a
given link or channel. Their relative importance, however, depends on the application.
For some applications, latency dominates
per second that it needs to transmit over the network to perform
acceptably. For some applications,
this might be “whatever I can get”;
for others, it might be some fixed
number (preferably no more than
the available link bandwidth); and
for others, it might be a number
that varies with time. We will provide more on this topic later in this
1.5 Performance
bandwidth. For example, a client that sends a 1-byte message to a server and receives
a 1-byte message in return is latency bound. Assuming that no serious computation
is involved in preparing the response, the application will perform much differently
on a transcontinental channel with a 100-ms RTT than it will on an across-the-room
channel with a 1-ms RTT. Whether the channel is 1 Mbps or 100 Mbps is relatively insignificant, however, since the former implies that the time to transmit a byte (Transmit)
is 8 μs and the latter implies Transmit = 0.08 μs.
In contrast, consider a digital library program that is being asked to fetch a
25-megabyte (MB) image—the more bandwidth that is available, the faster it will be
able to return the image to the user. Here, the bandwidth of the channel dominates
performance. To see this, suppose that the channel has a bandwidth of 10 Mbps. It will
take 20 seconds to transmit the image, making it relatively unimportant if the image
is on the other side of a 1-ms channel or a 100-ms channel; the difference between a
20.001-second response time and a 20.1-second response time is negligible.
Figure 1.21 gives you a sense of how latency or bandwidth can dominate performance in different circumstances. The graph shows how long it takes to move objects
Perceived latency (ms)
1-MB object, 1.5-Mbps link
1-MB object, 10-Mbps link
2-KB object, 1.5-Mbps link
2-KB object, 10-Mbps link
1-byte object, 1.5-Mbps link
1-byte object, 10-Mbps link
RTT (ms)
Figure 1.21 Perceived latency (response time) versus round-trip time for various object sizes and link speeds.
1 Foundation
of various sizes (1 byte, 2 KB, 1 MB) across networks with RTTs ranging from 1 to
100 ms and link speeds of either 1.5 or 10 Mbps. We use logarithmic scales to show
relative performance. For a 1-byte object (say, a keystroke), latency remains almost
exactly equal to the RTT, so that you cannot distinguish between a 1.5-Mbps network
and a 10-Mbps network. For a 2-KB object (say, an email message), the link speed
makes quite a difference on a 1-ms RTT network but a negligible difference on a 100ms RTT network. And for a 1-MB object (say, a digital image), the RTT makes no
difference—it is the link speed that dominates performance across the full range of RTT.
Note that throughout this book we use the terms latency and delay in a generic
way, that is, to denote how long it takes to perform a particular function such as
delivering a message or moving an object. When we are referring to the specific
amount of time it takes a signal to propagate from one end of a link to another,
we use the term propagation delay. Also, we make it clear in the context of the
discussion whether we are referring to the one-way latency or the round-trip time.
As an aside, computers are becoming
so fast that when we connect them to netHow Big Is a Mega?
works, it is sometimes useful to think, at
There are several pitfalls you need
least figuratively, in terms of instructions per
to be aware of when working with
mile. Consider what happens when a comthe common units of networking—
puter that is able to execute 1 billion instrucMB, Mbps, KB, and Kbps. The
tions per second sends a message out on a
first is to distinguish carefully bechannel with a 100-ms RTT. (To make the
tween bits and bytes. Throughout
math easier, assume that the message covers
this book, we always use a lowera distance of 5000 miles.) If that computer
case b for bits and a capital B for
sits idle the full 100 ms waiting for a reply
bytes. The second is to be sure you
message, then it has forfeited the ability to
are using the appropriate definition
execute 100 million instructions, or 20,000
of mega (M) and kilo (K). Mega,
instructions per mile. It had better have been
for example, can mean either 220
worth going over the network to justify this
or 106 . Similarly, kilo can be either
210 or 103 . What is worse, in networking we typically use both definitions. Here’s why.
1.5.2 Delay × Bandwidth
Network bandwidth, which is
specified in terms of Mbps,
It is also useful to talk about the product of
governed by the speed
these two metrics, often called the delay ×
of the clock that paces the transbandwidth product. Intuitively, if we think
mission of the bits. A clock that is
of a channel between a pair of processes
as a hollow pipe (see Figure 1.22), where
1.5 Performance
Figure 1.22 Network as a pipe.
the latency corresponds to the length of the pipe and the bandwidth gives the diameter
of the pipe, then the delay × bandwidth product gives the volume of the pipe—the
number of bits it holds. Said another way, if latency (measured in time) corresponds
to the length of the pipe, then given the width of each bit (also measured in time),
you can calculate how many bits fit in the pipe. For example, a transcontinental channel with a one-way latency of 50 ms and a bandwidth of 45 Mbps is able to hold
50 × 10−3 seconds × 45 × 106 bits/second
running at 10 MHz is used to transmit bits at 10 Mbps. Because the
mega in MHz means 106 hertz,
Mbps is usually also defined as 106
bits per second. (Similarly, Kbps is
103 bits per second.) On the other
hand, when we talk about a message that we want to transmit, we
often give its size in kilobytes. Because messages are stored in the
computer’s memory, and memory
is typically measured in powers of
two, the K in KB is usually taken
to mean 210 . (Similarly, MB usually means 220 .) When you put
the two together, it is not uncommon to talk about sending a
32-KB message over a 10-Mbps
channel, which should be interpreted to mean 32 × 210 × 8 bits
are being transmitted at a rate of
= 2.25 × 106 bits
or approximately 280 KB of data. In other
words, this example channel (pipe) holds as
many bytes as the memory of a personal
computer from the early 1980s could hold.
The delay × bandwidth product is important to know when constructing highperformance networks because it corresponds to how many bits the sender must
transmit before the first bit arrives at the
receiver. If the sender is expecting the receiver to somehow signal that bits are starting to arrive, and it takes another channel
latency for this signal to propagate back
to the sender (i.e., we are interested in the
channel’s RTT rather than just its one-way
latency), then the sender can send up to
two delay × bandwidth’s worth of data before hearing from the receiver that all is
well. The bits in the pipe are said to be “in
flight,” which means that if the receiver tells
the sender to stop transmitting, it might receive up to a delay × bandwidth’s worth of
1 Foundation
data before the sender manages to respond.
In our example above, that amount corresponds to 5.5 × 106 bits (671 KB) of data.
On the other hand, if the sender does not fill
the pipe—send a whole delay × bandwidth
product’s worth of data before it stops to
wait for a signal—the sender will not fully
utilize the network.
Note that most of the time we are
interested in the RTT scenario, which we
simply refer to as the delay × bandwidth
product, without explicitly saying that
this product is multiplied by two. Again,
whether the “delay” in “delay × bandwidth” means one-way latency or RTT is
made clear by the context.
High-Speed Networks
The bandwidths available on today’s networks are increasing at a dramatic rate, and
there is eternal optimism that network bandwidth will continue to improve. This causes
network designers to start thinking about
what happens in the limit, or stated another
way, what is the impact on network design
of having infinite bandwidth available.
Although high-speed networks bring
a dramatic change in the bandwidth available to applications, in many respects their
impact on how we think about networking
comes in what does not change as bandwidth increases: the speed of light. To quote
Scotty from Star Trek, “You cannae change
the laws of physics.” In other words, “high
speed” does not mean that latency improves
at the same rate as bandwidth; the transcontinental RTT of a 1-Gbps link is the same
100 ms as it is for a 1-Mbps link.
10×106 bits per second. This is the
interpretation we use throughout
the book, unless explicitly stated
The good news is that many
times we are satisfied with a
back-of-the-envelope calculation,
in which case it is perfectly reasonable to pretend that a byte has 10
bits in it (making it easy to convert
between bits and bytes) and that
106 is really equal to 220 (making
it easy to convert between the two
definitions of mega). Notice that
the first approximation introduces
a 20% error, while the latter introduces only a 5% error.
To help you in your quickand-dirty calculations, 100 ms is
a reasonable number to use for a
cross-country round-trip time—at
least when the country in question
is the United States—and 1 ms is
a good approximation of an RTT
across a local area network. In the
case of the former, we increase the
48-ms round-trip time implied by
the speed of light over a fiber to
100 ms because there are, as we
have said, other sources of delay,
such as the processing time in the
switches inside the network. You
can also be sure that the path taken
by the fiber between two points will
not be a straight line.
1.5 Performance
1-Mbps cross-country link
.1 Mb
.1 Mb
.1 Mb
.1 Mb
1-Gbps cross-country link
1 MB
Figure 1.23 Relationship between bandwidth and latency. With a 1-MB file, (a) the
1-Mbps link has 80 pipes full of data; (b) the 1-Gbps link has 1/12 of one pipe full of
To appreciate the significance of ever-increasing bandwidth in the face of fixed
latency, consider what is required to transmit a 1-MB file over a 1-Mbps network
versus over a 1-Gbps network, both of which have an RTT of 100 ms. In the case
of the 1-Mbps network, it takes 80 round-trip times to transmit the file; during each
RTT, 1.25% of the file is sent. In contrast, the same 1-MB file doesn’t even come close
to filling 1 RTT’s worth of the 1-Gbps link, which has a delay × bandwidth product
of 12.5 MB.
Figure 1.23 illustrates the difference between the two networks. In effect, the
1-MB file looks like a stream of data that needs to be transmitted across a 1-Mbps
network, while it looks like a single packet on a 1-Gbps network. To help drive this
point home, consider that a 1-MB file is to a 1-Gbps network what a 1-KB packet is
to a 1-Mbps network.
Another way to think about the situation is that more data can be transmitted
during each RTT on a high-speed network, so much so that a single RTT becomes a
significant amount of time. Thus, while you wouldn’t think twice about the difference
1 Foundation
between a file transfer taking 101 RTTs rather than 100 RTTs (a relative difference
of only 1%), suddenly the difference between 1 RTT and 2 RTTs is significant—a
100% increase. In other words, latency, rather than throughput, starts to dominate
our thinking about network design.
Perhaps the best way to understand the relationship between throughput and
latency is to return to basics. The effective end-to-end throughput that can be achieved
over a network is given by the simple relationship
Throughput = TransferSize/TransferTime
where TransferTime includes not only the elements of one-way Latency identified earlier
in this section, but also any additional time spent requesting or setting up the transfer.
Generally, we represent this relationship as
TransferTime = RTT + 1/Bandwidth × TransferSize
We use RTT in this calculation to account for a request message being sent across the
network and the data being sent back. For example, consider a situation where a user
wants to fetch a 1-MB file across a 1-Gbps network with a round-trip time of 100 ms.
The TransferTime includes both the transmit time for 1 MB (1/1 Gbps × 1 MB = 8 ms),
and the 100-ms RTT, for a total transfer time of 108 ms. This means that the effective
throughput will be
1 MB/108 ms = 74.1 Mbps
not 1 Gbps. Clearly, transferring a larger amount of data will help improve the effective throughput, where in the limit, an infinitely large transfer size will cause the
effective throughput to approach the network bandwidth. On the other hand, having
to endure more than 1 RTT—for example, to retransmit missing packets—will hurt
the effective throughput for any transfer of finite size and will be most noticeable for
small transfers.
Application Performance Needs
The discussion in this section has taken a network-centric view of performance; that
is, we have talked in terms of what a given link or channel will support. The unstated
assumption has been that application programs have simple needs—they want as much
bandwidth as the network can provide. This is certainly true of the aforementioned
digital library program that is retrieving a 25-MB image; the more bandwidth that is
available, the faster the program will be able to return the image to the user.
However, some applications are able to state an upper limit on how much bandwidth they need. Video applications are a prime example. Suppose you want to stream
1.5 Performance
a video image that is one-quarter the size of a standard TV image; that is, it has a
resolution of 352 by 240 pixels. If each pixel is represented by 24 bits of information,
as would be the case for 24-bit color, then the size of each frame would be
(352 × 240 × 24)/8 = 247.5 KB
If the application needs to support a frame rate of 30 frames per second, then it might
request a throughput rate of 75 Mbps. The ability of the network to provide more
bandwidth is of no interest to such an application because it has only so much data to
transmit in a given period of time.
Unfortunately, the situation is not as simple as this example suggests. Because
the difference between any two adjacent frames in a video stream is often small, it is
possible to compress the video by transmitting only the differences between adjacent
frames. This compressed video does not flow at a constant rate, but varies with time
according to factors such as the amount of action and detail in the picture and the
compression algorithm being used. Therefore, it is possible to say what the average
bandwidth requirement will be, but the instantaneous rate may be more or less.
The key issue is the time interval over which the average is computed. Suppose
that this example video application can be compressed down to the point that it needs
only 2 Mbps, on average. If it transmits 1 megabit in a 1-second interval and 3 megabits
in the following 1-second interval, then over the 2-second interval it is transmitting at
an average rate of 2 Mbps; however, this will be of little consolation to a channel that
was engineered to support no more than 2 megabits in any one second. Clearly, just
knowing the average bandwidth needs of an application will not always suffice.
Generally, however, it is possible to put an upper bound on how big of a burst an
application like this is likely to transmit. A burst might be described by some peak rate
that is maintained for some period of time. Alternatively, it could be described as the
number of bytes that can be sent at the peak rate before reverting to the average rate
or some lower rate. If this peak rate is higher than the available channel capacity, then
the excess data will have to be buffered somewhere, to be transmitted later. Knowing
how big of a burst might be sent allows the network designer to allocate sufficient
buffer capacity to hold the burst. We will return to the subject of describing bursty
traffic accurately in Chapter 6.
Analogous to the way an application’s bandwidth needs can be something other
than “all it can get,” an application’s delay requirements may be more complex than
simply “as little delay as possible.” In the case of delay, it sometimes doesn’t matter so
much whether the one-way latency of the network is 100 ms or 500 ms as how much
the latency varies from packet to packet. The variation in latency is called jitter.
Consider the situation in which the source sends a packet once every 33 ms,
as would be the case for a video application transmitting frames 30 times a second.
1 Foundation
Interpacket gap
Figure 1.24 Network-induced jitter.
If the packets arrive at the destination spaced out exactly 33 ms apart, then we can
deduce that the delay experienced by each packet in the network was exactly the same.
If the spacing between when packets arrive at the destination—sometimes called the
interpacket gap—is variable, however, then the delay experienced by the sequence of
packets must have also been variable, and the network is said to have introduced
jitter into the packet stream, as shown in Figure 1.24. Such variation is generally
not introduced in a single physical link, but it can happen when packets experience
different queuing delays in a multihop packet-switched network. This queuing delay
corresponds to the Queue component of latency defined earlier in this section, which
varies with time.
To understand the relevance of jitter, suppose that the packets being transmitted
over the network contain video frames, and in order to display these frames on the
screen the receiver needs to receive a new one every 33 ms. If a frame arrives early,
then it can simply be saved by the receiver until it is time to display it. Unfortunately,
if a frame arrives late, then the receiver will not have the frame it needs in time to
update the screen, and the video quality will suffer; it will not be smooth. Note that it
is not necessary to eliminate jitter, only to know how bad it is. The reason for this is
that if the receiver knows the upper and lower bounds on the latency that a packet can
experience, it can delay the time at which it starts playing back the video (i.e., displays
the first frame) long enough to ensure that in the future it will always have a frame to
display when it needs it. The receiver delays the frame, effectively smoothing out the
jitter, by storing it in a buffer. We return to the topic of jitter in Chapter 9.
1.6 Summary
Computer networks like the Internet have experienced enormous growth over the
past decade and are now positioned to provide a wide range of services—remote file
access, digital libraries, videoconferencing—to hundreds of millions of users. Much
of this growth can be attributed to the general-purpose nature of computer networks,
and in particular to the ability to add new functionality to the network by writing
Open Issue: Ubiquitous Networking
software that runs on affordable, high-performance computers. With this in mind,
the overriding goal of this book is to describe computer networks in such a way that
when you finish reading it, you should feel that if you had an army of programmers
at your disposal, you could actually build a fully functional computer network from
the ground up. This chapter lays the foundation for realizing this goal.
The first step we have taken toward this goal is to carefully identify exactly
what we expect from a network. For example, a network must first provide costeffective connectivity among a set of computers. This is accomplished through a nested
interconnection of nodes and links, and by sharing this hardware base through the use
of statistical multiplexing. This results in a packet-switched network, on top of which
we then define a collection of process-to-process communication services.
The second step is to define a layered architecture that will serve as a blueprint for
our design. The central objects of this architecture are network protocols. Protocols
both provide a communication service to higher-level protocols and define the form
and meaning of messages exchanged with their peers running on other machines. We
have briefly surveyed two of the most widely used architectures: the OSI architecture
and the Internet architecture. This book most closely follows the Internet architecture,
both in its organization and as a source of examples.
The third step is to implement the network’s protocols and application programs,
usually in software. Both protocols and applications need an interface by which they
invoke the services of other protocols in the network subsystem. The socket interface is
the most widely used interface between application programs and the network subsystem, but a slightly different interface is typically used within the network subsystem.
Finally, the network as a whole must offer high performance, where the two
performance metrics we are most interested in are latency and throughput. As we will
see in later chapters, it is the product of these two metrics—the so-called delay ×
bandwidth product—that often plays a critical role in protocol design.
There is little doubt that computer
networks are becoming an integral
part of the everyday lives of vast
numbers of people. What began over
Ubiquitous Networking
20 years ago as experimental systems like the ARPANET—connecting
mainframe computers over longdistance telephone lines—has turned into big business. And where there is big business, there are lots of players. In this case, there is the computing industry, which has
1 Foundation
become increasingly involved in supporting packet-switched networking products; the
telephone carriers, which recognize the market for carrying all sorts of data, not just
voice; and the cable TV industry, which currently owns the entertainment portion of
the market.
Assuming that the goal is ubiquitous networking—to bring the network into
every household—the first problem that must be addressed is how to establish the
necessary physical links. Although it could be argued that the ultimate answer is to
bring an optical fiber into every home, at an estimated $1000 per house and 100 million
homes in the U.S. alone, this is a $100 billion proposition. The most widely discussed
alternatives make use of either the existing cable TV facilities or the copper pairs used
to deliver telephone service. Each of these approaches has its own set of problems. For
example, today’s cable facilities are asymmetric—you can deliver 150 channels into
every home, but the outgoing bandwidth is severely limited. Such asymmetry implies
that there are a small number of information providers, but that most of us are simply
information consumers. Many people would argue that in a democracy we should
all have an equal opportunity to provide information. Digital subscriber line (DSL)
technology need not be asymmetric, but can only offer high-bandwidth connections
to a subset of consumers over the existing telephone wires.
How the struggle between the computer companies, the telephone companies,
and the cable industry will play out in the marketplace is anyone’s guess. (If we knew
the answer, we’d be charging a lot more for this book.) All we know is that there
are many technical obstacles—issues of connectivity, levels of service, performance,
reliability, and fairness—that stand between the current state of the art and the sort of
global, ubiquitous, heterogeneous network that we believe is possible and desirable.
It is these challenges that are the focus of this book.
Computer networks are not the first communication-oriented technology to have
found their way into the everyday fabric of our society. For example, the early part
of this century saw the introduction of the telephone, and then during the 1950s television became widespread. When considering the future of networking—how widely
it will spread and how we will use it—it is instructive to study this history. Our first
reference is a good starting point for doing this (the entire issue is devoted to the first
100 years of telecommunications).
The second and third papers are the seminal papers on the OSI and Internet
architectures, respectively. The Zimmerman paper introduces the OSI architecture, and
the Clark paper is a retrospective. The final two papers are not specific to networking,
Further Reading
but ones that every systems person should read. The Saltzer et al. paper motivates
and describes one of the most widely applied rules of system design—the end-to-end
argument. The paper by Mashey describes the thinking behind RISC architectures; as
we will soon discover, making good judgments about where to place functionality in
a complex system is what system design is all about.
■ Pierce, J. Telephony—a personal view. IEEE Communications 22(5):116–120,
May 1984.
■ Zimmerman, H. OSI reference model—the ISO model of architecture for
open systems interconnection. IEEE Transactions on Communications COM28(4):425–432, April 1980.
■ Clark, D. The design philosophy of the DARPA Internet protocols. Proceedings of the SIGCOMM ’88 Symposium, pages 106–114, August 1988.
■ Saltzer, J., D. Reed, and D. Clark. End-to-end arguments in system design.
ACM Transactions on Computer Systems 2(4):277–288, November 1984.
■ Mashey, J. RISC, MIPS, and the motion of complexity. UniForum 1986 Conference Proceedings, pages 116–124, 1986.
Several texts offer an introduction to computer networking: Stallings gives an
encyclopedic treatment of the subject, with an emphasis on the lower levels of the OSI
hierarchy [Sta00a]; Tanenbaum uses the OSI architecture as an organizational model
[Tan02]; Comer gives an overview of the Internet architecture [Com00]; and Bertsekas
and Gallager discuss networking from a performance modeling perspective [BG92].
To put computer networking into a larger context, two books—one dealing
with the past and the other looking toward the future—are must reading. The first is
Holzmann and Pehrson’s The Early History of Data Networks [HP95]. Surprisingly,
many of the ideas covered in the book you are now reading were invented during
the 1700s. The second is Realizing the Information Future: The Internet and Beyond,
a book prepared by the Computer Science and Telecommunications Board of the
National Research Council [NRC94].
To follow the history of the Internet from its beginning, you are encouraged to
peruse the Internet’s Request for Comments (RFC) series of documents. These documents, which include everything from the TCP specification to April Fools’ jokes, are
retrievable at For example, the protocol specifications for
TCP, UDP, and IP are available in RFC 793, 768, and 791, respectively.
To gain a better appreciation for the Internet philosophy and culture, two references are must reading; both are also quite entertaining. Padlipsky gives a good
description of the early days, including a pointed comparison of the Internet and OSI
1 Foundation
architectures [Pad85]. For a more up-to-date account of what really happens behind
the scenes at the Internet Engineering Task Force, we recommend Boorsook’s article
There are a wealth of articles discussing various aspects of protocol implementations. A good starting point is to understand two complete protocol implementation
environments: the Stream mechanism from System V Unix [Rit84] and the x-kernel
[HP91]. In addition, [LMKQ89] and [SW95] describe the widely used Berkeley Unix
implementation of TCP/IP.
More generally, there is a large body of work addressing the issue of structuring
and optimizing protocol implementations. Clark was one of the first to discuss the
relationship between modular design and protocol performance [Cla82]. Later papers then introduce the use of upcalls in structuring protocol code [Cla85] and study
the processing overheads in TCP [CJRS89]. Finally, [WM87] describes how to gain
efficiency through appropriate design and implementation choices.
Several papers have introduced specific techniques and mechanisms that can be
used to improve protocol performance. For example, [HMPT89] describes some of
the mechanisms used in the x-kernel, [MD93] discusses various implementations of
demultiplexing tables, [VL87] introduces the timing wheel mechanism used to manage
protocol events, and [DP93] describes an efficient buffer management strategy. Also,
the performance of protocols running on parallel processors—locking is a key issue in
such environments—is discussed in [BG93] and [NYKT94].
Because many aspects of protocol implementation depend on an understanding
of the basics of operating systems, we recommend Finkel [Fin88], Bic and Shaw [BS88],
and Tanenbaum [Tan01] for an introduction to OS concepts.
Finally, we conclude the “Further Reading” section of each chapter with a set
of live references, that is, URLs for locations on the World Wide Web where you can
learn more about the topics discussed in that chapter. Since these references are live,
it is possible that they will not remain active for an indefinite period of time. For this
reason, we limit the set of live references at the end of each chapter to sites that either
export software, provide a service, or report on the activities of an ongoing working
group or standardization body. In other words, we only give URLs for the kinds of
material that cannot easily be referenced using standard citations. For this chapter, we
include four live references:
■ information about this book, including supplements,
addendums, and so on
■ status of various networking standards, including those of the IETF, ISO, and IEEE
■ information about the IETF and its working groups
■˜ hgs/netbib/: searchable bibliography of networkrelated research papers
1 Use anonymous FTP to connect to (directory in-notes), and retrieve the
RFC index. Also retrieve the protocol specifications for TCP, IP, and UDP.
2 Look up the Web site
Here you can read about current network research under way at Princeton University and see a picture of author Larry Peterson. Follow links to find a picture
of author Bruce Davie.
3 Use a Web search tool to locate useful, general, and noncommercial information
about the following topics: MBone, ATM, MPEG, IPv6, and Ethernet.
4 The Unix utility whois can be used to find the domain name corresponding to
an organization, or vice versa. Read the man page documentation for whois and
experiment with it. Try whois and whois princeton, for starters.
5 Calculate the total time required to transfer a 1000-KB file in the following cases,
assuming an RTT of 100 ms, a packet size of 1 KB and an initial 2 × RTT of
“handshaking” before data is sent.
(a) The bandwidth is 1.5 Mbps, and data packets can be sent continuously.
(b) The bandwidth is 1.5 Mbps, but after we finish sending each data packet we
must wait one RTT before sending the next.
(c) The bandwidth is “infinite,” meaning that we take transmit time to be zero,
and up to 20 packets can be sent per RTT.
(d) The bandwidth is infinite, and during the first RTT we can send one packet
(21−1 ), during the second RTT we can send two packets (22−1 ), during the third
we can send four (23−1 ), and so on. (A justification for such an exponential
increase will be given in Chapter 6.)
1 Foundation
6 Calculate the total time required to transfer a 1.5-MB file in the following cases,
assuming an RTT of 80 ms, a packet size of 1 KB and an initial 2 × RTT of “handshaking” before data is sent.
(a) The bandwidth is 10 Mbps, and data packets can be sent continuously.
(b) The bandwidth is 10 Mbps, but after we finish sending each data packet we
must wait one RTT before sending the next.
(c) The link allows infinitely fast transmit, but limits bandwidth such that only
20 packets can be sent per RTT.
(d) Zero transmit time as in (c), but during the first RTT we can send one packet,
during the second RTT we can send two packets, during the third we can send
four = 23−1 , and so on. (A justification for such an exponential increase will
be given in Chapter 6.)
7 Consider a point-to-point link 2 km in length. At what bandwidth would propagation delay (at a speed of 2 × 108 m/s) equal transmit delay for 100-byte packets?
What about 512-byte packets?
8 Consider a point-to-point link 50 km in length. At what bandwidth would propagation delay (at a speed of 2 × 108 m/s) equal transmit delay for 100-byte packets?
What about 512-byte packets?
9 What properties of postal addresses would be likely to be shared by a network
addressing scheme? What differences might you expect to find? What properties
of telephone numbering might be shared by a network addressing scheme?
10 One property of addresses is that they are unique; if two nodes had the same
address it would be impossible to distinguish between them. What other properties
might be useful for network addresses to have? Can you think of any situations
in which network (or postal or telephone) addresses might not be unique?
11 Give an example of a situation in which multicast addresses might be beneficial.
12 What differences in traffic patterns account for the fact that STDM is a costeffective form of multiplexing for a voice telephone network and FDM is a costeffective form of multiplexing for television and radio networks, yet we reject both
as not being cost-effective for a general-purpose computer network?
13 How “wide” is a bit on a 1-Gbps link? How long is a bit in copper wire, where
the speed of propagation is 2.3 × 108 m/s?
14 How long does it take to transmit x KB over a y-Mbps link? Give your answer as
a ratio of x and y.
15 Suppose a 100-Mbps point-to-point link is being set up between Earth and a new
lunar colony. The distance from the moon to Earth is approximately 385,000 km,
and data travels over the link at the speed of light—3 × 108 m/s.
(a) Calculate the minimum RTT for the link.
(b) Using the RTT as the delay, calculate the delay × bandwidth product for the
(c) What is the significance of the delay × bandwidth product computed in (b)?
(d) A camera on the lunar base takes pictures of Earth and saves them in digital
format to disk. Suppose Mission Control on Earth wishes to download the
most current image, which is 25 MB. What is the minimum amount of time
that will elapse between when the request for the data goes out and the transfer
is finished?
16 Suppose a 128-Kbps point-to-point link is set up between Earth and a rover on
Mars. The distance from Earth to Mars (when they are closest together) is approximately 55 Gm, and data travels over the link at the speed of light—3 × 108 m/s.
(a) Calculate the minimum RTT for the link.
(b) Calculate the delay × bandwidth product for the link.
(c) A camera on the rover takes pictures of its surroundings and sends these to
Earth. How quickly after a picture is taken can it reach Mission Control on
Earth? Assume that each image is 5 Mb in size.
17 For each of the following operations on a remote file server, discuss whether they
are more likely to be delay sensitive or bandwidth sensitive.
(a) Open a file.
(b) Read the contents of a file.
(c) List the contents of a directory.
(d) Display the attributes of a file.
18 Calculate the latency (from first bit sent to last bit received) for the following:
(a) 10-Mbps Ethernet with a single store-and-forward switch in the path, and
a packet size of 5000 bits. Assume that each link introduces a propagation
delay of 10 μs and that the switch begins retransmitting immediately after it
has finished receiving the packet.
1 Foundation
(b) Same as (a) but with three switches.
(c) Same as (a) but assume the switch implements “cut-through” switching: It is
able to begin retransmitting the packet after the first 200 bits have been
19 Calculate the latency (from first bit sent to last bit received) for the following:
(a) 1-Gbps Ethernet with a single store-and-forward switch in the path, and a
packet size of 5000 bits. Assume that each link introduces a propagation delay
of 10 μs and that the switch begins retransmitting immediately after it has
finished receiving the packet.
(b) Same as (a) but with three switches.
(c) Same as (b) but assume the switch implements “cut-through” switching: It
is able to begin retransmitting the packet after the first 128 bits have been
20 Calculate the effective bandwidth for the following cases. For (a) and (b) assume
there is a steady supply of data to send; for (c) simply calculate the average over
12 hours.
(a) 10-Mbps Ethernet through three store-and-forward switches as in Exercise
18(b). Switches can send on one link while receiving on the other.
(b) Same as (a) but with the sender having to wait for a 50-byte acknowledgment
packet after sending each 5000-bit data packet.
(c) Overnight (12-hour) shipment of 100 compact disks (650 MB each).
21 Calculate the bandwidth × delay product for the following links. Use one-way
delay, measured from first bit sent to first bit received.
(a) 10-Mbps Ethernet with a delay of 10 μs.
(b) 10-Mbps Ethernet with a single store-and-forward switch like that of
Exercise 18(a), packet size 5000 bits, and 10 μs per link propagation delay.
(c) 1.5-Mbps T1 link, with a transcontinental one-way delay of 50 ms.
(d) 1.5-Mbps T1 link through a satellite in geosynchronous orbit, 35,900 km
high. The only delay is speed-of-light propagation delay.
22 Hosts A and B are each connected to a switch S via 10-Mbps links as in
Figure 1.25. The propagation delay on each link is 20 μs. S is a store-andforward device; it begins retransmitting a received packet 35 μs after it has finished
receiving it. Calculate the total time required to transmit 10,000 bits from
A to B
Figure 1.25 Diagram for Exercise 22.
(a) as a single packet
(b) as two 5000-bit packets sent one right after the other
23 Suppose a host has a 1-MB file that is to be sent to another host. The file takes
1 second of CPU time to compress 50%, or 2 seconds to compress 60%.
(a) Calculate the bandwidth at which each compression option takes the same
total compression + transmission time.
(b) Explain why latency does not affect your answer.
24 Suppose that a certain communications protocol involves a per-packet overhead
of 100 bytes for headers and framing. We send 1 million bytes of data using this
protocol; however, one data byte is corrupted and the entire packet containing it
is thus lost. Give the total number of overhead + loss bytes for packet data sizes
of 1000, 5000, 10,000, and 20,000 bytes. Which size is optimal?
25 Assume you wish to transfer an n-byte file along a path composed of the source,
destination, seven point-to-point links, and five switches. Suppose each link has a
propagation delay of 2 ms, bandwidth of 4 Mbps, and that the switches support
both circuit and packet switching. Thus you can either break the file up into 1-KB
packets, or set up a circuit through the switches and send the file as one contiguous
bit stream. Suppose that packets have 24 bytes of packet header information and
1000 bytes of payload, that store-and-forward packet processing at each switch
incurs a 1-ms delay after the packet has been completely received, that packets
may be sent continuously without waiting for acknowledgments, and that circuit
setup requires a 1-KB message to make one round-trip on the path incurring a
1-ms delay at each switch after the message has been completely received. Assume
switches introduce no delay to data traversing a circuit. You may also assume that
file size is a multiple of 1000 bytes.
(a) For what file size n bytes is the total number of bytes sent across the network
less for circuits than for packets?
(b) For what file size n bytes is the total latency incurred before the entire file
arrives at the destination less for circuits than for packets?
1 Foundation
(c) How sensitive are these results to the number of switches along the path? To
the bandwidth of the links? To the ratio of packet size to packet header size?
(d) How accurate do you think this model of the relative merits of circuits and
packets is? Does it ignore important considerations that discredit one or the
other approach? If so, what are they?
26 Consider a closed-loop network (e.g., token ring) with bandwidth 100 Mbps and
propagation speed of 2 × 108 m/s. What would the circumference of the loop be
to exactly contain one 250-byte packet, assuming nodes do not introduce delay?
What would the circumference be if there was a node every 100 m, and each node
introduced 10 bits of delay?
27 Compare the channel requirements for voice traffic with the requirements for the
real-time transmission of music, in terms of bandwidth, delay, and jitter. What
would have to improve? By approximately how much? Could any channel requirements be relaxed?
28 For the following, assume that no data compression is done; this would in practice almost never be the case. For (a)–(c), calculate the bandwidth necessary for
transmitting in real time:
(a) Video at a resolution of 640 × 480, 3 bytes/pixel, 30 frames/second.
(b) 160 × 120 video, 1 byte/pixel, 5 frames/second.
(c) CD-ROM music, assuming one CD holds 75 minutes’ worth and takes
650 MB.
(d) Assume a fax transmits an 8 × 10-inch black-and-white image at a resolution
of 72 pixels per inch. How long would this take over a 14.4-Kbps modem?
29 For the following, as in the previous problem, assume that no data compression
is done. Calculate the bandwidth necessary for transmitting in real time:
(a) HDTV high-definition video at a resolution of 1920 × 1080, 24 bits/pixel,
30 frames/second.
(b) POTS (plain old telephone service) voice audio of 8-bit samples at 8 KHz.
(c) GSM mobile voice audio of 260-bit samples at 50 Hz.
(d) HDCD high-definition audio of 24-bit samples at 88.2 KHz.
30 Discuss the relative performance needs of the following applications, in terms of
average bandwidth, peak bandwidth, latency, jitter, and loss tolerance:
(a) File server
(b) Print server
(c) Digital library
(d) Routine monitoring of remote weather instruments
(e) Voice
(f) Video monitoring of a waiting room
(g) Television broadcasting
31 Suppose a shared medium M offers to hosts A1 , A2 , . . . , A N in round-robin fashion
an opportunity to transmit one packet; hosts that have nothing to send immediately
relinquish M. How does this differ from STDM? How does network utilization
of this scheme compare with STDM?
32 Consider a simple protocol for transferring files over a link. After some initial
negotiation, A sends data packets of size 1 KB to B; B then replies with an acknowledgment. A always waits for each ACK before sending the next data packet;
this is known as stop-and-wait. Packets that are overdue are presumed lost and
are retransmitted.
(a) In the absence of any packet losses or duplications, explain why it is not
necessary to include any “sequence number” data in the packet headers.
(b) Suppose that the link can lose occasional packets, but that packets that do
arrive always arrive in the order sent. Is a 2-bit sequence number (that is, N
mod 4) enough for A and B to detect and resend any lost packets? Is a 1-bit
sequence number enough?
(c) Now suppose that the link can deliver out of order, and that sometimes a
packet can be delivered as much as 1 minute after subsequent packets. How
does this change the sequence number requirements?
33 Suppose hosts A and B are connected by a link. Host A continuously transmits the
current time from a high-precision clock, at a regular rate, fast enough to consume
all the available bandwidth. Host B reads these time values and writes them each
paired with its own time from a local clock synchronized with A’s. Give qualitative
examples of B’s output assuming the link has
(a) high bandwidth, high latency, low jitter
(b) low bandwidth, high latency, high jitter
(c) high bandwidth, low latency, low jitter, occasional lost data
1 Foundation
For example, a link with zero jitter, a bandwidth high enough to write on every
other clock tick, and a latency of 1 tick might yield something like (0000, 0001),
(0002, 0003), (0004, 0005).
34 Obtain and build the simplex-talk sample socket program shown in the text. Start
one server and one client, in separate windows. While the first client is running,
start 10 other clients that connect to the same server; these other clients should
most likely be started in the background with their input redirected from a file.
What happens to these 10 clients? Do their connect()s fail, or time out, or succeed?
Do any other calls block? Now let the first client exit. What happens? Try this
with the server value MAX PENDING set to 1 as well.
35 Modify the simplex-talk socket program so that each time the client sends a line
to the server, the server sends the line back to the client. The client (and server)
will now have to make alternating calls to recv() and send().
36 Modify the simplex-talk socket program so that it uses UDP as the transport protocol, rather than TCP. You will have to change SOCK STREAM to SOCK DGRAM
in both client and server. Then, in the server, remove the calls to listen() and accept(), and replace the two nested loops at the end with a single loop that calls
recv() with socket s. Finally, see what happens when two such UDP clients simultaneously connect to the same UDP server, and compare this to the TCP behavior.
37 Investigate the different options and parameters that you can set for a TCP connection. (Do man tcp on Unix.) Experiment with various parameter settings to see
how they affect TCP performance.
38 The Unix utility ping can be used to find the RTT to various Internet hosts. Read
the man page for ping, and use it to find the RTT to in New
Jersey and in California. Measure the RTT values at different times
of day, and compare the results. What do you think accounts for the differences?
39 The Unix utility traceroute, or its Windows equivalent tracert, can be used to find
the sequence of routers through which a message is routed. Use this to find the
path from your site to some others. How well does the number of hops correlate
with the RTT times from ping? How well does the number of hops correlate with
geographical distance?
40 Use traceroute, above, to map out some of the routers within your organization
(or to verify none are used).
his Page Intentionally Left Blank
Direct Link Networks
It is a mistake to look too far ahead. Only one link in the chain of
destiny can be handled at a time.
—Winston Churchill
he simplest network possible is one in which all the hosts are directly connected
by some physical medium. This may be a wire or a fiber, and it may cover a small
area (e.g., an office building) or a wide area (e.g., transcontinental). Connecting
two or more nodes with a suitable medium is only the first step, however. There are
five additional problems that must be addressed before the nodes can successfully
exchange packets.
The first is encoding bits onto the
wire or fiber so that they can be unP R O B L E M
derstood by a receiving host. Second
is the matter of delineating the
Physically Connecting Hosts
sequence of bits transmitted over the
link into complete messages that can
be delivered to the end node. This is
called the framing problem, and the messages delivered to the end hosts are often
called frames. Third, because frames are sometimes corrupted during transmission, it
is necessary to detect these errors and take the appropriate action; this is the error
detection problem. The fourth issue is making a link appear reliable in spite of the
fact that it corrupts frames from time to time. Finally, in those cases where the link is
shared by multiple hosts—as opposed to a simple point-to-point link—it is necessary
to mediate access to this link. This is the media access control problem.
Although these five issues—encoding, framing, error detection, reliable delivery,
and access mediation—can be discussed in the abstract, they are very real problems
that are addressed in different ways by different networking technologies. This chapter
considers these issues in the context of four specific network technologies: point-topoint links, Carrier Sense Multiple Access (CSMA) networks (of which Ethernet is the
most famous example), token rings (of which IEEE Standard 802.5 and FDDI are the
most famous examples), and wireless (for which 802.11 is
an emerging standard). The goal of this chapter is simultaneously to survey the available network technology and
to explore these five fundamental issues.
Before tackling the specific issues of connecting hosts,
this chapter begins by examining the building blocks that
will be used: nodes and links. We then explore the first
three issues—encoding, framing, and error detection—in
the context of a simple point-to-point link. The techniques
introduced in these three sections are general and therefore apply equally well to multiple-access networks. The
problem of reliable delivery is considered next. Since linklevel reliability is usually not implemented in shared-access
networks, this discussion focuses on point-to-point links
only. Finally, we address the media access problem in the
context of CSMA, token rings, and wireless.
Note that these five functions are, in general, implemented in a network adaptor—a board that plugs into a
host’s I/O bus on one end and into the physical medium
on the other end. In other words, bits are exchanged between adaptors, but correct frames are exchanged between
nodes. This adaptor is controlled by software running on
the node—the device driver—which, in turn, is typically
represented as the bottom protocol in a protocol graph.
This chapter concludes with a concrete example of a network adaptor and sketches the device driver for such an
2 Direct Link Networks
2.1 Hardware Building Blocks
As we saw in Chapter 1, networks are constructed from two classes of hardware
building blocks: nodes and links. This statement is just as true for the simplest possible
network—one in which a single point-to-point link connects a pair of nodes—as it is for
a worldwide internet. This section gives a brief overview of what we mean by nodes
and links and, in so doing, defines the underlying technology that we will assume
throughout the rest of this book.
Nodes are often general-purpose computers, like a desktop workstation, a multiprocessor, or a PC. For our purposes, let’s assume it’s a workstation-class machine. This
workstation can serve as a host that users run application programs on, it might be
used inside the network as a switch that forwards messages from one link to another,
or it might be configured as a router that forwards internet packets from one network
to another. In some cases, a network node—most commonly a switch or router inside
the network, rather than a host—is implemented by special-purpose hardware. This
is usually done for reasons of performance and cost: It is generally possible to build
custom hardware that performs a particular function faster and cheaper than a generalpurpose processor can perform it. When this happens, we will first describe the basic
function being performed by the node as though this function is being implemented
in software on a general-purpose workstation, and then explain why and how this
functionality might instead be implemented by special hardware.
Although we could leave it at that, it is useful to know a little bit about what a
workstation looks like on the inside. This information becomes particularly important
when we become concerned about how well the network performs. Figure 2.1 gives
a simple block diagram of the workstation-class machine we assume throughout this
book. There are three key features of this figure that are worth noting.
First, the memory on any given machine is finite. It may be 4 MB or it may be
128 MB, but it is not infinite. As pointed out in Section 1.2.2, this is important because
memory turns out to be one of the two scarce resources in the network (the other is
link bandwidth) that must be carefully managed if we are to provide a fair amount of
network capacity to each user. Memory is a scarce resource because on a node that
serves as a switch or router, packets must be buffered in memory while waiting their
turn to be transmitted over an outgoing link.
Second, each node connects to the network via a network adaptor. This adaptor
generally sits on the system’s I/O bus and delivers data between the workstation’s
memory and the network link. A software module running on the workstation—the
device driver—manages this adaptor. It issues commands to the adaptor, telling it,
2.1 Hardware Building Blocks
(To network)
I/O bus
Figure 2.1
Example workstation architecture.
for example, from what memory location outgoing data should be transmitted and
into what memory location incoming data should be stored. Adaptors are discussed
in more detail in Section 2.9.
Finally, while CPUs are becoming faster at an unbelievable pace, the same is
not true of memory. Recent performance trends show processor speeds doubling every
18 months, but memory latency improving at a rate of only 7% each year. The relevance
of this difference is that as a network node, a workstation runs at memory speeds, not
processor speeds, to a first approximation. This means that the network software needs
to be careful about how it uses memory and, in particular, about how many times it
accesses memory as it processes each message. We do not have the luxury of being
sloppy just because processors are becoming infinitely fast.
Network links are implemented on a variety of different physical media, including
twisted pair (the wire that your phone connects to), coaxial cable (the wire that your
TV connects to), optical fiber (the medium most commonly used for high-bandwidth,
long-distance links), and space (the stuff that radio waves, microwaves, and infrared
beams propagate through). Whatever the physical medium, it is used to propagate
signals. These signals are actually electromagnetic waves traveling at the speed of
light. (The speed of light is, however, medium dependent—electromagnetic waves
traveling through copper and fiber do so at about two-thirds the speed of light in a
2 Direct Link Networks
One important property of an electromagnetic wave is the frequency, measured in
hertz, with which the wave oscillates. The distance between a pair of adjacent maxima
or minima of a wave, typically measured in meters, is called the wave’s wavelength.
Since all electromagnetic waves travel at the speed of light, that speed divided by the
wave’s frequency is equal to its wavelength. We have already seen the example of a
voice-grade telephone line, which carries continuous electromagnetic signals ranging
between 300 Hz and 3300 Hz; a 300-Hz wave traveling through copper would have
a wavelength of
SpeedOfLightInCopper ÷ Frequency
= 2/3 × 3 × 108 ÷ 300
= 667 × 103 meters
Generally, electromagnetic waves span a much wider range of frequencies, ranging
from radio waves, to infrared light, to visible light, to X rays and gamma rays. Figure
2.2 depicts the electromagnetic spectrum and shows which media are commonly used
to carry which frequency bands.
So far we understand a link to be a physical medium carrying signals in the form
of electromagnetic waves. Such links provide the foundation for transmitting all sorts
of information, including the kind of data we are interested in transmitting—binary
data (1s and 0s). We say that the binary data is encoded in the signal. The problem of
encoding binary data onto electromagnetic signals is a complex topic. To help make
the topic more manageable, we can think of it as being divided into two layers. The
lower layer is concerned with modulation—varying the frequency, amplitude, or phase
of the signal to effect the transmission of information. A simple example of modulation
f(Hz) 100
Terrestrial microwave
Figure 2.2
Electromagnetic spectrum.
X ray
Gamma ray
Fiber optics
2.1 Hardware Building Blocks
is to vary the power (amplitude) of a single wavelength. Intuitively, this is equivalent
to turning a light on and off. Because the issue of modulation is secondary to our
discussion of links as a building block for computer networks, we simply assume
that it is possible to transmit a pair of distinguishable signals—think of them as a
“high” signal and a “low” signal—and we consider only the upper layer, which is
concerned with the much simpler problem of encoding binary data onto these two
signals. Section 2.2 discusses such encodings.
Another attribute of a link is how many bit streams can be encoded on it at
a given time. If the answer is only one, then the nodes connected to the link must
share access to the link. This is the case for the multiple-access links described in
Sections 2.6 and 2.7. For point-to-point links, however, it is often the case that two bit
streams can be simultaneously transmitted over the link at the same time, one going in
each direction. Such a link is said to be full-duplex. A point-to-point link that supports
data flowing in only one direction at a time—such a link is called half-duplex—requires
that the two nodes connected to the link alternate using it. For the purposes of this
book, we assume that all point-to-point links are full-duplex.
The only other property of a link that we are interested in at this stage is a very
pragmatic one—how do you go about getting one? The answer depends on how far
the link needs to reach, how much money you have to spend, and whether or not you
know how to operate earth-moving equipment. The following is a survey of different
link types you might use to build a computer network.
If the nodes you want to connect are in the same room, in the same building, or even
on the same site (e.g., a campus), then you can buy a piece of cable and physically
string it between the nodes. Exactly what type of cable you choose to install depends
on the technology you plan to use to transmit data over the link; we’ll see several
examples later in this chapter. For now, a list of the common cable (fiber) types is
given in Table 2.1.
Of these, Category 5 (Cat-5) twisted pair—it uses a thicker gauge than the twisted
pair you find in your home—is quickly becoming the within-building norm. Because
of the difficulty and cost in pulling new cable through a building, every effort is made
to make new technologies use existing cable; Gigabit Ethernet, for example, has been
designed to run over Cat-5 wiring. Fiber is typically used to connect buildings at a site.
Leased Lines
If the two nodes you want to connect are on opposite sides of the country, or even
across town, then it is not practical to install the link yourself. Your only option is to
lease a dedicated link from the telephone company, in which case all you’ll need to be
able to do is conduct an intelligent conversation with the phone company customer
2 Direct Link Networks
Typical Bandwidths
Category 5 twisted pair
10–100 Mbps
100 m
Thin-net coax
10–100 Mbps
200 m
Thick-net coax
10–100 Mbps
500 m
Multimode fiber
100 Mbps
2 km
Single-mode fiber
100–2400 Mbps
40 km
Table 2.1
Common types of cables and fibers available for local links.
1.544 Mbps
44.736 Mbps
51.840 Mbps
155.250 Mbps
622.080 Mbps
2.488320 Gbps
9.953280 Gbps
Table 2.2
Common bandwidths available from the carriers.
service representative. Table 2.2 gives the common services that can be leased from
the phone company. Again, more details are given throughout this chapter.
While these bandwidths appear somewhat arbitrary, there is actually some
method to the madness. DS1 and DS3 (they are also sometimes called T1 and T3,
respectively) are relatively old technologies that were orginally defined for copperbased transmission media. DS1 is equal to the aggregation of 24 digital voice circuits of
64 Kbps each, and DS3 is equal to 28 DS1 links. All the STS-N links are for optical fiber
(STS stands for Synchronous Transport Signal). STS-1 is the base link speed, and each
STS-N has N times the bandwidth of STS-1. An STS-N link is also sometimes called
an OC-N link (OC stands for optical carrier). The difference between STS and OC is
subtle: The former refers to the electrical transmission on the devices connected to the
link, and the latter refers to the actual optical signal that is propagated over the fiber.
Keep in mind that the phone company does not implement the “link” we just
ordered as a single, unbroken piece of cable or fiber. Instead, it implements the link
2.1 Hardware Building Blocks
on its own network. Although the telephone network has historically looked much
different from the kind of network described in this book—it was built primarily to
provide a voice service and used circuit-switching technology—the current trend is
toward the style of networking described in this book, including the asynchronous
transfer mode (ATM) network described in Chapter 3. This is not surprising—the
potential market for carrying data, voice, and video is huge.
In any case, whether the link is physical or a logical connection through the
telephone network, the problem of building a computer network on top of a collection
of such links remains the same. So, we will proceed as though each link is implemented
by a single cable/fiber, and only when we are done will we worry about whether we
have just built a computer network on top of the underlying telephone network, or
the computer network we have just built could itself serve as the backbone for the
telephone network.
Last-Mile Links
If you can’t afford a dedicated leased line—they range in price from roughly a thousand
dollars a month for a cross-country DS1 link to “if you have to ask, you can’t afford
it”—then there are less expensive options available. We call these “last-mile” links
because they often span the last mile from the home to a network service provider.
These services, which are summarized in Table 2.3, typically connect a home to an
existing network. This means they are probably not suitable for use in building a complete network from scratch, but if you’ve already succeeded in building a network—and
“you” happen to be either the telephone company or the cable company—then you
can use these links to reach millions of customers.
The first option is a conventional modem over POTS (plain old telephone service). Today it is possible to buy a modem that transmits data at 56 Kbps over a
standard voice-grade line for less than a hundred dollars. The technology is already
at its bandwidth limit, however, which has led to the development of the second
option: ISDN (Integrated Services Digital Network). An ISDN connection includes two
28.8–56 Kbps
64–128 Kbps
16 Kbps–55.2 Mbps
20–40 Mbps
Table 2.3
Common services available to connect your home.
2 Direct Link Networks
1.554–8.448 Mbps
16–640 Kbps
Figure 2.3
Local loop
ADSL connects the subscriber to the central office via the local loop.
64-Kbps channels, one that can be used to transmit data and another that can be
used for digitized voice. (A device that encodes analog voice into a digital ISDN link
is called a CODEC, for coder/decoder.) When the voice channel is not in use, it can
be combined with the data channel to support up to 128 Kbps of data bandwidth.
For many years ISDN was viewed
as the future for modest bandwidth into
the home. ISDN has now been largely
Shannon’s Theorem
overtaken, however, by two newer techMeets Your Modem
nologies: xDSL (digital subscriber line) and
There has been an enormous body
cable modems. The former is actually a
of work done in the related areas
collection of technologies that are able
of signal processing and informato transmit data at high speeds over the
tion theory, studying everything
standard twisted pair lines that currently
from how signals degrade over discome into most homes in the United States
tance to how much data a given sig(and many other places). The one in most
nal can effectively carry. The most
widespread use today is ADSL (asymmetric
notable piece of work in this area
digital subscriber line). As its name implies,
is a formula known as Shannon’s
ADSL provides a different bandwidth from
theorem. Simply stated, Shannon’s
the subscriber to the telephone company’s
theorem gives an upper bound to
central office (upstream) than it does from
the capacity of a link, in terms of
the central office to the subscriber (downbits per second (bps), as a function
stream). The exact bandwidth depends on
of the signal-to-noise ratio of the
the length of the line running from the
link, measured in decibels (dB).
subscriber to the central office. This line
Shannon’s theorem can be used
is called the local loop, as illustrated in
to determine the data rate at which
Figure 2.3, and runs over existing copper.
a modem can be expected to transDownstream bandwidths range from
mit binary data over a voice-grade
1.544 Mbps (18,000 feet) to 8.448 Mbps
phone line without suffering from
(9000 feet), while upstream bandwidths
range from 16 Kbps to 640 Kbps.
2.1 Hardware Building Blocks
over fiber
Neighborhood optical
network unit
VDSL at 12.96–55.2 Mbps
over 1000–4500 feet of copper
Figure 2.4 VDSL connects the subscriber to the optical network that reaches the
An alternative technology that has yet to be widely deployed—very high data rate
digital subscriber line (VDSL)—is symmetric, with data rates ranging from 12.96 Mbps
to 55.2 Mbps. VDSL runs over much shorter distances—1000 to 4500 feet—which
means that it will not typically reach from the home to the central office. Instead,
the telephone company would have to put VDSL transmission hardware in neighborhoods, with some other technology (e.g., STS-N running over fiber) connecting the
neighborhood to the central office, as illustrated in Figure 2.4. This is sometimes called
“fiber to the neighborhood” (contrasting
with more ambitious schemes such as “fiber
to the home” and “fiber to the curb”).
too high an error rate. For example,
Cable modems are an alternative to
we assume that a voice-grade
the various types of DSL. As the name
phone connection supports a fresuggests, this technology uses the cable
quency range of 300 Hz to 3300
TV (CATV) infrastructure, which currently
reaches 95% of the households in the
Shannon’s theorem is typically
United States. (Only 65% of U.S. homes
given by the following formula:
actually subscribe.) In this approach, some
subset of the available CATV channels
C = B log2 (1 + S/N)
are made available for transmitting digital data, where a single CATV channel
where C is the achievable channel
has a bandwidth of 6 MHz. CATV, like
capacity measured in hertz, B is the
ADSL, is used in an asymmetric way, with
bandwidth of the line (3300 Hz −
downstream rates much greater than up300 Hz = 3000 Hz), S is the avstream rates. The technology is currently
erage signal power, and N is the
able to achieve 40 Mbps downstream on
average noise power. The signala single CATV channel, with 100 Mbps
to-noise ratio (S/N) is usually
as the theoretical capacity. The upstream
expressed in decibels, related as folrate is roughly half the downstream rate
(i.e., 20 Mbps) due to a 1000-fold dedB = 10 × log10 (S/N)
crease in the signal-to-noise ratio. It is also
the case that fewer CATV channels are
2 Direct Link Networks
dedicated to upstream traffic than to downstream traffic. Unlike DSL, the bandwidth
is shared among all subscribers in a neighborhood (a fact that led to some amusing
advertising from DSL providers). This means that some method for arbitrating access
to the shared medium—similar to the 802 standards described later in this chapter—
needs to be used. Finally, like DSL, it is unlikely that cable modems will be used to
connect arbitrary node A at one site to arbitrary node B at some other site. Instead,
cable modems are seen as a means to connect node A in your home to the cable
company, with the cable company then defining what the rest of the network looks
Wireless Links
The field of wireless communication is exploding, both economically and technologically. The Advanced Mobile Phone System (AMPS) has been the standard for cellular
phones in the United States for several years. AMPS, which is based on analog technology, is rapidly giving way to digital cellular–PCS (Personal Communication Services)
in the United States and Canada, and GSM (Global System for Mobile Communication) in the rest of the world. All three
systems currently use a system of towers
Assuming a typical decibel ratio
to transmit signals, although some signifiof 30 dB, this means that S/N =
cant efforts have been made to supplement
1000. Thus, we have
this infrastructure by ringing the globe with
a grid of medium- and low-orbit satellites.
C = 3000 × log2 (1001)
These projects—which include ICO, Globwhich equals approximately 30
alstar, Iridium, and Teledesic—have had
Kbps, roughly the limit of a 28.8mixed success. Those that are still viable are
Kbps modem.
mostly focusing on delivery of telephone serGiven this fundamental limit,
vice to those increasingly rare parts of the
why is it possible to buy 56-Kbps
globe where cellular service is not available.
modems at any electronics store?
Thinking a bit less globally, frequency
One reason is that such rates debands from the radio and infrared portions
pend on improved line quality, that
of the electromagnetic spectrum can be used
is, a higher signal-to-noise ratio
to provide wireless links over short disthan 30 dB. Another reason is
tances, such as inside office buildings, cofthat changes within the phone sysfee shops, building complexes, and camtem have largely eliminated analog
puses. In the case of infrared, signals with
lines that are bandwidth-limited to
wavelengths in the 850–950-nanometer
3300 Hz.
range can be used to transmit data at
1-Mbps rates over distances of about 10 m.
2.2 Encoding (NRZ, NRZI, Manchester, 4B/5B)
This technology does not require line of sight, but is limited to in-building environments. In the case of radio, several different bands are currently being made available
for data communication. For example, bands at 5.2 GHz and 17 GHz are allocated
to HIPERLAN (High Performance European Radio LAN) in Europe. Similarly, bandwidth at 2.4 GHz has been set aside in many countries for use with the IEEE 802.11
standard for wireless LANs. (Additional bandwidth is available at 5 GHz, but unfortunately it is subject to interference from microwave ovens.) IEEE 802.11, which is an
evolving standard that supports data rates of up to 54 Mbps, will be discussed more
fully in Section 2.8.
Another interesting development in the wireless arena is the Bluetooth radio
interface that operates in the 2.45-GHz frequency band. Bluetooth is designed for
short distances (on the order of 10 m) with a bandwidth of 1 Mbps. Its developers
envision it being used in all devices (e.g., printers, workstations, laptops, projectors,
PDAs, mobile phones), thereby eliminating the need for wires and cables in the office
(or between the various devices on your body, perhaps). Networks of such devices are
starting to be called piconets.
2.2 Encoding (NRZ, NRZI, Manchester, 4B/5B)
The first step in turning nodes and links into usable building blocks is to understand
how to connect them in such a way that bits can be transmitted from one node to the
other. As mentioned in the preceding section, signals propagate over physical links.
The task, therefore, is to encode the binary data that the source node wants to send
into the signals that the links are able to carry, and then to decode the signal back
into the corresponding binary data at the receiving node. We ignore the details of
modulation and assume we are working with two discrete signals: high and low. In
practice, these signals might correspond to two different voltages on a copper-based
link, or two different power levels on an optical link.
As we have said, most of the functions discussed in this chapter are performed by
a network adaptor—a piece of hardware that connects a node to a link. The network
adaptor contains a signalling component that actually encodes bits into signals at the
sending node and decodes signals into bits at the receiving node. Thus, as illustrated
in Figure 2.5, signals travel over a link between two signalling components, and bits
flow between network adaptors.
Let’s return to the problem of encoding bits onto signals. The obvious thing to
do is to map the data value 1 onto the high signal and the data value 0 onto the low
signal. This is exactly the mapping used by an encoding scheme called, cryptically
enough, non-return to zero (NRZ). For example, Figure 2.6 schematically depicts the
NRZ-encoded signal (bottom) that corresponds to the transmission of a particular
sequence of bits (top).
2 Direct Link Networks
Signalling component
Figure 2.5
Signals travel between signalling components; bits flow between adaptors.
0 0 1 0 1 1 1 1 0 1 0 0 0 0 1 0
Figure 2.6
NRZ encoding of a bit stream.
The problem with NRZ is that a sequence of several consecutive 1s means that
the signal stays high on the link for an extended period of time, and similarly, several consecutive 0s means that the signal stays low for a long time. There are two
fundamental problems caused by long
strings of 1s or 0s. The first is that it leads
to a situation known as baseline wander.
Bit Rates and Baud Rates
Specifically, the receiver keeps an average of
Many people use the terms bit
the signal it has seen so far, and then uses this
rate and baud rate interchangeably,
average to distinguish between low and high
even though as we see with the
signals. Whenever the signal is significantly
Manchester encoding, they are not
lower than this average, the receiver conthe same thing. While the Mancludes that it has just seen a 0, and likewise,
chester encoding is an example of
a signal that is significantly higher than the
a case in which a link’s baud rate
average is interpreted to be a 1. The probis greater than its bit rate, it is also
lem, of course, is that too many consecutive
possible to have a bit rate that is
1s or 0s cause this average to change, makgreater than the baud rate. This
ing it more difficult to detect a significant
would imply that more than one bit
change in the signal.
is encoded on each pulse sent over
The second problem is that frequent
the link.
transitions from high to low and vice versa
To see how this might happen,
are necessary to enable clock recovery.
suppose you could transmit four
Intuitively, the clock recovery problem is
that both the encoding and the decoding
2.2 Encoding (NRZ, NRZI, Manchester, 4B/5B)
0 0 1 0 1 1 1 1 0 1 0 0 0 0 1 0
Figure 2.7
Different encoding strategies.
processes are driven by a clock—every clock cycle the sender transmits a bit and the
receiver recovers a bit. The sender’s and the receiver’s clocks have to be precisely
synchronized in order for the receiver to recover the same bits the sender transmits. If the receiver’s clock is even slightly faster or slower than the sender’s clock,
then it does not correctly decode the signal. You could imagine sending the clock
to the receiver over a separate wire, but this is typically avoided because it makes
the cost of cabling twice as high. So instead, the receiver derives the clock from the
received signal—the clock recovery process.
Whenever the signal changes, such as on
a transition from 1 to 0 or from 0 to 1,
then the receiver knows it is at a clock
distinguished signals over a link
cycle boundary, and it can resynchronize
rather than just two. On an analog
itself. However, a long period of time withlink, for example, these four signals
out such a transition leads to clock drift.
might correspond to four different
Thus, clock recovery depends on having lots
frequencies. Given four different
of transitions in the signal, no matter what
signals, it is possible to encode two
data is being sent.
bits of information on each signal.
One approach that addresses this
That is, the first signal means 00,
problem, called non-return to zero inverted
the second signal means 01, and so
(NRZI), has the sender make a transition
on. Now, a sender (receiver) that
from the current signal to encode a 1 and
is able to transmit (detect) 1000
stay at the current signal to encode a 0.
pulses per second would be able to
This solves the problem of consecutive 1s,
send (receive) 2000 bits of informabut obviously does nothing for consecution per second. That is, it would
tive 0s. NRZI is illustrated in Figure 2.7.
be a 1000-baud/2000-bps link.
An alternative, called Manchester encoding, does a more explicit job of merging
2 Direct Link Networks
the clock with the signal by transmitting the exclusive-OR of the NRZ-encoded data
and the clock. (Think of the local clock as an internal signal that alternates from low
to high; a low/high pair is considered one clock cycle.) The Manchester encoding is
also illustrated in Figure 2.7. Observe that the Manchester encoding results in 0 being
encoded as a low-to-high transition and 1 being encoded as a high-to-low transition.
Because both 0s and 1s result in a transition to the signal, the clock can be effectively
recovered at the receiver. (There is also a variant of the Manchester encoding, called
differential Manchester, in which a 1 is encoded with the first half of the signal equal
to the last half of the previous bit’s signal and a 0 is encoded with the first half of the
signal opposite to the last half of the previous bit’s signal.)
The problem with the Manchester encoding scheme is that it doubles the rate
at which signal transitions are made on the link, which means that the receiver has
half the time to detect each pulse of the signal. The rate at which the signal changes
is called the link’s baud rate. In the case of the Manchester encoding, the bit rate is
half the baud rate, so the encoding is considered only 50% efficient. Keep in mind
that if the receiver had been able to keep up with the faster baud rate required by the
Manchester encoding in Figure 2.7, then both NRZ and NRZI could have been able
to transmit twice as many bits in the same time period.
A final encoding that we consider, called 4B/5B, attempts to address the inefficiency of the Manchester encoding without suffering from the problem of having
extended durations of high or low signals. The idea of 4B/5B is to insert extra bits
into the bit stream so as to break up long sequences of 0s or 1s. Specifically, every
4 bits of actual data are encoded in a 5-bit code that is then transmitted to the
receiver; hence the name 4B/5B. The 5-bit codes are selected in such a way that each
one has no more than one leading 0 and no more than two trailing 0s. Thus, when sent
back-to-back, no pair of 5-bit codes results in more than three consecutive 0s being
transmitted. The resulting 5-bit codes are then transmitted using the NRZI encoding,
which explains why the code is only concerned about consecutive 0s—NRZI already
solves the problem of consecutive 1s. Note that the 4B/5B encoding results in 80%
Table 2.4 gives the 5-bit codes that correspond to each of the 16 possible 4-bit
data symbols. Notice that since 5 bits are enough to encode 32 different codes, and
we are using only 16 of these for data, there are 16 codes left over that we can use
for other purposes. Of these, code 11111 is used when the line is idle, code 00000
corresponds to when the line is dead, and 00100 is interpreted to mean halt. Of the
remaining 13 codes, 7 of them are not valid because they violate the “one leading 0,
two trailing 0s” rule, and the other 6 represent various control symbols. As we will
see later in this chapter, some framing protocols (e.g., FDDI) make use of these control
2.3 Framing
4-Bit Data Symbol
5-Bit Code
Table 2.4
4B/5B encoding.
2.3 Framing
Now that we have seen how to transmit a sequence of bits over a point-to-point link—
from adaptor to adaptor—let’s consider the scenario illustrated in Figure 2.8. Recall
from Chapter 1 that we are focusing on packet-switched networks, which means that
blocks of data (called frames at this level), not bit streams, are exchanged between
nodes. It is the network adaptor that enables the nodes to exchange frames. When
node A wishes to transmit a frame to node B, it tells its adaptor to transmit a frame
from the node’s memory. This results in a sequence of bits being sent over the link.
The adaptor on node B then collects together the sequence of bits arriving on the link
and deposits the corresponding frame in B’s memory. Recognizing exactly what set of
bits constitutes a frame—that is, determining where the frame begins and ends—is the
central challenge faced by the adaptor.
2 Direct Link Networks
Node A
Node B
Figure 2.9
Bits flow between adaptors, frames between hosts.
Figure 2.8
BISYNC frame format.
There are several ways to address the framing problem. This section uses several
different protocols to illustrate the various points in the design space. Note that while
we discuss framing in the context of point-to-point links, the problem is a fundamental
one that must also be addressed in multiple-access networks like Ethernet and token
Byte-Oriented Protocols (BISYNC, PPP, DDCMP)
One of the oldest approaches to framing—it has its roots in connecting terminals to
mainframes—is to view each frame as a collection of bytes (characters) rather than
a collection of bits. Such a byte-oriented approach is exemplified by the BISYNC
(Binary Synchronous Communication) protocol developed by IBM in the late 1960s,
and the DDCMP (Digital Data Communication Message Protocol) used in Digital
Equipment Corporation’s DECNET. Sometimes these protocols assume a particular
character set—for example, BISYNC can support ASCII, EBCDIC, and IBM’s 6-bit
Transcode—but this is not necessarily the case.
Although similar in many respects, these two protocols are examples of two
different framing techniques, the sentinel approach and the byte-counting approach.
Sentinel Approach
The BISYNC protocol illustrates the sentinel approach to framing; its frame format is
depicted in Figure 2.9. This figure is the first of many that you will see in this book
that are used to illustrate frame or packet formats, so a few words of explanation
2.3 Framing
are in order. We show a packet as a sequence of labeled fields. Above each field is a
number indicating the length of that field in bits. Note that the packets are transmitted
beginning with the leftmost field.
The beginning of a frame is denoted by sending a special SYN (synchronization)
character. The data portion of the frame is then contained between special sentinel
characters: STX (start of text) and ETX (end of text). The SOH (start of header)
field serves much the same purpose as the STX field. The problem with the sentinel
approach, of course, is that the ETX character might appear in the data portion of the
frame. BISYNC overcomes this problem by “escaping” the ETX character by preceding
it with a DLE (data-link-escape) character whenever it appears in the body of a frame;
the DLE character is also escaped (by preceding it with an extra DLE) in the frame
body. (C programmers may notice that this is analogous to the way a quotation mark
is escaped by the backslash when it occurs inside a string.) This approach is often
called character stuffing because extra characters are inserted in the data portion of
the frame.
The frame format also includes a field labeled CRC (cyclic redundancy check)
that is used to detect transmission errors; various algorithms for error detection are
presented in Section 2.4. Finally, the frame contains additional header fields that
are used for, among other things, the link-level reliable delivery algorithm. Examples
of these algorithms are given in Section 2.5.
The more recent Point-to-Point Protocol (PPP), which is commonly run over dialup modem links, is similar to BISYNC in that it uses character stuffing. The format for
a PPP frame is given in Figure 2.10. The special start-of-text character, denoted as the
Flag field in Figure 2.10, is 01111110. The Address and Control fields usually contain
default values, and so are uninteresting. The Protocol field is used for demultiplexing:
It identifies the high-level protocol such as IP or IPX (an IP-like protocol developed
by Novell). The frame payload size can be negotiated, but it is 1500 bytes by default.
The Checksum field is either 2 (by default) or 4 bytes long.
The PPP frame format is unusual in that several of the field sizes are negotiated
rather than fixed. This negotiation is conducted by a protocol called LCP (Link Control
Protocol). PPP and LCP work in tandem: LCP sends control messages encapsulated
in PPP frames—such messages are denoted by an LCP identifier in the PPP Protocol
Figure 2.10 PPP frame format.
2 Direct Link Networks
Figure 2.11 DDCMP frame format.
field—and then turns around and changes PPP’s frame format based on the information
contained in those control messages. LCP is also involved in establishing a link between
two peers when both sides detect the carrier signal.
Byte-Counting Approach
As every Computer Sciences 101 student
knows, the alternative to detecting the end
of a file with a sentinel value is to include the
number of items in the file at the beginning
of the file. The same is true in framing—the
number of bytes contained in a frame can
be included as a field in the frame header.
DECNET’s DDCMP protocol uses this
approach, as illustrated in Figure 2.11. In
this example, the COUNT field specifies how
many bytes are contained in the frame’s
One danger with this approach is
that a transmission error could corrupt the
COUNT field, in which case the end of the
frame would not be correctly detected. (A
similar problem exists with the sentinelbased approach if the ETX field becomes
corrupted.) Should this happen, the receiver
will accumulate as many bytes as the bad
COUNT field indicates and then use the error
detection field to determine that the frame
is bad. This is sometimes called a framing
error. The receiver will then wait until it sees
the next SYN character to start collecting
the bytes that make up the next frame. It is
What’s in a Layer?
One of the important contributions of the OSI reference model
presented in Chapter 1 was to provide some vocabulary for talking
about protocols and, in particular,
protocol layers. This vocabulary
has provided fuel for plenty of
arguments along the lines of “Your
protocol does function X at layer Y,
and the OSI reference model says it
should be done at layer Z—that’s
a layer violation.” In fact, figuring
out the right layer at which to
perform a given function can be
very difficult, and the reasoning
is usually a lot more subtle than
“What does the OSI model say?”
It is partly for this reason that
this book avoids a rigidly layerist
approach. Instead, it shows you
a lot of functions that need to be
performed by protocols and looks
at some ways that they have been
successfully implemented.
2.3 Framing
Beginning Header
CRC sequence
Figure 2.12 HDLC frame format.
therefore possible that a framing error will cause back-to-back frames to be incorrectly
In spite of our nonlayerist
approach, sometimes we need convenient ways to talk about classes
of protocols, and the name of the
layer at which they operate is often
the best choice. Thus, for example, this chapter focuses primarily on link-layer protocols. (Bit
encoding, described in Section 2.2,
is the exception, being considered
a physical-layer function.) Linklayer protocols can be identified
by the fact that they run over single links—the type of network discussed in this chapter. Networklayer protocols, by contrast, run
over switched networks that contain lots of links interconnected by
switches or routers. Topics related
to network-layer protocols are discussed in Chapters 3 and 4.
Note that protocol layers are
supposed to be helpful—they provide helpful ways to talk about
Bit-Oriented Protocols
Unlike these byte-oriented protocols, a bitoriented protocol is not concerned with byte
boundaries—it simply views the frame as
a collection of bits. These bits might come
from some character set, such as ASCII, they
might be pixel values in an image, or they
could be instructions and operands from an
executable file. The Synchronous Data Link
Control (SDLC) protocol developed by IBM
is an example of a bit-oriented protocol;
SDLC was later standardized by the ISO as
the High-Level Data Link Control (HDLC)
protocol. In the following discussion, we use
HDLC as an example; its frame format is
given in Figure 2.12.
HDLC denotes both the beginning
and the end of a frame with the distinguished bit sequence 01111110. This sequence is also transmitted during any times
that the link is idle so that the sender and
receiver can keep their clocks synchronized. In this way, both protocols essentially
use the sentinel approach. Because this sequence might appear anywhere in the body
of the frame—in fact, the bits 01111110
2 Direct Link Networks
might cross byte boundaries—bit-oriented protocols use the analog of the DLE character, a technique known as bit stuffing.
Bit stuffing in the HDLC protocol works as follows. On the sending side, any
time five consecutive 1s have been transmitted from the body of the message (i.e.,
excluding when the sender is trying to transmit the distinguished 01111110 sequence),
the sender inserts a 0 before transmitting the next bit. On the receiving side, should
five consecutive 1s arrive, the receiver makes its decision based on the next bit it sees
(i.e., the bit following the five 1s). If the next bit is a 0, it must have been stuffed, and
so the receiver removes it. If the next bit is a 1, then one of two things is true: Either
this is the end-of-frame marker or an error has been introduced into the bit stream.
By looking at the next bit, the receiver can distinguish between these two cases: If it
sees a 0 (i.e., the last eight bits it has looked at are 01111110), then it is the end-offrame marker; if it sees a 1 (i.e., the last eight bits it has looked at are 01111111), then
there must have been an error and the whole frame is discarded. In the latter case, the
receiver has to wait for the next 01111110 before it can start receiving again, and as a
consequence, there is the potential that the receiver will fail to receive two consecutive
frames. Obviously, there are still ways that framing errors can go undetected, such as
when an entire spurious end-of-frame pattern is generated by errors, but these failures
are relatively unlikely. Robust ways of detecting errors are discussed in Section 2.4.
An interesting characteristic of bit
stuffing, as well as character stuffing, is that
the size of a frame is dependent on the data
that is being sent in the payload of the frame.
classes of protocols, and they help
It is in fact not possible to make all frames
us divide the problem of buildexactly the same size, given that the data
ing networks into manageable subthat might be carried in any frame is arbitasks. However, they are not meant
trary. (To convince yourself of this, consider
to be overly restrictive—the mere
what happens if the last byte of a frame’s
fact that something is a layer viobody is the ETX character.) A form of framlation does not end the argument
ing that ensures that all frames are the same
about whether it is a worthwhile
size is described in the next subsection.
thing to do. In other words, layering makes a good slave, but a
2.3.3 Clock-Based Framing
poor master. A particularly interest(SONET)
ing argument about the best layer
A third approach to framing is exemplito place a certain function comes up
fied by the Synchronous Optical Network
when we look at congestion control
(SONET) standard. For lack of a widely
in Chapter 6.
accepted generic term, we refer to this
approach simply as clock-based framing.
2.3 Framing
SONET was first proposed by Bell Communications Research (Bellcore), and then
developed under the American National Standards Institute (ANSI) for digital transmission over optical fiber; it has since been adopted by the ITU-T. Who standardized
what and when is not the interesting issue, though. The thing to remember about
SONET is that it is the dominant standard for long-distance transmission of data over
optical networks.
An important point to make about SONET before we go any further is that the
full specification is substantially larger than this book. Thus, the following discussion
will necessarily cover only the high points of the standard. Also, SONET addresses
both the framing problem and the encoding problem. It also addresses a problem
that is very important for phone companies—the multiplexing of several low-speed
links onto one high-speed link. We begin with framing and discuss the other issues
As with the previously discussed framing schemes, a SONET frame has some
special information that tells the receiver where the frame starts and ends. However,
that is about as far as the similarities go. Notably, no bit stuffing is used, so that a
frame’s length does not depend on the data being sent. So the question to ask is, How
does the receiver know where each frame starts and ends? We consider this question
for the lowest-speed SONET link, which is known as STS-1 and runs at 51.84 Mbps.
An STS-1 frame is shown in Figure 2.13. It is arranged as nine rows of 90 bytes each,
and the first 3 bytes of each row are overhead, with the rest being available for data
that is being transmitted over the link. The first 2 bytes of the frame contain a special
bit pattern, and it is these bytes that enable the receiver to determine where the frame
starts. However, since bit stuffing is not used, there is no reason why this pattern will
not occasionally turn up in the payload portion of the frame. To guard against this,
the receiver looks for the special bit pattern consistently, hoping to see it appearing
9 rows
90 columns
Figure 2.13 A SONET STS-1 frame.
2 Direct Link Networks
once every 810 bytes, since each frame is 9 × 90 = 810 bytes long. When the special
pattern turns up in the right place enough times, the receiver concludes that it is in
sync and can then interpret the frame correctly.
One of the things we are not describing due to the complexity of SONET is the
detailed use of all the other overhead bytes. Part of this complexity can be attributed
to the fact that SONET runs across the carrier’s optical network, not just over a single
link. (Recall that we are glossing over the fact that the carriers implement a network,
and we are instead focusing on the fact that we can lease a SONET link from them and
then use this link to build our own packet-switched network.) Additional complexity
comes from the fact that SONET provides a considerably richer set of services than
just data transfer. For example, 64 Kbps of a SONET link’s capacity is set aside for a
voice channel that is used for maintenance.
The overhead bytes of a SONET frame are encoded using NRZ, the simple
encoding described in the previous section where 1s are high and 0s are low. However,
to ensure that there are plenty of transitions to allow the receiver to recover the sender’s
clock, the payload bytes are scrambled. This is done by calculating the exclusive-OR
(XOR) of the data to be transmitted and by the use of a well-known bit pattern. The bit
pattern, which is 127 bits long, has plenty of transitions from 1 to 0, so that XORing
it with the transmitted data is likely to yield a signal with enough transitions to enable
clock recovery.
SONET supports the multiplexing of multiple low-speed links in the following
way. A given SONET link runs at one of a finite set of possible rates, ranging from
51.84 Mbps (STS-1) to 2488.32 Mbps (STS-48) and beyond. (See Table 2.2 in Section
2.1 for the full set of SONET data rates.) Note that all of these rates are integer
multiples of STS-1. The significance for framing is that a single SONET frame can
contain subframes for multiple lower-rate channels. A second related feature is that
each frame is 125 μs long. This means that at STS-1 rates, a SONET frame is 810 bytes
long, while at STS-3 rates, each SONET frame is 2430 bytes long. Notice the synergy
between these two features: 3 × 810 = 2430, meaning that three STS-1 frames fit
exactly in a single STS-3 frame.
Intuitively, the STS-N frame can be thought of as consisting of N STS-1 frames,
where the bytes from these frames are interleaved; that is, a byte from the first frame is
transmitted, then a byte from the second frame is transmitted, and so on. The reason for
interleaving the bytes from each STS-N frame is to ensure that the bytes in each STS-1
frame are evenly paced; that is, bytes show up at the receiver at a smooth 51 Mbps,
rather than all bunched up during one particular 1/Nth of the 125-μs interval.
Although it is accurate to view an STS-N signal as being used to multiplex N STS1 frames, the payload from these STS-1 frames can be linked together to form a larger
STS-N payload; such a link is denoted STS-Nc (for concatenated). One of the fields in
2.3 Framing
Figure 2.14 Three STS-1 frames multiplexed onto one STS-3c frame.
87 columns
Frame 0
9 rows
Frame 1
Figure 2.15 SONET frames out of phase.
the overhead is used for this purpose. Figure 2.14 schematically depicts concatenation
in the case of three STS-1 frames being concatenated into a single STS-3c frame. The
significance of a SONET link being designated as STS-3c rather than STS-3 is that, in
the former case, the user of the link can view it as a single 155.25-Mbps pipe, whereas
an STS-3 should really be viewed as three 51.84-Mbps links that happen to share a
Finally, the preceding description of SONET is overly simplistic in that it
assumes that the payload for each frame is completely contained within the frame.
(Why wouldn’t it be?) In fact, we should view the STS-1 frame just described as simply
a placeholder for the frame, where the actual payload may float across frame boundaries. This situation is illustrated in Figure 2.15. Here we see both the STS-1 payload
floating across two STS-1 frames, and the payload shifted some number of bytes to
the right and, therefore, wrapped around. One of the fields in the frame overhead
points to the beginning of the payload. The value of this capability is that it simplifies
the task of synchronizing the clocks used throughout the carriers’ networks, which is
something that carriers spend a lot of their time worrying about.
2 Direct Link Networks
2.4 Error Detection
As discussed in Chapter 1, bit errors are sometimes introduced into frames. This
happens, for example, because of electrical interference or thermal noise. Although
errors are rare, especially on optical links, some mechanism is needed to detect these
errors so that corrective action can be taken. Otherwise, the end user is left wondering
why the C program that successfully compiled just a moment ago now suddenly has
a syntax error in it, when all that happened in the interim is that it was copied across
a network file system.
There is a long history of techniques for dealing with bit errors in computer
systems, dating back to Hamming and Reed/Solomon codes that were developed
for use when storing data on magnetic disks and in early core memories. This section describes some of the error detection techniques most commonly used in networking.
Detecting errors is only one part of the problem. The other part is correcting
errors once detected. There are two basic approaches that can be taken when the
recipient of a message detects an error. One is to notify the sender that the message
was corrupted so that the sender can retransmit a copy of the message. If bit errors
are rare, then in all probability the retransmitted copy will be error-free. Alternatively,
there are some types of error detection algorithms that allow the recipient to reconstruct the correct message even after it has been corrupted; such algorithms rely on
error-correcting codes, discussed below.
One of the most common techniques for detecting transmission errors is a technique known as the cyclic redundancy check (CRC). It is used in nearly all the link-level
protocols discussed in the previous section—for example, HDLC, DDCMP—as well
as in the CSMA and token ring protocols described later in this chapter. Section 2.4.3
outlines the basic CRC algorithm. Before discussing that approach, we consider two
simpler schemes that are also widely used: two-dimensional parity and checksums.
The former is used by the BISYNC protocol when it is transmitting ASCII characters
(CRC is used as the error code when BISYNC is used to transmit EBCDIC), and the
latter is used by several Internet protocols.
The basic idea behind any error detection scheme is to add redundant information
to a frame that can be used to determine if errors have been introduced. In the extreme,
we could imagine transmitting two complete copies of the data. If the two copies are
identical at the receiver, then it is probably the case that both are correct. If they
differ, then an error was introduced into one (or both) of them, and they must be
discarded. This is a rather poor error detection scheme for two reasons. First, it sends
n redundant bits for an n-bit message. Second, many errors will go undetected—any
error that happens to corrupt the same bit positions in the first and second copies of
the message.
2.4 Error Detection
Fortunately, we can do a lot better than this simple scheme. In general, we can
provide quite strong error detection capability while sending only k redundant bits for
an n-bit message, where k ≪ n. On an Ethernet, for example, a frame carrying up to
12,000 bits (1500 bytes) of data requires only a 32-bit CRC code, or as it is commonly
expressed, uses CRC-32. Such a code will catch the overwhelming majority of errors,
as we will see below.
We say that the extra bits we send are redundant because they add no new
information to the message. Instead, they are derived directly from the original message using some well-defined algorithm. Both the sender and the receiver know exactly
what that algorithm is. The sender applies the algorithm to the message to generate the
redundant bits. It then transmits both the message and those few extra bits. When the
receiver applies the same algorithm to the received message, it should (in the absence of
errors) come up with the same result as the sender. It compares the result with the one
sent to it by the sender. If they match, it can conclude (with high likelihood) that no
errors were introduced in the message during transmission. If they do not match, it can
be sure that either the message or the redundant bits were corrupted, and it must take
appropriate action, that is, discarding the message, or correcting it if that is possible.
One note on the terminology for these extra bits. In general, they are referred
to as error-detecting codes. In specific cases, when the algorithm to create the code
is based on addition, they may be called a checksum. We will see that the Internet
checksum is appropriately named: It is an error check that uses a summing algorithm.
Unfortunately, the word “checksum” is often used imprecisely to mean any form of
error-detecting code, including CRCs. This can be confusing, so we urge you to use
the word “checksum” only to apply to codes that actually do use addition and to use
“error-detecting code” to refer to the general class of codes described in this section.
Two-Dimensional Parity
Two-dimensional parity is exactly what the name suggests. It is based on “simple”
(one-dimensional) parity, which usually involves adding one extra bit to a 7-bit code
to balance the number of 1s in the byte. For example, odd parity sets the eighth bit to
1 if needed to give an odd number of 1s in the byte, and even parity sets the eighth bit
to 1 if needed to give an even number of 1s in the byte. Two-dimensional parity does
a similar calculation for each bit position across each of the bytes contained in the
frame. This results in an extra parity byte for the entire frame, in addition to a parity
bit for each byte. Figure 2.16 illustrates how two-dimensional even parity works for
an example frame containing 6 bytes of data. Notice that the third bit of the parity
byte is 1 since there are an odd number of 1s in the third bit across the 6 bytes in the
frame. It can be shown that two-dimensional parity catches all 1-, 2-, and 3-bit errors,
and most 4-bit errors. In this case, we have added 14 bits of redundant information
2 Direct Link Networks
Figure 2.16 Two-dimensional parity.
to a 42-bit message, and yet we have stronger protection against common errors than
the “repetition code” described above.
Internet Checksum Algorithm
A second approach to error detection is exemplified by the Internet checksum.
Although it is not used at the link level, it nevertheless provides the same sort of
functionality as CRCs and parity, so we discuss it here. We will see examples of its use
in Sections 4.1, 5.1, and 5.2.
The idea behind the Internet checksum is very simple—you add up all the words
that are transmitted and then transmit the result of that sum. The result is called
the checksum. The receiver performs the same calculation on the received data and
compares the result with the received checksum. If any transmitted data, including the
checksum itself, is corrupted, then the results will not match, so the receiver knows
that an error occurred.
You can imagine many different variations on the basic idea of a checksum. The
exact scheme used by the Internet protocols works as follows. Consider the data being
checksummed as a sequence of 16-bit integers. Add them together using 16-bit ones
complement arithmetic (explained below) and then take the ones complement of the
result. That 16-bit number is the checksum.
In ones complement arithmetic, a negative integer −x is represented as the
complement of x; that is, each bit of x is inverted. When adding numbers in ones
complement arithmetic, a carryout from the most significant bit needs to be added
to the result. Consider, for example, the addition of −5 and −3 in ones complement
arithmetic on 4-bit integers. +5 is 0101, so −5 is 1010; +3 is 0011, so −3 is 1100.
2.4 Error Detection
If we add 1010 and 1100 ignoring the carry, we get 0110. In ones complement arithmetic, the fact that this operation caused a carry from the most significant bit causes
us to increment the result, giving 0111, which is the ones complement representation
of −8 (obtained by inverting the bits in 1000), as we would expect.
The following routine gives a straightforward implementation of the Internet’s
checksum algorithm. The count argument gives the length of buf measured in 16-bit
units. The routine assumes that buf has already been padded with 0s to a 16-bit
cksum(u_short *buf, int count)
register u_long sum = 0;
while (count--)
sum += *buf++;
if (sum & 0xFFFF0000)
/* carry occurred,
so wrap around */
sum &= 0xFFFF;
return ˜ (sum & 0xFFFF);
This code ensures that the calculation uses ones complement arithmetic, rather
than the twos complement that is used in most machines. Note the if statement inside
the while loop. If there is a carry into the top 16 bits of sum, then we increment sum
just as in the previous example.
Compared to our repetition code, this algorithm scores well for using a small
number of redundant bits—only 16 for a message of any length—but it does not score
extremely well for strength of error detection. For example, a pair of single-bit errors,
one of which increments a word, one of which decrements another word by the same
amount, will go undetected. The reason for using an algorithm like this in spite of its
relatively weak protection against errors (compared to a CRC, for example) is simple:
This algorithm is much easier to implement in software. Experience in the ARPANET
suggested that a checksum of this form was adequate. One reason it is adequate is that
this checksum is the last line of defense in an end-to-end protocol; the majority of errors
are picked up by stronger error detection algorithms, such as CRCs, at the link level.
2 Direct Link Networks
Cyclic Redundancy
It should be clear by now that a major goal
in designing error detection algorithms is to
maximize the probability of detecting errors
using only a small number of redundant
bits. Cyclic redundancy checks use some
fairly powerful mathematics to achieve
this goal. For example, a 32-bit CRC gives
strong protection against common bit
errors in messages that are thousands of
bytes long. The theoretical foundation of
the cyclic redundancy check is rooted in a
branch of mathematics called finite fields.
While this may sound daunting, the basic
ideas can be easily understood.
To start, think of an (n + 1)-bit message as being represented by a polynomial
of degree n, that is, a polynomial whose
highest-order term is xn . The message is represented by a polynomial by using the value
of each bit in the message as the coefficient
for each term in the polynomial, starting
with the most significant bit to represent the
highest-order term. For example, an 8-bit
message consisting of the bits 10011010 corresponds to the polynomial
M(x) = 1 × x7 + 0 × x6 + 0 × x5 + 1 × x4
+ 1 × x3 + 0 × x2 + 1 × x1
Simple Probability
When dealing with network errors and other unlikely (we hope)
events, we often have use for simple
back-of-the-envelope probability
estimates. A useful approximation
here is that if two independent
events have small probabilities p
and q, then the probability of either
event is p + q; the exact answer is
1 − (1 − p)(1 − q) = p + q − pq.
For p = q = .01, this estimate is
.02, while the exact value is .0199.
For a simple application of this,
suppose that the per-bit error rate
on a link is 1 in 107 . Assuming bit
errors are all independent (which
they aren’t), we can estimate that
the probability of at least one error
in a 10,000-bit packet is 104 /107 =
10−3 . The exact answer, computed
as 1 − P(no errors), would be 1 −
(1 − 10−7 ) 10,000 = .00099950.
For a slightly more complex
application, we compute the probability of two errors in such a packet;
this is the probability of an error
+ 0 × x0
= x7 + x4 + x3 + x1
We can thus think of a sender and a receiver as exchanging polynomials with each
For the purposes o f calculating a CRC, a sender and receiver have to agree on
a divisor polynomial, C(x). C(x) is a polynomial of degree k. For example, suppose
C(x) = x3 + x2 + 1. In this case, k = 3. The answer to the question “Where did
2.4 Error Detection
C(x) come from?” is, in most practical
cases, “You look it up in a book.” In fact, the
choice of C(x) has a significant impact on
what types of errors can be reliably detected,
that would sneak past a 1-parity-bit
as we discuss below. There are a handful
checksum. Let Ei j be the event that
of divisor polynomials that are very good
bits i and j are bad, for 0 ≤ i < j <
choices for various environments, and the
104 ; the probability of this event is
exact choice is normally made as part of the
about p = 10−7 × 10−7 = 10−14 .
protocol design. For example, the Ethernet
For a fixed j, the number of events
standard uses a well-known polynomial of
Ei j with i < j is j; adding up the
degree 32.
number of these events for all j <
When a sender wishes to transmit a
104 , we get 1+2+· · ·+(104 −1) ≈
M(x) that is n + 1 bits long, what
108 . The final probability is thus
is actually sent is the (n + 1)-bit message
108 × 10−14 = 12 10−6 .
plus k bits. We call the complete transmitNote that had we attempted
ted message, including the redundant bits,
to estimate P(two errors) = P(first
P(x). What we are going to do is contrive
error) × P(second error), and taken
to make the polynomial representing P(x)
these last two to be P(one error)
exactly divisible by C(x); we explain how
= 10−3 , we would have obtained
this is achieved below. If P(x) is transmit10−6 here, which is rather far off;
ted over a link and there are no errors introthe problem with this approach is
duced during transmission, then the receiver
that not all i are equally likely to
should be able to divide P(x) by C(x) exactbe the position of the first error.
ly, leaving a remainder of zero. On the other
Or, looked at another way, we have
hand, if some error is introduced into P(x)
overstated the true probability by a
during transmission, then in all likelihood
factor of two because we counted
the received polynomial will no longer be
errors at positions (i, j) and ( j, i)
exactly divisible by C(x), and thus the reseparately when they should only
ceiver will obtain a nonzero remainder, imbe counted once.
plying that an error has occurred.
It will help to understand the following if you know a little about polynomial
arithmetic; it is just slightly different from normal integer arithmetic. We are dealing
with a special class of polynomial arithmetic here, where coefficients may be only one
or zero, and operations on the coefficients are performed using modulo 2 arithmetic.
This is referred to as “polynomial arithmetic modulo 2.” Since this is a networking
book, not a mathematics text, let’s focus on the key properties of this type of arithmetic
for our purposes (which we ask you to accept on faith):
2 Direct Link Networks
■ Any polynomial B(x) can be divided by a divisor polynomial C(x) if B(x) is
of higher degree than C(x).
■ Any polynomial B(x) can be divided once by a divisor polynomial C(x) if
B(x) is of the same degree as C(x).
■ The remainder obtained when B(x) is divided by C(x) is obtained by subtracting C(x) from B(x).
■ To subtract C(x) from B(x), we simply perform the exclusive-OR (XOR)
operation on each pair of matching coefficients.
For example, the polynomial x3 + 1 can be divided by x3 + x2 + 1 (because they
are both of degree 3) and the remainder would be 0 × x3 + 1 × x2 + 0 × x1 + 0 × x0 = x2
(obtained by XORing the coefficients of each term). In terms of messages, we could
say that 1001 can be divided by 1101 and leaves a remainder of 0100. You should be
able to see that the remainder is just the bitwise exclusive-OR of the two messages.
Now that we know the basic rules for dividing polynomials, we are able to do long
division, which is necessary to deal with longer messages. An example appears below.
Recall that we wanted to create a polynomial for transmission that is derived
from the original message M(x), is k bits longer than M(x), and is exactly divisible by
C(x). We can do this in the following way:
1 Multiply M(x) by xk; that is, add k zeroes at the end of the message. Call this
zero-extended message T(x).
2 Divide T(x) by C(x) and find the remainder.
3 Subtract the remainder from T(x).
It should be obvious that what is left at this point is a message that is exactly
divisible by C(x). We may also note that the resulting message consists of M(x) followed by the remainder obtained in step 2, because when we subtracted the remainder
(which can be no more than k bits long), we were just XORing it with the k zeroes
added in step 1. This part will become clearer with an example.
Consider the message x7 + x4 + x3 + x1 , or 10011010. We begin by multiplying
by x3 , since our divisor polynomial is of degree 3. This gives 10011010000. We divide
this by C(x), which corresponds to 1101 in this case. Figure 2.17 shows the polynomial
long division operation. Given the rules of polynomial arithmetic described above, the
long division operation proceeds much as it would if we were dividing integers. Thus
in the first step of our example, we see that the divisor 1101 divides once into the
first four bits of the message (1001), since they are of the same degree, and leaves
a remainder of 100 (1101 XOR 1001). The next step is to bring down a digit from
2.4 Error Detection
1101 10011010000
Figure 2.17 CRC calculation using polynomial long division.
the message polynomial until we get another polynomial with the same degree as
C(x), in this case 1001. We calculate the remainder again (100) and continue until the
calculation is complete. Note that the “result” of the long division, which appears at
the top of the calculation, is not really of much interest—it is the remainder at the end
that matters.
You can see from the very bottom of Figure 2.17 that the remainder of the
example calculation is 101. So we know that 10011010000 minus 101 would be
exactly divisible by C(x), and this is what we send. The minus operation in polynomial
arithmetic is the logical XOR operation, so we actually send 10011010101. As noted
above, this turns out to be just the original message with the remainder from the long
division calculation appended to it. The recipient divides the received polynomial by
C(x) and, if the result is 0, concludes that there were no errors. If the result is nonzero,
it may be necessary to discard the errored message; with some codes, it may be possible
to correct a small error (e.g., if the error affected only one bit). A code that enables
error correction is called an error-correcting code (ECC).
Now we will consider the question of where the polynomial C(x) comes from.
Intuitively, the idea is to select this polynomial so that it is very unlikely to divide evenly
into a message that has errors introduced into it. If the transmitted message is P(x), we
may think of the introduction of errors as the addition of another polynomial E(x),
so the recipient sees P(x) + E(x). The only way that an error could slip by undetected
would be if the received message could be evenly divided by C(x), and since we know
that P(x) can be evenly divided by C(x), this could only happen if E(x) can be divided
evenly by C(x). The trick is to pick C(x) so that this is very unlikely for common types
of errors.
2 Direct Link Networks
x8 + x2 + x1 + 1
x10 + x9 + x5 + x4 + x1 + 1
x12 + x11 + x3 + x2 + 1
x16 + x15 + x2 + 1
x16 + x12 + x5 + 1
x32 + x26 + x23 + x22 + x16 + x12 + x11
+ x10 + x8 + x7 + x5 + x4 + x2 + x + 1
Table 2.5
Common CRC polynomials.
One common type of error is a single-bit error, which can be expressed as E(x) =
xi when it affects bit position i. If we select C(x) such that the first and the last term are
nonzero, then we already have a two-term polynomial that cannot divide evenly into
the one term E(x). Such a C(x) can, therefore, detect all single-bit errors. In general,
it is possible to prove that the following types of errors can be detected by a C(x) with
the stated properties:
■ All single-bit errors, as long as the
xk and x0 terms have nonzero coefficients.
■ All double-bit errors, as long as
C(x) has a factor with at least three
■ Any odd number of errors, as long
as C(x) contains the factor (x + 1).
■ Any “burst” error (i.e., sequence of
consecutive errored bits) for which
the length of the burst is less than
k bits. (Most burst errors of larger
than k bits can also be detected.)
Six versions of C(x) are widely used in
link-level protocols (shown in Table 2.5).
Error Detection
or Error Correction?
We have mentioned that it is possible to use codes that not only detect
the presence of errors but also enable errors to be corrected. Since
the details of such codes require yet
more complex mathematics than
that required to understand CRCs,
we will not dwell on them here.
However, it is worth considering
the merits of correction versus detection.
At first glance, it would seem
that correction is always better,
since with detection we are forced
to throw away the message and,
2.5 Reliable Transmission
XOR gate
Figure 2.18 CRC calculation using shift register.
For example, the Ethernet and 802.5 networks described later in this chapter use
CRC-32, while HDLC uses CRC-CCITT. ATM, as described in Chapter 3, uses CRC8, CRC-10, and CRC-32.
Finally, we note that the CRC algorithm, while seemingly complex, is easily
implemented in hardware using a k-bit shift register and XOR gates. The number of
bits in the shift register equals the degree of the generator polynomial (k). Figure 2.18
shows the hardware that would be used for the generator x3 + x2 +1 from our previous
example. The message is shifted in from the left, beginning with the most significant
bit and ending with the string of k zeroes that is attached to the message, just as
in the long division example. When all the
bits have been shifted in and appropriately
XORed, the register contains the remainder,
that is, the CRC (most significant bit on the
right). The position of the XOR gates is dein general, ask for another copy to
termined as follows: If the bits in the shift
be transmitted. This uses up bandregister are labelled 0 through k − 1, left
width and may introduce latency
right, then put an XOR gate in front of
while waiting for the retransmisbit
n if there is a term xn in the generator
sion. However, there is a downside
polynomial. Thus, we see an XOR gate in
to correction: It generally requires a
front of positions 0 and 2 for the generator
greater number of redundant bits to
x3 + x2 + x0 .
send an error-correcting code that
is as strong (that is, able to cope
with the same range of errors) as a
code that only detects errors. Thus,
while error detection requires more
bits to be sent when errors occur,
error correction requires more bits
to be sent all the time. As a result,
2.5 Reliable Transmission
As we saw in the previous section, frames
are sometimes corrupted while in transit, with an error code like CRC used to
detect such errors. While some error codes
are strong enough also to correct errors,
in practice the overhead is typically too
2 Direct Link Networks
large to handle the range of bit and burst errors that can be introduced on a network
link. Even when error-correcting codes are used (e.g., on wireless links), some errors
will be too severe to be corrected. As a result, some corrupt frames must be discarded.
A link-level protocol that wants to deliver frames reliably must somehow recover from
these discarded (lost) frames.
This is usually accomplished using a combination of two fundamental
mechanisms—acknowledgments and timeouts. An acknowledgment (ACK for short)
is a small control frame that a protocol sends back to its peer saying that it has received
an earlier frame. By control frame we mean a header without any data, although a
protocol can piggyback an ACK on a data frame it just happens to be sending in the
opposite direction. The receipt of an acknowledgment indicates to the sender of the
original frame that its frame was successfully delivered. If the sender does not receive
an acknowledgment after a reasonable amount of time, then it retransmits the original
frame. This action of waiting a reasonable amount of time is called a timeout.
The general strategy of using acknowledgments and timeouts to implement reliable delivery is sometimes called automatic repeat request (normally abbreviated
ARQ). This section describes three different ARQ algorithms using generic language;
that is, we do not give detailed information about a particular protocol’s header fields.
The simplest ARQ scheme is the stop-andwait algorithm. The idea of stop-and-wait
is straightforward: After transmitting one
frame, the sender waits for an acknowledgment before transmitting the next frame. If
the acknowledgment does not arrive after a
certain period of time, the sender times out
and retransmits the original frame.
Figure 2.19 illustrates four different
scenarios that result from this basic algorithm. This figure is a timeline, a common way to depict a protocol’s behavior.
The sending side is represented on the left,
the receiving side is depicted on the right,
and time flows from top to bottom. Figure 2.19(a) shows the situation in which
the ACK is received before the timer expires, (b) and (c) show the situation in which
error correction tends to be most
useful when (1) errors are quite
probable, as they may be, for example, in a wireless environment,
or (2) the cost of retransmission is
too high, for example, because of
the latency involved in retransmitting a packet over a satellite link.
The use of error-correcting
codes in networking is sometimes
referred to as forward error correction (FEC) because the correction of errors is handled “in
advance” by sending extra information, rather than waiting for
errors to happen and dealing with
them later by retransmission.
2.5 Reliable Transmission
the original frame and the ACK, respectively, are lost, and (d) shows the situation in
which the timeout fires too soon. Recall that by “lost” we mean that the frame was
corrupted while in transit, that this corruption was detected by an error code on the
receiver, and that the frame was subsequently discarded.
There is one important subtlety in the stop-and-wait algorithm. Suppose the
sender sends a frame and the receiver acknowledges it, but the acknowledgment is
either lost or delayed in arriving. This situation is illustrated in timelines (c) and (d)
of Figure 2.19. In both cases, the sender times out and retransmits the original frame,
Figure 2.19 Timeline showing four different scenarios for the stop-and-wait algorithm.
(a) The ACK is received before the timer expires; (b) the original frame is lost; (c) the
ACK is lost; (d) the timeout fires too soon.
2 Direct Link Networks
Figure 2.20 Timeline for stop-and-wait with 1-bit sequence number.
but the receiver will think that it is the next frame, since it correctly received and
acknowledged the first frame. This has the potential to cause duplicate copies of a
frame to be delivered. To address this problem, the header for a stop-and-wait protocol
usually includes a 1-bit sequence number—that is, the sequence number can take on the
values 0 and 1—and the sequence numbers used for each frame alternate, as illustrated
in Figure 2.20. Thus, when the sender retransmits frame 0, the receiver can determine
that it is seeing a second copy of frame 0 rather than the first copy of frame 1 and
therefore can ignore it (the receiver still acknowledges it, in case the first ACK was lost).
The main shortcoming of the stop-and-wait algorithm is that it allows the sender
to have only one outstanding frame on the link at a time, and this may be far below
the link’s capacity. Consider, for example, a 1.5-Mbps link with a 45-ms round-trip
time. This link has a delay × bandwidth product of 67.5 Kb, or approximately 8 KB.
Since the sender can send only one frame per RTT, and assuming a frame size of 1 KB,
this implies a maximum sending rate of
BitsPerFrame ÷ TimePerFrame
= 1024 × 8 ÷ 0.045
= 182 Kbps
or about one-eighth of the link’s capacity. To use the link fully, then, we’d like the sender
to be able to transmit up to eight frames before having to wait for an acknowledgment.
2.5 Reliable Transmission
The significance of the bandwidth × delay product is that it represents the
amount of data that could be in transit. We would like to be able to send this much
data without waiting for the first acknowledgment. The principle at work here is often
referred to as keeping the pipe full. The algorithms presented in the following two
subsections do exactly this.
Sliding Window
Consider again the scenario in which the link has a delay × bandwidth product of
8 KB and frames are of 1-KB size. We would like the sender to be ready to transmit the
ninth frame at pretty much the same moment that the ACK for the first frame arrives.
The algorithm that allows us to do this is called sliding window, and an illustrative
timeline is given in Figure 2.21.
The Sliding Window Algorithm
The sliding window algorithm works as follows. First, the sender assigns a sequence
number, denoted SeqNum, to each frame. For now, let’s ignore the fact that SeqNum is
implemented by a finite-size header field and instead assume that it can grow infinitely
large. The sender maintains three variables: The send window size, denoted SWS,
gives the upper bound on the number of outstanding (unacknowledged) frames that
the sender can transmit; LAR denotes the sequence number of the last acknowledgment
received; and LFS denotes the sequence number of the last frame sent. The sender also
maintains the following invariant:
This situation is illustrated in Figure 2.22.
Figure 2.21 Timeline for the sliding window algorithm.
2 Direct Link Networks
Figure 2.22 Sliding window on sender.
Figure 2.23 Sliding window on receiver.
When an acknowledgment arrives, the sender moves LAR to the right, thereby
allowing the sender to transmit another frame. Also, the sender associates a timer with
each frame it transmits, and it retransmits the frame should the timer expire before an
ACK is received. Notice that the sender has to be willing to buffer up to SWS frames
since it must be prepared to retransmit them until they are acknowledged.
The receiver maintains the following three variables: The receive window size,
denoted RWS, gives the upper bound on the number of out-of-order frames that the
receiver is willing to accept; LAF denotes the sequence number of the largest acceptable
frame; and LFR denotes the sequence number of the last frame received. The receiver
also maintains the following invariant:
This situation is illustrated in Figure 2.23.
When a frame with sequence number SeqNum arrives, the receiver takes the
following action. If SeqNum ≤ LFR or SeqNum > LAF, then the frame is outside
the receiver’s window and it is discarded. If LFR < SeqNum ≤ LAF, then the frame is
within the receiver’s window and it is accepted. Now the receiver needs to decide
whether or not to send an ACK. Let SeqNumToAck denote the largest sequence
number not yet acknowledged, such that all frames with sequence numbers less
than or equal to SeqNumToAck have been received. The receiver acknowledges the
receipt of SeqNumToAck, even if higher-numbered packets have been received.
2.5 Reliable Transmission
This acknowledgment is said to be cumulative. It then sets LFR = SeqNumToAck
and adjusts LAF = LFR + RWS.
For example, suppose LFR = 5 (i.e., the last ACK the receiver sent was for
sequence number 5), and RWS = 4. This implies that LAF = 9. Should frames 7 and 8
arrive, they will be buffered because they are within the receiver’s window. However,
no ACK needs to be sent since frame 6 is yet to arrive. Frames 7 and 8 are said to have
arrived out of order. (Technically, the receiver could resend an ACK for frame 5 when
frames 7 and 8 arrive.) Should frame 6 then arrive—perhaps it is late because it was
lost the first time and had to be retransmitted, or perhaps it was simply delayed—the
receiver acknowledges frame 8, bumps LFR to 8, and sets LAF to 12. If frame 6 was
in fact lost, then a timeout will have occurred at the sender, causing it to retransmit
frame 6.
We observe that when a timeout occurs, the amount of data in transit decreases,
since the sender is unable to advance its window until frame 6 is acknowledged. This
means that when packet losses occur, this scheme is no longer keeping the pipe full.
The longer it takes to notice that a packet loss has occurred, the more severe this
problem becomes.
Notice that in this example, the receiver could have sent a negative acknowledgment (NAK) for frame 6 as soon as frame 7 arrived. However, this is unnecessary
since the sender’s timeout mechanism is sufficient to catch this situation, and sending
NAKs adds additional complexity to the receiver. Also, as we mentioned, it would
have been legitimate to send additional acknowledgments of frame 5 when frames 7
and 8 arrived; in some cases, a sender can use duplicate ACKs as a clue that a frame
was lost. Both approaches help to improve performance by allowing early detection
of packet losses.
Yet another variation on this scheme would be to use selective acknowledgments.
That is, the receiver could acknowledge exactly those frames it has received, rather
than just the highest-numbered frame received in order. So, in the above example, the
receiver could acknowledge the receipt of frames 7 and 8. Giving more information
to the sender makes it potentially easier for the sender to keep the pipe full, but adds
complexity to the implementation.
The sending window size is selected according to how many frames we want to
have outstanding on the link at a given time; SWS is easy to compute for a given delay ×
bandwidth product.1 On the other hand, the receiver can set RWS to whatever it wants.
Two common settings are RWS = 1, which implies that the receiver will not buffer any
frames that arrive out of order, and RWS = SWS, which implies that the receiver can
Easy, that is, if we know the delay and the bandwidth. Sometimes we do not, and estimating them well is a
challenge to protocol designers. We discuss this further in Chapter 5.
2 Direct Link Networks
buffer any of the frames the sender transmits. It makes no sense to set RWS > SWS
since it’s impossible for more than SWS frames to arrive out of order.
Finite Sequence Numbers and Sliding Window
We now return to the one simplification we introduced into the algorithm—our assumption that sequence numbers can grow infinitely large. In practice, of course, a
frame’s sequence number is specified in a header field of some finite size. For example,
a 3-bit field means that there are eight possible sequence numbers, 0 . . . 7. This makes
it necessary to reuse sequence numbers or, stated another way, sequence numbers wrap
around. This introduces the problem of being able to distinguish between different incarnations of the same sequence numbers, which implies that the number of possible
sequence numbers must be larger than the number of outstanding frames allowed. For
example, stop-and-wait allowed one outstanding frame at a time and had two distinct
sequence numbers.
Suppose we have one more number in our space of sequence numbers than
we have potentially outstanding frames; that is, SWS ≤ MaxSeqNum − 1, where
MaxSeqNum is the number of available sequence numbers. Is this sufficient? The
answer depends on RWR. If RWS = 1, then MaxSeqNum ≥ SWS + 1 is sufficient. If
RWS is equal to SWS, then having a MaxSeqNum just one greater than the sending
window size is not good enough. To see this, consider the situation in which we
have the eight sequence numbers 0 through 7, and SWS = RWS = 7. Suppose the
sender transmits frames 0..6, they are successfully received, but the ACKs are lost. The
receiver is now expecting frames 7, 0..5, but the sender times out and sends frames 0..6.
Unfortunately, the receiver is expecting the second incarnation of frames 0..5, but gets
the first incarnation of these frames. This is exactly the situation we wanted to avoid.
It turns out that the sending window size can be no more than half as big
as the number of available sequence numbers when RWS = SWS, or stated more
SWS < ( MaxSeqNum + 1)/2
Intuitively, what this is saying is that the sliding window protocol alternates between
the two halves of the sequence number space, just as stop-and-wait alternates between
sequence numbers 0 and 1. The only difference is that it continually slides between the
two halves rather than discretely alternating between them.
Note that this rule is specific to the situation where RWS = SWS. We leave
it as an exercise to determine the more general rule that works for arbitrary values of RWS and SWS. Also note that the relationship between the window size and
the sequence number space depends on an assumption that is so obvious that it
2.5 Reliable Transmission
is easy to overlook, namely, that frames are not reordered in transit. This cannot
happen on a direct point-to-point link since there is no way for one frame to overtake
another during transmission. However, we will see the sliding window algorithm used
in a different environment in Chapter 5, and we will need to devise another rule.
Implementation of Sliding Window
The following routines illustrate how we might implement the sending and receiving
sides of the sliding window algorithm. The routines are taken from a working protocol
named, appropriately enough, sliding window protocol (SWP). So as not to concern
ourselves with the adjacent protocols in the protocol graph, we denote the protocol
sitting above SWP as HLP (high-level protocol) and the protocol sitting below SWP
as LINK (link-level protocol).
We start by defining a pair of data structures. First, the frame header is very
simple: It contains a sequence number (SeqNum) and an acknowledgment number
(AckNum). It also contains a Flags field that indicates whether the frame is an ACK or
carries data.
typedef u_char
typedef struct {
} SwpHdr;
/* sequence number of this frame */
/* ack of received frame */
/* up to 8 bits' worth of flags */
Next, the state of the sliding window algorithm has the following structure. For
the sending side of the protocol, this state includes variables LAR and LFS, as described
earlier in this section, as well as a queue that holds frames that have been transmitted but not yet acknowledged (sendQ). The sending state also includes a counting
semaphore called sendWindowNotFull. We will see how this is used below, but generally a semaphore is a synchronization primitive that supports semWait and semSignal operations. Every invocation of semSignal increments the semaphore by 1, and
every invocation of semWait decrements s by 1, with the calling process blocked (suspended) should decrementing the semaphore cause its value to become less than 0. A
process that is blocked during its call to semWait will be allowed to resume as soon as
enough semSignal operations have been performed to raise the value of the semaphore
above 0.
For the receiving side of the protocol, the state includes the variable NFE. This is
the next frame expected—the frame with a sequence number one more than the last
frame received (LFR), described earlier in this section. There is also a queue that holds
frames that have been received out of order (recvQ). Finally, although not shown,
2 Direct Link Networks
the sender and receiver sliding window sizes are defined by constants SWS and RWS,
typedef struct {
/* sender side state: */
/* seqno of last ACK received */
/* last frame sent */
Semaphore sendWindowNotFull;
/* preinitialized header */
struct sendQ_slot {
Event timeout; /* event associated with send-timeout */
} sendQ[SWS];
/* receiver side state: */
SwpSeqno NFE;
/* seqno of next frame expected */
struct recvQ_slot {
received; /* is msg valid? */
} recvQ[RWS];
} SwpState;
The sending side of SWP is implemented by procedure sendSWP. This routine
is rather simple. First, semWait causes this process to block on a semaphore until it
is OK to send another frame. Once allowed to proceed, sendSWP sets the sequence
number in the frame’s header, saves a copy of the frame in the transmit queue (sendQ),
schedules a timeout event to handle the case in which the frame is not acknowledged,
and sends the frame to the next-lower-level protocol, which we denote as LINK.
One detail worth noting is the call to store swp hdr just before the call to
msgAddHdr. This routine translates the C structure that holds the SWP header (state->
hdr) into a byte string that can be safely attached to the front of the message (hbuf).
This routine (not shown) must translate each integer field in the header into network
byte order and remove any padding that the compiler has added to the C structure.
The issue of byte order is discussed more fully in Section 7.1, but for now it is enough
to assume that this routine places the most significant bit of a multiword integer in
the byte with the highest address. Also, we assume an abstract data type, denoted
Msg, that holds a message. The Msg type supports operations like msgAddHdr and
Another piece of complexity in this routine is the use of semWait and the sendWindowNotFull semaphore. sendWindowNotFull is initialized to the size of the sender’s
sliding window, SWS (this initialization is not shown). Each time the sender transmits a frame, the semWait operation decrements this count and blocks the sender
should the count go to 0. Each time an ACK is received, the semSignal operation
2.5 Reliable Transmission
invoked in deliverSWP (see below) increments this count, thus unblocking any waiting sender.
static int
sendSWP(SwpState *state, Msg *frame)
struct sendQ_slot *slot;
/* wait for send window to open */
state->hdr.SeqNum = ++state->LFS;
slot = &state->sendQ[state->hdr.SeqNum % SWS];
store_swp_hdr(state->hdr, hbuf);
msgAddHdr(frame, hbuf, HLEN);
msgSaveCopy(&slot->msg, frame);
slot->timeout = evSchedule(swpTimeout, slot,
return sendLINK(frame);
Now to SWP’s protocol-specific implementation of the deliver operation, which
is given in procedure deliverSWP. This routine actually handles two different kinds
of incoming messages: ACKs for frames sent earlier from this node and data frames
arriving at this node. In a sense, the ACK half of this routine is the counterpart to the
sender side of the algorithm given in sendSWP. A decision as to whether the incoming
message is an ACK or a data frame is made by checking the Flags field in the header.
Note that this particular implementation does not support piggybacking ACKs on data
When the incoming frame is an ACK, deliverSWP simply finds the slot in the
transmit queue (sendQ) that corresponds to the ACK, cancels the timeout event, and
frees the frame saved in that slot. This work is actually done in a loop since the
ACK may be cumulative. The only other thing to notice about this case is the call
to subroutine swpInWindow. This subroutine, which is given below, ensures that the
sequence number for the frame being acknowledged is within the range of ACKs that
the sender currently expects to receive.
When the incoming frame contains data, deliverSWP first calls msgStripHdr and
load swp hdr to extract the header from the frame. Routine load swp hdr is the counterpart to store swp hdr discussed earlier; it translates a byte string into the C data
structure that holds the SWP header. deliverSWP then calls swpInWindow to make
sure the sequence number of the frame is within the range of sequence numbers that it
expects. If it is, the routine loops over the set of consecutive frames it has received and
passes them up to the higher-level protocol by invoking the deliverHLP routine. It also
2 Direct Link Networks
sends a cumulative ACK back to the sender, but does so by looping over the receive
queue (it does not use the SeqNumToAck variable used in the prose description given
earlier in this section).
static int
deliverSWP(SwpState state, Msg *frame)
hbuf = msgStripHdr(frame, HLEN);
load_swp_hdr(&hdr, hbuf)
if (hdr->Flags & FLAG_ACK_VALID)
/* received an acknowledgment---do SENDER side */
if (swpInWindow(hdr.AckNum, state->LAR + 1,
struct sendQ_slot *slot;
slot = &state->sendQ[++state->LAR % SWS];
} while (state->LAR != hdr.AckNum);
if (hdr.Flags & FLAG_HAS_DATA)
struct recvQ_slot *slot;
/* received data packet---do RECEIVER side */
slot = &state->recvQ[hdr.SeqNum % RWS];
if (!swpInWindow(hdr.SeqNum, state->NFE,
state->NFE + RWS - 1))
/* drop the message */
return SUCCESS;
msgSaveCopy(&slot->msg, frame);
slot->received = TRUE;
if (hdr.SeqNum == state->NFE)
2.5 Reliable Transmission
Msg m;
while (slot->received)
slot->received = FALSE;
slot = &state->recvQ[++state->NFE % RWS];
/* send ACK: */
prepare_ack(&m, state->NFE - 1);
return SUCCESS;
Finally, swpInWindow is a simple subroutine that checks to see if a given sequence
number falls between some minimum and maximum sequence number.
static bool
swpInWindow(SwpSeqno seqno, SwpSeqno min, SwpSeqno max)
SwpSeqno pos, maxpos;
= seqno - min; /* pos *should* be in range [0..MAX)*/
maxpos = max - min + 1; /* maxpos is in range [0..MAX]*/
return pos < maxpos;
Frame Order and Flow Control
The sliding window protocol is perhaps the best-known algorithm in computer networking. What is easily confusing about the algorithm, however, is that it can be used
to serve three different roles. The first role is the one we have been concentrating on
in this section—to reliably deliver frames across an unreliable link. (In general, the
algorithm can be used to reliably deliver messages across an unreliable network.) This
is the core function of the algorithm.
The second role that the sliding window algorithm can serve is to preserve the
order in which frames are transmitted. This is easy to do at the receiver—since each
frame has a sequence number, the receiver just makes sure that it does not pass a
frame up to the next-higher-level protocol until it has already passed up all frames
with a smaller sequence number. That is, the receiver buffers (i.e., does not pass along)
out-of-order frames. The version of the sliding window algorithm described in this
2 Direct Link Networks
section does preserve frame order, although we could imagine a variation in which
the receiver passes frames to the next protocol without waiting for all earlier frames
to be delivered. A question we should ask ourselves is whether we really need the
sliding window protocol to keep the frames in order, or whether, instead, this is unnecessary functionality at the link level. Unfortunately, we have not yet seen enough
of the network architecture to answer this question; we first need to understand how
a sequence of point-to-point links is connected by switches to form an end-to-end
The third role that the sliding window algorithm sometimes plays is to support
flow control—a feedback mechanism by which the receiver is able to throttle the
sender. Such a mechanism is used to keep the sender from overrunning the receiver,
that is, from transmitting more data than the receiver is able to process. This is usually
accomplished by augmenting the sliding window protocol so that the receiver not only
acknowledges frames it has received, but also informs the sender of how many frames
it has room to receive. The number of frames that the receiver is capable of receiving
corresponds to how much free buffer space it has. As in the case of ordered delivery, we
need to make sure that flow control is necessary at the link level before incorporating
it into the sliding window protocol.
One important concept to take away from this discussion is the system design
principle we call separation of concerns. That is, you must be careful to distinguish
between different functions that are sometimes rolled together in one mechanism,
and you must make sure that each function is necessary and being supported in the
most effective way. In this particular case, reliable delivery, ordered delivery, and flow
control are sometimes combined in a single sliding window protocol, and we should
ask ourselves if this is the right thing to do at the link level. With this question in mind,
we revisit the sliding window algorithm in Chapter 3 (we show how X.25 networks
use it to implement hop-by-hop flow control) and in Chapter 5 (we describe how TCP
uses it to implement a reliable byte-stream channel).
Concurrent Logical Channels
The data link protocol used in the ARPANET provides an interesting alternative to
the sliding window protocol, in that it is able to keep the pipe full while still using the
simple stop-and-wait algorithm. One important consequence of this approach is that
the frames sent over a given link are not kept in any particular order. The protocol
also implies nothing about flow control.
The idea underlying the ARPANET protocol, which we refer to as concurrent
logical channels, is to multiplex several logical channels onto a single point-to-point
link and to run the stop-and-wait algorithm on each of these logical channels. There is
no relationship maintained among the frames sent on any of the logical channels, yet
2.6 Ethernet (802.3)
because a different frame can be outstanding on each of the several logical channels,
the sender can keep the link full.
More precisely, the sender keeps 3 bits of state for each channel: a boolean, saying
whether the channel is currently busy; the 1-bit sequence number to use the next time
a frame is sent on this logical channel; and the next sequence number to expect on
a frame that arrives on this channel. When the node has a frame to send, it uses the
lowest idle channel, and otherwise it behaves just like stop-and-wait.
In practice, the ARPANET supported 8 logical channels over each ground link
and 16 over each satellite link. In the ground-link case, the header for each frame
included a 3-bit channel number and a 1-bit sequence number, for a total of 4 bits.
This is exactly the number of bits the sliding window protocol requires to support up
to eight outstanding frames on the link when RWS = SWS.
2.6 Ethernet (802.3)
The Ethernet is easily the most successful local area networking technology of the last
20 years. Developed in the mid-1970s by researchers at the Xerox Palo Alto Research
Center (PARC), the Ethernet is a working example of the more general Carrier Sense
Multiple Access with Collision Detect (CSMA/CD) local area network technology.
As indicated by the CSMA name, the Ethernet is a multiple-access network,
meaning that a set of nodes send and receive frames over a shared link. You can,
therefore, think of an Ethernet as being like a bus that has multiple stations plugged
into it. The “carrier sense” in CSMA/CD means that all the nodes can distinguish
between an idle and a busy link, and “collision detect” means that a node listens as
it transmits and can therefore detect when a frame it is transmitting has interfered
(collided) with a frame transmitted by another node.
The Ethernet has its roots in an early packet radio network, called Aloha, developed at the University of Hawaii to support computer communication across the
Hawaiian Islands. Like the Aloha network, the fundamental problem faced by the
Ethernet is how to mediate access to a shared medium fairly and efficiently (in Aloha
the medium was the atmosphere, while in Ethernet the medium is a coax cable). That
is, the core idea in both Aloha and the Ethernet is an algorithm that controls when
each node can transmit.
Digital Equipment Corporation and Intel Corporation joined Xerox to define a
10-Mbps Ethernet standard in 1978. This standard then formed the basis for IEEE
standard 802.3. With one exception that we will see in Section 2.6.2, it is fair to view the
1978 Ethernet standard as a proper subset of the 802.3 standard; 802.3 additionally
defines a much wider collection of physical media over which Ethernet can operate,
and more recently, it has been extended to include a 100-Mbps version called Fast
2 Direct Link Networks
Ethernet cable
Figure 2.24 Ethernet transceiver and adaptor.
Ethernet and a 1000-Mbps version called Gigabit Ethernet. The rest of this section
focuses on 10-Mbps Ethernet, since it is typically used in multiple-access mode and
we are interested in how multiple hosts share a single link. Both 100-Mbps and 1000Mbps Ethernets are designed to be used in full-duplex, point-to-point configurations,
which means that they are typically used in switched networks, as described in the
next chapter.
Physical Properties
An Ethernet segment is implemented on a coaxial cable of up to 500 m. This cable
is similar to the type used for cable TV, except that it typically has an impedance
of 50 ohms instead of cable TV’s 75 ohms. Hosts connect to an Ethernet segment by
tapping into it; taps must be at least 2.5 m apart. A transceiver—a small device directly
attached to the tap—detects when the line is idle and drives the signal when the host
is transmitting. It also receives incoming signals. The transceiver is, in turn, connected
to an Ethernet adaptor, which is plugged into the host. All the logic that makes up the
Ethernet protocol, as described in this section, is implemented in the adaptor (not the
transceiver). This configuration is shown in Figure 2.24.
Multiple Ethernet segments can be joined together by repeaters. A repeater is a device that forwards digital signals, much like an amplifier forwards analog signals. However, no more than four repeaters may be positioned between any pair of hosts, meaning
that an Ethernet has a total reach of only 2500 m. For example, using just two repeaters between any pair of hosts supports a configuration similar to the one illustrated
in Figure 2.25, that is, a segment running down the spine of a building with a segment
on each floor. All told, an Ethernet is limited to supporting a maximum of 1024 hosts.
2.6 Ethernet (802.3)
Figure 2.25 Ethernet repeater.
Any signal placed on the Ethernet by a host is broadcast over the entire network;
that is, the signal is propagated in both directions, and repeaters forward the signal
on all outgoing segments. Terminators attached to the end of each segment absorb
the signal and keep it from bouncing back and interfering with trailing signals. The
Ethernet uses the Manchester encoding scheme described in Section 2.2.
In addition to the system of segments and repeaters just described, alternative
technologies have been introduced over the years. For example, rather than using a
50-ohm coax cable, an Ethernet can be constructed from a thinner cable known as
10Base2; the original cable is called 10Base5 (the two cables are commonly called thinnet and thick-net, respectively). The “10” in 10Base2 means that the network operates
at 10 Mbps, “Base” refers to the fact that the cable is used in a baseband system, and
the “2” means that a given segment can be no longer than 200 m (a segment of the
original 10Base5 cable can be up to 500 m long). Today, a third cable technology is
predominantly used, called 10BaseT, where the “T” stands for twisted pair. Typically,
Category 5 twisted pair wiring is used. A 10BaseT segment is usually limited to under
100 m in length. (Both 100-Mbps and 1000-Mbps Ethernets also run over Category
5 twisted pair, up to distances of 100 m.)
2 Direct Link Networks
Figure 2.26 Ethernet hub.
Because the cable is so thin, you do not tap into a 10Base2 or 10BaseT cable in the
same way as you would with 10Base5 cable. With 10Base2, a T-joint is spliced into the
cable. In effect, 10Base2 is used to daisy-chain a set of hosts together. With 10BaseT,
the common configuration is to have several point-to-point segments coming out of
a multiway repeater, sometimes called a hub, as illustrated in Figure 2.26. Multiple
100-Mbps Ethernet segments can also be connected by a hub, but the same is not true
of 1000-Mbps segments.
It is important to understand that whether a given Ethernet spans a single segment, a linear sequence of segments connected by repeaters, or multiple segments
connected in a star configuration by a hub, data transmitted by any one host on that
Ethernet reaches all the other hosts. This is the good news. The bad news is that all
these hosts are competing for access to the same link, and as a consequence, they are
said to be in the same collision domain.
Access Protocol
We now turn our attention to the algorithm that controls access to the shared Ethernet
link. This algorithm is commonly called the Ethernet’s media access control (MAC).
It is typically implemented in hardware on the network adaptor. We will not describe
the hardware per se, but instead focus on the algorithm it implements. First, however,
we describe the Ethernet’s frame format and addresses.
Frame Format
Each Ethernet frame is defined by the format given in Figure 2.27. The 64-bit preamble
allows the receiver to synchronize with the signal; it is a sequence of alternating 0s
and 1s. Both the source and destination hosts are identified with a 48-bit address.
The packet type field serves as the demultiplexing key; that is, it identifies to which
of possibly many higher-level protocols this frame should be delivered. Each frame
contains up to 1500 bytes of data. Minimally, a frame must contain at least 46 bytes of
data, even if this means the host has to pad the frame before transmitting it. The reason
2.6 Ethernet (802.3)
Figure 2.27 Ethernet frame format.
for this minimum frame size is that the frame must be long enough to detect a collision;
we discuss this more below. Finally, each frame includes a 32-bit CRC. Like the HDLC
protocol described in Section 2.3.2, the Ethernet is a bit-oriented framing protocol.
Note that from the host’s perspective, an Ethernet frame has a 14-byte header: two
6-byte addresses and a 2-byte type field. The sending adaptor attaches the preamble,
CRC, and postamble before transmitting, and the receiving adaptor removes them.
The frame format just described is taken from the Digital-Intel-Xerox Ethernet
standard. The 802.3 frame format is exactly the same, except it substitutes a 16-bit
length field for the 16-bit type field. 802.3 is usually paired with an encapsulation
standard that defines a type field used to demultiplex incoming frames. This type field
is the first thing in the data portion of the 802.3 frames; that is, it immediately follows
the 802.3 header. Fortunately, since the Ethernet standard has avoided using any type
values less than 1500 (the maximum length found in an 802.3 header), and the type and
length fields are in the same location in the header, it is possible for a single device to
accept both formats, and for the device driver running on the host to interpret the last
16 bits of the header as either a type or a length. In practice, most hosts follow the
Digital-Intel-Xerox format and interpret this field as the frame’s type.
Each host on an Ethernet—in fact, every Ethernet host in the world—has a unique
Ethernet address. Technically, the address belongs to the adaptor, not the host; it is
usually burned into ROM. Ethernet addresses are typically printed in a form humans
can read as a sequence of six numbers separated by colons. Each number corresponds
to 1 byte of the 6-byte address and is given by a pair of hexadecimal digits, one for each
of the 4-bit nibbles in the byte; leading 0s are dropped. For example, 8:0:2b:e4:b1:2 is
the human-readable representation of Ethernet address
00001000 00000000 00101011 11100100 10110001 00000010
To ensure that every adaptor gets a unique address, each manufacturer of Ethernet devices is allocated a different prefix that must be prepended to the address on
every adaptor they build. For example, Advanced Micro Devices has been assigned the
2 Direct Link Networks
24-bit prefix x080020 (or 8:0:20). A given manufacturer then makes sure the address
suffixes it produces are unique.
Each frame transmitted on an Ethernet is received by every adaptor connected
to that Ethernet. Each adaptor recognizes those frames addressed to its address and
passes only those frames on to the host. (An adaptor can also be programmed to run
in promiscuous mode, in which case it delivers all received frames to the host, but this
is not the normal mode.) In addition to these unicast addresses, an Ethernet address
consisting of all 1s is treated as a broadcast address; all adaptors pass frames addressed
to the broadcast address up to the host. Similarly, an address that has the first bit set
to 1 but is not the broadcast address is called a multicast address. A given host can
program its adaptor to accept some set of multicast addresses. Multicast addresses are
used to send messages to some subset of the hosts on an Ethernet (e.g., all file servers).
To summarize, an Ethernet adaptor receives all frames and accepts
■ frames addressed to its own address
■ frames addressed to the broadcast address
■ frames addressed to a multicast address, if it has been instructed to listen to
that address
■ all frames, if it has been placed in promiscuous mode
It passes to the host only the frames that it accepts.
Transmitter Algorithm
As we have just seen, the receiver side of the Ethernet protocol is simple; the real smarts
are implemented at the sender’s side. The transmitter algorithm is defined as follows.
When the adaptor has a frame to send and the line is idle, it transmits the frame
immediately; there is no negotiation with the other adaptors. The upper bound of
1500 bytes in the message means that the adaptor can occupy the line for only a fixed
length of time.
When an adaptor has a frame to send and the line is busy, it waits for the line to go
idle and then transmits immediately.2 The Ethernet is said to be a 1-persistent protocol
because an adaptor with a frame to send transmits with probability 1 whenever a busy
line goes idle. In general, a p-persistent algorithm transmits with probability 0 ≤ p ≤ 1
after a line becomes idle, and defers with probability q = 1 − p. The reasoning behind
choosing a p < 1 is that there might be multiple adaptors waiting for the busy line
To be more precise, all adaptors wait 9.6 μs after the end of one frame before beginning to transmit the next
frame. This is true for the sender of the first frame, as well as those nodes listening for the line to become idle.
2.6 Ethernet (802.3)
to become idle, and we don’t want all of them to begin transmitting at the same time.
If each adaptor transmits immediately with a probability of, say, 33%, then up to
three adaptors can be waiting to transmit and the odds are that only one will begin
transmitting when the line becomes idle. Despite this reasoning, an Ethernet adaptor
always transmits immediately after noticing that the network has become idle and has
been very effective in doing so.
To complete the story about p-persistent protocols for the case when p < 1, you
might wonder how long a sender that loses the coin flip (i.e., decides to defer) has
to wait before it can transmit. The answer for the Aloha network, which originally
developed this style of protocol, was to divide time into discrete slots, with each slot
corresponding to the length of time it takes to transmit a full frame. Whenever a node
has a frame to send and it senses an empty (idle) slot, it transmits with probability p
and defers until the next slot with probability q = 1 − p. If that next slot is also empty,
the node again decides to transmit or defer, with probabilities p and q, respectively. If
that next slot is not empty—that is, some other station has decided to transmit—then
the node simply waits for the next idle slot and the algorithm repeats.
Returning to our discussion of the Ethernet, because there is no centralized control it is possible for two (or more) adaptors to begin transmitting at the same time,
either because both found the line to be idle or because both had been waiting for a
busy line to become idle. When this happens, the two (or more) frames are said to
collide on the network. Each sender, because the Ethernet supports collision detection, is able to determine that a collision is in progress. At the moment an adaptor
detects that its frame is colliding with another, it first makes sure to transmit a 32-bit
jamming sequence and then stops the transmission. Thus, a transmitter will minimally send 96 bits in the case of a collision: 64-bit preamble plus 32-bit jamming
One way that an adaptor will send only 96 bits—which is sometimes called
a runt frame—is if the two hosts are close to each other. Had the two hosts been
farther apart, they would have had to transmit longer, and thus send more bits, before
detecting the collision. In fact, the worst-case scenario happens when the two hosts are
at opposite ends of the Ethernet. To know for sure that the frame it just sent did not
collide with another frame, the transmitter may need to send as many as 512 bits. Not
coincidentally, every Ethernet frame must be at least 512 bits (64 bytes) long: 14 bytes
of header plus 46 bytes of data plus 4 bytes of CRC.
Why 512 bits? The answer is related to another question you might ask about
an Ethernet: Why is its length limited to only 2500 m? Why not 10 or 1000 km? The
answer to both questions has to do with the fact that the farther apart two nodes
are, the longer it takes for a frame sent by one to reach the other, and the network is
vulnerable to a collision during this time.
2 Direct Link Networks
Figure 2.28 Worst-case scenario: (a) A sends a frame at time t; (b) A’s frame arrives at
B at time t + d; (c) B begins transmitting at time t + d and collides with A’s frame; (d) B’s
runt (32-bit) frame arrives at A at time t +2d.
Figure 2.28 illustrates the worst-case scenario, where hosts A and B are at opposite ends of the network. Suppose host A begins transmitting a frame at time t, as
shown in (a). It takes it one link latency (let’s denote the latency as d) for the frame to
reach host B. Thus, the first bit of A’s frame arrives at B at time t + d, as shown in (b).
Suppose an instant before host A’s frame arrives (i.e., B still sees an idle line), host B
begins to transmit its own frame. B’s frame will immediately collide with A’s frame,
and this collision will be detected by host B (c). Host B will send the 32-bit jamming
sequence, as described above. (B’s frame will be a runt.) Unfortunately, host A will not
know that the collision occurred until B’s frame reaches it, which will happen one link
latency later, at time t + 2 × d, as shown in (d). Host A must continue to transmit until
this time in order to detect the collision. In other words, host A must transmit for 2 × d
to be sure that it detects all possible collisions. Considering that a maximally configured Ethernet is 2500 m long, and that there may be up to four repeaters between
any two hosts, the round-trip delay has been determined to be 51.2 μs, which on
a 10-Mbps Ethernet corresponds to 512 bits. The other way to look at this situation is that we need to limit the Ethernet’s maximum latency to a fairly small value
2.6 Ethernet (802.3)
(e.g., 51.2 μs) for the access algorithm to work; hence, an Ethernet’s maximum length
must be something on the order of 2500 m.
Once an adaptor has detected a collision and stopped its transmission, it waits
a certain amount of time and tries again. Each time it tries to transmit but fails,
the adaptor doubles the amount of time it waits before trying again. This strategy of
doubling the delay interval between each retransmission attempt is a general technique
known as exponential backoff. More precisely, the adaptor first delays either 0 or
51.2 μs, selected at random. If this effort fails, it then waits 0, 51.2, 102.4, or 153.6 μs
(selected randomly) before trying again; this is k × 51.2 for k = 0..3. After the third
collision, it waits k × 51.2 for k = 0..23 − 1, again selected at random. In general, the
algorithm randomly selects a k between 0 and 2n − 1 and waits k × 51.2 μs, where
n is the number of collisions experienced so far. The adaptor gives up after a given
number of tries and reports a transmit error to the host. Adaptors typically retry up
to 16 times, although the backoff algorithm caps n in the above formula at 10.
Experience with Ethernet
Because Ethernets have been around for so many years and are so popular, we have a
great deal of experience in using them. One of the most important observations people
have made about Ethernets is that they work best under lightly loaded conditions. This
is because under heavy loads—typically, a utilization of over 30% is considered heavy
on an Ethernet—too much of the network’s capacity is wasted by collisions.
Fortunately, most Ethernets are used in a far more conservative way than the
standard allows. For example, most Ethernets have fewer than 200 hosts connected to
them, which is far fewer than the maximum of 1024. (See if you can discover a reason
for this upper limit of around 200 hosts in Chapter 4.) Similarly, most Ethernets are
far shorter than 2500 m, with a round-trip delay of closer to 5 μs than 51.2 μs.
Another factor that makes Ethernets practical is that, even though Ethernet adaptors
do not implement link-level flow control, the hosts typically provide an end-to-end
flow-control mechanism. As a result, it is rare to find situations in which any one host
is continuously pumping frames onto the network.
Finally, it is worth saying a few words about why Ethernets have been so successful, so that we can understand the properties we should emulate with any LAN
technology that tries to replace it. First, an Ethernet is extremely easy to administer
and maintain: There are no switches that can fail, no routing or configuration tables
that have to be kept up-to-date, and it is easy to add a new host to the network. It is
hard to imagine a simpler network to administer. Second, it is inexpensive: Cable is
cheap, and the only other cost is the network adaptor on each host. Any switch-based
approach will involve an investment in some relatively expensive infrastructure (the
switches), in addition to the incremental cost of each adaptor. As we will see in the
2 Direct Link Networks
next chapter, the most successful LAN switching technology in use today is itself based
on Ethernet.
2.7 Token Rings (802.5, FDDI)
Alongside the Ethernet, token rings are the other significant class of shared-media
network. There are more different types of token rings than there are types of Ethernets;
this section will discuss the type that was for years the most prevalent, known as the
IBM Token Ring. Like the Xerox Ethernet, IBM’s Token Ring has a nearly identical
IEEE standard, known as 802.5. Where necessary, we note the differences between the
IBM and 802.5 token rings.
Most of the general principles of token ring networks can be understood once
the IBM and 802.5 standards have been discussed. However, the FDDI (Fiber Distributed Data Interface) standard—a newer, faster type of token ring—warrants some
discussion, which we provide at the end of this section. At the time of writing, yet
another token ring standard, called Resilient Packet Ring or 802.17, is nearing completion.
As the name suggests, a token ring network consists of a set of nodes connected
in a ring (see Figure 2.29). Data always flows in a particular direction around the ring,
with each node receiving frames from its upstream neighbor and then forwarding them
to its downstream neighbor. This ring-based topology is in contrast to the Ethernet’s
bus topology. Like the Ethernet, however, the ring is viewed as a single shared medium;
it does not behave as a collection of independent point-to-point links that just happen
Figure 2.29 Token ring network.
2.7 Token Rings (802.5, FDDI)
to be configured in a loop. Thus, a token ring shares two key features with an Ethernet:
First, it involves a distributed algorithm that controls when each node is allowed to
transmit, and second, all nodes see all frames, with the node identified in the frame
header as the destination saving a copy of the frame as it flows past.
The word “token” in token ring comes from the way access to the shared
ring is managed. The idea is that a token, which is really just a special sequence
of bits, circulates around the ring; each node receives and then forwards the token.
When a node that has a frame to transmit sees the token, it takes the token off the
ring (i.e., it does not forward the special bit pattern) and instead inserts its frame
into the ring. Each node along the way simply forwards the frame, with the destination node saving a copy and forwarding the message onto the next node on the
ring. When the frame makes its way back around to the sender, this node strips its
frame off the ring (rather than continuing to forward it) and reinserts the token.
In this way, some node downstream will have the opportunity to transmit a frame.
The media access algorithm is fair in the sense that as the token circulates around
the ring, each node gets a chance to transmit. Nodes are serviced in a round-robin
Physical Properties
One of the first things you might worry about with a ring topology is that any link
or node failure would render the whole network useless. This problem is addressed
by connecting each station into the ring using an electromechanical relay. As long as
the station is healthy, the relay is open and the station is included in the ring. If the
station stops providing power, the relay closes and the ring automatically bypasses the
station. This is illustrated in Figure 2.30.
From previous
To next
From previous
To next
Figure 2.30 Relay used on a token ring: (a) relay open—host active; (b) relay closed—
host bypassed.
2 Direct Link Networks
From previous
To next
Figure 2.31 Multistation access unit.
Several of these relays are usually packed into a single box, known as a multistation access unit (MSAU). This has the interesting effect of making a token ring
actually look more like a star topology, as shown in Figure 2.31. It also makes it very
easy to add stations to and remove stations from the network, since they can just be
plugged into or unplugged from the nearest MSAU, while the overall wiring of the
network can be left unchanged. One of the small differences between the IBM Token
Ring specification and 802.5 is that the former actually requires the use of MSAUs,
while the latter does not. In practice, MSAUs are almost always used because of the
need for robustness and ease of station addition and removal.
There are a few other physical details to know about 802.5 and IBM Token
Rings. The data rate may be either 4 Mbps or 16 Mbps. The encoding of bits uses
differential Manchester encoding, as described in Section 2.2. IBM Token Rings may
have up to 260 stations per ring, while 802.5 sets the limit at 250. The physical medium
is twisted pair for IBM, but is not specified in 802.5.
Token Ring Media Access Control
It is now time to look a little more closely at how the MAC protocol operates on a
token ring. The network adaptor for a token ring contains a receiver, a transmitter, and
one or more bits of data storage between them. When none of the stations connected
to the ring has anything to send, the token circulates around the ring. Obviously, the
ring has to have enough “storage capacity” to hold an entire token. For example, the
2.7 Token Rings (802.5, FDDI)
802.5 token is 24 bits long. If every station could hold only 1 bit (as is the norm for
802.5 networks), and the stations were close enough together that the time for a bit to
propagate from one station to another was negligible, we would need to have at least
24 stations on the ring before it would operate correctly. This situation is avoided by
having one designated station, called the monitor, add some additional bits of delay to
the ring if necessary. The operation of the monitor is described in more detail below.
As the token circulates around the ring, any station that has data to send may
“seize” the token, that is, drain it off the ring and begin sending data. In 802.5 networks, the seizing process involves simply modifying 1 bit in the second byte token; the
first 2 bytes of the modified token now become the preamble for the subsequent data
packet. Once a station has the token, it is allowed to send one or more packets—exactly
how many more depends on some factors described below.
Each transmitted packet contains the destination address of the intended receiver;
it may also contain a multicast (or broadcast) address if it is intended to reach more
than one (or all) receivers. As the packet flows past each node on the ring, each node
looks inside the packet to see if it is the intended recipient. If so, it copies the packet
into a buffer as it flows through the network adaptor, but it does not remove the packet
from the ring. The sending station has the responsibility of removing the packet from
the ring. For any packet that is longer than the number of bits that can be stored in
the ring, the sending station will be draining the first part of the packet from the ring
while still transmitting the latter part.
One issue we must address is how much data a given node is allowed to transmit
each time it possesses the token, or said another way, how long a given node is allowed
to hold the token. We call this the token holding time (THT). If we assume that most
nodes on the network do not have data to send at any given time—a reasonable
assumption, and certainly one that the Ethernet takes advantage of—then we could
make a case for letting a node that possesses the token transmit as much data as it
has before passing the token on to the next node. This would mean setting the THT
to infinity. It would be silly in this case to limit a node to sending a single message
and to force it to wait until the token circulates all the way around the ring before
getting a chance to send another message. Of course, “as much data as it has” would
be dangerous because a single station could keep the token for an arbitrarily long time,
but we could certainly set the THT to significantly more than the time to send one
It is easy to see that the more bytes a node can send each time it has the token, the
better the utilization of the ring you can achieve in the situation in which only a single
node has data to send. The downside, of course, is that this strategy does not work
well when multiple nodes have data to send—it favors nodes that have a lot of data
to send over nodes that have only a small message to send, even when it is important
2 Direct Link Networks
to get this small message delivered as soon as possible. The situation is analogous to
finding yourself in line at the bank behind a customer who is taking out a car loan,
even though you simply want to cash a check. In 802.5 networks, the default THT is
10 ms.
There is a little subtlety to the use of the THT. Before putting each packet onto
the ring, the station must check that the amount of time it would take to transmit the
packet would not cause it to exceed the token holding time. This means keeping track
of how long it has already held the token, and looking at the length of the next packet
that it wants to send.
From the token holding time we can derive another useful quantity, the token
rotation time (TRT), which is the amount of time it takes a token to traverse the ring
as viewed by a given node. It is easy to see that
TRT ≤ ActiveNodes × THT + RingLatency
where RingLatency denotes how long it takes the token to circulate around the ring
when no one has data to send, and ActiveNodes denotes the number of nodes that
have data to transmit.
The 802.5 protocol provides a form of reliable delivery using 2 bits in the packet
trailer, the A and C bits. These are both 0 initially. When a station sees a frame for
which it is the intended recipient, it sets the A bit in the frame. When it copies the frame
into its adaptor, it sets the C bit. If the sending station sees the frame come back over
the ring with the A bit still 0, it knows that the intended recipient is not functioning
or absent. If the A bit is set but not the C bit, this implies that for some reason (e.g.,
lack of buffer space) the destination could not accept the frame. Thus, the frame might
reasonably be retransmitted later in the hope that buffer space had become available.
Another detail of the 802.5 protocol concerns the support of different levels of
priority. The token contains a 3-bit priority field, so we can think of the token having
a certain priority n at any time. Each device that wants to send a packet assigns a
priority to that packet, and the device can only seize the token to transmit a packet if
the packet’s priority is at least as great as the token’s. The priority of the token changes
over time due to the use of three reservation bits in the frame header. For example, a
station X waiting to send a priority n packet may set these bits to n if it sees a data
frame going past and the bits have not already been set to a higher value. This causes
the station that currently holds the token to elevate its priority to n when it releases it.
Station X is responsible for lowering the token priority to its old value when it is done.
Note that this is a strict priority scheme, in the sense that no lower-priority packets get sent when higher-priority packets are waiting. This may cause lower-priority
packets to be locked out of the ring for extended periods if there is a sufficient supply
of high-priority packets.
2.7 Token Rings (802.5, FDDI)
Figure 2.32 Token release: (a) early versus (b) delayed.
One final issue will complete our discussion of the MAC protocol, which is the
matter of exactly when the sending node releases the token. As illustrated in Figure
2.32, the sender can insert the token back onto the ring immediately following its
frame (this is called early release) or after the frame it transmits has gone all the way
around the ring and been removed (this is called delayed release). Clearly, early release
allows better bandwidth utilization, especially on large rings. 802.5 originally used
delayed token release, but support for early release was subsequently added.
Token Ring Maintenance
As we noted above, token rings have a designated monitor station. The monitor’s job
is to ensure the health of the ring. Any station on the ring can become the monitor,
and there are defined procedures by which the monitor is elected when the ring is first
connected or on the failure of the current monitor. A healthy monitor periodically
announces its presence with a special control message; if a station fails to see such a
message for some period of time, it will assume that the monitor has failed and will try
to become the monitor. The procedures for electing a monitor are the same whether
the ring has just come up or the active monitor has just failed.
When a station decides that a new monitor is needed, it transmits a “claim token”
frame, announcing its intent to become the new monitor. If that token circulates back
to the sender, it can assume that it is OK for it to become the monitor. If some other
station is also trying to become the monitor at the same instant, the sender might see
a claim token message from that other station first. In this case, it will be necessary to
break the tie using some well-defined rule like “highest address wins.”
2 Direct Link Networks
Once the monitor is agreed upon, it plays a number of roles. We have already
seen that it may need to insert additional delay into the ring. It is also responsible
for making sure that there is always a token somewhere in the ring, either circulating
or currently held by a station. It should be clear that a token may vanish for several
reasons, such as a bit error, or a crash on the part of a station that was holding it. To
detect a missing token, the monitor watches for a passing token and maintains a timer
equal to the maximum possible token rotation time. This interval equals
NumStations × THT + RingLatency
where NumStations is the number of stations on the ring, and RingLatency is the total
propagation delay of the ring. If the timer expires without the monitor seeing a token,
it creates a new one.
The monitor also checks for corrupted or orphaned frames. The former have
checksum errors or invalid formats, and without monitor intervention, they could
circulate forever on the ring. The monitor drains them off the ring before reinserting
the token. An orphaned frame is one that was transmitted correctly onto the ring but
whose “parent” died; that is, the sending station went down before it could remove
the frame from the ring. These are detected using another header bit, the “‘monitor”
bit. This is 0 on transmission and set to 1 the first time the packet passes the monitor.
If the monitor sees a packet with this bit set, it knows the packet is going by for the
second time and it drains the packet off the ring.
One additional ring maintenance function is the detection of dead stations. The
relays in the MSAU can automatically bypass a station that has been disconnected
or powered down, but may not detect more subtle failures. If any station suspects a
failure on the ring, it can send a beacon frame to the suspect destination. Based on
how far this frame gets, the status of the ring can be established, and malfunctioning
stations can be bypassed by the relays in the MSAU.
Frame Format
We are now ready to define the 802.5 frame format, which is depicted in Figure 2.33.
As noted above, 802.5 uses differential Manchester encoding. This fact is used by the
frame format, which uses “illegal” Manchester codes in the start and end delimiters.
Figure 2.33 802.5/token ring frame format.
2.7 Token Rings (802.5, FDDI)
After the start delimiter comes the access control byte, which includes the frame priority
and the reservation priority mentioned above. The frame control byte is a demux key
that identifies the higher-layer protocol.
Similar to the Ethernet, 802.5 addresses are 48 bits long. The standard actually
allows for smaller 16-bit addresses, but 48-bit addresses are typically used. When 48bit addresses are used, they are interpreted in exactly the same way as on an Ethernet.
The frame also includes a 32-bit CRC. This is followed by the frame status byte, which
includes the A and C bits for reliable delivery.
In many respects, FDDI is similar to 802.5 and IBM Token Rings. However, there
are significant differences—some arising because it runs on fiber, not copper, and some
arising from innovations that were made subsequent to the invention of the IBM Token
Ring. We discuss some of the significant differences below.
Physical Properties
Unlike 802.5 networks, an FDDI network consists of a dual ring—two independent
rings that transmit data in opposite directions, as illustrated in Figure 2.34(a). The
second ring is not used during normal operation but instead comes into play only if
the primary ring fails, as depicted in Figure 2.34(b). That is, the ring loops back on
the secondary fiber to form a complete ring, and as a consequence, an FDDI network
is able to tolerate a single break in the cable or the failure of one station.
Because of the expense of the dual-ring configuration, FDDI allows nodes to
attach to the network by means of a single cable. Such nodes are called single
Figure 2.34 Dual-fiber ring: (a) normal operation; (b) failure of the primary ring.
2 Direct Link Networks
Concentrator (DAS)
Figure 2.35 SASs connected to a concentrator.
attachment stations (SAS); their dual-connected counterparts are called, not surprisingly, dual attachment stations (DAS). A concentrator is used to attach several SASs
to the dual ring, as illustrated in Figure 2.35. Notice how the single-cable (two-fiber)
connection into an SAS forms a connected piece of the ring. Should this SAS fail, the
concentrator detects this situation and uses an optical bypass to isolate the failed SAS,
thereby keeping the ring connected. This is analogous to the relays inside MSAUs used
in 802.5 rings. Note that in this illustration, the second (backup) ring is denoted with
a dotted line.
As in 802.5, each network adaptor holds some number of bits between its input
and output interfaces. Unlike 802.5, however, the buffer can be of different sizes in
different stations, although never less than 9 bits nor more than 80 bits. It is also
possible for a station to start transmitting bits out of this buffer before it is full. Of
course, the total time it takes for a token to pass around the network is a function of
the size of these buffers. For example, because FDDI is a 100-Mbps network, it has
a 10-nanosecond (ns) bit time (each bit is 10 ns wide). If each station implements a
10-bit buffer and waits for the buffer to be half full before starting to transmit, then
each station introduces a 5 × 10 ns = 50-ns delay into the total ring rotation time.
FDDI has other physical characteristics. For example, the standard limits a single
network to at most 500 stations (hosts), with a maximum distance of 2 km between
any pair of stations. Overall, the network is limited to a total of 200 km of fiber, which
means that, because of the dual nature of the ring, the total amount of cable connecting
all stations is limited to 100 km. Also, although the “F” in FDDI implies that optical
fiber serves as the underlying physical medium, the standard has been defined to run
over a number of different physical media, including coax and twisted pair. Of course,
you still have to be careful about the total distance covered by the ring. As we will
2.7 Token Rings (802.5, FDDI)
see below, the amount of time it takes the token to traverse the network plays an
important role in the access control algorithm.
FDDI uses 4B/5B encoding, as discussed in Section 2.2. Since FDDI was the
first popular networking technology to use fiber, and 4B/5B chip sets operating at
FDDI rates became widely available, 4B/5B has enjoyed considerable popularity as an
encoding scheme for fiber.
Timed Token Algorithm
The rules governing token holding times are a little more complex in FDDI than in
802.5. The THT for each node is defined as before and is configured to some suitable
value. In addition, to ensure that a given node has the opportunity to transmit within
a certain amount of time—that is, to put an upper bound on the TRT observed by any
node—we define a target token rotation time (TTRT), and all nodes agree to live within
the limits of the TTRT. (How the nodes agree to a particular TTRT is described in the
next subsection.) Specifically, each node measures the time between successive arrivals
of the token. We call this the node’s measured TRT. If this measured TRT is greater
than the agreed-upon TTRT, then the token is late, and the node does not transmit any
data. If this measured TRT is less than the TTRT, then the token is early, and the node
is allowed to hold the token for the difference between TTRT and the measured TRT.
Although it may seem that we are now done, the algorithm we have just developed
does not ensure that a node concerned with sending a frame with a bounded delay will
actually be able to do so. The problem is that a node with lots of data to send has the
opportunity, upon seeing an early token, to hold the token for so long that by the time
a downstream node gets the token, its measured TRT is equal to or exceeds the TTRT,
meaning that it still cannot transmit its frame. To account for this possibility, FDDI
defines two classes of traffic: synchronous and asynchronous.3 When a node receives
a token, it is always allowed to send synchronous data, without regard for whether
the token is early or late. In contrast, a node can send asynchronous traffic only when
the token is early.
Note that the terms synchronous and asynchronous are somewhat misleading.
By synchronous, FDDI means that the traffic is delay sensitive. For example, you
would send voice or video as synchronous traffic on an FDDI network. In contrast,
asynchronous means that the application is more interested in throughput than delay.
A file transfer application would be asynchronous FDDI traffic.
Are we done yet? Not quite. Because synchronous traffic can transmit without
regard to whether the token is early or late, it would seem that if each node had a sizable
Originally, FDDI defined two subclasses of asynchronous traffic: restricted and unrestricted. In practice, however,
the restricted asynchronous case is not supported, and so we describe only the unrestricted case and refer to it
simply as “asynchronous.”
2 Direct Link Networks
amount of synchronous data to send, then the target rotation time would again be
meaningless. To account for this, the total amount of synchronous data that can be sent
during one token rotation is also bounded by TTRT. This means that in the worst case,
the nodes with asynchronous traffic first use up one TTRT’s worth of time, and then the
nodes with synchronous data consume another TTRT’s worth of time, meaning that
it is possible for the measured TRT at any given node to be as much as 2 × TTRT.
Note that if the synchronous traffic has already consumed one TTRT’s worth of time,
then the nodes with asynchronous traffic will not send any data because the token
will be late. Thus, while it is possible for a single rotation of the token to take as long
as 2 × TTRT, it is not possible to have back-to-back rotations that take 2 × TTRT
amount of time.
One final detail concerns precisely how a node determines if it can send asynchronous traffic. As stated above, a node sends if the measured TRT is less than the
TTRT. The question then arises: What if the measured TRT is less than the TTRT,
but by such a small amount that it’s not possible to send the full message without
exceeding the TTRT? The answer is that the node is allowed to send in this case. As a
consequence, the measured TRT is actually bounded by TTRT plus the time it takes
to send a full FDDI frame.
Token Maintenance
The FDDI mechanisms for ensuring that a valid token is always in circulation are also
different from those in 802.5, as they are intertwined with the process of setting the
TTRT. First, all nodes on an FDDI ring monitor the ring to be sure that the token
has not been lost. Observe that in a correctly functioning ring, each node should see
a valid transmission—either a data frame or the token—every so often. The greatest
idle time between valid transmissions that a given node should experience is equal to
the ring latency plus the time it takes to transmit a full frame, which on a maximally
sized ring is a little less than 2.5 ms. Therefore, each node sets a timer event that fires
after 2.5 ms. If this timer expires, the node suspects that something has gone wrong
and transmits a “claim” frame. Every time a valid transmission is received, however,
the node resets the timer back to 2.5 ms.
The claim frames in FDDI differ from those in 802.5 because they contain the
node’s bid for the TTRT, that is, the token rotation time that the node needs so that the
applications running on the node can meet their timing constraints. A node can send
a claim frame without holding the token and typically does so whenever it suspects a
failure or when it first joins the network. If this claim frame makes it all the way
around the ring, then the sender removes it, knowing that its TTRT bid was the lowest.
That node now holds the token—that is, it is responsible for inserting a valid token
on the ring—and may proceed with the normal token algorithm.
2.8 Wireless (802.11)
Start of
End of
Figure 2.36 FDDI frame format.
When a node receives a claim frame, it checks to see if the TTRT bid in the frame
is less than its own. If it is, then the node resets its local definition of the TTRT to that
contained in the claim frame and forwards the frame to the next node. If the bid TTRT
is greater than that node’s minimum required TTRT, then the claim frame is removed
from the ring and the node enters the bidding process by putting its own claim frame
on the ring. Should the bid TTRT be equal to the node’s required TTRT, the node
compares the address of the claim frame’s sender with its own and the higher address
wins. Thus, if a claim frame makes it all the way back around to the original sender,
that node knows that it is the only active bidder and that it can safely claim the token.
At the same time, all nodes are now in agreement about the TTRT that will be short
enough to keep all nodes happy.
Frame Format
The FDDI frame format, depicted in Figure 2.36, differs in very few ways from that
for 802.5. Because FDDI uses 4B/5B encoding instead of Manchester, it uses 4B/5B
control symbols rather than illegal Manchester symbols in the start- and end-of-frame
markers. The other significant differences are the presence of a bit in the header to
distinguish synchronous from asynchronous traffic, and the lack of the access control
bits of 802.5.
2.8 Wireless (802.11)
Wireless networking is a rapidly evolving technology for connecting computers. As we
saw earlier in this chapter, the possibilities for building wireless networks are almost
endless, ranging from using infrared signals within a single building to constructing
a global network from a grid of low-orbit satellites. This section takes a closer look
at a specific technology centered around the emerging IEEE 802.11 standard. Like its
Ethernet and token ring siblings, 802.11 is designed for use in a limited geographical
area (homes, office buildings, campuses), and its primary challenge is to mediate access
to a shared communication medium—in this case, signals propagating through space.
802.11 supports additional features (e.g., time-bounded services, power management,
and security mechanisms), but we focus our discussion on its base functionality.
2 Direct Link Networks
Physical Properties
802.11 was designed to run over three different physical media—two based on spread
spectrum radio and one based on diffused infrared. The radio-based versions currently
run at 11 Mbps, but may soon run at 54 Mbps.
The idea behind spread spectrum is to spread the signal over a wider frequency
band than normal, so as to minimize the impact of interference from other devices.
(Spread spectrum was originally designed for military use, so these “other devices”
were often attempting to jam the signal.) For example, frequency hopping is a spread
spectrum technique that involves transmitting the signal over a random sequence of
frequencies; that is, first transmitting at one frequency, then a second, then a third,
and so on. The sequence of frequencies is not truly random, but is instead computed
algorithmically by a pseudorandom number generator. The receiver uses the same
algorithm as the sender—and initializes it with the same seed—and hence is able to
hop frequencies in sync with the transmitter to correctly receive the frame.
A second spread spectrum technique, called direct sequence, achieves the same
effect by representing each bit in the frame by multiple bits in the transmitted signal.
For each bit the sender wants to transmit, it actually sends the exclusive-OR of that
bit and n random bits. As with frequency hopping, the sequence of random bits is
generated by a pseudorandom number generator known to both the sender and the
receiver. The transmitted values, known as an n-bit chipping code, spread the signal
across a frequency band that is n times wider than the frame would have otherwise
required. Figure 2.37 gives an example of a 4-bit chipping sequence.
802.11 defines one physical layer using frequency hopping (over 79 1-MHz-wide
frequency bandwidths) and a second using direct sequence (using an 11-bit chipping
sequence). Both standards run in the 2.4-GHz frequency band of the electromagnetic
spectrum. In both cases, spread spectrum also has the interesting characteristic of
making the signal look like noise to any receiver that does not know the pseudorandom
Data stream: 1010
Random sequence: 0100101101011001
Figure 2.37 Example 4-bit chipping sequence.
XOR of the two: 1011101110101001
2.8 Wireless (802.11)
Figure 2.38 Example wireless network.
The third physical standard for 802.11 is based on infrared signals. The transmission is diffused, meaning that the sender and receiver do not have to be aimed at
each other and do not need a clear line of sight. This technology has a range of up to
about 10 m and is limited to the inside of buildings only.
Collision Avoidance
At first glance, it might seem that a wireless protocol would follow exactly the same
algorithm as the Ethernet—wait until the link becomes idle before transmitting and
back off should a collision occur—and to a first approximation, this is exactly what
802.11 does. The problem is more complicated in a wireless network, however, because
not all nodes are always within reach of each other.
Consider the situation depicted in Figure 2.38, where each of four nodes is able
to send and receive signals that reach just the nodes to its immediate left and right.
For example, B can exchange frames with A and C but it cannot reach D, while C
can reach B and D but not A. (A and D’s reach is not shown in the figure.) Suppose
both A and C want to communicate with B and so they each send it a frame. A and C
are unaware of each other since their signals do not carry that far. These two frames
collide with each other at B, but unlike an Ethernet, neither A nor C is aware of this
collision. A and C are said to be hidden nodes with respect to each other.
A related problem, called the exposed node problem, occurs under the following
circumstances. Suppose B is sending to A in Figure 2.38. Node C is aware of this
communication because it hears B’s transmission. It would be a mistake for C to
conclude that it cannot transmit to anyone just because it can hear B’s transmission.
For example, suppose C wants to transmit to node D. This is not a problem since
C’s transmission to D will not interfere with A’s ability to receive from B. (It would
interfere with A sending to B, but B is transmitting in our example.)
2 Direct Link Networks
802.11 addresses these two problems with an algorithm called Multiple Access
with Collision Avoidance (MACA). The idea is for the sender and receiver to exchange
control frames with each other before the sender actually transmits any data. This
exchange informs all nearby nodes that a transmission is about to begin. Specifically,
the sender transmits a Request to Send (RTS) frame to the receiver; the RTS frame
includes a field that indicates how long the sender wants to hold the medium (i.e., it
specifies the length of the data frame to be transmitted). The receiver then replies with
a Clear to Send (CTS) frame; this frame echoes this length field back to the sender.
Any node that sees the CTS frame knows that it is close to the receiver, and therefore
cannot transmit for the period of time it takes to send a frame of the specified length.
Any node that sees the RTS frame but not the CTS frame is not close enough to the
receiver to interfere with it, and so is free to transmit.
There are two more details to complete the picture. First, the receiver sends an
ACK to the sender after successfully receiving a frame. All nodes must wait for this
ACK before trying to transmit.4 Second, should two or more nodes detect an idle link
and try to transmit an RTS frame at the same time, their RTS frames will collide with
each other. 802.11 does not support collision detection, but instead the senders realize
the collision has happened when they do not receive the CTS frame after a period
of time, in which case they each wait a random amount of time before trying again.
The amount of time a given node delays is defined by the same exponential backoff
algorithm used on the Ethernet (see Section 2.6.2).
Distribution System
As described so far, 802.11 would be suitable for an ad hoc configuration of nodes
that may or may not be able to communicate with all other nodes, depending on how
far apart they are. Moreover, since one of the advantages of a wireless network is
that nodes are free to move around—they are not tethered by wire—the set of directly
reachable nodes may change over time. To help deal with this mobility and partial
connectivity, 802.11 defines additional structure on a set of nodes. Nodes are free to
directly communicate with each other as just described, but in practice, they operate
within this structure.
Instead of all nodes being created equal, some nodes are allowed to roam (e.g.,
your laptop) and some are connected to a wired network infrastructure. The latter
are called access points (AP), and they are connected to each other by a so-called
distribution system. Figure 2.39 illustrates a distribution system that connects three
access points, each of which services the nodes in some region. Each of these regions
This ACK was not part of the original MACA algorithm, but was instead proposed in an extended version called
MACAW: MACA for Wireless LANs.
2.8 Wireless (802.11)
Distribution system
Figure 2.39 Access points connected to a distribution network.
is analogous to a cell in a cellular phone system, with the APs playing the same role
as a base station. The details of the distribution system are not important to this
discussion—it could be an Ethernet or a token ring, for example. The only important
point is that the distribution network runs at layer 2 of the ISO architecture; that is,
it does not depend on any higher-level protocols.
Although two nodes can communicate directly with each other if they are within
reach of each other, the idea behind this configuration is that each node associates
itself with one access point. For node A to communicate with node E, for example,
A first sends a frame to its access point (AP-1), which forwards the frame across
the distribution system to AP-3, which finally transmits the frame to E. How AP-1
knew to forward the message to AP-3 is beyond the scope of 802.11; it may have
used the bridging protocol described in the next chapter (Section 3.2). What 802.11
does specify is how nodes select their access points and, more interestingly, how this
algorithm works in light of nodes moving from one cell to another.
The technique for selecting an AP is called scanning and involves the following
four steps:
1 The node sends a Probe frame.
2 All APs within reach reply with a Probe Response frame.
3 The node selects one of the access points and sends that AP an Association
Request frame.
4 The AP replies with an Association Response frame.
2 Direct Link Networks
Distribution system
Figure 2.40 Node mobility.
A node engages this protocol whenever it joins the network, as well as when it becomes
unhappy with its current AP. This might happen, for example, because the signal from
its current AP has weakened due to the node moving away from it. Whenever a node
acquires a new AP, the new AP notifies the old AP of the change (this happens in step 4)
via the distribution system.
Consider the situation shown in Figure 2.40, where node C moves from the cell
serviced by AP-1 to the cell serviced by AP-2. As it moves, it sends Probe frames, which
eventually result in Probe Response frames from AP-2. At some point, C prefers AP-2
over AP-1, and so it associates itself with that access point.
The mechanism just described is called active scanning since the node is actively
searching for an access point. APs also periodically send a Beacon frame that advertises
the capabilities of the access point; these include the transmission rates supported by
the AP. This is called passive scanning, and a node can change to this AP based on the
Beacon frame simply by sending it an Association Request frame back to the access
Frame Format
Most of the 802.11 frame format, which is depicted in Figure 2.41, is exactly what
we would expect. The frame contains the source and destination node addresses, each
of which are 48 bits long; up to 2312 bytes of data; and a 32-bit CRC. The Control
field contains three subfields of interest (not shown): a 6-bit Type field that indicates
whether the frame carries data, is an RTS or CTS frame, or is being used by the
2.9 Network Adaptors
Figure 2.41 802.11 frame format.
scanning algorithm; and a pair of 1-bit fields—called ToDS and FromDS—that are
described below.
The peculiar thing about the 802.11 frame format is that it contains four, rather
than two, addresses. How these addresses are interpreted depends on the settings of the
ToDS and FromDS bits in the frame’s Control field. This is to account for the possibility
that the frame had to be forwarded across the distribution system, which would mean
that the original sender is not necessarily the same as the most recent transmitting
node. Similar reasoning applies to the destination address. In the simplest case, when
one node is sending directly to another, both the DS bits are 0, Addr1 identifies the
target node, and Addr2 identifies the source node. In the most complex case, both DS
bits are set to 1, indicating that the message went from a wireless node onto the distribution system, and then from the distribution system to another wireless node.
With both bits set, Addr1 identifies the ultimate destination, Addr2 identifies the
immediate sender (the one that forwarded the frame from the distribution system to
the ultimate destination), Addr3 identifies the intermediate destination (the one that accepted the frame from a wireless node and forwarded it across the distribution system),
and Addr4 identifies the original source. In terms of the example given in Figure 2.39,
Addr1 corresponds to E, Addr2 identifies AP-3, Addr3 corresponds to AP-1, and Addr4
identifies A.
2.9 Network Adaptors
Nearly all the networking functionality described in this chapter is implemented in the
network adaptor: framing, error detection, and the media access protocol. The only
exceptions are the point-to-point automatic repeat request (ARQ) schemes described
in Section 2.5, which are typically implemented in the lowest-level protocol running
on the host. We conclude this chapter by describing the design of a generic network
adaptor and the device driver software that controls it.
When reading this section, keep in mind that no two network adaptors are
exactly alike; they vary in countless small details. Our focus, therefore, is on their
general characteristics, although we do include some examples from an actual adaptor
to make the discussion more tangible.
Host I/O bus
2 Direct Link Networks
Network link
Figure 2.42 Block diagram of a typical network adaptor.
A network adaptor serves as an interface between the host and the network, and as
a result, it can be thought of as having two main components: a bus interface that
understands how to communicate with the host and a link interface that speaks the
correct protocol on the network. There must also be a communication path between
these two components, over which incoming and outgoing data is passed. A simple
block diagram of a network adaptor is depicted in Figure 2.42.
Network adaptors are always designed for a specific I/O bus, which often precludes moving an adaptor from one vendor’s machine to another.5 Each bus, in effect,
defines a protocol that is used by the host’s CPU to program the adaptor, by the adaptor to interrupt the host’s CPU, and by the adaptor to read and write memory on the
host. One of the main features of an I/O bus is the data transfer rate that it supports.
For example, a typical bus might have a 32-bit-wide data path (i.e., it can transfer
32 bits of data in parallel) running at 33 MHz (i.e., the bus’s cycle time is 33 ns),
giving it a peak transfer rate of approximatley 1 Gbps, which would be enough to
support a (unidirectional) 622-Mbps STS-12 link. Of course, the peak rate tells us
almost nothing about the average rate, which may be much lower.
The link-half of the adaptor implements the link-level protocol. For fairly mature
technologies like Ethernet, the link-half of the adaptor is implemented by a chip set
that can be purchased on the commodity market. For newer link technologies, however, the link-level protocol may be implemented in software on a general-purpose
microprocessor or perhaps with some form of programmable hardware, such as a
field-programmable gate array (FPGA). These approaches generally add to the cost of
Fortunately, there are standards in bus design just as there are in networking, so some adaptors can be used on
machines from several vendors.
2.9 Network Adaptors
the adaptor but make it more flexible—it is easier to modify software than hardware
and easier to reprogram FPGAs than to redesign boards.
Because the host’s bus and the network link are, in all probability, running at
different speeds, there is a need to put a small amount of buffering between the two
halves of the adaptor. Typically, a small FIFO (byte queue) is enough to hide the
asynchrony between the bus and the link.
View from the Host
Since we have spent most of this chapter discussing various protocols that are implemented by the link-half of the adaptor, we now turn our attention to the host’s view
of the network adaptor.
Control Status Register
A network adaptor, like any other device, is ultimately programmed by software running on the CPU. From the CPU’s perspective, the adaptor exports a control status
register (CSR) that is readable and writable from the CPU. The CSR is typically located
at some address in the memory, thereby making it possible for the CPU to read and
write just like any other memory location. The CPU writes to the CSR to instruct it to
transmit and/or receive a frame and reads from the CSR to learn the current state of
the adaptor.
The following is an example CSR from the Lance Ethernet device, which is
manufactured by Advanced Micro Devices (AMD). The Lance device actually has four
different control status registers; the following shows the bit masks used to interpret
the 16-bit CSR0. To set a bit on the adaptor, the CPU does an inclusive-OR of CSRO
and the mask corresponding to the bit it wants to set. To determine if a particular bit
is set, the CPU compares the AND of the contents of CSR0 and the mask against 0.
* Control and status bits for CSR0.
* Legend:
* RO - Read Only
* RC - Read/Clear (writing 1 clears, writing 0 has no
* RW - Read/Write
* W1 - Write-1-only (writing 1 sets, writing 0 has no
* RW1 - Read/Write-1-only
(writing 1 sets, writing 0 has no
#define LE_ERR
0x8000 /* RO BABL | CERR | MISS | MERR
#define LE_BABL 0x4000 /* RC transmitted too many bits
#define LE_CERR 0x2000 /* RC No Heartbeat */
2 Direct Link Networks
RC Missed an incoming packet */
RC Memory Error; no acknowledge */
RC Received packet Interrupt */
RC Transmitted packet Interrupt */
RC Initialization Done */
RW Interrupt Enable */
RO Receiver On */
RO Transmitter On */
W1 Transmit Demand (send it now) */
RW1 Stop */
RW1 Start */
RW1 Initialize */
This definition says, for example, that the host writes a 1 to the least significant
bit of CSR0 (0x0001) to initialize the Lance chip. Similarly, if the host sees a 1 in the
sixth significant bit (0x0020) and in the fifth significant bit (0x0010), then it knows that
the Lance chip is enabled to receive and transmit frames, respectively.
The host CPU could sit in a tight loop reading the adaptor’s control status register
until something interesting happens and then take the appropriate action. On the
Lance chip, for example, it could continually watch for a 1 in the 11th significant bit
(0x0400), which would indicate that a frame has just arrived. This is called polling,
and although it is not an unreasonable design in certain situations (e.g., a network
router that has nothing better to do than wait for the next frame), it is not typically
done on end hosts that could better spend their time running application programs.
Instead of polling, most hosts only pay attention to the network device when the
adaptor interrupts the host. The device raises an interrupt when an event that requires
host intervention occurs—for example, a frame has been successfully transmitted or
received, or an error occurred when the device was attempting to transmit or receive
a frame. The host’s architecture includes a mechanism that causes a particular procedure inside the operating system to be invoked when such an interrupt occurs. This
procedure is known as an interrupt handler, and it inspects the CSR to determine the
cause of the interrupt and then takes the appropriate action.
While servicing an interrupt, the host typically disables additional interrupts.
This keeps the device driver from having to service multiple interrupts at one time.
Because interrupts are disabled, the device driver must finish its job quickly (it does not
have the time to execute the entire protocol stack), and under no circumstances can it
afford to block (that is, suspend execution while awaiting some event). For example,
this might be accomplished by having the interrupt handler dispatch a process to take
2.9 Network Adaptors
care of the frame and then return. Thus, the handler makes sure that the frame will get
processed without having to spend valuable time actually processing the frame itself.
Direct Memory Access versus Programmed I/O
One of the most important issues in network adaptor design is how the bytes of a
frame are transferred between the adaptor and the host memory. There are two basic
mechanisms: direct memory access (DMA) and programmed I/O (PIO). With DMA,
the adaptor directly reads and writes the host’s memory without any CPU involvement;
the host simply gives the adaptor a memory address and the adaptor reads to (writes
from) it. With PIO, the CPU is directly responsible for moving data between the adaptor
and the host memory: To send a frame, the CPU sits in a tight loop that first reads a
word from host memory and then writes it to the adaptor; to receive a frame, the CPU
reads words from the adaptor and writes them to memory. We now consider DMA
and PIO in more detail.
When using DMA, there is no need to buffer frames on the adaptor; the adaptor
reads and writes host memory. (A few bytes of buffering are needed to stage data
between the bus and the link, as described above, but complete frames are not buffered
on the adaptor.) The CPU is therefore responsible for giving the adaptor a pair of buffer
descriptor lists: one to transmit out of and one to receive into. A buffer descriptor list
is an array of address/length pairs, as illustrated in Figure 2.43.
When receiving frames, the adaptor uses as many buffers as it needs to hold the
incoming frame. For example, the descriptor illustrated in Figure 2.43 would cause
Memory buffers
Figure 2.43 Buffer descriptor list.
an Ethernet adaptor that was attempting to
receive a 1450-byte frame to put the first
100 bytes in the first buffer and the next
1350 bytes in the second buffer. If a second 1500-byte frame arrived immediately
after the first, it would be placed entirely
in the third buffer. That is, separate frames
are placed in separate buffers, although a
single frame may be scattered across multiple buffers. This latter feature is usually
called scatter-read. In practice, scatter-read
is used when the network’s maximum frame
size is so large that it is wasteful to allocate
all buffers big enough to contain the largest
possible arriving frame. An OS-specific message data structure, similar to the ones described in Section 1.4.3, would then be used
to link together all the buffers that make
up a single frame. Scatter-read is typically
not used on an Ethernet because preallocating 1500-byte buffers does not excessively
waste memory.
Output works in a similar way. When
the host has a frame to transmit, it puts a
pointer to the buffer that contains the frame
in the transmit descriptor list. Devices that
support gather-write allow the frame to be
fragmented across multiple physical buffers.
In practice, gather-write is more widely used
than scatter-read because outgoing frames
are often constructed in a piecemeal fashion,
with more than one protocol contributing
a buffer. For example, by the time a message makes it down the protocol stack and
is ready to be transmitted, it consists of a
buffer that contains the aggregate header
(the collection of headers attached by various protocols that processed the message)
and a separate buffer that contains the application’s data.
2 Direct Link Networks
Frames, Buffers,
and Messages
As this section has suggested, the
network adaptor is the place where
the network comes in physical contact with the host. It also happens
to be the place where three different worlds intersect: the network,
the host architecture, and the host
operating system. It turns out that
each of these has a different terminology for talking about the same
thing. It is important to recognize
when this is happening.
From the network’s perspective, the adaptor transmits frames
from the host and receives frames
into the host. Most of this chapter
has been presented from the network perspective, so you should
have a good understanding of
what the term “frame” means.
From the perspective of the host
architecture, each frame is received into or transmitted from a
buffer, which is simply a region
of main memory of some length
and starting at some address. Finally, from the operating system’s
perspective, a message is an abstract object that holds network
frames. Messages are implemented
by a data structure that includes
pointers to different memory locations (buffers). We saw an example of a message data structure in
Chapter 1.
2.9 Network Adaptors
Figure 2.44 Programmed I/O.
In the case of PIO, the network adaptor must contain some amount of buffering—
the CPU copies frames between host memory and this adaptor memory, as illustrated
in Figure 2.44. The basic fact that necessitates buffering is that, with most operating
systems, you can never be sure when the CPU will get around to doing something,
so you need to be prepared to wait for it. One important question that must be addressed is how much memory is needed on the adaptor. There certainly needs to be
at least one frame’s worth of memory in both the transmit and the receive direction.
In addition, adaptors that use PIO usually have additional memory that can be used
to hold a small number of incoming frames until the CPU can get around to copying them into host memory. Although the computer system axiom that “memory is
cheap” would seem to suggest putting a huge amount of memory on the adaptor, this
memory must be of the more expensive dual-ported type because both the CPU and
the adaptor read/write it. PIO-based adaptors typically have something on the order
of 64–256 KB of adaptor memory, although there are adaptors with as much as 1 MB
of memory.
Device Drivers
A device driver is a collection of operating system routines that effectively anchor the
protocol stack to the network hardware. It typically includes routines to initialize the
device, transmit frames on the link, and field interrupts. The code is often difficult to
read because it’s full of device-specific details, but the overall logic is actually quite
For example, a transmit routine first makes sure there is a free transmit buffer
on the device to handle the message. If not, it has to block the process until one is
available. Once there is an available transmit buffer, the invoking process disables
2 Direct Link Networks
interrupts to protect itself from interference. It then translates the message from the
internal OS format to that expected by the device, sets the CSR to instruct the device
to transmit, and enables interrupts.
The logic for the interrupt handler is equally simple. It first disables additional
interrupts that might interfere with the processing of this interrupt. It then inspects the
CSR to determine what caused the interrupt. There are three possibilities: (1) an error
has occurred, (2) a transmit request has completed, or (3) a frame has been received.
In the first case, the handler prints a message and clears the error bits. In the second
case, we know that a transmit request that was queued earlier by the transmit routine
has completed, meaning that there is now a free transmit buffer that can be reused. In
the third case, the handler calls a receive routine to extract the incoming frame from
the receive buffer list and place it in the OS’s internal message data structure, and then
start a process to shepherd the message up the protocol stack.
Memory Bottleneck
As discussed in Section 2.1.1, host memory performance is often the limiting factor in
network performance. Nowhere is this possibility more critical than at the host/adaptor
interface. To help drive this point home, consider Figure 2.45. This diagram shows
the bandwidth available between various components of a modern PC. While the I/O
bus is fast enough to transfer frames between the network adaptor and host memory
at gigabit rates, there are two potential problems.
The first is that the advertised I/O bus speed corresponds to its peak bandwidth;
it is the product of the bus’s width and clock speed (e.g., a 32-bit-wide bus running
at 33 MHz has a peak transfer rate of 1056 Mbps). The real limitation is the size
of the data block that is being transferred across the I/O bus, since there is a certain
amount of overhead involved in each bus transfer. On some architectures, for example,
I/O bus
235 Mbps
1056 Mbps
Figure 2.45 Memory bandwidth on a modern PC-class machine.
2.9 Network Adaptors
it takes 8 clock cycles to acquire the bus for the purpose of transferring data from the
adaptor to host memory. This overhead is independent of the number of data bytes
transferred. Thus, if you want to transfer a 64-byte payload across the I/O bus—this
happens to be the size of a minimum Ethernet packet—then the whole transfer takes
24 cycles: 8 cycles to acquire the bus and 16 cycles to transfer the data. (The bus
is 32 bits wide, which means that it can transfer a 4-byte word during each clock
cycle; 64 bytes divided by 4 bytes per cycle equals 16 cycles.) This means that the
maximum bandwidth you can achieve is
16 ÷ (8 + 16) × 1056 = 704 Mbps
not the peak 1056 Mbps.
The second problem is that the memory/CPU bandwidth, which is 235 MBps
(1880 Mbps), is the same order of magnitude as the bandwidth of the I/O bus. Fortunately, this is a measured number rather than an advertised peak rate. The ramification
is that while it is possible to deliver frames across the I/O bus and into memory and
then to load the data from memory into the CPU’s registers at network bandwidths, it
is impractical for the device driver, operating system, and application to go to memory multiple times for each word of data in a network packet, possibly because it
needs to copy the data from one buffer to another. In particular, if the memory/CPU
path is crossed n times, then it might be the case that the bandwidth your application sees is 235 MBps/n. (The performance might be better if the data is cached, but
often caches don’t help with data arriving from the network.) For example, if the
various software layers need to copy the message from one buffer to another four
times—not an uncommon situation—then the application might see a throughput of
58.75 MBps (470 Mbps), a far cry from the 1056 Mbps we thought this machine could
As an aside, it is important to recognize that there are many parallels between
moving a message to and from memory and moving a message across a network. In
particular, the effective throughput of the memory system is defined by the same two
formulas given in Section 1.5.
Throughput = TransferSize/TransferTime
TransferTime = RTT + 1/Bandwidth × TransferSize
In the case of the memory system, however, the transfer size corresponds to how
big a unit of data we can move across the bus in one transfer (i.e., cache line versus
small cells versus large message), and the RTT corresponds to the memory latency, that
2 Direct Link Networks
is, whether the memory is on-chip cache, off-chip cache, or main memory. Just as in the
case of the network, the larger the transfer size and the smaller the latency, the better
the effective throughput. Also similar to a network, the effective memory throughput
does not necessarily equal the peak memory bandwidth (i.e., the bandwidth that can
be achieved with an infinitely large transfer).
The main point of this discussion is that we must be aware of the limits memory
bandwidth places on network performance. If carefully designed, the system can work
around these limits. For example, it is possible to integrate the buffers used by the
device driver, the operating system, and the application in a way that minimizes data
copies. The system also needs to be aware of when data is brought into the cache, so
it can perform all necessary operations on the data before it gets bumped from the
cache. The details of how this is accomplished are beyond the scope of this book, but
can be found in papers referenced at the end of the chapter.
Finally, there is a second important lesson lurking in this discussion: when the
network isn’t performing as well as you think it should, it’s not always the network’s
fault. In many cases, the actual bottleneck in the system is one of the machines connected to the network. For example, when it takes a long time for a Web page to
appear on your browser, it might be network congestion, but it’s just as likely the case
that the server at the other end of the network can’t keep up with the workload.
2.10 Summary
This chapter introduced the hardware building blocks of a computer network—nodes
and links—and discussed the five key problems that must be solved so that two or
more nodes that are directly connected by a physical link can exchange messages with
each other.
First, physical links carry signals. It is therefore necessary to encode the bits that
make up a binary message into the signal at the source node and then to recover the
bits from the signal at the receiving node. This is the encoding problem, and it is made
challenging by the need to keep the sender’s and receiver’s clocks synchronized. We
discussed four different encoding techniques—NRZ, NRZI, Manchester, and 4B/5B—
which differ largely in how they encode clock information along with the data being
transmitted. One of the key attributes of an encoding scheme is its efficiency, that is,
the ratio of signal pulses to encoded bits.
Once it is possible to transmit bits between nodes, the next step is to figure out
how to package these bits into frames. This is the framing problem, and it boils down
to being able to recognize the beginning and end of each frame. Again, we looked at
several different techniques, including byte-oriented protocols, bit-oriented protocols,
and clock-based protocols.
Open Issue: Does It Belong in Hardware?
Assuming that each node is able to recognize the collection of bits that make up a
frame, the third problem is to determine if those bits are in fact correct, or if they have
possibly been corrupted in transit. This is the error detection problem, and we looked
at three different approaches: cyclic redundancy check, two-dimensional parity, and
checksums. Of these, the CRC approach gives the strongest guarantees and is the most
widely used at the link level.
Given that some frames will arrive at the destination node containing errors
and thus will have to be discarded, the next problem is how to recover from such
losses. The goal is to make the link appear reliable. The general approach to this
problem is called ARQ and involves using a combination of acknowledgments and
timeouts. We looked at three specific ARQ algorithms: stop-and-wait, sliding window,
and concurrent channels. What makes these algorithms interesting is how effectively
they use the link, with the goal being to keep the pipe full.
The final problem is not relevant to point-to-point links, but it is the central
issue in multiple-access links: how to mediate access to a shared link so that all nodes
eventually have a chance to transmit their data. In this case, we looked at three different media access protocols—Ethernet, token ring, and wireless—which have been
put to practical use in building local area networks. What these technologies have in
common is that control over the network is distributed over all the nodes connected
to the network; there is no dependence on a central arbitrator.
We concluded the chapter by observing that, in practice, most of the algorithms
that address these five problems are implemented on the adaptor that connects the
host to the link. It turns out that the design of this adaptor is of critical importance in
how well the network, as a whole, performs.
One of the most important questions
in the design of any computer system
is, What belongs in hardware and
what belongs in software? In the case
Does It Belong in Hardware?
of networking, the network adaptor
finds itself at the heart of this question. For example, why is the Ethernet algorithm, presented in Section 2.6 of this chapter, typically implemented on the
network adaptor, while the higher-level protocols discussed later in this book are not?
It is certainly possible to put a general-purpose microprocessor on the network
adaptor, which gives you the opportunity to move high-level protocols there, such as
TCP/IP. The reason that this is typically not done is complicated, but it comes down to
2 Direct Link Networks
the economics of computer design: The host processor is usually the fastest processor
on a computer, and it would be a shame if this fast host processor had to wait for a
slower adaptor processor to run TCP/IP when it could have done the job faster itself.
On the flip side, some protocol processing does belong on the network adaptor. The
general rule of thumb is that any processing for which a fixed processor can keep pace
with the link speed—that is, a faster processor would not improve the situation—is a
good candidate for being moved to the adaptor. In other words, any function that is
already limited by the link speed, as opposed to the processor at the end of the link,
might be effectively implemented on the adaptor.
Historically, the decision as to what functionality belonged on the network adaptor and what belonged on the host computer was a complex one that generated quite
a body of research. In modern systems, it is almost always the case that the MAC layer
and below are performed by the adaptor, while the IP layer and above are performed
on the host. Interestingly, however, the same debate about how much hardware assistance is required above the MAC layer continues in the design of switches and routers,
which is a topic for the next two chapters.
Independent of exactly what protocols are implemented on the network adaptor,
generally the data will eventually find its way onto the main computer, and when it
does, the efficiency with which the data is moved between the adaptor and the computer’s memory is very important. Recall from Section 2.9.3 that memory bandwidth—
the rate at which data can be moved from one memory location to another—has the
potential to be a limiting factor in how a workstation-class machine performs. An
inefficient host/adaptor data transfer mechanism can, therefore, limit the throughput rate seen by application programs running on the host. First, there is the issue of
whether DMA or programmed I/O is used; each has advantages in different situations.
Second, there is the issue of how well the network adaptor is integrated with the operating system’s buffer mechanism; a carefully integrated system is usually able to avoid
copying data at a higher level of the protocol graph, thereby improving applicationto-application throughput.
One of the most important contributions in computer networking over the last 20 years
is the original paper by Metcalf and Boggs (1976) introducing the Ethernet. Many
years later, Boggs, Mogul, and Kent (1988) reported their practical experiences with
Ethernet, debunking many of the myths that had found their way into the literature
over the years. Both papers are must reading. The third and fourth papers discuss the
issues involved in integrating high-speed network adaptors with system software.
Further Reading
■ Metcalf, R., and D. Boggs. Ethernet: Distributed packet switching for local
computer networks. Communications of the ACM 19(7):395–403, July 1976.
■ Boggs, D., J. Mogul, and C. Kent. Measured capacity of an Ethernet. Proceedings of the SIGCOMM ’88 Symposium, pages 222–234, August 1988.
■ Metcalf, R. Computer/network interface design lessons from Arpanet and Ethernet. IEEE Journal of Selected Areas in Communication (JSAC) 11(2):173–
180, February 1993.
■ Druschel, P., M. Abbot, M. Pagels, and L. L. Peterson. Network subsystem
design. IEEE Network (Special Issue on End-System Support for High Speed
Networks) 7(4):8–17, July 1993.
There are countless textbooks with a heavy emphasis on the lower levels of
the network hierarchy, with a particular focus on telecommunications—networking
from the phone company’s perspective. Books by Spragins et al. [SHP91] and Minoli
[Min93] are two good examples. Several other books concentrate on various local area
network technologies. Of these, Stallings’s book is the most comprehensive [Sta00b],
while Jain gives a thorough description of FDDI [Jai94]. Jain’s book also gives a good
introduction to the low-level details of optical communication. Also, a comprehensive
overview of FDDI can be found in Ross’s article [Ros86].
For an introduction to information theory, Blahut’s book is a good place to start
[Bla87], along with Shannon’s seminal paper on link capacity [Sha48].
For a general introduction to the mathematics behind error codes, Rao and
Fujiwara [RF89] is recommended. For a detailed discussion of the mathematics of
CRCs in particular, along with some more information about the hardware used to
calculate them, see Peterson and Brown [PB61].
On the topic of network adaptor design, much work was done in the early 1990s
by researchers trying to connect hosts to networks running at higher and higher rates.
In addition to the two examples given in the reading list, see Traw and Smith [TS93],
Ramakrishnan [Ram93], Edwards et al. [EWL+ 94], Druschel et al. [DPD94], Kanakia
and Cheriton [KC88], Cohen et al. [CFFD93], and Steenkiste [Ste94a]. Recently, a new
generation of interface cards, ones that utilize network processors, are coming onto
the market. Spalink et al. demonstrate how these new processors can be programmed
to implement various network functionality [SKPG01].
For general information on computer architecture, Hennessy and Patterson’s
book [HP02] is an excellent reference.
Finally, we recommend the following live reference:
■ status of various IEEE network-related standards
2 Direct Link Networks
1 Show the NRZ, Manchester, and NRZI encodings for the bit pattern shown in
Figure 2.46. Assume that the NRZI signal starts out low.
2 Show the 4B/5B encoding, and the resulting NRZI signal, for the following bit
1110 0101 0000 0011
3 Show the 4B/5B encoding, and the resulting NRZI signal, for the following bit
1101 1110 1010 1101 1011 1110 1110 1111
4 In the 4B/5B encoding (Table 2.4), only two of the 5-bit codes used end in two 0s.
How many possible 5-bit sequences are there (used by the existing code or not)
that meet the stronger restriction of having at most one leading and at most one
trailing 0? Could all 4-bit sequences be mapped to such 5-bit sequences?
5 Assuming a framing protocol that uses bit stuffing, show the bit sequence transmitted over the link when the frame contains the following bit sequence:
Mark the stuffed bits.
6 Suppose the following sequence of bits arrives over a link:
1 0 0 1 1 1 1 1 0 0 0 1 0 0 0 1
Figure 2.46 Diagram for Exercise 1.
Show the resulting frame after any stuffed bits have been removed. Indicate any
errors that might have been introduced into the frame.
7 Suppose the following sequence of bits arrive over a link:
Show the resulting frame after any stuffed bits have been removed. Indicate any
errors that might have been introduced into the frame.
8 Suppose you want to send some data using the BISYNC framing protocol, and
the last 2 bytes of your data are DLE and ETX. What sequence of bytes would be
transmitted immediately prior to the CRC?
9 For each of the following framing protocols, give an example of a byte/bit sequence
that should never appear in a transmission.
(b) HDLC
10 Assume that a SONET receiver resynchronizes its clock whenever a 1 bit appears;
otherwise, the receiver samples the signal in the middle of what it believes is the
bit’s time slot.
(a) What relative accuracy of the sender’s and receiver’s clocks is required in order
to receive correctly 48 0 bytes (one ATM AAL5 cell’s worth) in a row?
(b) Consider a forwarding station A on a SONET STS-1 line, receiving frames
from the downstream end B and retransmitting them upstream. What relative
accuracy of A’s and B’s clocks is required to keep A from accumulating more
than one extra frame per minute?
11 Show that two-dimensional parity allows detection of all 3-bit errors.
12 Give an example of a 4-bit error that would not be detected by two-dimensional
parity, as illustrated in Figure 2.16. What is the general set of circumstances under
which 4-bit errors will be undetected?
13 Show that two-dimensional parity provides the receiver enough information to
correct any 1-bit error (assuming the receiver knows only 1 bit is bad), but not
any 2-bit error.
2 Direct Link Networks
14 Show that the Internet checksum will never be 0xFFFF (that is, the final value of
sum will not be 0x0000) unless every byte in the buffer is 0. (Internet specifications
in fact require that a checksum of 0x0000 be transmitted as 0xFFFF; the value
0x0000 is then reserved for an omitted checksum. Note that, in ones complement
arithmetic, 0x0000 and 0xFFFF are both representations of the number 0.)
15 Prove the Internet checksum computation shown in the text is independent of
byte order (host order or network order) except that the bytes in the final checksum should be swapped later to be in the correct order. Specifically, show that
the sum of 16-bit word integers can be computed in either byte order. For example, if the ones complement sum (denoted by +′ ) of 16-bit words is represented
[A, B] +′ [C, D] +′ · · · +′ [Y, Z]
the following swapped sum is the same as the original sum above:
[B, A] +′ [D, C] +′ · · · +′ [Z, Y]
16 Suppose that one byte in a buffer covered by the Internet checksum algorithm needs
to be decremented (e.g., a header hop count field). Give an algorithm to compute
the revised checksum without rescanning the entire buffer. Your algorithm should
consider whether the byte in question is low order or high order.
17 Show that the Internet checksum can be computed by first taking the 32-bit ones
complement sum of the buffer in 32-bit units, then taking the 16-bit ones complement sum of the upper and lower halfwords, and finishing as before by complementing the result. (To take a 32-bit ones complement sum on 32-bit twos
complement hardware, you need access to the “overflow” bit.)
18 Suppose we want to transmit the message 11001001 and protect it from errors
using the CRC polynomial x3 + 1.
(a) Use polynomial long division to determine the message that should be transmitted.
(b) Suppose the leftmost bit of the message is inverted due to noise on the transmission link. What is the result of the receiver’s CRC calculation? How does
the receiver know that an error has occurred?
19 Suppose we want to transmit the message 1011 0010 0100 1011 and protect it from
errors using the CRC-8 polynomial x8 + x2 + x1 + 1.
(a) Use polynomial long division to determine the message that should be transmitted.
(b) Suppose the leftmost bit of the message is inverted due to noise on the transmission link. What is the result of the receiver’s CRC calculation? How does
the receiver know that an error has occurred?
20 The CRC algorithm as presented in this chapter requires lots of bit manipulations.
It is, however, possible to do polynomial long division taking multiple bits at a
time, via a table-driven method, that enables efficient software implementations of
CRC. We outline the strategy here for long division 3 bits at a time (see Table 2.6);
in practice we would divide 8 bits at a time, and the table would have 256 entries.
Let the divisor polynomial C = C(x) be x3 + x2 + 1, or 1101. To build the table
for C, we take each 3-bit sequence, p, append three trailing 0s, and then find the
quotient q = p⌢ 000 ÷ C, ignoring the remainder. The third column is the product
C × q, the first 3 bits of which should equal p.
(a) Verify, for p = 110, that the quotients p⌢ 000 ÷ C and p⌢ 111 ÷ C are the
same; that is, it doesn’t matter what the trailing bits are.
(b) Fill in the missing entries in the table.
q = p⌢ 000÷C
000 000
001 101
100 011
101 110
Table 2.6
Table-driven CRC calculation.
2 Direct Link Networks
(c) Use the table to divide 101 001 011 001 100 by C. Hint: The first 3 bits of
the dividend are p = 101, so from the table the corresponding first 3 bits
of the quotient are q = 110. Write the 110 above the second 3 bits of the dividend, and subtract C × q = 101 110, again from the table, from the first 6 bits
of the dividend. Keep going in groups of 3 bits. There should be no remainder.
21 With 1 parity bit we can detect all 1-bit errors. Show that at least one generalization
fails, as follows:
(a) Show that if messages m are 8 bits long, then there is no error detection code
e = e(m) of size 2 bits that can detect all 2-bit errors. Hint: Consider the set
M of all 8-bit messages with a single 1 bit; note that any message from M can
be transmuted into any other with a 2-bit error, and show that some pair of
messages m1 and m2 in M must have the same error code e.
(b) Find an N (not necessarily minimal) such that no 32-bit error detection code
applied to N-bit blocks can detect all errors altering up to 8 bits.
22 Consider an ARQ protocol that uses only negative acknowledgments (NAKs),
but no positive acknowledgments (ACKs). Describe what timeouts would need
to be scheduled. Explain why an ACK-based protocol is usually preferred to a
NAK-based protocol.
23 Consider an ARQ algorithm running over a 20-km point-to-point fiber link.
(a) Compute the propagation delay for this link, assuming that the speed of light
is 2 × 108 m/s in the fiber.
(b) Suggest a suitable timeout value for the ARQ algorithm to use.
(c) Why might it still be possible for the ARQ algorithm to time out and retransmit
a frame, given this timeout value?
24 Suppose you are designing a sliding window protocol for a 1-Mbps point-to-point
link to the moon, which has a one-way latency of 1.25 seconds. Assuming that
each frame carries 1 KB of data, what is the minimum number of bits you need
for the sequence number?
25 Suppose you are designing a sliding window protocol for a 1-Mbps point-to-point
link to a stationary satellite revolving around the earth at 3 × 104 km altitude.
Assuming that each frame carries 1 KB of data, what is the minimum number of
bits you need for the sequence number in the following cases? Assume the speed
of light is 3 × 108 m/s.
(a) RWS = 1
(b) RWS = SWS
26 The text suggests that the sliding window protocol can be used to implement flow
control. We can imagine doing this by having the receiver delay ACKs, that is, not
send the ACK until there is free buffer space to hold the next frame. In doing so,
each ACK would simultaneously acknowledge the receipt of the last frame and
tell the source that there is now free buffer space available to hold the next frame.
Explain why implementing flow control in this way is not a good idea.
27 Implicit in the stop-and-wait scenarios of Figure 2.19 is the notion that the receiver will retransmit its ACK immediately on receipt of the duplicate data frame.
Suppose instead that the receiver keeps its own timer and retransmits its ACK
only after the next expected frame has not arrived within the timeout interval.
Draw timelines illustrating the scenarios in Figure 2.19(b)–(d); assume the receiver’s timeout value is twice the sender’s. Also redraw (c) assuming the receiver’s
timeout value is half the sender’s.
28 In stop-and-wait transmission, suppose that both sender and receiver retransmit
their last frame immediately on receipt of a duplicate ACK or data frame; such
a strategy is superficially reasonable because receipt of such a duplicate is most
likely to mean the other side has experienced a timeout.
(a) Draw a timeline showing what will happen if the first data frame is somehow
duplicated, but no frame is lost. How long will the duplications continue?
This situation is known as the Sorcerer’s Apprentice bug.
(b) Suppose that, like data, ACKs are retransmitted if there is no response within
the timeout period. Suppose also that both sides use the same timeout interval.
Identify a reasonably likely scenario for triggering the Sorcerer’s Apprentice
29 Give some details of how you might augment the sliding window protocol with
flow control by having ACKs carry additional information that reduces the SWS
as the receiver runs out of buffer space. Illustrate your protocol with a timeline
for a transmission; assume the initial SWS and RWS are 4, the link speed is instantaneous, and the receiver can free buffers at the rate of one per second (i.e., the
receiver is the bottleneck). Show what happens at T = 0, T = 1, . . . , T = 4 seconds.
2 Direct Link Networks
30 Describe a protocol combining the sliding window algorithm with selective ACKs.
Your protocol should retransmit promptly, but not if a frame simply arrives one or
two positions out of order. Your protocol should also make explicit what happens
if several consecutive frames are lost.
31 Draw a timeline diagram for the sliding window algorithm with SWS = RWS =
3 frames for the following two situations. Use a timeout interval of about 2 × RTT.
(a) Frame 4 is lost.
(b) Frames 4–6 are lost.
32 Draw a timeline diagram for the sliding window algorithm with SWS = RWS = 4
frames for the following two situations. Assume the receiver sends a duplicate
acknowledgement if it does not receive the expected frame. For example, it sends
DUPACK[2] when it expects to see FRAME[2] but receives FRAME[3] instead. Also,
the receiver sends a cumulative acknowledgment after it receives all the outstanding frames. For example, it sends ACK[5] when it receives the lost frame FRAME[2]
after it already received FRAME[3], FRAME[4], and FRAME[5]. Use a timeout interval of about 2 × RTT.
(a) Frame 2 is lost. Retransmission takes place upon timeout (as usual).
(b) Frame 2 is lost. Retransmission takes place either upon receipt of the first
DUPACK or upon timeout. Does this scheme reduce the transaction time? Note
that some end-to-end protocols (e.g., variants of TCP) use a similar scheme
for fast retransmission.
33 Suppose that we attempt to run the sliding window algorithm with SWS = RWS =
3 and with MaxSeqNum = 5. The Nth packet DATA[ N] thus actually contains N
mod 5 in its sequence number field. Give an example in which the algorithm
becomes confused; that is, a scenario in which the receiver expects DATA[5] and
accepts DATA[0]—which has the same transmitted sequence number—in its stead.
No packets may arrive out of order. Note this implies MaxSeqNum ≥ 6 is necessary
as well as sufficient.
34 Consider the sliding window algorithm with SWS = RWS = 3, with no out-oforder arrivals, and with infinite-precision sequence numbers.
(a) Show that if DATA[6] is in the receive window, then DATA[0] (or in general
any older data) cannot arrive at the receiver (and hence that MaxSeqNum = 6
would have sufficed).
Figure 2.47 Diagram for Exercises 36–38.
(b) Show that if ACK[6] may be sent (or, more literally, that DATA[5] is in the
sending window), then ACK[2] (or earlier) cannot be received.
These amount to a proof of the formula given in Section 2.5.2, particularized to
the case SWS = 3. Note that part (b) implies that the scenario of the previous
problem cannot be reversed to involve a failure to distinguish ACK[0] and ACK[5].
35 Suppose that we run the sliding window algorithm with SWS = 5 and RWS = 3,
and no out-of-order arrivals.
(a) Find the smallest value for MaxSeqNum. You may assume that it suffices to
find the smallest MaxSeqNum such that if DATA[MaxSeqNum] is in the receive
window, then DATA[0] can no longer arrive.
(b) Give an example showing that MaxSeqNum − 1 is not sufficient.
(c) State a general rule for the minimum MaxSeqNum in terms of SWS and RWS.
36 Suppose A is connected to B via an intermediate router R, as shown in Figure 2.47.
The A–R and R–B links each accept and transmit only one packet per second in
each direction (so two packets take 2 seconds), and the two directions transmit independently. Assume A sends to B using the sliding window protocol with
SWS = 4.
(a) For Time = 0, 1, 2, 3, 4, 5, state what packets arrive at and leave each node,
or label them on a timeline.
(b) What happens if the links have a propagation delay of 1.0 seconds, but accept
immediately as many packets as are offered (i.e., latency = 1 second but
bandwidth is infinite)?
37 Suppose A is connected to B via an intermediate router R, as in the previous
problem. The A–R link is instantaneous, but the R–B link transmits only one
packet each second, one at a time (so two packets take 2 seconds). Assume A sends
to B using the sliding window protocol with SWS = 4. For Time = 0, 1, 2, 3, 4,
state what packets arrive at and are sent from A and B. How large does the queue
at R grow?
2 Direct Link Networks
38 Consider the situation in the previous exercise, except this time assume that the
router has a queue size of 1; that is, it can hold one packet in addition to the one
it is sending (in each direction). Let A’s timeout be 5 seconds, and let SWS again
be 4. Show what happens at each second from T = 0 until all four packets from
the first windowful are successfully delivered.
39 Why is it important for protocols configured on top of the Ethernet to have a
length field in their header, indicating how long the message is?
40 What kinds of problems can arise when two hosts on the same Ethernet share
the same hardware address? Describe what happens and why that behavior is a
41 The 1982 Ethernet specification allowed between any two stations up to 1500 m
of coaxial cable, 1000 m of other point-to-point link cable, and two repeaters.
Each station or repeater connects to the coaxial cable via up to 50 m of “drop
cable.” Typical delays associated with each device are given in Table 2.7 (where
c = speed of light in a vacuum = 3 × 108 m/s). What is the worst-case round-trip
propagation delay, measured in bits, due to the sources listed? (This list is not
complete; other sources of delay include sense time and signal rise time.)
42 Coaxial cable Ethernet was limited to a maximum of 500 m between repeaters,
which regenerate the signal to 100% of its original amplitude. Along one 500-m
segment, the signal could decay to no less than 14% of its original value (8.5 dB).
Along 1500 m, then, the decay might be (0.14) 3 = 0.3%. Such a signal, even
along 2500 m, is still strong enough to be read; why then are repeaters required
every 500 m?
Coaxial cable
propagation speed .77c
Link/drop cable
propagation speed .65c
approximately 0.6 μs each
approximately 0.2 μs each
Table 2.7
Typical delays associated with various devices (Exercise 41).
43 Suppose the round-trip propagation delay for Ethernet is 46.4 μs. This yields a
minimum packet size of 512 bits (464 bits corresponding to propagation delay +
48 bits of jam signal).
(a) What happens to the minimum packet size if the delay time is held constant,
and the signalling rate rises to 100 Mbps?
(b) What are the drawbacks to so large a minimum packet size?
(c) If compatibility were not an issue, how might the specifications be written so
as to permit a smaller minimum packet size?
44 Let A and B be two stations attempting to transmit on an Ethernet. Each has a
steady queue of frames ready to send; A’s frames will be numbered A1 , A2 , and
so on, and B’s similarly. Let T = 51.2 μs be the exponential backoff base unit.
Suppose A and B simultaneously attempt to send frame 1, collide, and happen to
choose backoff times of 0 × T and 1 × T, respectively, meaning A wins the race
and transmits A1 while B waits. At the end of this transmission, B will attempt
to retransmit B1 while A will attempt to transmit A2 . These first attempts will
collide, but now A backs off for either 0 × T or 1 × T, while B backs off for time
equal to one of 0 × T, . . . , 3 × T.
(a) Give the probability that A wins this second backoff race immediately after
this first collision; that is, A’s first choice of backoff time k × 51.2 is less than
(b) Suppose A wins this second backoff race. A transmits A3 , and when it is
finished, A and B collide again as A tries to transmit A4 and B tries once
more to transmit B1 . Give the probability that A wins this third backoff race
immediately after the first collision.
(c) Give a reasonable lower bound for the probability that A wins all the remaining
backoff races.
(d) What then happens to the frame B1 ?
This scenario is known as the Ethernet capture effect.
45 Suppose the Ethernet transmission algorithm is modified as follows: After each successful transmission attempt, a host waits one or two slot times before attempting
to transmit again, and otherwise backs off the usual way.
(a) Explain why the capture effect of the previous exercise is now much less likely.
(b) Show how the strategy above can now lead to a pair of hosts capturing the
Ethernet, alternating transmissions, and locking out a third.
2 Direct Link Networks
(c) Propose an alternative approach, for example, by modifying the exponential
backoff. What aspects of a station’s history might be used as parameters to
the modified backoff?
46 Ethernets use Manchester encoding. Assuming that hosts sharing the Ethernet are
not perfectly synchronized, why does this allow collisions to be detected soon after
they occur, without waiting for the CRC at the end of the packet?
47 Suppose A, B, and C all make their first carrier sense, as part of an attempt to
transmit, while a fourth station D is transmitting. Draw a timeline showing one
possible sequence of transmissions, attempts, collisions, and exponential backoff
choices. Your timeline should also meet the following criteria: (i) initial transmission attempts should be in the order A, B, C, but successful transmissions should
be in the order C, B, A, and (ii) there should be at least four collisions.
48 Repeat the previous exercise, now with the assumption that Ethernet is p-persistent
with p = 0.33 (that is, a waiting station transmits immediately with probability
p when the line goes idle, and otherwise defers one 51.2-μs slot time and repeats
the process). Your timeline should meet criterion (i) of the previous problem, but
in lieu of criterion (ii), you should show at least one collision and at least one run
of four deferrals on an idle line. Again, note that many solutions are possible.
49 Suppose Ethernet physical addresses are chosen at random (using true random
(a) What is the probability that on a 1024-host network, two addresses will be
the same?
(b) What is the probability that the above event will occur on some one or more
of 220 networks?
(c) What is the probability that of the 230 hosts in all the networks of (b), some
pair has the same address?
Hint: The calculation for (a) and (c) is a variant of that used in solving the socalled Birthday Problem: Given N people, what is the probability that two of their
birthdays (addresses) will be the same? The second person has probability 1 − 365
of having a different birthday from the first, the third has probability 1 − 365
of having a different birthday from the first two, and so on. The probability all
birthdays are different is thus
× 1−
× ··· × 1 −
which for smallish N is about
1 + 2 + · · · + ( N − 1)
50 Suppose five stations are waiting for another packet to finish on an Ethernet. All
transmit at once when the packet is finished and collide.
(a) Simulate this situation up until the point when one of the five waiting stations
succeeds. Use coin flips or some other genuine random source to determine
backoff times. Make the following simplifications: Ignore interframe spacing,
ignore variability in collision times (so that retransmission is always after an
exact integral multiple of the 51.2-μs slot time), and assume that each collision
uses up exactly one slot time.
(b) Discuss the effect of the listed simplifications in your simulation versus the
behavior you might encounter on a real Ethernet.
51 Write a program to implement the simulation discussed above, this time with N
stations waiting to transmit. Again model time as an integer, T, in units of slot
times, and again treat collisions as taking one slot time (so a collision at time
T followed by a backoff of k = 0 would result in a retransmission attempt at
time T + 1). Find the average delay before one station transmits successfully, for
N = 20, N = 40, and N = 100. Does your data support the notion that the delay
is linear in N? Hint: For each station, keep track of that station’s NextTimeToSend
and CollisionCount. You are done when you reach a time T for which there is only
one station with NextTimeToSend == T. If there is no such station, increment T. If
there are two or more, schedule the retransmissions and try again.
52 Suppose that N Ethernet stations, all trying to send at the same time, require N/2
slot times to sort out who transmits next. Assuming the average packet size is 5
slot times, express the available bandwidth as a function of N.
53 Consider the following Ethernet model. Transmission attempts are at random
times with an average spacing of λ slot times; specifically, the interval between
consecutive attempts is an exponential random variable x = −λ log u, where
u is chosen randomly in the interval 0 ≤ u ≤ 1. An attempt at time t results in a collision if there is another attempt in the range from t − 1 to t + 1,
where t is measured in units of the 51.2-μs slot time; otherwise the attempt
(a) Write a program to simulate, for a given value of λ, the average number of slot
times needed before a successful transmission, called the contention interval.
2 Direct Link Networks
Find the minimum value of the contention interval. Note that you will have
to find one attempt past the one that succeeds, in order to determine if there
was a collision. Ignore retransmissions, which probably do not fit the random
model above.
(b) The Ethernet alternates between contention intervals and successful transmissions. Suppose the average successful transmission lasts 8 slot times (512
bytes). Using your minimum length of the contention interval from above,
what fraction of the theoretical 10-Mbps bandwidth is available for transmissions?
54 What conditions would have to hold for a corrupted frame to circulate forever on
a token ring without a monitor? How does the monitor fix this problem?
55 An IEEE 802.5 token ring has five stations and a total wire length of 230 m. How
many bits of delay must the monitor insert into the ring? Do this for both 4 Mbps
and 16 Mbps; use a propagation rate of 2.3 × 108 m/s.
56 Consider a token ring network like FDDI in which a station is allowed to hold the
token for some period of time (the token holding time, or THT). Let RingLatency
denote the time it takes the token to make one complete rotation around the
network when none of the stations have any data to send.
(a) In terms of THT and RingLatency, express the efficiency of this network when
only a single station is active.
(b) What setting of THT would be optimal for a network that had only one station
active (with data to send) at a time?
(c) In the case where N stations are active, give an upper bound on the token
rotation time, or TRT, for the network.
57 Consider a token ring with a ring latency of 200 μs. Assuming that the delayed
token release strategy is used, what is the effective throughput rate that can be
achieved if the ring has a bandwidth of 4 Mbps? What is the effective throughput
rate that can be achieved if the ring has a bandwidth of 100 Mbps? Answer for
both a single active host and for “many” hosts; for the latter, assume there are
sufficiently many hosts transmitting that the time spent advancing the token can
be ignored. Assume a packet size of 1 KB.
58 For a 100-Mbps token ring network with a token rotation time of 200 μs that
allows each station to transmit one 1-KB packet each time it possesses the token,
calculate the maximum effective throughput rate that any one host can achieve.
Do this assuming (a) immediate release and (b) delayed release.
59 Suppose a 100-Mbps delayed-release token ring has 10 stations, a ring latency of
30 μs, and an agreed-upon TTRT of 350 μs.
(a) How many synchronous frame bytes could each station send, assuming all are
allocated the same amount?
(b) Assume stations A, B, C are in increasing order on the ring. Due to uniform
synchronous traffic, the TRT without asynchronous data is 300 μs. B sends a
200-μs (2.5-Kb) asynchronous frame. What TRT will A, B, and C then see on
their next measurement? Who may transmit such a frame next?
Packet Switching
Nature seems . . .to reach many of her ends by long circuitous routes.
—Rudolph Lotze
he directly connected networks described in the previous chapter suffer from
two limitations. First, there is a limit to how many hosts can be attached. For
example, only two hosts can be attached to a point-to-point link, and an Ethernet can connect up to only 1024 hosts. Second, there is a limit to how large of a
geographic area a single network can serve. For example, an Ethernet can span only
2500 m, and even though point-topoint links can be quite long, they do
not really serve the area between the
two ends. Since our goal is to build
Not All Networks Are Directly
networks that can be global in scale,
the next problem is therefore to enable communication between hosts
that are not directly connected.
This problem is not unlike the one addressed in the telephone network: Your
phone is not directly connected to every person you might want to call, but instead
is connected to an exchange that contains a switch. It is the switches that create the
impression that you have a connection to the person at the other end of the call. Similarly, computer networks use packet switches (as distinct from the circuit switches
used for telephony) to enable packets to travel from one host to another, even when
no direct connection exists between those hosts. This chapter introduces the major
concepts of packet switching, which lies at the heart of computer networking.
A packet switch is a device with several inputs and outputs leading to and from
the hosts that the switch interconnects. The core job of a switch is to take packets
that arrive on an input and forward (or switch) them to the right output so that they
will reach their appropriate destination. There are a variety of ways that the switch
can determine the “right” output for a packet, which can be broadly categorized as
connectionless and connection-oriented approaches.
A key problem that a switch must deal with is the
finite bandwidth of its outputs. If packets destined for a
certain output arrive at a switch and their arrival rate exceeds the capacity of that output, then we have a problem
of contention. The switch queues (buffers) packets until
the contention subsides, but if it lasts too long, the switch
will run out of buffer space and be forced to discard packets. When packets are discarded too frequently, the switch
is said to be congested. The ability of a switch to handle
contention is a key aspect of its performance, and many
high-performance switches use exotic hardware to reduce
the effects of contention.
This chapter introduces the issues of forwarding and
contention in packet switches. We begin by considering the
various approaches to switching, including the connectionless and connection-oriented models. We then examine two particular technologies in detail. The first is LAN
switching, which has evolved from Ethernet bridging to
become one of the dominant technologies in today’s LAN
environments. The second noteworthy switching technology is asynchronous transfer mode (ATM), which is popular among telecommunications service providers in wide
area networks. Finally, we consider some of the aspects
of switch design that must be taken into account when
building large-scale networks.
3 Packet Switching
3.1 Switching and Forwarding
In the simplest terms, a switch is a mechanism that allows us to interconnect links to
form a larger network. A switch is a multi-input, multi-output device, which transfers
packets from an input to one or more outputs. Thus, a switch adds the star topology (see Figure 3.1) to the point-to-point link, bus (Ethernet), and ring (802.5 and
FDDI) topologies established in the last chapter. A star topology has several attractive
■ Even though a switch has a fixed number of inputs and outputs, which limits
the number of hosts that can be connected to a single switch, large networks
can be built by interconnecting a number of switches.
■ We can connect switches to each other and to hosts using point-to-point links,
which typically means that we can build networks of large geographic scope.
■ Adding a new host to the network by connecting it to a switch does not
necessarily mean that the hosts already connected will get worse performance
from the network.
This last claim cannot be made for the shared-media networks discussed in the
last chapter. For example, it is impossible for two hosts on the same Ethernet to transmit
continuously at 10 Mbps because they share the same transmission medium. Every host
on a switched network has its own link to the switch, so it may be entirely possible for
many hosts to transmit at the full link speed (bandwidth), provided that the switch is
designed with enough aggregate capacity. Providing high aggregate throughput is one
Figure 3.1
A switch provides a star topology.
3.1 Switching and Forwarding
Figure 3.2
Figure 3.3
Example protocol graph running on a switch.
Example switch with three input and output ports.
of the design goals for a switch; we return to this topic below. In general, switched
networks are considered more scalable (i.e., more capable of growing to large numbers
of nodes) than shared-media networks because of this ability to support many hosts
at full speed.
A switch is connected to a set of links and, for each of these links, runs the
appropriate data link protocol to communicate with the node at the other end of the
link. A switch’s primary job is to receive incoming packets on one of its links and to
transmit them on some other link. This function is sometimes referred to as either
switching or forwarding, and in terms of the OSI architecture, it is the main function
of the network layer. Figure 3.2 shows the protocol graph that would run on a switch
that is connected to two T3 links and one STS-1 SONET link. A representation of this
same switch is given in Figure 3.3. In this figure, we have split the input and output
halves of each link, and we refer to each input or output as a port. (In general, we
assume that each link is bidirectional, and hence supports both input and output.) In
other words, this example switch has three input ports and three output ports.
The question then is, How does the switch decide which output port to place
each packet on? The general answer is that it looks at the header of the packet
for an identifier that it uses to make the decision. The details of how it uses this
identifier vary, but there are two common approaches. The first is the datagram or
3 Packet Switching
connectionless approach. The second is the virtual circuit or connection-oriented approach. A third approach, source routing, is less common than these other two, but it
is simple to explain and does have some useful applications.
One thing that is common to all networks is that we need to have a way to identify the end nodes. Such identifiers are usually called addresses. We have already seen
examples of addresses in the previous chapter, for example, the 48-bit address used
for Ethernet. The only requirement for Ethernet addresses is that no two nodes on a
network have the same address. This is accomplished by making sure that all Ethernet
cards are assigned a globally unique identifier. For the following discussions, we assume
that each host has a globally unique address. Later on, we consider other useful properties that an address might have, but global uniqueness is adequate to get us started.
Another assumption that we need to make is that there is some way to identify the
input and output ports of each switch. There are at least two sensible ways to identify
ports: One is to number each port, and the other is to identify the port by the name of
the node (switch or host) to which it leads. For now, we use numbering of the ports.
The idea behind datagrams is incredibly simple: You just make sure that every packet
contains enough information to enable any switch to decide how to get it to its destination. That is, every packet contains the complete destination address. Consider
the example network illustrated in Figure 3.4, in which the hosts have addresses A,
B, C, and so on. To decide how to forward a packet, a switch consults a forwarding
table (sometimes called a routing table), an example of which is depicted in Table 3.1.
Table 3.1
Forwarding table for switch 2.
3.1 Switching and Forwarding
Host D
Host C
Host E
Switch 1
Host F
2 Switch 2
Host A
Host G
0 Switch 3
Host B
Host H
Figure 3.4
Datagram forwarding: an example network.
This particular table shows the forwarding information that switch 2 needs to forward
datagrams in the example network. It is pretty easy to figure out such a table when you
have a complete map of a simple network like that depicted here; we could imagine a
network operator configuring the tables statically. It is a lot harder to create the forwarding tables in large, complex networks with dynamically changing topologies and
multiple paths between destinations. That harder problem is known as routing and is
the topic of Section 4.2. We can think of routing as a process that takes place in the
background so that, when a data packet turns up, we will have the right information
in the forwarding table to be able to forward, or switch, the packet.
Connectionless (datagram) networks have the following characteristics:
■ A host can send a packet anywhere at any time, since any packet that turns
up at a switch can be immediately forwarded (assuming a correctly populated
forwarding table). As we will see, this contrasts with most connection-oriented
networks, in which some “connection state” needs to be established before
the first data packet is sent.
■ When a host sends a packet, it has no way of knowing if the network is capable
of delivering it or if the destination host is even up and running.
3 Packet Switching
■ Each packet is forwarded independently of previous packets that might have
been sent to the same destination. Thus, two successive packets from host A
to host B may follow completely different paths (perhaps because of a change
in the forwarding table at some switch in the network).
■ A switch or link failure might not have any serious effect on communication
if it is possible to find an alternate route around the failure and to update the
forwarding table accordingly.
This last fact is particularly important to the history of datagram networks. One
of the important goals of the ARPANET, forerunner to the Internet, was to develop
networking technology that would be robust in a military environment, where you
might expect links and nodes to fail because of active attacks such as bombing. It was
the ability to route around failures that led to a datagram-based design.
Virtual Circuit Switching
A widely used technique for packet switching, which differs significantly from the
datagram model, uses the concept of a virtual circuit (VC). This approach, which is also
called a connection-oriented model, requires that we first set up a virtual connection
from the source host to the destination host before any data is sent. To understand how
this works, consider Figure 3.5, where host A again wants to send packets to host B.
We can think of this as a two-stage process. The first stage is “connection setup.” The
second is data transfer. We consider each in turn.
In the connection setup phase, it is necessary to establish “connection state” in
each of the switches between the source and destination hosts. The connection state
Switch 1
Switch 2
Host A
Switch 3
Figure 3.5
An example of a virtual circuit network.
Host B
3.1 Switching and Forwarding
for a single connection consists of an entry in a “VC table” in each switch through
which the connection passes. One entry in the VC table on a single switch contains
■ a virtual circuit identifier (VCI) that uniquely identifies the connection at this
switch and that will be carried inside the header of the packets that belong to
this connection
■ an incoming interface on which packets for this VC arrive at the switch
■ an outgoing interface in which packets for this VC leave the switch
■ a potentially different VCI that will be used for outgoing packets
The semantics of one such entry is as follows: If a packet arrives on the designated
incoming interface and that packet contains the designated VCI value in its header,
then that packet should be sent out the specified outgoing interface with the specified
outgoing VCI value first having been placed in its header.
Note that the combination of the VCI of packets as they are received at the switch
and the interface on which they are received uniquely identifies the virtual connection.
There may of course be many virtual connections established in the switch at one
time. Also, we observe that the incoming and outgoing VCI values are generally not
the same. Thus, the VCI is not a globally significant identifier for the connection;
rather, it has significance only on a given link—that is, it has link local scope.
Whenever a new connection is created, we need to assign a new VCI for that
connection on each link that the connection will traverse. We also need to ensure that
the chosen VCI on a given link is not currently in use on that link by some existing
There are two broad classes of approach to establishing connection state. One is
to have a network administrator configure the state, in which case the virtual circuit is
“permanent.” Of course, it can also be deleted by the administrator, so a permanent
virtual circuit (PVC) might best be thought of as a long-lived or administratively
configured VC. Alternatively, a host can send messages into the network to cause the
state to be established. This is referred to as signalling, and the resulting virtual circuits
are said to be switched. The salient characteristic of a switched virtual circuit (SVC)
is that a host may set up and delete such a VC dynamically without the involvement
of a network administrator. Note that an SVC should more accurately be called a
“signalled” VC, since it is the use of signalling (not switching) that distinguishes an
SVC from a PVC.
Let’s assume that a network administrator wants to manually create a new virtual
connection from host A to host B. First, the administrator needs to identify a path
through the network from A to B. In the example network of Figure 3.5, there is only
3 Packet Switching
Incoming VCI
Outgoing VCI
Incoming VCI
Outgoing VCI
Incoming VCI
Outgoing VCI
Table 3.2
Virtual circuit table entries for (a) switch 1, (b) switch 2 and (c) switch 3.
one such path, but in general this may not be the case. The administrator then picks a
VCI value that is currently unused on each link for the connection. For the purposes
of our example, let’s suppose that the VCI value 5 is chosen for the link from host A
to switch 1, and that 11 is chosen for the link from switch 1 to switch 2. In that case,
switch 1 needs to have an entry in its VC table configured as shown in Table 3.2(a).
Similarly, suppose that the VCI of 7 is chosen to identify this connection on
the link from switch 2 to switch 3, and that a VCI of 4 is chosen for the link from
switch 3 to host B. In that case, switches 2 and 3 need to be configured with VC table
entries as shown in Table 3.2. Note that the “outgoing” VCI value at one switch is the
“incoming” VCI value at the next switch.
Once the VC tables have been set up, the data transfer phase can proceed, as
illustrated in Figure 3.6. For any packet that it wants to send to host B, A puts the VCI
value of 5 in the header of the packet and sends it to switch 1. Switch 1 receives any
such packet on interface 2, and it uses the combination of the interface and the VCI
in the packet header to find the appropriate VC table entry. As shown in Table 3.2,
the table entry in this case tells switch 1 to forward the packet out of interface 1 and
3.1 Switching and Forwarding
Switch 1
Switch 2
Host A
Figure 3.6
0 Switch 3
Host B
A packet is sent into a virtual circuit network.
Switch 1
Switch 2
Figure 3.7
0 Switch 3
Host A
Host B
A packet makes its way through a virtual circuit network.
to put the VCI value 11 in the header when the packet is sent. Thus, the packet will
arrive at switch 2 on interface 3 bearing VCI 11. Switch 2 looks up interface 3 and VCI
11 in its VC table (as shown in Table 3.2) and sends the packet on to switch 3 after
updating the VCI value in the packet header appropriately, as shown in Figure 3.7.
This process continues until it arrives at host B with the VCI value of 4 in the packet.
To host B, this identifies the packet as having come from host A.
In real networks of reasonable size, the burden of configuring VC tables correctly in a large number of switches would quickly become excessive using the above
procedures. Thus, some sort of signalling is almost always used, even when setting up
“permanent” VCs. In the case of PVCs, signalling is initiated by the network administrator, while SVCs are usually set up using signalling by one of the hosts. We consider
now how the same VC just described could be set up by signalling from the host.
To start the signalling process, host A sends a setup message into the network,
that is, to switch 1. The setup message contains, among other things, the complete
destination address of host B. The setup message needs to get all the way to B to create
the necessary connection state in every switch along the way. We can see that getting
3 Packet Switching
the setup message to B is a lot like getting a datagram to B, in that the switches have
to know which output to send the setup message to so that it eventually reaches B.
For now, let’s just assume that the switches know enough about the network topology
to figure out how to do that, so that the setup message flows on to switches 2 and 3
before finally reaching host B.
When switch 1 receives the connection request, in addition to sending it on to
switch 2, it creates a new entry in its virtual circuit table for this new connection. This
entry is exactly the same as shown previously in Table 3.2. The main difference is that
now the task of assigning an unused VCI value on the interface is performed by the
switch. In this example, the switch picks the value 5. The virtual circuit table now
has the following information: “When packets arrive on port 2 with identifier 5, send
them out on port 1.” Another issue is that, somehow, host A will need to learn that it
should put the VCI value of 5 in packets that it wants to send to B; we will see how
that happens below.
When switch 2 receives the setup message, it performs a similar process; in this
example it picks the value 11 as the incoming VCI value. Similarly, switch 3 picks
7 as the value for its incoming VCI. Each switch can pick any number it likes, as
long as that number is not currently in use for some other connection on that port of
that switch. As noted above, VCIs have “link local scope”; that is, they have no global
Finally, the setup message arrives at host B. Assuming that B is healthy and willing
to accept a connection from host A, it too allocates an incoming VCI value, in this
case 4. This VCI value can be used by B to identify all packets coming from host A.
Now, to complete the connection, everyone needs to be told what their downstream neighbor is using as the VCI for this connection. Host B sends an acknowledgment of the connection setup to switch 3 and includes in that message the VCI that it
chose (4). Now switch 3 can complete the virtual circuit table entry for this connection,
since it knows the outgoing value must be 4. Switch 3 sends the acknowledgment on to
switch 2, specifying a VCI of 7. Switch 2 sends the message on to switch 1, specifying
a VCI of 11. Finally, switch 1 passes the acknowledgment on to host A, telling it to
use the VCI of 5 for this connection.
At this point, everyone knows all that is necessary to allow traffic to flow from
host A to host B. Each switch has a complete virtual circuit table entry for the connection. Furthermore, host A has a firm acknowledgment that everything is in place
all the way to host B. At this point, the connection table entries are in place in all
three switches just as in the administratively configured example above, but the whole
process happened automatically in response to the signalling message sent from A.
The data transfer phase can now begin and is identical to that used in the PVC case.
When host A no longer wants to send data to host B, it tears down the connection
by sending a teardown message to switch 1. The switch removes the relevant entry from
3.1 Switching and Forwarding
its table and forwards the message on to the other switches in the path, which similarly
delete the appropriate table entries. At this point, if host A were to send a packet with
a VCI of 5 to switch 1, it would be dropped as if the connection had never existed.
There are several things to note about virtual circuit switching:
■ Since host A has to wait for the connection request to reach the far side of the
network and return before it can send its first data packet, there is at least one
RTT of delay before data is sent.1
■ While the connection request contains the full address for host B (which might
be quite large, being a global identifier on the network), each data packet
contains only a small identifier, which is only unique on one link. Thus, the
per-packet overhead caused by the header is reduced relative to the datagram
■ If a switch or a link in a connection fails, the connection is broken and a new
one will need to be established. Also, the old one needs to be torn down to
free up table storage space in the switches.
■ The issue of how a switch decides which link to forward the connection request
on has been glossed over. In essence, this is the same problem as building up
the forwarding table for datagram forwarding, which requires some sort of
routing algorithm. Routing is described in Section 4.2, and the algorithms
described there are generally applicable to routing setup requests as well as
One of the nice aspects of virtual circuits is that by the time the host gets the
go-ahead to send data, it knows quite a lot about the network—for example, that
there really is a route to the receiver and that the receiver is willing and able to receive
data. It is also possible to allocate resources to the virtual circuit at the time it is
established. For example, an X.25 network—a packet-switched network that uses the
connection-oriented model—employs the following three-part strategy:
1 Buffers are allocated to each virtual circuit when the circuit is initialized.
2 The sliding window protocol is run between each pair of nodes along the virtual
circuit, and this protocol is augmented with flow control to keep the sending
node from overrunning the buffers allocated at the receiving node.
3 The circuit is rejected by a given node if not enough buffers are available at that
node when the connection request message is processed.
This is not strictly true. Some people have proposed “optimistically” sending a data packet immediately after
sending the connection request. However, most current implementations wait for connection setup to complete
before sending data.
3 Packet Switching
In doing these three things, each node is ensured of having the buffers it needs to queue
the packets that arrive on that circuit. This basic strategy is usually called hop-by-hop
flow control.
By comparison, a datagram network has no connection establishment phase,
and each switch processes each packet independently, making it less obvious how
a datagram network would allocate resources in a meaningful way. Instead, each
arriving packet competes with all other
packets for buffer space. If there are no free
Introduction to Congestion
buffers, the incoming packet must be disRecall the distinction between concarded. We observe, however, that even in
tention and congestion: Contention
a datagram-based network, a source host
occurs when multiple packets have
often sends a sequence of packets to the
to be queued at a switch besame destination host. It is possible for each
cause they are competing for the
switch to distinguish among the set of packsame output link, while congestion
ets it currently has queued, based on the
means that the switch has so many
source/destination pair, and thus for the
packets queued that it runs out of
switch to ensure that the packets belonging
buffer space and has to start dropto each source/destination pair are receivping packets. We return to the topic
ing a fair share of the switch’s buffers. We
of congestion in Chapter 6, after
discuss this idea in much greater depth in
we have seen the transport protoChapter 6.
col component of the network arIn the virtual circuit model, we could
chitecture. At this point, however,
imagine providing each circuit with a differwe observe that the decision as to
ent quality of service (QoS). In this setting,
whether your network uses virtual
the term “quality of service” is usually taken
circuits or datagrams has an impact
to mean that the network gives the user
on how you deal with congestion.
some kind of performance-related guaranOn the one hand, suppose
tee, which in turn implies that switches set
that each switch allocates enough
aside the resources they need to meet this
buffers to handle the packets
guarantee. For example, the switches along
belonging to each virtual circuit it
a given virtual circuit might allocate a persupports, as is done in an X.25
centage of each outgoing link’s bandwidth
network. In this case, the netto that circuit. As another example, a sework has defined away the problem
quence of switches might ensure that packof congestion—a switch never enets belonging to a particular circuit not be
counters a situation in which it has
delayed (queued) for more than a certain
more packets to queue than it has
amount of time. We return to the topic of
buffer space, since it does not allow
quality of service in Section 6.5.
the connection to be established in
The most popular examples of virtual
the first place unless it can dedicate
circuit technologies are Frame Relay and
3.1 Switching and Forwarding
Figure 3.8
Frame Relay packet format.
enough resources to it to avoid this
situation. The problem with this
approach, however, is that it is extremely conservative—it is unlikely
that all the circuits will need to use
all of their buffers at the same time,
and as a consequence, the switch is
potentially underutilized.
On the other hand, the datagram model seemingly invites
congestion—you do not know that
there is enough contention at a
switch to cause congestion until
you run out of buffers. At that
point, it is too late to prevent the
congestion, and your only choice is
to try to recover from it. The good
news, of course, is that you may
be able to get better utilization out
of your switches since you are not
holding buffers in reserve for a
worst-case scenario that is unlikely
to happen.
As is quite often the case, nothing is strictly black and white—
there are design advantages for
defining congestion away (as the
X.25 model does) and for doing
nothing about congestion until
after it happens (as the simple
datagram model does). We describe
some of these design points in
Chapter 6.
asynchronous transfer mode (ATM). ATM
has a number of interesting properties that
we discuss in Section 3.3. Frame Relay is
a rather straightforward implementation of
virtual circuit technology, and its simplicity has made it extremely popular. Many
network service providers offer Frame Relay PVC services. One of the applications
of Frame Relay is the construction of virtual private networks (VPNs), a subject discussed in Section 4.1.8.
Frame Relay provides some basic
quality of service and congestion-avoidance
features, but these are rather lightweight
compared to X.25 and ATM. The Frame
Relay packet format (see Figure 3.8) provides a good example of a packet used for
virtual circuit switching.
Source Routing
A third approach to switching that uses neither virtual circuits nor conventional datagrams is known as source routing. The name
derives from the fact that all the information
about network topology that is required to
switch a packet across the network is provided by the source host.
There are various ways to implement
source routing. One would be to assign
a number to each output of each switch
and to place that number in the header of
the packet. The switching function is then
very simple: For each packet that arrives on
an input, the switch would read the port
3 Packet Switching
0 Switch 1
3 0 1
2 Switch 2
1 3 0
Host A
0 1 3
0 Switch 3
Host B
Figure 3.9 Source routing in a switched network (where the switch reads the rightmost number).
number in the header and transmit the packet on that output. However, since there will
in general be more than one switch in the path between the sending and the receiving
host, the header for the packet needs to contain enough information to allow every
switch in the path to determine which output the packet needs to be placed on. One
way to do this would be to put an ordered list of switch ports in the header and to
rotate the list so that the next switch in the path is always at the front of the list.
Figure 3.9 illustrates this idea.
In this example, the packet needs to traverse three switches to get from host A
to host B. At switch 1, it needs to exit on port 1, at the next switch it needs to exit
at port 0, and at the third switch it needs to exit at port 3. Thus, the original header
when the packet leaves host A contains the list of ports (3, 0, 1), where we assume
that each switch reads the rightmost element of the list. To make sure that the next
switch gets the appropriate information, each switch rotates the list after it has read
its own entry. Thus, the packet header as it leaves switch 1 en route to switch 2 is now
(1, 3, 0); switch 2 performs another rotation and sends out a packet with (0, 1, 3) in
the header. Although not shown, switch 3 performs yet another rotation, restoring the
header to what it was when host A sent it.
There are several things to note about this approach. First, it assumes that host
A knows enough about the topology of the network to form a header that has all
the right directions in it for every switch in the path. This is somewhat analogous to
the problem of building the forwarding tables in a datagram network or figuring out
3.1 Switching and Forwarding
Optical Switching
To a casual observer of the networking industry around the year
2000, it might have appeared that
the most interesting sort of switching was optical switching. Indeed,
optical switching did become an
important technology in the late
1990s, due to a confluence of several factors. One factor was the
commercial availability of dense
wavelength division multiplexing
(DWDM) equipment, which makes
it possible to send a great deal of
information down a single fiber by
transmitting on a large number of
optical wavelengths (or colors) at
once. Thus, for example, you might
send data on 100 or more different
wavelengths, and each wavelength
might carry as much as 10 Gbps of
A second factor was the commercial availability of optical amplifiers. Optical signals are attenuated as they pass through fiber,
and after some distance (about 40
km or so) they need to be made
stronger in some way. Before optical amplifiers, it was necessary to
place repeaters in the path to recover the optical signal, convert it
to a digital electronic signal, and
then convert it back to optical
again. Before you could get the data
into a repeater, you would have to
where to send a setup packet in a virtual circuit network. Second, observe that we cannot predict how big the header needs to be,
since it must be able to hold one word of
information for every switch on the path.
This implies that headers are probably of
variable length with no upper bound, unless we can predict with absolute certainty
the maximum number of switches through
which a packet will ever need to pass. Third,
there are some variations on this approach.
For example, rather than rotate the header,
each switch could just strip the first element
as it uses it. Rotation has an advantage over
stripping, however: Host B gets a copy of
the complete header, which may help it figure out how to get back to host A. Yet another alternative is to have the header carry
a pointer to the current “next port” entry,
so that each switch just updates the pointer
rather than rotating the header; this may
be more efficient to implement. We show
these three approaches in Figure 3.10. In
each case, the entry that this switch needs to
read is A, and the entry that the next switch
needs to read is B.
Source routing can be used in both
datagram networks and virtual circuit networks. For example, the Internet Protocol, which is a datagram protocol, includes
a source route option that allows selected
packets to be source routed, while the majority are switched as conventional datagrams. Source routing is also used in some
virtual circuit networks as the means to get
the initial setup request along the path from
source to destination.
Finally, we note that source routing
suffers from a scaling problem. In any
3 Packet Switching
Header entering
Ptr D C B A
Header leaving
Ptr D C B A
Figure 3.10 Three ways to handle headers for source routing: (a) rotation; (b) stripping;
(c) pointer. The labels are read right to left.
reasonably large network, it is very hard for
a host to get the complete path information
it needs to construct correct headers.
3.2 Bridges and LAN
Having discussed some of the basic ideas behind switching, we now focus more closely
on some specific switching technologies. We
begin by considering a class of switches
that is used to forward packets between
shared-media LANs such as Ethernets. Such
switches are sometimes known by the obvious name of LAN switches; historically they
have also been referred to as bridges.
Suppose you have a pair of Ethernets that you want to interconnect. One approach you might try is to put a repeater
between them, as described in Chapter 2.
This would not be a workable solution,
however, if doing so exceeded the physical limitations of the Ethernet. (Recall that
no more than two repeaters between any
pair of hosts and no more than a total of
2500 m in length is allowed.) An alternative
demultiplex it using a DWDM terminal. Thus, a large number of
DWDM terminals would be needed
just to drive a single fiber pair for
a long distance. Optical amplifiers,
unlike repeaters, are analog devices
that boost whatever signal is sent
along the fiber, even if it is sent on
a hundred different wavelengths.
Optical amplifiers therefore made
DWDM gear much more attractive, because now a pair of DWDM
terminals could talk to each other
when separated by a distance of
hundreds of kilometers. Furthermore, you could even upgrade the
DWDM gear at the ends without
touching the optical amplifiers in
the middle of the path, because they
will amplify 100 wavelengths as
easily as 50 wavelengths.
With DWDM and optical amplifiers, it became possible to build
optical networks of huge capacity. But at least one more type
3.2 Bridges and LAN Switches
would be to put a node between the two Ethernets and have the node forward frames
from one Ethernet to the other. This node would be in promiscuous mode, accepting
all frames transmitted on either of the Ethernets, so it could forward them to the other.
The node we have just described is typically called a bridge, and a collection of
LANs connected by one or more bridges is usually said to form an extended LAN. In
their simplest variants, bridges simply accept LAN frames on their inputs and forward
them out on all other outputs. This simple strategy was used by early bridges, but has
since been refined to make bridges a more effective mechanism for interconnecting a
set of LANs. The rest of this section fills in the more interesting details.
Note that a bridge meets our definition of a switch from the previous section:
a multi-input, multi-output device, which transfers packets from an input to one or
more outputs. And recall that this provides
a way to increase the total bandwidth of
of device is needed to make
a network. For example, while a single
these networks useful—the optiEthernet segment can carry only 10 Mbps
cal switch. Most so-called optiof total traffic, an Ethernet bridge can carry
cal switches today actually perform
as much as 10n Mbps, where n is the numtheir switching function electronber
of ports (inputs and outputs) on the
ically, and from an architectural
point of view they have more in
common with the circuit switches
of the telephone network than the
packet switches described in this
chapter. A typical optical switch
has a large number of interfaces
that understand SONET framing
and is able to cross-connect a
SONET channel from an incoming
interface to an outgoing interface.
Thus, with an optical switch, it becomes possible to provide SONET
channels from point A to point B
via point C even if there is no direct fiber path from A to B—there
just needs to be a path from A to
C, a switch at C, and a path from
C to B. In this respect, an optical
Learning Bridges
The first optimization we can make to a
bridge is to observe that it need not forward all frames that it receives. Consider
the bridge in Figure 3.11. Whenever a frame
from host A that is addressed to host B
arrives on port 1, there is no need for the
bridge to forward the frame out over port 2.
The question, then, is, How does a bridge
come to learn on which port the various
hosts reside?
One option would be to have a human
download a table into the bridge similar to
the one given in Table 3.3. Then, whenever
the bridge receives a frame on port 1 that is
addressed to host A, it would not forward
the frame out on port 2; there would be no
need because host A would have already directly received the frame on the LAN connected to port 1. Anytime a frame addressed
to host A was received on port 2, the bridge
would forward the frame out on port 1.
Note that a bridge using such a table
would be using the datagram (or connectionless) model of forwarding described in
Section 3.1.1. Each packet carries a global
address, and the bridge decides which output to send a packet on by looking up that
address in a table.
Having a human maintain this table is
quite a burden, especially considering that
there is a simple trick by which a bridge
can learn this information for itself. The
idea is for each bridge to inspect the source
address in all the frames it receives. Thus,
when host A sends a frame to a host on
either side of the bridge, the bridge receives
this frame and records the fact that a frame
from host A was just received on port 1. In
this way, the bridge can build a table just
like Table 3.3.
When a bridge first boots, this table is
empty; entries are added over time. Also, a
timeout is associated with each entry, and
the bridge discards the entry after a specified period of time. This is to protect against
the situation in which a host—and as a
consequence, its LAN address—is moved
from one network to another. Thus, this table is not necessarily complete. Should the
bridge receive a frame that is addressed to
a host not currently in the table, it goes
ahead and forwards the frame out on all the
3 Packet Switching
switch bears some relationship to
the switches in Figure 3.5, in that
it creates the illusion of a connetion between two points even
when there is no direct physical
connection between them. However, optical switches do not provide virtual circuits; they provide
“real” circuits (e.g., a SONET
channel). There are even some
newer types of optical switches
that use microscopic mirrors to
deflect all the light from one
switch port to another, so that
there could be an uninterrupted
optical channel from point A to
point B.
We don’t cover optical networking extensively in this book,
in part because of space considerations. For many practical purposes,
you can think of optical networks
as a piece of the infrastructure that
enables telephone companies to
provide SONET links or other
types of circuits where and when
you need them. However, it is
worth noting that many of the technologies that are discussed later in
this book, such as routing protocols
and Multiprotocol Label Switching, do have application to the
world of optical networking.
3.2 Bridges and LAN Switches
Port 1
Port 2
Figure 3.11 Illustration of a learning bridge.
Table 3.3
Forwarding table maintained by a bridge.
other ports. In other words, this table is simply an optimization that filters out some
frames; it is not required for correctness.
The code that implements the learning bridge algorithm is quite simple, and we sketch it
here. Structure BridgeEntry defines a single entry in the bridge’s forwarding table; these
are stored in a Map structure (which supports mapCreate, mapBind, and MapResolve
operations) to enable entries to be efficiently located when packets arrive from sources
already in the table. The constant MAX TTL specifies how long an entry is kept in the
table before it is discarded.
3 Packet Switching
#define MAX_TTL
1024 /* max. size of bridging table */
120 /* time (in seconds) before an
entry is flushed */
typedef struct {
} BridgeEntry;
MAC address of a node */
interface to reach it */
time to live */
binding in the Map */
numEntries = 0;
bridgeMap = mapCreate(BRIDGE_TAB_SIZE,
The routine that updates the forwarding table when a new packet arrives is
given by updateTable. The arguments passed are the source MAC address contained
in the packet and the interface number on which it was received. Another routine, not
shown here, is invoked at regular intervals, scans the entries in the forwarding table,
and decrements the TTL (time to live) field of each entry, discarding any entries whose
TTL has reached 0. Note that the TTL is reset to MAX TTL every time a packet arrives
to refresh an existing table entry, and that the interface on which the destination can
be reached is updated to reflect the most recently received packet.
updateTable (MacAddr src, int inif)
if (mapResolve(bridgeMap, &src, (void **)&b) == FALSE)
/* this address is not in the table, so try to add it */
if (numEntries < BRIDGE_TAB_SIZE)
b = NEW(BridgeEntry);
b->binding = mapBind( bridgeMap, &src, b);
/* use source address of packet as dest. address in
table */
b->destination = src;
/* can't fit this address in the table now, so give
up */
3.2 Bridges and LAN Switches
/* reset TTL and use most recent input interface */
b->ifnumber = inif;
Note that this implementation adopts a simple strategy in the case where the
bridge table has become full to capacity—it simply fails to add the new address. Recall
that completeness of the bridge table is not necessary for correct forwarding; it just
optimizes performance. If there is some entry in the table that is not currently being
used, it will eventually time out and be removed, creating space for a new entry. An
alternative approach would be to invoke some sort of cache replacement algorithm
on finding the table full; for example, we might locate and remove the entry with the
smallest TTL to accommodate the new entry.
Spanning Tree Algorithm
The preceding strategy works just fine until the extended LAN has a loop in it, in
which case it fails in a horrible way—frames potentially loop through the extended
LAN forever. This is easy to see in the example depicted in Figure 3.12, where, for
example, bridges B1, B4, and B6 form a loop. How does an extended LAN come to
have a loop in it? One possibility is that the network is managed by more than one
administrator, for example, because it spans multiple departments in an organization.
In such a setting, it is possible that no single person knows the entire configuration of
the network, meaning that a bridge that closes a loop might be added without anyone
knowing. A second, more likely scenario is that loops are built into the network on
purpose—to provide redundancy in case of failure.
Whatever the cause, bridges must be able to correctly handle loops. This problem
is addressed by having the bridges run a distributed spanning tree algorithm. If you
think of the extended LAN as being represented by a graph that possibly has loops
(cycles), then a spanning tree is a subgraph of this graph that covers (spans) all the
vertices, but contains no cycles. That is, a spanning tree keeps all of the vertices of the
original graph, but throws out some of the edges. For example, Figure 3.13 shows a
cyclic graph on the left and one of possibly many spanning trees on the right.
The spanning tree algorithm, which was developed by Radia Perlman at Digital,
is a protocol used by a set of bridges to agree upon a spanning tree for a particular extended LAN. (The IEEE 802.1 specification for LAN bridges is based on this
algorithm.) In practice, this means that each bridge decides the ports over which it
3 Packet Switching
Figure 3.12 Extended LAN with loops.
Figure 3.13 Example of (a) a cyclic graph; (b) a corresponding spanning tree.
is and is not willing to forward frames. In a sense, it is by removing ports from the
topology that the extended LAN is reduced to an acyclic tree.2 It is even possible that
an entire bridge will not participate in forwarding frames, which seems strange when
you consider that the one reason we intentionally have loops in the network in the first
Representing an extended LAN as an abstract graph is a bit awkward. Basically, you let both the bridges and
the LANs correspond to the vertices of the graph and the ports correspond to the graph’s edges. However, the
spanning tree we are going to compute for this graph needs to span only those nodes that correspond to networks.
It is possible that nodes corresponding to bridges will be disconnected from the rest of the graph. This corresponds
to a situation in which all the ports connecting a bridge to various networks get removed by the algorithm.
3.2 Bridges and LAN Switches
place is to provide redundancy. The algorithm is dynamic, however, meaning that the
bridges are always prepared to reconfigure themselves into a new spanning tree should
some bridge fail.
The main idea of the spanning tree is for the bridges to select the ports over
which they will forward frames. The algorithm selects ports as follows. Each bridge
has a unique identifier; for our purposes, we use the labels B1, B2, B3, and so on. The
algorithm first elects the bridge with the smallest id as the root of the spanning tree;
exactly how this election takes place is described below. The root bridge always forwards frames out over all of its ports. Next, each bridge computes the shortest path
to the root and notes which of its ports is on this path. This port is also selected as the
bridge’s preferred path to the root. Finally, all the bridges connected to a given LAN
elect a single designated bridge that will be responsible for forwarding frames toward
the root bridge. Each LAN’s designated bridge is the one that is closest to the root, and
if two or more bridges are equally close to the root, then the bridges’ identifiers are
used to break ties; the smallest id wins. Of course, each bridge is connected to more
than one LAN, so it participates in the election of a designated bridge for each LAN
it is connected to. In effect, this means that each bridge decides if it is the designated
bridge relative to each of its ports. The bridge forwards frames over those ports for
which it is the designated bridge.
Figure 3.14 shows the spanning tree that corresponds to the extended LAN
shown in Figure 3.12. In this example, B1 is the root bridge, since it has the smallest id.
Notice that both B3 and B5 are connected to LAN A, but B5 is the designated bridge
since it is closer to the root. Similarly, both B5 and B7 are connected to LAN B, but
in this case, B5 is the designated bridge since it has the smaller id; both are an equal
distance from B1.
While it is possible for a human to look at the extended LAN given in Figure 3.12
and to compute the spanning tree given in Figure 3.14 according to the rules given
above, the bridges in an extended LAN do not have the luxury of being able to see
the topology of the entire network, let alone peek inside other bridges to see their ids.
Instead, the bridges have to exchange configuration messages with each other and then
decide whether or not they are the root or a designated bridge based on these messages.
Specifically, the configuration messages contain three pieces of information:
1 the id for the bridge that is sending the message
2 the id for what the sending bridge believes to be the root bridge
3 the distance, measured in hops, from the sending bridge to the root bridge
Each bridge records the current “best” configuration message it has seen on each of
its ports (“best” is defined below), including both messages it has received from other
bridges and messages that it has itself transmitted.
3 Packet Switching
Figure 3.14 Spanning tree with some ports not selected.
Initially, each bridge thinks it is the root, and so it sends a configuration message
out on each of its ports identifying itself as the root and giving a distance to the root of
0. Upon receiving a configuration message over a particular port, the bridge checks to
see if that new message is better than the current best configuration message recorded
for that port. The new configuration message is considered “better” than the currently
recorded information if
■ it identifies a root with a smaller id or
■ it identifies a root with an equal id but with a shorter distance or
■ the root id and distance are equal, but the sending bridge has a smaller id.
If the new message is better than the currently recorded information, the bridge discards
the old information and saves the new information. However, it first adds 1 to the
distance-to-root field since the bridge is one hop farther away from the root than the
bridge that sent the message.
When a bridge receives a configuration message indicating that it is not the root
bridge—that is, a message from a bridge with a smaller id—the bridge stops generating
configuration messages on its own and instead only forwards configuration messages
from other bridges, after first adding 1 to the distance field. Likewise, when a bridge
3.2 Bridges and LAN Switches
receives a configuration message that indicates it is not the designated bridge for that
port—that is, a message from a bridge that is closer to the root or equally far from
the root but with a smaller id—the bridge stops sending configuration messages over
that port. Thus, when the system stabilizes, only the root bridge is still generating
configuration messages, and the other bridges are forwarding these messages only
over ports for which they are the designated bridge.
To make this more concrete, consider what would happen in Figure 3.14 if the
power had just been restored to the building housing this network, so that all the
bridges boot at about the same time. All the bridges would start off by claiming to
be the root. We denote a configuration message from node X in which it claims to
be distance d from root node Y as (Y, d, X). Focusing on the activity at node B3, a
sequence of events would unfold as follows:
1 B3 receives (B2, 0, B2).
2 Since 2 < 3, B3 accepts B2 as root.
3 B3 adds one to the distance advertised by B2 (0) and thus sends (B2, 1, B3)
toward B5.
4 Meanwhile, B2 accepts B1 as root because it has the lower id, and it sends
(B1, 1, B2) toward B3.
5 B5 accepts B1 as root and sends (B1, 1, B5) toward B3.
6 B3 accepts B1 as root, and it notes that both B2 and B5 are closer to the root
than it is. Thus B3 stops forwarding messages on both its interfaces.
This leaves B3 with both ports not selected, as shown in Figure 3.14.
Even after the system has stabilized, the root bridge continues to send configuration messages periodically, and the other bridges continue to forward these messages as
described in the previous paragraph. Should a particular bridge fail, the downstream
bridges will not receive these configuration messages, and after waiting a specified
period of time, they will once again claim to be the root, and the algorithm just described will kick in again to elect a new root and new designated bridges.
One important thing to notice is that although the algorithm is able to reconfigure the spanning tree whenever a bridge fails, it is not able to forward frames over
alternative paths for the sake of routing around a congested bridge.
Broadcast and Multicast
The preceding discussion has focused on how bridges forward unicast frames from
one LAN to another. Since the goal of a bridge is to transparently extend a LAN
across multiple networks, and since most LANs support both broadcast and multicast,
3 Packet Switching
then bridges must also support these two features. Broadcast is simple—each bridge
forwards a frame with a destination broadcast address out on each active (selected)
port other than the one on which the frame was received.
Multicast can be implemented in exactly the same way, with each host deciding
for itself whether or not to accept the message. This is exactly what is done in practice.
Notice, however, that since not all the LANs in an extended LAN necessarily have
a host that is a member of a particular multicast group, it is possible to do better.
Specifically, the spanning tree algorithm can be extended to prune networks over which
multicast frames need not be forwarded. Consider a frame sent to group M by a host
on LAN A in Figure 3.14. If there is no host on LAN J that belongs to group M, then
there is no need for bridge B4 to forward the frames over that network. On the other
hand, not having a host on LAN H that belongs to group M does not necessarily mean
that bridge B1 can avoid forwarding multicast frames onto LAN H. It all depends on
whether or not there are members of group M on LANs I and J.
How does a given bridge learn whether it should forward a multicast frame over
a given port? It learns exactly the same way that a bridge learns whether it should forward a unicast frame over a particular port—by observing the source addresses that it
receives over that port. Of course, groups are not typically the source of frames, so we
have to cheat a little. In particular, each host that is a member of group M must periodically send a frame with the address for group M in the source field of the frame header.
This frame would have as its destination address the multicast address for the bridges.
Note that while the multicast extension just described has been proposed, it is
not widely adopted. Instead, multicast is implemented in exactly the same way as
broadcast on today’s extended LANs.
Limitations of Bridges
The bridge-based solution just described is meant to be used in only a fairly limited
setting—to connect a handful of similar LANs. The main limitations of bridges become
apparent when we consider the issues of scale and heterogeneity.
On the issue of scale, it is not realistic to connect more than a few LANs by
means of bridges, where in practice “few” typically means “tens of.” One reason for
this is that the spanning tree algorithm scales linearly; that is, there is no provision for
imposing a hierarchy on the extended LAN. A second reason is that bridges forward
all broadcast frames. While it is reasonable for all hosts within a limited setting (say,
a department) to see each other’s broadcast messages, it is unlikely that all the hosts
in a larger environment (say, a large company or university) would want to have to be
bothered by each other’s broadcast messages. Said another way, broadcast does not
scale, and as a consequence, extended LANs do not scale.
3.2 Bridges and LAN Switches
VLAN 100
VLAN 100
VLAN 200
VLAN 200
Figure 3.15 Two virtual LANs share a common backbone.
One approach to increasing the scalability of extended LANs is the virtual LAN
(VLAN). VLANs allow a single extended LAN to be partitioned into several seemingly
separate LANs. Each virtual LAN is assigned an identifier (sometimes called a color),
and packets can only travel from one segment to another if both segments have the
same identifier. This has the effect of limiting the number of segments in an extended
LAN that will receive any given broadcast packet.
We can see how VLANs work with an example. Figure 3.15 shows four hosts on
four different LAN segments. In the absence of VLANs, any broadcast packet from
any host will reach all the other hosts. Now let’s suppose that we define the segments
connected to hosts W and X as being in one VLAN, which we’ll call VLAN 100. We
also define the segments that connect to hosts Y and Z as being in VLAN 200. To do
this, we need to configure a VLAN ID on each port of bridges B1 and B2. The link
between B1 and B2 is considered to be in both VLANs.
When a packet sent by host X arrives at bridge B2, the bridge observes that it
came in a port that was configured as being in VLAN 100. It inserts a VLAN header
between the Ethernet header and its payload. The interesting part of the VLAN header
is the VLAN ID; in this case, that ID is set to 100. The bridge now applies its normal
rules for forwarding to the packet, with the extra restriction that the packet may not
be sent out an interface that is not part of VLAN 100. Thus, under no circumstances
will the packet—even a broadcast packet—be sent out the interface to host Z, which
is in VLAN 200. The packet is, however, forwarded to bridge B1, which follows the
same rules, and thus may forward the packet to host W but not to host Y.
An attractive feature of VLANs is that it is possible to change the logical topology
without moving any wires or changing any addresses. For example, if we wanted to
make the segment that connects to host Z be part of VLAN 100, and thus enable X,
W, and Z to be on the same virtual LAN, we would just need to change one piece of
configuration on bridge B2.
3 Packet Switching
On the issue of heterogeneity, bridges are fairly limited in the kinds of networks
they can interconnect. In particular, bridges make use of the network’s frame header
and so can support only networks that have exactly the same format for addresses.
Thus, bridges can be used to connect Ethernets to Ethernets, 802.5 to 802.5, and
Ethernets to 802.5 rings, since both networks support the same 48-bit address format.
Bridges do not readily generalize to other kinds of networks, such as ATM.3
Despite their limitations, bridges are a very important part of the complete networking picture. Their main advantage is that they allow multiple LANs to be transparently connected; that is, the networks can be connected without the end hosts
having to run any additional protocols (or even be aware, for that matter). The one
potential exception is when the hosts are expected to announce their membership in a
multicast group, as described in Section 3.2.3.
Notice, however, that this transparency can be dangerous. If a host, or more precisely, the application and transport protocol running on that host, is programmed
under the assumption that it is running on a single LAN, then inserting bridges
between the source and destination hosts can have unexpected consequences. For
example, if a bridge becomes congested, it may have to drop frames; in contrast, it
is rare that a single Ethernet ever drops a frame. As another example, the latency
between any pair of hosts on an extended LAN becomes both larger and more highly
variable; in contrast, the physical limitations of a single Ethernet make the latency both
small and predictable. As a final example, it is possible (although unlikely) that frames
will be reordered in an extended LAN; in contrast, frame order is never shuffled on
a single Ethernet. The bottom line is that it is never safe to design network software
under the assumption that it will run over a single Ethernet segment. Bridges happen.
3.3 Cell Switching (ATM)
Another switching technology that deserves special attention is asynchronous transfer
mode (ATM). ATM became an important technology in the 1980s and early 1990s for
a variety of reasons, not the least of which is that it was embraced by the telephone
industry, which has historically been less than active in data communications except
as a supplier of links on top of which other people have built networks. ATM also
happened to be in the right place at the right time, as a high-speed switching technology
that appeared on the scene just when shared media like Ethernet and 802.5 were
starting to look a bit too slow for many users of computer networks. In some ways
ATM is a competing technology with Ethernet switching, but the areas of application
for these two technologies only partially overlap.
As we will see in Section 3.3.5, there are techniques to make ATM networks look more like “conventional” LANs
such as Ethernets, and bridges do have a role in this environment.
3.3 Cell Switching (ATM)
ATM is a connection-oriented, packet-switched technology, which is to say, it
uses virtual circuits very much in the manner described in Section 3.1.2. In ATM
terminology, the connection setup phase is called signalling. The main ATM signalling
protocol is known as Q.2931. In addition to discovering a suitable route across an
ATM network, Q.2931 is also responsible for allocating resources at the switches
along the circuit. This is done in an effort to ensure the circuit a particular quality
of service. Indeed, the QoS capabilities of ATM are one of its greatest strengths. We
return to this topic in Chapter 6, where we discuss it in the context of similar efforts
to implement QoS.
When any virtual connection is set up, it is necessary to put the address of the
destination in the signalling message. In ATM, this address can be in one of several
formats, the most common ones being E.164 and NSAP (network service access point);
the details are not terribly important here, except to note that they are different from
the MAC addresses used in traditional LANs.
One thing that makes ATM really unusual is that the packets that are switched in
an ATM network are of fixed length. That length happens to be 53 bytes—5 bytes of
header followed by 48 bytes of payload—a rather interesting choice that is discussed
in more detail below. To distinguish these fixed-length packets from the more common
variable-length packets normally used in computer networks, they are given a special
name: cells. ATM may be thought of as the canonical example of cell switching.
All the packet-switching technologies we have looked at so far have used variablelength packets. Variable-length packets are normally constrained to fall within some
bounds. The lower bound is set by the minimum amount of information that needs to
be contained in the packet, which is typically a header with no optional extensions.
The upper bound may be set by a variety of factors; the maximum FDDI packet size,
for example, determines how long each station is allowed to transmit without passing
on the token, and thus determines how long a station might have to wait for the token
to reach it. Cells, in contrast, are both fixed in length and small in size. While this
seems like a simple enough design choice, there are actually a lot of factors involved,
as explained in the following paragraphs.
Cell Size
Variable-length packets have some nice characteristics. If you only have 1 byte to send
(e.g., to acknowledge the receipt of a packet), you put it in a minimum-sized packet.
If you have a large file to send, however, you break it up into as many maximumsized packets as you need. You do not need to send any extraneous padding in the
first case, and in the second, you drive down the ratio of header to data bytes, thus
3 Packet Switching
increasing bandwidth efficiency. You also minimize the total number of packets sent,
thereby minimizing the total processing incurred by per-packet operations. This can be
particularly important in obtaining high throughput, since many network devices are
limited not by how many bits per second they can process but rather by the number
of packets per second.
So, why use fixed-length cells? One of the main reasons was to facilitate the
implementation of hardware switches. When ATM was being created in the mid- and
late 1980s, 10-Mbps Ethernet was the cutting-edge technology in terms of speed. To go
much faster, most people thought in terms of hardware. Also, in the telephone world,
people think big when they think of switches—telephone switches often serve tens of
thousands of customers. Fixed-length packets turn out to be a very helpful thing if you
want to build fast, highly scalable switches. There are two main reasons for this:
1 It is easier to build hardware to do simple jobs, and the job of processing packets
is simpler when you already know how long each one will be.
2 If all packets are the same length, then you can have lots of switching elements
all doing much the same thing in parallel, each of them taking the same time to
do its job.
This second reason, the enabling of parallelism, greatly improves the scalability of
switch designs. It would be overstating the case to say that fast parallel hardware
switches can only be built using fixed-length cells. However, it is certainly true that
cells ease the task of building such hardware and that there was a lot of knowledge
available about how to build cell switches in hardware at the time the ATM standards
were being defined.
Another nice property of cells relates to the behavior of queues. Queues build
up in a switch when traffic from several inputs may be heading for a single output. In
general, once you extract a packet from a queue and start transmitting it, you need
to continue until the whole packet is transmitted; it is not practical to preempt the
transmission of a packet. The longest time that a queue output can be tied up is equal
to the time it takes to transmit a maximum-sized packet. Fixed-length cells mean that
a queue output is never tied up for more than the time it takes to transmit one cell,
which is almost certainly shorter than the maximum-sized packet on a variable-length
packet network. Thus, if tight control over the latency that is being experienced by
cells when they pass through a queue is important, cells provide some advantage. Of
course, long queues can still build up, and there is no getting around the fact that some
cells will have to wait their turn. What you get from cells is not much shorter queues
but potentially finer control over the behavior of queues.
3.3 Cell Switching (ATM)
An example will help to clarify this idea. Imagine a network with variable-length
packets, where the maximum packet length is 4 KB and the link speed is 100 Mbps.
The time to transmit a maximum-sized packet is 4096 × 8/100 = 327.68 μs. Thus, a
high-priority packet that arrives just after the switch starts to transmit a 4-KB packet
will have to sit in the queue 327.68 μs waiting for access to the link. In contrast, if the
switch were forwarding 53-byte cells, the longest wait would be 53×8/100 = 4.24 μs.
This may not seem like a big deal, but the ability to control delay and especially to
control its variation with time (jitter) can be important for some applications.
Queues of cells also tend to be a little shorter than queues of packets, for the
following reason. When a packet begins to arrive in an empty queue, it is typical for the
switch to have to wait for the whole packet to arrive before it can start transmitting
the packet on an outgoing link. This means that the link sits idle while the packet
arrives. However, if you imagine a large packet being replaced by a “train” of small
cells, then as soon as the first cell in the train has entered the queue, the switch can
transmit it. Imagine in the example above what would happen if two 4-KB packets
arrived in a queue at about the same time. The link would sit idle for 327.68 μs while
these two packets arrive, and at the end of that period we would have 8 KB in the
queue. Only then could the queue start to empty. If those same two packets were sent
as trains of cells, then transmission of the cells could start 4.24 μs after the first train
started to arrive. At the end of 327.68 μs, the link would have been active for a little
over 323 μs, and there would be just over 4 KB of data left in the queue, not 8 KB as
before. Shorter queues mean less delay for all the traffic.
Having decided to use small, fixed-length packets, the next question is, What is
the right length to fix them at? If you make them too short, then the amount of header
information that needs to be carried around relative to the amount of data that fits
in one cell gets larger, so the percentage of link bandwidth that is actually used to
carry data goes down. Even more seriously, if you build a device that processes cells at
some maximum number of cells per second, then as cells get shorter, the total data rate
drops in direct proportion to cell size. An example of such a device might be a network
adaptor that reassembles cells into larger units before handing them up to the host.
The performance of such a device depends directly on cell size. On the other hand, if
you make the cells too big, then there is a problem of wasted bandwidth caused by the
need to pad transmitted data to fill a complete cell. If the cell payload size is 48 bytes
and you want to send 1 byte, you’ll need to send 47 bytes of padding. If this happens
a lot, then the utilization of the link will be very low.
Efficient link utilization is not the only factor that influences cell size. For example, cell size has a particular effect on voice traffic, and since ATM grew out of
the telephony community, one of the major concerns was that it be able to carry
voice effectively. The standard digital encoding of voice is done at 64 Kbps (8-bit
samples taken at 8 KHz). To maximize efficiency, you want to collect a full cell’s
worth of voice samples before transmitting
a cell. A sampling rate of 8 KHz means that
1 byte is sampled every 125 μs, so the time
it takes to fill an n-byte cell with samples
is n × 125 μs. If cells are, say, 1000 bytes
long, it would take 125 ms just to collect
a full cell of samples before you even start
to transmit it to the receiver. That amount
of latency starts to be quite noticeable to a
human listener. Even considerably smaller
latencies create problems for voice, particularly in the form of echoes. Echoes can be
eliminated by a piece of technology called an
echo canceler, but this adds cost to a telephone network that many network operators would rather avoid.
All of the above factors caused a great
deal of debate in the international standards
bodies when ATM was being standardized,
and the fact that no length was perfect in all
cases was used by those opposed to ATM to
argue that fixed-length cells were a bad idea
in the first place. As is so often the case with
standards, the end result was a compromise
that pleased almost no one: 48 bytes was
chosen as the length for the ATM cell payload. Probably the greatest tragedy of this
choice is that it is not a power of two, which
means that it is quite a mismatch to most
things that computers handle, like pages and
cache lines. Rather less controversially, the
header was fixed at 5 bytes. The format
of an ATM cell is shown in Figure 3.16;
note that this figure shows the field lengths
in bits.
3 Packet Switching
A Compromise of 48 Bytes
The explanation for why the payload of an ATM cell is 48 bytes
is an interesting one and is an excellent case study in the process
of standardization. As the ATM
standard was evolving, the U.S.
telephone companies were pushing
for a 64-byte cell size, while the
European companies were advocating 32-byte cells. The reason
that the Europeans wanted the
smaller size was that since the countries they served were of a small
enough size, they would not have
to install echo cancelers if they were
able to keep the latency induced
by generating a complete cell small
enough. Thirty-two-byte cells were
adequate for this purpose. In contrast, the United States is a large
enough country that the phone
companies had to install echo cancelers anyway, and so the larger cell
size reflected a desire to improve the
header-to-payload ratio.
Averaging is a classic form
of compromise—48 bytes is simply the average of 64 bytes and
32 bytes. So as not to leave the
false impression that this use of
compromise-by-averaging is an isolated incident, we note that the
seven-layer OSI model was actually a compromise between six and
eight layers.
3.3 Cell Switching (ATM)
384 (48 bytes)
Figure 3.16 ATM cell format at the UNI.
Cell Format
The ATM cell actually comes in two different formats, depending on where you look
in the network. The one shown in Figure 3.16 is called the UNI (user-network interface) format; the alternative is the NNI (network-network interface). The UNI format
is used, of course, at the user-to-network interface. This is likely to be the interface
between a telephone company and one of its customers. The network-to-network
interface is likely to be between a pair of phone companies. The only significant difference in cell formats is that the NNI format replaces the GFC field with 4 extra bits
of VPI. Clearly, understanding all the three-letter acronyms (TLAs) is a key part of
understanding ATM.
Starting from the leftmost byte of the cell (which is the first one transmitted), the
UNI cell has 4 bits for generic flow control (GFC). These bits have not been widely used,
but they were intended to have local significance at a site and could be overwritten in
the network. The basic idea behind the GFC bits was to provide a means to arbitrate
access to the link if the local site used some shared medium to connect to ATM.
The next 24 bits contain an 8-bit virtual path identifier (VPI) and a 16-bit virtual
circuit identifier (VCI). The difference between the two is explained below, but for
now it is adequate to think of them as a single 24-bit identifier that is used to identify
a virtual connection, just as in Section 3.1.2. Following the VPI/VCI is a 3-bit Type
field that has eight possible values. Four of them, when the first bit in the field is set,
relate to management functions. When that bit is clear, it means that the cell contains
user data. In this case, the second bit is the “explicit forward congestion indication”
(EFCI) bit, and the third is the “user signalling” bit. The former can be set by a
congested switch to tell an end node that it is congested; it has its roots in the DECbit
described in Section 6.4.1; in ATM, it is used for congestion control in conjunction
with the available bit rate (ABR) service class described in Section 6.5.4. The third bit
is used primarily in conjunction with ATM Adaptation Layer 5 to delineate frames, as
discussed below.
Next is a bit to indicate cell loss priority (CLP); a user or network element may set
this bit to indicate cells that should be dropped preferentially in the event of overload.
For example, a video coding application could set this bit for cells that, if dropped,
would not dramatically degrade the quality of the video. A network element might set
3 Packet Switching
this bit for cells that have been transmitted by a user in excess of the amount that was
The last byte of the header is an 8-bit CRC, known as the header error check
(HEC). It uses the CRC-8 polynomial given in Section 2.4.3 and provides error detection
and single-bit error correction capability on the cell header only. Protecting the cell
header is particularly important because an error in the VCI will cause the cell to be
Segmentation and Reassembly
Up to this point, we have assumed that a low-level protocol could just accept the
packet handed down to it by a high-level protocol, attach its own header, and pass the
packet on down. This is not possible with ATM, however, since the packets handed
down from above are often larger than 48 bytes, and thus, will not fit in the payload
of an ATM cell. The solution to this problem is to fragment the high-level message
into low-level packets at the source, transmit the individual low-level packets over the
network, and then reassemble the fragments back together at the destination. This
general technique is usually called fragmentation and reassembly. In the case of ATM,
however, it is often called segmentation and reassembly (SAR).
Segmentation is not unique to ATM, but it is much more of a problem than in a
network with a maximum packet size of, say, 1500 bytes. To address the issue, a protocol layer was added that sits between ATM and the variable-length packet protocols
that might use ATM, such as IP. This layer is called the ATM Adaptation Layer (AAL),
and to a first approximation, the AAL header simply contains the information needed
by the destination to reassemble the individual cells back into the original message.
The relationship between the AAL and ATM is illustrated in Figure 3.17.
Because ATM was designed to support all sorts of services, including voice,
video, and data, it was felt that different services would have different AAL needs.
Figure 3.17 Segmentation and reassembly in ATM.
3.3 Cell Switching (ATM)
Thus, four adaptation layers were originally defined: 1 and 2 were designed to support
applications, like voice, that require guaranteed bit rates, while 3 and 4 were intended
to provide support for packet data running over ATM. The idea was that AAL3 would
be used by connection-oriented packet services (such as X.25) and AAL4 would be
used by connectionless services (such as IP). Eventually, the reasons for having different
AALs for these two types of service were found to be insufficient, and the AALs
merged into one that is inconveniently known as AAL3/4. Meanwhile, some perceived
shortcomings in AAL3/4 caused a fifth AAL to be proposed, called AAL5. Thus, there
are now four AALs: 1, 2, 3/4, and 5. The two that support computer communications
are described below.
ATM Adaptation Layer 3/4
The main function of AAL3/4 is to provide enough information to allow variablelength packets to be transported across the ATM network as a series of fixed-length
cells. That is, the AAL supports the segmentation and reassembly process. Since we
are now working at a new layer of the network hierarchy, convention requires us
to introduce a new name for a packet—in this case, we call it a protocol data unit
(PDU). The task of segmentation/reassembly involves two different packet formats.
The first of these is the convergence sublayer protocol data unit (CS-PDU), as depicted
in Figure 3.18. The CS-PDU defines a way of encapsulating variable-length PDUs prior
to segmenting them into cells. The PDU passed down to the AAL layer is encapsulated
by adding a header and a trailer, and the resultant CS-PDU is segmented into ATM
The CS-PDU format begins with an 8-bit common part indicator (CPI), which
indicates which version of the CS-PDU format is in use. Only the value 0 is currently
defined. The next 8 bits contain the beginning tag (Btag), which is supposed to match
the end tag (Etag) for a given PDU. This protects against the situation in which the
loss of the last cell of one PDU and the first cell of another causes two PDUs to be
inadvertently joined into a single PDU and passed up to the next layer in the protocol
stack. The buffer allocation size (BASize) field is not necessarily the length of the PDU
(which appears in the trailer); it is supposed to be a hint to the reassembly process as
to how much buffer space to allocate for the reassembly. The reason for not including
< 64 KB
User data
Figure 3.18 ATM Adaptation Layer 3/4 packet format.
3 Packet Switching
ATM header
352 (44 bytes)
Figure 3.19 ATM cell format for AAL3/4.
Beginning of message
Continuation of message
End of message
Single-segment message
Table 3.4
AAL3/4 Type field.
the actual length here is that the sending host might not have known how long the
CS-PDU was when it transmitted the header. Before adding the CS-PDU trailer, the
user data is padded to one byte less than a multiple of 4 bytes, by adding up to 3 bytes
of padding. This padding, plus the 0-filled byte, ensures that the trailer is aligned on
a 32-bit boundary, making for more efficient processing. The CS-PDU trailer itself
contains the Etag and the real length of the PDU (Len).
In addition to the CS-PDU header and trailer, AAL3/4 specifies a header and
trailer that are carried in each cell, as depicted in Figure 3.19. Thus, the CS-PDU is
actually segmented into 44-byte chunks; an AAL3/4 header and trailer is attached to
each one, bringing it up to 48 bytes, which is then carried as the payload of an ATM
The first two bits of the AAL3/4 header contain the Type field, which indicates
if this is the first cell of a CS-PDU, the last cell of a CS-PDU, a cell in the middle of
a CS-PDU, or a single-cell PDU (in which case it is both first and last). The official
names for these four conditions are shown in Table 3.4, along with the bit encodings.
Next is a 4-bit sequence number (SEQ), which is intended simply to detect cell
loss or misordering so that reassembly can be aborted. Clearly, a sequence number this
small can miss cell losses if the number of lost cells is large enough. This is followed
by a multiplexing identifier (MID), which can be used to multiplex several PDUs onto
a single connection. The 6-bit Length field shows the number of bytes of PDU that are
contained in the cell; it must equal 44 for BOM and COM cells. Finally, a 10-bit CRC
is used to detect errors anywhere in the 48-byte cell payload.
3.3 Cell Switching (ATM)
44 bytes
User data
44 bytes
44 bytes
AAL header
AAL trailer
ATM header
Cell payload
≤ 44 bytes
Figure 3.20 Encapsulation and segmentation for AAL3/4.
Figure 3.20 shows the entire encapsulation and segmentation process for AAL3/4.
At the top, the user data is encapsulated with the CS-PDU header and trailer. The
CS-PDU is then segmented into 44-byte payloads, which are encapsulated as ATM
cells by adding the AAL3/4 header and trailer as well as the 5-byte ATM header. Note
that the last cell is only partially filled whenever the CS-PDU is not an exact multiple
of 44 bytes.
One thing to note about AAL3/4 is that it exacerbates the fixed per-cell overhead
that we discussed above. With 44 bytes of data to 9 bytes of header, the best possible
bandwidth utilization would be 83%. Note that the efficiency can be considerably less
than that, as illustrated by Figure 3.20, because of the CS-PDU encapsulation and the
partial filling of the last cell.
ATM Adaptation Layer 5
One thing you may have noticed in the discussion of AAL3/4 is that it seems to
take a lot of fields and thus a lot of overhead to perform the conceptually simple
function of segmentation and reassembly. This observation was, in fact, made by
several people in the early days of ATM, and numerous competing proposals arose
for an AAL to support computer communications over ATM. There was a movement,
known informally as “Back the Bit,” that argued that if we could just have 1 bit in
the ATM header (as opposed to the AAL header) to delineate the end of a frame, then
segmentation and reassembly could be accomplished without using any of the 48-byte
ATM payload for segmentation/reassembly information. This movement eventually led
to the definition of the user signalling bit described above and to the standardization
of AAL5.
3 Packet Switching
< 64 KB
0–47 bytes
Figure 3.21 ATM Adaptation Layer 5 packet format.
User data
48 bytes
ATM header
48 bytes
48 bytes
Cell payload
Figure 3.22 Encapsulation and segmentation for AAL5.
What AAL5 does is replace the 2-bit Type field of AAL3/4 with 1 bit of framing
information in the ATM cell header. By setting that 1 bit, we can identify the last cell
of a PDU; the next cell is assumed to be the first cell of the next PDU, and subsequent
cells are assumed to be COM cells until another cell is received with the user signalling bit set. All the pieces of AAL3/4 that provide protection against lost, corrupt,
or misordered cells, including the loss of an EOM cell, are provided by the AAL5
CS-PDU packet format depicted in Figure 3.21.
The AAL5 CS-PDU consists simply of the data portion (the PDU handed down
by the higher-layer protocol) and an 8-byte trailer. To make sure that the trailer always
falls at the tail end of an ATM cell, there may be up to 47 bytes of padding between
the data and the trailer. It is necessary to force the trailer to be at the end of a cell, as
otherwise there would be no way for the entity performing reassembly of the CS-PDU
to find the trailer. The first 2 bytes of the trailer are currently reserved and must be
0. The length field (Len) is the number of bytes carried in the PDU, not including the
trailer or any padding before the trailer. Finally, there is a 32-bit CRC.
Figure 3.22 shows the encapsulation and segmentation process for AAL5. Just
like AAL3/4, the user data is encapsulated to form a CS-PDU (although using only a
3.3 Cell Switching (ATM)
trailer in this case). The resulting PDU is then cut up into 48-byte chunks, which are
carried directly inside the payload of ATM cells without any further encapsulation.
Somewhat surprisingly, AAL5 provides almost the same functionality as AAL3/4
without using an extra 4 bytes out of every cell. For example, the CRC-32 detects lost
or misordered cells as well as bit errors in the data. In fact, having a checksum over
the entire PDU rather than doing it on a per-cell basis as in AAL3/4 provides stronger
protection. For example, it protects against the loss of 16 consecutive cells, an event
that would not be picked up by the sequence number checking of AAL3/4. Also, a
32-bit CRC protects against longer burst errors than a 10-bit CRC.
The main feature missing from AAL5 is the ability to provide an additional layer
of multiplexing onto one virtual circuit using the MID. It is not clear whether this is
a significant loss. It is still possible to multiplex traffic from many applications and
higher-layer protocols onto a single VC using AAL5 by carrying a demux key of the
sort described in Section 1.3.1. It just becomes necessary to do the multiplexing on a
packet-by-packet, rather than a cell-by-cell, basis.
There are positive and negative aspects to multiplexing traffic from a lot of
different applications onto a single VC. For example, if you are being charged for
every virtual circuit you set up across a network, then multiplexing traffic from lots of
different applications onto one connection might be a plus. However, this approach
has the drawback that all applications will have to live with whatever quality of service
(e.g., delay and bandwidth guarantees) has been chosen for that one connection, which
may mean that some applications are not receiving appropriate service.
In general, AAL5 has been wholeheartedly embraced by the computer communications community (at least by that part of the community that has embraced ATM
at all). For example, it is the preferred AAL in the IETF for transmitting IP datagrams
over ATM. Its more efficient use of bandwidth and simple design are the main features
that make it more appealing than AAL3/4.
Virtual Paths
As mentioned above, ATM uses a 24-bit identifier for virtual circuits, and these circuits
operate almost exactly like the ones described in Section 3.1.2. The one twist is that the
24-bit identifier is split into two parts: an 8-bit virtual path identifier (VPI) and a 16-bit
virtual circuit identifier (VCI). This effectively creates a two-level hierarchy of virtual
connections. To understand how such a hierarchy might work, consider the following
example. (We ignore the fact that in some places there might be a network-network
interface with a different-sized VPI; just assume that 8-bit VPIs are used everywhere.)
Suppose that a corporation has two sites that connect to a public ATM network,
and that at each site the corporation has a network of ATM switches. We could imagine
establishing a virtual path between two sites using only the VPI field. Thus, the switches
3 Packet Switching
Public network
Network A
Network B
Figure 3.23 Example of a virtual path.
in the public network would use the VPI as the only field on which to make forwarding
decisions. From their point of view, this is a virtual circuit network with 8-bit circuit
identifiers. The 16-bit VCI is of no interest to these public switches, and they neither
use the field for switching nor remap it. Within the corporate sites, however, the full
24-bit space is used for switching. Any traffic that needs to flow between the two sites
is routed to a switch that has a connection to the public network, and its top 8 bits
(the VPI) are mapped onto the appropriate value to get the data to the other site. This
idea is illustrated in Figure 3.23. Note that the virtual path acts like a fat pipe that
contains a bundle of virtual circuits, all of which have the same 8 bits in their most
significant byte.
The advantage of this approach is clear: Although there may be thousands or
millions of virtual connections across the public network, the switches in the public
network behave as if there is only one connection. This means that there needs to be
much less connection-state information stored in the switches, avoiding the need for
big, expensive tables of per-VCI information.
Physical Layers for ATM
While the layered approach to protocol design might lead you to think that we do not
need to worry about what type of point-to-point link ATM runs on top of, this turns
out not to be the case. From a simple pragmatic point of view, when you buy an ATM
adaptor for a workstation or an ATM switch, it comes with some physical medium
over which ATM cells will be sent. Of course, this is also true for other networking
protocols such as 802.5 and Ethernet. Like these protocols, ATM can also run over
several different physical media and physical-layer protocols.
From early in the process of standardizing ATM, it was assumed that ATM would
run on top of a SONET physical layer (see Section 2.3.3). Some people even get ATM
and SONET confused because they have been so tightly coupled for so long. While
it is true that standard ways of carrying ATM cells inside a SONET frame have been
defined, and that you can now buy ATM-over-SONET products, the two are entirely
3.3 Cell Switching (ATM)
separable. For example, you can lease a SONET link from a phone company and send
whatever you want over it, including variable-length packets. Also, you can send ATM
cells over many other physical layers instead of SONET, and standards have been (or
are being) defined for these encapsulations. A notable early physical layer for ATM
was TAXI, the physical layer used in FDDI (Section 2.7). Wireless physical layers for
ATM are also being defined.
When you send ATM cells over some physical medium, the main issue is how to
find the boundaries of the ATM cells; this is exactly the framing problem described in
Chapter 2. With SONET, there are two easy ways to find the boundaries. One of the
overhead bytes in the SONET frame can be used as a pointer into the SONET payload
to the start of an ATM cell. Having found the start of one cell, it is known that the
next cell starts 53 bytes further on in the SONET payload, and so on. In theory, you
only need to read this pointer once, but in practice, it makes sense to read it every time
the SONET overhead goes by so that you can detect errors or resynchronize if needed.
The other way to find the boundaries of ATM cells takes advantage of the fact
that every cell has a CRC in the fifth byte of the cell. Thus, if you run a CRC calculation
over the last 5 bytes received and the answer comes out to indicate no errors, then it is
probably true that you have just read an ATM header. If this happens several times in
a row at 53-byte intervals, you can be pretty sure you have found the cell boundary.
ATM in the LAN
As we mentioned above, ATM grew out of the telephony community, who envisioned
it as a way to build large public networks that could transport voice, video, and
data traffic. However, it was subsequently embraced by segments of the computer and
data communications industries as a technology to be used in LANs—a replacement
for Ethernet and 802.5. Its popularity in this realm at a particular point in time can
be attributed to two main factors:
■ ATM is a switched technology, whereas Ethernet and 802.5 were originally
envisioned as shared-media technologies.
■ ATM was designed to operate on links with speeds of 155 Mbps and above,
compared to the original 10 Mbps of Ethernet and 4 or 16 Mbps of token
When ATM switches first became available, these were significant advantages
over the existing solutions. In particular, switched networks have a big performance
advantage over shared-media networks: A single shared-media network has a fixed
total bandwidth that must be shared among all hosts, whereas each host gets its own
3 Packet Switching
dedicated link to the switch in a switched network. Thus the performance of switched
networks scales better than that of shared-media networks.
However, it should be apparent that the distinction between shared-media and
switched networks is not all that clear-cut. A bridge that connects a number of sharedmedia networks together is also a switch, and it is possible (and quite common) to
connect only one host to each segment, giving it dedicated access to that bandwidth.
At the same time as ATM switches were appearing on the scene, high-performance
Ethernet switches became available. These devices have large numbers of ports and
high total throughput. The 100-Mbps Ethernet standard was defined, and so the link
speed of Ethernet—which could be achieved over copper—began to approach that of
All this was not enough to kill off ATM in the LAN. One advantage of ATM
over Ethernet that remains is the lack of distance limitation for ATM links. Also,
higher-speed ATM links (e.g., 622 Mbps) soon became available. This made ATM
fairly popular for the high-performance “backbone” of larger LANs. One common
configuration was to connect hosts to Ethernet switches, which in turn could be interconnected by ATM switches, as depicted in Figure 3.24. High-performance servers
might also be connected directly to the ATM switch, as with host H7 in this example.
More recently, the technology that has probably overtaken ATM for LAN backbones and server connections is Gigabit Ethernet. Gigabit Ethernet links use the same
framing as lower-speed Ethernets but are usually point-to-point fiber links and can run
over relatively long distances (up to several kilometers). And the same basic approach
is now scaling up to provide 10-Gbps links.
Ethernet links
ATM links
E2 Ethernet switch
ATM switch
ATM-attached H7
Figure 3.24 ATM used as a LAN backbone.
3.3 Cell Switching (ATM)
One significant problem with running ATM in a LAN is that it doesn’t look like
a “traditional” LAN. Because most LANs (i.e., Ethernets and token rings) are sharedmedia networks (i.e., every node on the LAN is connected to the same link), it is easy
to implement broadcast (sending to everybody) and multicast (sending to a group).
Thus, many of the protocols that people depend on in their LANs—for example,
the Address Resolution Protocol (ARP) described in Section 4.1.5—depend in turn
on the ability of the LAN to support multicast and broadcast. However, because of
its connection-oriented and switched nature, ATM behaves rather differently than a
shared-media LAN. For example, how can you broadcast to all nodes on an ATM
LAN if you don’t know all their addresses and set up VCs to all of them?
There are two possible solutions to this problem, and both of them have been
explored. One is to redesign the protocols that make assumptions about LANs that
are not in fact true of ATM. Thus, for example, there is a new protocol called
ATMARP that, unlike traditional ARP, does not depend on broadcast. We discuss
this in Section 4.1.5. The alternative is to make ATM behave more like a shared-media
LAN—in the sense of supporting multicast and broadcast—without losing the performance advantages of a switched network. This approach has been specified by the
ATM Forum as “LAN emulation” or LANE (which might be more correctly called
“shared-media emulation”). This approach aims to add functionality to ATM LANs
so that anything that runs over a shared-media LAN can operate over an ATM LAN.
While LANE might now be considered something of a historical curiosity, it does provide an interesting case study in how layering can work in a network. By making the
“ATM layer” look more like an Ethernet, higher-layer protocols that worked well over
Ethernet continue to work without modification.
One aspect of LAN emulation that can be confusing is the variety of different
addresses and identifiers that are used. All ATM devices must have an ATM address,
which is used when signalling to establish a VC. As noted above, these addresses are
different from the standard IEEE 802 MAC addresses used in Ethernets, token rings,
and so on. If we want to emulate the behavior of these types of LANs, each device will
also need to have a standard (48-bit, globally unique) MAC address. And finally, recall
that a virtual circuit identifier is very different from an address. It is the shorthand that
is used to get cells along an established connection, but you must first establish a
connection, and to do that you need an ATM address.
LAN emulation does not actually change the functionality of ATM switches,
but adds functionality to the network through the addition of a number of servers.
Devices that connect to the ATM network—hosts, bridges, routers—are referred to
as LAN emulation clients (LECs). The interactions between LECs and the various
servers result in network behavior that, from the point of view of any higher-layer
protocol, is indistinguishable from that of an Ethernet or token ring network.
3 Packet Switching
(IP, ARP, . . .)
(IP, ARP, . . .)
Figure 3.25 Protocol layers in LAN emulation.
Figure 3.25 illustrates the protocol layers in the case where a pair of hosts communicate across an ATM network that is emulating a LAN. By “Ethernet-like interface,”
we mean that the services offered up to higher layers are like those of an Ethernet:
Frames can be delivered to any MAC address on the LAN, frames can be broadcast
to all destinations on the LAN, and so on.
The servers that are required to build an emulated LAN are
■ the LAN emulation configuration server (LECS)
■ the LAN emulation server (LES)
■ the broadcast and unknown server (BUS)
These servers can be physically located in one or more devices, perhaps in one of the
hosts or other devices connected to the ATM network. The LECS and LES primarily
perform configuration functions, while the BUS has a central role in making data
transfer in an ATM network resemble that of a shared-media LAN.
The LECS enables a newly attached or rebooted LAN emulation client (e.g., a
host) to get some essential information. First, the client must find the LECS, which
it may do by using a well-known, predefined VC that is always set up; alternatively,
the client must have prior knowledge of the ATM address of the LECS so it can set
up a VC to it. Once connected to the LECS, the client provides the LECS with its
ATM address, and the LECS responds by telling the client what type of LAN is being
emulated (Ethernet or token ring), what the maximum packet size is, and the ATM
address of the LES. One LECS might support many separate emulated LANs.
3.3 Cell Switching (ATM)
ATM network
Point-to-point VC
Point-to-multipoint VC
Figure 3.26 Servers and clients in an emulated LAN.
The client now signals for a connection to the LES whose ATM address it just
learned. Once connected to the LES, the client registers its MAC and ATM addresses
with the LES. Among other things, the LES provides the client with the ATM address
of the BUS.
The BUS maintains a single point-to-multipoint VC that connects it to all registered clients. It should be apparent that the BUS and this multipoint VC are crucial to
LAN emulation: They enable the broadcast capability of traditional LANs to be emulated in a virtual circuit environment. Once a LEC has the ATM address of the BUS,
it signals for a connection to the BUS. The BUS in turn adds the LEC to the point-tomultipoint VC. At this point, everything is ready for the LEC to participate in data
transfer. The arrangement where two hosts have connected to the LES and the BUS,
and the BUS has formed the point-to-multipoint VC to both of them, is shown in
Figure 3.26. The LECS is not shown.
This might seem like a lot of work to get the LEC connected to the BUS, but
the separation of functions among servers is helpful from a network management
standpoint. For example, a great deal of information can be centralized in a single
LECS rather than having to be distributed to many LESs, and the amount of special
configuration needed in each host is kept to a bare minimum.
It should be clear that the BUS is the place to send any packet that needs to be
broadcast to all clients on the LAN. While it could also be used for delivery of unicast
packets, this would be inefficient. Delivery of unicast packets operates as follows.
Assume that a host has a packet that it wants to deliver to a particular MAC address.
In a traditional LAN, the packet could be placed on the wire and would be picked up
by the intended recipient. In an emulated LAN, the packet needs to be delivered to
the recipient over a virtual circuit. But a newly attached host would only have a VC
to the LES and the BUS, not the recipient. To make matters worse, it would not even
know the ATM address of the recipient, which is required to set up a VC. Thus, the
host performs the following steps:
3 Packet Switching
■ It sends the packet to the BUS, which it knows can deliver the packet to the
destination using its point-to-multipoint VC.
■ It sends an “address resolution” request to the LES, of the form “What ATM
address corresponds to this MAC address?”
Since all clients should have registered their MAC and ATM addresses with the
LES, the LES should be able to answer the query and provide an ATM address to the
client. The client can now signal for a VC to the recipient, which it may use to forward
subsequent frames to the destination. The reason for using the BUS to send the first
packet is to minimize delay, since it may take some time to get a response from the
LES and establish a VC.
One detail in this process is that LANs are not supposed to deliver frames out
of order, and an emulated LAN should be no different. But if some frames are sent
via the BUS and then later frames are sent on a direct connection, misordering may
occur. LAN emulation procedures include a “flush” mechanism to ensure that the last
packet sent down one path has arrived before another one is sent on a new path, thus
ensuring in-order delivery.
With the above process, a client would eventually end up with direct VCs to
all destinations that it has ever sent data to. This might be an excessive number of
VCs, and so a client may use a caching algorithm to dispose of VCs that are no longer
carrying traffic. A “cache miss” (i.e., the arrival of a packet that needs to be sent to a
destination for which no VC exists) will be handled by sending the packet to the BUS.
3.4 Implementation and Performance
So far, we have talked about what a switch must do without discussing how to do it.
There is a very simple way to build a switch: Buy a general-purpose workstation and
equip it with a number of network interfaces. Such a device, running suitable software,
can receive packets on one of its interfaces, perform any of the switching functions
described above, and send packets out another of its interfaces. This is, in fact, a
popular way to build experimental switches when you want to be able to do things
like develop new routing protocols, because it offers extreme flexibility and a familiar
programming environment. It is also not too far removed from the architecture of many
low-end routers (which, as we will see in the next chapter, have much in common with
Figure 3.27 shows a workstation with three network interfaces used as a switch.
The figure shows a path that a packet might take from the time it arrives on interface
1 until it is output on interface 2. We have assumed here that the workstation has a
mechanism to move data directly from an interface to its main memory without having
3.4 Implementation and Performance
I/O bus
Interface 1
Interface 2
Interface 3
Main memory
Figure 3.27 A workstation used as a packet switch.
to be directly copied by the CPU, that is, direct memory access (DMA) as described in
Section 2.9. Once the packet is in memory, the CPU examines its header to determine
which interface the packet should be sent out on. It then uses DMA to move the packet
out to the appropriate interface. Note that Figure 3.27 does not show the packet going
to the CPU because the CPU inspects only the header of the packet; it does not have
to read every byte of data in the packet.
The main problem with using a workstation as a switch is that its performance
is limited by the fact that all packets must pass through a single point of contention:
In the example shown, each packet crosses the I/O bus twice and is written to and
read from main memory once. The upper bound on aggregate throughput of such a
device (the total sustainable data rate summed over all inputs) is, thus, either half the
main memory bandwidth or half the I/O bus bandwidth, whichever is less. (Usually,
it’s the I/O bus bandwidth.) For example, a workstation with a 33-MHz, 32-bit-wide
I/O bus can transmit data at a peak rate of a little over 1 Gbps. Since forwarding a
packet involves crossing the bus twice, the actual limit is 500 Mbps, which is enough to
support five 100-Mbps Ethernet interface cards. In practice, the peak bus bandwidth
isn’t sustainable, so it’s more likely such a workstation would support only three or
four such interface cards.
Moreover, this upper bound also assumes that moving data is the only problem—
a fair approximation for long packets but a bad one when packets are short. In the latter
case, the cost of processing each packet—parsing its header and deciding which output
link to transmit it on—is likely to dominate. Suppose, for example, that a workstation can perform all the necessary processing to switch 500,000 packets each second.
3 Packet Switching
This is sometimes called the packet per second (pps) rate. (This number is representative of what is achievable on today’s highend PCs.) If the average packet is short, say,
64 bytes, this would imply
Throughput = pps × ( BitsPerPacket)
= 500 × 103 × 64 × 8
= 256 × 106
that is, a throughput of 256 Mbps—
substantially below the range that users are
demanding from their networks today. Bear
in mind that this 256 Mbps would be shared
by all users connected to the switch, just as
the 10 Mbps of an Ethernet is shared among
all users connected to the shared medium.
Thus, for example, a 10-port switch with
this aggregate throughput would only be
able to cope with an average data rate of
25.6 Mbps on each port.
To address this problem, hardware designers have come up with a large array
of switch designs that reduce the amount
of contention and provide high aggregate
throughput. Note that some contention is
unavoidable: If every input has data to send
to a single output, then they cannot all send
it at once. However, if data destined for different outputs is arriving at different inputs,
a well-designed switch will be able to move
data from inputs to outputs in parallel, thus
increasing the aggregate throughput.
Most switches look conceptually similar to
the one shown in Figure 3.28. They consist of a number of input ports and output
ports and a fabric. There is usually at least
Defining Throughput
It turns out to be difficult to define precisely the throughput of a
switch. Intuitively, we might think
that if a switch has n inputs that
each support a link speed of si , then
the throughput would just be the
sum of all the si . This is actually
the best possible throughput that
such a switch could provide, but
in practice almost no real switch
can guarantee that level of performance. One reason for this is simple to understand. Suppose that, for
some period of time, all the traffic arriving at the switch needed
to be sent to the same output. As
long as the bandwidth of that output is less than the sum of the input
bandwidths, then some of the traffic will need to be either buffered
or dropped. With this particular
traffic pattern, the switch could
not provide a sustained throughput higher than the link speed of
that one output. However, a switch
might be able to handle traffic arriving at the full link speed on all
inputs if it is distributed across all
the outputs evenly; this would be
considered optimal.
Another factor that affects
the performance of switches is the
the size of packets arriving on the
inputs. For an ATM switch, this
is normally not an issue because
3.4 Implementation and Performance
all “packets” (cells) are the same
length. But for Ethernet switches
or IP routers, packets of widely
varying sizes are possible. Some of
the operations that a switch must
perform have a constant overhead
per packet, so a switch is likely
to perform differently depending
on whether all arriving packets are
very short, very long, or mixed.
For this reason, routers or switches
that forward variable-length packets are often characterized by a
packet per second (pps) rate as well
as a throughput in bits per second.
The pps rate is usually measured
with minimum-sized packets.
The first thing to notice about
this discussion is that the throughput of the switch is a function of the
traffic to which it is subjected. One
of the things that switch designers
spend a lot of their time doing is
trying to come up with traffic models that approximate the behavior
of real data traffic. It turns out that
it is extremely difficult to achieve
accurate models. A traffic model attempts to answer several important
questions: (1) When do packets arrive? (2) What outputs are they destined for? And (3) how big are they?
Traffic modeling is a wellestablished science that has been
extremely successful in the world of
telephony, enabling telephone companies to engineer their networks
one control processor in charge of the whole
switch that communicates with the ports
either directly or, as shown here, via the
switch fabric. The ports communicate with
the outside world. They may contain fiber
optic receivers and lasers, buffers to hold
packets that are waiting to be switched or
transmitted, and often a significant amount
of other circuitry that enables the switch
to function. The fabric has a very simple
and well-defined job: When presented with
a packet, deliver it to the right output port.
One of the jobs of the ports, then, is to
deal with the complexity of the real world
in such a way that the fabric can do its
relatively simple job. For example, suppose
that this switch is supporting a virtual circuit model of communication. In general,
the virtual circuit mapping tables described
in Section 3.1.2 are located in the ports.
The ports maintain lists of virtual circuit
identifiers that are currently in use, with
information about what output a packet
should be sent out on for each VCI and
how the VCI needs to be remapped to ensure
uniqueness on the outgoing link. Similarly,
the ports of an Ethernet switch store tables
that map between Ethernet addresses and
output ports (bridge forwarding tables as
described in Section 3.2). In general, when
a packet is handed from an input port to
the fabric, the port has figured out where
the packet needs to go, and either the port
sets up the fabric accordingly by communicating some control information to it, or
it attaches enough information to the packet
itself (e.g., an output port number) to allow the fabric to do its job automatically.
Fabrics that switch packets by looking only
at the information in the packet are referred
to as “self-routing,” since they require no
external control to route packets. An example of a self-routing fabric is discussed
The input port is the first place to
look for performance bottlenecks. The input port has to receive a steady stream of
packets, analyze information in the header
of each one to determine which output port
(or ports) the packet must be sent to, and
pass the packet on to the fabric. The type of
header analysis that it performs can range
from a simple table lookup on a VCI to
complex matching algorithms that examine
many fields in the header. This is the type of
operation that sometimes becomes a problem when the average packet size is very
small. For example, 64-byte packets arriving on a port connected to an OC-48 link
have to process packets at a rate of
2.48 × 109 ÷ (64 × 8) = 4.83 × 106 pps
In other words, when small packets are
arriving as fast as possible on this link (the
worst-case scenario that most ports are
engineered to handle), the input port has
approximately 200 nanoseconds to process
each packet.
Another key function of ports is
buffering. Observe that buffering can happen in either the input or the output
port; it can also happen within the fabric (sometimes called internal buffering).
Simple input buffering has some serious limitations. Consider an input buffer
4 This
3 Packet Switching
to carry expected loads quite efficiently. This is partly because the
way people use the phone network
does not change that much over
time: The frequency with which
calls are placed, the amount of time
taken for a call, and the tendency of
everyone to make calls on Mother’s
Day have stayed fairly constant for
many years.4 By contrast, the rapid
evolution of computer communications, where a new application
like Napster can change the traffic
patterns almost overnight, has
made effective modeling of computer networks much more difficult. Nevertheless, there are some
excellent books and articles on the
subject that we list at the end of the
To give you a sense of the range
of throughputs that designers need
to be concerned about, a high-end
router used in the Internet at the
time of writing might support 10
OC-192 links for a throughput of
approximately 100 Gbps. A 100Gbps switch, if called upon to handle a steady stream of 64-byte packets, would need a packet per second
rate of
100 × 109 ÷ (64 × 8)
= 195 × 106 pps
statement has recently become less true with the advent of fax machines and modem connections to
the Internet.
3.4 Implementation and Performance
Figure 3.28 A 4 × 4 switch.
Port 1
Port 2
Figure 3.29 Simple illustration of head-of-line blocking.
implemented as a FIFO. As packets arrive at the switch, they are placed in the input
buffer. The switch then tries to forward the packets at the front of each FIFO to their
appropriate output port. However, if the packets at the front of several different input
ports are destined for the same output port at the same time, then only one of them
can be forwarded;5 the rest must stay in their input buffers.
The drawback of this feature is that those packets left at the front of the input
buffer prevent other packets further back in the buffer from getting a chance to go
to their chosen outputs, even though there may be no contention for those outputs.
This phenomenon is called head-of-line blocking. A simple example of head-of-line
blocking is given in Figure 3.29, where we see a packet destined for port 1 blocked
behind a packet contending for port 2. It can be shown that when traffic is uniformly distributed among outputs, head-of-line blocking limits the throughput of an
For a simple input-buffered switch, exactly one packet at a time can be sent to a given output port. It is possible
to design switches that can forward more than one packet to the same output at once, at a cost of higher switch
complexity, but there is always some upper limit on the number.
3 Packet Switching
input-buffered switch to 59% of the theoretical maximum (which is the sum of the
link bandwidths for the switch). Thus, the majority of switches use either pure output buffering or a mixture of internal and output buffering. Those that do rely on
input buffers use sophisticated buffer management schemes to avoid head-of-line
Buffers actually perform a more complex task than just holding onto packets
that are waiting to be transmitted. Buffers are the main source of delay in a switch,
and also the place where packets are most likely to get dropped due to lack of space
to store them. The buffers therefore are the main place where the quality of service
characteristics of a switch are determined. For example, if a certain packet has been
sent along a VC that has a guaranteed delay, it cannot afford to sit in a buffer for very
long. This means that the buffers, in general, must be managed using packet scheduling
and discard algorithms that meet a wide range of QoS requirements. We talk more
about these issues in Chapter 6.
While there has been an abundance of impressive research conducted on the design
of efficient and scalable fabrics, it is sufficient for our purposes here to understand
only the high-level properties of a switch fabric. A switch fabric should be able to
move packets from input ports to output ports with minimal delay and in a way that
meets the throughput goals of the switch. That usually means that fabrics display
some degree of parallelism. A high-performance fabric with n ports can often move
one packet from each of its n ports to one of the output ports at the same time. A
sample of fabric types includes the following:
■ Shared-bus: This is the type of “fabric” found in a conventional workstation
used as a switch, as described above. Because the bus bandwidth determines
the throughput of the switch, high-performance switches usually have specially designed busses rather than the standard busses found in PCs.
■ Shared-memory: In a shared-memory switch, packets are written into a memory location by an input port and then read from memory by the output ports.
Here it is the memory bandwidth that determines switch throughput, so wide
and fast memory is typically used in this sort of design. A shared-memory
switch is similar in principle to the shared-bus switch, except it usually uses a
specially designed, high-speed memory bus rather than an I/O bus.
■ Crossbar: A crossbar switch is a matrix of pathways that can be configured
to connect any input port to any output port. Figure 3.30 shows a 4 × 4
3.4 Implementation and Performance
Figure 3.30 A 4 × 4 crossbar switch.
crossbar switch. The main problem with crossbars is that, in their simplest
form, they require each output port to be able to accept packets from all inputs
at once, implying that each port would have a memory bandwidth equal to the
total switch throughput. In reality, more complex designs are typically used
to address this issue (see, for example, the Knockout switch and McKeown’s
virtual output-buffered approach in the “Further Reading” section at the end
of the chapter).
■ Self-routing: As noted above, self-routing fabrics rely on some information
in the packet header to direct each packet to its correct output. Usually, a
special “self-routing header” is appended to the packet by the input port after it has determined which output the packet needs to go to, as illustrated
in Figure 3.31; this extra header is removed before the packet leaves the
switch. Self-routing fabrics are often built from large numbers of very simple 2 × 2 switching elements interconnected in regular patterns, such as the
banyan switching fabric shown in Figure 3.32. For some examples of selfrouting fabric designs, see the “Further Reading” section at the end of this
Self-routing fabrics are among the most scalable approaches to fabric design, and
there has been a wealth of research on the topic, some of which is listed in the “Further
Reading” section. Many self-routing fabrics resemble the one shown in Figure 3.32,
3 Packet Switching
Original packet
Figure 3.31 A self-routing header is applied to a packet at input to enable the fabric to
send the packet to the correct output, where it is removed. (a) Packet arrives at input
port. (b) Input port attaches self-routing header to direct packet to correct output. (c)
Self-routing header is removed at output port before packet leaves switch.
consisting of regularly interconnected 2 × 2 switching elements. For example, the 2 × 2
switches in the banyan network perform a simple task: They look at 1 bit in each selfrouting header and route packets toward the upper output if it is 0 or toward the
lower output if it is 1. Obviously, if two packets arrive at a banyan element at the
3.4 Implementation and Performance
Figure 3.32 Routing packets through a banyan network. The 3-bit numbers represent
values in the self-routing headers of four arriving packets.
same time and both have the bit set to the same value, then they want to be routed
to the same output and a collision will occur. Either preventing or dealing with these
collisions is a main challenge for self-routing switch design. The banyan network is a
clever arrangement of 2 × 2 switching elements that routes all packets to the correct
output without collisions if the packets are presented in ascending order.
We can see how this works in an example, as shown in Figure 3.32, where the
self-routing header contains the output port number encoded in binary. The switch
elements in the first column look at the most significant bit of the output port number
and route packets to the top if that bit is a 0 or the bottom if it is a 1. Switch elements
in the second column look at the second bit in the header, and those in the last column
look at the least significant bit. You can see from this example that the packets are
routed to the correct destination port without collisions. Notice how the top outputs
from the first column of switches all lead to the top half of the network, thus getting
packets with port numbers 0–3 into the right half of the network. The next column
gets packets to the right quarter of the network, and the final column gets them to the
right output port. The clever part is the way switches are arranged to avoid collisions.
Part of the arrangement includes the “perfect shuffle” wiring pattern at the start of
the network. To build a complete switch fabric around a banyan network would require
additional components to sort packets before they are presented to the banyan. The
Batcher-banyan switch design is a notable example of such an approach. The Batcher
network, which is also built from a regular interconnection of 2×2 switching elements,
3 Packet Switching
sorts packets into descending order. On leaving the Batcher network, the packets are
then ready to be directed to the correct output, with no risk of collisions, by the banyan
One of the interesting things about switch design is the wide range of different
types of switches that can be built using the same basic technology. For example,
the Ethernet switches and ATM switches discussed in this chapter, as well as Internet
routers discussed in the next chapter, are all built using designs such as those outlined
in this section.
3.5 Summary
This chapter has started to look at some of the issues involved in building large scalable
networks by using switches, rather than just links, to interconnect hosts. There are
several different ways to decide how to switch packets; the two main ones are the
datagram (connectionless) model and the virtual circuit (connection-oriented) model.
An important application of switching is the interconnection of shared-media
LANs. LAN switches, or bridges, use techniques such as source address learning to
improve forwarding efficiency and spanning tree algorithms to avoid looping. These
switches are extensively used in data centers, campuses, and corporate networks.
The most widespread uses of virtual circuit switching are in Frame Relay and
ATM switches. ATM introduces some particular challenges through the use of cells,
short fixed-length packets. The availability of relatively high-throughput ATM switches
has contributed to the acceptance of the technology, although it has certainly not swept
all other technologies aside as some predicted. One of the main uses of ATM today is
to interconnect widely separated sites in corporate networks.
Independent of the specifics of the switching technology, switches need to forward
packets from inputs to outputs at a high rate, and in some circumstances, switches
need to grow to a large size to accommodate hundreds or thousands of ports. Building
switches that both scale and offer high performance at acceptable cost is complicated
by the problem of contention, and as a consequence, switches often employ specialpurpose hardware rather than being built from general-purpose workstations.
In addition to the issues of contention discussed here, we observe that the related
problem of congestion has come up throughout this chapter. We will postpone our
discussion of congestion control until Chapter 6, after we have seen more of the
network architecture. We do this because it is impossible to fully appreciate congestion
(both the problem and how it to address it) without understanding both what happens
inside the network (the topic of this and the next chapter) and what happens at the
edges of the network (the topic of Chapter 5).
Further Reading
ATM was originally envisioned by
many of its proponents as the founO P E N I S S U E
dation for the “Broadband Integrated
Services Digital Network,” and it was
The Future of ATM
predicted in some quarters that ATM
would displace all other networking
technologies. Hosts would acquire
ATM adaptors instead of Ethernet ports, enabling “ATM to the desktop.” Phone
companies everywhere would deploy ATM, and as the technology that supports all
media types—voice, video, and data—it would remove the need for any other type of
It is now apparent that this scenario is unlikely to play out. The success of Ethernet switches in particular has killed off the ATM-to-the-desktop movement. Gigabit
Ethernet and 10-Gigabit Ethernet technologies have successfully addressed the need
for high-speed connections to servers where ATM might once have been used. In fact,
we now hear ATM referred to as a “legacy protocol,” a term that was much used in
the heyday of ATM to refer to older protocols.
Another factor that has limited the acceptance of ATM has been the success of
the Internet. It is now a fact of life that consumers are willing to pay for Internet
access, and that means selling a service that delivers IP packets. While ATM can be
used to help deliver that service (as it is in many DSL networks, for example), simply
selling ATM connections to consumers does not meet their data networking needs. The
notable exception is corporate customers looking to interconnect many sites, where
an ATM VC may be just the right thing to economically replace a leased line. In fact,
this is the primary niche for ATM today—it is used (in conjunction with Frame Relay)
to provide wide area virtual circuit services to corporate networking customers.
The future of ATM therefore seems to hinge on the future of wide area virtual
circuit-based services. These services are unlikely to go away any time soon, but ATM’s
role is to some extent being challenged by newer technologies, such as encrypted IP
tunnels and Multiprotocol Label Switching (MPLS). Both of these technologies will
be described in the next chapter.
The seminal paper on bridges, in particular the spanning tree algorithm, is the article
by Perlman below. There is a wealth of survey papers on ATM; the article by Turner,
an ATM pioneer, is one of the earliest to propose the use of a cell-based network for
integrated services. The third paper describes the Sunshine switch and is especially
3 Packet Switching
interesting because it provides insights into the important role of traffic analysis in
switch design. In particular, the Sunshine designers were among the first to realize that
cells were unlikely to arrive at a switch in a totally uncorrelated way and thus were
able to factor these correlations into their design. Finally, McKeown’s paper describes
an approach to switch design that uses cells internally but has been used commercially
as the basis for high-performance routers forwarding variable-length packets.
■ Perlman, R. An algorithm for distributed computation of spanning trees in an
extended LAN. Proceedings of the Ninth Data Communications Symposium,
pages 44–53, September 1985.
■ Turner, J. S. Design of an integrated services packet network. Proceedings
of the Ninth Data Communications Symposium, pages 124–133, September
■ Giacopelli, J. N., et al. Sunshine: A high-performance self-routing broadband
packet-switched architecture. IEEE Journal of Selected Areas in Communications (JSAC) 9(8):1289–1298, October 1991.
■ McKeown, N. The iSLIP scheduling algorithm for input-queued switches.
IEEE Transactions on Networking 7(2):188–201, April 1999.
A good general overview of bridges can be found in another work by Perlman
[Per00]. For a detailed description of many aspects of ATM, with a focus on building
real networks, we recommend the book by Ginsburg [Gin99]. Also, as one of the key
ATM standards-setting bodies, the ATM Forum produces new specifications for ATM;
the User-Network Interface (UNI) specification, version 4.1, is the most recent at the
time of this writing. (See the live reference below.)
There have been literally thousands of papers published on switch architectures.
One early paper that explains Batcher networks well is, not surprisingly, one by Batcher
himself [Bat68]. Sorting networks are explained by Drysdale and Young [DY75], and
the knockout switch, an interesting form of crossbar switch is described by Yeh et al.
[YHA87]. A survey of ATM switch architectures appears in Partridge [Par94], and
a good overview of the performance of different switching fabrics can be found in
Robertazzi [Rob93]. An example of the design of a switch based on variable-length
packets can be found in Gopal and Guerin [GG94].
Optical networking is a rich field in its own right, with its own journals, conferences, and so on. We recommend Ramaswami and Sivarajan [RS01] as a good
introductory text in that field.
An excellent text to read if you want to learn about the mathematical analysis of
network performance is by Kleinrock [Kle75], one of the pioneers of the ARPANET.
Many papers have been published on the applications of queuing theory to packet
switching. We recommend the article by Paxson and Floyd [PF94] as a significant
contribution focused on the Internet, and one by Leland et al. [LTWW94], a paper
that introduces the important concept of “long-range dependence” and shows the
inadequacy of many traditional approaches to traffic modeling.
Finally, we recommend the following live reference:
■ current activities of the ATM Forum
1 Using the example network given in Figure 3.33, give the virtual circuit tables for
all the switches after each of the following connections is established. Assume that
the sequence of connections is cumulative; that is, the first connection is still up
when the second connection is established, and so on. Also assume that the VCI
assignment always picks the lowest unused VCI on each link, starting with 0.
(a) Host A connects to host B.
(b) Host C connects to host G.
(c) Host E connects to host I.
(d) Host D connects to host B.
(e) Host F connects to host J.
(f) Host H connects to host A.
2 Using the example network given in Figure 3.33, give the virtual circuit tables
for all the switches after each of the following connections is established. Assume
that the sequence of connections is cumulative; that is, the first connection is
still up when the second connection is established, and so on. Also assume that
the VCI assignment always picks the lowest unused VCI on each link, starting
with 0.
(a) Host D connects to host H.
(b) Host B connects to host G.
(c) Host F connects to host A.
(d) Host H connects to host C.
(e) Host I connects to host E.
(f) Host H connects to host J.
3 Packet Switching
Host D
Host C
Host F
Host E
0 Switch 1
Switch 4
2 Switch 2
Host G
Host H
Host A
Host J
0 Switch 3
Host B
Host I
Figure 3.33 Example network for Exercises 1 and 2.
Figure 3.34 Network for Exercise 3.
3 For the network given in Figure 3.34, give the datagram forwarding table for each
node. The links are labelled with relative costs; your tables should forward each
packet via the lowest-cost path to its destination.
4 Give forwarding tables for switches S1–S4 in Figure 3.35. Each switch should
have a “default” routing entry, chosen to forward packets with unrecognized
Figure 3.35 Diagram for Exercise 4.
Figure 3.36 Diagram for Exercise 5.
Switch S1
Switch S2
Switch S3
Port VCI
Port VCI
Port VCI
Port VCI
Port VCI
Port VCI
Table 3.5
VCI tables for switches in Figure 3.36.
destination addresses toward OUT. Any specific destination table entries duplicated by the default entry should then be eliminated.
5 Consider the virtual circuit switches in Figure 3.36. Table 3.5 lists, for each switch,
what port, VCI (or VCI, interface) pairs are connected to what other. Connections are bidirectional. List all endpoint-to-endpoint connections.
6 In the source routing example of Section 3.1.3, the address received by B is not
reversible and doesn’t help B know how to reach A. Propose a modification to the
delivery mechanism that does allow for reversibility. Your mechanism should not
require giving all switches globally unique names.
3 Packet Switching
7 Propose a mechanism that virtual circuit switches might use so that if one switch
loses all its state regarding connections, then a sender of packets along a path
through that switch is informed of the failure.
8 Propose a mechanism that might be used by datagram switches so that if one
switch loses all or part of its forwarding table, affected senders are informed of
the failure.
9 The virtual circuit mechanism described in Section 3.1.2 assumes that each link
is point-to-point. Extend the forwarding algorithm to work in the case that links
are shared-media connections, for example, Ethernet.
10 Suppose, in Figure 3.4, that a new link has been added, connecting switch 3
port 1 (where G is now) and switch 1 port 0 (where D is now); neither switch is
“informed” of this link. Furthermore, switch 3 mistakenly thinks that host B is
reached via port 1.
(a) What happens if host A attempts to send to host B, using datagram forwarding?
(b) What happens if host A attempts to connect to host B, using the virtual circuit
setup mechanism discussed in the text?
11 Give an example of a working virtual circuit whose path traverses some link twice.
Packets sent along this path should not, however, circulate indefinitely.
12 In Section 3.1.2, each switch chose the VCI value for the incoming link. Show that
it is also possible for each switch to choose the VCI value for the outbound link,
and that the same VCI values will be chosen by each approach. If each switch
chooses the outbound VCI, is it still necessary to wait one RTT before data is
13 Given the extended LAN shown in Figure 3.37, indicate which ports are not
selected by the spanning tree algorithm.
14 Given the extended LAN shown in Figure 3.37, assume that bridge B1 suffers
catastrophic failure. Indicate which ports are not selected by the spanning tree
algorithm after the recovery process and a new tree has been formed.
15 Consider the arrangement of learning bridges shown in Figure 3.38. Assuming all
are initially empty, give the forwarding tables for each of the bridges B1–B4 after
the following transmissions:
Figure 3.37 Network for Exercises 13 and 14.
Figure 3.38 Network for Exercises 15 and 16.
■ A sends to C.
■ C sends to A.
■ D sends to C.
Identify ports with the unique neighbor reached directly from that port; that is,
the ports for B1 are to be labelled “A” and “B2.”
16 As in the previous problem, consider the arrangement of learning bridges shown
in Figure 3.38. Assuming all are initially empty, give the forwarding tables for
each of the bridges B1–B4 after the following transmissions:
3 Packet Switching
Figure 3.39 Diagram for Exercise 17.
Figure 3.40 Extended LAN for Exercise 18.
■ D sends to C.
■ C sends to D.
■ A sends to C.
17 Consider hosts X, Y, Z, W and learning bridges B1, B2, B3, with initially empty
forwarding tables, as in Figure 3.39.
(a) Suppose X sends to Z. Which bridges learn where X is? Does Y’s network
interface see this packet?
(b) Suppose Z now sends to X. Which bridges learn where Z is? Does Y’s network
interface see this packet?
(c) Suppose Y now sends to X. Which bridges learn where Y is? Does Z’s network
interface see this packet?
(d) Finally, suppose Z sends to Y. Which bridges learn where Z is? Does W’s
network interface see this packet?
18 Give the spanning tree generated for the extended LAN shown in Figure 3.40, and
discuss how any ties are resolved.
Figure 3.41 Loop for Exercises 19 and 20.
19 Suppose two learning bridges B1 and B2 form a loop as shown in Figure 3.41,
and do not implement the spanning tree algorithm. Each bridge maintains a single
table of address, interface pairs.
(a) What will happen if M sends to L?
(b) Suppose a short while later L replies to M. Give a sequence of events that leads
to one packet from M and one packet from L circling the loop in opposite
20 Suppose that M in Figure 3.41 sends to itself (this normally would never happen).
State what would happen, assuming
(a) the bridges’ learning algorithm is to install (or update) the new sourceaddress,
interface entry before searching the table for the destination address
(b) the new source address was installed after destination address lookup
21 Consider the extended LAN of Figure 3.12. What happens in the spanning tree
algorithm if bridge B1 does not participate and
(a) simply forwards all spanning tree algorithm messages?
(b) drops all spanning tree messages?
22 Suppose some repeaters (hubs), rather than bridges, are connected into a loop.
(a) What will happen when somebody transmits?
(b) Why would the spanning tree mechanism be difficult or impossible to implement for repeaters?
3 Packet Switching
(c) Propose a mechanism by which repeaters might detect loops and shut down
some ports to break the loop. Your solution is not required to work 100% of
the time.
23 Suppose a bridge has two of its ports on the same network. How might the bridge
detect and correct this?
24 What percentage of an ATM link’s total bandwidth is consumed by the ATM cell
headers? What percentage of the total bandwidth is consumed by all nonpayload
bits in AAL3/4 and AAL5, when the user data is 512 bytes long?
25 Explain why AAL3/4 will not detect the loss of 16 consecutive cells of a single
26 The IP datagram for a TCP ACK message is 40 bytes long: It contains 20 bytes of
TCP header and 20 bytes of IP header. Assume that this ACK is traversing an ATM
network that uses AAL5 to encapsulate IP packets. How many ATM packets will
it take to carry the ACK? What if AAL3/4 is used instead?
27 The CS-PDU for AAL5 contains up to 47 bytes of padding, while the AAL3/4 CSPDU only contains up to 3 bytes of padding. Explain why the effective bandwidth
of AAL5 is always the same as, or higher than, that of AAL3/4, given a PDU of a
particular size.
28 How reliable does an ATM connection have to be in order to maintain a loss
rate of less than one per million for a higher-level PDU of size 20 cells? Assume
29 Assuming the 20-cell AAL5 packet from the previous problem, suppose a final cell
is tacked on the end of the PDU, and that this cell is the XOR of all the previous
cells in the PDU. This allows recovery from any one lost cell. What cell loss rate
now would yield a net one-per-million loss rate for 20-data-cell PDUs?
30 Recall that AAL3/4 has a CRC-10 checksum at the end of each cell, while AAL5
has a single CRC-32 checksum at the end of the PDU. If a PDU is carried in
12 AAL3/4 cells, then AAL3/4 devotes nearly four times as many bits to error
detection as AAL5.
(a) Suppose errors are known to come in bursts, where each burst is small enough
to be confined to a single cell. Find the probability that AAL3/4 fails to detect
an error, given that it is known that exactly two cells are affected. Do the same
for three cells. Under these conditions, is AAL3/4 more or less reliable than
AAL5? Assume that an N-bit CRC fails to detect an error with probability
1/2 N (which is strictly true only when all errors are equally likely).
(b) Can you think of any error distribution in which AAL3/4 would be more likely
than AAL5 to detect an error? Do you think such circumstances are likely?
31 Cell switching methods essentially always use virtual circuit routing rather than
datagram routing. Give a specific argument why this is so.
32 Suppose a workstation has an I/O bus speed of 800 Mbps and memory bandwidth
of 2 Gbps. Assuming DMA in and out of main memory, how many interfaces to
45-Mbps T3 links could a switch based on this workstation handle?
33 Suppose a workstation has an I/O bus speed of 1 Gbps and memory bandwidth
of 2 Gbps. Assuming DMA in and out of main memory, how many interfaces to
45-Mbps T3 links could a switch based on this workstation handle?
34 Suppose a switch can forward packets at a rate of 100,000 per second, regardless
(within limits) of size. Assuming the workstation parameters described in the previous problem, at what packet size would the bus bandwidth become the limiting
35 Suppose that a switch is designed to have both input and output FIFO buffering.
As packets arrive on an input port they are inserted at the tail of the FIFO. The
switch then tries to forward the packets at the head of each FIFO to the tail of the
appropriate output FIFO.
(a) Explain under what circumstances such a switch can lose a packet destined
for an output port whose FIFO is empty.
(b) What is this behavior called?
(c) Assuming the FIFO buffering memory can be redistributed freely, suggest a
reshuffling of the buffers that avoids the above problem, and explain why it
does so.
36 A stage of an n × n banyan network consists of (n/2) 2 × 2 switching elements.
The first stage directs packets to the correct half of the network, the next stage
to the correct quarter, and so on, until the packet is routed to the correct output.
3 Packet Switching
Derive an expression for the number of 2 × 2 switching elements needed to make
an n × n banyan network. Verify your answer for n = 8.
37 Describe how a Batcher network works. (See the “Further Reading” section.)
Explain how a Batcher network can be used in combination with a banyan network
to build a switching fabric.
38 An Ethernet switch is simply a bridge that has the ability to forward some number of packets in parallel, assuming the input and output ports are all distinct.
Suppose two such N-port switches, for a large value of N, are each able to forward individually up to three packets in parallel. They are then connected to one
another in series by joining a pair of ports, one from each switch; the joining link
is the bottleneck as it can, of course, carry only one packet at a time.
(a) Suppose we choose two connections through this combined switch at random.
What is the probability that both connections can be forwarded in parallel?
Hint: This is the probability that at most one of the connections crosses the
(b) What if three connections are chosen at random?
39 Suppose a 10-Mbps Ethernet hub (repeater) is replaced by a 10-Mbps switch,
in an environment where all traffic is between a single server and N “clients.”
Because all traffic must still traverse the server-switch link, nominally there is no
improvement in bandwidth.
(a) Would you expect any improvement in bandwidth? If so, why?
(b) What would your answer be if the original hub were token ring rather than
(c) What other advantages and drawbacks might a switch offer versus a
his Page Intentionally Left Blank
Every seeming equality conceals a hierarchy.
—Mason Cooley
e have now seen how to build a single network using point-to-point links,
shared media, and switches. The problem is that lots of people have built
networks with these various technologies and they all want to be able to
communicate with each other, not just with the other users of a single network. This
chapter is about the problem of interconnecting different networks.
There are two important problems that must be addressed when
connecting networks: heterogeneity
and scale. Simply stated, the problem
There Is More Than
of heterogeneity is that users on one
One Network
type of network want to be able to
communicate with users on other
type of networks. To further complicate matters, establishing connectivity between
hosts on two different networks may require traversing several other networks in
between, each of which may be of yet another type. These different networks may be
Ethernets, token rings, point-to-point links, or switched networks of various kinds,
and each of them is likely to have its own addressing scheme, media access protocols,
service model, and so on. The challenge of heterogeneity is to provide a useful and
fairly predictable host-to-host service over this hodgepodge of different networks. To
understand the problem of scaling, it is worth considering the growth of the Internet,
which has roughly doubled in size each year for 20 years. This sort of growth forces us
to face a number of challenges. One of these is routing: How can you find an efficient
path through a network with millions, or perhaps billions, of nodes? Closely related
to this is the problem of addressing, the task of providing suitable identifiers for all
those nodes.
This chapter looks at a series of approaches to interconnecting networks and
the problems that must be solved. In doing so, we trace the evolution of the TCP/IP
Internet in an effort to understand the problems of heterogeneity and scale in detail, along with the general techniques that can be applied to them.
The first section introduces the Internet Protocol (IP)
and shows how it can be used to build a scalable, heterogeneous internetwork. This section includes a discussion of
the Internet’s service model, which is the key to its ability
to handle heterogeneity. It also describes how the Internet’s
hierarchical addressing scheme has helped the Internet to
scale to a modestly large size.
A central aspect of building large heterogeneous internetworks is the problem of finding efficient, loop-free
paths through the constituent networks. The second section introduces the principles of routing and explores the
scaling issues of routing protocols, using some of the Internet’s routing protocols as examples.
The third section discusses several of the problems
(growing pains) that the Internet has experienced over the
past several years and introduces a variety of techniques
that have been employed to address these problems. The
experience gained from using these techniques has led to
the design of a new version of IP, which is IP version 6
(IPv6). Throughout all these discussions, we see the importance of hierarchy in building scalable networks.
The chapter concludes by considering a pair of significant enhancements to the Internet’s capabilities. The
first, multicast, is an enhancement of the basic service
model. We show how multicast—the ability to deliver
packets efficiently to a set of receivers—can be incorporated into an internet, and we describe several of the routing protocols that have been developed to support multicast. The second enhancement, MPLS (Multiprotocol
Label Switching), modifies the forwarding mechanism of
IP networks. This modification has enabled some changes
in the way IP routing is performed and in the services offered by IP networks.
4 Internetworking
4.1 Simple Internetworking (IP)
In the previous chapter, we saw that it was possible to build reasonably large LANs
using bridges and LAN switches, but that such approaches were limited in their ability to scale and to handle heterogeneity. In this chapter, we explore some ways to go
beyond the limitations of bridged networks, enabling us to build large, highly heterogeneous networks with reasonably efficient routing. We refer to such networks
as internetworks. In the following sections, we make a steady progression toward
larger and larger internetworks. We start with the basic functionality of the currently
deployed version of the Internet Protocol (IP), and then we examine various techniques
that have been developed to extend the scalability of the Internet in Section 4.3. This
discussion culminates with a description of IP version 6 (IPv6), also known as the
“next-generation” IP. Before delving into the details of an internetworking protocol,
however, let’s consider more carefully what the word “internetwork” means.
What Is an Internetwork?
We use the term “internetwork,” or sometimes just “internet” with a lowercase i, to
refer to an arbitrary collection of networks interconnected to provide some sort of hostto-host packet delivery service. For example, a corporation with many sites might construct a private internetwork by interconnecting the LANs at their different sites with
point-to-point links leased from the phone company. When we are talking about the
widely used, global internetwork to which a large percentage of networks are now connected, we call it the “Internet” with a capital I. In keeping with the first-principles approach of this book, we mainly want you to learn about the principles of “lowercase i”
internetworking, but we illustrate these ideas with real-world examples from the
“big I” Internet.
Another piece of terminology that can be confusing is the difference between
networks, subnetworks, and internetworks. We are going to avoid subnetworks (or
subnets) altogether until Section 4.3. For now, we use network to mean either a directly
connected or a switched network of the kind that was discussed in the last two chapters.
Such a network uses one technology, such as 802.5, Ethernet, or ATM. An internetwork
is an interconnected collection of such networks. Sometimes, to avoid ambiguity, we
refer to the underlying networks that we are interconnecting as physical networks.
An internet is a logical network built out of a collection of physical networks. In
this context, a collection of Ethernets connected by bridges or switches would still be
viewed as a single network.
Figure 4.1 shows an example internetwork. An internetwork is often referred
to as a “network of networks” because it is made up of lots of smaller networks. In
this figure, we see Ethernets, an FDDI ring, and a point-to-point link. Each of these
4.1 Simple Internetworking (IP)
Network 1 (Ethernet)
Network 4
Network 2 (Ethernet)
Network 3 (FDDI)
Figure 4.1
A simple internetwork. Hn = host; Rn = router.
is a single-technology network. The nodes that interconnect the networks are called
routers. They are also sometimes called gateways, but since this term has several other
connotations, we restrict our usage to router.
The Internet Protocol is the key tool used today to build scalable, heterogeneous
internetworks. It was originally known as the Kahn-Cerf protocol after its inventors.
One way to think of IP is that it runs on all the nodes (both hosts and routers) in
a collection of networks and defines the infrastructure that allows these nodes and
networks to function as a single logical internetwork. For example, Figure 4.2 shows
how hosts H1 and H8 are logically connected by the internet in Figure 4.1, including
the protocol graph running on each node. Note that higher-level protocols, such as
TCP and UDP, typically run on top of IP on the hosts.
Most of the rest of this chapter is about various aspects of IP. While it is certainly
possible to build an internetwork that does not use IP—for example, Novell created
an internetworking protocol called IPX, which was in turn based on the XNS internet
designed by Xerox—IP is the most interesting case to study simply because of the
size of the Internet. Said another way, it is only the IP Internet that has really faced
the issue of scale. Thus it provides the best case study of a scalable internetworking
4 Internetworking
Figure 4.2 A simple internetwork, showing the protocol layers used to connect H1 to
H8 in Figure 4.1. ETH is the protocol that runs over Ethernet.
Service Model
A good place to start when you build an internetwork is to define its service model,
that is, the host-to-host services you want to provide. The main concern in defining a
service model for an internetwork is that we can provide a host-to-host service only if
this service can somehow be provided over each of the underlying physical networks.
For example, it would be no good deciding that our internetwork service model was
going to provide guaranteed delivery of every packet in 1 ms or less if there were
underlying network technologies that could arbitrarily delay packets. The philosophy
used in defining the IP service model, therefore, was to make it undemanding enough
that just about any network technology that might turn up in an internetwork would
be able to provide the necessary service.
The IP service model can be thought of as having two parts: an addressing scheme,
which provides a way to identify all hosts in the internetwork, and a datagram (connectionless) model of data delivery. This service model is sometimes called best effort
because, although IP makes every effort to deliver datagrams, it makes no guarantees.
We postpone a discussion of the addressing scheme for now and look first at the data
delivery model.
Datagram Delivery
The IP datagram is fundamental to the Internet Protocol. Recall from Section 3.1.1 that
a datagram is a type of packet that happens to be sent in a connectionless manner over
a network. Every datagram carries enough information to let the network forward the
packet to its correct destination; there is no need for any advance setup mechanism to
tell the network what to do when the packet arrives. You just send it, and the network
makes its best effort to get it to the desired destination. The “best-effort” part means
4.1 Simple Internetworking (IP)
that if something goes wrong and the packet gets lost, corrupted, misdelivered, or in
any way fails to reach its intended destination, the network does nothing—it made its
best effort, and that is all it has to do. It does not make any attempt to recover from
the failure. This is sometimes called an unreliable service.
Best-effort, connectionless service is about the simplest service you could ask for
from an internetwork, and this is a great strength. For example, if you provide besteffort service over a network that provides a reliable service, then that’s fine—you end
up with a best-effort service that just happens to always deliver the packets. If, on the
other hand, you had a reliable service model over an unreliable network, you would
have to put lots of extra functionality into the routers to make up for the deficiencies
of the underlying network. Keeping the routers as simple as possible was one of the
original design goals of IP.
The ability of IP to “run over anything” is frequently cited as one of its most
important characteristics. It is noteworthy that many of the technologies over which
IP runs today did not exist when IP was invented. So far, no networking technology
has been invented that has proven too bizarre for IP; it has even been claimed that IP
can run over a network that transports messages using carrier pigeons.
Best-effort delivery does not just mean that packets can get lost. Sometimes they
can get delivered out of order, and sometimes the same packet can get delivered more
than once. The higher-level protocols or applications that run above IP need to be
aware of all these possible failure modes.
Packet Format
Clearly, a key part of the IP service model is the type of packets that can be carried.
The IP datagram, like most packets, consists of a header followed by a number of bytes
of data. The format of the header is shown in Figure 4.3. Note that we have adopted
a different style of representing packets than the one we used in previous chapters.
This is because packet formats at the internetworking layer and above, where we will
be focusing our attention for the next few chapters, are almost invariably designed to
align on 32-bit boundaries to simplify the task of processing them in software. Thus,
the common way of representing them (used in Internet Requests for Comments, for
example) is to draw them as a succession of 32-bit words. The top word is the one
transmitted first, and the leftmost byte of each word is the one transmitted first. In
this representation, you can easily recognize fields that are a multiple of 8 bits long.
On the odd occasion when fields are not an even multiple of 8 bits, you can determine
the field lengths by looking at the bit positions marked at the top of the packet.
Looking at each field in the IP header, we see that the “simple” model of besteffort datagram delivery still has some subtle features. The Version field specifies the
4 Internetworking
Options (variable)
Figure 4.3
IPv4 packet header.
version of IP. The current version of IP is 4, and it is sometimes called IPv4.1 Observe
that putting this field right at the start of the datagram makes it easy for everything
else in the packet format to be redefined in subsequent versions; the header processing
software starts off by looking at the version and then branches off to process the rest
of the packet according to the appropriate format. The next field, HLen, specifies the
length of the header in 32-bit words. When there are no options, which is most of the
time, the header is 5 words (20 bytes) long. The 8-bit TOS (type of service) field has
had a number of different definitions over the years, but its basic function is to allow
packets to be treated differently based on application needs. For example, the TOS
value might determine whether or not a packet should be placed in a special queue
that receives low delay. We discuss the use of this field (and a new name for it) in more
detail in Section 6.5.3.
The next 16 bits of the header contain the Length of the datagram, including the
header. Unlike the HLen field, the Length field counts bytes rather than words. Thus,
the maximum size of an IP datagram is 65,535 bytes. The physical network over
which IP is running, however, may not support such long packets. For this reason,
IP supports a fragmentation and reassembly process. The second word of the header
contains information about fragmentation, and the details of its use are presented
under “Fragmentation and Reassembly” below.
The next major version of IP, which is discussed later in this chapter, has a new version number 6 and is known
as IPv6. The version number 5 was used for an experimental protocol called ST-II that was not widely used.
4.1 Simple Internetworking (IP)
Moving on to the third word of the header, the next byte is the TTL (time to live)
field. Its name reflects its historical meaning rather than the way it is commonly used
today. The intent of the field is to catch packets that have been going around in routing
loops and discard them, rather than let them consume resources indefinitely. Originally,
TTL was set to a specific number of seconds that the packet would be allowed to live,
and routers along the path would decrement this field until it reached 0. However,
since it was rare for a packet to sit for as long as 1 second in a router, and routers did
not all have access to a common clock, most routers just decremented the TTL by 1 as
they forwarded the packet. Thus, it became more of a hop count than a timer, which is
still a perfectly good way to catch packets that are stuck in routing loops. One subtlety
is in the initial setting of this field by the sending host: Set it too high and packets could
circulate rather a lot before getting dropped; set it too low and they may not reach
their destination. The value 64 is the current default.
The Protocol field is simply a demultiplexing key that identifies the higher-level
protocol to which this IP packet should be passed. There are values defined for
TCP (6), UDP (17), and many other protocols that may sit above IP in the protocol
The Checksum is calculated by considering the entire IP header as a sequence of
16-bit words, adding them up using ones complement arithmetic, and taking the ones
complement of the result. This is the IP checksum algorithm described in Section 2.4.
Thus, if any bit in the header is corrupted in transit, the checksum will not contain
the correct value upon receipt of the packet. Since a corrupted header may contain
an error in the destination address—and, as a result, may have been misdelivered—it
makes sense to discard any packet that fails the checksum. It should be noted that this
type of checksum does not have the same strong error detection properties as a CRC,
but it is much easier to calculate in software.
The last two required fields in the header are the SourceAddr and the
DestinationAddr for the packet. The latter is the key to datagram delivery: Every packet
contains a full address for its intended destination so that forwarding decisions can be
made at each router. The source address is required to allow recipients to decide if they
want to accept the packet and to enable them to reply. IP addresses are discussed in
Section 4.1.3—for now, the important thing to know is that IP defines its own global
address space, independent of whatever physical networks it runs over. As we will see,
this is one of the keys to supporting heterogeneity.
Finally, there may be a number of options at the end of the header. The presence
or absence of options may be determined by examining the header length (HLen)
field. While options are used fairly rarely, a complete IP implementation must handle
them all.
4 Internetworking
Fragmentation and Reassembly
One of the problems of providing a uniform host-to-host service model over a heterogeneous collection of networks is that each network technology tends to have its
own idea of how large a packet can be. For example, an Ethernet can accept packets
up to 1500 bytes long, while FDDI packets may be 4500 bytes long. This leaves two
choices for the IP service model: make sure that all IP datagrams are small enough to
fit inside one packet on any network technology, or provide a means by which packets
can be fragmented and reassembled when they are too big to go over a given network
technology. The latter turns out to be a good choice, especially when you consider the
fact that new network technologies are always turning up, and IP needs to run over all
of them; this would make it hard to pick a suitably small bound on datagram size. This
also means that a host will not send needlessly small packets, which wastes bandwidth
and consumes processing resources by requiring more headers per byte of data sent.
For example, two hosts connected to FDDI networks that are interconnected by a
point-to-point link would not need to send packets small enough to fit on an Ethernet.
The central idea here is that every network type has a maximum transmission
unit (MTU), which is the largest IP datagram that it can carry in a frame. Note that this
value is smaller than the largest packet size on that network because the IP datagram
needs to fit in the payload of the link-layer frame. Also, note that in ATM networks,
the “frame” is the CS-PDU, not the ATM cell; the fact that CS-PDUs get segmented
into cells is not visible to IP.
When a host sends an IP datagram, therefore, it can choose any size that it
wants. A reasonable choice is the MTU of the network to which the host is directly
attached. Then fragmentation will only be necessary if the path to the destination
includes a network with a smaller MTU. Should the transport protocol that sits on
top of IP give IP a packet larger than the local MTU, however, then the source host
must fragment it.
Fragmentation typically occurs in a router when it receives a datagram that it
wants to forward over a network that has an MTU that is smaller than the received
datagram. To enable these fragments to be reassembled at the receiving host, they
all carry the same identifier in the Ident field. This identifier is chosen by the sending
host and is intended to be unique among all the datagrams that might arrive at the
destination from this source over some reasonable time period. Since all fragments
of the original datagram contain this identifier, the reassembling host will be able to
recognize those fragments that go together. Should all the fragments not arrive at the
receiving host, the host gives up on the reassembly process and discards the fragments
that did arrive. IP does not attempt to recover from missing fragments.
To see what this all means, consider what happens when host H1 sends a datagram to host H8 in the example internet shown in Figure 4.1. Assuming that the MTU
4.1 Simple Internetworking (IP)
ETH IP (1400)
FDDI IP (1400)
PPP IP (512)
ETH IP (512)
PPP IP (512)
ETH IP (512)
PPP IP (376)
ETH IP (376)
Figure 4.4 IP datagrams traversing the sequence of physical networks graphed in
Figure 4.1.
is 1500 bytes for the two Ethernets, 4500 bytes for the FDDI network, and 532 bytes
for the point-to-point network, then a 1420-byte datagram (20-byte IP header plus
1400 bytes of data) sent from H1 makes it across the first Ethernet and the FDDI network without fragmentation but must be fragmented into three datagrams at router
R2. These three fragments are then forwarded by router R3 across the second Ethernet
to the destination host. This situation is illustrated in Figure 4.4. This figure also serves
to reinforce two important points:
1 Each fragment is itself a self-contained IP datagram that is transmitted over a
sequence of physical networks, independent of the other fragments.
2 Each IP datagram is reencapsulated for each physical network over which it
The fragmentation process can be understood in detail by looking at the header
fields of each datagram, as is done in Figure 4.5. The unfragmented packet, shown at
the top, has 1400 bytes of data and a 20-byte IP header. When the packet arrives at
router R2, which has an MTU of 532 bytes, it has to be fragmented. A 532-byte MTU
leaves 512 bytes for data after the 20-byte IP header, so the first fragment contains
512 bytes of data. The router sets the M bit in the Flags field (see Figure 4.3), meaning
that there are more fragments to follow, and it sets the Offset to 0, since this fragment
contains the first part of the original datagram. The data carried in the second fragment
starts with the 513th byte of the original data, so the Offset field in this header is set
to 64, which is 512 ÷ 8. Why the division by 8? Because the designers of IP decided
that fragmentation should always happen on 8-byte boundaries, which means that
the Offset field counts 8-byte chunks, not bytes. (We leave it as an exercise for you to
figure out why this design decision was made.) The third fragment contains the last
4 Internetworking
Start of header
Ident = x
0 Offset = 0
Rest of header
1400 data bytes
Start of header
Ident = x
1 Offset = 0
Rest of header
512 data bytes
Start of header
Ident = x
1 Offset = 64
Rest of header
512 data bytes
Start of header
Ident = x
0 Offset = 128
Rest of header
376 data bytes
Figure 4.5 Header fields used in IP fragmentation. (a) Unfragmented packet; (b) fragmented packets.
376 bytes of data, and the offset is now 2 × 512 ÷ 8 = 128. Since this is the last
fragment, the M bit is not set.
Observe that the fragmentation process is done in such a way that it could
be repeated if a fragment arrived at another network with an even smaller MTU.
Fragmentation produces smaller, valid IP datagrams that can be readily reassembled
4.1 Simple Internetworking (IP)
into the original datagram upon receipt, independent of the order of their arrival.
Reassembly is done at the receiving host and not at each router.
We conclude this discussion of IP fragmentation and reassembly by giving a fragment
of code that performs reassembly. One reason we give this particular piece of code is
that it is representative of a large proportion of networking software—it does little
more than tedious and unglamorous bookkeeping.
First, we define the key data structure (FragList) that is used to hold the individual
fragments that arrive at the destination. Incoming fragments are saved in this data
structure until all the fragments in the original datagram have arrived, at which time
they are reassembled into a complete datagram and passed up to some higher-level
protocol. Note that each element in FragList contains either a fragment or a hole.
#define FRAGOFFSET(fragflag)
((fragflag) & FRAGOFFMASK)
/* structure to hold the fields that uniquely identify fragments
of the same IP datagram */
typedef struct fid {
IpHost source;
IpHost dest;
u_char prot;
u_char pad;
u_short ident;
} FragId;
typedef struct hole {
} Hole;
#define HOLE
#define FRAG
/* structure to hold a fragment or a hole */
typedef struct fragif {
u_char type;
union {
} u;
struct fragif *next, *prev;
} FragInfo;
4 Internetworking
/* structure to hold all the fragments and holes for a
single IP datagram being reassembled */
typedef struct FragList {
/* dummy header node */
gcMark; /* garbage collection flag */
} FragList;
The reassembly routine, ipReassemble, takes an incoming datagram (dg) and
the IP header for that datagram (hdr) as arguments. The third argument, fragMap,
is a Map structure (which supports mapBind, mapRemove, and MapResolve operations) used to efficiently map the incoming datagram into the appropriate FragList.
(Recall that the group of fragments that are being reassembled together are uniquely
identified by several fields in the IP header, as defined by structure FragId given
The actual work done in ipReassemble is straightforward; as stated above, it
is mostly bookkeeping. First, the routine extracts the fields from the IP header that
uniquely identify the datagram to be reassembled, constructs a key from these fields,
and looks this key up in fragMap to find the appropriate FragList. If this is the first
fragment for the datagram, a new FragList must be created and initialized. Next, the
routine inserts the new fragment into this FragList. This involves comparing the sum
of the offset and length of this fragment with the offset of the next fragment in the
list. Some of this work is done in subroutine hole create, which is given below. Finally,
ipReassemble checks to see if all the holes are filled. If all the fragments are present,
it calls the routine msgReassemble to actually reassemble the fragments into a whole
datagram and then calls deliver to pass this datagram up the protocol graph to some
high-level protocol identified as HLP.
ipReassemble(Msg *dg, IpHdr *hdr, Map fragMap)
*fi, *prev;
offset, len;
/* extract fragmentation info from header
(offset and fragment length) */
offset = FRAGOFFSET(hdr->frag)*8;
4.1 Simple Internetworking (IP)
len = hdr->dlen - GET_HLEN(hdr) * 4;
/* Create the unique id for this fragment */
bzero((char *)&fragid, sizeof(FragId));
fragid.source = hdr->source;
fragid.dest = hdr->dest;
fragid.prot = hdr->prot;
fragid.ident = hdr->ident;
/* find reassembly list for this frag; create one if none exists */
if (mapResolve( fragMap, &fragid, (void **)&list) == FALSE)
/* first fragment of datagram - need new FragList */
list = NEW(FragList);
/* insert it into the Map structure */
list->binding = mapBind( fragMap, &fragid, list );
/* initialize list with a single hole spanning the
whole datagram */
list->nholes = 1;
list-> = fi = NEW(FragInfo);
fi->next = 0;
fi->type = HOLE;
fi->u.hole.first = 0;
fi->u.hole.last = INFINITE_OFFSET;
/* mark the current FragList as ineligible for garbage
collection */
list->gcMark = FALSE;
/* walk through the FragList to find the right hole for
this frag */
prev = &list->head;
for ( fi = prev->next; fi != 0; prev = fi, fi = fi->next )
if ( fi->type == FRAG )
hole = &fi->u.hole;
if ( (offset < hole->last) && ((offset + len) >
hole->first) )
/* check to see if frag overlaps previously
received frags */
if ( offset < hole->first )
4 Internetworking
/* truncate message from left */
msgStripHdr(dg, hole->first - offset);
offset = hole->first;
if ( (offset + len) > hole->last )
/* truncate message from right */
msgTruncate(dg, hole->last - offset);
len = hole->last - offset;
/* now check to see if new hole(s) need to be made */
if (((offset + len) < hole->last) &&
(hdr->frag & MOREFRAGMENTS))
/* creating new hole above */
hole_create(prev, fi, (offset+len), hole->last);
if ( offset > hole->first )
/* creating new hole below */
hole_create(fi, fi->next, hole->first, (offset));
/* change this FragInfo structure to be FRAG */
fi->type = FRAG;
msgSaveCopy(&fi->u.frag, dg);
} /* if found a hole */
} /* for loop */
/* check to see if we're done, and if so, pass datagram up */
if ( list->nholes == 0 )
Msg fullMsg;
/* now have a full datagram */
for( fi = list->; fi != 0; fi = fi->next )
msgReassemble(&fullMsg, &fi->u.frag, &fullMsg);
/* get rid of FragList and its Map entry */
4.1 Simple Internetworking (IP)
mapRemove(fragMap, list->binding);
deliver(HLP, &fullMsg);
return SUCCESS;
Subroutine hole create creates a new hole in the fragment list that begins at offset
first and continues to offset last. It makes use of the utility NEW, which creates an
instance of the given structure.
static int
hole_create(FragInfo *prev, FragInfo *next, u_int first, u_int last)
/* creating new hole from first to last */
fi = NEW(FragInfo);
fi->type = HOLE;
fi->u.hole.first = first;
fi->u.hole.last = last;
fi->next = next;
prev->next = fi;
Finally, note that these routines do not capture the entire picture of reassembly.
What is not shown is a background process that periodically checks to see if there
has been any recent activity on this datagram (it looks at field gcMark), and if not, it
deletes the corresponding FragList. IP does not attempt to recover from the situation
in which one or more of the fragments does not arrive; it simply gives up and reclaims
the memory that was being used for reassembly.
One thing to notice from this code is that IP reassembly is far from a simple
process. Note, for example, that if a single fragment is lost, the receiver will still
attempt to reassemble the datagram, and it will eventually give up and have to garbagecollect the resources that were used to perform the failed reassembly. For this reason,
among others, IP fragmentation is generally considered a good thing to avoid. Hosts
are now strongly encouraged to perform “path MTU discovery,” a process by which
fragmentation is avoided by sending packets that are small enough to traverse the link
with the smallest MTU in the path from sender to receiver.
4 Internetworking
Global Addresses
In the above discussion of the IP service model, we mentioned that one of the things
that it provides is an addressing scheme. After all, if you want to be able to send data to
any host on any network, there needs to be a way of identifying all the hosts. Thus, we
need a global addressing scheme—one in which no two hosts have the same address.
Global uniqueness is the first property that should be provided in an addressing
Ethernet addresses are globally unique, but that alone does not suffice for an
addressing scheme in a large internetwork. Ethernet addresses are also flat, which
means that they have no structure and provide very few clues to routing protocols.2
In contrast, IP addresses are hierarchical, by which we mean that they are made up of
several parts that correspond to some sort of hierarchy in the internetwork. Specifically,
IP addresses consist of two parts, a network part and a host part. This is a fairly logical
structure for an internetwork, which is made up of many interconnected networks.
The network part of an IP address identifies the network to which the host is attached;
all hosts attached to the same network have the same network part in their IP address.
The host part then identifies each host uniquely on that particular network. Thus, in
the simple internetwork of Figure 4.1, the addresses of the hosts on network 1, for
example, would all have the same network part and different host parts.
Note that the routers in Figure 4.1 are attached to two networks. They need
to have an address on each network, one for each interface. For example, router R1,
which sits between network 2 and network 3, has an IP address on the interface to
network 2 that has the same network part as the hosts on network 2, and it has an
IP address on the interface to network 3 that has the same network part as the hosts
on network 3. Thus, bearing in mind that a router might be implemented as a host
with two network interfaces, it is more precise to think of IP addresses as belonging
to interfaces than to hosts.
Now, what do these hierarchical addresses look like? Unlike some other forms
of hierarchical address, the sizes of the two parts are not the same for all addresses.
Instead, IP addresses are divided into three different classes, as shown in Figure 4.6,
each of which defines different-sized network and host parts. (There are also class D
addresses that specify a multicast group, discussed in Section 4.4, and class E addresses
that are currently unused.) In all cases, the address is 32 bits long.
The class of an IP address is identified in the most significant few bits. If the
first bit is 0, it is a class A address. If the first bit is 1 and the second is 0, it is a
In fact, as we noted, Ethernet addresses do have a structure for the purposes of assignment—the first 24 bits
identify the manufacturer—but this provides no useful information to routing protocols since this structure has
nothing to do with network topology.
4.1 Simple Internetworking (IP)
Figure 4.6
IP addresses: (a) class A; (b) class B; (c) class C.
class B address. If the first two bits are 1 and the third is 0, it is a class C address.
Thus, of the approximately 4 billion possible IP addresses, half are class A, onequarter are class B, and one-eighth are class C. Each class allocates a certain number
of bits for the network part of the address and the rest for the host part. Class A
networks have 7 bits for the network part and 24 bits for the host part, meaning
that there can be only 126 class A networks (the values 0 and 127 are reserved),
but each of them can accommodate up to 224 − 2 (about 16 million) hosts (again,
there are two reserved values). Class B addresses allocate 14 bits for the network and
16 bits for the host, meaning that each class B network has room for 65,534 hosts.
Finally, class C addresses have only 8 bits for the host and 21 for the network part.
Therefore, a class C network can have only 256 unique host identifiers, which means
only 254 attached hosts (one host identifier, 255, is reserved for broadcast, and 0
is not a valid host number). However, the addressing scheme supports 221 class C
On the face of it, this addressing scheme has a lot of flexibility, allowing networks
of vastly different sizes to be accommodated fairly efficiently. The original idea was
that the Internet would consist of a small number of wide area networks (these would
be class A networks), a modest number of site- (campus-) sized networks (these would
be class B networks), and a large number of LANs (these would be class C networks).
However, as we shall see in Section 4.3, additional flexibility has been needed, and
some innovative ways to provide it are now in use. Because one of these techniques
actually removes the distinction between address classes, the addressing scheme just
described is now known as “classful” addressing to distinguish it from the newer
“classless” approach.
Before we look at how IP addresses get used, it is helpful to look at some practical
matters, such as how you write them down. By convention, IP addresses are written
4 Internetworking
as four decimal integers separated by dots. Each integer represents the decimal value
contained in 1 byte of the address, starting at the most significant. For example, the
address of the computer on which this sentence was typed is
It is important not to confuse IP addresses with Internet domain names, which
are also hierarchical. Domain names tend to be ASCII strings separated by dots, such
as We will be talking about those in Section 9.1. The important thing
about IP addresses is that they are what is carried in the headers of IP packets, and it
is those addresses that are used in IP routers to make forwarding decisions.
Datagram Forwarding in IP
We are now ready to look at the basic mechanism by which IP routers forward datagrams in an internetwork. Recall from Chapter 3 that forwarding is the process of
taking a packet from an input and sending it out on the appropriate output, while
routing is the process of building up the tables that allow the correct output for a
packet to be determined. The discussion here focuses on forwarding; we take up routing in Section 4.2.
The main points to bear in mind as we discuss the forwarding of IP datagrams
are the following:
■ Every IP datagram contains the IP address of the destination host.
■ The “network part” of an IP address uniquely identifies a single physical
network that is part of the larger Internet.
■ All hosts and routers that share the same network part of their address are
connected to the same physical network and can thus communicate with each
other by sending frames over that network.
■ Every physical network that is part of the Internet has at least one router that,
by definition, is also connected to at least one other physical network; this
router can exchange packets with hosts or routers on either network.
Forwarding IP datagrams can therefore be handled in the following way. A datagram is sent from a source host to a destination host, possibly passing through several routers along the way. Any node, whether it is a host or a router, first tries to
establish whether it is connected to the same physical network as the destination.
To do this, it compares the network part of the destination address with the network part of the address of each of its network interfaces. (Hosts normally have
only one interface, while routers normally have two or more, since they are typically
connected to two or more networks.) If a match occurs, then that means that the
destination lies on the same physical network as the interface, and the packet can be
4.1 Simple Internetworking (IP)
directly delivered over that network. Section 4.1.5 explains some of the details of this
If the node is not connected to the same physical network as the destination node,
then it needs to send the datagram to a router. In general, each node will have a choice of
several routers, and so it needs to pick the best one, or at least one that has a reasonable
chance of getting the datagram closer to its destination. The router that it chooses is
known as the next hop router. The router finds the correct next hop by consulting its
forwarding table. The forwarding table is conceptually just a list of NetworkNum,
NextHop pairs. (As we will see below, forwarding tables in practice often contain
some additional information related to the next hop.) Normally, there is also a default
router that is used if none of the entries in the table match the destination’s network
number. For a host, it may be quite acceptable to have a default router and nothing
else—this means that all datagrams destined for hosts not on the physical network to
which the sending host is attached will be sent out through the default router.
We can describe the datagram forwarding algorithm in the following way:
if (NetworkNum of destination = NetworkNum of one of my interfaces) then
deliver packet to destination over that interface
if (NetworkNum of destination is in my forwarding table) then
deliver packet to NextHop router
deliver packet to default router
For a host with only one interface and only a default router in its forwarding
table, this simplifies to
if (NetworkNum of destination = my NetworkNum) then
deliver packet to destination directly
deliver packet to default router
Let’s see how this works in the example internetwork of Figure 4.1. First, suppose that H1 wants to send a datagram to H2. Since they are on the same physical
network, H1 and H2 have the same network number in their IP address. Thus, H1
deduces that it can deliver the datagram directly to H2 over the Ethernet. The one
issue that needs to be resolved is how H1 finds out the correct Ethernet address for
H2—this is the address resolution mechanism described in Section 4.1.5.
Now suppose H1 wants to send a datagram to H8. Since these hosts are on
different physical networks, they have different network numbers, so H1 deduces that
4 Internetworking
Table 4.1
Example forwarding table for router R2 in Figure 4.1.
it needs to send the datagram to a router.
R1 is the only choice—the default router—
so H1 sends the datagram over the Ethernet
to R1. Similarly, R1 knows that it cannot
deliver a datagram directly to H8 because
neither of R1’s interfaces is on the same network as H8. Suppose R1’s default router is
R2; R1 then sends the datagram to R2 over
the token ring network. Assuming R2 has
the forwarding table shown in Table 4.1, it
looks up H8’s network number (network 1)
and forwards the datagram to R3. Finally,
R3, since it is on the same network as H8,
forwards the datagram directly to H8.
Note that it is possible to include the
infomation about directly connected networks in the forwarding table. For example, we would label the network interfaces
of router R2 as interface 0 for the point-topoint link (network 4) and interface 1 for
the token ring (network 3). Then R2 would
have the forwarding table shown in Table
Thus, for any network number that
R2 encounters in a packet, it knows what to
do. Either that network is directly connected
to R2, in which case the packet can be delivered to its destination over that network,
or the network is reachable via some next
Bridges, Switches,
and Routers
It is easy to become confused about
the distinction between bridges,
switches, and routers. There is
good reason for such confusion,
since at some level, they all forward
messages from one link to another.
One distinction people make is
based on layering: Bridges are linklevel nodes (they forward frames
from one link to another to implement an extended LAN), switches
are network-level nodes (they forward packets from one link to
another to implement a packetswitched network), and routers are
internet-level nodes (they forward
datagrams from one network to
another to implement an internet).
In some sense, however, this is an
artificial distinction. It is certainly
the case that networking companies do not ask the layering police
for permission to sell new products
that do not fit neatly into one layer
or another.
4.1 Simple Internetworking (IP)
Interface 1
Interface 0
Table 4.2
Complete forwarding table for router R2 in Figure 4.1.
For example, we have already
seen that a multiport bridge is usually called an Ethernet switch or
LAN switch. Thus the distinction
between bridges and switches has
now been largely eroded. For this
reason, bridges and switches are
often grouped together as “layer
2 devices,” where layer 2 in this
context means “above the physical
layer, below the internet layer.”
There is, however, an important distinction between LAN
switches (or bridges) and ATM
switches (and other switches that
are used in WANs, such as Frame
Relay and X.25 switches). LAN
switches and bridges depend on
the spanning tree algorithm, while
WAN switches generally run routing protocols that allow each
switch to learn the topology of the
whole network. This is an important distinction because knowing
the whole network topology allows
hop router that R2 can reach over a network to which it is connected. In either case,
R2 will use ARP, described below, to find
the MAC address of the node to which the
packet is to be sent next.
The forwarding table used by R2 is
simple enough that it could be manually
configured. Usually, however, these tables
are more complex and would be built up
by running a routing protocol such as one
of those described in Section 4.2. Also note
that, in practice, the network numbers are
usually longer (e.g., 128.96).
We can now see how hierarchical addressing—splitting the address into
network and host parts—has improved the
scalability of a large network. Routers now
contain forwarding tables that list only a
set of network numbers, rather than all
the nodes in the network. In our simple example, that meant that R2 could
store the information needed to reach all
the hosts in the network (of which there
were eight) in a four-entry table. Even if
there were 100 hosts on each physical network, R2 would still only need those same
four entries. This is a good first step (although by no means the last) in achieving
This illustrates one of the most important principles of building scalable networks: To achieve scalability, you need to
reduce the amount of information that is
stored in each node and that is exchanged
between nodes. The most common way to
do that is hierarchical aggregation. IP introduces a two-level hierarchy, with networks
at the top level and nodes at the bottom
level. We have aggregated information by
letting routers deal only with reaching the
right network; the information that a router
needs to deliver a datagram to any node on
a given network is represented by a single
aggregated piece of information.
Router Implementation
In Section 3.4 we saw a variety of ways
to build a switch, ranging from a generalpurpose workstation with a suitable number of network interfaces to some sophisticated hardware designs. In general, the same
range of options is available for building
routers, most of which look something like
Figure 4.7. The control processor is responsible for running the routing protocols (discussed in Section 4.2) and generally acts as
the central point of control of the router.
The switching fabric transfers packets from
one port to another, just as in a switch, and
the ports provide a range of functionality
to allow the router to interface to links of
various types (e.g., Ethernet, SONET, etc.).
A few points are worth noting about
router design and how it differs from switch
design. First, routers must be designed to
handle variable-length packets, a constraint
that does not apply to ATM switches but
4 Internetworking
the switches to discriminate among
different routes, while in contrast,
the spanning tree algorithm locks in
a single tree over which messages
are forwarded. It is also the case
that the spanning tree approach
does not scale as well.
What about switches and
routers? Are they fundamentally
the same thing, or are they different
in some important way? Here, the
distinction is much less clear. For
starters, since a single point-topoint link is itself a legitimate
network, a router can be used to
connect a set of such links. In such
a situation, a router looks just like
a switch. It just happens to be a
switch that forwards IP packets using a datagram forwarding model
and IP routing protocols. We’ll see
more of this similarity when we
consider router implementation at
the end of this section.
One big difference between
an ATM network built from
switches and the Internet built
from routers is that the Internet is
able to accommodate heterogeneity, whereas ATM consists of homogeneous links. This support for
heterogeneity is one of the key reasons why the Internet is so widely
4.1 Simple Internetworking (IP)
Figure 4.7
Block diagram of a router.
is certainly applicable to Ethernet or Frame Relay switches. It turns out that many
high-performance routers are designed using a switching fabric that is cell based. In
such cases the ports must be able to convert variable-length packets into cells and back
again. This is very much like the standard ATM segmentation and reassembly (SAR)
problem described in Section 3.3.2.
Another consequence of the variable length of IP datagrams is that it can be
harder to characterize the performance of a router than a switch that forwards only
cells. Routers can usually forward a certain number of packets per second, and this
implies that the total throughput in bits per second depends on packet size. Router
designers generally have to make a choice as to what packet length they will support
at line rate. That is, if pps (packets per second) is the rate at which packets arriving
on a particular port can be forwarded, and linerate is the physical speed of the port in
bits per second, then there will be some packetsize in bits such that
packetsize × pps = linerate
This is the packet size at which the router can forward at line rate; it is likely to be
able to sustain line rate for longer packets but not for shorter packets. Sometimes
a designer might decide that the right packet size to support is 40 bytes, since that
is the minimum size of an IP packet that has a TCP header attached. Another choice
might be the expected average packet size, which can be determined by studying traces
of network traffic. For example, measurements of the Internet backbone suggest that
the average IP packet is around 300 bytes long. However, such a router would fall
behind and perhaps start dropping packets when faced with a long sequence of short
4 Internetworking
packets, which is statistically likely from time to time and also very possible if the
router is subject to an active attack (see Chapter 8). Design decisions of this type
depend heavily on cost considerations and the intended application of the router.
When it comes to the task of forwarding IP packets, routers can be broadly
characterized as having either a centralized or distributed forwarding model. In the
centralized model, the IP forwarding algorithm, outlined earlier in this section, is done
in a single processing engine that handles the traffic from all ports. In the distributed
model, there are several processing engines, perhaps one per port, or more often one
per line card, where a line card may serve one or more physical ports. Each model
has advantages and disadvantages. All things being equal, a distributed forwarding
model should be able to forward more packets per second through the router as a
whole because there is more processing power in total. But a distributed model also
complicates the software architecture because each forwarding engine typically needs
its own copy of the forwarding table, and thus it is necessary for the control processor
to ensure that the forwarding tables are updated consistently and in a timely manner.
In recent years, there has been considerable interest in the possibility of creating
network processors that could be used in the design of routers and other networking hardware. A network processor is intended to be a device that is just about as
programmable as a standard workstation or PC processor, but that is more highly
optimized for networking tasks. For example, a network processor might have instructions that are particularly well suited to performing lookups on IP addresses or
calculating checksums on IP datagrams.
One of the interesting and ongoing debates about network processors is whether
they can do a better job than the alternatives. For example, given the continuous and
remarkable improvements in performance of conventional processors, and the huge
industry that drives those improvements, can network processors keep up? And can a
device that strives for generality do as good a job as a custom-designed chip that does
nothing except, say, IP forwarding? Part of the answer to questions like these depends
on what you mean by “do a better job.” For example, there will always be trade-offs to
be made between cost of hardware, time to market, performance, and flexibility—the
ability to change the features supported by a router after it is built. We will see in the
rest of this chapter and in later chapters just how diverse the requirements for router
functionality can be. It is safe to assume that a wide range of router designs will exist
for the foreseeable future and that network processors will have some role to play.
Address Translation (ARP)
In the previous section we talked about how to get IP datagrams to the right physical
network, but glossed over the issue of how to get a datagram to a particular host or
4.1 Simple Internetworking (IP)
router on that network. The main issue is that IP datagrams contain IP addresses, but
the physical interface hardware on the host or router to which you want to send the
datagram only understands the addressing scheme of that particular network. Thus, we
need to translate the IP address to a link-level address that makes sense on this network
(e.g., a 48-bit Ethernet address). We can then encapsulate the IP datagram inside a
frame that contains that link-level address and send it either to the ultimate destination
or to a router that promises to forward the datagram toward the ultimate destination.
One simple way to map an IP address into a physical network address is to
encode a host’s physical address in the host part of its IP address. For example, a host
with physical address 00100001 01001001 (which has the decimal value 33 in the upper
byte and 81 in the lower byte) might be given the IP address While this
solution has been used on some networks, it is limited in that the network’s physical
addresses can be no more than 16 bits long in this example; they can be only 8 bits
long on a class C network. This clearly will not work for 48-bit Ethernet addresses.
A more general solution would be for each host to maintain a table of address
pairs; that is, the table would map IP addresses into physical addresses. While this table
could be centrally managed by a system administrator and then copied to each host
on the network, a better approach would be for each host to dynamically learn the
contents of the table using the network. This can be accomplished using the Address
Resolution Protocol (ARP). The goal of ARP is to enable each host on a network to
build up a table of mappings between IP addresses and link-level addresses. Since these
mappings may change over time (e.g., because an Ethernet card in a host breaks and
is replaced by a new one with a new address), the entries are timed out periodically
and removed. This happens on the order of every 15 minutes. The set of mappings
currently stored in a host is known as the ARP cache or ARP table.
ARP takes advantage of the fact that many link-level network technologies, such
as Ethernet and token ring, support broadcast. If a host wants to send an IP datagram
to a host (or router) that it knows to be on the same network (i.e., the sending and
receiving node have the same IP network number), it first checks for a mapping in
the cache. If no mapping is found, it needs to invoke the Address Resolution Protocol
over the network. It does this by broadcasting an ARP query onto the network. This
query contains the IP address in question (the “target IP address”). Each host receives
the query and checks to see if it matches its IP address. If it does match, the host
sends a response message that contains its link-layer address back to the originator of
the query. The originator adds the information contained in this response to its ARP
The query message also includes the IP address and link-layer address of the
sending host. Thus, when a host broadcasts a query message, each host on the network
can learn the sender’s link-level and IP addresses and place that information in its
4 Internetworking
Hardware type = 1
HLen = 48
PLen = 32
ProtocolType = 0x0800
SourceHardwareAddr (bytes 0–3)
SourceHardwareAddr (bytes 4–5) SourceProtocolAddr (bytes 0–1)
SourceProtocolAddr (bytes 2–3)
TargetHardwareAddr (bytes 0–1)
TargetHardwareAddr (bytes 2–5)
TargetProtocolAddr (bytes 0–3)
Figure 4.8
ARP packet format for mapping IP addresses into Ethernet addresses.
ARP table. However, not every host adds this information to its ARP table. If the host
already has an entry for that host in its table, it “refreshes” this entry; that is, it resets
the length of time until it discards the entry. If that host is the target of the query, then
it adds the information about the sender to its table, even if it did not already have an
entry for that host. This is because there is a good chance that the source host is about
to send it an application-level message, and it may eventually have to send a response
or ACK back to the source; it will need the source’s physical address to do this. If a host
is not the target and does not already have an entry for the source in its ARP table, then
it does not add an entry for the source. This is because there is no reason to believe
that this host will ever need the source’s link-level address; there is no need to clutter
its ARP table with this information.
Figure 4.8 shows the ARP packet format for IP-to-Ethernet address mappings.
In fact, ARP can be used for lots of other kinds of mappings—the major differences
are in the address sizes. In addition to the IP and link-layer addresses of both sender
and target, the packet contains
■ a HardwareType field, which specifies the type of physical network (e.g.,
■ a ProtocolType field, which specifies the higher-layer protocol (e.g., IP)
■ HLen (“hardware” address length) and PLen (“protocol” address length) fields,
which specify the length of the link-layer address and higher-layer protocol
address, respectively
■ an Operation field, which specifies whether this is a request or a response
■ the source and target hardware (Ethernet) and protocol (IP) addresses
4.1 Simple Internetworking (IP)
Note that the results of the ARP process can be added as an extra column in
a forwarding table like the one in Table 4.1. Thus, for example, when R2 needs to
forward a packet to network 2, it not only finds that the next hop is R1, but also finds
the MAC address to place on the packet to send it to R1.
It should be clear that if an ATM network is to operate as part of an IP internetwork,
then it too must provide a form of ARP. However, the procedure just described will
clearly not work on a simple ATM network, because it depends on the fact that ARP
packets can be broadcast to all hosts on a single network. One solution to this problem
is to use the LAN emulation procedures described in Section 3.3.5. Since the goal of
these procedures is to make an ATM network behave just like a shared-media LAN,
which includes support for broadcast, the effect is to reduce ARP to a previously solved
There are, however, situations where it may not be desirable to treat an ATM
network as an emulated LAN. In particular, LAN emulation can be quite inefficient in
a large, wide area ATM network. Recall that in an emulated LAN many packets may
need to be sent to the broadcast and unknown server, which then floods those packets
to all nodes on the emulated LAN. Clearly, there are limits to how far this can scale.
The problem here is that adding broadcast capabilities to an intrinsically nonbroadcast
network, while useful in some circumstances, is really overkill if the only reason you
need broadcast is to enable address resolution.
For this reason, there is a different ARP procedure that may be used in an ATM
network and that does not depend on broadcast or LAN emulation. This procedure
is known as ATMARP and is part of the Classical IP over ATM model. The reason
for calling the model “classical” will become apparent shortly. Like LAN emulation,
ATMARP relies on the use of a server to resolve addresses—in this case, it is called an
ARP server, and its behavior is described below.
A key concept in the Classical IP over ATM model is the logical IP subnet (LIS).
The LIS abstraction allows us to take one large ATM network and subdivide it into
several smaller subnets. (We define “subnet” precisely in Section 4.3.1, but in this case
a subnet behaves much like a single network.) All nodes on the same subnet have the
same IP network number. And just as in “classical” IP, two nodes (hosts or routers)
that are on the same subnet can communicate directly over the ATM network, whereas
two nodes that are on different subnets will have to communicate via one or more
routers. An example of an ATM network divided into two LISs appears in Figure 4.9.
Note that the IP address of host H1 has a network number of 10, as does the router
interface that connects to the left-hand LIS, while H2 has a network number of 12, as
does the right-hand interface on the router. That is, H1 and the router connect to the
4 Internetworking
LIS 10
LIS 12
ATM network
Figure 4.9
Logical IP subnets.
same LIS (LIS 10) while H2 is on a different subnet (LIS 12) to which the router also
An advantage of the LIS model is that we can connect a large number of hosts and
routers to a big ATM network without necessarily giving them all addresses from the
same IP network. This may make it easier to manage address assignment, for example,
in the case where not all nodes connected to the ATM network are under the control
of the same administrative entity. The division of the ATM network into a number of
LISs also improves scalability by limiting the number of nodes that must be supported
by a single ARP server.
The basic job of an ARP server is to enable nodes on a LIS to resolve IP addresses
to ATM addresses without using broadcast. Each node in the LIS must be configured
with the ATM address of the ARP server, so that it can establish a VC to the server
when it boots. Once it has a VC to the server, the node sends a registration message
to the ARP server that contains both the IP and ATM addresses of the registering
node. Thus the ARP server builds up a complete database of all the IP address, ATM
address pairs. Once this is in place, any node that wants to send a packet to some
IP address can ask the ARP server to provide the corresponding ATM address. Once
this is received, the sending node can use ATM signalling to set up a VC to that ATM
address, and then send the packet. Just like conventional ARP, a cache of IP-to-ATM
address mappings can be maintained. In addition, the node can keep a VC established
to that ATM destination as long as there is enough traffic flowing to justify it, thus
avoiding the delay of setting up the VC again when the next packet arrives.
An interesting consequence of the Classical IP over ATM model is that two nodes
on the same ATM network cannot establish a direct VC between themselves if they
are on different subnets. This would violate the rule that communication from one
subnet to another must pass through a router. For example, host H1 and host H2 in
Figure 4.9 cannot establish a direct VC under the classical model. Instead, each needs
4.1 Simple Internetworking (IP)
to have a VC to router R. The simple explanation for this rule is that IP routing is
known to work well when that rule is obeyed, as it is in non-ATM networks. New
techniques to work around that rule have been developed, but they have introduced
considerable complexity and problems of robustness.
We have now seen the basic mechanisms that IP provides for dealing with both
heterogeneity and scale. On the issue of heterogeneity, IP begins by defining a besteffort service model that makes minimal assumptions about the underlying networks;
most notably, this service model is based on unreliable datagrams. IP then makes two
important additions to this starting point: (1) a common packet format (fragmentation/reassembly is the mechanism that makes this format work over networks with
different MTUs) and (2) a global address space for identifying all hosts (ARP is the
mechanism that makes this global address space work over networks with different
physical addressing schemes). On the issue of scale, IP uses hierarchical aggregation to
reduce the amount of information needed to forward packets. Specifically, IP addresses
are partitioned into network and host components, with packets first routed toward
the destination network and then delivered to the correct host on that network.
Host Configuration (DHCP)
In Section 2.6 we observed that Ethernet addresses are configured into the network
adaptor by the manufacturer, and this process is managed in such a way to ensure that
these addresses are globally unique. This is clearly a sufficient condition to ensure
that any collection of hosts connected to a single Ethernet (including an extended
LAN) will have unique addresses. Furthermore, uniqueness is all we ask of Ethernet
IP addresses, by contrast, not only must be unique on a given internetwork, but
also must reflect the structure of the internetwork. As noted above, they contain a
network part and a host part, and the network part must be the same for all hosts
on the same network. Thus, it is not possible for the IP address to be configured once
into a host when it is manufactured, since that would imply that the manufacturer
knew which hosts were going to end up on which networks, and it would mean that
a host, once connected to one network, could never move to another. For this reason,
IP addresses need to be reconfigurable.
In addition to an IP address, there are some other pieces of information a host
needs to have before it can start sending packets. The most notable of these is the
address of a default router—the place to which it can send packets whose destination
address is not on the same network as the sending host.
Most host operating systems provide a way for a system administrator, or even
a user, to manually configure the IP information needed by a host. However, there
are some obvious drawbacks to such manual configuration. One is that it is simply
4 Internetworking
a lot of work to configure all the hosts in a large network directly, especially when
you consider that such hosts are not reachable over a network until they are configured. Even more importantly, the configuration process is very error-prone, since it is
necessary to ensure that every host gets the correct network number and that no two
hosts receive the same IP address. For these reasons, automated configuration methods are required. The primary method uses a protocol known as the Dynamic Host
Configuration Protocol (DHCP).
DHCP relies on the existence of a DHCP server that is responsible for providing
configuration information to hosts. There is at least one DHCP server for an administrative domain. At the simplest level, the DHCP server can function just as a centralized
repository for host configuration information. Consider, for example, the problem of
administering addresses in the internetwork of a large company. DHCP saves the network administrators from having to walk around to every host in the company with a
list of addresses and network map in hand and configuring each host manually. Instead,
the configuration information for each host could be stored in the DHCP server and
automatically retrieved by each host when it is booted or connected to the network.
However, the administrator would still pick the address that each host is to receive;
he would just store that in the server. In this model, the configuration information for
each host is stored in a table that is indexed by some form of unique client identifier,
typically the “hardware address” (e.g., the Ethernet address of its network adaptor).
A more sophisticated use of DHCP saves the network admininstrator from even
having to assign addresses to individual hosts. In this model, the DHCP server maintains a pool of available addresses that it hands out to hosts on demand. This considerably reduces the amount of configuration an administrator must do, since now it is
only necessary to allocate a range of IP addresses (all with the same network number)
to each network.
Since the goal of DHCP is to minimize the amount of manual configuration
required for a host to function, it would rather defeat the purpose if each host had
to be configured with the address of a DHCP server. Thus, the first problem faced by
DHCP is that of server discovery.
To contact a DHCP server, a newly booted or attached host sends a
DHCPDISCOVER message to a special IP address ( that is an IP broadcast address. This means it will be received by all hosts and routers on that network.
(Routers do not forward such packets onto other networks, preventing broadcast to
the entire Internet.) In the simplest case, one of these nodes is the DHCP server for the
network. The server would then reply to the host that generated the discovery message
(all the other nodes would ignore it). However, it is not really desirable to require one
DHCP server on every network because this still creates a potentially large number
of servers that need to be correctly and consistently configured. Thus, DHCP uses the
4.1 Simple Internetworking (IP)
Unicast to server
Other networks
Figure 4.10 A DHCP relay agent receives a broadcast DHCPDISCOVER message from
a host and sends a unicast DHCPDISCOVER message to the DHCP server.
chaddr (16 bytes)
sname (64 bytes)
file (128 bytes)
Figure 4.11 DHCP packet format.
concept of a relay agent. There is at least one relay agent on each network, and it is
configured with just one piece of information: the IP address of the DHCP server. When
a relay agent receives a DHCPDISCOVER message, it unicasts it to the DHCP server and
awaits the response, which it will then send back to the requesting client. The process
of relaying a message from a host to a remote DHCP server is shown in Figure 4.10.
Figure 4.11 shows the format of a DHCP message. The message is actually sent
using a protocol called UDP (the User Datagram Protocol) that runs over IP. UDP is
4 Internetworking
discussed in detail in the next chapter, but the only interesting thing it does in this
context is to provide a demultiplexing key that says, “This is a DHCP packet.”
DHCP is derived from an earlier protocol called BOOTP, and some of the packet
fields are thus not strictly relevant to host configuration. When trying to obtain configuration information, the client puts its hardware address (e.g., its Ethernet address) in
the chaddr field. The DHCP server replies by filling in the yiaddr (“your” IP address)
field and sending it to the client. Other information such as the default router to be
used by this client can be included in the options field.
In the case where DHCP dynamically assigns IP addresses to hosts, it is clear
that hosts cannot keep addresses indefinitely, as this would eventually cause the server
to exhaust its address pool. At the same time, a host cannot be depended upon to
give back its address, since it might have crashed, been unplugged from the network,
or been turned off. Thus, DHCP allows addresses to be “leased” for some period of
time. Once the lease expires, the server is free to return that address to its pool. A host
with a leased address clearly needs to renew the lease periodically if in fact it is still
connected to the network and functioning correctly.
DHCP illustrates an important aspect of scaling: the scaling of network management. While discussions of scaling often focus on keeping the state in network
devices from growing too rapidly, it is important to pay attention to growth of network management complexity. By allowing network managers to configure a range
of IP addresses per network rather than one IP address per host, DHCP improves the
manageability of a network.
Note that DHCP may also introduce some more complexity into network management, since it makes the binding between physical hosts and IP addresses much
more dynamic. This may make the network manager’s job more difficult if, for example, it becomes necessary to locate a malfunctioning host.
Error Reporting (ICMP)
The next issue is how the Internet treats errors. While IP is perfectly willing to drop
datagrams when the going gets tough—for example, when a router does not know
how to forward the datagram or when one fragment of a datagram fails to arrive
at the destination—it does not necessarily fail silently. IP is always configured with a
companion protocol, known as the Internet Control Message Protocol (ICMP), that
defines a collection of error messages that are sent back to the source host whenever
a router or host is unable to process an IP datagram successfully. For example, ICMP
defines error messages indicating that the destination host is unreachable (perhaps due
to a link failure), that the reassembly process failed, that the TTL had reached 0, that
the IP header checksum failed, and so on.
4.1 Simple Internetworking (IP)
ICMP also defines a handful of control messages that a router can send back
to a source host. One of the most useful control messages, called an ICMP-Redirect,
tells the source host that there is a better route to the destination. ICMP-Redirects are
used in the following situation. Suppose a host is connected to a network that has two
routers attached to it, called R1 and R2, where the host uses R1 as its default router.
Should R1 ever receive a datagram from the host, where based on its forwarding
table it knows that R2 would have been a better choice for a particular destination
address, it sends an ICMP-Redirect back to the host, instructing it to use R2 for all
future datagrams addressed to that destination. The host then adds this new route to
its forwarding table.
Virtual Networks and Tunnels
We conclude our introduction to IP by considering an issue you might not have anticipated, but one that is becoming increasingly important. Our discussion up to this point
has focused on making it possible for nodes on different networks to communicate with
each other in an unrestricted way. This is usually the goal in the Internet—everybody
wants to be able to send email to everybody, and the creator of a new Web site wants
to reach the widest possible audience. However, there are many situations where more
controlled connectivity is required. An important example of such a situation is the
virtual private network (VPN).
The term “VPN” is heavily overused and definitions vary, but intuitively we can
define a VPN by considering first the idea of a private network. Corporations with
many sites often build private networks by leasing transmission lines from the phone
companies and using those lines to interconnect sites. In such a network, communication is restricted to take place only among the sites of that corporation, which is often
desirable for security reasons. To make a private network virtual, the leased transmission lines—which are not shared with any other corporations—would be replaced by
some sort of shared network. A virtual circuit is a very reasonable replacement for
a leased line because it still provides a logical point-to-point connection between the
corporation’s sites. For example, if corporation X has a VC from site A to site B, then
clearly it can send packets between sites A and B. But there is no way that corporation Y can get its packets delivered to site B without first establishing its own virtual
circuit to site B, and the establishment of such a VC can be administratively prevented,
thus preventing unwanted connectivity between corporation X and corporation Y.
Figure 4.12(a) shows two private networks for two separate corporations. In
Figure 4.12(b) they are both migrated to a virtual circuit network. The limited connectivity of a real private network is maintained, but since the private networks now
share the same transmission facilities and switches we say that two virtual private
networks have been created.
4 Internetworking
Physical links
Corporation X private network
Corporation Y private network
Physical links
Virtual circuits
Figure 4.12 An example of virtual private networks: (a) two separate private networks;
(b) two virtual private networks sharing common switches.
In Figure 4.12, a Frame Relay or ATM network is used to provide the controlled connectivity among sites. It is also possible to provide a similar function
using an IP network—an internetwork—to provide the connectivity. However, we
cannot just connect the various corporations’ sites to a single internetwork because
that would provide connectivity between corporation X and corporation Y, which
we wish to avoid. To solve this problem, we need to introduce a new concept, the
IP tunnel.
We can think of an IP tunnel as a virtual point-to-point link between a pair of
nodes that are actually separated by an arbitrary number of networks. The virtual link
is created within the router at the entrance to the tunnel by providing it with the IP
address of the router at the far end of the tunnel. Whenever the router at the entrance
of the tunnel wants to send a packet over this virtual link, it encapsulates the packet
inside an IP datagram. The destination address in the IP header is the address of the
4.1 Simple Internetworking (IP)
Network 1
Network 2
IP header,
Destination = 2.x
IP header,
Destination =
IP header,
Destination = 2.x
IP payload
IP header,
Destination = 2.x
IP payload
IP payload
Figure 4.13 A tunnel through an internetwork.
Interface 0
Virtual interface 0
Interface 1
Table 4.3
Forwarding table for router R1 in Figure 4.13.
router at the far end of the tunnel, while the source address is that of the encapsulating
In the forwarding table of the router at the entrance to the tunnel, this virtual
link looks much like a normal link. Consider, for example, the network in Figure 4.13.
A tunnel has been configured from R1 to R2 and assigned a virtual interface number
of 0. The forwarding table in R1 might therefore look like Table 4.3.
R1 has two physical interfaces. Interface 0 connects to network 1; interface 1
connects to a large internetwork and is thus the default for all traffic that does not
match something more specific in the forwarding table. In addition, R1 has a virtual
interface, which is the interface to the tunnel. Suppose R1 receives a packet from
network 1 that contains an address in network 2. The forwarding table says this packet
should be sent out virtual interface 0. In order to send a packet out this interface, the
router takes the packet, adds an IP header addressed to R2, and then proceeds to
forward the packet as if it had just been received. R2’s address is; since the
network number of this address is 10, not 1 or 2, a packet destined for R2 will be
forwarded out the default interface into the internetwork.
4 Internetworking
Once the packet leaves R1, it looks to the rest of the world like a normal IP packet
destined to R2, and it is forwarded accordingly. All the routers in the internetwork
forward it using normal means, until it arrives at R2. When R2 receives the packet,
it finds that it carries its own address, so it removes the IP header and looks at the
payload of the packet. What it finds is an inner IP packet whose destination address is
in network 2. R2 now processes this packet like any other IP packet it receives. Since
R2 is directly connected to network 2, it forwards the packet on to that network.
Figure 4.13 shows the change in encapsulation of the packet as it moves across the
While R2 is acting as the endpoint of the tunnel, there is nothing to prevent it
from performing the normal functions of a router. For example, it might receive some
packets that are not tunneled, but that are addressed to networks it knows how to
reach, and it would forward them in the normal way.
You might wonder why anyone would want to go to all the trouble of creating a
tunnel and changing the encapsulation of a packet as it goes across an internetwork.
One reason is security, which we will discuss in more detail in Chapter 8. Supplemented
with encryption, a tunnel can become a very private sort of link across a public network. Another reason may be that R1 and R2 have some capabilities that are not
widely available in the intervening networks, such as multicast routing. By connecting
these routers with a tunnel, we can build a virtual network in which all the routers
with this capability appear to be directly connected. This in fact is how the MBone
(multicast backbone) is built, as we will see in Section 4.4. A third reason to build
tunnels is to carry packets from protocols other than IP across an IP network. As long
as the routers at either end of the tunnel know how to handle these other protocols,
the IP tunnel looks to them like a point-to-point link over which they can send non-IP
packets. Tunnels also provide a mechanism by which we can force a packet to be delivered to a particular place even if its original header—the one that gets encapsulated
inside the tunnel header—might suggest that it should go somewhere else. We will see
an application of this when we consider mobile hosts in Section 4.2.5. Thus, we see that
tunneling is a powerful and quite general technique for building virtual links across
Tunneling does have its downsides. One is that it increases the length of packets;
this might represent a significant waste of bandwidth for short packets. There may also
be performance implications for the routers at either end of the tunnel, since they need
to do more work than normal forwarding as they add and remove the tunnel header.
Finally, there is a management cost for the administrative entity that is responsible
for setting up the tunnels and making sure they are correctly handled by the routing
4.2 Routing
4.2 Routing
In both this and the previous chapter we have assumed that the switches and routers
have enough knowledge of the network topology so they can choose the right port onto
which each packet should be output. In the case of virtual circuits, routing is an issue
only for the connection request packet; all subsequent packets follow the same path
as the request. In datagram networks, including IP networks, routing is an issue for
every packet. In either case, a switch or router needs to be able to look at the packet’s
destination address and then to determine which of the output ports is the best choice
to get the packet to that address. As we saw in Section 3.1.1, the switch makes this
decision by consulting a forwarding table. The fundamental problem of routing is,
How do switches and routers acquire the information in their forwarding tables?
We restate an important distinction, which is often neglected, between forwarding and routing. Forwarding consists of taking a packet, looking at its destination
address, consulting a table, and sending the packet in a direction determined by that
table. We saw several examples of forwarding in the preceding section. Routing is the
process by which forwarding tables are built. We also note that forwarding is a relatively simple and well-defined process performed locally at a node, whereas routing
depends on complex distributed algorithms that have continued to evolve throughout
the history of networking.
While the terms forwarding table and routing table are sometimes used interchangeably, we will make a distinction between them here. The forwarding table is
used when a packet is being forwarded and so must contain enough information to
accomplish the forwarding function. This means that a row in the forwarding table
contains the mapping from a network number to an outgoing interface and some MAC
information, such as the Ethernet address of the next hop. The routing table, on the
other hand, is the table that is built up by the routing algorithms as a precursor to
building the forwarding table. It generally contains mappings from network numbers
to next hops. It may also contain information about how this information was learned,
so that the router will be able to decide when it should discard some information.
Whether the routing table and forwarding table are actually separate data structures is something of an implementation choice, but there are numerous reasons to
keep them separate. For example, the forwarding table needs to be structured to optimize the process of looking up a network number when forwarding a packet, while the
routing table needs to be optimized for the purpose of calculating changes in topology.
In some cases, the forwarding table may even be implemented in specialized hardware,
whereas this is rarely if ever done for the routing table. Table 4.4 provides an example
of a row from each sort of table. In this case, the routing table tells us that network
number 10 is to be reached by a next hop router with the IP address,
4 Internetworking
Network Number
Network Number
MAC Address
Table 4.4
Example rows from (a) routing and (b) forwarding tables.
while the forwarding table contains the information about exactly how to forward
a packet to that next hop: Send it out interface number 0 with a MAC address of
8:0:2b:e4:b:1:2. Note that the last piece of information is provided by the Address
Resolution Protocol.
Before getting into the details of routing, we need to remind ourselves of the key
question we should be asking anytime we try to build a mechanism for the Internet:
“Does this solution scale?” The answer for the algorithms and protocols described
in this section is no. They are designed for networks of fairly modest size—fewer
than a hundred nodes, in practice. However, the solutions we describe do serve as
a building block for a hierarchical routing infrastructure that is used in the Internet
today. Specifically, the protocols described in this section are collectively known as
intradomain routing protocols, or interior gateway protocols (IGPs). To understand
these terms, we need to define a routing domain: A good working definition is an
internetwork in which all the routers are under the same adminstrative control (e.g.,
a single university campus or the network of a single Internet service provider). The
relevance of this definition will become apparent in the next section when we look at
interdomain routing protocols. For now, the important thing to keep in mind is that we
are considering the problem of routing in the context of small to midsized networks,
not for a network the size of the Internet.
Network as a Graph
Routing is, in essence, a problem of graph theory. Figure 4.14 shows a graph representing a network. The nodes of the graph, labeled A through F, may be either hosts,
switches, routers, or networks. For our initial discussion, we will focus on the case
4.2 Routing
Figure 4.14 Network represented as a graph.
where the nodes are routers. The edges of the graph correspond to the network links.
Each edge has an associated cost, which gives some indication of the desirability of
sending traffic over that link. A discussion of how edge costs are assigned is given in
The basic problem of routing is to find the lowest-cost path between any two
nodes, where the cost of a path equals the sum of the costs of all the edges that make
up the path. For a simple network like the one in Figure 4.14, you could imagine just
calculating all the shortest paths and loading them into some nonvolatile storage on
each node. Such a static approach has several shortcomings:
■ It does not deal with node or link failures.
■ It does not consider the addition of new nodes or links.
■ It implies that edge costs cannot change, even though we might reasonably
wish to temporarily assign a high cost to a link that is heavily loaded.
For these reasons, routing is achieved in most practical networks by running
routing protocols among the nodes. These protocols provide a distributed, dynamic
way to solve the problem of finding the lowest-cost path in the presence of link and node
failures and changing edge costs. Note the word “distributed” in the last sentence: It is
difficult to make centralized solutions scalable, so all the widely used routing protocols
use distributed algorithms.
The distributed nature of routing algorithms is one of the main reasons why this
has been such a rich field of research and development—there are a lot of challenges in
making distributed algorithms work well. For example, distributed algorithms raise the
possibility that two routers will at one instant have different ideas about the shortest
In the example networks (graphs) used throughout this chapter, we use undirected edges and assign each edge a
single cost. This is actually a slight simplification. It is more accurate to make the edges directed, which typically
means that there would be a pair of edges between each node—one flowing in each direction, and each with its
own edge cost.
4 Internetworking
Figure 4.15 Distance-vector routing: an example network.
path to some destination. In fact, each one may think that the other one is closer to
the destination, and decide to send packets to the other one. Clearly, such packets will
be stuck in a loop until the discrepancy between the two routers is resolved, and it
would be good to resolve it as soon as possible. This is just one example of the type
of problem routing protocols must address.
To begin our analysis, we assume that the edge costs in the network are known.
We will examine the two main classes of routing protocols: distance vector and link
state. In Section 4.2.4 we return to the problem of calculating edge costs in a meaningful
Distance Vector (RIP)
The idea behind the distance-vector algorithm is suggested by its name:4 Each node
constructs a one-dimensional array (a vector) containing the “distances” (costs) to
all other nodes and distributes that vector to its immediate neighbors. The starting
assumption for distance-vector routing is that each node knows the cost of the link to
each of its directly connected neighbors. A link that is down is assigned an infinite cost.
To see how a distance-vector routing algorithm works, it is easiest to consider
an example like the one depicted in Figure 4.15. In this example, the cost of each link
is set to 1, so that a least-cost path is simply the one with the fewest hops. (Since all
edges have the same cost, we do not show the costs in the graph.) We can represent
each node’s knowledge about the distances to all other nodes as a table like the one
given in Table 4.5. Note that each node only knows the information in one row of the
table (the one that bears its name in the left column). The global view that is presented
here is not available at any single point in the network.
The other common name for this class of algorithm is Bellman-Ford, after its inventors.
4.2 Routing
Distance to Reach Node
Stored at Node
Table 4.5
Initial distances stored at each node (global view).
Table 4.6
Initial routing table at node A.
We may consider each row in Table 4.5 as a list of distances from one node to
all other nodes, representing the current beliefs of that node. Initially, each node sets a
cost of 1 to its directly connected neighbors and ∞ to all other nodes. Thus, A initially
believes that it can reach B in one hop and that D is unreachable. The routing table
stored at A reflects this set of beliefs and includes the name of the next hop that A
would use to reach any reachable node. Initially, then, A’s routing table would look
like Table 4.6.
The next step in distance-vector routing is that every node sends a message to
its directly connected neighbors containing its personal list of distances. For example,
4 Internetworking
Table 4.7
Final routing table at node A.
node F tells node A that it can reach node G at a cost of 1; A also knows it can reach
F at a cost of 1, so it adds these costs to get the cost of reaching G by means of F. This
total cost of 2 is less than the current cost of infinity, so A records that it can reach G
at a cost of 2 by going through F. Similarly, A learns from C that D can be reached
from C at a cost of 1; it adds this to the cost of reaching C (1) and decides that D can
be reached via C at a cost of 2, which is better than the old cost of infinity. At the same
time, A learns from C that B can be reached from C at a cost of 1, so it concludes that
the cost of reaching B via C is 2. Since this is worse than the current cost of reaching
B (1), this new information is ignored.
At this point, A can update its routing table with costs and next hops for all
nodes in the network. The result is shown in Table 4.7.
In the absence of any topology changes, it only takes a few exchanges of information between neighbors before each node has a complete routing table. The process of getting consistent routing information to all the nodes is called convergence.
Table 4.8 shows the final set of costs from each node to all other nodes when routing
has converged. We must stress that there is no one node in the network that has all
the information in this table—each node only knows about the contents of its own
routing table. The beauty of a distributed algorithm like this is that it enables all
nodes to achieve a consistent view of the network in the absence of any centralized
There are a few details to fill in before our discussion of distance-vector routing is
complete. First we note that there are two different circumstances under which a given
node decides to send a routing update to its neighbors. One of these circumstances is the
periodic update. In this case, each node automatically sends an update message every
4.2 Routing
Distance to Reach Node
Stored at Node
Table 4.8
Final distances stored at each node (global view).
so often, even if nothing has changed. This serves to let the other nodes know that this
node is still running. It also makes sure that they keep getting information that they may
need if their current routes become unviable. The frequency of these periodic updates
varies from protocol to protocol, but it is typically on the order of several seconds to
several minutes. The second mechanism, sometimes called a triggered update, happens
whenever a node receives an update from one of its neighbors that causes it to change
one of the routes in its routing table. That is, whenever a node’s routing table changes,
it sends an update to its neighbors, which may lead to a change in their tables, causing
them to send an update to their neighbors.
Now consider what happens when a link or node fails. The nodes that notice first
send new lists of distances to their neighbors, and normally the system settles down
fairly quickly to a new state. As to the question of how a node detects a failure, there
are a couple of different answers. In one approach, a node continually tests the link to
another node by sending a control packet and seeing if it receives an acknowledgment.
In another approach, a node determines that the link (or the node at the other end of
the link) is down if it does not receive the expected periodic routing update for the last
few update cycles.
To understand what happens when a node detects a link failure, consider what
happens when F detects that its link to G has failed. First, F sets its new distance to
G to infinity and passes that information along to A. Since A knows that its 2-hop
path to G is through F, A would also set its distance to G to infinity. However, with
4 Internetworking
the next update from C, A would learn that C has a 2-hop path to G. Thus A would
know that it could reach G in 3 hops through C, which is less than infinity, and so A
would update its table accordingly. When it advertises this to F, node F would learn
that it can reach G at a cost of 4 through A, which is less than infinity, and the system
would again become stable.
Unfortunately, slightly different circumstances can prevent the network from
stabilizing. Suppose, for example, that the link from A to E goes down. In the next
round of updates, A advertises a distance of infinity to E, but B and C advertise a
distance of 2 to E. Depending on the exact timing of events, the following might
happen: Node B, upon hearing that E can be reached in 2 hops from C, concludes that
it can reach E in 3 hops and advertises this to A; node A concludes that it can reach
E in 4 hops and advertises this to C; node C concludes that it can reach E in 5 hops;
and so on. This cycle stops only when the distances reach some number that is large
enough to be considered infinite. In the meantime, none of the nodes actually knows
that E is unreachable, and the routing tables for the network do not stabilize. This
situation is known as the count-to-infinity problem.
There are several partial solutions to this problem. The first one is to use some
relatively small number as an approximation of infinity. For example, we might decide
that the maximum number of hops to get across a certain network is never going to be
more than 16, and so we could pick 16 as the value that represents infinity. This at least
bounds the amount of time that it takes to count to infinity. Of course, it could also
present a problem if our network grew to a point where some nodes were separated
by more than 16 hops.
One technique to improve the time to stabilize routing is called split horizon.
The idea is that when a node sends a routing update to its neighbors, it does not send
those routes it learned from each neighbor back to that neighbor. For example, if B
has the route (E, 2, A) in its table, then it knows it must have learned this route from A,
and so whenever B sends a routing update to A, it does not include the route (E, 2) in
that update. In a stronger variation of split horizon, called split horizon with poison
reverse, B actually sends that route back to A, but it puts negative information in the
route to ensure that A will not eventually use B to get to E. For example, B sends
the route (E, ∞) to A. The problem with both of these techniques is that they only
work for routing loops that involve two nodes. For larger routing loops, more drastic measures are called for. Continuing the above example, if B and C had waited
for a while after hearing of the link failure from A before advertising routes to
E, they would have found that neither of them really had a route to E. Unfortunately, this approach delays the convergence of the protocol; speed of convergence
is one of the key advantages of its competitor, link-state routing, the subject of
Section 4.2.3.
4.2 Routing
The code that implements this algorithm is very straightforward; we give only some of
the basics here. Structure Route defines each entry in the routing table, and constant
MAX TTL specifies how long an entry is kept in the table before it is discarded.
#define MAX_ROUTES 128
#define MAX_TTL
/* maximum size of routing table */
/* time (in seconds) until route expires */
typedef struct {
NodeAddr Destination;
NodeAddr NextHop;
u_short TTL;
} Route;
address of destination */
address of next hop */
distance metric */
time to live */
numRoutes = 0;
The routine that updates the local node’s routing table based on a new route is
given by mergeRoute. Although not shown, a timer function periodically scans the list
of routes in the node’s routing table, decrements the TTL (time to live) field of each
route, and discards any routes that have a time to live of 0. Notice, however, that the
TTL field is reset to MAX TTL any time the route is reconfirmed by an update message
from a neighboring node.
mergeRoute (Route *new)
int i;
for (i = 0; i < numRoutes; ++i)
if (new->Destination == routingTable[i].Destination)
if (new->Cost + 1 < routingTable[i].Cost)
/* found a better route: */
} else if (new->NextHop == routingTable[i].NextHop) {
/* metric for current next hop may have changed: */
} else {
/* route is uninteresting---just ignore it */
4 Internetworking
if (i == numRoutes)
/*this is a completely new route; is there room for it?*/
if (numRoutes < MAXROUTES)
} else {
/* can't fit this route in table so give up */
routingTable[i] = *new;
/* reset TTL */
routingTable[i].TTL = MAX_TTL;
/* account for hop to get to next node */
Finally, the procedure updateRoutingTable is the main routine that calls mergeRoute to incorporate all the routes contained in a routing update that is received from
a neighboring node.
updateRoutingTable (Route *newRoute, int numNewRoutes)
int i;
for (i=0; i < numNewRoutes; ++i)
Routing Information Protocol (RIP)
One of the most widely used routing protocols in IP networks is the Routing Information Protocol (RIP). Its widespread use is due in no small part to the fact that it was
distributed along with the popular Berkeley Software Distribution (BSD) version of
Unix, from which many commercial versions of Unix were derived. It is also extremely
simple. RIP is the canonical example of a routing protocol built on the distance-vector
algorithm just described.
Routing protocols in internetworks differ very slightly from the idealized graph
model described above. In an internetwork, the goal of the routers is to learn how to
4.2 Routing
Figure 4.16 Example network running RIP.
Family of net 1
Must be zero
Address of net 1
Address of net 1
Distance to net 1
Family of net 2
Address of net 2
Address of net 2
Distance to net 2
Figure 4.17 RIP packet format.
forward packets to various networks. Thus, rather than advertising the cost of reaching
other routers, the routers advertise the cost of reaching networks. For example, in
Figure 4.16, router C would advertise to router A the fact that it can reach networks
2 and 3 (to which it is directly connected) at a cost of 0; networks 5 and 6 at cost 1;
and network 4 at cost 2.
We can see evidence of this in the RIP packet format in Figure 4.17. The majority
of the packet is taken up with network-address, distance pairs. However, the principles of the routing algorithm are just the same. For example, if router A learns from
router B that network X can be reached at a lower cost via B than via the existing next
4 Internetworking
hop in the routing table, A updates the cost and next hop information for the network
number accordingly.
RIP is in fact a fairly straightforward implementation of distance-vector routing.
Routers running RIP send their advertisements every 30 seconds; a router also sends
an update message whenever an update from another router causes it to change its
routing table. One point of interest is that it supports multiple address families, not just
IP. The network-address part of the advertisements is actually represented as a family,
address pair. RIP version 2 (RIPv2) also has some features related to scalability that
we will discuss in the next section.
As we will see below, it is possible to use a range of different metrics or costs
for the links in a routing protocol. RIP takes the simplest approach, with all link
costs being equal to 1, just as in our example above. Thus it always tries to find the
minimum hop route. Valid distances are 1 through 15, with 16 representing infinity.
This also limits RIP to running on fairly small networks—those with no paths longer
than 15 hops.
Link State (OSPF)
Link-state routing is the second major class of intradomain routing protocol. The
starting assumptions for link-state routing are rather similar to those for distancevector routing. Each node is assumed to be capable of finding out the state of the link
to its neighbors (up or down) and the cost of each link. Again, we want to provide each
node with enough information to enable it to find the least-cost path to any destination.
The basic idea behind link-state protocols is very simple: Every node knows how to
reach its directly connected neighbors, and if we make sure that the totality of this
knowledge is disseminated to every node, then every node will have enough knowledge
of the network to build a complete map of the network. This is clearly a sufficient
condition (although not a necessary one) for finding the shortest path to any point
in the network. Thus, link-state routing protocols rely on two mechanisms: reliable
dissemination of link-state information, and the calculation of routes from the sum of
all the accumulated link-state knowledge.
Reliable Flooding
Reliable flooding is the process of making sure that all the nodes participating in the
routing protocol get a copy of the link-state information from all the other nodes.
As the term “flooding” suggests, the basic idea is for a node to send its link-state
information out on all of its directly connected links, with each node that receives
this information forwarding it out on all of its links. This process continues until the
information has reached all the nodes in the network.
4.2 Routing
More precisely, each node creates an update packet, also called a link-state packet
(LSP), that contains the following information:
■ the ID of the node that created the LSP
■ a list of directly connected neighbors of that node, with the cost of the link to
each one
■ a sequence number
■ a time to live for this packet
The first two items are needed to enable route calculation; the last two are used to make
the process of flooding the packet to all nodes reliable. Reliability includes making sure
that you have the most recent copy of the information, since there may be multiple,
contradictory LSPs from one node traversing the network. Making the flooding reliable
has proven to be quite difficult. (For example, an early version of link-state routing
used in the ARPANET caused that network to fail in 1981.)
Flooding works in the following way. First, the transmission of LSPs between
adjacent routers is made reliable using acknowledgments and retransmissions just as
in the reliable link-layer protocol described in Section 2.5. However, there are several
more steps needed to reliably flood an LSP to all nodes in a network.
Consider a node X that receives a copy of an LSP that originated at some other
node Y. Note that Y may be any other router in the same routing domain as X. X
checks to see if it has already stored a copy of an LSP from Y. If not, it stores the LSP.
If it already has a copy, it compares the sequence numbers; if the new LSP has a larger
sequence number, it is assumed to be the more recent, and that LSP is stored, replacing
the old one. A smaller (or equal) sequence number would imply an LSP older (or not
newer) than the one stored, so it would be discarded and no further action would be
needed. If the received LSP was the newer one, X then sends a copy of that LSP to all
of its neighbors except the neighbor from which the LSP was just received. The fact
that the LSP is not sent back to the node from which it was received helps to bring
an end to the flooding of an LSP. Since X passes the LSP on to all its neighbors, who
then turn around and do the same thing, the most recent copy of the LSP eventually
reaches all nodes.
Figure 4.18 shows an LSP being flooded in a small network. Each node becomes
shaded as it stores the new LSP. In Figure 4.18(a) the LSP arrives at node X, which
sends it to neighbors A and C in Figure 4.18(b). A and C do not send it back to X, but
send it on to B. Since B receives two identical copies of the LSP, it will accept whichever
arrived first and ignore the second as a duplicate. It then passes the LSP on to D, who
has no neighbors to flood it to, and the process is complete.
4 Internetworking
Figure 4.18 Flooding of link-state packets. (a) LSP arrives at node X; (b) X floods LSP
to A and C; (c) A and C flood LSP to B (but not X); (d) flooding is complete.
Just as in RIP, each node generates LSPs under two circumstances. Either the
expiry of a periodic timer or a change in topology can cause a node to generate a new
LSP. However, the only topology-based reason for a node to generate an LSP is if one
of its directly connected links or immediate neighbors has gone down. The failure of a
link can be detected in some cases by the link-layer protocol. The demise of a neighbor
or loss of connectivity to that neighbor can be detected using periodic “hello” packets.
Each node sends these to its immediate neighbors at defined intervals. If a sufficiently
long time passes without receipt of a “hello” from a neighbor, the link to that neighbor
will be declared down, and a new LSP will be generated to reflect this fact.
One of the important design goals of a link-state protocol’s flooding mechanism
is that the newest information must be flooded to all nodes as quickly as possible, while
old information must be removed from the network and not allowed to circulate. In
addition, it is clearly desirable to minimize the total amount of routing traffic that is
sent around the network; after all, this is just “overhead” from the perspective of those
who actually use the network for their applications. The next few paragraphs describe
some of the ways that these goals are accomplished.
One easy way to reduce overhead is to avoid generating LSPs unless absolutely
necessary. This can be done by using very long timers—often on the order of hours—for
the periodic generation of LSPs. Given that the flooding protocol is truly reliable when
topology changes, it is safe to assume that messages saying “nothing has changed” do
not need to be sent very often.
To make sure that old information is replaced by newer information, LSPs carry
sequence numbers. Each time a node generates a new LSP, it increments the sequence
4.2 Routing
number by 1. Unlike most sequence numbers used in protocols, these sequence numbers
are not expected to wrap, so the field needs to be quite large (say, 64 bits). If a node
goes down and then comes back up, it starts with a sequence number of 0. If the
node was down for a long time, all the old LSPs for that node will have timed out
(as described below); otherwise, this node will eventually receive a copy of its own
LSP with a higher sequence number, which it can then increment and use as its own
sequence number. This will ensure that its new LSP replaces any of its old LSPs left
over from before the node went down.
LSPs also carry a time to live. This is used to ensure that old link-state information
is eventually removed from the network. A node always decrements the TTL of a newly
received LSP before flooding it to its neighbors. It also “ages” the LSP while it is stored
in the node. When the TTL reaches 0, the node refloods the LSP with a TTL of 0, which
is interpreted by all the nodes in the network as a signal to delete that LSP.
Route Calculation
Once a given node has a copy of the LSP from every other node, it is able to compute
a complete map for the topology of the network, and from this map it is able to decide
the best route to each destination. The question, then, is exactly how it calculates
routes from this information. The solution is based on a well-known algorithm from
graph theory—Dijkstra’s shortest-path algorithm.
We first define Dijkstra’s algorithm in graph-theoretic terms. Imagine that a node
takes all the LSPs it has received and constructs a graphical representation of the network, in which N denotes the set of nodes in the graph, l(i, j) denotes the nonnegative
cost (weight) associated with the edge between nodes i, j ∈ N, and l(i, j) = ∞ if no
edge connects i and j. In the following description, we let s ∈ N denote this node, that
is, the node executing the algorithm to find the shortest path to all the other nodes
in N. Also, the algorithm maintains the following two variables: M denotes the set
of nodes incorporated so far by the algorithm, and C(n) denotes the cost of the path
from s to each node n. Given these definitions, the algorithm is defined as follows:
M = {s}
for each n in N − {s}
C(n) = l(s, n)
while ( N = M)
M = M ∪ {w} such that C(w) is the minimum for all w in ( N − M)
for each n in ( N − M)
C(n) = MIN(C(n), C(w) + l(w, n))
Basically, the algorithm works as follows. We start with M containing this node s and
then initialize the table of costs (the C(n)s) to other nodes using the known costs to
4 Internetworking
directly connected nodes. We then look for the node that is reachable at the lowest
cost (w) and add it to M. Finally, we update the table of costs by considering the cost
of reaching nodes through w. In the last line of the algorithm, we choose a new route
to node n that goes through node w if the total cost of going from the source to w
and then following the link from w to n is less than the old route we had to n. This
procedure is repeated until all nodes are incorporated in M.
In practice, each switch computes its routing table directly from the LSPs it has
collected using a realization of Dijkstra’s algorithm called the forward search algorithm. Specifically, each switch maintains two lists, known as Tentative and Confirmed.
Each of these lists contains a set of entries of the form (Destination, Cost, NextHop).
The algorithm works as follows:
1 Initialize the Confirmed list with an entry for myself; this entry has a cost of 0.
2 For the node just added to the Confirmed list in the previous step, call it node
Next, select its LSP.
3 For each neighbor (Neighbor) of Next, calculate the cost (Cost) to reach this
Neighbor as the sum of the cost from myself to Next and from Next to Neighbor.
(a) If Neighbor is currently on neither the Confirmed nor the Tentative list, then
add (Neighbor, Cost, NextHop) to the Tentative list, where NextHop is the
direction I go to reach Next.
(b) If Neighbor is currently on the Tentative list, and the Cost is less than the currently listed cost for Neighbor, then replace the current entry with (Neighbor,
Cost, NextHop), where NextHop is the direction I go to reach Next.
4 If the Tentative list is empty, stop. Otherwise, pick the entry from the Tentative
list with the lowest cost, move it to the Confirmed list, and return to step 2.
This will become a lot easier to understand when we look at an example. Consider
the network depicted in Figure 4.19. Note that, unlike our previous example, this
Figure 4.19 Link-state routing: an example network.
4.2 Routing
D’s LSP says we can reach B through B
at cost 11, which is better than
anything else on either list, so put it
on Tentative list; same for C.
Put lowest-cost member of Tentative
(C) onto Confirmed list. Next,
examine LSP of newly confirmed
member (C).
Cost to reach B through C is 5, so
replace (B,11,B). C’s LSP tells us
that we can reach A at cost 12.
Move lowest-cost member of Tentative
(B) to Confirmed, then look at its LSP.
Since we can reach A at cost 5 through
B, replace the Tentative entry.
Table 4.9
Since D is the only new member of the
confirmed list, look at its LSP.
Move lowest-cost member of
Tentative (A) to Confirmed, and we
are all done.
Steps for building routing table for node D (Figure 4.19).
network has a range of different edge costs. Table 4.9 traces the steps for building the
routing table for node D. We denote the two outputs of D by using the names of the
nodes to which they connect, B and C. Note the way the algorithm seems to head off
on false leads (like the 11-unit cost path to B that was the first addition to the Tentative
list) but ends up with the least-cost paths to all nodes.
4 Internetworking
The link-state routing algorithm has many nice properties: It has been proven to
stabilize quickly, it does not generate much traffic, and it responds rapidly to topology
changes or node failures. On the downside, the amount of information stored at each
node (one LSP for every other node in the network) can be quite large. This is one
of the fundamental problems of routing and is an instance of the more general problem of scalability. Some solutions to both the specific problem (the amount of storage
potentially required at each node) and the general problem (scalability) will be discussed in the next section.
Thus, the difference between the distance-vector and link-state algorithms can be
summarized as follows. In distance vector, each node talks only to its directly connected
neighbors, but it tells them everything it has learned (i.e., distance to all nodes). In link
state, each node talks to all other nodes, but it tells them only what it knows for sure
(i.e., only the state of its directly connected links).
The Open Shortest Path First Protocol (OSPF)
One of the most widely used link-state routing protocols is OSPF. The first word,
“Open,” refers to the fact that it is an open, nonproprietary standard, created under
the auspices of the IETF. The “SPF” part comes from an alternative name for linkstate routing. OSPF adds quite a number of features to the basic link-state algorithm
described above, including the following:
■ Authentication of routing messages: This is a nice feature, since it is all too
common for some misconfigured host to decide that it can reach every host
in the universe at a cost of 0. When the host advertises this fact, every router
in the surrounding neighborhood updates its forwarding tables to point to
that host, and said host receives a vast amount of data that, in reality, it has
no idea what to do with. It typically drops it all, bringing the network to a
halt. Such disasters can be averted in many cases by requiring routing updates
to be authenticated. Early versions of OSPF used a simple 8-byte password
for authentication. This is not a strong enough form of authentication to
prevent dedicated malicious users, but it alleviates many problems caused
by misconfiguration. (A similar form of authentication was added to RIP
in version 2.) Strong cryptographic authentication of the sort discussed in
Section 8.2.1 was later added.
■ Additional hierarchy: Hierarchy is one of the fundamental tools used to make
systems more scalable. OSPF introduces another layer of hierarchy into routing by allowing a domain to be partitioned into areas. This means that a
router within a domain does not necessarily need to know how to reach every
network within that domain—it may be sufficient for it to know only how to
4.2 Routing
Message length
Authentication type
Figure 4.20 OSPF header format.
get to the right area. Thus, there is a reduction in the amount of information
that must be transmitted to and stored in each node. We examine areas in
detail in Section 4.3.4.
■ Load balancing: OSPF allows multiple routes to the same place to be assigned the same cost and will cause traffic to be distributed evenly over those
There are several different types of OSPF messages, but all begin with the same
header, as shown in Figure 4.20. The Version field is currently set to 2, and the Type
field may take the values 1 through 5. The SourceAddr identifies the sender of the
message, and the AreaId is a 32-bit identifier of the area in which the node is located.
The entire packet, except the authentication data, is protected by a 16-bit checksum
using the same algorithm as the IP header (see Section 2.4). The Authentication type
is 0 if no authentication is used; otherwise it may be 1, implying a simple password is
used, or 2, which indicates that a cryptographic authentication checksum, of the sort
described in Section 8.2.1, is used. In the latter cases the Authentication field carries
the password or cryptographic checksum.
Of the five OSPF message types, type 1 is the “hello” message, which a router
sends to its peers to notify them that it is still alive and connected as described above.
The remaining types are used to request, send, and acknowledge the receipt of linkstate messages. The basic building block of link-state messages in OSPF is known as
the link-state advertisement (LSA). One message may contain many LSAs. We provide
a few details of the LSA here.
Like any internetwork routing protocol, OSPF must provide information about
how to reach networks. Thus, OSPF must provide a little more information than the
simple graph-based protocol described above. Specifically, a router running OSPF may
generate link-state packets that advertise one or more of the networks that are directly
4 Internetworking
LS Age
Link-state ID
Advertising router
LS sequence number
LS checksum
Number of links
Link ID
Link data
Link type
Optional TOS information
More links
Figure 4.21 OSPF link-state advertisement.
connected to that router. In addition, a router that is connected to another router by
some link must advertise the cost of reaching that router over the link. These two types
of advertisements are necessary to enable all the routers in a domain to determine the
cost of reaching all networks in that domain and the appropriate next hop for each
Figure 4.21 shows the packet format for a “type 1” link-state advertisement.
Type 1 LSAs advertise the cost of links between routers. Type 2 LSAs are used to
advertise networks to which the advertising router is connected, while other types are
used to support additional hierarchy as described in the next section. Many fields in
the LSA should be familiar from the preceding discussion. The LS Age is the equivalent
of a time to live, except that it counts up and the LSA expires when the age reaches a
defined maximum value. The Type field tells us that this is a type 1 LSA.
In a type 1 LSA, the Link-state ID and the Advertising router field are identical.
Each carries a 32-bit identifier for the router that created this LSA. While a number of
assignment strategies may be used to assign this ID, it is essential that it be unique in
the routing domain and that a given router consistently uses the same router ID. One
way to pick a router ID that meets these requirements would be to pick the lowest IP
address among all the IP addresses assigned to that router. (Recall that a router may
have a different IP address on each of its interfaces.)
The LS sequence number is used exactly as described above, to detect old or
duplicate LSAs. The LS checksum is similar to others we have seen in Section 2.4 and
in other protocols; it is of course used to verify that data has not been corrupted. It
covers all fields in the packet except LS Age, so that it is not necessary to recompute
a checksum every time LS Age is incremented. Length is the length in bytes of the
complete LSA.
4.2 Routing
Now we get to the actual link-state information. This is made a little complicated
by the presence of TOS (type of service) information. Ignoring that for a moment, each
link in the LSA is represented by a Link ID, some Link Data, and a metric. The first two of
these fields identify the link; a common way to do this would be to use the router ID of
the router at the far end of the link as the Link ID, and then use the Link Data to disambiguate among multiple parallel links if necessary. The metric is of course the cost of the
link. Type tells us something about the link, for example, if it is a point-to-point link.
The TOS information is present to allow OSPF to choose different routes for IP
packets based on the value in their TOS field. Instead of assigning a single metric to a
link, it is possible to assign different metrics depending on the TOS value of the data.
For example, if we had a link in our network that was very good for delay-sensitive
traffic, we could give it a low metric for the TOS value representing low delay and
a high metric for everything else. OSPF would then pick a different shortest path for
those packets that had their TOS field set to that value. It is worth noting that, at the
time of writing, this capability has not been widely deployed.5
The preceding discussion assumes that link costs, or metrics, are known when we
execute the routing algorithm. In this section, we look at some ways to calculate link
costs that have proven effective in practice. One example that we have seen already,
which is quite reasonable and very simple, is to assign a cost of 1 to all links—the
least-cost route will then be the one with the fewest hops. Such an approach has
several drawbacks, however. First, it does not distinguish between links on a latency
basis. Thus, a satellite link with 250-ms latency looks just as attractive to the routing
protocol as a terrestrial link with 1-ms latency. Second, it does not distinguish between
routes on a capacity basis, making a 9.6-Kbps link look just as good as a 45-Mbps
link. Finally, it does not distinguish between links based on their current load, making
it impossible to route around overloaded links. It turns out that this last problem is
the hardest because you are trying to capture the complex and dynamic characteristics
of a link in a single scalar cost.
The ARPANET was the testing ground for a number of different approaches
to link-cost calculation. (It was also the place where the superior stability of link
state over distance-vector routing was demonstrated; the original mechanism used
distance vector while the later version used link state.) The following discussion traces
the evolution of the ARPANET routing metric and, in so doing, explores the subtle
aspects of the problem.
Note also that the meaning of the TOS field has changed since the OSPF specification was written. This topic is
discussed in Section 6.5.3.
4 Internetworking
The original ARPANET routing metric measured the number of packets that
were queued waiting to be transmitted on each link, meaning that a link with 10
packets queued waiting to be transmitted was assigned a larger cost weight than a link
with 5 packets queued for transmission. Using queue length as a routing metric did
not work well, however, since queue length is an artificial measure of load—it moves
packets toward the shortest queue rather than toward the destination, a situation all
too familiar to those of us who hop from line to line at the grocery store. Stated more
precisely, the original ARPANET routing mechanism suffered from the fact that it did
not take either the bandwidth or the latency of the link into consideration.
A second version of the ARPANET routing algorithm, sometimes called the
“new routing mechanism,” took both link bandwidth and latency into consideration
and used delay, rather than just queue length, as a measure of load. This was done
as follows. First, each incoming packet was timestamped with its time of arrival at
the router (ArrivalTime); its departure time from the router (DepartTime) was also
recorded. Second, when the link-level ACK was received from the other side, the node
computed the delay for that packet as
Delay = ( DepartTime − ArrivalTime) + TransmissionTime + Latency
where TransmissionTime and Latency were statically defined for the link and captured
the link’s bandwidth and latency, respectively. Notice that in this case, DepartTime –
ArrivalTime represents the amount of time the packet was delayed (queued) in the node
due to load. If the ACK did not arrive, but instead the packet timed out, then DepartTime was reset to the time the packet was retransmitted. In this case, DepartTime –
ArrivalTime captures the reliability of the link—the more frequent the retransmission
of packets, the less reliable the link, and the more we want to avoid it. Finally, the
weight assigned to each link was derived from the average delay experienced by the
packets recently sent over that link.
Although an improvement over the original mechanism, this approach also had
a lot of problems. Under light load, it worked reasonably well, since the two static
factors of delay dominated the cost. Under heavy load, however, a congested link
would start to advertise a very high cost. This caused all the traffic to move off that
link, leaving it idle, so then it would advertise a low cost, thereby attracting back all
the traffic, and so on. The effect of this instability was that, under heavy load, many
links would in fact spend a great deal of time being idle, which is the last thing you
want under heavy load.
Another problem was that the range of link values was much too large. For
example, a heavily loaded 9.6-Kbps link could look 127 times more costly than a lightly
loaded 56-Kbps link. This means that the routing algorithm would choose a path with
126 hops of lightly loaded 56-Kbps links in preference to a 1-hop 9.6-Kbps path.
4.2 Routing
While shedding some traffic from an overloaded line is a good idea, making it look so
unattractive that it loses all its traffic is excessive. Using 126 hops when 1 hop will do is
in general a bad use of network resources. Also, satellite links were unduly penalized,
so that an idle 56-Kbps satellite link looked considerably more costly than an idle
9.6-Kbps terrestrial link, even though the former would give better performance for
high-bandwidth applications.
A third approach, called the “revised ARPANET routing metric,” addressed
these problems. The major changes were to compress the dynamic range of the metric
considerably, to account for the link type, and to smooth the variation of the metric
with time.
The smoothing was achieved by several mechanisms. First, the delay measurement was transformed to a link utilization, and this number was averaged with the
last reported utilization to suppress sudden changes. Second, there was a hard limit
on how much the metric could change from one measurement cycle to the next. By
smoothing the changes in the cost, the likelihood that all nodes would abandon a route
at once is greatly reduced.
The compression of the dynamic range was achieved by feeding the measured
utilization, the link type, and the link speed into a function that is shown graphically
in Figure 4.22. Observe the following:
New metric (routing units)
9.6-Kbps satellite link
9.6-Kbps terrestrial link
56-Kbps satellite link
56-Kbps terrestrial link
Figure 4.22 Revised ARPANET routing metric versus link utilization.
4 Internetworking
■ A highly loaded link never shows
a cost of more than three times its
cost when idle.
■ The most expensive link is only
seven times the cost of the least expensive.
■ A high-speed satellite link is more
attractive than a low-speed terrestrial link.
■ Cost is a function of link utilization only at moderate to high
All these factors mean that a link is much
less likely to be universally abandoned, since
a threefold increase in cost is likely to make
the link unattractive for some paths while
letting it remain the best choice for others.
The slopes, offsets, and breakpoints for
the curves in Figure 4.22 were arrived at
by a great deal of trial and error, and they
were carefully tuned to provide good performance.
There is one final issue related to calculating edge weights—the frequency with
which each node calculates the weights on
its links. There are two things to keep in
mind. First, none of the metrics are instantaneous. That is, whether a node is measuring queue length, delay, or utilization, it is
actually computing an average over a period of time. Second, just because a metric changes does not mean that the node
sends out an update message. In practice,
updates are sent only when the change to an
edge weight is larger than some threshold.
Monitoring Routing
Given the complexity of routing
packets through a network of the
scale of the Internet, we might wonder how well the system works. We
know it works some of the time
because we are able to connect to
sites all over the world. We suspect
it doesn’t work all the time, though,
because sometimes we are unable
to connect to certain sites. The real
problem is determining what part
of the system is at fault when our
connections fail: Has some routing
machinery failed to work properly,
is the remote server too busy, or has
some link or machine simply gone
This is really an issue of network management, and while there
are tools that system administrators use to keep tabs on their
own networks—for example, see
the Simple Network Management
Protocol (SNMP) described in Section 9.2.3—it is a largely unresolved problem for the Internet as
a whole. In fact, the Internet has
grown so large and complex that,
even though it is constructed from a
collection of man-made, largely deterministic parts, we have come to
view it almost as a living organism
4.2 Routing
or natural phenomenon that is to
be studied. That is, we try to understand the Internet’s dynamic behavior by performing experiments on it
and proposing models that explain
our observations.
An excellent example of this
kind of study has been conducted
by Vern Paxson. Paxson used
the Unix traceroute tool to study
40,000 end-to-end routes between
37 Internet sites in 1995. He was
attempting to answer questions
about how routes fail, how stable
routes are over time, and whether
or not they are symmetric. Among
other things, Paxson found that the
likelihood of a user encountering a
serious end-to-end routing problem
was 1 in 30, and that such problems
usually lasted about 30 seconds. He
also found that two-thirds of the
Internet’s routes persisted for days
or weeks, and that about one-third
of the time the route used to get
from host A to host B included at
least one different routing domain
than the route used to get from
host B to host A. Paxson’s overall conclusion was that Internet
routing was becoming less and less
predictable over time.
Routing for Mobile
Looking back over the preceding discussion
of how IP addressing and routing works,
you might notice that there is an implicit
assumption about the mobility of hosts, or
rather the lack of it. A host’s address consists of a network number and a host part,
and the network number tells us which network the host is attached to. IP routing algorithms tell the routers how to get packets
to the correct network, thus enhancing the
scalability of the routing system by keeping
host-specific information out of the routers.
So what would happen if a host were disconnected from one network and connected
to another? If we didn’t change the IP
address of the host, then it would become
unreachable. Any packet destined for this
host would be sent to the network that has
the appropriate network number, but when
the router(s) on that network tried to deliver
the packet to the host, the host would not
be there to receive it.
The obvious solution to this problem
is to provide the host with a new address
when it attaches to a new network. Techniques such as DHCP (described in Section 4.1.6) can make this a relatively simple
process. In many situations this solution is
adequate, but in others it is not. For example, suppose that a user of a PC equipped
with a wireless network interface is running
some application while she roams the countryside. The PC might detach itself from
one network and attach to another with
some frequency, but the user would want
to be oblivious to this. In particular, the
4 Internetworking
applications that were running when the PC was attached to network A should continue to run without interruption when it attaches to network B. If the PC simply
changes its IP address in the middle of running the application, the application cannot
simply keep working, because the remote end has no way of knowing that it must
now send the packets to a new IP address. Ideally, we want the movement of the PC to
be transparent to the remote application. The procedures that are designed to address
this problem are usually referred to as “Mobile IP” (which is also the name of the
IETF working group that defined them).
The Mobile IP working group made some important design decisions at the
outset. In particular, it was a requirement that the solution would work without any
changes to the software of nonmobile hosts or the majority of routers in the Internet.
This sort of approach is frequently adopted in the Internet. Any new technology that
requires a majority of routers or hosts to be modified before it can work is likely to
face an uphill battle for acceptance.
While the majority of routers remain unchanged, mobility support does require
some new functionality in at least one router, known as the home agent of the mobile node. This router is located on the “home” network of the mobile host. The mobile
host is assumed to have a permanent IP address, called its home address, which has a
network number equal to that of the home network, and thus of the home agent. This is
the address that will be used by other hosts when they send packets to the mobile host;
since it does not change, it can be used by long-lived applications as the host roams.
In many cases, a second router with enhanced functionality, the foreign agent, is
also required. This router is located on a network to which the mobile node attaches
itself when it is away from its home network. We will consider first the operation of
Mobile IP when a foreign agent is used. An example network with both home and
foreign agents is shown in Figure 4.23.
Both home and foreign agents periodically announce their presence on the networks to which they are attached using agent advertisement messages. A mobile host
Sending host
Home agent
Foreign agent
Home network
(network 10)
Figure 4.23 Mobile host and mobility agents.
Mobile host
4.2 Routing
may also solicit an advertisement when it attaches to a new network. The advertisement by the home agent enables a mobile host to learn the address of its home agent
before it leaves its home network. When the mobile host attaches to a foreign network,
it hears an advertisement from a foreign agent and registers with the agent, providing the address of its home agent. The foreign agent then contacts the home agent,
providing a care-of address. This is usually the IP address of the foreign agent.
At this point, we can see that any host that tries to send a packet to the mobile
host will send it with a destination address equal to the home address of that node.
Normal IP forwarding will cause that packet to arrive on the home network of the
mobile node, on which the home agent is sitting. Thus, we can divide the problem of
delivering the packet to the mobile node into three parts:
1 How does the home agent intercept a packet that is destined for the mobile node?
2 How does the home agent then deliver the packet to the foreign agent?
3 How does the foreign agent deliver the packet to the mobile node?
The first problem might look easy if you just look at Figure 4.23, in which the
home agent is clearly the only path between the sending host and the home network,
and thus must receive packets that are destined to the mobile node. But what if the
sending node were on network 10, or what if there were another router connected
to network 10 that tried to deliver the packet without its passing through the home
agent? To address this problem, the home agent actually impersonates the mobile
node, using a technique called “proxy ARP.” This works just like ARP as described
in Section 4.1.5, except that the home agent inserts the IP address of the mobile node,
rather than its own, in the ARP messages. It uses its own hardware address, so that
all the nodes on the same network learn to associate the hardware address of the
home agent with the IP address of the mobile node. One subtle aspect of this process
is the fact that ARP information may be cached in other nodes on the network. To
make sure that these caches are invalidated in a timely way, the home agent issues an
ARP message as soon as the mobile node registers with a foreign agent. Because the
ARP message is not a response to a normal ARP request, it is termed a “gratuitous
The second problem is the delivery of the intercepted packet to the foreign agent.
Here we use the tunneling technique described in Section 4.1.8. The home agent simply
“wraps” the packet inside an IP header that is destined for the foreign agent and
transmits it into the internetwork. All the intervening routers just see an IP packet
destined for the IP address of the foreign agent. Another way of looking at this is that
an IP tunnel is established between the home agent and the foreign agent, and the
home agent just drops packets destined for the mobile node into that tunnel.
4 Internetworking
When a packet finally arrives at the foreign agent, it strips the extra IP header
and finds inside an IP packet destined for the mobile node. Clearly, the foreign agent
cannot treat this like any old IP packet because this would cause it to send it back
to the home network. Instead, it has to recognize the address as that of a registered mobile node. It then delivers the packet to the hardware address of the mobile
node (e.g., its Ethernet address), which was learned as part of the registration process.
One observation that can be made about these procedures is that it is possible
for the foreign agent and the mobile node to be in the same box; that is, a mobile node
can perform the foreign agent function itself. To make this work, however, the mobile
node must be able to dynamically acquire an IP address that is located in the address
space of the foreign network. This address will then be used as the care-of address.
In our example, this would have to be an address with a network number of 12.
We have already seen one way in which a host can dynamically acquire a correct
IP address, using DHCP (Section 4.1.6). This approach has the desirable feature of
allowing mobile nodes to attach to networks that don’t have foreign agents; thus,
mobility can be achieved with only the addition of a home agent and some new software
on the mobile node (assuming DHCP is used on the foreign network).
What about traffic in the other direction (i.e., from mobile node to fixed node)?
This turns out to be much easier. The mobile node just puts the IP address of the fixed
node in the destination field of its IP packets, while putting its permanent address in
the source field, and the packets are forwarded to the fixed node using normal means.
Of course, if both nodes in a conversation are mobile, then the procedures described
above are used in each direction.
Route Optimization in Mobile IP
There is one significant drawback to the above approach, which may be familiar to
users of cellular telephones. The route from sending node to mobile node can be significantly suboptimal. One of the most extreme examples is when a mobile node and
the sending node are on the same network, but the home network for the mobile node
is on the far side of the Internet. The sending node addresses all packets to the home
network; they traverse the Internet to reach the home agent, which then tunnels them
back across the Internet to reach the foreign agent. Clearly, it would be nice if the
sending node could find out that the mobile node is actually on the same network
and deliver the packet directly. In the more general case, the goal is to deliver packets
as directly as possible from sending node to mobile node without passing through a
home agent. This is sometimes referred to as the “triangle routing problem” since the
path from sender to mobile node via home agent takes two sides of a triangle, rather
than the third side that is the direct path.
The basic idea behind the solution to triangle routing is to let the sending node
know the care-of address of the mobile node. The sending node can then create its
4.3 Global Internet
own tunnel to the foreign agent. This is treated as an optimization of the process just
described. If the sender has been equipped with the necessary software to learn the
care-of address and create its own tunnel, then the route can be optimized; if not,
packets just follow the suboptimal route.
When a home agent sees a packet destined for one of the mobile nodes that it
supports, it can deduce that the sender is not using the optimal route. Therefore, it
sends a “binding update” message back to the source, in addition to forwarding the
data packet to the foreign agent. The source, if capable, uses this binding update to
create an entry in a “binding cache,” which consists of a list of mappings from mobile
node addresses to care-of addresses. The next time this source has a data packet to
send to that mobile node, it will find the binding in the cache and can tunnel the packet
directly to the foreign agent.
There is an obvious problem with this scheme, which is that the binding cache
may become out-of-date if the mobile host moves to a new network. If an out-of-date
cache entry is used, the foreign agent will receive tunneled packets for a mobile node
that is no longer registered on its network. In this case, it sends a “binding warning”
message back to the sender to tell it to stop using this cache entry. This scheme works
only in the case where the foreign agent is not the mobile node itself, however. For this
reason, cache entries need to be deleted after some period of time; the exact amount
is specified in the binding update message.
Mobile routing provides some interesting security challenges. For example, an
attacker wishing to intercept the packets destined to some other node in an internetwork could contact the home agent for that node and announce itself as the new
foreign agent for the node. Thus it is clear that some authentication mechanisms are
required. We discuss such mechanisms in Chapter 8.
Finally, we note that there are many open issues in mobile networking. For
example, the security and performance aspects of mobile networks might require routing algorithms to take account of several factors when finding a route to a mobile host;
for example, it might be desirable to find a route that doesn’t pass through some untrusted network. There is also the problem of “ad hoc” mobile networks—enabling
a group of mobile nodes to form a network in the absence of any fixed nodes. These
continue to be areas of active research.
4.3 Global Internet
At this point, we have seen how to connect a heterogeneous collection of networks to
create an internetwork and how to use the simple hierarchy of the IP address to make
routing in an internet somewhat scalable. We say “somewhat” scalable because even
though each router does not need to know about all the hosts connected to the internet,
4 Internetworking
NSFNET backbone
Figure 4.24 The tree structure of the Internet in 1990.
it does, in the model described so far, need to know about all the networks connected
to the internet. Today’s Internet has tens of thousands of networks connected to it.
Routing protocols such as those we have just discussed do not scale to those kinds of
numbers. This section looks at a variety of techniques that greatly improve scalability
and that have enabled the Internet to grow as far as it has.
Before getting to these techniques, we need to have a general picture in our
heads of what the global Internet looks like. It is not just a random interconnection of
Ethernets, but instead it takes on a shape that reflects the fact that it interconnects many
different organizations. Figure 4.24 gives a simple depiction of the state of the Internet
in 1990. Since that time, the Internet’s topology has grown much more complex than
this figure suggests—we present a more accurate picture of the current Internet in
Section 4.3.3 and Figure 4.29—but this picture will do for now.
One of the salient features of this topology is that it consists of “end user” sites
(e.g., Stanford University) that connect to “service provider” networks (e.g., BARRNET was a provider network that served sites in the San Francisco Bay Area). In 1990,
many providers served a limited geographic region and were thus known as regional
networks. The regional networks were, in turn, connected by a nationwide backbone.
In 1990, this backbone was funded by the National Science Foundation (NSF) and
was therefore called the NSFNET backbone. Although the detail is not shown in this
figure, the provider networks are typically built from a large number of point-to-point
links (e.g., DS3 or OC-3 links) that connect to routers; similarly, each end user site
is typically not a single network, but instead consists of multiple physical networks
connected by routers and bridges.
Notice in Figure 4.24 that each provider and end user are likely to be an administratively independent entity. This has some significant consequences on routing.
For example, it is quite likely that different providers will have different ideas about
4.3 Global Internet
the best routing protocol to use within their network, and on how metrics should
be assigned to links in their network. Because of this independence, each provider’s
network is usually a single autonomous system (AS). We will define this term more
precisely in Section 4.3.3, but for now it is adequate to think of an AS as a network
that is administered independently of other ASs.
The fact that the Internet has a discernible structure can be used to our advantage
as we tackle the problem of scalability. In fact, we need to deal with two related scaling
issues. The first is the scalability of routing. We need to find ways to minimize the
number of network numbers that get carried around in routing protocols and stored
in the routing tables of routers. The second is address utilization—that is, making sure
that the IP address space does not get consumed too quickly.
Throughout this section, we will see the principle of hierarchy used again and
again to improve scalability. We begin with subnetting, which primarily deals with
address space utilization. Next we introduce classless routing or supernetting, which
tackles both address utilization and routing scalability. We then look at how hierarchy can be used to improve the scalability of routing, both through interdomain
routing and within a single domain. Our final subsection looks at the emerging standards for IP version 6, the invention of which was largely the result of scalability
The original intent of IP addresses was that the network part would uniquely identify
exactly one physical network. It turns out that this approach has a couple of drawbacks. Imagine a large campus that has lots of internal networks and that decides to
connect to the Internet. For every network, no matter how small, the site needs at least
a class C network address. Even worse, for any network with more than 255 hosts,
they need a class B address. This may not seem like a big deal, and indeed it wasn’t
when the Internet was first envisioned, but there are only a finite number of network
numbers, and there are far fewer class B addresses than class Cs. Class B addresses
tend to be in particularly high demand because you never know if your network might
expand beyond 255 nodes, so it is easier to use a class B address from the start than
to have to renumber every host when you run out of room on a class C network.
The problem we observe here is address assignment inefficiency: A network with two
nodes uses an entire class C network address, thereby wasting 253 perfectly useful
addresses; a class B network with slightly more than 255 hosts wastes over 64,000
Assigning one network number per physical network, therefore, uses up the
IP address space potentially much faster than we would like. While we would need
to connect over 4 billion hosts to use up all the valid addresses, we only need to
4 Internetworking
connect 214 (about 16,000) class B networks before that part of the address space runs
out. Therefore, we would like to find some way to use the network numbers more
Assigning many network numbers has another drawback that becomes apparent
when you think about routing. Recall that the amount of state that is stored in a node
participating in a routing protocol is proportional to the number of other nodes, and
that routing in an internet consists of building up forwarding tables that tell a router
how to reach different networks. Thus, the more network numbers there are in use,
the bigger the forwarding tables get. Big forwarding tables add cost to routers, and
they are potentially slower to search than smaller tables for a given technology, so they
degrade router performance. This provides another motivation for assigning network
numbers carefully.
Subnetting provides an elegantly simple way to reduce the total number of network numbers that are assigned. The idea is to take a single IP network number
and allocate the IP addresses with that network number to several physical networks,
which are now referred to as subnets. Several things need to be done to make this
work. First, the subnets should be close to each other. This is because at a distant
point in the Internet, they will all look like a single network, having only one network
number between them. This means that a router will only be able to select one route
to reach any of the subnets, so they had better all be in the same general direction. A
perfect situation in which to use subnetting is a large campus or corporation that has
many physical networks. From outside the campus, all you need to know to reach any
subnet inside the campus is where the campus connects to the rest of the Internet. This
is often at a single point, so one entry in your forwarding table will suffice. Even if
there are multiple points at which the campus is connected to the rest of the Internet,
knowing how to get to one point in the campus network is still a good start.
The mechanism by which a single network number can be shared among multiple
networks involves configuring all the nodes on each subnet with a subnet mask. With
simple IP addresses, all hosts on the same network must have the same network number.
The subnet mask enables us to introduce a subnet number; all hosts on the same
physical network will have the same subnet number, which means that hosts may be
on different physical networks but share a single network number.
What the subnet mask effectively does is introduce another level of hierarchy into
the IP address. For example, suppose that we want to share a single class B address
among several physical networks. We could use a subnet mask of
(Subnet masks are written down just like IP addresses; this mask is therefore all 1s in
the upper 24 bits and 0s in the lower 8 bits.) In effect, this means that the top 24 bits
(where the mask has 1s) are now defined to be the network number, and the lower
8 bits (where the mask has 0s) are the host number. Since the top 16 bits identify the
4.3 Global Internet
Network number
Host number
Class B address
Subnet mask (
Network number
Host ID
Subnet ID
Subnetted address
Figure 4.25 Subnet addressing.
Subnet mask:
Subnet number:
Subnet mask:
Subnet number:
Subnet mask:
Subnet number:
Figure 4.26 An example of subnetting.
network in a class B address, we may now think of the address as having not two parts
but three: a network part, a subnet part, and a host part. That is, we have divided what
used to be the host part into a subnet part and a host part. This is shown in Figure 4.25.
What subnetting means to a host is that it is now configured with both an IP
address and a subnet mask for the subnet to which it is attached. For example, host
H1 in Figure 4.26 is configured with an address of and a subnet mask
of (All hosts on a given subnet are configured with the same mask;
4 Internetworking
i.e., there is exactly one subnet mask per subnet.) The bitwise AND of these two
numbers defines the subnet number of the host and of all other hosts on the same
subnet. In this case, AND equals, so this
is the subnet number for the topmost subnet in the figure.
When the host wants to send a packet to a certain IP address, the first thing it
does is to perform a bitwise AND between its own subnet mask and the destination
IP address. If the result equals the subnet number of the sending host, then it knows
that the destination host is on the same subnet and the packet can be delivered directly
over the subnet. If the results are not equal, the packet needs to be sent to a router
to be forwarded to another subnet. For example, if H1 is sending to H2, then H1
ANDs its subnet mask ( with the address for H2 ( to
obtain This does not match the subnet number for H1 (
so H1 knows that H2 is on a different subnet. Since H1 cannot deliver the packet to
H2 directly over the subnet, it sends the packet to its default router R1.
Note that ARP is largely unaffected by the change in address structure. Once a
host or router figures out which node it needs to deliver a packet to on one of the
networks to which it is attached, it performs ARP to find the MAC address for that
node if necessary.
The job of a router also changes when we introduce subnetting. Recall that,
for simple IP, a router has a forwarding table that consists of entries of the form
NetworkNum, NextHop. To support subnetting, the table must now hold entries of
the form SubnetNumber, SubnetMask, NextHop. To find the right entry in the table,
the router ANDs the packet’s destination address with the SubnetMask for each entry
in turn; if the result matches the SubnetNumber of the entry, then this is the right entry
to use, and it forwards the packet to the next hop router indicated. In the example
network of Figure 4.26, router R1 would have the entries shown in Table 4.10.
Continuing with the example of a datagram from H1 being sent to H2, R1
would AND H2’s address ( with the subnet mask of the first entry
( and compare the result ( with the network number
Interface 0
Interface 1
Table 4.10 Example forwarding table with subnetting for Figure 4.26.
4.3 Global Internet
for that entry ( Since this is not a match, it proceeds to the next entry.
This time a match does occur, so R1 delivers the datagram to H2 using interface 1,
which is the interface connected to the same network as H2.
We can now describe the datagram forwarding algorithm in the following way:
D = destination IP address
for each forwarding table entry SubnetNumber, SubnetMask, NextHop
D1 = SubnetMask & D
if D1 = SubnetNumber
if NextHop is an interface
deliver datagram directly to destination
deliver datagram to NextHop (a router)
Although not shown in this example, a default router would usually be included in
the table and would be used if no explicit matches were found. We note in passing
that a naive implementation of this algorithm—one involving repeated ANDing of the
destination address with a subnet mask that may not be different every time, and a
linear table search—would be very inefficient.
A few fine points about subnetting need to be mentioned. We have already seen
that the subnet mask does not need to align with a byte boundary, with the example
mask of (25 1s followed by 7 0s) used above. More confusingly, it
is not even necessary for all the 1s in a subnet mask to be contiguous. For example, it
would be quite possible to use a subnet mask of All of the mechanisms
described above should continue to work, but now you can’t look at a contiguous part
of the IP address and say, “That is the subnet number.” This makes administration
more difficult. It may also fail to work with implementations that assume that no one
would use noncontiguous masks, and so it is not recommended in practice.
We can also put multiple subnets on a single physical network. The effect of this
would be to force hosts on the same network to talk to each other through a router,
which might be useful for administrative purposes; for example, to provide isolation
between different departments sharing a LAN.
A third point to which we have alluded is that different parts of the internet see the
world differently. From outside our hypothetical campus, routers see a single network.
In the example above, routers outside the campus see the collection of networks in
Figure 4.26 as just the network 128.96, and they keep one entry in their forwarding
tables to tell them how to reach it. Routers within the campus, however, need to be able
to route packets to the right subnet. Thus, not all parts of the internet see exactly the
same routing information. The next section takes a closer look at how the propagation
of routing information is done in the Internet.
4 Internetworking
The bottom line is that subnetting helps solve our scalability problems in two
ways. First, it improves our address assignment efficiency by letting us not use up an
entire class C or class B address every time we add a new physical network. Second, it
helps us aggregate information. From a reasonable distance, a complex collection of
physical networks can be made to look like a single network, so that the amount of
information that routers need to store to deliver datagrams to those networks can be
Classless Routing (CIDR)
Classless interdomain routing (CIDR, pronounced “cider”) is a technique that
addresses two scaling concerns in the Internet: the growth of backbone routing tables as more and more network numbers need to be stored in them, and the potential
for the 32-bit IP address space to be exhausted well before the four-billionth host is
attached to the Internet. We have already mentioned the problem that would cause
this address space exhaustion: address assignment inefficiency. The inefficiency arises
because the IP address structure, with class A, B, and C addresses, forces us to hand
out network address space in fixed-sized chunks of three very different sizes. A network with two hosts needs a class C address, giving an address assignment efficiency
of 2/255 = 0.78%; a network with 256 hosts needs a class B address, for an efficiency of only 256/65,535 = 0.39%. Even though subnetting can help us to assign
addresses carefully, it does not get around the fact that any autonomous system with
more than 255 hosts, or an expectation of eventually having that many, wants a class B
As it turns out, exhaustion of the IP address space centers on exhaustion of the
class B network numbers. One way to deal with that would seem to be saying no to
any AS that requests a class B address unless they can show a need for something
close to 64K addresses, and instead giving them an appropriate number of class C
addresses to cover the expected number of hosts. Since we would now be handing out
address space in chunks of 256 addresses at a time, we could more accurately match
the amount of address space consumed to the size of the AS. For any AS with at least
256 hosts (which means the majority of ASs), we can guarantee an address utilization
of at least 50%, and typically much more.
This solution, however, raises a problem that is at least as serious: excessive
storage requirements at the routers. If a single AS has, say, 16 class C network numbers
assigned to it, that means every Internet backbone router needs 16 entries in its routing
tables for that AS. This is true even if the path to every one of those networks is the
same. If we had assigned a class B address to the AS, the same routing information
could be stored in one table entry. However, our address assignment efficiency would
then be only 16 × 255/65,536 = 6.2%.
4.3 Global Internet
CIDR, therefore, tries to balance the desire to minimize the number of routes that
a router needs to know against the need to hand out addresses efficiently. To do this,
CIDR helps us to aggregate routes. That is, it lets us use a single entry in a forwarding
table to tell us how to reach a lot of different networks. As you may have guessed
from the name, it does this by breaking the rigid boundaries between address classes.
To understand how this works, consider our hypothetical AS with 16 class C network
numbers. Instead of handing out 16 addresses at random, we can hand out a block of
contiguous class C addresses. Suppose we assign the class C network numbers from
192.4.16 through 192.4.31. Observe that the top 20 bits of all the addresses in this
range are the same (11000000 00000100 0001). Thus, what we have effectively created
is a 20-bit network number—something that is between a class B network number and
a class C number in terms of the number of hosts that it can support. In other words,
we get both the high address efficiency of handing out addresses in chunks smaller
than a class B network and a single network prefix that can be used in forwarding
tables. Observe that for this scheme to work, we need to hand out blocks of class C
addresses that share a common prefix, which means that each block must contain a
number of class C networks that is a power of two.
All we need now to make CIDR solve our problems is a routing protocol that
can deal with these “classless” addresses, which means that it must understand that
a network number may be of any length. Modern routing protocols (such as BGP-4,
described below) do exactly that. The network numbers that are carried in such a
routing protocol are represented simply by length, value pairs, where the length
gives the number of bits in the network prefix—20 in the above example. Note that
representing a network address in this way is similar to the mask, value approach
used in subnetting, as long as masks consist of contiguous bits starting from the most
significant bit. Also note that we used subnetting to share one address among multiple
physical networks, while CIDR aims to collapse the multiple addresses that would be
assigned to a single AS onto one address. The similarity between the two approaches
is reflected in the original name for CIDR—supernetting.
In fact, the ability to aggregate routes in the way that we have just shown is
only the first step. Imagine an Internet service provider network, whose primary job
is to provide Internet connectivity to a large number of corporations and campuses.
If we assign network numbers to the corporations in such a way that all the different
corporations connected to the provider network share a common address prefix, then
we can get even greater aggregation of routes. Consider the example in Figure 4.27.
The two corporations served by the provider network have been assigned adjacent 20bit network prefixes. Since both of the corporations are reachable through the same
provider network, it can advertise a single route to both of them by just advertising the
common 19-bit prefix they share. In general, it is possible to aggregate routes repeatedly
if addresses are assigned carefully. This means that we need to pay attention to which
4 Internetworking
Corporation X
Border gateway
(advertises path to
Regional network
Corporation Y
Figure 4.27 Route aggregation with CIDR.
provider a corporation is attached to before assigning it an address if this scheme is to
work. One way to accomplish that is to assign a portion of address space to the provider
and then to let the network provider assign addresses from that space to its customers.
IP Forwarding Revisited
In all our discussion of IP forwarding so far, we have assumed that we could find the
network number in a packet and then look up that number in a forwarding table.
However, now that we have introduced CIDR, we need to reexamine this assumption.
CIDR means that prefixes may be of any length, from 2 to 32 bits. Furthermore, it is
sometimes possible to have prefixes in the forwarding table that “overlap,” in the sense
that some addresses may match more than one prefix. For example, we might find both
171.69 (a 16-bit prefix) and 171.69.10 (a 24-bit prefix) in the forwarding table of a
single router. In this case, a packet destined to, say, clearly matches both
prefixes. The rule in this case is based on the principle of “longest match”; that is, the
packet matches the longest prefix, which would be 171.69.10 in this example. On the
other hand, a packet destined to would match 171.69 and not 171.69.10,
and in the absence of any other matching entry in the routing table, 171.69 would be
the longest match.
The task of efficiently finding the longest match between an IP address and the
variable-length prefixes in a forwarding table has been a fruitful field of research
in recent years, and the “Further Reading” section of this chapter provides some
references. The most well-known algorithm uses an approach known as a PATRICIA
tree, which was actually developed well in advance of CIDR.
Interdomain Routing (BGP)
At the beginning of this section we introduced the notion that the Internet is organized
as autonomous systems, each of which is under the control of a single administrative
entity. A corporation’s complex internal network might be a single AS, as may the
4.3 Global Internet
Autonomous system 1
Autonomous system 2
Figure 4.28 A network with two autonomous systems.
network of a single Internet service provider. Figure 4.28 shows a simple network
with two autonomous systems.
The basic idea behind autonomous systems is to provide an additional way to
hierarchically aggregate routing information in a large internet, thus improving scalability. We now divide the routing problem into two parts: routing within a single
autonomous system and routing between autonomous systems. Since another name
for autonomous systems in the Internet is routing domains, we refer to the two parts
of the routing problem as interdomain routing and intradomain routing. In addition
to improving scalability, the AS model decouples the intradomain routing that takes
place in one AS from that taking place in another. Thus, each AS can run whatever
intradomain routing protocols it chooses. It can even use static routes or multiple protocols if desired. The interdomain routing problem is then one of having different ASs
share reachability information with each other.
One feature of the autonomous system idea is that it enables some ASs to dramatically reduce the amount of routing information they need to care about by using
default routes. For example, if a corporate network is connected to the rest of the
4 Internetworking
Internet by a single router (this router is typically called a border router since it sits
at the boundary between the AS and the rest of the Internet), then it is pretty easy
for a host or router inside the autonomous system to figure out where it should send
packets that are headed for a destination outside of this AS—they first go to the AS’s
border router. This is the default route. Similarly, a regional Internet service provider
can keep track of how to reach the networks of all its directly connected customers and
can have a default route to some other provider (typically a backbone provider) for
everyone else. Of course, this passing of the buck has to stop at some point; eventually
the packet should reach a router connected to a backbone network that knows how
to reach everything. Managing the amount of routing information in the backbones is
an important issue that we discuss below.
There have been two major interdomain routing protocols in the recent history of
the Internet. The first was the Exterior Gateway Protocol (EGP). EGP had a number of
limitations, perhaps the most severe of which was that it constrained the topology of the
Internet rather significantly. EGP basically forced a treelike topology onto the Internet,
or to be more precise, it was designed when the Internet had a treelike topology, such
as that illustrated in Figure 4.24. EGP did not allow for the topology to become more
general. Note that in this simple treelike structure, there is a single backbone, and
autonomous systems are connected only as parents and children and not as peers.
The replacement for EGP is the Border Gateway Protocol (BGP), which is in its
fourth version at the time of this writing (BGP-4). BGP is also known for being rather
complex. This section presents the highlights of BGP-4.
As a starting position, BGP assumes that the Internet is an arbitrarily interconnected set of ASs. This model is clearly general enough to accommodate non-treestructured internetworks, like the simplified picture of today’s multibackbone Internet
shown in Figure 4.29.6
Unlike the simple tree-structured Internet shown in Figure 4.24, today’s Internet
consists of an interconnection of multiple backbone networks (they are usually called
service provider networks, and they are operated by private companies rather than
the government), and sites are connected to each other in arbitrary ways. Some large
corporations connect directly to one or more of the backbones, while others connect
to smaller, nonbackbone service providers. Many service providers exist mainly to
provide service to “consumers” (i.e., individuals with PCs in their homes), and these
providers must also connect to the backbone providers. Often many providers arrange
to interconnect with each other at a single “peering point.” In short, it is hard to discern
much structure at all in today’s Internet.
In an interesting stretch of metaphor, the Internet now has multiple backbones, having had only one for most of
its early life. The authors know of no other animal that has this characteristic.
4.3 Global Internet
Large corporation
“Consumer” ISP
Backbone service provider
“Consumer” ISP
Large corporation
“Consumer” ISP
Figure 4.29 Today’s multibackbone Internet.
Given this rough sketch of the Internet, if we define local traffic as traffic that
originates at or terminates on nodes within an AS, and transit traffic as traffic that
passes through an AS, we can classify ASs into three types:
■ Stub AS: an AS that has only a single connection to one other AS; such an
AS will only carry local traffic. The small corporation in Figure 4.29 is an
example of a stub AS.
■ Multihomed AS: an AS that has connections to more than one other AS but
that refuses to carry transit traffic; for example, the large corporation at the
top of Figure 4.29.
■ Transit AS: an AS that has connections to more than one other AS and that is
designed to carry both transit and local traffic, such as the backbone providers
in Figure 4.29.
Whereas the discussion of routing in Section 4.2 focused on finding optimal paths
based on minimizing some sort of link metric, the problem of interdomain routing turns
out to be so difficult that the goals are more modest. First and foremost, the goal is
to find any path to the intended destination that is loop-free. That is, we are more
concerned with reachability than optimality. Finding a path that is anywhere close to
optimal is considered a great achievement. We will see why this is so as we look at the
details of BGP.
There are a few reasons why interdomain routing is hard. The first is simply
a matter of scale. An Internet backbone router must be able to forward any packet
destined anywhere in the Internet. That means having a routing table that will provide
a match for any valid IP address. While CIDR has helped to control the number of
4 Internetworking
distinct prefixes that are carried in the Internet’s backbone routing, there is inevitably
a lot of routing information to pass around—on the order of 140,000 prefixes at the
time of writing.
The second challenge in interdomain routing arises from the autonomous nature
of the domains. Note that each domain may run its own interior routing protocols and
use any scheme it chooses to assign metrics to paths. This means that it is impossible
to calculate meaningful path costs for a path that crosses multiple ASs. A cost of 1000
across one provider might imply a great path, but it might mean an unacceptably bad
one from another provider. As a result, interdomain routing advertises only “reachability.” The concept of reachability is basically a statement that “you can reach this
network through this AS.” This means that for interdomain routing to pick an optimal
path is essentially impossible.
The third challenge involves the issue of trust. Provider A might be unwilling to
believe certain advertisements from provider B for fear that provider B will advertise
erroneous routing information. For example, trusting provider B when he advertises
a great route to anywhere in the Internet can be a disastrous choice if provider B turns
out to have made a mistake configuring his routers or to have insufficient capacity to
carry the traffic.
Closely related to this issue is the need to support very flexible policies in interdomain routing. One common policy is the prevention of transit traffic. For example,
the multihomed corporation in Figure 4.29 may not wish to carry any traffic between
the two providers to whom it connects. As a more complex example, provider A might
wish to implement policies that say, “Use provider B only to reach these addresses,”
“Use the path that crosses the fewest number of ASs,” or “Use AS x in preference to
AS y.” The goal is to specify policies that lead to “good” paths, if not to optimal ones.
When configuring BGP, the administrator of each AS picks at least one node to
be a “BGP speaker,” which is essentially a spokesperson for the entire AS. That BGP
speaker establishes BGP sessions to other BGP speakers in other ASs. These sessions
are used to exchange reachability information among ASs.
In addition to the BGP speakers, the AS has one or more border “gateways,”
which need not be the same as the speakers. The border gateways are the routers
through which packets enter and leave the AS. In our simple example in Figure 4.28,
routers R2 and R4 would be border gateways. Note that we have avoided using the
word “gateway” until this point because it tends to be confusing. We can’t avoid it here,
given the name of the protocol we are describing. The important point to understand
here is that, in the context of interdomain routing, a border gateway is simply an IP
router that is charged with the task of forwarding packets between ASs.
BGP does not belong to either of the two main classes of routing protocols
(distance-vector and link-state protocols) described in Section 4.2. Unlike these
4.3 Global Internet
Customer P
(AS 4)
Customer Q
(AS 5)
Customer R
(AS 6)
Customer S
(AS 7)
Regional provider A
(AS 2)
Backbone network
(AS 1)
Regional provider B
(AS 3)
Figure 4.30 Example of a network running BGP.
protocols, BGP advertises complete paths as an enumerated list of ASs to reach a
particular network. This is necessary to enable the sorts of policy decisions described
above to be made in accordance with the wishes of a particular AS. It also enables
routing loops to be readily detected.
To see how this works, consider the example network in Figure 4.30. Assume
that the providers are transit networks, while the customer networks are stubs. A BGP
speaker for the AS of provider A (AS 2) would be able to advertise reachability information for each of the network numbers assigned to customers P and Q. Thus, it
would say, in effect, “The networks 128.96, 192.4.153, 192.4.32, and 192.4.3 can
be reached directly from AS 2.” The backbone network, on receiving this advertisement, can advertise, “The networks 128.96, 192.4.153, 192.4.32, and 192.4.3 can
be reached along the path AS 1, AS 2.” Similarly, it could advertise, “The networks
192.12.69, 192.4.54, and 192.4.23 can be reached along the path AS 1, AS 3.”
An important job of BGP is to prevent the establishment of looping paths. For
example, consider three interconnected ASs, 1, 2, and 3. Suppose AS 1 learns that
it can reach network 10.0.1 through AS 2, so it advertises this fact to AS 3, who in
turn advertises it back to AS 2. AS 2 could now decide that AS 3 was the place to
send packets destined for 10.0.1; AS 3 sends them to AS 1; AS 1 sends them back to
AS 2; and they would loop forever. This is prevented by carrying the complete AS path
in the routing messages. In this case, the advertisement received by AS 2 from AS 3
would contain an AS path of AS 3, AS 1, AS 2. AS 2 sees itself in this path, and thus
concludes that this is not a useful path for it to use.
It should be apparent that the AS numbers carried in BGP need to be unique.
For example, AS 2 can only recognize itself in the AS path in the above example if
4 Internetworking
no other AS identifies itself in the same way. AS numbers are 16-bit numbers assigned
by a central authority to assure uniqueness. While 16 bits only allows about 65,000
ASs, which might not seem like a lot, we note that stub ASs do not need a unique AS
number, and this covers the overwhelming majority of nonprovider networks.
We should note that a given AS will only advertise routes that it considers good
enough for itself. That is, if a BGP speaker has a choice of several different routes to a
destination, it will choose the best one according to its own local policies, and then that
will be the route it advertises. Furthermore, a BGP speaker is under no obligation to
advertise any route to a destination, even if it has one. This is how an AS can implement
a policy of not providing transit—by refusing to advertise routes to prefixes that are
not contained within that AS, even if it knows how to reach them.
In addition to advertising paths, BGP speakers need to be able to cancel previously
advertised paths if a critical link or node on a path goes down. This is done with a form
of negative advertisement known as a withdrawn route. Both positive and negative
reachability information are carried in a BGP update message, the format of which is
shown in Figure 4.31. (Note that the fields in this figure are multiples of 16 bits, unlike
other packet formats in this chapter.)
One point to note about BGP-4 is that it was designed to cope with the classless
addresses described in Section 4.3.2. This means that the “networks” that are advertised in BGP are actually prefixes of any length. Thus, the updates contain both the
prefix itself and its length in bits. When writing these down, it is common to write
prefix/length. For example, a CIDR prefix that begins 192.4.16 and is 20 bits long
would be written as 192.4.16/20.
Unfeasible routes
Withdrawn routes
Total path
attribute length
Path attributes
Network layer
reachability info
Figure 4.31 BGP-4 update packet format.
4.3 Global Internet
A final point to note is that BGP is defined to run on top of TCP, the reliable
transport protocol described in Section 5.2. Because BGP speakers can count on TCP
to be reliable, this means that any information that has been sent from one speaker to
another does not need to be sent again. Thus, as long as nothing has changed, a BGP
speaker can simply send an occasional “keepalive” message that says, in effect, “I’m
still here and nothing has changed.” If that router were to crash, it would stop sending
the keepalives, and the other routers that had learned routes from it would know that
those routes were no longer valid.
We will not delve further into the details of BGP-4, except to point out that all
the protocol does is specify how reachability information should be exchanged among
autonomous systems. BGP speakers obtain enough information by this exchange to
calculate loop-free routes to all reachable networks, but how they choose the “best”
routes is largely left to the policies of the AS.
Let’s return to the real question: How does all this help us to build scalable
networks? First, the number of nodes participating in BGP is on the order of the
number of ASs, which is much smaller than the number of networks. Second, finding
a good interdomain route is only a matter of finding a path to the right border router,
of which there are only a few per AS. Thus, we have neatly subdivided the routing
problem into manageable parts, once again using a new level of hierarchy to increase
scalability. The complexity of interdomain routing is now on the order of the number
of ASs, and the complexity of intradomain routing is on the order of the number of
networks in a single AS.
Integrating Interdomain and Intradomain Routing
While the preceding discussion illustrates how a BGP speaker learns interdomain routing information, the question still remains as to how all the other routers in a domain
get this information. There are several ways this problem can be addressed.
We have already alluded to one very simple situation, which is also very common.
In the case of a stub AS that only connects to other ASs at a single point, the border
router is clearly the only choice for all routes that are outside the AS. Such a router
can “inject” a default route into the intradomain routing protocol. In effect, this is a
statement that any network that has not been explicitly advertised in the intradomain
protocol is reachable through the border router. Recall from the discussion of IP forwarding in Section 4.1 that the default entry in the forwarding table comes after all
the more specific entries, and it matches anything that failed to match a specific entry.
The next step up in complexity is to have the border routers inject specific routes
they have learned from outside the AS. Consider, for example, the border router of a
provider AS that connects to a customer AS. That router could learn that the network
prefix 192.4.54/24 is located inside the customer AS, either through BGP or because
4 Internetworking
the information is configured into the border router. It could inject a route to that
prefix into the routing protocol running inside the provider AS. This would be an
advertisement of the sort “I have a link to 192.4.54/24 of cost X.” This would cause
other routers in the provider AS to learn that this border router is the place to send
packets destined for that prefix.
The final level of complexity comes in backbone networks, which learn so much
routing information from BGP that it becomes too costly to inject it into the intradomain protocol. For example, if a border router wants to inject 10,000 prefixes that
it learned about from another AS, it will have to send very big link-state packets to
the other routers in that AS, and their shortest-path calculations are going to become
very complex. For this reason, the routers in a backbone network use a variant of BGP
called interior BGP (IBGP) to effectively redistribute the information that is learned
by the BGP speakers at the edges of the AS to all the other routers in the AS. IBGP
enables any router in the AS to learn the best border router to use when sending a
packet to any address. At the same time, each router in the AS keeps track of how to
get to each border router using a conventional intradomain protocol with no injected
information. By combining these two sets of information, each router in the AS is able
to determine the appropriate next hop for all prefixes.
Routing Areas
As if we didn’t already have enough hierarchy, link-state intradomain routing protocols
provide a means to partition a routing domain into subdomains called areas. (The
terminology varies somewhat among protocols; we use the OSPF terminology here.)
By adding this extra level of hierarchy, we enable single domains to grow larger without
overburdening the intradomain routing protocols.
An area is a set of routers that are administratively configured to exchange linkstate information with each other. There is one special area—the backbone area, also
known as area 0. An example of a routing domain divided into areas is shown in
Figure 4.32. Routers R1, R2, and R3 are members of the backbone area. They are
also members of at least one nonbackbone area; R1 is actually a member of both area 1
and area 2. A router that is a member of both the backbone area and a nonbackbone
area is an area border router (ABR). Note that these are distinct from the routers
that are at the edge of an AS, which are referred to as AS border routers for clarity.
Routing within a single area is exactly as described in Section 4.2.3. All the
routers in the area send link-state advertisements to each other, and thus develop a
complete, consistent map of the area. However, the link-state advertisements of routers
that are not area border routers do not leave the area in which they originated. This has
the effect of making the flooding and route calculation processes considerably more
scalable. For example, router R4 in area 3 will never see a link-state advertisement
4.3 Global Internet
Area 3
Area 1
Area 0
Area 2
Figure 4.32 A domain divided into areas.
from router R8 in area 1. As a consequence, it will know nothing about the detailed
topology of areas other than its own.
How, then, does a router in one area determine the right next hop for a packet
destined to a network in another area? The answer to this becomes clear if we imagine
the path of a packet that has to travel from one nonbackbone area to another as being
split into three parts. First, it travels from its source network to the backbone area,
then it crosses the backbone, then it travels from backbone to destination network.
To make this work, the area border routers summarize routing information that they
have learned from one area and make it available in their advertisements to other
areas. For example, R1 receives link-state advertisements from all the routers in area 1
and can thus determine the cost of reaching any network in area 1. When R1 sends
link-state advertisements into area 0, it advertises the costs of reaching the networks
in area 1 much as if all those networks were directly connected to R1. This enables all
the area 0 routers to learn the cost to reach all networks in area 1. The area border
routers then summarize this information and advertise it into the nonbackbone areas.
Thus, all routers learn how to reach all networks in the domain.
Note that in the case of area 2, there are two ABRs, and that routers in area 2
will thus have to make a choice as to which one they use to reach the backbone. This
is easy enough, since both R1 and R2 will be advertising costs to various networks,
so that it will become clear which is the better choice as the routers in area 2 run their
shortest-path algorithm. For example, it is pretty clear that R1 is going to be a better
choice than R2 for destinations in area 1.
4 Internetworking
When dividing a domain into areas, the network administrator makes a tradeoff between scalability and optimality of routing. The use of areas forces all packets
traveling from one area to another to go via the backbone area, even if a shorter path
might have been available. For example, even if R4 and R5 were directly connected,
packets would not flow between them because they are in different nonbackbone areas.
It turns out that the need for scalability is often more important than the need to use
the absolute shortest path.
This illustrates an important principle in network design. There is frequently
a trade-off between some sort of optimality and scalability. When hierarchy is introduced, information is hidden from some nodes in the network, hindering their
ability to make perfectly optimal decisions. However, information hiding is essential
to scalability, since it saves all nodes from having global knowledge. It is invariably
true in large networks that scalability is a more pressing design goal than perfect
Finally, we note that there is a trick by which network administrators can more
flexibly decide which routers go in area 0. This trick uses the idea of a “virtual link”
between routers. Such a virtual link is obtained by configuring a router that is not
directly connected to area 0 to exchange backbone routing information with a router
that is. For example, a virtual link could be configured from R8 to R1, thus making R8
part of the backbone. R8 would now participate in link-state advertisement flooding
with the other routers in area 0. The cost of the virtual link from R8 to R1 is determined
by the exchange of routing information that takes place in area 1. This technique can
help to improve the optimality of routing.
IP Version 6 (IPv6)
In many respects, the motivation for a new version of IP is the same as the motivation
for the techniques described in the last section: to deal with scaling problems caused
by the Internet’s massive growth. Subnetting and CIDR have helped to contain the rate
at which the Internet address space is being consumed (the address depletion problem)
and have also helped to control the growth of routing table information needed in
the Internet’s routers (the routing information problem). However, there will come a
point at which these techniques are no longer adequate. In particular, it is virtually
impossible to achieve 100% address utilization efficiency, so the address space will
be exhausted well before the four-billionth host is connected to the Internet. Even if
we were able to use all four billion addresses, it’s not too hard to imagine ways that
number could be exhausted, such as the assignment of IP addresses to set-top boxes for
cable TV or to electricity meters. All of these possibilities argue that a bigger address
space than that provided by 32 bits will eventually be needed.
4.3 Global Internet
Historical Perspective
The IETF began looking at the problem of expanding the IP address space in 1991,
and several alternatives were proposed. Since the IP address is carried in the header
of every IP packet, increasing the size of the address dictates a change in the packet
header. This means a new version of the Internet Protocol, and as a consequence, a
need for new software for every host and router in the Internet. This is clearly not a
trivial matter—it is a major change that needs to be thought about very carefully.
The effort to define a new version of IP was known as IP Next Generation, or
IPng. As the work progressed, an official IP version number was assigned, so IPng
is now known as IPv6. Note that the version of IP discussed so far in this chapter
is version 4 (IPv4). The apparent discontinuity in numbering is the result of version
number 5 being used for an experimental protocol some years ago.
The significance of the change to a new version of IP caused a snowball effect.
The general feeling among network designers was that if you are going to make a
change of this magnitude, you might as well fix as many other things in IP as possible
at the same time. Consequently, the IETF solicited white papers from anyone who
cared to write one, asking for input on the features that might be desired in a new
version of IP. In addition to the need to accommodate scalable routing and addressing,
some of the other wish list items for IPng were
■ support for real-time services
■ security support
■ autoconfiguration (i.e., the ability of hosts to automatically configure themselves with such information as their own IP address and domain name)
■ enhanced routing functionality, including support for mobile hosts
It is interesting to note that while many of these features were absent from IPv4 at the
time IPv6 was being designed, support for all of them has made its way into IPv4 in
recent years.
In addition to the wish list, one absolutely nonnegotiable feature for IPng was
that there must be a transition plan to move from the current version of IP (version 4)
to the new version. With the Internet being so large and having no centralized control,
it would be completely impossible to have a “flag day” on which everyone shut down
their hosts and routers and installed a new version of IP. Thus, there will probably be
a long transition period in which some hosts and routers will run IPv4 only, some will
run IPv4 and IPv6, and some will run IPv6 only.
The IETF appointed a committee called the IPng Directorate to collect all the
inputs on IPng requirements and to evaluate proposals for a protocol to become IPng.
Over the life of this committee there were a number of proposals, some of which
4 Internetworking
merged with other proposals, and eventually one was chosen by the Directorate to be
the basis for IPng. That proposal was called SIPP (Simple Internet Protocol Plus). SIPP
originally called for a doubling of the IP address size to 64 bits. When the Directorate
selected SIPP, they stipulated several changes, one of which was another doubling of
the address to 128 bits (16 bytes). It was around this time that the version number 6
was assigned. The rest of this section describes some of the main features of IPv6. At
the time of this writing, most of the key specifications for IPv6 are Proposed or Draft
Standards in the IETF.
Addresses and Routing
First and foremost, IPv6 provides a 128-bit address space, as opposed to the 32 bits of
version 4. Thus, while version 4 can potentially address four billion nodes if address
assignment efficiency reaches 100%, IPv6 can address 3.4×1038 nodes, again assuming
100% efficiency. As we have seen, though, 100% efficiency in address assignment is
not likely. Some analysis of other addressing schemes, such as those of the French
and U.S. telephone networks, as well as that of IPv4, have turned up some empirical
numbers for address assignment efficiency. Based on the most pessimistic estimates of
efficiency drawn from this study, the IPv6 address space is predicted to provide over
1500 addresses per square foot of the earth’s surface, which certainly seems like it
should serve us well even when toasters on Venus have IP addresses.
Address Space Allocation
IPv6 addresses do not have classes, but the address space is still subdivided in various ways based on the leading bits. Rather than specifying different address classes,
the leading bits specify different uses of the IPv6 address. The current assignment of
prefixes is listed in Table 4.11.
This allocation of the address space turns out to be easier to explain than it
looks. First, the entire functionality of IPv4’s three main address classes (A, B, and C)
is contained inside the 001 prefix. Aggregatable Global Unicast Addresses, as we will
see shortly, are a lot like classless IPv4 addresses, only much longer. These are the main
ones of interest at this point, with one-eighth of the address space allocated to this
important form of address. Obviously, large chunks of address space have been left
unassigned to allow for future growth and new features. Two portions of the address
space (0000 001 and 0000 010) have been reserved for encoding of other (non-IP)
address schemes. NSAP addresses are used by the ISO protocols, and IPX addresses
are used by Novell’s network-layer protocol.
The idea behind “link local use” addresses is to enable a host to construct an
address that will work on the network to which it is connected without being concerned
about global uniqueness of the address. This may be useful for autoconfiguration, as
4.3 Global Internet
0000 0000
0000 0001
0000 001
Reserved for NSAP allocation
0000 010
Reserved for IPX allocation
0000 011
0000 1
Aggregatable Global Unicast Addresses
1111 0
1111 10
1111 110
1111 1110 0
1111 1110 10
Link local use addresses
1111 1110 11
Site local use addresses
1111 1111
Multicast addresses
Table 4.11 Address prefix assignments for IPv6.
we will see below. Similarly, the “site local
use” addresses are intended to allow valid
addresses to be constructed on a site (e.g., a
private corporate network) that is not connected to the larger Internet; again, global
uniqueness need not be an issue.
Finally, the multicast address space is
for multicast, thereby serving the same role
as class D addresses in IPv4. Note that multicast addresses are easy to distinguish—
they start with a byte of all 1s. We will see
how these addresses are used in Section 4.4.
Within the reserved address space (addresses beginning with a byte of 0s) are some
important special types of addresses. A node
may be assigned an “IPv4-compatible IPv6
address” by zero-extending a 32-bit IPv4
address to 128 bits. A node that is only capable of understanding IPv4 can be assigned
an “IPv4-mapped IPv6 address” by prefixing the 32-bit IPv4 address with 2 bytes of
all 1s and then zero-extending the result to
128 bits. These two special address types
have uses in the IPv4-to-IPv6 transition (see
the sidebar on this topic).
Address Notation
Just as with IPv4, there is some special notation for writing down IPv6 addresses. The
standard representation is x:x:x:x:x:x:x:x,
where each “x” is a hexadecimal representation of a 16-bit piece of the address. An
example would be
Any IPv6 address can be written using
this notation. Since there are a few special types of IPv6 addresses, there are some
special notations that may be helpful in
4 Internetworking
Transition from IPv4
to IPv6
The most important idea behind
the transition from IPv4 to IPv6
is that the Internet is far too big
and decentralized to have a “flag
day”—one specified day on which
every host and router is upgraded
from IPv4 to IPv6. Thus, IPv6 needs
to be deployed incrementally in
such a way that hosts and routers
that only understand IPv4 can continue to function for as long as possible. Ideally, IPv4 nodes should be
able to talk to other IPv4 nodes
and some set of other IPv6-capable
nodes indefinitely. Also, IPv6 hosts
should be capable of talking to
other IPv6 nodes even when some
of the infrastructure between them
may only support IPv4. Two major
mechanisms have been defined
to help this transition: dual-stack
operation and tunneling.
The idea of dual stacks is fairly
straightforward: IPv6 nodes run
both IPv6 and IPv4 and use the
Version field to decide which stack
should process an arriving packet.
In this case, the IPv6 address could
be unrelated to the IPv4 address,
or it could be the “IPv4-mapped
IPv6 address” described earlier in
this section.
The basic tunneling technique,
in which an IP packet is sent as
4.3 Global Internet
the payload of another IP packet,
was described in Section 4.1. For
IPv6 transition, tunneling is used to
send an IPv6 packet over a piece
of the network that only understands IPv4. This means that the
IPv6 packet is encapsulated within
an IPv4 header that has the address
of the tunnel endpoint in its header,
is transmitted across the IPv4-only
piece of network, and then is decapsulated at the endpoint. The endpoint could be either a router or
a host; in either case, it must be
IPv6-capable to be able to process
the IPv6 packet after decapsulation. If the endpoint is a host with
an IPv4-mapped IPv6 address, then
tunneling can be done automatically, by extracting the IPv4 address
from the IPv6 address and using
it to form the IPv4 header. Otherwise, the tunnel must be configured
manually. In this case, the encapsulating node needs to know the
IPv4 address of the other end of
the tunnel, since it cannot be extracted from the IPv6 header. From
the perspective of IPv6, the other
end of the tunnel looks like a regular IPv6 node that is just one
hop away, even though there may
be many hops of IPv4 infrastructure between the tunnel endpoints.
certain circumstances. For example, an address with a large number of contiguous 0s
can be written more compactly by omitting
all the 0 fields. Thus
could be written
Clearly, this form of shorthand can only
be used for one set of contiguous 0s in an
address to avoid ambiguity.
Since there are two types of IPv6
addresses that contain an embedded IPv4
address, these have their own special notation that makes extraction of the IPv4
address easier. For example, the “IPv4mapped IPv6 address” of a host whose IPv4
address was could be written
That is, the last 32 bits are written in IPv4
notation, rather than as a pair of hexadecimal numbers separated by a colon. Note
that the double colon at the front indicates
the leading 0s.
Aggregatable Global Unicast
By far the most important thing that IPv6
must provide when it is deployed is plain
old unicast addressing. It must do this in a
way that supports the rapid rate of addition
of new hosts to the Internet and that allows
routing to be done in a scalable way as the
number of physical networks in the Internet grows. Thus, at the heart of IPv6 is the
4 Internetworking
unicast address allocation plan that determines how addresses beginning with the 001
prefix will be assigned to service providers, autonomous systems, networks, hosts, and
In fact, the address allocation plan that is proposed for IPv6 unicast addresses
is extremely similar to that being deployed with CIDR in IPv4. To understand how
it works and how it provides scalability, it is helpful to define some new terms. We
may think of a nontransit AS (i.e., a stub or multihomed AS) as a subscriber, and we
may think of a transit AS as a provider. Furthermore, we may subdivide providers into
direct and indirect. The former are directly connected to subscribers. The latter primarily connect other providers, are not connected directly to subscribers, and are often
known as backbone networks.
With this set of definitions, we can see that the Internet is not just an arbitrarily
interconnected set of ASs; it has some intrinsic hierarchy. The difficulty is in making
use of this hierarchy without inventing mechanisms that fail when the hierarchy is not
strictly observed, as happened with EGP. For example, the distinction between direct
and indirect providers becomes blurred when a subscriber connects to a backbone or
when a direct provider starts connecting to many other providers.
As with CIDR, the goal of the IPv6 address allocation plan is to provide aggregation of routing information to reduce the burden on intradomain routers. Again,
the key idea is to use an address prefix—a set of contiguous bits at the most significant
end of the address—to aggregate reachability information to a large number of networks and even to a large number of ASs. The main way to achieve this is to assign an
address prefix to a direct provider and then for that direct provider to assign longer
prefixes that begin with that prefix to its subscribers. This is exactly what we observed
in Figure 4.27. Thus, a provider can advertise a single prefix for all of its subscribers.
Of course, the drawback is that if a site decides to change providers, it will
need to obtain a new address prefix and renumber all the nodes in the site. This
could be a colossal undertaking, enough to dissuade most people from ever changing
providers. For this reason, there is ongoing research on other addressing schemes, such
as geographic addressing, in which a site’s address is a function of its location rather
than the provider to which it attaches. At present, however, provider-based addressing
is necessary to make routing work efficiently.
Note that while IPv6 address assignment is essentially equivalent to the way
address assignment has happened in IPv4 since the introduction of CIDR, IPv6 has
the significant advantage of not having a large installed base of assigned addresses to
fit into its plans.
One question is whether it makes sense for hierarchical aggregation to take place
at other levels in the hierarchy. For example, should all providers obtain their address
prefixes from within a prefix allocated to the backbone to which they connect? Given
that most providers connect to multiple backbones, this probably doesn’t make sense.
4.3 Global Internet
Figure 4.33 An IPv6 provider-based unicast address.
Also, since the number of providers is much smaller than the number of sites, the
benefits of aggregating at this level are much less.
One place where aggregation may make sense is at the national or continental
level. Continental boundaries form natural divisions in the Internet topology, and
if all addresses in Europe, for example, had a common prefix, then a great deal of
aggregation could be done, so that most routers in other continents would only need
one routing table entry for all networks with the Europe prefix. Providers in Europe
would all select their prefixes such that they began with the European prefix. Using
this scheme, an IPv6 address might look like Figure 4.33. The RegistryID might be an
identifier assigned to a European address registry, with different IDs assigned to other
continents or countries. Note that prefixes would be of different lengths under this
scenario. For example, a provider with few customers could have a longer prefix (and
thus less total address space available) than one with many customers.
One tricky situation could occur when a subscriber is connected to more than
one provider. Which prefix should the subscriber use for his or her site? There is no
perfect solution to the problem. For example, suppose a subscriber is connected to two
providers X and Y. If the subscriber takes his prefix from X, then Y has to advertise a
prefix that has no relationship to its other subscribers and that as a consequence cannot
be aggregated. If the subscriber numbers part of his AS with the prefix of X and part
with the prefix of Y, he runs the risk of having half his site become unreachable if the
connection to one provider goes down. One solution that works fairly well if X and Y
have a lot of subscribers in common is for them to have three prefixes between them:
one for subscribers of X only, one for subscribers of Y only, and one for the sites that
are subscribers of both X and Y.
Packet Format
Despite the fact that IPv6 extends IPv4 in several ways, its header format is actually
simpler. This simplicity is due to a concerted effort to remove unnecessary functionality
from the protocol. Figure 4.34 shows the result. (For comparison with IPv4, see the
header format shown in Figure 4.3.)
As with many headers, this one starts with a Version field, which is set to 6 for
IPv6. The Version field is in the same place relative to the start of the header as IPv4’s
Version field so that header-processing software can immediately decide which header
4 Internetworking
Next header/data
Figure 4.34 IPv6 packet header.
format to look for. The TrafficClass and FlowLabel fields both relate to quality of
service issues, as discussed in Section 6.5.
The PayloadLen field gives the length of the packet, excluding the IPv6 header,
measured in bytes. The NextHeader field cleverly replaces both the IP options and
the Protocol field of IPv4. If options are required, then they are carried in one or
more special headers following the IP header, and this is indicated by the value of the
NextHeader field. If there are no special headers, the NextHeader field is the demux
key identifying the higher-level protocol running over IP (e.g., TCP or UDP); that is, it
serves the same purpose as the IPv4 Protocol field. Also, fragmentation is now handled
as an optional header, which means that the fragmentation-related fields of IPv4 are
not included in the IPv6 header. The HopLimit field is simply the TTL of IPv4, renamed
to reflect the way it is actually used.
Finally, the bulk of the header is taken up with the source and destination addresses, each of which is 16 bytes (128 bits) long. Thus, the IPv6 header is always
40 bytes long. Considering that IPv6 addresses are four times longer than those of
IPv4, this compares quite well with the IPv4 header, which is 20 bytes long in the
absence of options.
4.3 Global Internet
The way that IPv6 handles options is quite an improvement over IPv4. In IPv4,
if any options were present, every router had to parse the entire options field to see if
any of the options were relevant. This is because the options were all buried at the end
of the IP header, as an unordered collection of type, length, value tuples. In contrast,
IPv6 treats options as extension headers that must, if present, appear in a specific
order. This means that each router can quickly determine if any of the options are
relevant to it; in most cases, they will not be. Usually, this can be determined by just
looking at the NextHeader field. The end result is that option processing is much more
efficient in IPv6, which is an important factor in router performance. In addition, the
new formatting of options as extension headers means that they can be of arbitrary
length, whereas in IPv4 they were limited to 44 bytes at most. We will see how some
of the options are used below.
Each option has its own type of extension header. The type of each extension
header is identified by the value of the NextHeader field in the header that precedes it,
and each extension header contains a NextHeader field to identify the header following
it. The last extension header will be followed by a transport-layer header (e.g., TCP),
and in this case the value of the NextHeader field is the same as the value of the
Protocol field in an IPv4 header. Thus, the NextHeader field does double duty; either
it may identify the type of extension header to follow, or, in the last extension header,
it serves as a demux key to identify the higher-layer protocol running over IPv6.
Consider the example of the fragmentation header, shown in Figure 4.35. This
header provides functionality similar to the fragmentation fields in the IPv4 header described in Section 4.1.2, but it is only present if fragmentation is necessary. Assuming it
is the only extension header present, then the NextHeader field of the IPv6 header would
contain the value 44, which is the value assigned to indicate the fragmentation header.
The NextHeader field of the fragmentation header itself contains a value describing the
header that follows it. Again, assuming no other extension headers are present, then
the next header might be the TCP header, which results in NextHeader containing
the value 6, just as the Protocol field would in IPv4. If the fragmentation header were
followed by, say, an authentication header, then the fragmentation header’s NextHeader
field would contain the value 51.
Figure 4.35 IPv6 fragmentation extension header.
While the Internet’s growth has been impressive, one factor that has inhibited faster
acceptance of the technology is the fact that
getting connected to the Internet has typically required a fair amount of system administration expertise. In particular, every
host that is connected to the Internet needs
to be configured with a certain minimum
amount of information, such as a valid IP
address, a subnet mask for the link to which
it attaches, and the address of a name server.
Thus, it has not been possible to unpack a
new computer and connect it to the Internet
without some preconfiguration. One goal of
IPv6, therefore, is to provide support for
autoconfiguration, sometimes referred to as
“plug-and-play” operation.
As we saw in Section 4.1.6, autoconfiguration is possible for IPv4, but
it depends on the existence of a server
that is configured to hand out addresses
and other configuration information to
DHCP clients. The longer address format
in IPv6 helps provide a useful, new form
of autoconfiguration called stateless autoconfiguration, which does not require a
Recall that IPv6 unicast addresses are
hierarchical, and that the least significant
portion is the interface ID. Thus, we can
subdivide the autoconfiguration problem
into two parts:
1 Obtain an interface ID that is unique
on the link to which the host is attached.
2 Obtain the correct address prefix for
this subnet.
4 Internetworking
Network Address
While IPv6 was motivated by a
concern that increased usage of IP
would lead to exhaustion of the
address space, another technology
has become popular as a way
to conserve IP address space. That
technology is network address
translation (NAT), and it is possible that its widespread use will
significantly delay the need to deploy IPv6. NAT is often viewed as
“architecturally impure,” but it is
also a fact of networking life that
cannot be ignored.
The basic idea behind NAT
is that all the hosts that might
communicate with each other over
the Internet do not need to have
globally unique addresses. Instead,
a host could be assigned a “private address” that is not necessarily globally unique, but is unique
within some more limited scope; for
example, within the corporate network where the host resides. The
class A network number 10 is often
used for this purpose, since that
network number was assigned to
the ARPANET and is no longer in
use as a globally unique address.
As long as the host communicates
only with other hosts in the corporate network, a locally unique
address is sufficient. If it should
want to communicate with a host
4.3 Global Internet
outside the corporate network, it
does so via a “NAT box”—a device that is able to translate from
the private address used by the host
to some globally unique address
that is assigned to the NAT box.
Since it’s likely that a small subset of the hosts in the corporation need the services of the NAT
box at any one time, the NAT box
might be able to get by with a small
pool of globally unique addresses,
much smaller than the number of
addresses that would be needed if
every host in the corporation had a
globally unique address.
So, we can imagine a NAT box
receiving IP packets from a host
inside the corporation and translating the IP source address from
some private address (say,
to a globally unique address (say, When packets
come back from the remote host
addressed to 171.69. 210.246, the
NAT box translates the destination address to and forwards the packet on toward the
The chief drawback of NAT
is that it breaks a key assumption of the IP service model—that
all nodes have globally unique addresses. It turns out that lots of
applications and protocols rely on
The first part turns out to be rather
easy, since every host on a link must have a
unique link-level address. For example, all
hosts on an Ethernet have a unique 48-bit
Ethernet address. This can be turned into a
valid link local use address by adding the
appropriate prefix from Table 4.11 (1111
1110 10) followed by enough 0s to make
up 128 bits. For some devices—for example, printers or hosts on a small routerless network that do not connect to any
other networks—this address may be perfectly adequate. Those devices that need a
globally valid address depend on a router
on the same link to periodically advertise
the appropriate prefix for the link. Clearly,
this requires that the router be configured
with the correct address prefix, and that this
prefix be chosen in such a way that there is
enough space at the end (e.g., 48 bits) to
attach an appropriate link-level address.
The ability to embed link-level addresses as long as 48 bits into IPv6 addresses
was one of the reasons for choosing such a
large address size. Not only does 128 bits
allow the embedding, but it leaves plenty
of space for the multilevel hierarchy of
addressing that we discussed above.
Advanced Routing Capabilities
Another of IPv6’s extension headers is the
routing header. In the absence of this header,
routing for IPv6 differs very little from that
of IPv4 under CIDR. The routing header
contains a list of IPv6 addresses that represent nodes or topological areas that the
packet should visit en route to its destination. A topological area may be, for
example, a backbone provider’s network.
Specifying that packets must visit this network would be a way of implementing
provider selection on a packet-by-packet
basis. Thus, a host could say that it wants
some packets to go through a provider that
is cheap, others through a provider that
provides high reliability, and still others
through a provider that the host trusts to
provide security.
To provide the ability to specify topological entities rather than individual nodes,
IPv6 defines an anycast address. An anycast
address is assigned to a set of interfaces, and
packets sent to that address will go to the
“nearest” of those interfaces, with nearest
being determined by the routing protocols.
For example, all the routers of a backbone
provider could be assigned a single anycast
address, which would be used in the routing
The anycast address and the routing
header are also expected to be used to provide enhanced routing support to mobile
hosts. The detailed mechanisms for providing this support are still being defined.
4 Internetworking
this assumption. In particular,
many protocols that run over
IP (e.g., application protocols)
carry IP addresses in their messages. These addresses also need
to be translated by a NAT box
if the higher-layer protocol is to
work properly, and thus NAT
boxes become much more complex than simple IP header translators. They potentially need to understand an ever-growing number
of higher-layer protocols. This in
turn presents an obstacle to deployment of new applications. It is
probably safe to say that networks
would be better off without NAT,
but its disappearance seems unlikely. Widespread deployment of
IPv6 would almost certainly help.
Other Features
As mentioned at the beginning of this section, the primary motivation behind the development of IPv6 was to support the continued growth of the Internet. Once the IP header
had to be changed for the sake of the addresses, however, the door was open for a wide
variety of other changes, two of which we have just described—autoconfiguration and
source-directed routing. IPv6 includes several additional features, most of which are
covered elsewhere in this book—mobility is discussed in Section 4.2.5, network security is the topic of Chapter 8, and a new service model proposed for the Internet is
described in Section 6.5. It is interesting to note that, in most of these areas, the IPv4
and IPv6 capabilities have become virtually indistinguishable, so that the main driver
for IPv6 remains the need for larger addresses.
4.4 Multicast
4.4 Multicast
As we saw in Chapter 2, multiaccess networks like Ethernet and token rings implement
multicast in hardware. This section describes how to extend multicast, in software,
across an internetwork of such networks. The approach described in this section is
based on an implementation of multicast used in the current Internet (IPv4). Multicast
will also be supported in the next generation of IP (IPv6), with the major differences
being restricted to the address format.
The motivation for developing multicast is that there are applications that want
to send a packet to more than one destination host. Instead of forcing the source host
to send a separate packet to each of the destination hosts, we want the source to be
able to send a single packet to a multicast address, and for the network—or internet,
in this case—to deliver a copy of that packet to each of a group of hosts. Hosts can
then choose to join or leave this group at will, without synchronizing or negotiating
with other members of the group. Also, a host may belong to more than one group at
a time.
Internet multicast can be implemented on top of a collection of networks that
support hardware multicast (or broadcast) by extending the routing and forwarding
functions implemented by the routers that connect these networks. This section describes three such extensions: The first is based on distance-vector routing as described
in Section 4.2.2; the second is based on link-state routing, described in Section 4.2.3;
the third can build on any underlying routing protocol and is thus called Protocol
Independent Multicast (PIM).
Before looking at any of the multicast routing protocols, however, we need to
look at the service model for IP multicast. We could imagine that a host wishing to
send a packet to some number of internet hosts could enumerate all their addresses,
but this would quickly become unscalable for large numbers of receivers—consider
using the Internet to distribute a pay-per-view movie, for example. For this reason, IP
multicast uses the idea of a multicast group that receivers may join. Each group has a
specially assigned address, and senders to the group use that address as the destination
address for their packets. In IPv4, these addresses are assigned in the class D address
space, and IPv6 also has a portion of its address space reserved for multicast group
Hosts join multicast groups using a protocol called Internet Group Management
Protocol (IGMP). They use this to notify a router on their local network of their desire
to receive packets sent to a certain multicast group. The protocols described below
are concerned with how packets are distributed to the appropriate routers. Delivery of
packets from the “last hop” router to the host is handled by the underlying multicast
capability of the network, as described in Section 2.6.
4 Internetworking
One perplexing question is how senders and receivers learn about multicast
addresses. This is normally handled by out-of-band means, and there are some quite
sophisticated tools to enable group addresses to be advertised on the Internet.
Link-State Multicast
Adding multicast to a link-state routing algorithm is fairly straightforward, so we describe it first. Recall that in link-state routing, each router monitors the state of its
directly connected links and sends an update message to all of the other routers whenever the state changes. Since each router receives enough information to reconstruct
the entire topology of the network, it is able to use Dijkstra’s algorithm to compute
the shortest-path spanning tree rooted at itself and reaching all possible destinations.
The router uses this tree to determine the best next hop for each packet it forwards.
All we have to do to extend this algorithm to support multicast is to add the set of
groups that have members on a particular link (LAN) to the “state” for that link. The
only question is how each router determines which groups have members on which
links. As suggested in Section 3.2.3, the solution is to have each host periodically
announce to the LAN the groups to which it belongs. The router simply monitors
the LAN for such announcements. Should such announcements stop arriving after a
period of time, the router then assumes that the host has left the group.
Given full knowledge of which groups have members on which links, each router
is able to compute the shortest-path multicast tree from any source to any group,
again using Dijkstra’s algorithm. For example, given the internet illustrated in Figure
4.36, where the colored hosts belong to group G, the routers would compute the
shortest-path multicast trees given in Figure 4.37 for sources A, B, and C. The routers
would use these trees to decide how to forward packets addressed to multicast group
G. For example, router R3 would forward a packet going from host A to group
G to R6.
Keep in mind that each router must potentially keep a separate shortest-path
multicast tree from every source to every group. This is obviously very expensive,
so instead the router just computes and stores a cache of these trees—one for each
source/group pair that is currently active.
Distance-Vector Multicast
Adding multicast to the distance-vector algorithm is a bit trickier because the routers do
not know the entire topology of the internet. Instead, recall that each router maintains
a table of Destination, Cost, NextHop tuples, and exchanges a list of Destination,
Cost pairs with its directly connected neighbors. Extending this algorithm to support
multicast is a two-stage process. First, we need to design a broadcast mechanism that
allows a packet to be forwarded to all the networks on the internet. Second, we need
4.4 Multicast
Figure 4.36 Example internet with members of group G in color.
to refine this mechanism so that it prunes back networks that do not have hosts that
belong to the multicast group.
Reverse-Path Broadcast (RPB)
Each router knows that the current shortest path to a given destination goes through
NextHop. Thus, whenever it receives a multicast packet from source S, the router
forwards the packet on all outgoing links (except the one on which the packet arrived)
if and only if the packet arrived over the link that is on the shortest path to S (i.e., the
packet came from the NextHop associated with S in the routing table). This strategy
effectively floods packets outward from S, but does not loop packets back toward S.
There are two major shortcomings to this approach. The first is that it truly
floods the network; it has no provision for avoiding LANs that have no members
in the multicast group. We address this problem in the next subsection. The second
limitation is that a given packet will be forwarded over a LAN by each of the routers
connected to that LAN. This is due to the forwarding strategy of flooding packets on
all links other than the one on which the packet arrived, without regard to whether or
not those links are part of the shortest-path tree rooted at the source.
The solution to this second limitation is to eliminate the duplicate broadcast
packets that are generated when more than one router is connected to a given LAN.
4 Internetworking
Figure 4.37 Example shortest-path multicast trees.
4.4 Multicast
One way to do this is to designate one router as the “parent” router for each link,
relative to the source, where only the parent router is allowed to forward multicast
packets from that source over the LAN. The router that has the shortest path to source
S is selected as the parent; a tie between two routers would be broken according to
which router has the smallest address. A given router can learn if it is the parent for the
LAN (again relative to each possible source) based upon the distance-vector messages
it exchanges with its neighbors.
Notice that this refinement requires that each router keep, for each source,
a bit for each of its incident links indicating whether or not it is the parent for
that source/link pair. Keep in mind that in an internet setting, a “source” is a network, not a host, since an internet router is only interested in forwarding packets between networks. The resulting mechanism is sometimes called reverse-path broadcast
Reverse-Path Multicast (RPM)
RPB implements shortest-path broadcast. We now want to prune the set of networks
that receives each packet addressed to group G to exclude those that have no hosts
that are members of G. This can be accomplished in two stages. First, we need to
recognize when a leaf network has no group members. Determining that a network
is a leaf is easy—if the parent router as described in RPB is the only router on the
network, then the network is a leaf. Determining if any group members reside on the
network is accomplished by having each host that is a member of group G periodically
announce this fact over the network, as described in our earlier description of link-state
multicast. The router then uses this information to decide whether or not to forward
a multicast packet addressed to G over this LAN.
The second stage is to propagate this “no members of G here” information up
the shortest-path tree. This is done by having the router augment the Destination,
Cost pairs it sends to its neighbors with the set of groups for which the leaf network
is interested in receiving multicast packets. This information can then be propagated
from router to router, so that for each of its links, a given router knows for what
groups it should forward multicast packets.
Note that including all of this information in the routing update is a fairly expensive thing to do. In practice, therefore, this information is exchanged only when some
source starts sending packets to that group. In other words, the strategy is to use
RPB, which adds a small amount of overhead to the basic distance-vector algorithm,
until a particular multicast address becomes active. At that time, routers that are not
interested in receiving packets addressed to that group speak up, and that information
is propagated to the other routers.
4 Internetworking
Protocol Independent Multicast (PIM)
PIM was developed in response to the scaling problems of existing multicast routing
protocols. In particular, it was recognized that the existing protocols did not scale well
in environments where a relatively small proportion of routers want to receive traffic
for a certain group. For example, broadcasting traffic to all routers until they explicitly
ask to be removed from the distribution is not a good design choice if most routers
don’t want to receive the traffic in the first place. This situation is sufficiently common
that PIM divides the problem space into “sparse mode” and “dense mode.” Because
the existing protocols were so poorly suited to the sparse environment, PIM sparse
mode has received the most attention and is the focus of our discussion here.
In PIM sparse mode (PIM-SM), routers explicitly join and leave the multicast
group using PIM protocol messages known as Join and Prune messages. The question
that arises is where to send those messages. To address this, PIM assigns a rendezvous
point (RP) to each group. In general, a number of routers in a domain are configured
to be candidate RPs, and PIM defines a set of procedures by which all the routers in a
domain can agree on the router to use as the RP for a given group. These procedures
are rather complex, as they must deal with a wide variety of scenarios, such as the
failure of a candidate RP and the partitioning of a domain into two separate networks
due to a number of link or node failures. For the rest of this discussion, we assume
that all routers in a domain know the unicast IP address of the RP for a given group.
A multicast forwarding tree is built as a result of routers sending Join messages
to the RP. PIM-SM allows two types of trees to be constructed: a shared tree, which
may be used by all senders, and a source-specific tree, which may be used only by
a specific sending host. The normal mode of operation creates the shared tree first,
followed by one or more source-specific trees if there is enough traffic to warrant
it. Because building trees installs state in the routers along the tree, it is important
that the default is to have only one tree for a group, not one for every sender to a
When a router sends a Join message toward the RP for a group G, it is sent using
normal IP unicast transmission. This is illustrated in Figure 4.38(a), in which router
R4 is sending a Join to the rendezvous point for some group. The initial Join message
is “wildcarded”; that is, it applies to all senders. A Join message clearly must pass
through some sequence of routers before reaching the RP (e.g., R2). Each router along
the path looks at the Join and creates a forwarding table entry for the shared tree,
called a (*, G) entry (* meaning “all senders”). To create the forwarding table entry,
it looks at the interface on which the Join arrived and marks that interface as one on
which it should forward data packets for this group. It then determines which interface
it will use to forward the Join toward the RP. This will be the only acceptable interface
4.4 Multicast
RP = Rendezvous point
Shared tree
Source-specific tree for source R1
Figure 4.38 PIM operation: (a) R4 sends Join to RP and joins shared tree. (b) R5 joins
shared tree. (c) RP builds source-specific tree to R1 by sending Join to R1. (d) R4 and
R5 build source-specific tree to R1 by sending Joins to R1.
for incoming packets sent to this group. It then forwards the Join toward the RP.
Eventually, the message arrives at the RP, completing the construction of the tree
branch. The shared tree thus constructed is shown as a colored line from the RP to R4
in Figure 4.38(a).
As more routers send Joins toward the RP, they cause new branches to be added
to the tree, as illustrated in Figure 4.38(b). Note that in this case, the Join only needs
to travel to R2, which can add the new branch to the tree simply by adding a new
outgoing interface to the forwarding table entry created for this group. R2 need not
forward the Join on to the RP. Note also that the end result of this process is to build
a tree whose root is the RP.
4 Internetworking
Figure 4.39 Delivery of a packet along a shared tree. R1 tunnels the packet to the RP,
which forwards it along the shared tree to R4 and R5.
At this point, suppose a host wishes to send a message to the group. To do so,
it constructs a packet with the appropriate multicast group address as its destination
and sends it to a router on its local network known as the designated router (DR).
Suppose the DR is R1 in Figure 4.38. There is no state for this multicast group between
R1 and the RP at this point, so instead of simply forwarding the multicast packet, R1
“tunnels” it to the RP. That is, R1 encapsulates the multicast packet inside a unicast
IP packet that it sends to the unicast IP address of the RP. Just like a tunnel endpoint
of the sort described in Section 4.1.8, the RP receives the packet addressed to it, looks
at the payload of the unicast packet, and finds inside an IP packet addressed to the
multicast address of this group. The RP, of course, does know what to do with such a
packet—it sends it out onto the shared tree of which the RP is the root. In the example
of Figure 4.38, this means that the RP sends the packet on to R2, which is able to
forward it to R4 and R5. The complete delivery of a packet from R1 to R4 and R5 is
shown in Figure 4.39. We see the tunneled packet travel from R1 to the RP with an
extra IP header containing the unicast address of RP, and then the multicast packet
addressed to G making its way along the shared tree to R4 and R5.
At this point, we might be tempted to declare success, since all hosts can send to
all receivers this way. However, there is some bandwidth inefficiency and processing
cost in the encapsulation and decapsulation of packets on the way to the RP, so the
RP has the option of forcing knowledge about this group into the intervening routers
so that tunneling can be avoided. Its decision to exercise this option is based on the
4.4 Multicast
data rate of packets coming from a given source; only if this rate is high enough to
warrant the effort will the RP take action. If it does, it sends a Join message toward
the sending host (Figure 4.38(c)). As this Join travels toward the host, it causes the
routers along the path (R3) to learn about the group, so that it will be possible for
the DR to send the packet to the group as “native” (i.e., not tunneled) multicast
An important detail to note at this stage is that the Join message sent by the RP
to the sending host is specific to that sender, whereas the previous ones sent by R4 and
R5 applied to all senders. Thus the effect of the new Join is to create sender-specific
state in the routers between the identified source and the RP. This is referred to as
(S, G) state, since it applies to one sender to one group and contrasts with the (*, G)
state that was installed between the receivers and the RP that applies to all senders.
Thus, in Figure 4.38(c), we see a source-specific route from R1 to the RP (indicated
by the dashed line) and a tree that is valid for all senders from the RP to the receivers
(indicated by the colored line).
The next possible optimization is to replace the entire shared tree with a sourcespecific tree. This is desirable because the path from sender to receiver via the RP
might be significantly longer than the shortest possible path. This again is likely to be
triggered by a high data rate being observed from some sender. In this case, the router
at the downstream end of the tree—say, R4 in our example—sends a source-specific
Join toward the source. As it follows the shortest path toward the source, the routers
along the way create (S, G) state for this tree, and the result is a tree that has its root
at the source, rather than the RP. Assuming both R4 and R5 made the switch to the
source-specific tree, we would end up with the tree shown in Figure 4.38(d). Note that
this tree no longer involves the RP at all. We have removed the shared tree from this
picture to simplify the diagram, but in reality all routers with receivers for a group
must stay on the shared tree in case new senders show up.
We can now see why PIM is “protocol independent.” All of its mechanisms
for building and maintaining trees depend on whatever unicast routing protocol is
used in the domain. The formation of trees is entirely determined by the paths that
Join messages follow, which is determined by the choice of shortest paths made by
unicast routing. Thus, to be precise, PIM is “unicast routing protocol independent,”
as compared to the other multicast routing protocols in this section, which are derived
from either link-state or distance-vector routing. Note that PIM is very much bound up
with the Internet Protocol—it is not protocol independent in terms of network-layer
The design of PIM again illustrates the challenges in building scalable networks,
and how scalability is sometimes pitted against some sort of optimality. The shared
tree is certainly more scalable than a source-specific tree, in the sense that it reduces
4 Internetworking
the total state in routers to be on the order of the number of groups rather than the
number of senders times the number of groups. However, the source-specific tree is
likely to be necessary to achieve efficient routing.
4.5 Multiprotocol Label Switching (MPLS)
We conclude our discussion of IP by describing an idea that was originally viewed
as a way to improve the performance of the Internet. The idea, called Multiprotocol
Label Switching (MPLS), tries to combine some of the properties of virtual circuits
with the flexibility and robustness of datagrams. On the one hand, MPLS is very much
associated with the Internet Protocol’s datagram-based architecture—it relies on IP
addresses and IP routing protocols to do its job. On the other hand, MPLS-enabled
routers also forward packets by examining relatively short, fixed-length labels, and
these labels have local scope, just like in a virtual circuit network. It is perhaps this
marriage of two seemingly opposed technologies that has caused MPLS to have a
somewhat mixed reception in the Internet engineering community.
Before looking at how MPLS works, it is reasonable to ask, “What is it good
for?” Many claims have been made for MPLS, but there are three main things that it
is used for today:
■ To enable IP capabilities on devices that do not have the capability to forward
IP datagrams in the normal manner
■ To forward IP packets along “explicit routes”—precalculated routes that don’t
necessarily match those that normal IP routing protocols would select
■ To support certain types of virtual private network services
It is worth noting that one of the original goals—improving performance— is not on
the list. This has a lot to do with the advances that have been made in forwarding
algorithms for IP routers in recent years, and with the complex set of factors beyond
header processing that determine performance.
The best way to understand how MPLS works is to look at some examples
of its use. In the next three sections we will look at examples to illustrate the three
applications of MPLS mentioned above.
Destination-Based Forwarding
One of the earliest publications to introduce the idea of attaching labels to IP packets
was a paper by Chandranmenon and Varghese that described an idea called “threaded
indices.” A very similar idea is now implemented in MPLS-enabled routers. The following example shows how this idea works.
4.5 Multiprotocol Label Switching (MPLS)
Figure 4.40 Routing tables in example network.
Consider the network in Figure 4.40. Each of the two routers on the far right
(R3 and R4) has one connected network, with prefixes 10.1.1/24 and 10.3.3/24. The
remaining routers (R1 and R2) have routing tables that indicate which outgoing interface each router would use when forwarding packets to one of those two networks.
When MPLS is enabled on a router, the router allocates a label for each prefix
in its routing table and advertises both the label and the prefix that it represents
to its neighboring routers. This advertisement is carried in the “Label Distribution
Protocol.” This is illustrated in Figure 4.41. Router R2 has allocated the label value
15 for the prefix 10.1.1 and the label value 16 for the prefix 10.3.3. These labels can
be chosen at the convenience of the allocating router and can be thought of as indices
into the routing table. After allocating the labels, R2 advertises the label bindings to its
neighbors; in this case, we see R2 advertising a binding between the label 15 and the
prefix 10.1.1 to R1. The meaning of such an advertisement is that R2 has said, in effect,
“Please attach the label 15 to all packets sent to me that are destined to prefix 10.1.1.”
R1 stores the label in a table alongside the prefix that it represents as the “remote” or
“outgoing” label for any packets that it sends to that prefix.
In Figure 4.41(c), we see another label advertisement from router R3 to R2 for
the prefix 10.1.1, and R2 places the “remote” label that it learned from R3 in the
appropriate place in its table.
At this point, we can look at what happens when a packet is forwarded in this
network. Suppose a packet destined to the IP address arrives from the left to
router R1. R1 in this case is referred to as a label edge router (LER); an LER performs
a complete IP lookup on arriving IP packets, and then applies labels to them as a result
of the lookup. In this case, R1 would see that matches the prefix 10.1.1 in its
4 Internetworking
Label = 15, Prefix = 10.1.1
Label Prefix
Prefix Interface
Interface label
Label = 24, Prefix = 10.1.1
10. 1.1
10. 3.3
Label Prefix
Figure 4.41 (a) R2 allocates labels and advertises bindings to R1. (b) R1 stores the
received labels in a table. (c) R3 advertises another binding, and R2 stores the received
label in a table.
4.5 Multiprotocol Label Switching (MPLS)
forwarding table, and that this entry contains both an outgoing interface and a remote
label value. R1 therefore attaches the remote label 15 to the packet before sending it.
When the packet arrives at R2, R2 looks at the label in the packet. The forwarding
table at R2 indicates that packets arriving with a label value of 15 should be sent out
interface 1, and that they should carry the label value 24, as advertised by router R3.
R2 therefore rewrites, or swaps, the label and forwards it to R3.
What has been accomplished by all this application and swapping of labels?
Observe that when R2 forwarded the packet in this example, it never actually needed
to examine the IP address. Instead, R3 looked only at the incoming label. Thus, we have
replaced the normal IP destination address lookup with a label lookup. To understand
why this is significant, it helps to recall that although IP addresses are always the
same length, IP prefixes are of variable length, and the IP destination address lookup
algorithm needs to find the longest match—the longest prefix that matches the highorder bits in the IP address of the packet being forwarded. By contrast, the label
forwarding mechanism just described is an exact match algorithm. It is possible to
implement a very simple exact match algorithm, for example, by using the label as
an index into an array, where each element in the array is one line in the forwarding
Note that while the forwarding algorithm has been changed from longest match
to exact match, the routing algorithm can be any standard IP routing algorithm (e.g.,
OSPF). The path that a packet will follow in this environment is the exact same path
that it would have followed if MPLS were not involved—the path chosen by the IP
routing algorithms. All that has changed is the forwarding algorithm.
The major effect of changing the forwarding algorithm is that devices that normally don’t know how to forward IP packets can be used in an MPLS network. The
most notable early application of this result was to ATM switches, which can support
MPLS without any changes to their forwarding hardware. ATM switches support the
label swapping forwarding algorithm just described, and by providing these switches
with IP routing protocols and a method to distribute label bindings, they could be
turned into label switching routers (LSRs)—devices that run IP control protocols but
use the label switching forwarding algorithm. More recently, the same idea has been
applied to optical switches of the sort described in Section 3.1.2.
Before we consider the purported benefits of turning an ATM switch into an LSR,
we should tie up some loose ends. We have said that labels are “attached” to packets,
but where exactly are they attached? The answer depends on the type of link on which
packets are carried. Two common methods for carrying labels on packets are shown in
Figure 4.42. When IP packets are carried as complete frames, as they are on most link
types including Ethernet, token ring, and PPP, the label is inserted as a “shim” between
the layer 2 header and the IP (or other layer 3) header, as shown in Figure 4.42(b).
4 Internetworking
However, if an ATM switch is to function
as an MPLS LSR, then the label needs to
What Layer Is MPLS?
be in a place where the switch can use it,
There have been many debates
and that means it needs to be in the ATM
about where MPLS belongs in the
cell header, exactly where we would norlayered protocol architectures premally find the VCI and VPI fields, as shown in
sented in Section 1.3. Since the
Figure 4.42(a).
MPLS header is normally found
Having now devised a scheme by
between the layer 3 and the layer
which an ATM switch can function as an
2 headers in a packet, it is someLSR, what have we gained? One thing to
times referred to as a layer 2.5 pronote is that we could now build a nettocol. Some people argue that, since
work that used a mixture of conventional
IP packets are encapsulated inIP routers, label edge routers, and ATM
side MPLS headers, MPLS must be
switches functioning as LSRs, and they
“below” IP, making it a layer 2 prowould all use the same routing protocols.
tocol. Others argue that, since the
To understand the benefits of using the same
control protocols for MPLS are, in
protocols, consider the alternative. In Figure
large part, the same protocols as
4.43(a) we see a set of routers interconIP—MPLS uses IP routing protonected by virtual circuits over an ATM netcols and IP addressing—then MPLS
work, a configuration called an “overlay”
must be at the same layer as IP (i.e.,
network. At one point in time, networks of
layer 3). As we noted in Section 1.3,
this type were often built because commerlayered architectures are useful
cially available ATM switches supported
tools but they may not always
higher total throughput than routers. Today,
exactly describe the real world,
networks like this are less common because
and MPLS is a good example of
routers have caught up with and even surwhere strictly layerist views may be
passed ATM switches. However, these netdifficult to reconcile with reality.
works still exist because of the significant
installed base of ATM switches in network
backbones, which in turn is partly a result of ATM’s ability to support a range of
capabilities such as circuit emulation and virtual circuit services.
In an overlay network, each router would potentially be connected to each of
the other routers by a virtual circuit, but in this case for clarity we have just shown the
circuits from R1 to all of its peer routers. R1 has five routing neighbors and needs to
exchange routing protocol messages with all of them—we say that R1 has five routing
adjacencies. By contrast, in Figure 4.43(b), the ATM switches have been replaced with
LSRs. There are no longer virtual circuits interconnecting the routers. Thus R1 has
only one adjacency, with LSR1. In large networks, running MPLS on the switches leads
to a significant reduction in the number of adjacencies that each router must maintain
4.5 Multiprotocol Label Switching (MPLS)
ATM cell
“Shim” header
(for PPP, Ethernet,
PPP header
Label header
Layer 3 header
Figure 4.42 (a) Label on an ATM-encapsulated packet. (b) Label on a frameencapsulated packet.
Figure 4.43 (a) Routers connect to each other using an “overlay”of virtual circuits. (b)
Routers peer directly with LSRs.
4 Internetworking
and can greatly reduce the amount of work that the routers have to do to keep each
other informed of topology changes.
A second benefit of running the same routing protocols on edge routers and on
the LSRs is that the edge routers now have a full view of the topology of the network.
This means that if some link or node fails inside the network, the edge routers will
have a better chance of picking a good new path than if the ATM switches rerouted
the affected VCs without the knowledge of the edge routers.
Note that the step of “replacing” ATM switches with LSRs is actually achieved
by changing the protocols running on the switches, but typically no change to the
forwarding hardware is needed. That is, an ATM switch can often be converted to an
MPLS LSR by upgrading only its software. Furthermore, an MPLS LSR might continue
to support standard ATM capabilities at the same time as it runs the MPLS control
More recently, the idea of running IP control protocols on devices that are unable
to forward IP packets natively has been extended to optical switches and TDM devices
such as SONET multiplexers. This is known as generalized MPLS (GMPLS). Part of
the motivation for GMPLS was to provide routers with topological knowledge of an
optical network, just as in the ATM case. Even more important was the fact that there
were no standard protocols for controlling optical devices, and so MPLS seemed like
a natural fit for that job.
Explicit Routing
In Section 3.1.3 we introduced the concept of source routing. IP has a source routing
option, but it is not widely used for several reasons, including the fact that only a
limited number of hops can be specified, and because it is usually processed outside
the “fast path” on most routers.
MPLS provides a convenient way to add capabilities similar to source routing
to IP networks, although the capability is more often called “explicit routing” rather
than “source routing.” One reason for the distinction is that it usually isn’t the real
source of the packet that picks the route. More often it is one of the routers inside a
service provider’s network. Figure 4.44 shows an example of how the explicit routing
capability of MPLS might be applied. This sort of network is often called a “fish”
network because of its shape (the routers R1 and R2 form the tail; R7 is at the head).
Suppose that the operator of the network in Figure 4.44 has determined that any
traffic flowing from R1 to R7 should follow the path R1-R3-R6-R7, and that any traffic
going from R2 to R7 should follow the path R2-R3-R4-R7. One reason for such a
choice would be to make good use of the capacity available along the two distinct paths
from R3 to R7. This cannot easily be accomplished with normal IP routing because
R3 doesn’t look at where traffic came from in making its forwarding decisions.
4.5 Multiprotocol Label Switching (MPLS)
Figure 4.44 A network requiring explicit routing.
Because MPLS uses label swapping to forward packets, it is easy enough to
achieve the desired routing if the routers are MPLS-enabled. If R1 and R2 attach
distinct labels to packets before sending them to R3, then R3 can forward packets
from R1 and R2 along different paths. The question that then arises is, How do all
the routers in the network agree on what labels to use and how to forward packets
with particular labels? Clearly, we can’t use the same procedures as described in the
preceding section to distribute labels because those procedures establish labels that
cause packets to follow the normal paths picked by IP routing, which is exactly what
we are trying to avoid. Instead, a new mechanism is needed. It turns out that the
protocol used for this task is the Resource Reservation Protocol (RSVP). We’ll talk
more about this protocol in Section 6.5.2, but for now it suffices to say that it is possible
to send an RSVP message along an explicitly specified path (e.g., R1-R3-R6-R7) and
use it to set up label forwarding table entries all along that path. This is very similar
to the process of establishing a virtual circuit described in Section 3.1.2.
One of the applications of explicit routing is “traffic engineering,” which refers
to the task of ensuring that sufficient resources are available in a network to meet
the demands placed on it. Controlling exactly which paths the traffic flows on is an
important part of traffic engineering. Explicit routing can also help to make networks
more resilient in the face of failure, using a capability called fast reroute. For example,
it is possible to precalculate a path from router A to router B that explicitly avoids
a certain link L. In the event that link L fails, router A could send all traffic destined
to B down the precalculated path. The combination of precalculation of the “backup
path” and the explicit routing of packets along the path means that A doesn’t need
to wait for routing protocol packets to make their way across the network or for
routing algorithms to be executed by various other nodes in the network. In certain
circumstances, this can significantly reduce the time taken to reroute packets around
a point of failure.
4 Internetworking
One final point to note about explicit routing is that explicit routes need not
be calculated by a network operator as in the above example. There are a range of
algorithms that routers can use to calculate explicit routes automatically. The most
common of these is called constrained shortest path first (CSPF), which is like the
link-state algorithms described in Section 4.2.3, but which also takes “constraints”
into account. For example, if it was required to find a path from R1 to R7 that could
carry an offered load of 100 Mbps, we could say that the “constraint” is that each
link must have at least 100 Mbps of available capacity. CSPF addresses this sort of
problem. More details on CSPF, and the applications of explicit routing, are provided
in the “Further Reading” section at the end of the chapter.
Virtual Private Networks and Tunnels
We first talked about virtual private networks (VPNs) in Section 4.1.8, and we noted
that one way to build them was using tunnels. It turns out that MPLS can be thought
of as a way to build tunnels, and this makes it suitable for building VPNs of various
The simplest form of MPLS VPN to understand is a “layer 2” VPN. In this type
of VPN, MPLS is used to tunnel layer 2 data (such as Ethernet frames or ATM cells)
across a network of MPLS-enabled routers. Recall from Section 4.1.8 that one reason
for tunnels is to provide some sort of network service (such as multicast) that is not
supported by some routers in the network. The same logic applies here: IP routers
are not ATM switches, so you cannot provide an ATM virtual circuit service across a
network of conventional routers. However, if you had a pair of routers interconnected
by a tunnel, they could send ATM cells across the tunnel and emulate an ATM circuit.
The term for this technique within the IETF is pseudowire emulation. Figure 4.45
illustrates the idea.
We have already seen how IP tunnels are built: The router at the entrance of the
tunnel wraps the data to be tunneled in an IP header (the “tunnel header”), which
ATM cells arrive
ATM cells sent
Cells sent into
tunnel at head
Figure 4.45 An ATM circuit is emulated by a tunnel.
Tunneled data
arrives at tail
4.5 Multiprotocol Label Switching (MPLS)
represents the address of the router at the far end of the tunnel, and sends the data like
any other IP packet. The receiving router receives the packet with its own address in
the header, strips the tunnel header, and finds the data that was tunneled, which it then
processes. Exactly what it does with that data depends on what it is. For example, if it
were another IP packet, it would then be forwarded like a normal IP packet. However,
it need not be an IP packet, as long as the receiving router knows what to do with
non-IP packets. We’ll return to the issue of how to handle non-IP data in a moment.
An MPLS tunnel is not too different from an IP tunnel, except that the “tunnel
header” consists of an MPLS header rather than an IP header. Looking back to our
first example, in Figure 4.41, we saw that router R1 attached a label (15) to every
packet that it sent toward prefix 10.1.1. Such a packet would then follow the path
R1-R2-R3, with each router in the path examining only the MPLS label. Thus, we
observe that there was no requirement that R1 only send IP packets along this path—
any data could be wrapped up in the MPLS header, and it would follow the same path
because the intervening routers never look beyond the MPLS header. In this regard,
an MPLS header is just like an IP tunnel header.7 The only issue with sending non-IP
traffic along a tunnel, MPLS or otherwise, is this: What do we do with non-IP traffic
when it reaches the end of the tunnel? The general solution is to carry some sort of
demultiplexing identifier in the tunnel payload that tells the router at the end of the
tunnel what to do. It turns out that an MPLS label is a perfect fit for such an identifier.
An example will make this clear.
Let’s assume we want to tunnel ATM cells from one router to another across
a network of MPLS-enabled routers, as in Figure 4.45. Further, we assume that the
goal is to emulate an ATM virtual circuit; that is, cells arrive at the entrance, or head,
of the tunnel on a certain input port with a certain VCI and should leave the tail
end of the tunnel on a certain output port and potentially different VCI. This can be
accomplished by configuring the “head” and “tail” routers as follows:
■ The head router needs to be configured with the incoming port, the incoming
VCI, the “demultiplexing label” for this emulated circuit, and the address of
the tunnel end router.
■ The tail end router needs to be configured with the outgoing port, the outgoing
VCI, and the demultiplexing label.
Once the routers are provided with this information, we can see how an ATM cell
would be forwarded. Figure 4.46 illustrates the steps.
Note, however, that an MPLS header is only 4 bytes long, compared to 20 for an IP header, which implies a
bandwidth savings when MPLS is used.
4 Internetworking
1. ATM cells arrive
6. ATM cells sent
2. Demux label added
DL 101
3. Tunnel label added
TL DL 101
DL 101
TL DL 101
5. Demux label examined
4. Packet is forwarded to tail
Figure 4.46 Forwarding ATM cells along a tunnel.
1 An ATM cell arrives on the designated input port with the appropriate VCI value
(101 in this example).
2 The head router attaches the demultiplexing label that identifies the emulated
3 The head router then attaches a second label, which is the tunnel label that will
get the packet to the tail router. This label is learned by mechanisms just like
those described in Section 4.5.1.
4 Routers between the head and tail forward the packet using only the tunnel label.
5 The tail router removes the tunnel label, finds the demultiplexing label, and
recognizes the emulated circuit.
6 The tail router modifies the ATM VCI to the correct value (202 in this case) and
sends it out the correct port.
One item in this example that might be surprising is that the packet has two
labels attached to it. This is one of the interesting features of MPLS—labels may be
“stacked” on a packet to any depth. This provides some useful scaling capabilities. In
this example, it enables a single tunnel to carry a potentially large number of emulated
The same techniques described here can be applied to emulate many other layer 2
services, including Frame Relay and Ethernet. It is worth noting that virtually identical
capabilities can be provided using IP tunnels; the main advantage of MPLS here is the
shorter tunnel header.
Before MPLS was used to tunnel layer 2 services, it was also being used to support
layer 3 VPNs. We won’t go into the details of layer 3 VPNs, which are quite complex—
4.5 Multiprotocol Label Switching (MPLS)
VPN A/Site 2
VPN B/Site 2
VPN B/Site 1
VPN A/Site 3
VPN A/Site 1
VPN B/Site 3
Figure 4.47 Example of a layer 3 VPN. Customers A and B each obtain a virtually
private IP service from a single provider.
see the “Further Reading” section for some good sources of more information—but
we will note that they represent one of the most popular uses of MPLS today. Layer 3
VPNs also use stacks of MPLS labels to tunnel packets across an IP network. However,
the packets that are tunneled are themselves IP packets—hence the name “layer 3
VPNs.” In a layer 3 VPN, a single service provider operates a network of MPLSenabled routers and provides a “virtually private” IP network service to any number
of distinct customers. That is, each customer of the provider has some number of
sites, and the service provider creates the illusion for each customer that there are no
other customers on the network. The customer sees an IP network interconnecting his
own sites, and no other sites. This means that each customer is isolated from all other
customers in terms of both routing and addressing. Customer A can’t send packets
directly to customer B, and vice versa.8 Customer A can even use IP addresses that
have also been used by customer B. The basic idea is illustrated in Figure 4.47. As in
Customer A in fact usually can send data to customer B in some restricted way. Most likely, both customer A
and customer B have some connection to the global Internet, and thus it is probably possible for customer A to
send email messages, for example, to the mail server inside customer B’s network. The “privacy” offered by a
VPN prevents customer A from having unrestricted access to all the machines and subnets inside customer B’s
4 Internetworking
layer 2 VPNs, MPLS is used to tunnel packets from one site to another. However, the
configuration of the tunnels is performed automatically by some fairly elaborate use
of BGP, which is beyond the scope of this book.
In summary, MPLS is a rather versatile tool that has been applied to a wide
range of different networking problems. It combines the label swapping forwarding
mechanism that is normally associated with virtual circuit networks with the routing
and control protocols of IP datagram networks to produce a class of network that is
somewhere between the two conventional extremes. This extends the capabilities of
IP networks to enable, among other things, more precise control of routing and the
support of a range of VPN services.
4.6 Summary
The main theme of this chapter was how to build big networks by interconnecting
smaller networks. We looked at bridging in the last chapter, but it is a technique
that is mostly used to interconnect a small to moderate number of similar networks.
What bridging does not do well is tackle the two closely related problems of building
very large networks: heterogeneity and scale. The Internet Protocol is the key tool for
dealing with these problems, and it provided most of the examples for this chapter.
IP tackles heterogeneity by defining a simple, common service model for an
internetwork, which is based on the best-effort delivery of IP datagrams. An important
part of the service model is the global addressing scheme, which enables any two nodes
in an internetwork to uniquely identify each other for the purposes of exchanging data.
The IP service model is simple enough to be supported by any known networking
technology, and the ARP mechanism is used to translate global IP addresses into local
link-layer addresses.
A crucial aspect of the operation of an internetwork is the determination of
efficient routes to any destination in the internet. Internet routing algorithms solve
this problem in a distributed fashion; this chapter introduced the two major classes of
algorithms—link-state and distance-vector—along with examples of their application
(RIP and OSPF). We also examined the extensions to IP routing that will support
mobile hosts.
We then saw a succession of scaling problems and the ways that IP deals with
them. The major scaling issues are the efficient use of address space and the growth
of routing tables as the Internet grows. The hierarchical IP address format, with its
network and host parts, gives us one level of hierarchy to manage scale. Subnetting
lets us make more efficient use of network numbers and helps consolidate routing
information; in effect, it adds one more level of hierarchy to the address. Classless
routing (CIDR) lets us introduce more levels of hierarchy and achieve further routing
Open Issue: Deployment of IPv6
aggregation. Autonomous systems allow us to partition the routing problem into two
parts, interdomain and intradomain routing, each of which is much smaller than the
total routing problem would be. These mechanisms have enabled today’s Internet to
sustain remarkable growth.
Eventually, all of these mechanisms will be unable to keep up with the Internet’s
growth, and a new address format will be needed. This will require a new IP datagram
format and a new version of the protocol. Originally known as IP Next Generation
(IPng), this new protocol is now known as IPv6, and it provides a 128-bit address with
CIDR-like addressing and routing. While many new capabilities have been claimed for
IPv6, its main advantage remains its ability to support an extremely large number of
addressable devices.
More than 10 years have elapsed
since the shortage of IPv4 address
space became serious enough to warrant proposals for a new version of
Deployment of IPv6
IP. The orginal IPv6 specification is
now more than 7 years old. IPv6capable host operating systems are
now widely available and the major router vendors offer varying degrees of support
for IPv6 in their products. Yet the deployment of IPv6 in the Internet has not, at the
time of writing, begun in any meaningful way. It is worth wondering when deployment
is likely to begin in earnest, and what will cause it.
One reason why IPv6 has not been needed sooner is because of the extensive use
of NAT (network address translation, described earlier in this chapter). As providers
viewed IPv4 addresses as a scarce resource, they handed out fewer of them to their
customers or charged for the number of addresses used; customers responded by hiding
many of their devices behind a NAT box and a single IPv4 address. For example, it
is likely that most home networks with more than one IP-capable device have some
sort of NAT in the network to conserve addresses. So one factor that might drive IPv6
deployment would be applications that don’t work well with NAT. While client/server
applications work reasonably well when the client’s address is “hidden” behind a NAT
box, peer-to-peer applications fare less well. Examples of applications that would work
better without NAT and would therefore benefit from more liberal address allocation
policies are multiplayer gaming and IP telephony.
Obtaining blocks of IPv4 addresses has been getting more difficult for years, and
this is particularly noticeable in countries outside the United States. As the difficulty
increases, the incentive for providers to start offering IPv6 addresses to their customers
4 Internetworking
also rises. At the same time, for existing providers, offering IPv6 is a substantial additional cost because they don’t get to stop supporting IPv4 when they start to offer IPv6.
This means, for example, that the size of a provider’s routing tables can only increase
initially because they need to carry all the existing IPv4 prefixes plus new IPv6 prefixes.
At the moment, IPv6 deployment is happening almost exclusively in research
networks. A few service providers are starting to offer it (often with some incentive
from national governments). It seems hard to imagine that the Internet can continue to
grow indefinitely without IPv6 seeing some more significant deployments, but it also
seems likely that the overwhelming majority of hosts and networks will be IPv4-only
for the foreseeable future.
Not surprisingly, there have been countless papers written on various aspects of the
Internet. Of these, we recommend two as must reading: The paper by Cerf and Kahn
is the one that originally introduced the TCP/IP architecture and is worth reading just
for its historical perspective; the paper by Bradner and Mankin gives an informative
overview of how the rapidly growing Internet has stressed the scalability of the original architecture, ultimately resulting in the next-generation IP. The paper by Paxson
describes a study of how routers behave in the Internet. It also happens to be a good
example of how researchers are now studying the dynamic behavior of the Internet.
The final paper discusses multicast, presenting the approach to multicast originally
used on the MBone.
■ Cerf, V., and R. Kahn. A protocol for packet network intercommunication. IEEE Transactions on Communications COM-22(5):637–648, May
■ Bradner, S., and A. Mankin. The recommendation for the next generation IP
protocol. Request for Comments 1752, January 1995.
■ Paxson, V. End-to-end routing behavior in the Internet. SIGCOMM ’96,
pages 25–38, August 1996.
■ Deering, S., and D. Cheriton. Multicast routing in datagram internetworks
and extended LANs. ACM Transactions on Computer Systems 8(2):85–110,
May 1990.
Beyond these papers, Perlman gives an excellent explanation of routing in an
internet, including coverage of both bridges and routers [Per00]. Also, the book by
Lynch and Rose gives general information on the scalability of the Internet [Cha93].
Some interesting experimental studies of the behavior of Internet routing are presented
in Labovitz et al. [LAAJ00].
Many of the techniques and protocols developed to help the Internet scale are
described in RFCs: Subnetting is described in Mogul and Postel [MP85], CIDR is
described in Fuller et al. [FLYV93], RIP is defined in Hedrick [Hed88] and Malkin
[Mal93], OSPF is defined in Moy [Moy98], and BGP-4 is defined in Rekhter and Li
[RL95]. The OSPF specification, at over 200 pages, is one of the longer RFCs around,
but also contains an unusual wealth of detail about how to implement a protocol. A
collection of RFCs related to IPv6 can be found in Bradner and Mankin [BM95], and
the most recent IPv6 spec is by Deering and Hinden [DH98]. The reasons to avoid IP
fragmentation are examined in Kent and Mogul [KM87], and the path MTU discovery
technique is described in Mogul and Deering [MD90]. Protocol Independent Multicast
(PIM) is described in Deering et al. [DEF+ 96] and [EFH+ 98].
There has been a lot of work developing algorithms that can be used by routers to
do fast lookup of IP addresses. (Recall that the problem is that the router needs to match
the longest prefix in the forwarding table.) PATRICIA trees are one of the first algorithms applied to this problem [Mor68]. More recent work is reported in [DBCP97],
[WVTP97], [LS98], and [SVSW98]. For an overview of how these algorithms can be
used to build a high-speed router, see Partridge et al. [Par98].
MPLS and the related protocols that fed its development are described in Chandranmenon and Varghese [CV95], Rekhter et al. [RDR+ 97], and Davie and Rekhter
[DR00]. The latter reference describes many applications of MPLS such as traffic
engineering, fast recovery from network failures, and virtual private networks. [RR99]
provides the specification of MPLS/BGP VPNs, a form of layer 3 VPN that can be provided over MPLS networks.
Finally, we recommend the following live references:
■ the IETF home page, from which you can get RFCs, Internet drafts, and working group charters
■ current state of IPv6
1 What aspect of IP addresses makes it necessary to have one address per network
interface, rather than just one per host? In light of your answer, why does IP
tolerate point-to-point interfaces that have nonunique addresses or no addresses?
2 Why does the Offset field in the IP header measure the offset in 8-byte units? (Hint:
Recall that the Offset field is 13 bits long.)
4 Internetworking
3 Some signalling errors can cause entire ranges of bits in a packet to be overwritten
by all 0s or all 1s. Suppose all the bits in the packet including the Internet checksum
are overwritten. Could a packet with all 0s or all 1s be a legal IPv4 packet? Will
the Internet checksum catch that error? Why or why not?
4 Suppose a TCP message that contains 2048 bytes of data and 20 bytes of TCP
header is passed to IP for delivery across two networks of the Internet (i.e., from
the source host to a router to the destination host). The first network uses 14-byte
headers and has an MTU of 1024 bytes; the second uses 8-byte headers with an
MTU of 512 bytes. Each network’s MTU gives the size of the largest IP datagram
that can be carried in a link-layer frame. Give the sizes and offsets of the sequence
of fragments delivered to the network layer at the destination host. Assume all IP
headers are 20 bytes.
5 Path MTU is the smallest MTU of any link on the current path (route) between two
hosts. Assume we could discover the path MTU of the path used in the previous
exercise, and that we use this value as the MTU for all the path segments. Give
the sizes and offsets of the sequence of fragments delivered to the network layer
at the destination host.
6 Suppose an IP packet is fragmented into 10 fragments, each with a 1% (independent) probability of loss. To a reasonable approximation, this means there
is a 10% chance of losing the whole packet due to loss of a fragment. What
is the probability of net loss of the whole packet if the packet is transmitted
(a) assuming all fragments received must have been part of the same transmission?
(b) assuming any given fragment may have been part of either transmission?
(c) Explain how use of the Ident field might be applicable here.
7 Suppose the fragments of Figure 4.5(b) all pass through another router onto a
link with an MTU of 380 bytes, not counting the link header. Show the fragments
produced. If the packet were originally fragmented for this MTU, how many
fragments would be produced?
8 What is the maximum bandwidth at which an IP host can send 576-byte packets
without having the Ident field wrap around within 60 seconds? Suppose IP’s maximum segment lifetime (MSL) is 60 seconds; that is, delayed packets can arrive
up to 60 seconds late but no later. What might happen if this bandwidth were
9 ATM AAL3/4 uses fields Btag/Etag, BASize/Len, Type, SEQ, MID, Length, and
CRC-10 to implement fragmentation into cells. IPv4 uses Ident, Offset, and the M
bit in Flags, among others. What is the IP analog, if any, for each AAL3/4 field?
Does each IP field listed here have an AAL3/4 analog? How well do these fields
10 Why do you think IPv4 has fragment reassembly done at the endpoint, rather than
at the next router? Why do you think IPv6 abandoned fragmentation entirely?
Hint: Think about the differences between IP-layer fragmentation and link-layer
11 Having ARP table entries time out after 10–15 minutes is an attempt at a reasonable compromise. Describe the problems that can occur if the timeout value is too
small or too large.
12 IP currently uses 32-bit addresses. If we could redesign IP to use the 6-byte MAC
address instead of the 32-bit address, would we be able to eliminate the need for
ARP? Explain why or why not.
13 Suppose hosts A and B have been assigned the same IP address on the same Ethernet, on which ARP is used. B starts up after A. What will happen to A’s existing
connections? Explain how “self-ARP” (querying the network on startup for one’s
own IP address) might help with this problem.
14 Suppose an IP implementation adheres literally to the following algorithm on
receipt of a packet, P, destined for IP address D:
if (Ethernet address for D is in ARP cache)
send P
send out an ARP query for D
put P into a queue until the response comes back
(a) If the IP layer receives a burst of packets destined for D, how might this
algorithm waste resources unnecessarily?
(b) Sketch an improved version.
(c) Suppose we simply drop P, after sending out a query, when cache lookup
fails. How would this behave? (Some early ARP implementations allegedly did
4 Internetworking
Figure 4.48 Network for Exercises 15, 17, and 20.
Figure 4.49 Network for Exercise 16.
15 For the network given in Figure 4.48, give global distance-vector tables like those
of Tables 4.5 and 4.8 when
(a) each node knows only the distances to its immediate neighbors.
(b) each node has reported the information it had in the preceding step to its
immediate neighbors.
(c) step (b) happens a second time.
16 For the network given in Figure 4.49, give global distance-vector tables like those
of Tables 4.5 and 4.8 when
(a) each node knows only the distances to its immediate neighbors.
(b) each node has reported the information it had in the preceding step to its
immediate neighbors.
(c) step (b) happens a second time.
17 For the network given in Figure 4.48, show how the link-state algorithm builds
the routing table for node D.
Table 4.12 Forwarding tables for Exercise 18.
Table 4.13 Forwarding tables for Exercise 19.
18 Suppose we have the forwarding tables shown in Table 4.12 for nodes A and F,
in a network where all links have cost 1. Give a diagram of the smallest network
consistent with these tables.
19 Suppose we have the forwarding tables shown in Table 4.13 for nodes A and F,
in a network where all links have cost 1. Give a diagram of the smallest network
consistent with these tables.
20 For the network in Figure 4.48, suppose the forwarding tables are all established
as in Exercise 15 and then the C–E link fails. Give
(a) the tables of A, B, D, and F after C and E have reported the news.
(b) the tables of A and D after their next mutual exchange.
(c) the table of C after A exchanges with it.
4 Internetworking
Interface 0
Interface 1
Table 4.14 Routing table for Exercise 21.
Interface 0
Interface 1
Table 4.15 Routing table for Exercise 22.
21 Suppose a router has built up the routing table shown in Table 4.14. The router
can deliver packets directly over interfaces 0 and 1, or it can forward packets to
routers R2, R3, or R4. Describe what the router does with a packet addressed to
each of the following destinations:
22 Suppose a router has built up the routing table shown in Table 4.15. The router
can deliver packets directly over interfaces 0 and 1, or it can forward packets to
routers R2, R3, or R4. Assume the router does the longest prefix match. Describe
Figure 4.50 Simple network for Exercise 23.
what the router does with a packet addressed to each of the following destinations:
23 Consider the simple network in Figure 4.50, in which A and B exchange distancevector routing information. All links have cost 1. Suppose the A–E link fails.
(a) Give a sequence of routing table updates that leads to a routing loop between
A and B.
(b) Estimate the probability of the scenario in (a), assuming A and B send out
routing updates at random times, each at the same average rate.
(c) Estimate the probability of a loop forming if A broadcasts an updated report
within 1 second of discovering the A–E failure, and B broadcasts every 60
seconds uniformly.
24 Consider the situation involving the creation of a routing loop in the network
of Figure 4.15 when the A–E link goes down. List all sequences of table updates
among A, B, and C, pertaining to destination E, that lead to the loop. Assume that
table updates are done one at a time, that the split horizon technique is observed
by all participants, and that A sends its initial report of E’s unreachability to B
before C. You may ignore updates that don’t result in changes.
25 Suppose a set of routers all use the split horizon technique; we consider here under
what circumstances it makes a difference if they use poison reverse in addition.
(a) Show that poison reverse makes no difference in the evolution of the routing
loop in the two examples described in Section 4.2.2, given that the hosts
involved use split horizon.
(b) Suppose split horizon routers A and B somehow reach a state in which they
forward traffic for a given destination X toward each other. Describe how this
situation will evolve with and without the use of poison reverse.
4 Internetworking
Figure 4.51 Networks for Exercise 26.
Figure 4.52 Network for Exercise 27.
(c) Give a sequence of events that leads A and B to a looped state as in (b), even
if poison reverse is used. Hint: Suppose B and A connect through a very slow
link. They each reach X through a third node, C, and simultaneously advertise
their routes to each other.
26 Hold-down is another distance-vector loop-avoidance technique, whereby hosts
ignore updates for a period of time until link failure news has had a chance to
propagate. Consider the networks in Figure 4.51, where all links have cost 1,
except E–D with cost 10. Suppose that the E–A link breaks and B reports its loopforming E route to A immediately afterward (this is the false route, via A). Specify
the details of a hold-down interpretation, and use this to describe the evolution
of the routing loop in both networks. To what extent can hold down prevent the
loop in the EAB network without delaying the discovery of the alternative route
in the EABD network?
27 Consider the network in Figure 4.52, using link-state routing. Suppose the B–F
link fails, and the following then occur in sequence:
(a) Node H is added to the right side with a connection to G.
(b) Node D is added to the left side with a connection to C.
(c) A new link D–A is added.
The failed B–F link is now restored. Describe what link-state packets will flood
back and forth. Assume that the initial sequence number at all nodes is 1, and
Figure 4.53 Network for Exercise 28.
Figure 4.54 Network for Exercise 29.
Figure 4.55 Network for Exercise 30.
that no packets time out, and that both ends of a link use the same sequence
number in their LSP for that link, greater than any sequence number either used
28 Give the steps as in Table 4.9 in the forward search algorithm as it builds the
routing database for node A in the network shown in Figure 4.53.
29 Give the steps as in Table 4.9 in the forward search algorithm as it builds the
routing database for node A in the network shown in Figure 4.54.
30 Suppose that nodes in the network shown in Figure 4.55 participate in link-state
routing, and C receives contradictory LSPs: One from A arrives claiming the A–B
link is down, but one from B arrives claiming the A–B link is up.
(a) How could this happen?
(b) What should C do? What can C expect?
Do not assume that LSPs contain any synchronized timestamp.
4 Internetworking
Provider P
Provider Q
Provider R
Figure 4.56 Network for Exercise 31.
31 Consider the network shown in Figure 4.56, in which horizontal lines represent
transit providers and numbered vertical lines are interprovider links.
(a) How many routes to P could provider Q’s BGP speakers receive?
(b) Suppose Q and P adopt the policy that outbound traffic is routed to the closest
link to the destination’s provider, thus minimizing their own cost. What paths
will traffic from host A to host B and from host B to host A take?
(c) What could Q do to have the B−→A traffic use the closer link 1?
(d) What could Q do to have the B−→A traffic pass through R?
32 Give an example of an arrangement of routers grouped into autonomous systems
so that the path with the fewest hops from a point A to another point B crosses
the same AS twice. Explain what BGP would do with this situation.
33 Let A be the number of autonomous systems on the Internet, and let D (for
diameter) be the maximum AS path length.
(a) Give a connectivity model for which D is of order log A and another for which
D is of order A.
(b) Assuming each AS number is 2 bytes and each network number is 4 bytes,
give an estimate for the amount of data a BGP speaker must receive to keep
track of the AS path to every network. Express your answer in terms of A, D,
and the number of networks N.
34 Suppose IP routers learned about IP networks and subnets the way Ethernet learning bridges learn about hosts: by noting the appearance of new ones and the interface by which they arrive. Compare this with existing distance-vector router
(a) for a leaf site with a single attachment to the Internet, and
(b) for internal use at an organization that did not connect to the Internet.
Rest of Internet
Figure 4.57 Site for Exercise 39.
Assume that routers only receive new-network notices from other routers, and that
the originating routers receive their IP network information via configuration.
35 IP hosts that are not designated routers are required to drop packets misaddressed
to them, even if they would otherwise be able to forward them correctly. In the
absence of this requirement, what would happen if a packet addressed to IP address
A were inadvertently broadcast at the link layer? What other justifications for this
requirement can you think of?
36 Read the man page or other documentation for the Unix/Windows utility netstat.
Use netstat to display the current IP routing table on your host. Explain the purpose
of each entry. What is the practical minimum number of entries?
37 Use the Unix utility traceroute (Windows tracert) to determine how many
hops it is from your host to other hosts in the Internet (e.g., or How many routers do you traverse just to get out of your local
site? Read the man page or other documentation for traceroute and explain how
it is implemented.
38 What will happen if traceroute is used to find the path to an unassigned address?
Does it matter if the network portion or only the host portion is unassigned?
39 A site is shown in Figure 4.57. R1 and R2 are routers; R2 connects to the outside
world. Individual LANs are Ethernets. RB is a bridge router; it routes traffic
addressed to it and acts as a bridge for other traffic. Subnetting is used inside the
site; ARP is used on each subnet. Unfortunately, host A has been misconfigured
and doesn’t use subnets. Which of B, C, D can A reach?
4 Internetworking
Figure 4.58 Network for Exercise 41.
40 An organization has a class C network 200.1.1 and wants to form subnets for
four departments, with hosts as follows:
72 hosts
35 hosts
20 hosts
18 hosts
There are 145 hosts in all.
(a) Give a possible arrangement of subnet masks to make this possible.
(b) Suggest what the organization might do if department D grows to 34 hosts.
41 Suppose hosts A and B are on an Ethernet LAN with class C IP network address
200.0.0. It is desired to attach a host C to the network via a direct connection
to B (see Figure 4.58). Explain how to do this with subnets; give sample subnet
assignments. Assume that an additional network address is not available. What
does this do to the size of the Ethernet LAN?
42 An alternative method for connecting host C in Exercise 41 is to use proxy ARP
and routing: B agrees to route traffic to and from C and also answers ARP queries
for C received over the Ethernet.
(a) Give all packets sent, with physical addresses, as A uses ARP to locate and
then send one packet to C.
(b) Give B’s routing table. What peculiarity must it contain?
43 Propose a plausible addressing plan for IPv6 that runs out of bits. Specifically,
provide a diagram such as Figure 4.33, perhaps with additional ID fields, that
adds up to more than 128 bits, together with plausible justifications for the size
of each field. You may assume fields are divided on byte boundaries and that
Table 4.16 Routing table for Exercise 45.
the InterfaceID is 64 bits. Hint: Consider fields that would approach maximum
allocation only under unusual circumstances. Can you do this if the InterfaceID is
48 bits?
44 Suppose two subnets share the same physical LAN; hosts on each subnet will see
the other subnet’s broadcast packets.
(a) How will DHCP fare if two servers, one for each subnet, coexist on the shared
LAN? What problems might [do!] arise?
(b) Will ARP be affected by such sharing?
45 Table 4.16 is a routing table using CIDR. Address bytes are in hexadecimal.
The notation “/12” in C4.50.0.0/12 denotes a netmask with 12 leading 1 bits,
that is, FF.F0.0.0. Note that the last three entries cover every address and thus
serve in lieu of a default route. State to what next hop the following will be
(a) C4.5E.13.87
(b) C4.5E.22.09
(c) C3.41.80.02
(d) 5E.43.91.12
(e) C4.6D.31.2E
(f) C4.6B.31.2E
4 Internetworking
Table 4.17 Routing table for Exercise 46.
46 Table 4.17 is a routing table using CIDR. Address bytes are in hexadecimal. The
notation “/12” in C4.50.0.0/12 denotes a netmask with 12 leading 1 bits, that is,
FF.F0.0.0. State to what next hop the following will be delivered:
(a) C4.4B.31.2E
(b) C4.5E.05.09
(c) C4.4D.31.2E
(d) C4.5E.03.87
(e) C4.5E.7F.12
(f) C4.5E.D1.02
47 Suppose P, Q, and R are network service providers, with respective CIDR
address allocations (using the notation of Exercise 45) C1.0.0.0/8, C2.0.0.0/8,
and C3.0.0.0/8. Each provider’s customers initially receive address allocations
that are a subset of the provider’s. P has the following customers:
PA, with allocation C1.A3.0.0/16, and
PB, with allocation C1.B0.0.0/12.
Q has the following customers:
QA, with allocation C2.0A.10.0/20, and
QB, with allocation C2.0B.0.0/16.
Assume there are no other providers or customers.
(a) Give routing tables for P, Q, and R, assuming each provider connects to both
of the others.
(b) Now assume P is connected to Q and Q is connected to R, but P and R are
not directly connected. Give tables for P and R.
(c) Suppose customer PA acquires a direct link to Q, and QA acquires a direct
link to P, in addition to existing links. Give tables for P and Q, ignoring R.
48 In the previous problem, assume each provider connects to both others. Suppose
customer PA switches to provider Q and customer QB switches to provider R.
Use the CIDR longest match rule to give routing tables for all three providers that
allow PA and QB to switch without renumbering.
49 Suppose most of the Internet uses some form of geographical addressing, but that
a large international organization has a single IP network address and routes its
internal traffic over its own links.
(a) Explain the routing inefficiency for the organization’s inbound traffic inherent
in this situation.
(b) Explain how the organization might solve this problem for outbound traffic.
(c) For your method above to work for inbound traffic, what would have to
(d) Suppose the large organization now changes its addressing to separate geographical addresses for each office. What will its internal routing structure
have to look like if internal traffic is still to be routed internally?
50 The telephone system uses geographical addressing. Why do you think this wasn’t
adopted as a matter of course by the Internet?
51 Suppose a site A is multihomed, in that it has two Internet connections from two
different providers, P and Q. Provider-based addressing as in Exercise 47 is used,
and A takes its address assignment from P. Q has a CIDR longest match routing
entry for A.
(a) Describe what inbound traffic might flow on the A–Q connection. Consider
cases where Q does and does not advertise A to the world using BGP.
(b) What is the minimum advertising of its route to A that Q must do in order for
all inbound traffic to reach A via Q if the P–A link breaks?
(c) What problems must be overcome if A is to use both links for its outbound
4 Internetworking
52 An ISP with a class B address is working with a new company to allocate it a
portion of address space based on CIDR. The new company needs IP addresses
for machines in three divisions of its corporate network: Engineering, Marketing,
and Sales. These divisions plan to grow as follows: Engineering has 5 machines
as of the start of year 1 and intends to add 1 machine every week; Marketing
will never need more than 16 machines; and Sales needs 1 machine for every
two clients. As of the start of year 1, the company has no clients, but the sales
model indicates that by the start of year 2, the company will have six clients
and each week thereafter gets one new client with probability 60%, loses one
client with probability 20%, or maintains the same number with probability
(a) What address range would be required to support the company’s growth plans
for at least seven years if marketing uses all 16 of its addresses and the sales
and engineering plans behave as expected?
(b) How long would this address assignment last? At the time when the company
runs out of address space, how would the addresses be assigned to the three
(c) If CIDR addressing were not available for the seven-year plan, what options
would the new company have in terms of getting address space?
53 Propose a lookup algorithm for a CIDR fowarding table that does not require a
linear search of the entire table to find the longest match.
54 Suppose a network N within a larger organization A acquires its own direct connection to an Internet service provider, in addition to an existing connection via
A. Let R1 be the router connecting N to its own provider, and let R2 be the router
connecting N to the rest of A.
(a) Assuming N remains a subnet of A, how should R1 and R2 be configured?
What limitations would still exist with N’s use of its separate connection?
Would A be prevented from using N’s connection? Specify your configuration
in terms of what R1 and R2 should advertise, and with what paths. Assume
a BGP-like mechanism is available.
(b) Now suppose N gets its own network number; how does this change your
answer in (a)?
(c) Describe a router configuration that would allow A to use N’s link when its
own link is down.
Figure 4.59 Example internet for Exercise 55.
55 Consider the example internet shown in Figure 4.59, in which sources D and E
send packets to multicast group G, whose members are shaded in gray. Show the
shortest-path multicast trees for each source.
56 Consider the example internet shown in Figure 4.60 in which sources S1 and S2
send packets to multicast group G, whose members are shaded in gray. Show the
shortest-path multicast trees for each source.
57 Suppose host A is sending to a multicast group; the recipients are leaf nodes of
a tree rooted at A with depth N and with each nonleaf node having k children;
there are thus kN recipients.
(a) How many individual link transmissions are involved if A sends a multicast
message to all recipients?
4 Internetworking
Figure 4.60 Example network for Exercise 56.
(b) How many individual link transmissions are involved if A sends unicast messages to each individual recipient?
(c) Suppose A sends to all recipients, but some messages are lost and retransmission is necessary. Unicast retransmissions to what fraction of the recipients is
equivalent, in terms of individual link transmissions, to a multicast retransmission to all recipients?
58 Determine whether or not the following IPv6 address notations are correct.
(a) ::0F53:6382:AB00:67DB:BB27:7332
(b) 7803:42F2:::88EC:D4BA:B75D:11CD
(c) ::4BA8:95CC::DB97:4EAB
(d) 74DC::02BA
(e) ::00FF:
59 Determine if your site is connected to the MBone. If so, investigate and experiment
with any MBone tools, such as sdr, vat, and vic.
60 MPLS labels are usually 20 bits long. Explain why this provides enough labels
when MPLS is used for destination-based forwarding.
61 MPLS has sometimes been claimed to improve router performance. Explain why
this might be true, and suggest reasons why in practice this may not be the case.
62 Assume that it takes 32 bits to carry each MPLS label that is added to a packet
when the “shim” header of Figure 4.42(b) is used.
(a) How many additional bytes are needed to tunnel a packet using the MPLS
techniques described in Section 4.5.3?
(b) How many additional bytes are needed, at a minimum, to tunnel a packet
using an additional IP header as described in Section 4.1.8?
(c) Calculate the efficiency of bandwidth usage for each of the two tunneling
approaches when the average packet size is 300 bytes. Repeat for 64-byte
packets. Bandwidth efficiency is defined as (payload bytes carried) ÷ (total
bytes carried).
63 RFC 791 describes the Internet Protocol and includes two options for source
routing. Describe three disadvantages of using IP source route options compared
to using MPLS for explicit routing. (Hint: The IP header including options may
be at most 15 words long.)
End-to-End Protocols
Victory is the beautiful, bright coloured flower. Transport is the
stem without which it could never have blossomed.
—Winston Churchill
he previous three chapters have described various technologies that can be
used to connect together a collection of computers: direct links (including
LAN technologies like Ethernet and token ring), packet-switched networks
(including cell-based networks like ATM), and internetworks. The next problem is to
turn this host-to-host packet delivery service into a process-to-process communication
channel. This is the role played by the
transport level of the network archiP R O B L E M
tecture, which, because it supports
communication between the end
Getting Processes to
application programs, is sometimes
called the end-to-end protocol.
Two forces shape the end-to-end
protocol. From above, the application-level processes that use its services have certain requirements. The following list itemizes some of the common properties that a
transport protocol can be expected to provide:
■ guarantees message delivery
■ delivers messages in the same order they are sent
■ delivers at most one copy of each message
■ supports arbitrarily large messages
■ supports synchronization between the sender and the receiver
■ allows the receiver to apply flow control to the sender
■ supports multiple application processes on each host
Note that this list does not include all the functionality
that application processes might want from the network.
For example, it does not include security, which is typically
provided by protocols that sit above the transport level.
From below, the underlying network upon which the
transport protocol operates has certain limitations in the
level of service it can provide. Some of the more typical
limitations of the network are that it may
■ drop messages
■ reorder messages
■ deliver duplicate copies of a given message
■ limit messages to some finite size
■ deliver messages after an arbitrarily long delay
Such a network is said to provide a best-effort level of
service, as exemplified by the Internet.
The challenge, therefore, is to develop algorithms
that turn the less-than-desirable properties of the underlying network into the high level of service required by application programs. Different transport protocols employ
different combinations of these algorithms. This chapter
looks at these algorithms in the context of three representative services—a simple asynchronous demultiplexing
service, a reliable byte-stream service, and a request/reply
In the case of the demultiplexing and byte-stream
services, we use the Internet’s UDP and TCP protocols,
respectively, to illustrate how these services are provided
in practice. In the third case, we first give a collection of
algorithms that implement the request/reply (plus other related) services and then show how these algorithms can be
combined to implement a Remote Procedure Call (RPC)
protocol. This discussion is capped off with a description
of two widely used RPC protocols—SunRPC and DCERPC—in terms of these component algorithms. Finally,
the chapter concludes with a section that discusses the
performance of the different transport protocols.
5 End-to-End Protocols
5.1 Simple Demultiplexer (UDP)
The simplest possible transport protocol is one that extends the host-to-host delivery
service of the underlying network into a process-to-process communication service.
There are likely to be many processes running on any given host, so the protocol needs
to add a level of demultiplexing, thereby allowing multiple application processes on
each host to share the network. Aside from this requirement, the transport protocol
adds no other functionality to the best-effort service provided by the underlying network. The Internet’s User Datagram Protocol (UDP) is an example of such a transport
The only interesting issue in such a protocol is the form of the address used to
identify the target process. Although it is possible for processes to directly identify
each other with an OS-assigned process id (pid), such an approach is only practical
in a closed distributed system in which a single OS runs on all hosts and assigns each
process a unique id. A more common approach, and the one used by UDP, is for
processes to indirectly identify each other using an abstract locator, often called a port
or mailbox. The basic idea is for a source process to send a message to a port and for
the destination process to receive the message from a port.
The header for an end-to-end protocol that implements this demultiplexing function typically contains an identifier (port) for both the sender (source) and the receiver
(destination) of the message. For example, the UDP header is given in Figure 5.1. Notice
that the UDP port field is only 16 bits long. This means that there are up to 64K possible ports, clearly not enough to identify all the processes on all the hosts in the Internet.
Fortunately, ports are not interpreted across the entire Internet, but only on a single
host. That is, a process is really identified by a port on some particular host—a port,
host pair. In fact, this pair constitutes the demultiplexing key for the UDP protocol.
The next issue is how a process learns the port for the process to which it wants
to send a message. Typically, a client process initiates a message exchange with a server
Figure 5.1
Format for UDP header.
5.1 Simple Demultiplexer (UDP)
process. Once a client has contacted a server, the server knows the client’s port (it was
contained in the message header) and can reply to it. The real problem, therefore, is how
the client learns the server’s port in the first place. A common approach is for the server
to accept messages at a well-known port. That is, each server receives its messages at
some fixed port that is widely published, much like the emergency telephone service
available at the well-known phone number 911. In the Internet, for example, the
Domain Name Server (DNS) receives messages at well-known port 53 on each host,
the mail service listens for messages at port 25, and the Unix talk program accepts
messages at well-known port 517, and so on. This mapping is published periodically
in an RFC and is available on most Unix systems in file /etc/services. Sometimes a
well-known port is just the starting point for communication: The client and server
use the well-known port to agree on some other port that they will use for subsequent
communication, leaving the well-known port free for other clients.
An alternative strategy is to generalize this idea, so that there is only a single
well-known port—the one at which the “Port Mapper” service accepts messages. A
client would send a message to the Port Mapper’s well-known port asking for the
port it should use to talk to the “whatever” service, and the Port Mapper returns
the appropriate port. This strategy makes it easy to change the port associated with
different services over time, and for each host to use a different port for the same
As just mentioned, a port is purely an abstraction. Exactly how it is implemented
differs from system to system, or more precisely, from OS to OS. For example, the
socket API described in Chapter 1 is an implementation of ports. Typically, a port is
implemented by a message queue, as illustrated in Figure 5.2. When a message arrives,
the protocol (e.g., UDP) appends the message to the end of the queue. Should the
queue be full, the message is discarded. There is no flow-control mechanism that tells
the sender to slow down. When an application process wants to receive a message,
one is removed from the front of the queue. If the queue is empty, the process blocks
until a message becomes available.
Finally, although UDP does not implement flow control or reliable/ordered delivery, it does a little more work than to simply demultiplex messages to some application
process—it also ensures the correctness of the message by the use of a checksum. (The
UDP checksum is optional in the current Internet, but it will become mandatory with
IPv6.) UDP computes its checksum over the UDP header, the contents of the message
body, and something called the pseudoheader. The pseudoheader consists of three fields
from the IP header—protocol number, source IP address, and destination IP address—
plus the UDP length field. (Yes, the UDP length field is included twice in the checksum
calculation.) UDP uses the same checksum algorithm as IP, as defined in Section 2.4.2.
The motivation behind having the pseudoheader is to verify that this message has been
5 End-to-End Protocols
Packets arrive
Figure 5.2
UDP message queue.
delivered between the correct two endpoints. For example, if the destination IP address
was modified while the packet was in transit, causing the packet to be misdelivered,
this fact would be detected by the UDP checksum.
5.2 Reliable Byte Stream (TCP)
In contrast to a simple demultiplexing protocol like UDP, a more sophisticated transport protocol is one that offers a reliable, connection-oriented, byte-stream service.
Such a service has proven useful to a wide assortment of applications because it frees
the application from having to worry about missing or reordered data. The Internet’s
Transmission Control Protocol (TCP) is probably the most widely used protocol of
this type; it is also the most carefully tuned. It is for these two reasons that this section
studies TCP in detail, although we identify and discuss alternative design choices at
the end of the section.
In terms of the properties of transport protocols given in the problem statement
at the start of this chapter, TCP guarantees the reliable, in-order delivery of a stream
of bytes. It is a full-duplex protocol, meaning that each TCP connection supports a
5.2 Reliable Byte Stream (TCP)
pair of byte streams, one flowing in each direction. It also includes a flow-control
mechanism for each of these byte streams that allows the receiver to limit how much
data the sender can transmit at a given time. Finally, like UDP, TCP supports a demultiplexing mechanism that allows multiple application programs on any given host
to simultaneously carry on a conversation with their peers. In addition to the above
features, TCP also implements a highly tuned congestion-control mechanism. The idea
of this mechanism is to throttle how fast TCP sends data, not for the sake of keeping
the sender from overrunning the receiver, but to keep the sender from overloading
the network. A description of TCP’s congestion-control mechanism is postponed until
Chapter 6, where we discuss it in the larger context of how network resources are
fairly allocated.
Since many people confuse congestion control and flow control, we restate the
difference. Flow control involves preventing senders from overrunning the capacity of
receivers. Congestion control involves preventing too much data from being injected
into the network, thereby causing switches or links to become overloaded. Thus, flow
control is an end-to-end issue, while congestion control is concerned with how hosts
and networks interact.
End-to-End Issues
At the heart of TCP is the sliding window algorithm. Even though this is the same basic
algorithm we saw in Section 2.5.2, because TCP runs over the Internet rather than a
point-to-point link, there are many important differences. This subsection identifies
these differences and explains how they complicate TCP. The following subsections
then describe how TCP addresses these and other complications.
First, whereas the sliding window algorithm presented in Section 2.5.2 runs over a
single physical link that always connects the same two computers, TCP supports logical
connections between processes that are running on any two computers in the Internet.
This means that TCP needs an explicit connection establishment phase during which
the two sides of the connection agree to exchange data with each other. This difference
is analogous to having to dial up the other party, rather than having a dedicated phone
line. TCP also has an explicit connection teardown phase. One of the things that
happens during connection establishment is that the two parties establish some shared
state to enable the sliding window algorithm to begin. Connection teardown is needed
so each host knows it is OK to free this state.
Second, whereas a single physical link that always connects the same two computers has a fixed RTT, TCP connections are likely to have widely different round-trip
times. For example, a TCP connection between a host in San Francisco and a host
in Boston, which are separated by several thousand kilometers, might have an RTT
5 End-to-End Protocols
of 100 ms, while a TCP connection between two hosts in the same room, only a few
meters apart, might have an RTT of only 1 ms. The same TCP protocol must be able
to support both of these connections. To make matters worse, the TCP connection
between hosts in San Francisco and Boston might have an RTT of 100 ms at 3 a.m.,
but an RTT of 500 ms at 3 p.m. Variations in the RTT are even possible during a
single TCP connection that lasts only a few minutes. What this means to the sliding
window algorithm is that the timeout mechanism that triggers retransmissions must be
adaptive. (Certainly, the timeout for a point-to-point link must be a settable parameter,
but it is not necessary to adapt this timer for a particular pair of nodes.)
A third difference is that packets may be reordered as they cross the Internet,
but this is not possible on a point-to-point link where the first packet put into one
end of the link must be the first to appear at the other end. Packets that are slightly
out of order do not cause a problem since the sliding window algorithm can reorder
packets correctly using the sequence number. The real issue is how far out-of-order
packets can get, or said another way, how late a packet can arrive at the destination.
In the worst case, a packet can be delayed in the Internet until IP’s time to live (TTL)
field expires, at which time the packet is discarded (and hence there is no danger of
it arriving late). Knowing that IP throws packets away after their TTL expires, TCP
assumes that each packet has a maximum lifetime. The exact lifetime, known as the
maximum segment lifetime (MSL), is an engineering choice. The current recommended
setting is 120 seconds. Keep in mind that IP does not directly enforce this 120-second
value; it is simply a conservative estimate that TCP makes of how long a packet might
live in the Internet. The implication is significant—TCP has to be prepared for very old
packets to suddenly show up at the receiver, potentially confusing the sliding window
Fourth, the computers connected to a point-to-point link are generally engineered
to support the link. For example, if a link’s delay × bandwidth product is computed
to be 8 KB—meaning that a window size is selected to allow up to 8 KB of data to be
unacknowledged at a given time—then it is likely that the computers at either end of
the link have the ability to buffer up to 8 KB of data. Designing the system otherwise
would be silly. On the other hand, almost any kind of computer can be connected to the
Internet, making the amount of resources dedicated to any one TCP connection highly
variable, especially considering that any one host can potentially support hundreds of
TCP connections at the same time. This means that TCP must include a mechanism
that each side uses to “learn” what resources (e.g., how much buffer space) the other
side is able to apply to the connection. This is the flow-control issue.
Fifth, because the transmitting side of a directly connected link cannot send any
faster than the bandwidth of the link allows, and only one host is pumping data into
the link, it is not possible to unknowingly congest the link. Said another way, the load
5.2 Reliable Byte Stream (TCP)
on the link is visible in the form of a queue of packets at the sender. In contrast, the
sending side of a TCP connection has no idea what links will be traversed to reach
the destination. For example, the sending machine might be directly connected to a
relatively fast Ethernet—and so, capable of sending data at a rate of 100 Mbps—but
somewhere out in the middle of the network, a 1.5-Mbps T1 link must be traversed.
And to make matters worse, data being generated by many different sources might be
trying to traverse this same slow link. This leads to the problem of network congestion.
Discussion of this topic is delayed until Chapter 6.
We conclude this discussion of end-to-end issues by comparing TCP’s approach to
providing a reliable/ordered delivery service with the approach used by X.25 networks.
In TCP, the underlying IP network is assumed to be unreliable and to deliver messages
out of order; TCP uses the sliding window algorithm on an end-to-end basis to provide
reliable/ordered delivery. In contrast, X.25 networks use the sliding window protocol
within the network, on a hop-by-hop basis. The assumption behind this approach is
that if messages are delivered reliably and in order between each pair of nodes along
the path between the source host and the destination host, then the end-to-end service
also guarantees reliable/ordered delivery.
The problem with this latter approach is that a sequence of hop-by-hop guarantees does not necessarily add up to an end-to-end guarantee. First, if a heterogeneous
link (say, an Ethernet) is added to one end of the path, then there is no guarantee
that this hop will preserve the same service as the other hops. Second, just because
the sliding window protocol guarantees that messages are delivered correctly from
node A to node B, and then from node B to node C, it does not guarantee that node B
behaves perfectly. For example, network nodes have been known to introduce errors
into messages while transferring them from an input buffer to an output buffer. They
have also been known to accidentally reorder messages. As a consequence of these
small windows of vulnerability, it is still necessary to provide true end-to-end checks
to guarantee reliable/ordered service, even though the lower levels of the system also
implement that functionality.
This discussion serves to illustrate one of the most important principles in system
design—the end-to-end argument. In a nutshell, the end-to-end argument says that a
function (in our example, providing reliable/ordered delivery) should not be provided
in the lower levels of the system unless it can be completely and correctly implemented
at that level. Therefore, this rule argues in favor of the TCP/IP approach. This rule is
not absolute, however. It does allow for functions to be incompletely provided at a
low level as a performance optimization. This is why it is perfectly consistent with the
end-to-end argument to perform error detection (e.g., CRC) on a hop-by-hop basis;
detecting and retransmitting a single corrupt packet across one hop is preferable to
having to retransmit an entire file end-to-end.
5 End-to-End Protocols
Application process
Application process
Send buffer
Receive buffer
Transmit segments
Figure 5.3
How TCP manages a byte stream.
Segment Format
TCP is a byte-oriented protocol, which means that the sender writes bytes into a TCP
connection and the receiver reads bytes out of the TCP connection. Although “byte
stream” describes the service TCP offers to application processes, TCP does not, itself,
transmit individual bytes over the Internet. Instead, TCP on the source host buffers
enough bytes from the sending process to fill a reasonably sized packet and then sends
this packet to its peer on the destination host. TCP on the destination host then empties
the contents of the packet into a receive buffer, and the receiving process reads from
this buffer at its leisure. This situation is illustrated in Figure 5.3, which, for simplicity,
shows data flowing in only one direction. Remember that, in general, a single TCP
connection supports byte streams flowing in both directions.
The packets exchanged between TCP peers in Figure 5.3 are called segments,
since each one carries a segment of the byte stream. Each TCP segment contains the
header schematically depicted in Figure 5.4. The relevance of most of these fields will
become apparent throughout this section. For now, we simply introduce them.
The SrcPort and DstPort fields identify the source and destination ports, respectively, just as in UDP. These two fields, plus the source and destination IP addresses,
combine to uniquely identify each TCP connection. That is, TCP’s demux key is given
by the 4-tuple
SrcPort, SrcIPAddr, DstPort, DstIPAddr Note that because TCP connections come and go, it is possible for a connection between a particular pair of ports to be established, used to send and receive data, and
closed, and then at a later time for the same pair of ports to be involved in a second
5.2 Reliable Byte Stream (TCP)
Options (variable)
Figure 5.4
TCP header format.
Data (SequenceNum)
Acknowledgment +
Figure 5.5 Simplified illustration (showing only one direction) of the TCP process,
with data flow in one direction and ACKs in the other.
connection. We sometimes refer to this situation as two different incarnations of the
same connection.
The Acknowledgment, SequenceNum, and AdvertisedWindow fields are all involved in TCP’s sliding window algorithm. Because TCP is a byte-oriented protocol,
each byte of data has a sequence number; the SequenceNum field contains the sequence
number for the first byte of data carried in that segment. The Acknowledgment and
AdvertisedWindow fields carry information about the flow of data going in the other
direction. To simplify our discussion, we ignore the fact that data can flow in both
directions, and we concentrate on data that has a particular SequenceNum flowing
in one direction and Acknowledgment and AdvertisedWindow values flowing in the
opposite direction, as illustrated in Figure 5.5. The use of these three fields is described
more fully in Section 5.2.4.
The 6-bit Flags field is used to relay control information between TCP peers. The
possible flags include SYN, FIN, RESET, PUSH, URG, and ACK. The SYN and FIN flags
5 End-to-End Protocols
are used when establishing and terminating a TCP connection, respectively. Their use
is described in Section 5.2.3. The ACK flag is set any time the Acknowledgment field is
valid, implying that the receiver should pay attention to it. The URG flag signifies that
this segment contains urgent data. When this flag is set, the UrgPtr field indicates where
the nonurgent data contained in this segment begins. The urgent data is contained at
the front of the segment body, up to and including a value of UrgPtr bytes into the
segment. The PUSH flag signifies that the sender invoked the push operation, which
indicates to the receiving side of TCP that it should notify the receiving process of
this fact. We discuss these last two features more in Section 5.2.7. Finally, the RESET
flag signifies that the receiver has become confused—for example, because it received
a segment it did not expect to receive—and so wants to abort the connection.
Finally, the Checksum field is used in exactly the same way as for UDP—it is
computed over the TCP header, the TCP data, and the pseudoheader, which is made
up of the source address, destination address, and length fields from the IP header. The
checksum is required for TCP in both IPv4 and IPv6. Also, since the TCP header is of
variable length (options can be attached after the mandatory fields), a HdrLen field is
included that gives the length of the header in 32-bit words. This field is also known
as the Offset field, since it measures the offset from the start of the packet to the start
of the data.
Connection Establishment and Termination
A TCP connection begins with a client (caller) doing an active open to a server (callee).
Assuming that the server had earlier done a passive open, the two sides engage in
an exchange of messages to establish the connection. (Recall from Chapter 1 that a
party wanting to initiate a connection performs an active open, while a party willing to accept a connection does a passive open.) Only after this connection establishment phase is over do the two sides begin sending data. Likewise, as soon as
a participant is done sending data, it closes one direction of the connection, which
causes TCP to initiate a round of connection termination messages. Notice that while
connection setup is an asymmetric activity (one side does a passive open and the
other side does an active open), connection teardown is symmetric (each side has to
close the connection independently).1 Therefore, it is possible for one side to have
done a close, meaning that it can no longer send data, but for the other side to
keep the other half of the bidirectional connection open and to continue sending
To be more precise, connection setup can be symmetric, with both sides trying to open the connection at the same
time, but the common case is for one side to do an active open and the other side to do a passive open.
5.2 Reliable Byte Stream (TCP)
Active participant
Passive participant
um =
= y,
Figure 5.6
Timeline for three-way handshake algorithm.
Three-Way Handshake
The algorithm used by TCP to establish and terminate a connection is called a threeway handshake. We first describe the basic algorithm and then show how it is used by
TCP. The three-way handshake involves the exchange of three messages between the
client and the server, as illustrated by the timeline given in Figure 5.6.
The idea is that two parties want to agree on a set of parameters, which, in the
case of opening a TCP connection, are the starting sequence numbers the two sides plan
to use for their respective byte streams. In general, the parameters might be any facts
that each side wants the other to know about. First, the client (the active participant)
sends a segment to the server (the passive participant) stating the initial sequence
number it plans to use (Flags = SYN, SequenceNum = x). The server then responds
with a single segment that both acknowledges the client’s sequence number (Flags =
ACK, Ack = x + 1) and states its own beginning sequence number (Flags = SYN,
SequenceNum = y). That is, both the SYN and ACK bits are set in the Flags field of this
second message. Finally, the client responds with a third segment that acknowledges
the server’s sequence number (Flags = ACK, Ack = y + 1). The reason that each
side acknowledges a sequence number that is one larger than the one sent is that
the Acknowledgment field actually identifies the “next sequence number expected,”
thereby implicitly acknowledging all earlier sequence numbers. Although not shown
in this timeline, a timer is scheduled for each of the first two segments, and if the
expected response is not received, the segment is retransmitted.
You may be asking yourself why the client and server have to exchange starting
sequence numbers with each other at connection setup time. It would be simpler if
each side simply started at some “well-known” sequence number, such as 0. In fact,
5 End-to-End Protocols
the TCP specification requires that each side of a connection select an initial starting
sequence number at random. The reason for this is to protect against two incarnations
of the same connection reusing the same sequence numbers too soon, that is, while
there is still a chance that a segment from an earlier incarnation of a connection might
interfere with a later incarnation of the connection.
State Transition Diagram
TCP is complex enough that its specification includes a state transition diagram. A
copy of this diagram is given in Figure 5.7. This diagram shows only the states involved in opening a connection (everything above ESTABLISHED) and in closing a
connection (everything below ESTABLISHED). Everything that goes on while a connection is open—that is, the operation of the sliding window algorithm—is hidden in
the ESTABLISHED state.
Active open/SYN
Passive open
Figure 5.7
ACK Timeout after two
segment lifetimes
TCP state transition diagram.
5.2 Reliable Byte Stream (TCP)
TCP’s state transition diagram is fairly easy to understand. Each circle denotes
a state that one end of a TCP connection can find itself in. All connections start in the
CLOSED state. As the connection progresses, the connection moves from state to state
according to the arcs. Each arc is labelled with a tag of the form event/action. Thus, if
a connection is in the LISTEN state and a SYN segment arrives (i.e., a segment with
the SYN flag set), the connection makes a transition to the SYN RCVD state and takes
the action of replying with an ACK + SYN segment.
Notice that two kinds of events trigger a state transition: (1) a segment arrives
from the peer (e.g., the event on the arc from LISTEN to SYN RCVD), or (2) the local
application process invokes an operation on TCP (e.g., the active open event on the arc
from CLOSE to SYN SENT). In other words, TCP’s state transition diagram effectively
defines the semantics of both its peer-to-peer interface and its service interface, as
defined in Section 1.3.1. The syntax of these two interfaces is given by the segment
format (as illustrated in Figure 5.4) and by some application programming interface
(an example of which is given in Section 1.4.1), respectively.
Now let’s trace the typical transitions taken through the diagram in Figure 5.7.
Keep in mind that at each end of the connection, TCP makes different transitions
from state to state. When opening a connection, the server first invokes a passive open
operation on TCP, which causes TCP to move to the LISTEN state. At some later time,
the client does an active open, which causes its end of the connection to send a SYN
segment to the server and to move to the SYN SENT state. When the SYN segment
arrives at the server, it moves to the SYN RCVD state and responds with a SYN+ACK
segment. The arrival of this segment causes the client to move to the ESTABLISHED
state and to send an ACK back to the server. When this ACK arrives, the server finally
moves to the ESTABLISHED state. In other words, we have just traced the three-way
There are three things to notice about the connection establishment half of the
state transition diagram. First, if the client’s ACK to the server is lost, corresponding to
the third leg of the three-way handshake, then the connection still functions correctly.
This is because the client side is already in the ESTABLISHED state, so the local
application process can start sending data to the other end. Each of these data segments
will have the ACK flag set, and the correct value in the Acknowledgment field, so the
server will move to the ESTABLISHED state when the first data segment arrives.
This is actually an important point about TCP—every segment reports what sequence
number the sender is expecting to see next, even if this repeats the same sequence
number contained in one or more previous segments.
The second thing to notice about the state transition diagram is that there is a
funny transition out of the LISTEN state whenever the local process invokes a send
operation on TCP. That is, it is possible for a passive participant to identify both ends
5 End-to-End Protocols
of the connection (i.e., itself and the remote participant that it is willing to have connect
to it), and then to change its mind about waiting for the other side and instead actively
establish the connection. To the best of our knowledge, this is a feature of TCP that
no application process actually takes advantage of.
The final thing to notice about the diagram is the arcs that are not shown. Specifically, most of the states that involve sending a segment to the other side also schedule
a timeout that eventually causes the segment to be resent if the expected response does
not happen. These retransmissions are not depicted in the state transition diagram. If
after several tries the expected response does not arrive, TCP gives up and returns to
the CLOSED state.
Turning our attention now to the process of terminating a connection, the important thing to keep in mind is that the application process on both sides of the
connection must independently close its half of the connection. If only one side closes
the connection, then this means it has no more data to send, but it is still available
to receive data from the other side. This complicates the state transition diagram because it must account for the possibility that the two sides invoke the close operator
at the same time, as well as the possibility that first one side invokes close and then,
at some later time, the other side invokes close. Thus, on any one side there are three
combinations of transitions that get a connection from the ESTABLISHED state to the
CLOSED state:
■ This side closes first:
■ The other side closes first:
■ Both sides close at the same time:
There is actually a fourth, although rare, sequence of transitions that leads to the
CLOSED state; it follows the arc from FIN WAIT 1 to TIME WAIT. We leave it as an
exercise for you to figure out what combination of circumstances leads to this fourth
The main thing to recognize about connection teardown is that a connection in
the TIME WAIT state cannot move to the CLOSED state until it has waited for two
times the maximum amount of time an IP datagram might live in the Internet (i.e.,
120 seconds). The reason for this is that while the local side of the connection has
sent an ACK in response to the other side’s FIN segment, it does not know that the
ACK was successfully delivered. As a consequence, the other side might retransmit its
5.2 Reliable Byte Stream (TCP)
FIN segment, and this second FIN segment might be delayed in the network. If the
connection were allowed to move directly to the CLOSED state, then another pair of
application processes might come along and open the same connection (i.e., use the
same pair of port numbers), and the delayed FIN segment from the earlier incarnation
of the connection would immediately initiate the termination of the later incarnation
of that connection.
Sliding Window Revisited
We are now ready to discuss TCP’s variant of the sliding window algorithm, which
serves several purposes: (1) it guarantees the reliable delivery of data, (2) it ensures
that data is delivered in order, and (3) it enforces flow control between the sender
and the receiver. TCP’s use of the sliding window algorithm is the same as we saw in
Section 2.5.2 in the case of the first two of these three functions. Where TCP differs
from the earlier algorithm is that it folds the flow-control function in as well. In
particular, rather than having a fixed-size sliding window, the receiver advertises a
window size to the sender. This is done using the AdvertisedWindow field in the TCP
header. The sender is then limited to having no more than a value of AdvertisedWindow
bytes of unacknowledged data at any given time. The receiver selects a suitable value
for AdvertisedWindow based on the amount of memory allocated to the connection
for the purpose of buffering data. The idea is to keep the sender from overrunning the
receiver’s buffer. We discuss this at greater length below.
Reliable and Ordered Delivery
To see how the sending and receiving sides of TCP interact with each other to implement reliable and ordered delivery, consider the situation illustrated in Figure 5.8.
TCP on the sending side maintains a send buffer. This buffer is used to store data
Sending application
Receiving application
Figure 5.8
Relationship between TCP send buffer (a) and receive buffer (b).
5 End-to-End Protocols
that has been sent but not yet acknowledged, as well as data that has been written by
the sending application, but not transmitted. On the receiving side, TCP maintains a
receive buffer. This buffer holds data that arrives out of order, as well as data that is
in the correct order (i.e., there are no missing bytes earlier in the stream) but that the
application process has not yet had the chance to read.
To make the following discussion simpler to follow, we initially ignore the fact
that both the buffers and the sequence numbers are of some finite size and hence will
eventually wrap around. Also, we do not distinguish between a pointer into a buffer
where a particular byte of data is stored and the sequence number for that byte.
Looking first at the sending side, three pointers are maintained into the send buffer, each with an obvious meaning: LastByteAcked, LastByteSent, and LastByteWritten.
LastByteAcked ≤ LastByteSent
since the receiver cannot have acknowledged a byte that has not yet been sent, and
LastByteSent ≤ LastByteWritten
since TCP cannot send a byte that the application process has not yet written. Also
note that none of the bytes to the left of LastByteAcked need to be saved in the buffer
because they have already been acknowledged, and none of the bytes to the right of
LastByteWritten need to be buffered because they have not yet been generated.
A similar set of pointers (sequence numbers) are maintained on the receiving side:
LastByteRead, NextByteExpected, and LastByteRcvd. The inequalities are a little less intuitive, however, because of the problem of out-of-order delivery. The first relationship
LastByteRead < NextByteExpected
is true because a byte cannot be read by the application until it is received and all preceding bytes have also been received. NextByteExpected points to the byte immediately
after the latest byte to meet this criterion. Second,
NextByteExpected ≤ LastByteRcvd + 1
since, if data has arrived in order, NextByteExpected points to the byte after LastByteRcvd, whereas if data has arrived out of order, NextByteExpected points to the start of
the first gap in the data, as in Figure 5.8. Note that bytes to the left of LastByteRead
need not be buffered because they have already been read by the local application
process, and bytes to the right of LastByteRcvd need not be buffered because they have
not yet arrived.
Flow Control
Most of the above discussion is similar to that found in Section 2.5.2; the only real
difference is that this time we elaborated on the fact that the sending and receiving application processes are filling and emptying their local buffer, respectively. (The earlier
5.2 Reliable Byte Stream (TCP)
discussion glossed over the fact that data arriving from an upstream node was filling
the send buffer, and data being transmitted to a downstream node was emptying the
receive buffer.)
You should make sure you understand this much before proceeding because
now comes the point where the two algorithms differ more significantly. In what
follows, we reintroduce the fact that both buffers are of some finite size, denoted
MaxSendBuffer and MaxRcvBuffer, although we don’t worry about the details of how
they are implemented. In other words, we are only interested in the number of bytes
being buffered, not in where those bytes are actually stored.
Recall that in a sliding window protocol, the size of the window sets the amount
of data that can be sent without waiting for acknowledgment from the receiver. Thus,
the receiver throttles the sender by advertising a window that is no larger than the
amount of data that it can buffer. Observe that TCP on the receive side must keep
LastByteRcvd − LastByteRead ≤ MaxRcvBuffer
to avoid overflowing its buffer. It therefore advertises a window size of
AdvertisedWindow = MaxRcvBuffer − (( NextByteExpected − 1) − LastByteRead)
which represents the amount of free space remaining in its buffer. As data arrives,
the receiver acknowledges it as long as all the preceding bytes have also arrived. In
addition, LastByteRcvd moves to the right (is incremented), meaning that the advertised
window potentially shrinks. Whether or not it shrinks depends on how fast the local
application process is consuming data. If the local process is reading data just as fast as
it arrives (causing LastByteRead to be incremented at the same rate as LastByteRcvd),
then the advertised window stays open (i.e., AdvertisedWindow = MaxRcvBuffer).
If, however, the receiving process falls behind, perhaps because it performs a very
expensive operation on each byte of data that it reads, then the advertised window
grows smaller with every segment that arrives, until it eventually goes to 0.
TCP on the send side must then adhere to the advertised window it gets from
the receiver. This means that at any given time, it must ensure that
LastByteSent − LastByteAcked ≤ AdvertisedWindow
Said another way, the sender computes an effective window that limits how much data
it can send:
EffectiveWindow = AdvertisedWindow − ( LastByteSent − LastByteAcked)
Clearly, EffectiveWindow must be greater than 0 before the source can send more data.
It is possible, therefore, that a segment arrives acknowledging x bytes, thereby allowing
the sender to increment LastByteAcked by x, but because the receiving process was not
reading any data, the advertised window is now x bytes smaller than the time before.
5 End-to-End Protocols
In such a situation, the sender would be able to free buffer space, but not to send any
more data.
All the while this is going on, the send side must also make sure that the local
application process does not overflow the send buffer, that is, that
LastByteWritten − LastByteAcked ≤ MaxSendBuffer
If the sending process tries to write y bytes to TCP, but
( LastByteWritten − LastByteAcked) + y > MaxSendBuffer
then TCP blocks the sending process and does not allow it to generate more data.
It is now possible to understand how a slow receiving process ultimately stops
a fast sending process. First, the receive buffer fills up, which means the advertised
window shrinks to 0. An advertised window of 0 means that the sending side cannot
transmit any data, even though data it has previously sent has been successfully acknowledged. Finally, not being able to transmit any data means that the send buffer
fills up, which ultimately causes TCP to block the sending process. As soon as the
receiving process starts to read data again, the receive-side TCP is able to open its window back up, which allows the send-side TCP to transmit data out of its buffer. When
this data is eventually acknowledged, LastByteAcked is incremented, the buffer space
holding this acknowledged data becomes free, and the sending process is unblocked
and allowed to proceed.
There is only one remaining detail that must be resolved—how does the sending
side know that the advertised window is no longer 0? As mentioned above, TCP always
sends a segment in response to a received data segment, and this response contains the
latest values for the Acknowledge and AdvertisedWindow fields, even if these values
have not changed since the last time they were sent. The problem is this. Once the
receive side has advertised a window size of 0, the sender is not permitted to send
any more data, which means it has no way to discover that the advertised window
is no longer 0 at some time in the future. TCP on the receive side does not spontaneously send nondata segments; it only sends them in response to an arriving data
TCP deals with this situation as follows. Whenever the other side advertises a
window size of 0, the sending side persists in sending a segment with 1 byte of data
every so often. It knows that this data will probably not be accepted, but it tries
anyway, because each of these 1-byte segments triggers a response that contains the
current advertised window. Eventually, one of these 1-byte probes triggers a response
that reports a nonzero advertised window.
Note that the reason the sending side periodically sends this probe segment is
that TCP is designed to make the receive side as simple as possible—it simply responds
5.2 Reliable Byte Stream (TCP)
to segments from the sender, and it never initiates any activity on its own. This is
an example of a well-recognized (although not universally applied) protocol design
rule, which, for lack of a better name, we call the smart sender/dumb receiver rule.
Recall that we saw another example of this rule when we discussed the use of NAKs
in Section 2.5.2.
Protecting against Wraparound
This subsection and the next consider the size of the SequenceNum and AdvertisedWindow fields and the implications of their sizes on TCP’s correctness and performance.
TCP’s SequenceNum field is 32 bits long, and its AdvertisedWindow field is 16 bits
long, meaning that TCP has easily satisfied the requirement of the sliding window algorithm that the sequence number space be twice as big as the window size: 232 ≫ 2×216 .
However, this requirement is not the interesting thing about these two fields. Consider
each field in turn.
The relevance of the 32-bit sequence number space is that the sequence number
used on a given connection might wrap around—a byte with sequence number x could
be sent at one time, and then at a later time a second byte with the same sequence
number x might be sent. Once again, we assume that packets cannot survive in the
Internet for longer than the recommended MSL. Thus, we currently need to make
sure that the sequence number does not wrap around within a 120-second period of
time. Whether or not this happens depends on how fast data can be transmitted over
the Internet, that is, how fast the 32-bit sequence number space can be consumed.
(This discussion assumes that we are trying to consume the sequence number space as
fast as possible, but of course we will be if we are doing our job of keeping the pipe
full.) Table 5.1 shows how long it takes for the sequence number to wrap around on
networks with various bandwidths.
As you can see, the 32-bit sequence number space is adequate for today’s networks, but given that OC-48 links currently exist in the Internet backbone, it won’t
be long until individual TCP connections want to run at 622-Mbps speeds or higher.
Fortunately, the IETF has already worked out an extension to TCP that effectively
extends the sequence number space to protect against the sequence number wrapping
around. This and related extensions are described in Section 5.2.8.
Keeping the Pipe Full
The relevance of the 16-bit AdvertisedWindow field is that it must be big enough
to allow the sender to keep the pipe full. Clearly, the receiver is free not to open
the window as large as the AdvertisedWindow field allows; we are interested in the
situation in which the receiver has enough buffer space to handle as much data as the
largest possible AdvertisedWindow allows.
5 End-to-End Protocols
Time until Wraparound
T1 (1.5 Mbps)
6.4 hours
Ethernet (10 Mbps)
57 minutes
T3 (45 Mbps)
13 minutes
FDDI (100 Mbps)
6 minutes
STS-3 (155 Mbps)
4 minutes
STS-12 (622 Mbps)
55 seconds
STS-24 (1.2 Gbps)
28 seconds
Table 5.1
Time until 32-bit sequence number space wraps around.
Delay × Bandwidth Product
T1 (1.5 Mbps)
18 KB
Ethernet (10 Mbps)
122 KB
T3 (45 Mbps)
549 KB
FDDI (100 Mbps)
1.2 MB
STS-3 (155 Mbps)
1.8 MB
STS-12 (622 Mbps)
7.4 MB
STS-24 (1.2 Gbps)
14.8 MB
Table 5.2
Required window size for 100-ms RTT.
In this case, it is not just the network bandwidth but the delay × bandwidth
product that dictates how big the AdvertisedWindow field needs to be—the window
needs to be opened far enough to allow a full delay × bandwidth product’s worth of
data to be transmitted. Assuming an RTT of 100 ms (a typical number for a crosscountry connection in the U.S.), Table 5.2 gives the delay × bandwidth product for
several network technologies.
As you can see, TCP’s AdvertisedWindow field is in even worse shape than its
SequenceNum field—it is not big enough to handle even a T3 connection across the
continental United States, since a 16-bit field allows us to advertise a window of only
64 KB. The very same TCP extension mentioned above (see Section 5.2.8) provides a
mechanism for effectively increasing the size of the advertised window.
5.2 Reliable Byte Stream (TCP)
Triggering Transmission
We next consider a surprisingly subtle issue: how TCP decides to transmit a segment. As
described earlier, TCP supports a byte-stream abstraction, that is, application programs
write bytes into the stream, and it is up to TCP to decide that it has enough bytes to
send a segment. What factors govern this decision?
If we ignore the possibility of flow control—that is, we assume the window is
wide open, as would be the case when a connection first starts—then TCP has three
mechanisms to trigger the transmission of a segment. First, TCP maintains a variable,
typically called the maximum segment size (MSS), and it sends a segment as soon as it
has collected MSS bytes from the sending process. MSS is usually set to the size of the
largest segment TCP can send without causing the local IP to fragment. That is, MSS
is set to the MTU of the directly connected network, minus the size of the TCP and IP
headers. The second thing that triggers TCP to transmit a segment is that the sending
process has explicitly asked it to do so. Specifically, TCP supports a push operation,
and the sending process invokes this operation to effectively flush the buffer of unsent
bytes. The final trigger for transmitting a segment is that a timer fires; the resulting
segment contains as many bytes as are currently buffered for transmission. However,
as we will soon see, this “timer” isn’t exactly what you expect.
Silly Window Syndrome
Of course, we can’t just ignore flow control, which plays an obvious role in throttling
the sender. If the sender has MSS bytes of data to send and the window is open at least
that much, then the sender transmits a full segment. Suppose, however, that the sender
is accumulating bytes to send, but the window is currently closed. Now suppose an
ACK arrives that effectively opens the window enough for the sender to transmit, say,
MSS/2 bytes. Should the sender transmit a half-full segment or wait for the window
to open to a full MSS? The original specification was silent on this point, and early
implementations of TCP decided to go ahead and transmit a half-full segment. After
all, there is no telling how long it will be before the window opens further.
It turns out that the strategy of aggressively taking advantage of any available
window leads to a situation now known as the silly window syndrome. Figure 5.9
helps visualize what happens. If you think of a TCP stream as a conveyer belt with
“full” containers (data segments) going in one direction and empty containers (ACKs)
going in the reverse direction, then MSS-sized segments correspond to large containers
and 1-byte segments correspond to very small containers. If the sender aggressively fills
an empty container as soon as it arrives, then any small container introduced into the
system remains in the system indefinitely. That is, it is immediately filled and emptied
at each end, and never coalesced with adjacent containers to create larger containers.
5 End-to-End Protocols
Figure 5.9
Silly window syndrome.
This scenario was discovered when early implementations of TCP regularly found
themselves filling the network with tiny segments.
Note that the silly window syndrome is only a problem when either the sender
transmits a small segment or the receiver opens the window a small amount. If neither
of these happens, then the small container is never introduced into the stream. It’s
not possible to outlaw sending small segments; for example, the application might
do a push after sending a single byte. It is possible, however, to keep the receiver
from introducing a small container (i.e., a small open window). The rule is that after
advertizing a zero window, the receiver must wait for space equal to an MSS before it
advertises an open window.
Since we can’t eliminate the possibility of a small container being introduced into
the stream, we also need mechanisms to coalesce them. The receiver can do this by
delaying ACKs—sending one combined ACK rather than multiple smaller ones—but
this is only a partial solution because the receiver has no way of knowing how long it is
safe to delay waiting either for another segment to arrive or for the application to read
more data (thus opening the window). The ultimate solution falls to the sender, which
brings us back to our original issue: When does the TCP sender decide to transmit a
Nagle’s Algorithm
Returning to the TCP sender, if there is data to send but the window is open less than
MSS, then we may want to wait some amount of time before sending the available
data, but the question is, how long? If we wait too long, then we hurt interactive
applications like Telnet. If we don’t wait long enough, then we risk sending a bunch
of tiny packets and falling into the silly window syndrome. The answer is to introduce
a timer and to transmit when the timer expires.
While we could use a clock-based timer—for example, one that fires every 100
ms—Nagle introduced an elegant self-clocking solution. The idea is that as long as TCP
has any data in flight, the sender will eventually receive an ACK. This ACK can be
5.2 Reliable Byte Stream (TCP)
treated like a timer firing, triggering the transmission of more data. Nagle’s algorithm
provides a simple, unified rule for deciding when to transmit:
When the application produces data to send
if both the available data and the window ≥ MSS
send a full segment
if there is unACKed data in flight
buffer the new data until an ACK arrives
send all the new data now
In other words, it’s always OK to send a full segment if the window allows.
It’s also OK to immediately send a small amount of data if there are currently no
segments in transit, but if there is anything in flight, the sender must wait for an ACK
before transmiting the next segment. Thus, an interactive application like Telnet that
continually writes one byte at a time will send data at a rate of one segment per RTT.
Some segments will contain a single byte, while others will contain as many bytes as
the user was able to type in one round-trip time. Because some applications cannot
afford such a delay for each write they do to a TCP connection, the socket interface
allows applications to turn off Nagle’s algorithm by setting the TCP NODELAY option.
Setting this option means that data is transmitted as soon as possible.
Adaptive Retransmission
Because TCP guarantees the reliable delivery of data, it retransmits each segment if an
ACK is not received in a certain period of time. TCP sets this timeout as a function of
the RTT it expects between the two ends of the connection. Unfortunately, given the
range of possible RTTs between any pair of hosts in the Internet, as well as the variation in RTT between the same two hosts over time, choosing an appropriate timeout
value is not that easy. To address this problem, TCP uses an adaptive retransmission
mechanism. We now describe this mechanism and how it has evolved over time as the
Internet community has gained more experience using TCP.
Original Algorithm
We begin with a simple algorithm for computing a timeout value between a pair of
hosts. This is the algorithm that was originally described in the TCP specification—
and the following description presents it in those terms—but it could be used by any
end-to-end protocol.
The idea is to keep a running average of the RTT and then to compute the timeout
as a function of this RTT. Specifically, every time TCP sends a data segment, it records
5 End-to-End Protocols
the time. When an ACK for that segment arrives, TCP reads the time again and then
takes the difference between these two times as a SampleRTT. TCP then computes
an EstimatedRTT as a weighted average between the previous estimate and this new
sample. That is,
EstimatedRTT = α × EstimatedRTT + ( 1 − α) × SampleRTT
The parameter α is selected to smooth the EstimatedRTT. A small α tracks changes in
the RTT but is perhaps too heavily influenced by temporary fluctuations. On the other
hand, a large α is more stable but perhaps not quick enough to adapt to real changes.
The original TCP specification recommended a setting of α between 0.8 and 0.9. TCP
then uses EstimatedRTT to compute the timeout in a rather conservative way:
TimeOut = 2 × EstimatedRTT
Karn/Partridge Algorithm
After several years of use on the Internet, a rather obvious flaw was discovered in
this simple algorithm. The problem was that an ACK does not really acknowledge a
transmission; it actually acknowledges the receipt of data. In other words, whenever
a segment is retransmitted and then an ACK arrives at the sender, it is impossible to
determine if this ACK should be associated with the first or the second transmission
of the segment for the purpose of measuring the sample RTT. It is necessary to know
which transmission to associate it with so as to compute an accurate SampleRTT. As
illustrated in Figure 5.10, if you assume that the ACK is for the original transmission
but it was really for the second, then the SampleRTT is too large (a), while if you
assume that the ACK is for the second transmission but it was actually for the first,
then the SampleRTT is too small (b).
Figure 5.10 Associating
(b) retransmission.
5.2 Reliable Byte Stream (TCP)
The solution is surprisingly simple. Whenever TCP retransmits a segment, it
stops taking samples of the RTT; it only measures SampleRTT for segments that have
been sent only once. This solution is known as the Karn/Partridge algorithm, after its
inventors. Their proposed fix also includes a second small change to TCP’s timeout
mechanism. Each time TCP retransmits, it sets the next timeout to be twice the last
timeout, rather than basing it on the last EstimatedRTT. That is, Karn and Partridge
proposed that TCP use exponential backoff, similar to what the Ethernet does. The
motivation for using exponential backoff is simple: Congestion is the most likely cause
of lost segments, meaning that the TCP source should not react too aggressively to a
timeout. In fact, the more times the connection times out, the more cautious the source
should become. We will see this idea again, embodied in a much more sophisticated
mechanism, in Chapter 6.
Jacobson/Karels Algorithm
The Karn/Partridge algorithm was introduced at a time when the Internet was suffering
from high levels of network congestion. Their approach was designed to fix some of
the causes of that congestion, and although it was an improvement, the congestion was
not eliminated. A couple of years later, two other researchers—Jacobson and Karels—
proposed a more drastic change to TCP to battle congestion. The bulk of that proposed
change is described in Chapter 6. Here, we focus on the aspect of that proposal that
is related to deciding when to time out and retransmit a segment.
As an aside, it should be clear how the timeout mechanism is related to
congestion—if you time out too soon, you may unnecessarily retransmit a segment,
which only adds to the load on the network. As we will see in Chapter 6, the other
reason for needing an accurate timeout value is that a timeout is taken to imply congestion, which triggers a congestion-control mechanism. Finally, note that there is nothing
about the Jacobson/Karels timeout computation that is specific to TCP. It could be used
by any end-to-end protocol.
The main problem with the original computation is that it does not take the
variance of the sample RTTs into account. Intuitively, if the variation among samples
is small, then the EstimatedRTT can be better trusted and there is no reason for multiplying this estimate by 2 to compute the timeout. On the other hand, a large variance
in the samples suggests that the timeout value should not be too tightly coupled to the
In the new approach, the sender measures a new SampleRTT as before. It then
folds this new sample into the timeout calculation as follows:
Difference = SampleRTT − EstimatedRTT
EstimatedRTT = EstimatedRTT + (δ × Difference)
5 End-to-End Protocols
Deviation = Deviation + δ(|Difference| − Deviation)
where δ is a fraction between 0 and 1. That is, we calculate both the mean RTT and
the variation in that mean.
TCP then computes the timeout value as a function of both EstimatedRTT and
Deviation as follows:
TimeOut = μ × EstimatedRTT + φ × Deviation
where based on experience, μ is typically set to 1 and φ is set to 4. Thus, when
the variance is small, TimeOut is close to EstimatedRTT; a large variance causes the
Deviation term to dominate the calculation.
There are two items of note regarding the implementation of timeouts in TCP. The
first is that it is possible to implement the calculation for EstimatedRTT and Deviation
without using floating-point arithmetic. Instead, the whole calculation is scaled by
2n , with δ selected to be 1/2n . This allows us to do integer arithmetic, implementing
multiplication and division using shifts, thereby achieving higher performance. The
resulting calculation is given by the following code fragment, where n = 3 (i.e., δ =
1/8). Note that EstimatedRTT and Deviation are stored in their scaled-up forms, while
the value of SampleRTT at the start of the code and of TimeOut at the end are real,
unscaled values. If you find the code hard to follow, you might want to try plugging
some real numbers into it and verifying that it gives the same results as the equations
SampleRTT -= (EstimatedRTT >> 3);
EstimatedRTT += SampleRTT;
if (SampleRTT < 0)
SampleRTT = -SampleRTT;
SampleRTT -= (Deviation >> 3);
Deviation += SampleRTT;
TimeOut = (EstimatedRTT >> 3) + (Deviation >> 1);
The second point of note is that the Jacobson/Karels algorithm is only as good
as the clock used to read the current time. On a typical Unix implementation, the
clock granularity is as large as 500 ms, which is significantly larger than the average
cross-country RTT of somewhere between 100 and 200 ms. To make matters worse,
the Unix implementation of TCP only checks to see if a timeout should happen every
time this 500-ms clock ticks, and it only takes a sample of the round-trip time once per
5.2 Reliable Byte Stream (TCP)
RTT. The combination of these two factors quite often means that a timeout happens
1 second after the segment was transmitted. Once again, the extensions to TCP include
a mechanism that makes this RTT calculation a bit more precise.
Record Boundaries
Since TCP is a byte-stream protocol, the number of bytes written by the sender are
not necessarily the same as the number of bytes read by the receiver. For example,
the application might write 8 bytes, then 2 bytes, then 20 bytes to a TCP connection,
while on the receiving side, the application reads 5 bytes at a time inside a loop that
iterates 6 times. TCP does not interject record boundaries between the 8th and 9th
bytes, nor between the 10th and 11th bytes. This is in contrast to a message-oriented
protocol, such as UDP, in which the message that is sent is exactly the same length as
the message that is received.
Even though TCP is a byte-stream protocol, it has two different features that
can be used by the sender to insert record boundaries into this byte stream, thereby
informing the receiver how to break the stream of bytes into records. (Being able to
mark record boundaries is useful, for example, in many database applications.) Both
of these features were originally included in TCP for completely different reasons; they
have only come to be used for this purpose over time.
The first mechanism is the urgent data feature, as implemented by the URG flag
and the UrgPtr field in the TCP header. Originally, the urgent data mechanism was
designed to allow the sending application to send out-of-band data to its peer. By “out
of band” we mean data that is separate from the normal flow of data (e.g., a command
to interrupt an operation already under way). This out-of-band data was identified in
the segment using the UrgPtr field and was to be delivered to the receiving process as
soon as it arrived, even if that meant delivering it before data with an earlier sequence
number. Over time, however, this feature has not been used, so instead of signifying
“urgent” data, it has come to be used to signify “special” data, such as a record marker.
This use has developed because, as with the push operation, TCP on the receiving side
must inform the application that “urgent data” has arrived. That is, the urgent data
in itself is not important. It is the fact that the sending process can effectively send a
signal to the receiver that is important.
The second mechanism for inserting end-of-record markers into a byte is the push
operation. Originally, this mechanism was designed to allow the sending process to
tell TCP that it should send (flush) whatever bytes it had collected to its peer. The push
operation can be used to implement record boundaries because the specification says
that TCP must send whatever data it has buffered at the source when the application
says push, and optionally, TCP at the destination notifies the application whenever
an incoming segment has the PUSH flag set. If the receiving side supports this option
5 End-to-End Protocols
(the socket interface does not), then the push operation can be used to break the TCP
stream into records.
Of course, the application program is always free to insert record boundaries
without any assistance from TCP. For example, it can send a field that indicates the
length of a record that is to follow, or it can insert its own record boundary markers
into the data stream.
TCP Extensions
We have mentioned at three different points in this section that there are now extensions
to TCP that help to mitigate some problem that TCP is facing as the underlying
network gets faster. These extensions are designed to have as small an impact on TCP
as possible. In particular, they are realized as options that can be added to the TCP
header. (We glossed over this point earlier, but the reason that the TCP header has a
HdrLen field is that the header can be of variable length; the variable part of the TCP
header contains the options that have been added.) The significance of adding these
extensions as options rather than changing the core of the TCP header is that hosts
can still communicate using TCP even if they do not implement the options. Hosts that
do implement the optional extensions, however, can take advantage of them. The two
sides agree that they will use the options during TCP’s connection establishment phase.
The first extension helps to improve TCP’s timeout mechanism. Instead of measuring the RTT using a coarse-grained event, TCP can read the actual system clock
when it is about to send a segment, and put this time—think of it as a 32-bit timestamp—in the segment’s header. The receiver then echoes this timestamp back to the
sender in its acknowledgment, and the sender subtracts this timestamp from the current time to measure the RTT. In essence, the timestamp option provides a convenient
place for TCP to “store” the record of when a segment was transmitted; it stores
the time in the segment itself. Note that the endpoints in the connection do not need
synchronized clocks, since the timestamp is written and read at the same end of the
The second extension addresses the problem of TCP’s 32-bit SequenceNum field
wrapping around too soon on a high-speed network. Rather than define a new 64-bit
sequence number field, TCP uses the 32-bit timestamp just described to effectively
extend the sequence number space. In other words, TCP decides whether to accept or
reject a segment based on a 64-bit identifier that has the SequenceNum field in the
low-order 32 bits and the timestamp in the high-order 32 bits. Since the timestamp
is always increasing, it serves to distinguish between two different incarnations of the
same sequence number. Note that the timestamp is being used in this setting only to
protect against wraparound; it is not treated as part of the sequence number for the
purpose of ordering or acknowledging data.
5.2 Reliable Byte Stream (TCP)
The third extension allows TCP to advertise a larger window, thereby allowing it to fill larger delay × bandwidth pipes that are made possible by high-speed
networks. This extension involves an option that defines a scaling factor for the advertised window. That is, rather than interpreting the number that appears in the
AdvertisedWindow field as indicating how many bytes the sender is allowed to have
unacknowledged, this option allows the two sides of TCP to agree that the AdvertisedWindow field counts larger chunks (e.g., how many 16-byte units of data the sender can
have unacknowledged). In other words, the window scaling option specifies how many
bits each side should left-shift the AdvertisedWindow field before using its contents to
compute an effective window.
Alternative Design Choices
Although TCP has proven to be a robust protocol that satisfies the needs of a wide
range of applications, the design space for transport protocols is quite large. TCP is,
by no means, the only valid point in that design space. We conclude our discussion of
TCP by considering alternative design choices. While we offer an explanation for why
TCP’s designers made the choices they did, we leave it to you to decide if there might
be a place for alternative transport protocols.
First, we have suggested from the very first chapter of this book that there are at
least two interesting classes of transport protocols: stream-oriented protocols like TCP
and request/reply protocols like RPC. In other words, we have implicitly divided the
design space in half and placed TCP squarely in the stream-oriented half of the world.
We could further divide the stream-oriented protocols into two groups—reliable and
unreliable—with the former containing TCP and the latter being more suitable for
interactive video applications that would rather drop a frame than incur the delay
associated with a retransmission.
This exercise in building a transport protocol taxonomy is interesting and could
be continued in greater and greater detail, but the world isn’t as black and white as we
might like. Consider the suitability of TCP as a transport protocol for request/reply
applications, for example. TCP is a full-duplex protocol, so it would be easy to open
a TCP connection between the client and server, send the request message in one
direction, and send the reply message in the other direction. There are two complications, however. The first is that TCP is a byte-oriented protocol rather than a
message-oriented protocol, and request/reply applications always deal with messages.
(We explore the issue of bytes versus messages in greater detail in a moment.) The
second complication is that in those situations where both the request message and
the reply message fit in a single network packet, a well-designed request/reply protocol
needs only two packets to implement the exchange, whereas TCP would need at least
nine: three to establish the connection, two for the message exchange, and four to tear
5 End-to-End Protocols
down the connection. Of course, if the request or reply messages are large enough to
require multiple network packets (e.g., it might take 100 packets to send a 100,000byte reply message), then the overhead of setting up and tearing down the connection
is inconsequential. In other words, it isn’t always the case that a particular protocol
cannot support a certain functionality; it’s sometimes the case that one design is more
efficient than another under particular circumstances.
Second, as just suggested, you might question why TCP chose to provide a reliable
byte-stream service rather than a reliable message-stream service; messages would be
the natural choice for a database application that wants to exchange records. There are
two answers to this question. The first is that a message-oriented protocol must, by definition, establish an upper bound on message sizes. After all, an infinitely long message
is a byte stream. For any message size that a protocol selects, there will be applications
that want to send larger messages, rendering the transport protocol useless and forcing
the application to implement its own transportlike services. The second reason is that
while message-oriented protocols are definitely more appropriate for applications that
want to send records to each other, you can easily insert record boundaries into a byte
stream to implement this functionality, as described in Section 5.2.7.
Third, TCP chose to implement explicit setup/teardown phases, but this is not
required. In the case of connection setup, it would certainly be possible to send all
necessary connection parameters along with the first data message. TCP elected to
take a more conservative approach that gives the receiver the opportunity to reject the
connection before any data arrives. In the case of teardown, we could quietly close a
connection that has been inactive for a long period of time, but this would complicate
applications like Telnet that want to keep a connection alive for weeks at a time; such
applications would be forced to send out-of-band “keepalive” messages to keep the
connection state at the other end from disappearing.
Finally, TCP is a window-based protocol, but this is not the only possibility.
The alternative is a rate-based design, in which the receiver tells the sender the rate—
expressed in either bytes or packets per second—at which it is willing to accept incoming data. For example, the receiver might inform the sender that it can accommodate
100 packets a second. There is an interesting duality between windows and rate, since
the number of packets (bytes) in the window, divided by the RTT, is exactly the rate.
For example, a window size of 10 packets and a 100-ms RTT implies that the sender is
allowed to transmit at a rate of 100 packets a second. It is by increasing or decreasing
the advertised window size that the receiver is effectively raising or lowering the rate
at which the sender can transmit. In TCP, this information is fed back to the sender
in the AdvertisedWindow field of the ACK for every segment. One of the key issues in
a rate-based protocol is how often the desired rate—which may change over time—is
relayed back to the source: Is it for every packet, once per RTT, or only when the
5.3 Remote Procedure Call
rate changes? While we have just now considered window versus rate in the context
of flow control, it is an even more hotly contested issue in the context of congestion
control, which we will discuss in Chapter 6.
5.3 Remote Procedure Call
As discussed in Chapter 1, a common pattern of communication used by application
programs is the request/reply paradigm, also called message transaction: A client sends
a request message to a server, the server responds with a reply message, and the client
blocks (suspends execution) waiting for this response. Figure 5.11 illustrates the basic
interaction between the client and server in such a message transaction.
A transport protocol that supports the request/reply paradigm is much more than
a UDP message going in one direction, followed by a UDP message going in the other
direction. It also involves overcoming all of the limitations of the underlying network
outlined in the problem statement at the beginning of this chapter. While TCP overcomes these limitations by providing a reliable byte-stream service, it doesn’t match
the request/reply paradigm very well either since going to the trouble of establishing a
TCP connection just to exchange a pair of messages seems like overkill. This section describes a third transport protocol—which we call Remote Procedure Call (RPC)—that
more closely matches the needs of an application involved in a request/reply message
RPC is actually more than just a protocol—it is a popular mechanism for structuring distributed systems. RPC is popular because it is based on the semantics of a
local procedure call—the application program makes a call into a procedure without
regard for whether it is local or remote and blocks until the call returns. While this
may sound simple, there are two main problems that make RPC more complicated
than local procedure calls:
Figure 5.11 Timeline for RPC.
5 End-to-End Protocols
■ The network between the calling process and the called process has much
more complex properties than the backplane of a computer. For example, it is
likely to limit message sizes and has a tendency to lose and reorder messages.
■ The computers on which the calling and called processes run may have significantly different architectures and data representation formats.
Thus, a complete RPC mechanism actually involves two major components:
1 A protocol that manages the messages sent between the client and the server processes and that deals with the potentially undesirable properties of the underlying
2 Programming language and compiler support to package the arguments into a
request message on the client machine and then to translate this message back
into the arguments on the server machine, and likewise with the return value
(this piece of the RPC mechanism is usually called a stub compiler)
Figure 5.12 schematically depicts what happens when a client invokes a remote
procedure. First, the client calls a local stub for the procedure, passing it the arguments
required by the procedure. This stub hides the fact that the procedure is remote by
Figure 5.12 Complete RPC mechanism.
5.3 Remote Procedure Call
translating the arguments into a request message and then invoking an RPC protocol
to send the request message to the server machine. At the server, the RPC protocol
delivers the request message to the server stub, which translates it into the arguments to
the procedure and then calls the local procedure. After the server procedure completes,
it returns the answer to the server stub, which packages this return value in a reply
message that it hands off to the RPC protocol for transmission back to the client. The
RPC protocol on the client passes this message up to the client stub, which translates
it into a return value that it returns to the client program.
This section considers just the protocol-related aspects of an RPC mechanism.
That is, it ignores the stubs and focuses instead on the RPC protocol that transmits
messages between client and server; the transformation of arguments into messages
and vice versa is covered in Chapter 7. Furthermore, since RPC is a generic term—
rather than a specific standard like TCP—we are going to take a different approach
than we did in the previous section. Instead of organizing the discussion around an
existing standard (i.e., TCP) and then pointing out alternative designs at the end, we
are going to walk you through the thought process involved in designing an RPC
protocol. That is, we will design our own RPC protocol from scratch—considering
the design options at every step of the way—and then come back and describe some
widely used RPC protocols by comparing and contrasting them to the protocol we
just designed.
Before jumping in, however, we note that an RPC protocol performs a rather
complicated set of functions, and so instead of treating RPC as a single, monolithic
protocol, we develop it as a “stack” of three smaller protocols: BLAST, CHAN, and
SELECT. Each of these smaller protocols, which we sometimes call a microprotocol,
contains a single algorithm that addresses one of the problems outlined at the start of
this chapter. As a brief overview:
■ BLAST: fragments and reassembles large messages
■ CHAN: synchronizes request and reply messages
■ SELECT: dispatches request messages to the correct process
These microprotocols are complete, self-contained protocols that can be used in different combinations to provide different end-to-end services. Section 5.3.4 shows how
they can be combined to implement RPC.
Just to be clear, BLAST, CHAN, and SELECT are not standard protocols in the
sense that TCP, UDP, and IP are. They are simply protocols of our own invention, but
ones that demonstrate the algorithms needed to implement RPC. Because this section
is not constrained by the artifacts of what has been designed in the past, it provides a
particularly good opportunity to examine the principles of protocol design.
5 End-to-End Protocols
Bulk Transfer (BLAST)
The first problem we are going to tackle is
how to turn an underlying network that delivers messages of some small size (say, 1 KB)
into a service that delivers messages of a
much larger size (say, 32 KB). While 32 KB
does not qualify as “arbitrarily large,” it is
large enough to be of practical use for many
applications, including most distributed file
systems. Ultimately, a stream-based protocol like TCP (see Section 5.2) will be
needed to support an arbitrarily large message, since any message-oriented protocol
will necessarily have some upper limit to the
size of the message it can handle, and you
can always imagine needing to transmit a
message that is larger than this limit.
We have already examined the basic
technique that is used to transmit a large
message over a network that can accommodate only smaller messages—fragmentation
and reassembly. We now describe the
BLAST protocol, which uses this technique. One of the unique properties of
BLAST is how hard it tries to deliver all
the fragments of a message. Unlike the
AAL segmentation/reassembly mechanism
used with ATM (see Section 3.3) or the IP
fragmentation/reassembly mechanism (see
Section 4.1), BLAST attempts to recover
from dropped fragments by retransmitting
them. However, BLAST does not go so far
as to guarantee message delivery. The significance of this design choice will become
clear later in this section.
What Layer Is RPC?
Once again, the “What layer is
this?” issue raises its ugly head.
To many people, especially those
who adhere to the Internet architecture, RPC is implemented on
top of a transport protocol (usually UDP) and so cannot itself (by
definition) be a transport protocol.
It is equally valid, however, to argue that the Internet should have
an RPC protocol, since it offers
a process-to-process service that is
fundamentally different from that
offered by TCP and UDP. The
usual response to such a suggestion, however, is that the Internet
architecture does not prohibit network designers from implementing
their own RPC protocol on top of
UDP. (In general, UDP is viewed
as the Internet architecture’s “escape hatch,” since effectively it just
adds a layer of demultiplexing to
IP.) Whichever side of the issue of
whether the Internet should have
an official RPC protocol you support, the important point is that
the way you implement RPC in
the Internet architecture says nothing about whether RPC should be
BLAST Algorithm
The basic idea of BLAST is for the sender to break a large message passed to it by
some high-level protocol into a set of smaller fragments, and then for it to transmit
5.3 Remote Procedure Call
these fragments back-to-back over the
network. Hence the name BLAST—the protocol does not wait for any of the fragconsidered a transport protocol or
ments to be acknowledged before sendnot.
ing the next. The receiver then sends
Interestingly, there are other
a selective retransmission request (SRR)
people who believe that RPC is
back to the sender, indicating which fragthe most interesting protocol in the
ments arrived and which did not. (The
world and that TCP/IP is just what
SRR message is sometimes called a paryou do when you want to go “off
tial or selective acknowledgment.) Finally,
site.” This is the predominant view
the sender retransmits the missing fragof the operating systems commuments. In the case in which all the fragnity, which has built countless OS
ments have arrived, the SRR serves to fully
kernels for distributed systems that
acknowledge the message. Figure 5.13 gives
contain exactly one protocol—you
a representative timeline for the BLAST
guessed it, RPC—running on top of
a network device driver.
We now consider the send and reThe water gets even mudceive sides of BLAST in more detail. On
dier when you implement RPC as
the sending side, after fragmenting the mesa combination of three different
sage and transmitting each of the fragments,
microprotocols, as is the case in this
the sender sets a timer called DONE. Whensection. In such a situation, which
ever an SRR arrives, the sender retransmits
of the three is the “transport” prothe requested fragments and resets timer
tocol? Our answer to this quesDONE. Should the SRR indicate that all
tion is that any protocol that offers
the fragments have arrived, the sender frees
process-to-process service, as opits copy of the message and cancels timer
posed to node-to-node or host-toDONE. If timer DONE ever expires, the
host service, qualifies as a transport
sender frees its copy of the message; that
protocol. Thus, RPC is a transport
is, it gives up.
protocol and, in fact, can be imOn the receiving side, whenever the
plemented from a combination of
first fragment of a message arrives, the remicroprotocols that are themselves
ceiver initializes a data structure to hold the
valid transport protocols.
individual fragments as they arrive and sets
a timer LAST FRAG. This timer counts the
time that has elapsed since the last fragment
arrived. Each time a fragment for that message arrives, the receiver adds it to this data
structure, and should all the fragments then be present, it reassembles them into a
complete message and passes this message up to the higher-level protocol. There are
four exceptional conditions, however, that the receiver watches for:
5 End-to-End Protocols
Figure 5.13 Representative timeline for BLAST.
■ If the last fragment arrives (the last fragment is specially marked) but
the message is not complete, then the receiver determines which fragments
are missing and sends an SRR to the sender. It also sets a timer called
■ If timer LAST FRAG expires, then the receiver determines which fragments
are missing and sends an SRR to the sender. It also sets timer RETRY.
■ If timer RETRY expires for the first or second time, then the receiver determines which fragments are still missing and retransmits an SRR message.
■ If timer RETRY expires for the third time, then the receiver frees the fragments
that have arrived and cancels timer LAST FRAG; that is, it gives up.
There are three aspects of BLAST worth noting. First, two different events trigger
the initial transmission of an SRR: the arrival of the last fragment and the firing of the
LAST FRAG timer. In the case of the former, because the network may reorder packets,
5.3 Remote Procedure Call
the arrival of the last fragment does not necessarily imply that an earlier fragment is
missing (it may just be late in arriving), but since this is the most likely explanation,
BLAST aggressively sends an SRR message. In the latter case, we deduce that the last
fragment was either lost or seriously delayed.
Second, the performance of BLAST does not critically depend on how carefully
the timers are set. Timer DONE is used only to decide that it is time to give up
and delete the message that is currently being worked on. This timer can be set to a
fairly large value since its only purpose is to reclaim storage. Timer RETRY is only
used to retransmit an SRR message. Any time the situation is so bad that a protocol
is reexecuting a failure recovery process, performance is the last thing on its mind.
Finally, timer LAST FRAG has the potential to influence performance—it sometimes
triggers the sending by the receiver of an SRR message—but this is an unlikely event:
It only happens when the last fragment of the message happens to get dropped in the
Third, while BLAST is persistent in asking for and retransmitting missing fragments, it does not guarantee that the complete message will be delivered. To understand
this, suppose that a message consists of only one or two fragments and that these fragments are lost. The receiver will never send an SRR, and the sender’s DONE timer
will eventually expire, causing the sender to release the message. To guarantee delivery, BLAST would need for the sender to time out if it does not receive an SRR and
then retransmit the last set of fragments it had transmitted. While BLAST certainly
could have been designed to do this, we chose not to because the purpose of BLAST is
to deliver large messages, not to guarantee message delivery. Other protocols can be
configured on top of BLAST to guarantee message delivery. You might wonder why
we put any retransmission capability at all into BLAST if we need to put a guaranteed delivery mechanism above it anyway. The reason is that we’d prefer to retransmit
only those fragments that were lost rather than having to retransmit the entire larger
message whenever one fragment is lost. So we get the guarantees from the higher-level
protocol but some improved efficiency by retransmitting fragments in BLAST.
BLAST Message Format
The BLAST header has to convey several pieces of information. First, it must contain
some sort of message identifier so that all the fragments that belong to the same
message can be identified. Second, there must be a way to identify where in the original
message the individual fragments fit, and likewise, an SRR must be able to indicate
which fragments have arrived and which are missing. Third, there must be a way to
distinguish the last fragment, so that the receiver knows when it is time to check to
see if all the fragments have arrived. Finally, it must be possible to distinguish a data
5 End-to-End Protocols
Figure 5.14 Format for BLAST message header.
message from an SRR message. Some of these items are encoded in a header field in an
obvious way, but others can be done in a variety of different ways. Figure 5.14 gives
the header format used by BLAST. The following discussion explains the various fields
and considers alternative designs.
The MID field uniquely identifies this message. All fragments that belong to the
same message have the same value in their MID field. The only question is how many
bits are needed for this field. This is similar to the question of how many bits are needed
in the SequenceNum field for TCP. The central issue in deciding how many bits to use
in the MID field has to do with how long it will take before this field wraps around
and the protocol starts using message ids over again. If this happens too soon—that
is, the MID field is only a few bits long—then it is possible for the protocol to become
confused by a message that was delayed in the network, so that an old incarnation of
some message id is mistaken for a new incarnation of that same id. So, how many bits
are enough to ensure that the amount of time it takes for the MID field to wrap around
is longer than the amount of time a message can potentially be delayed in the network?
In the worst-case scenario, each BLAST message contains a single fragment that is
1 byte long, which means that BLAST might need to generate a new MID for every byte
it sends. On a 10-Mbps Ethernet, this would mean generating a new MID roughly once
every microsecond, while on a 1.2-Gbps STS-24 link, a new MID would be required
once every 7 nanoseconds. Of course, this is a ridiculously conservative calculation—
the overhead involved in preparing a message is going to be more than a microsecond.
Thus, suppose a new MID is potentially needed once every microsecond, and a message may be delayed in the network for up to 60 seconds (our standard worst-case
5.3 Remote Procedure Call
assumption for the Internet); then we need to ensure that there are more than 60
million MID values. While a 26-bit field would be sufficient (226 = 67,108,864), it is
easier to deal with header fields that are even multiples of a byte, so we will settle on
a 32-bit MID field.
This conservative (you could say paranoid) analysis of the MID field illustrates an
important point. When designing a transport protocol, it is tempting to take shortcuts,
since not all networks suffer from all the problems listed in the problem statement at
the beginning of this chapter. For example, messages do not get stuck in an Ethernet
for 60 seconds, and similarly, it is physically impossible to reorder messages on an
Ethernet segment. The problem with this way of thinking, however, is that if you want
the transport protocol to work over any kind of network, then you have to design for
the worst case. This is because the real danger is that as soon as you assume that an
Ethernet does not reorder packets, someone will come along and put a bridge or a
router in the middle of it.
Let’s move on to the other fields in the BLAST header. The Type field indicates
whether this is a DATA message or an SRR message. Notice that while we certainly don’t
need 16 bits to represent these two types, as a general rule we like to keep the header
fields aligned on 32-bit (word) boundaries, so as to improve processing efficiency.
The ProtNum field identifies the high-level protocol that is configured on top of BLAST;
incoming messages are demultiplexed to this protocol. The Length field indicates how
many bytes of data are in this fragment; it has nothing to do with the length of the
entire message. The NumFrags field indicates how many fragments are in this message.
This field is used to determine when the last fragment has been received. An alternative
is to include a flag that is only set for the last fragment.
Finally, the FragMask field is used to distinguish among fragments. It is a 32-bit
field that is used as a bit mask. For messages of Type = DATA, the ith bit is 1 (all
others are 0) to indicate that this message carries the ith fragment. For messages of
Type = SRR, the ith bit is set to 1 to indicate that the ith fragment has arrived, and it
is set to 0 to indicate that the ith fragment is missing. Note that there are several ways
to identify fragments. For example, the header could have contained a simple “fragment ID” field, with this field set to i to denote the ith fragment. The tricky part with
this approach, as opposed to a bit-vector, is how the SRR specifies which fragments
have arrived and which have not. If it takes an n-bit number to identify each missing
fragment—as opposed to a single bit in a fixed-size bit-vector—then the SRR message
will be of variable length, depending on how many fragments are missing. Variablelength headers are allowed, but they are a little trickier to process. On the other hand,
one limitation of the BLAST header given above is that the length of the bit-vector
limits each message to only 32 fragments. If the underlying network has an MTU of
1 KB, then this is sufficient to send up to 32-KB messages.
5 End-to-End Protocols
Request/Reply (CHAN)
The next microprotocol, CHAN, implements the request/reply algorithm that is at the
core of RPC. In terms of the common properties of transport protocols given in the
problem statement at the beginning of this chapter, CHAN guarantees message delivery, ensures that only one copy of each message is delivered, and allows the communicating processes to synchronize with each other. In the case of this last item,
the synchronization we are after mimics the behavior of a procedure call—the caller
(client) blocks while waiting for a reply from the callee (server).
At-Most-Once Semantics
The name CHAN comes from the fact that the protocol implements a logical request/
reply channel between a pair of participants. At any given time, there can be only one
message transaction active on a given channel. Like the concurrent logical channel
protocol described in Section 2.5.3, the application programs have to open multiple
channels if they want to have more than one request/reply transaction between them
at the same time.
The most important property of each channel is that it preserves a semantics
known as at-most-once. This means that for every request message that the client
sends, at most one copy of that message is delivered to the server. Stated in terms of
the RPC mechanism that CHAN is designed to support, for each time the client calls a
remote procedure, that procedure is invoked at most one time on the server machine.
We say “at-most-once” rather than “exactly once” because it is always possible that
either the network or the server machine has failed, making it impossible to deliver
even one copy of the request message.
As obvious as at-most-once sounds, not all RPC protocols support this behavior.
Some support a semantics that is facetiously called zero-or-more semantics, that is,
each invocation on a client results in the remote procedure being invoked zero or
more times. It is not hard to understand how this would cause problems for a remote
procedure that changed some local state variable (e.g., incremented a counter) or
that had some externally visible side effect (e.g., launched a missile) each time it was
invoked. On the other hand, if the remote procedure being invoked is idempotent—
multiple invocations have the same effect as just one—then the RPC mechanism need
not support at-most-once semantics; a simpler (possibly faster) implementation will
CHAN Algorithm
The request/reply algorithm has several subtle aspects; hence, we develop it in stages.
The basic algorithm is straightforward, as illustrated by the timeline given in Figure
5.15. The client sends a request message and the server acknowledges it. Then, after
5.3 Remote Procedure Call
Figure 5.15 Simple timeline for CHAN.
est 1
est 2
Figure 5.16 Timeline for CHAN when using implicit ACKs.
executing the procedure, the server sends a reply message and the client acknowledges
the reply.
Because the reply message often comes back with very little delay, and it is sometimes the case that the client turns around and makes a second request on the same
channel immediately after receiving the first reply, this basic scenario can be optimized
by using a technique called implicit acknowledgments. As illustrated in Figure 5.16,
the reply message serves to acknowledge the request message, and a subsequent request
acknowledges the preceding reply.
5 End-to-End Protocols
There are two factors that potentially complicate the rosy picture we have painted
so far. The first is that either a message carrying data (a request message or a reply
message) or the ACK sent to acknowledge that message may be lost in the network.
To account for this possibility, both client and server save a copy of each message
they send until an ACK for it has arrived. Each side also sets a RETRANSMIT
timer and resends the message should this timer expire. Both sides reset this timer
and try again some agreed-upon number of times before giving up and freeing the
Recall from Section 2.5.1 that this acknowledgment/timeout strategy means that
it is possible for duplicate copies of a message to arrive—the original message arrives,
the ACK is lost, and then the retransmission arrives. Thus, the receiver must remember
what messages it has seen and discard any duplicates. This is done through the use of a
MID field in the header. Any message whose MID field does not match the next expected
MID is discarded instead of being passed up to the high-level protocol configured on
top of CHAN.
The second complication is that the server may take an arbitrarily long time
to produce the result, and worse yet, it may crash (either the process or the entire
machine) before generating the reply. Keep in mind that we are talking about the period
of time after the server has acknowledged the request but before it has sent the reply.
To help the client distinguish between a slow server and a dead server, CHAN’s client
side periodically sends an “Are you alive?” message to the server, and CHAN’s server
side responds with an ACK. Alternatively, the server could send “I am still alive”
messages to the client without the client having first solicited them, but we prefer the
client-initiated approach because it keeps the server as simple as possible (i.e., it has
one less timer to manage).
CHAN Message Format
The CHAN message format is given in Figure 5.17. As with BLAST, the Type field
specifies the type of the message; in this case, the possible types are REQ, REP, ACK,
and PROBE. (PROBE is the “Are you alive?” message discussed above.) Similarly, the
ProtNum field identifies the high-level protocol that depends on CHAN.
The CID field uniquely identifies the logical channel to which this message belongs.
This is a 16-bit field, meaning that CHAN supports up to 64K concurrent request/reply
transactions between any pair of hosts. Of course, a given host can be participating in
channels with many other hosts at the same time.
The MID field uniquely identifies each request/reply pair; the reply message has
the same MID as the request. Note that because CHAN permits only one message
transaction at a time on a given channel, you might think that a 1-bit MID field is
sufficient, just as for the stop-and-wait algorithm presented in Section 2.5.1. However,
5.3 Remote Procedure Call
Figure 5.17 Format for CHAN message header.
as with BLAST, we have to be concerned about messages that wander around the
network for an extended period of time and then suddenly appear at the destination,
confusing CHAN. Thus, using much the same reasoning as we used in Section 5.3.1,
CHAN uses a 32-bit MID field.
Finally, the BID field gives the boot id for the host. A machine’s boot id is a
number that is incremented each time the machine reboots; this number is read from
disk, incremented, and written back to disk during the machine’s startup procedure.
This number is then put in every message sent by that host. The role played by the
BID field is much the same as the role played by the large MID field—it protects against
old messages suddenly appearing at the destination—although in this case, the old
message is due not to an arbitrary delay in the network but rather to a machine that
has crashed and rebooted.
To understand the use of the boot id, consider the following pathological situation. A client machine sends a request message with MID = 0, then crashes and
reboots, and then sends an unrelated request message, also with MID = 0. The server
may not have been aware that the client crashed and rebooted, and upon seeing a request message with MID = 0, acknowledges it and discards it as a duplicate. To protect
against this possibility, each side of CHAN makes sure that the BID, MID pair, not
just the MID, matches what it is expecting. BID is also a 32-bit field, which means that
if we assume that it takes at least 10 minutes to reboot a machine, it will wrap around
once every 40 billion minutes (approximately 80,000 years). In effect, the BID and
MID combine to form a unique 64-bit id for each transaction; the low-order 32 bits
are incremented for each transaction but reset to 0 when the machine reboots, and the
high-order 32 bits are incremented each time the machine reboots.
5 End-to-End Protocols
CHAN involves three different timers: There is a RETRANSMIT timer on both the
client and server, and the client also manages a PROBE timer. The PROBE timer is
not critical to performance and thus can be set to a conservatively large value—on
the order of several seconds. The RETRANSMIT timer, however, does influence the
performance of CHAN. If it is set too large, then CHAN might wait an unnecessarily
long time before retransmitting a message that was lost by the network. This clearly
hurts performance. If the RETRANSMIT timer is set too small, however, then CHAN
may load the network with unnecessary traffic.
If CHAN is designed to run on a local area network only, or even over a campussize extended LAN, then RETRANSMIT can be set to a fixed value. Something on the
order of 20 milliseconds would be reasonable. This is because the RTT of a LAN is not
that variable. If CHAN is expected to run over the Internet, however, then selecting
a suitable value for RETRANSMIT is similar to the problem faced by TCP. Thus,
CHAN would calculate the RETRANSMIT timeout using a mechanism similar to the
one described in Section 5.2.6. The only difference is that CHAN has to take into
account the fact that the message it is sending ranges in size from 1 byte to 32 KB,
whereas TCP is always transmitting segments of approximately the same size.
Synchronous versus Asynchronous Protocols
One way to characterize a protocol is by whether it is synchronous or asynchronous.
These two terms have significantly different meanings, depending on where in the
protocol hierarchy you use them. At the transport layer, it is most accurate to think
of synchrony as a spectrum of possibilities rather than as two alternatives, where
the key attribute of any point along the spectrum is how much the sender knows,
after the operation to send a message returns. In other words, if we assume that
an application program invokes a send operation on a transport protocol, then the
question is, Exactly what does the application know about the success of the operation
when the send operation returns?
At the asynchronous end of the spectrum, the application knows absolutely nothing when send returns. It not only doesn’t know if the message was received by its peer,
but it doesn’t even know for sure that the message has successfully left the local machine. At the synchronous end of the spectrum, the send operation typically returns a
reply message. That is, the application not only knows that the message it sent was received by its peer, but it knows that the peer has returned an answer. Thus, synchronous
protocols implement the request/reply abstraction, while asynchronous protocols are
used if the sender wants to be able to transmit many messages without having to wait
for a response. Using this definition, CHAN is obviously a synchronous protocol.
5.3 Remote Procedure Call
Although we have not discussed them in this chapter, there are interesting points
between these two extremes. For example, the transport protocol might implement
send so that it blocks (does not return) until the message has been successfully received
at the remote machine, but returns before the sender’s peer on that machine has actually
processed and responded to it. This is sometimes called a reliable datagram protocol.
Implementation of CHAN
We conclude our discussion of CHAN by giving fragments of C code that implement its
client side. Reading code can be tedious, but if done judiciously, it can help to solidify
your understanding of how a system works. In the case of CHAN, it serves to illustrate
all the separate pieces that go into a protocol implementation—the function that sends
an outgoing message, the function that retransmits messages, and the function that
processes incoming messages—and how they interact with each other.
We begin with CHAN’s two key data structures: ChanHdr and ChanState. The
fields in ChanHdr have already been explained. The fields in ChanState will be explained
by the code that follows. Note that ChanState includes a hdr template field, which is
a copy of the CHAN header. Many of the fields in the CHAN header remain the same
for all messages sent out over this channel. These fields are filled in when the channel
is created (not shown); only the fields that change are modified before a given message
is transmitted.
typedef struct {
} ChanHdr;
typedef struct {
timeout; /*
retries; /*
ret_val; /*
*request; /*
Semaphore reply_sem;
message type: REQ, REP, ACK, PROBE */
unique channel id */
unique message id */
unique boot id */
length of message */
high-level protocol number */
type of session: CLIENT or SERVER */
status of session: BUSY or IDLE */
place to save timeout event */
timeout value */
number of times retransmitted */
place to save return value */
place to save request message */
place to save reply message */
/* semaphore the client blocks on */
/* message id for this channel */
/* boot id for this channel */
5 End-to-End Protocols
BlastState blast;
} ChanState;
/* header template for this channel */
/* pointer to BLAST protocol */
We now turn our attention to the function that sends request messages. Since
CHAN exports a synchronous interface to higher-level protocols—the caller blocks until a reply can be returned—the send operation we have been assuming since Chapter 1
is not going to work. Therefore, we introduce a new interface operation, which we
give the generic name call, that blocks until a reply message is available, and returns that reply message to the caller. The first argument identifies the channel being
used; it effectively encapsulates all the information needed to send the message to
the correct destination. The second and third arguments correspond to the abstract
data type (ADT) for messages, and represent the request and reply messages, respectively. We assume this ADT supports the obvious operations (e.g., msgSaveCopy and
callCHAN(ChanState *state, Msg *request, Msg *reply)
/* ensure only one transaction per channel */
if ((state->status != IDLE))
return FAILURE;
state->status = BUSY;
/* save a copy of request msg and pointer to reply msg*/
msgSaveCopy(&state->request, request);
state->reply = reply;
/* fill out header fields */
hdr = state->hdr_template;
hdr->Length = msgLength(request);
if (state->mid == MAX_MID)
state->mid = 0;
hdr->MID = ++state->mid;
/* attach header to msg and send it */
store_chan_hdr(hdr, hbuf);
msgAddHdr(request, hdr, HLEN);
/* schedule first timeout event */
state->retries = 1;
state->event = eventSchedule(retransmit, state, state->timeout);
5.3 Remote Procedure Call
/* block waiting for the reply msg */
/* clean up state and return */
state->status = IDLE;
return state->ret_val;
The first thing to notice is that the ChanState passed as an argument to callCHAN
includes a field named status that indicates whether or not this channel is being used.
If the channel is currently in use, then callCHAN returns failure. An alternative design
would be to block the calling thread until the channel becomes idle. We have elected
to push responsibility for blocking threads that want to use busy channels onto the
higher-level protocol, in our case, SELECT.
The next thing to notice about call is that after filling out the message header
and transmitting the request message via BLAST, the calling process is blocked on a
semaphore (reply sem). When the reply message eventually arrives, it is processed by
CHAN’s deliverCHAN routine (see below), which copies the reply message into state
variable reply and signals this blocked process. The process then returns. Should the
reply message not arrive, then timeout routine retransmit is called (see below). This
event is scheduled in the body of callCHAN.
The next routine, retransmit, is called whenever the retransmit timer fires. It is
scheduled for the first time in callCHAN, but each time it is called, it reschedules itself.
Once the request message has been retransmitted four times, CHAN gives up: It sets the
return value to FAILURE and wakes up the blocked client process. Finally, each time retransmit executes and sends another copy of the request message, it needs to resave the
message in state variable request. This is because we assume that each time a protocol
sends a message to the lower-level protocol, it loses its reference to the message.
static void
retransmit(Event ev, int *arg)
*state = (ChanState *)arg;
/* unblock the client process if we have retried 4 times */
if (++state->retries > 4)
state->ret_val = FAILURE;
/* retransmit request message */
5 End-to-End Protocols
msgSaveCopy(&tmp, &state->request);
/* reschedule event with exponential backoff */
state->timeout = 2*state->timeout;
state->event = eventSchedule(retransmit, state, state->timeout);
Finally, we consider CHAN’s deliver routine. The first thing we observe is that
CHAN is an asymmetric protocol: The code that implements CHAN on the client
machine is completely distinct from the code that implements CHAN on the server
machine. This fact is stored in the CHAN state variable (type). Thus, the first thing
CHAN’s deliver routine does is check to see whether it is running on a server (i.e., it
expects REQ messages) or on a client (i.e., it expects REP messages), and then it calls the
appropriate client- or server-specific routine. In this case, we show the client-specific
routine, deliverClient.
static int
deliverClient(ChanState *state, Msg *msg)
ChanHdr hdr;
/* strip header and verify correctness */
hbuf = msgStripHdr(msg, HLEN);
load_chan_hdr(&hdr, hbuf);
if (!clnt_msg_ok(state, &hdr))
return FAILURE;
/* cancel retransmit timeout event */
/* if this is an ACK, then schedule PROBE timer and exit*/
if (hdr.Type == ACK)
state->event = eventSchedule(probe, s, PROBE);
return SUCCESS;
/* msg is a REP; save it and signal blocked client */
msgSaveCopy(state->reply, msg);
state->ret_val = SUCCESS;
return SUCCESS;
5.3 Remote Procedure Call
Routine deliverClient first checks to see if it has received the expected message,
for example, that it has the right MID, the right BID, and that the message is of type
REP or ACK. This check is made in subroutine clnt msg ok (not shown). If it is a valid
acknowledgment message, then deliverClient cancels the RETRANSMIT timer and
schedules the PROBE timer. The PROBE timer is not shown, but would be similar to
the RETRANSMIT timer given above. If the message is a valid reply, then deliverClient
cancels the RETRANSMIT timer, saves a copy of the reply message in state variable
reply, and wakes up the blocked client process. It is this client process that actually
returns the reply message to the high-level protocol; the process that called deliverClient
simply returns back down the protocol stack.
Dispatcher (SELECT)
The final microprotocol, called SELECT, dispatches request messages to the appropriate procedure. It is the RPC protocol stack’s version of a demultiplexing protocol
like UDP; the main difference is that it is a synchronous protocol rather than an asynchronous protocol. What this means is that on the client side, SELECT is given a procedure number that the client wants to invoke, it puts this number in its header, and then
it invokes the call operation on a lower-level request/reply protocol like CHAN. When
this invocation returns, SELECT merely lets the return pass through to the client; it
has no real demultiplexing work to do. On the server side, SELECT uses the procedure
number it finds in its header to select the right local procedure to invoke. When this procedure returns, SELECT simply returns to the low-level protocol that just invoked it.
It may seem that SELECT is so simple that it is not worthy of being treated as a
separate protocol. After all, CHAN already has its own demultiplexing field that could
be used to dispatch incoming request messages to the appropriate procedure. There
are two reasons why we elected to separate SELECT into a self-contained protocol.
The first is that doing so makes it possible to change the address space with
which procedures are identified simply by configuring a different version of SELECT
into the protocol graph. In some settings, it is sufficient to define a flat address space for
procedures—for example, a 16-bit selector field allows you to identify 64K different
procedures. In other settings, however, a flat address space is hard to manage—who
decides which procedure gets which procedure number? In this case, it might be better
to have a hierarchical address space, that is, a two-part procedure number. First,
each program could be given a program number, where a program corresponds to
something like a “file server” or a “name server.” Next, each program could be given
the responsibility to assign unique procedure numbers to its own procedures. For
example, within the file server program, read might be procedure 1, write might be
procedure 2, seek might be procedure 3, and so on, whereas within the name server
program, insert might be procedure 1 and lookup might be procedure 2.
5 End-to-End Protocols
The second reason we implement SELECT as its own protocol is that it provides
a good place to manage concurrency. Recall that CHAN supports at-most-once channels. Suppose we want to allow applications running on this host to make multiple
outstanding calls to the same remote procedure. Since CHAN allows only one outstanding call at a time, the only way to do this is to open multiple channels to the same
server. Each time a calling process invokes SELECT, it sends the process out on an
idle channel. If all the channels are currently active, then SELECT blocks the calling
process until a channel becomes idle.
Putting It All Together (SunRPC, DCE)
We are now ready to construct an RPC stack from the microprotocols described in the
three previous subsections. This section also explains two widely used RPC protocols—
SunRPC and DCE-RPC—in terms of our three microprotocols.
A Simple RPC Stack
Figure 5.18 depicts a simple protocol stack that implements RPC. At the bottom are the
protocols that implement the underlying network. Although this stack could contain
protocols corresponding to any of the networking technologies discussed in the three
previous chapters, we use IP running on top of an Ethernet for illustrative purposes.
On top of IP is BLAST, which turns the small message size of the underlying
network into a communication service that supports messages of up to 32 KB in
length. Notice that it is not strictly true that the underlying network provides for
only small messages; IP can handle messages of up to 64 KB. However, because IP
Figure 5.18 A simple RPC stack.
5.3 Remote Procedure Call
has to fragment such large messages before sending them out over the Ethernet, and
BLAST’s fragmentation/reassembly algorithm is superior to IP’s (because it is able to
selectively retransmit missing fragments), we prefer to treat IP as though it supports exactly the same MTU as the underlying physical network. This puts the fragmentation/
reassembly burden on BLAST, unless IP has to perform fragmentation out in the middle
of the network somewhere.
Next, CHAN implements the request/reply algorithm. Recall that we chose not
to implement reliable delivery in BLAST, but instead postponed solving this issue until
a higher-level protocol. In this case, CHAN’s timeout and acknowledgment mechanism makes sure messages are reliably delivered. Other protocols might use different
techniques to guarantee delivery or, for that matter, might choose not to implement
reliable delivery at all. This is an example of the end-to-end argument at work—do
not do at low levels of the system (e.g., BLAST) what has to be done at higher levels
(e.g., CHAN) anyway.
Finally, SELECT defines an address space for identifying remote procedures. As
suggested in Section 5.3.3, different versions of SELECT, each defining a different
method for identifying procedures, could be configured on top of CHAN. In fact, it
would even be possible to write a version of SELECT that mimics some existing RPC
package’s address space for procedures (such as SunRPC’s), and then to use CHAN
and BLAST underneath this new SELECT to implement the rest of the RPC stack.
This new stack would not interoperate with the original protocol, but it would allow
you to slide a new RPC system underneath an existing collection of remote procedures
without having to change the interface. SELECT also manages concurrency.
SELECT, CHAN, and BLAST, although complete and correctly functioning protocols,
have been neither standardized nor widely adopted. We now turn our discussion to a
widely used RPC protocol—SunRPC. Ironically, SunRPC has also not been approved
by any standardization body, but it has become a de facto standard, thanks to its wide
distribution with Sun workstations and to the central role it plays in Sun’s popular
Network File System (NFS). At the time of this writing, the IETF is considering officially
adopting SunRPC as a standard Internet protocol.
Fundamentally, any RPC protocol must worry about three issues: fragmenting large messages, synchronizing request and reply messages, and dispatching request messages to the appropriate procedure. SunRPC is no exception. Unlike the
SELECT/CHAN/BLAST stack, however, SunRPC addresses these three functions in
a different order and uses slightly different algorithms. The basic SunRPC protocol
graph is given in Figure 5.19.
5 End-to-End Protocols
Figure 5.19 Protocol graph for SunRPC.
First, SunRPC implements the core request/reply algorithm; it is CHAN’s counterpart. SunRPC differs from CHAN, however, in that it does not technically guarantee
at-most-once semantics; there are obscure circumstances under which a duplicate
copy of a request message is delivered to the server (see below). Second, the role of
SELECT is split between UDP and SunRPC—UDP dispatches to the correct program,
and SunRPC dispatches to the correct procedure within the program. (We discuss how
procedures are identified in more detail below.) Finally, the ability to send request and
reply messages that are larger than the network MTU, corresponding to the functionality implemented in BLAST, is handled by IP. Keep in mind, however, that IP is
not as persistent as BLAST is in implementing fragmentation; BLAST uses selective
retransmission, whereas IP does not.
As just mentioned, SunRPC uses two-tier addresses to identify remote procedures: a 32-bit program number and a 32-bit procedure number. (There is also a
32-bit version number, but we ignore that in the following discussion.) For example,
the NFS server has been assigned program number x00100003, and within this program, getattr is procedure 1, setattr is procedure 2, read is procedure 6, write is
procedure 8, and so on. Each program is reachable by sending a message to some
UDP port. When a request message arrives at this port, SunRPC picks it up and calls
the appropriate procedure.
To determine which port corresponds to a particular SunRPC program number,
there is a separate SunRPC program, called the Port Mapper, that maps program numbers to port numbers. The Port Mapper itself also has a program number (x00100000)
that must be translated into some UDP port. Fortunately, the Port Mapper is always
present at a well-known UDP port (111). The Port Mapper program supports several
procedures, one of which (procedure number 3) is the one that performs the programto-port number mapping.
5.3 Remote Procedure Call
Thus, to send a request message to NFS’s read procedure, a client first sends a
request message to the Port Mapper at well-known UDP port 111, asking that procedure 3 be invoked to map program number x00100003 to the UDP port where the
NFS program currently resides. (In practice, NFS is such an important program that
it is given its own well-known UDP port, so the Port Mapper need not be involved in
finding it.) The client then sends a SunRPC request message with procedure number
6 to this UDP port, and the SunRPC module listening at that port calls the NFS read
procedure. The client also caches the program-to-port number mapping so that it need
not go back to the Port Mapper each time it wants to talk to the NFS program.
The actual SunRPC header is defined by a complex nesting of data structures.
Figure 5.20 gives the essential details for the case in which the call completes without any problems. XID is a unique transaction id, much like CHAN’s MID field. The
reason that SunRPC cannot guarantee at-most-once semantics is that on the server
side, SunRPC does not remember that it has already seen a particular XID once it has
successfully completed the transaction. This is only a problem if the client retransmits
a request message as a result of a timeout and that request message is in transit at
exactly the same time as the reply to the original request is on its way from the server
back to the client. When the retransmitted request arrives at the server, it looks like a
new transaction, since the server thinks it has already completed the transaction with
MsgType = CALL
MsgType = REPLY
RPCVersion = 2
Credentials (variable)
Verifier (variable)
Figure 5.20 SunRPC header formats: (a) request; (b) reply.
5 End-to-End Protocols
this XID. Clearly, if the reply arrives at the client before the timeout, then the request
will not be retransmitted. Likewise, if the retransmitted request arrives at the server
before the reply has been generated, then the server will recognize that transaction XID
is already in progress, and it will discard the duplicate request message. So it is really
quite unlikely that this erroneous behavior will occur. Note that the server’s short-term
memory about XIDs also means that it cannot protect itself against messages that have
been delayed for a long time in the network. This has not been a serious problem with
SunRPC, however, because it was originally designed for use on a LAN.
Returning to the SunRPC header format, the request message contains variablelength Credentials and Verifier fields, both of which are used by the client to authenticate
itself to the server, that is, to give evidence that the client has the right to invoke the
server. How a client authenticates itself to a server is a general issue that must be
addressed by any protocol that wants to provide a reasonable level of security. This
topic is discussed in more detail in Chapter 8.
The Distributed Computing Environment (DCE) defines another widely used RPC
protocol, which we call DCE-RPC. DCE is a set of standards and software for building distributed systems. It was defined by the Open Software Foundation (OSF),
a consortium of computer companies that originally included IBM, Digital, and
Hewlett-Packard; today OSF goes by the name Open Group. DCE-RPC is the RPC
protocol at the core of the DCE system. It can be used with the Network Data
Representation (NDR) stub compiler described in Chapter 7, but it also serves as
the underlying RPC protocol for the Common Object Request Broker Architecture
(CORBA), which is an industrywide standard for building distributed, object-oriented
DCE-RPC is designed to run on top of UDP. It is similar to SunRPC in that it
defines a two-level addressing scheme: UDP demultiplexes to the correct server, DCERPC dispatches to a particular procedure exported by that server, and clients consult an
“endpoint mapping service” (similar to SunRPC’s Port Mapper) to learn how to reach
a particular server. Unlike SunRPC, however, DCE-RPC implements at-most-once call
semantics. It does this in a single protocol that essentially combines the algorithms in
BLAST and CHAN. We focus our discussion on this aspect of DCE-RPC. (In truth,
DCE-RPC supports multiple call semantics, including an idempotent semantics similar
to SunRPC’s, but at-most-once is the default behavior.)
Figure 5.21 gives a timeline for the typical exchange of messages, where each
message is labelled by its DCE-RPC type. The pattern is similar to CHAN’s: The client
sends a Request message, the server eventually replies with a Response message, and
5.3 Remote Procedure Call
Figure 5.21 Typical DCE-RPC message exchange.
the client acknowledges (Ack) the response. Instead of the server acknowledging the
request messages, however, the client periodically sends a Ping message to the server,
which responds with a Working message to indicate that the remote procedure is still
in progress. Although not shown in the figure, other message types are also supported.
For example, the client can send a Quit message to the server, asking it to abort an earlier
call that is still in progress; the server responds with a Quack (quit acknowledgment)
message. Also, the server can respond to a Request message with a Reject message
(indicating that a call has been rejected), and it can respond to a Ping message with a
Nocall message (indicating that the server has never heard of the caller).
In addition to the message type, request and reply messages include four key
fields that are used to implement both the fragmentation/reassembly aspects of BLAST
and the message transaction aspects of CHAN. These include ServerBoot, ActivityId,
SequenceNum, and FragmentNum.
The ServerBoot field serves the same purpose as CHAN’s BID (boot id) field: The
server records its boot time in a global variable each time it starts up, and it includes this
5 End-to-End Protocols
variable in each call it services. The ActivityId field is similar to CHAN’s CID (channel id)
field: It identifies a logical connection between the client and server on which a sequence
of calls can be made. The SequenceNum field then distinguishes between calls made
as part of the same activity; it serves the same purpose as CHAN’s MID (message id)
and SunRPC’s xid (transaction id) fields. Like CHAN (and unlike SunRPC), DCE-RPC
keeps track of the last sequence number used as part of a particular activity, so as to
ensure at-most-once semantics.
Because both request and response messages may be larger than the underlying
network packet size, they may be fragmented into multiple packets. The FragmentNum
field uniquely identifies each fragment that makes up a given request or reply message.
Unlike BLAST, which uses a bit-vector to identify fragments, each DCE-RPC fragment
is assigned a unique fragment number (e.g., 0, 1, 2, 3, and so on). Both the client and
server implement a selective acknowledgment mechanism, which works as follows.
(We describe the mechanism in terms of a client sending a fragmented request message
to the server; the same mechanism applies when a server sends a fragment response to
the client.)
First, each fragment that makes up the request message contains both a unique
FragmentNum and a flag indicating whether this packet is a fragment of a call (frag) or
the last fragment of a call (last frag); request messages that fit in a single packet carry a
no frag flag. The server knows it has received the complete request message when it has
the last frag packet and there are no gaps in the fragment numbers. Second, in response
to each arriving fragment, the server sends a Fack (fragment acknowledgment) message
to the client. This acknowledgment identifies the highest fragment number that the
server has successfully received. In other words, the acknowledgment is cumulative,
much like in TCP. In addition, however, the server selectively acknowledges any higher
fragment numbers it has received out of order. It does so with a bit-vector that identifies
these out-of-order fragments relative to the highest in-order fragment it has received.
Finally, the client responds by retransmitting the missing fragments.
Figure 5.22 illustrates how this all works. Suppose the server has successfully
received fragments up through number 20, plus fragments 23, 25, and 26. The server
responds with a Fack that identifies fragment 20 as the highest in-order fragment,
plus a bit-vector (SelAck) with the third (23 = 20 + 3), fifth (25 = 20 + 5), and
sixth (26 = 20 + 6) bits turned on. So as to support an (almost) arbitrarily long
bit-vector, the size of the vector (measured in 32-bit words) is given in the SelAckLen
Given DCE-RPC’s support for very large messages—the FragmentNum field
is 16 bits long, meaning it can support 64K fragments—it is not appropriate for
the protocol to blast all the fragments that make up a message as fast as it can,
5.4 Performance
Type = Fack
… … …
FragmentNum = 20
WindowSize = 10
SelAckLen = 1
SelAck[1] = 0x36
6 + 20
5 + 20
3 + 20
Figure 5.22 Fragmentation with selective acknowledgments.
as BLAST does, since doing so might overrun the receiver. Instead, DCE-RPC implements a flow-control algorithm that is very similar to TCP’s. Specifically, each
Fack message not only acknowledges received fragments, but also informs the sender
of how many fragments it may now send. This is the purpose of the WindowSize
field in Figure 5.22, which serves exactly the same purpose as TCP’s AdvertisedWindow field except it counts fragments rather than bytes. DCE-RPC also implements a congestion-control mechanism that is similar to TCP’s, which we will see in
Chapter 6.
5.4 Performance
Recall that Chapter 1 introduced the two quantitative metrics by which network
performance is evaluated: latency and throughput. As mentioned in that discussion,
these metrics are influenced not only by the underlying hardware (e.g., propagation
delay and link bandwidth) but also by software overheads. Now that we have a
complete software-based protocol graph available to us that includes alternative
transport protocols, we can discuss how to meaningfully measure its performance.
5 End-to-End Protocols
User process
User process
100 Mbps
Figure 5.23 Measured system: Two Pentium workstations running Linux connected by
a 100-Mbps Ethernet.
The importance of such measurements is that they represent the performance seen by
application programs.
We begin, as any report of experimental results should, by describing our experimental method. This includes the apparatus used in the experiments. We ran our
experiments on a pair of 733-MHz Pentium workstations connected by an isolated
100-Mbps Ethernet. The Ethernet spanned a single machine room so propagation is
not an issue, making this a measure of processor/software overheads. Each workstation was running the Linux operating system (2.4 kernel). A test program running on
top of the socket interface ping-pongs (reflects) messages between the two machines.
Figure 5.23 illustrates one round-trip between the two test programs.
Each experiment involved running three identical instances of the same test. Each
test, in turn, involved sending a message of some specified size back and forth between
the two machines 10,000 times. The system’s clock was read at the beginning and
end of each test, and the difference between these two times was divided by 10,000
to determine the time taken for each round-trip. The average of these three times
(the three runs of the test) is reported for each experiment below. Each experiment
involved a different-sized message. The latency numbers were for message sizes of
1 byte, 100 bytes, 200 bytes, . . . , 1000 bytes. The throughput results were for message
sizes of 1 KB, 2 KB, 4 KB, 8 KB, . . . , 32 KB.
Table 5.3 gives the results of the latency test. As you would expect, latency
increases with message size. Although there are sometimes special cases where you
might be interested in the latency of, say, a 200-byte message, typically the most
5.4 Performance
Message Size (Bytes)
Table 5.3
Measured round-trip latencies (µs) for various message sizes and protocols.
important latency number is the 1-byte case. This is because the 1-byte case represents
the overhead involved in processing each message that does not depend on the amount
of data contained in the message. It is typically the lower bound on latency, representing
factors like the speed-of-light delay and the time taken to process headers. Note that
there is a small difference between the latency experienced by the two different protocol
stacks, with UDP round-trip times a bit less than for TCP. This is to be expected since
TCP provides more functionality.
The results of the throughput test are given in Figure 5.24. Here, we show only
the results for UDP. The key thing to notice in this graph is that throughput improves
as the messages get larger. This makes sense—each message involves a certain amount
of overhead, so a larger message means that this overhead is amortized over more
bytes. The throughput curve flattens off above 16 KB, at which point the per-message
overhead becomes insignificant when compared to the large number of bytes that the
protocol stack has to process.
A second thing to notice is that the throughput curve tops out before reaching
100 Mbps. Although it can’t be deduced from these measurements, it turns out that
the factor preventing our system from running at the full Ethernet speed is a limitation
of the network adaptor rather than the software.
5 End-to-End Protocols
Measured throughput (Mbps)
Message size (KB)
Figure 5.24 Measured throughput using UDP, for various message sizes.
5.5 Summary
This chapter has described three very different end-to-end protocols. The first protocol
we considered is a simple demultiplexer, as typified by UDP. All such a protocol does
is dispatch messages to the appropriate application process based on a port number.
It does not enhance the best-effort service model of the underlying network in any
way, or said another way, it offers an unreliable, connectionless datagram service to
application programs.
The second type is a reliable byte-stream protocol, and the specific example of
this type that we looked at is TCP. The challenges faced with such a protocol are to
recover from messages that may be lost by the network, to deliver messages in the same
order in which they are sent, and to allow the receiver to do flow control on the sender.
TCP uses the basic sliding window algorithm, enhanced with an advertised window, to
implement this functionality. The other item of note for this protocol is the importance
of an accurate timeout/retransmission mechanism. Interestingly, even though TCP is
a single protocol, we saw that it employs at least five different algorithms—sliding
Open Issue: Application-Specific Protocols
window, Nagle, three-way handshake, Karn/Partridge, and Jacobson/Karels—all of
which can be of value to any end-to-end protocol.
The third transport protocol we looked at is a request/reply protocol that forms
the basis for RPC. In this case, a combination of three different algorithms are employed to implement the request/reply service: a selective retransmission algorithm that
is used to fragment and reassemble large messages, a synchronous channel algorithm
that pairs the request message with the reply message, and a dispatch algorithm that
causes the correct remote procedure to be invoked.
What should be clear after reading
this chapter is that transport protoO P E N I S S U E
col design is a tricky business. As we
have seen, getting a transport protocol right in the first place is hard Application-Specific Protocols
enough, but changing circumstances
make matters more complicated. The
challenge is finding ways to adapt to these changes.
Our experience with using the protocol can change. As we saw with TCP’s timeout mechanism, experience led to a series of refinements in how TCP decides to retransmit a segment. None of these changes affected the format of the TCP header, however,
and so they could be incorporated into TCP one implementation at a time. That is,
there was no need for everyone to upgrade their version of TCP on the same day.
The characteristics of the underlying network can also change. For many years,
TCP’s 32-bit sequence number and 16-bit advertised window were more than adequate. Recently, however, higher-bandwidth networks have meant that the sequence
number is not large enough to protect against wraparound, and the advertised window
is too small to allow the sender to fill the network pipe. While an obvious solution
would have been to redefine the TCP header to include a 64-bit sequence number field
and a 32-bit advertised window field, this would have introduced the very serious problem of how several million Internet hosts would make the transition from the current
header to this new header. While such transitions have been performed on production
networks, including the telephone network, they are no trivial matter. It was decided,
therefore, to implement the necessary extensions as options and to allow hosts to negotiate with each other as to whether or not they will use the options for each connection.
This approach will not work indefinitely, however, since the TCP header has room
for only 44 bytes of options. (This is because the HdrLen field is 4 bits long, meaning that
the total TCP header length cannot exceed 16 × 32 bit words, or 64 bytes.) Of course,
a TCP option that extends the space available for options is always a possibility, but
you have to wonder how far it is worth going for the sake of backward compatibility.
5 End-to-End Protocols
Perhaps the hardest changes to accommodate are the adaptations to the level of
service required by application programs. It is inevitable that some applications will
have a good reason for wanting a slight variation from the standard services. For example, some applications want RPC most of the time, but occasionally want to be
able to send a stream of request messages without waiting for any of the replies. While
this is no longer technically the semantics of RPC, a common scenario is to modify
an existing RPC protocol to allow this flexibility. As another example, because video
is a stream-oriented application, it is tempting to use TCP as the transport protocol.
Unfortunately, TCP guarantees reliability, which is not important to the video application. In fact, a video application would rather drop a frame (segment) than wait for it
to be retransmitted. Rather than invent a new transport protocol from scratch, however, some designers have proposed that TCP should support an option that effectively
turns off its reliability feature. It seems that such a protocol could hardly be called TCP
anymore, but we are talking about the pragmatics of getting an application to run.
How to develop transport protocols that can evolve to satisfy diverse applications, many of which have not yet been imagined, is a hard problem. It is possible
that the ultimate answer to this problem is the one-function-per-protocol style exemplified by the microprotocols we used to implement RPC, or some similar mechanism
by which the application programmer is allowed to program, configure, or otherwise
customize the transport protocol.
There is no doubt that TCP is a complex protocol and that in fact it has subtleties not
illuminated in this chapter. Therefore, the recommended reading list for this chapter
includes the original TCP specification. Our motivation for including this specification
is not so much to fill in the missing details as to expose you to what an honest-togoodness protocol specification looks like. The other two papers in the recommended
reading list focus on RPC. The paper by Birrell and Nelson is the seminal paper on
the topic, while the article by O’Malley and Peterson describes the one-function-perprotocol design philosophy in more detail.
■ USC-ISI. Transmission Control Protocol. Request for Comments 793, September 1981.
■ Birrell, A., and B. Nelson. Implementing remote procedure calls. ACM Transactions on Computer Systems 2(1):39–59, February 1984.
■ O’Malley, S., and L. Peterson. A dynamic network architecture. ACM Transactions on Computer Systems 10(2):110–143, May 1992.
Beyond the protocol specification, the most complete description of TCP, including its implementation in Unix, can be found in [Ste94b] and [SW95]. Also, the third
volume of Comer and Stevens’s TCP/IP series of books describes how to write
client/server applications on top of TCP and UDP, using the Posix socket interface
[CS00], the Windows socket interface [CS97], and the System V Unix TLI interface
Several papers evaluate the performance of different transport protocols at a very
detailed level. For example, the article by Clark et al. [CJRS89] measures the processing
overheads of TCP, a paper by Mosberger et al. [MPBO96] explores the limitations of
protocol processing overheads, and Thekkath and Levy [TL93] and Schroeder and
Burrows [SB89] examine RPC’s performance in great detail.
The original TCP timeout calculation was described in the TCP specification (see
above), while the Karn/Partridge algorithm was described in [KP91] and the Jacobson/
Karels algorithm was proposed in [Jac88]. The TCP extensions are defined by Jacobson
et al. [JBB92], while O’Malley and Peterson [OP91] argue that extending TCP in this
way is not the right approach to solving the problem.
Finally, there are several distributed operating systems that have defined their
own RPC protocol. Notable examples include the V system, described by Cheriton and
Zwaenepoel [CZ85]; Sprite, described by Ousterhout et al. [OCD+ 88]; and Amoeba,
described by Mullender [Mul90]. The latest version of SunRPC, as defined by Srinivasan [Sri95a], is a proposed standard for the Internet.
1 If a UDP datagram is sent from host A, port P to host B, port Q, but at host B there
is no process listening to port Q, then B is to send back an ICMP Port Unreachable
message to A. Like all ICMP messages, this is addressed to A as a whole, not to
port P on A.
(a) Give an example of when an application might want to receive such ICMP
(b) Find out what an application has to do, on the operating system of your choice,
to receive such messages.
(c) Why might it not be a good idea to send such messages directly back to the
originating port P on A?
2 Consider a simple UDP-based protocol for requesting files (based somewhat
loosely on the Trivial File Transport Protocol, TFTP). The client sends an
5 End-to-End Protocols
initial file request, and the server answers (if the file can be sent) with the first
data packet. Client and server then continue with a stop-and-wait transmission
(a) Describe a scenario by which a client might request one file but get another;
you may allow the client application to exit abruptly and be restarted with
the same port.
(b) Propose a change in the protocol that will make this situation much less
3 Design a simple UDP-based protocol for retrieving files from a server. No authentication is to be provided. Stop-and-wait transmission of the data may be used.
Your protocol should address the following issues:
(a) Duplication of the first packet should not duplicate the “connection.”
(b) Loss of the final ACK should not necessarily leave the server in doubt as to
whether the transfer succeeded.
(c) A late-arriving packet from a past connection shouldn’t be interpretable as
part of a current connection.
4 This chapter explains three sequences of state transitions during TCP connection teardown. There is a fourth possible sequence, which traverses an additional
arc (not shown in Figure 5.7) from FIN WAIT 1 to TIME WAIT and labelled
FIN + ACK/ACK. Explain the circumstances that result in this fourth teardown
5 When closing a TCP connection, why is the two-segment-lifetime timeout not
necessary on the transition from LAST ACK to CLOSED?
6 A sender on a TCP connection that receives a 0 advertised window periodically
probes the receiver to discover when the window becomes nonzero. Why would the
receiver need an extra timer if it were responsible for reporting that its advertised
window had become nonzero (i.e., if the sender did not probe)?
7 Read the man page (or Windows equivalent) for the Unix/Windows utility netstat.
Use netstat to see the state of the local TCP connections. Find out how long closing
connections spend in TIME WAIT.
8 The sequence number field in the TCP header is 32 bits long, which is big enough
to cover over 4 billion bytes of data. Even if this many bytes were never transferred
over a single connection, why might the sequence number still wrap around from
232 − 1 to 0?
9 You are hired to design a reliable byte-stream protocol that uses a sliding window
(like TCP). This protocol will run over a 100-Mbps network. The RTT of the
network is 100 ms, and the maximum segment lifetime is 60 seconds.
(a) How many bits would you include in the AdvertisedWindow and SequenceNum fields of your protocol header?
(b) How would you determine the numbers given above, and which values might
be less certain?
10 You are hired to design a reliable byte-stream protocol that uses a sliding window
(like TCP). This protocol will run over a 1-Gbps network. The RTT of the network
is 140 ms, and the maximum segm